AWS Summit ANZ 2022 - Build resilient microservices using fault-tolerant patterns (DEV5)

Aug 16, 2023

AWS Summit ANZ 2022 - Build resilient microservices using fault-tolerant patterns (DEV5)

This session covers design patterns - including circuit-breaker and saga - for building resilient microservice architectures, and how they lead to application robustness, data consistency and service recovery across network calls. Learn how AWS Fault Injection Simulator helps to uncover performance bottlenecks and edge cases that are otherwise difficult to find in distributed systems.

Learn more about AWS webinar series in Australia and New Zealand at https://go.aws/3ChL0Y6.

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#AWSSummit #AWS #AmazonWebServices #CloudComputing

Content

15.28 -> Hello everyone. Welcome to the session on building resilient

19.04 -> microservices using fault-tolerant design and patterns.

23.6 -> First off, I just want to say that I really appreciate you

26.96 -> taking the time to watch this session. My name is Anitha Deenadayalan

31.8 -> I'm a Developer Specialist Solutions Architect

34.32 -> here at AWS, and I'm based out of Singapore. I work for the

38.96 -> Developer Acceleration, DevAx team. In our team, we help our

43.84 -> customers build secure, reliable, and scalable modern applications

48.24 -> on AWS. You probably want to know what you can learn from this

54.2 -> session today. I'll be talking about how distributed

57.92 -> architectures require changes in the way we think, and how

61.84 -> network plays a key role in the design of the services. Then

66.08 -> I'll talk about some of the commonly occurring problems in

69.12 -> microservice architectures, and the design patterns to resolve

73.04 -> them. I will also talk about why you should use chaos

76.8 -> engineering, and AWS Fault Injection Simulator for testing

81.12 -> the distributed applications. Today, I will discuss the

85.68 -> patterns in the context of a trip booking application

89.56 -> TravelBuddy. TravelBuddy lets you book hotel and flights. It's a

94.28 -> monolithic application, typical three-tier, web tier, application

98.96 -> layer, with a relational database. Monolithic

102.92 -> applications have most of the functionality within a single

106.28 -> process or a container. Internally, they can have

109.36 -> multiple components, layers, and libraries. The application can

115.52 -> have complex interactions, but they remain within a single

118.64 -> process. But if a particular service, maybe you can think of

122.52 -> the flight service here - has to be scaled, there is no way to

126.36 -> scale just the individual servers that's choking. The

129.96 -> entire application has to be scaled to cater to the incoming

133.4 -> requests. When releasing changes, entire application has

137.6 -> to be regression tested and released. Let's imagine that our

142.44 -> developers have been tasked to re-write the code in the

146.04 -> microservices architecture. Developers are now building

149.76 -> microservices to create hundreds and sometimes thousands of

153.48 -> small interconnected software components, and very often they

157.52 -> are using different technology stacks. Here we should remember

161.96 -> that our developers are used to writing code for monolithic

165.08 -> applications, they frequently disregard the unseen participant

169.12 -> in the communication - the network. That's somewhat

172.16 -> understandable, given that many middleware technologies have

176.08 -> tried to make the developer experience of writing client

179.4 -> code to be very close to that experience of calling a local

183.56 -> function. So if the developers are not exposed to issues with

188.44 -> network in their local tests with more data, they are less

192.16 -> likely to defend against them. So what happens when

197.8 -> applications are returned with literal error handling on

201.28 -> networking errors? If a network outage occurs, such applications

206.04 -> infinitely wait for an answer packet, permanently consuming

209.6 -> memory or other resources. And the services come back up

213.6 -> sudden, an increase load might hit a running system, causing

217.6 -> operations to return much slower than anticipated. Without

221.56 -> timeouts or circuit breakers in place, this increase in latency

225.56 -> will begin to compound, and may even look like total

228.8 -> unavailability to the system's users. Okay, let's look at the

235.48 -> TravelBuddy application again. The different components are now

239.28 -> extracted into microservices, and they are working together in

242.76 -> an event-driven architecture. The services they have their own

246.4 -> databases, a call to hotel service may lead to multiple

250.48 -> calls - here they can be flight service, payments service - before

254.84 -> returning an output to the caller. Modern application

259.4 -> architectures bring in a lot of benefits. They have smaller

262.64 -> blast radius for changes, they have faster release cycles,tyhey

266.52 -> have less constraints with resources like CPU and memory,

270.08 -> and they can scale on demand. You can also easily troubleshoot

273.96 -> and deploy changes at a service level, rather than at an

277.32 -> application level. You can choose the best technology for

280.76 -> each service. At the same time now we have hundreds of

284.44 -> services. There could be network failures, service failures,

288.2 -> service delays due to peak loads. A single transaction may span

292.6 -> different databases, making it harder to provide two phase commit.

298.2 -> So I'll be talking about some of the defensive programming

301.04 -> techniques or patterns that you can use to design applications

305.12 -> using microservices architectures. Imagine that you're

308.32 -> calling an API to get hotel information from the

311.2 -> TravelBuddy application and there are intermittent network

314.44 -> issues. If you're in front of your computer, you will probably

318.44 -> be hitting the refresh button repeatedly until the call

321.32 -> succeeds. But what if another process is calling this service?

326.16 -> How are you going to make the same call again. A good way here

332.6 -> would be to implement retries with backoff from the caller

335.84 -> side. You can define the number of retries. Based on the

339.6 -> response status, you can retry with an increased wait time

343.08 -> between the calls. You're increasing the wait time because

346.68 -> you're providing time for the network to recover. You may

350.16 -> overload the network bandwidth if you retry too frequently.

355.44 -> I'll provide provided the code here in .NET, and you can

358.36 -> choose to implement this in any of the programming languages

361.56 -> that you prefer. The next scenario I want to talk about is

366.48 -> service throttling. Individual services may be throttled when

370.56 -> there are too many requests. Microservices communicate

373.76 -> through remote procedure calls, and it's always possible that

377.16 -> transient errors could occur in the network connectivity,

380.32 -> causing failures also. Here the trip service sends an event

385.44 -> notification after processing. The rewards service starts

388.88 -> processing when the event arrives. This is an event driven

391.92 -> architecture and the services are not aware of each other. So

395.96 -> the retry logic cannot be implemented in the trip service

399.48 -> code as it causes close coupling. A better solution here

406.36 -> would be to use AWS Step Functions to retry and backoff

410.44 -> instead of embedding the retry in the individual services. So

414.76 -> if a service call fails, the workflow can try again for a

418.2 -> defined number of times with an interval and backoff rate. In my

422.64 -> example here, the workflow will wait for three seconds at first,

426.52 -> and the back of rate is 1.5. So the workflow waits for three

430.6 -> times 1.5 in the second try, which is 4.5 seconds. It will

434.88 -> continue to the next step if it succeeds in one of the retries.

441.2 -> So whenever there is service throttling, or intermittent

444.4 -> service failures, you can use AWS Step Functions to retry the

448.6 -> action. When multiple microservices collaborate to handle

454.56 -> requests, one or more services may become unavailable or

458.44 -> exhibit a high latency. If it is a synchronous call entire

463.04 -> application can become slow, and it will lead to poor user

466.32 -> experience. Our timeout in the recommendation servers will

470.76 -> propagate to the caller during synchronous execution. Imagine

476.64 -> the operation invoking the service has timeouts, and it is

480.2 -> waiting for the service to respond. The thread will be

483.52 -> blocked until the timeout period expires. If there are many

487.08 -> concurrent requests, these block requests may hold critical

490.92 -> system resources such as memory, threats, and database

494 -> connections. Now, the recommendation service team is

500 -> trying to find the cause of the issue and fix it, but as you can

504.44 -> see, the application has many users, and they continuously

508.24 -> retry the operation. This can cause the entire application to

512.28 -> go down. When faults are due to unanticipated events, it might

518 -> take much longer to fix the issue. These faults can range in

522.28 -> severity. They may cause partial loss of connectivity, or they

526.52 -> could even lead to a complete failure of the service. In these

530.32 -> situations, it might be pointless for an application to

534.04 -> continually retry an operation, that's most likely going to

537.84 -> fail. Instead the application should quickly accept that the

541.92 -> operation has failed, and handle this failure. The solution here

547.52 -> is to use a circuit breaker pattern. This pattern can

550.92 -> prevent an application from repeatedly trying to execute an

554.4 -> operation that's likely to fail. In this pattern, there is a

559.28 -> circuit breaker process that routes the calls from the caller

562.92 -> to the callee. For example, in our TravelBuddy application,

568.56 -> we can add a circuit breaker process between the trip and the

572 -> recommendation services. When there are no failures, circuit

577.24 -> breaker routes all the calls to the recommendation service. If

582.36 -> the recommendation service times out, the circuit breaker can

585.44 -> detect the timeout and track the failure.

589.04 -> If the timeouts exceed a specific threshold value, the

592.52 -> circuit is open. Once the circuit is open, the circuit

596.44 -> breaker object does not route any calls to the recommendation

599.44 -> service. Instead, it just returns an immediate failure.

604.88 -> Meantime, as you can see, the recommendation service team is

608.4 -> working on fixing the issue. The circuit breaker can periodically

612.72 -> retry to see if the calls to the recommendation service are

615.84 -> successful. If you see here, the service team is taking time to

620.16 -> fix and the service is still not available. During the retry if

625.08 -> the call to the recommendation service succeeds, the circuit is

628.08 -> closed, and all further calls are routed to it again. Next,

633.76 -> I'll be showing you a demo of the circuit breaker pattern

636.52 -> using AWS Step Functions. This is the architecture for today's

641.68 -> circuit breaker pattern demo. AWS Step Functions provides the

646.32 -> circuit breaker capabilities here. I've designed the circuit breaker

650.88 -> as a generic construct so that it's service agnostic. This is

654.68 -> one of the ways of implementing the circuit breaker pattern.

658.92 -> Trip service triggers a Step Functions workflow with the name

662.36 -> of the service, it needs to call through the circuit breaker. In

666.64 -> this example, recommendation service. We can use the same circuit

671.24 -> breaker with other services. For example, hotel service can call

675.28 -> payment service with no caller configuration changes. I'm using

680.96 -> Amazon DynamoDB here for storing the degraded service status, but

685.44 -> you can also use an in-memory data store like Amazon

688.76 -> Elasticache for lower latency axis. In my demo, I'll first degrade the

695.2 -> recommendation service by injecting timeouts, and show you

699.08 -> how the service status is stored in the DynamoDB table. Then I'll

703.96 -> call this degraded service and show you how the service returns

707.44 -> an immediate failure for subsequent calls. After a

711.36 -> specified wait, the circuit breaker will call the degraded

714.92 -> service again. I'm now in the Step Functions console. I've

721.88 -> already created the Step Function workflow using AWS CDK.

726 -> But you can use to create your cloud application resources

729.4 -> using familiar programming languages. I'll click on the

736.28 -> definition to see the state machine now. This is the circuit

741.6 -> breaker state machine definition. It gets the circuit

746.12 -> status from the DynamoDB database, checks whether the

749.48 -> circuit is open or closed. If the circuit is close, it

752.64 -> executes the Lambda function. If the Lambda function executes

756.08 -> successfully, the workflow exits without any error. If the Lambda

760.92 -> function times out or returns error repeatedly, service has

765.2 -> degraded in the update circuit status step with an expiry

768.96 -> timestamp. The expiry timestamp is required so that the service

773.48 -> call can be retried after a short wait time. So if the

777.8 -> circuit is open, then an immediate failure is returned to

780.72 -> the caller without executing the Lambda function. To get the

786.76 -> circuit status, I'm using the AWS SDK service integrations to

791.84 -> query the DynamoDB database. If the circuit is closed, then the

795.96 -> query will return a count of zero. Otherwise, it will return

800.68 -> a non-zero positive number. Execute Lambda step here

805.6 -> provides the retry with exponential backoff

808.48 -> capabilities. If this step times out after three consecutive

812.64 -> times, the service is degraded by storing an item in the

816.2 -> DynamoDB table. As I mentioned earlier, you can modify the

820.4 -> data store to be an in-memory data store instead of using

824.2 -> DynamoDB to get low latency access. So I have used CDK for

829.76 -> .NET for creating the infrastructure. First I create

833.92 -> the roles for the Lambda functions and the Step Functions.

840.28 -> Next, I'm creating the circuit breaker table in DynamoDB with

844.04 -> the service name as partition key, and the expired timestamp as

847.4 -> the sort key. Then I'm creating the API gateway and the trip

851.24 -> resource. I then integrate the trip service Lambda function to

855.32 -> the API Gateway resource endpoint.

873.8 -> So the Step Functions definition I showed you just now, this is

878.08 -> how I created. I'm creating individual task states here and

882.56 -> chaining them up together for the workflow. As you see here,

887.88 -> I'm creating the state machine type as Standard to have the

891.68 -> visual workflow. If you're using this code in production

895 -> workloads use the state machine type as Express to get a low

898.96 -> latency execution. I'm storing the state machine ARN in

904.16 -> in an environment variable of the trip service Lambda

906.68 -> function. As I'll need this for triggering the Step Functions

910.32 -> workflow from the trip service. This is the code for the trip

918.32 -> service Lambda function. I'm reading the state machine ARN and

921.8 -> from the environment variable here. The function receives an

927.88 -> input parameter object that contains the target Lambda name.

931.64 -> I serialise the JSON input, trigger the Step Functions

934.76 -> workflow here, the workflow will trigger the target Lambda

937.92 -> function based on the service status. The Step Function's

941.84 -> workflow is service agnostic, you can extract out the circuit

945.08 -> breaker calling logic logic here into a separate DLL or Lambda

948.44 -> layers, and use it to integrate the circuit breaker workflow for

951.72 -> any service. So far, I've shown you the state machine definition,

956.52 -> CDK code for creating the infrastructure, and trip service

960.68 -> code. I'll show the next execution next. First, I'll make

965.52 -> a call to degrade the recommendation service through

968.44 -> Postman. I'm sending an extra parameter here to simulate the

972.44 -> timeout. Recommendation servers will recognise this flag and

976.16 -> return a timeout. I'm clicking on send now - we will see the

981.84 -> string done status as 200, OK.

990.16 -> Next, I'm going to show you the state machine execution, where it

993.32 -> retries a service call and then updates the service status in

996.8 -> the DynamoDB database. The state machine is now executing the

1001.04 -> Lambda function, I've injected a timeout to degrade the service.

1005.76 -> So the Lambda function will be retried for three times, as per

1009.76 -> my configuration. When it finally times out, an item will

1014.08 -> be inserted in the DynamoDB database indicating that the

1017.44 -> service has been degraded.

1035.2 -> Let's now look at the DynamoDB table. The service name is the partition key

1040.64 -> and the expired timestamp is the sort key here. So if you see the output

1047.4 -> recommendation service is now degraded. I'll now rerun the call

1051.68 -> and show you how the service returns an immediate failure.

1057.44 -> I've removed the simulate timeout and calling the workflow

1060.04 -> again. I'll click on send.

1068.52 -> If you see here the state machine has returned an

1070.88 -> immediate failure, as the service is degraded. It does so by

1074.84 -> checking the DynamoDB table, finds a record, and fails

1078.2 -> immediately. The recommendation service is not called in this case.

1092.52 -> I'll call the recommendation service again now. So the state

1096.32 -> machine has now called the recommendation service, and it

1099.44 -> has run successfully. The current time is greater than the

1103.04 -> timestamp of the items stored in the DynamoDB table, and the

1106.8 -> service is no longer considered degraded. I have used the AWS Step

1111.36 -> Function Standard workflow for the demo as it provides a

1114.68 -> visual interface, and it's easier for me to show and explain to

1117.92 -> you. But if you're planning to use this in production

1120.72 -> workloads, you can use the Express workflow to achieve

1124.08 -> much higher performance and lower latency. There are

1127.76 -> situations when a single transaction can span multiple

1130.88 -> data stores. The basic principle of microservices is that each

1135.32 -> service manages its own data. So hotel service stores its data

1139.76 -> and its own database, flight service and its own, and the

1143.12 -> payment service can track the executed transactions in its

1146.28 -> own database. Anyone remembers storing BLOB data and relational

1151 -> database columns? How about a really long text? How many times

1155.6 -> have we converted data from one representation to another, just

1159.56 -> so that we can store and track that information in the

1162.52 -> database. When using microservices, services can choose

1169.04 -> their databases that suits the need. In our example, hotel

1173.68 -> service uses Amazon Aurora, a relational database, and the

1177.68 -> flight service uses Amazon DynamoDB, which is our noSQL

1181.16 -> data store. Payment service here we can imagine talks to an

1185.12 -> external payment SaaS. As I mentioned earlier, microservices

1189.72 -> communicate through events to indicate process changes.

1193.64 -> Imagine that a customer is booking a trip. This may include

1197.52 -> booking the hotel, flight, and making payment. You're seeing

1201.8 -> the happy path on the screen, there are no failures in the

1204.6 -> workflow, and the order gets placed successfully. If you

1208.64 -> notice, this is an example of a distributed transaction with

1211.84 -> polyglot persistence. The transaction data gets stored

1215.48 -> across different databases, and each service writes to its own

1219.36 -> database. Let's imagine that there was a network failure, and

1225.2 -> the payment gateway has timed out. So the last step in the

1228.64 -> workflow has failed. At the time of failure, the hotel database

1233 -> and the flight databases are already updated. But if you see

1237.08 -> the data, it is now inconsistent. Hotel and flights

1240.68 -> are booked, but the payment has failed. And there is no way to

1243.92 -> correct the data, and there are no options to retry the

1246.88 -> processes that have failed. In a relational database, we get ACID

1252.16 -> transactions - Atomicity, Consistency, Isolation, and

1255.8 -> Durability. If a transaction fails in a relational database,

1259.64 -> the transaction gets rolled back. But in distributed systems

1263.6 -> like microservices, two-phase commit is not an option as the

1267.16 -> transaction is distributed across various databases. In

1272 -> this case, the solution is to use the Saga orchestration

1274.72 -> pattern. If any transaction fails in the workflow, the Saga

1278.84 -> orchestrator executes a series of compensating transactions

1282.52 -> that reverts the changes that were made by the preceding

1285.72 -> transactions. Let's imagine that a customer books the hotel when

1291.32 -> the flight tickets are almost full, another customer secures

1295.16 -> the flight tickets, and it becomes unavailable. As we all

1298.36 -> know this can happen when multiple customers are competing

1301.56 -> to get tickets at the same time. In this case, the flight booking

1305.56 -> step will fail in the workflow. Now orchestrator will

1309.88 -> now execute the compensatory transactions, it will run revert

1313.56 -> flight booking, remove hotel booking steps, and return the

1317 -> status as failed. As you can see, for every action that makes

1323.04 -> a change to the datastore there is an opposite action that

1326.24 -> compensates the change in the case of a failure. It's almost

1330.36 -> like Newton's Third Law - for every action there is an equal

1333.44 -> and opposite reaction. A failure that happens in the last step,

1338.24 -> payment services in this case, will cause all the compensatory

1341.84 -> transactions to be executed, before returning a failed state.

1348.08 -> The state machine makes it very easy for us to configure the

1351.08 -> compensatory transactions in case of failures. Here you can

1355.36 -> see how the Saga orchestration has been implemented, with the

1358.56 -> Amazon States Language. This shows the payment processing

1361.92 -> step of the Saga orchestration pattern, this can be easily

1365.72 -> extended to define the orchestration of a complex

1368.24 -> transaction. So to summarise - we solve intermittent network

1374.32 -> failures with retries and code, service failures with AWS Step

1379.32 -> Functions retry workflow, service delays and performance

1383.36 -> issues with the circuit breaker pattern, and distributed

1387.6 -> transactions using the Saga orchestration pattern. These are

1392.28 -> solutions to known problems. Once outages happen, it is easy

1396.12 -> enough because you can analyse things. Of course you have to

1399.76 -> get the data metrics and traces to determine why it occurred.

1404.08 -> But how do you find issues that have not occurred yet?

1408.68 -> There are two types of complexity in software

1410.8 -> development. One is accidental complexity. Accidental

1414.8 -> complexity is generated by the developers themselves, like bad

1418.52 -> variable names, or bad API definitions. The other is

1422 -> essential complexity. It is the complexity of the problem that

1426.2 -> we want to solve. It is the inherent complexity of the

1429.48 -> domains, so we can't remove it at all. The traditional way of

1433.8 -> tackling this challenge has been testing. But as distributed

1437.44 -> systems have become more complex, testing can only verify the

1441.88 -> known conditions in isolated environments. Answering

1445.88 -> questions like - is it what we expect this, or is this the

1449.68 -> result of that function too? You know those kinds of questions.

1454.2 -> How can you test something you don't know yet? One way to do a

1458.28 -> deal with the unpredictables is using chaos engineering. By

1462.12 -> using chaos engineering, you can learn how the system behaves by

1466.04 -> running controlled experiments on your system. For distributed

1470.08 -> systems, even for a single request, many services are

1473.12 -> involved. So we need to aggregate information and

1476.12 -> visualise it to understand the system status, and to trace the

1479.6 -> request. This is what we call observability. And without

1483.48 -> observability you don't have chaos engineering, you will just

1486.2 -> have chaos. If you can't observe the behaviour of your system you

1490.56 -> can't learn from your experiments. This is the chaos engineering

1494.64 -> process. Firstly we define the steady state. Next we make a

1498.68 -> hypothesis and run an experiment by injecting faults.

1502.52 -> After the experiment, we check the system behaviour and verify

1505.52 -> the hypothesis. If we see any difference between the

1508.6 -> hypothesis and actual behaviour, we need to improve the system.

1512.48 -> Through iterations of this process, we can improve the

1515.6 -> system and gain confidence in the system's capability to

1518.92 -> withstand turbulent conditions in the production environment.

1522.92 -> Fault Injection Simulator is a fully managed service for chaos

1526.28 -> engineering, it helps you run experiments. Many similar

1530.12 -> services normally require the user to instal an agent, but FIS

1533.64 -> does not. And you can configure a stop condition. So if experiments

1538.12 -> go wrong, you can stop them automatically. And is it really

1542.84 -> mandatory to do experiments in production? There are advantages

1547.28 -> of running an experiment in the production environment, but of

1550.44 -> course, it comes with blockers and risks. For example, if

1554.04 -> you're not familiar with some tools, there is a risk of making

1557.72 -> a mistake and you may unintentionally impact the

1560.24 -> users. So it's better that you start in the development or

1564.08 -> staging environment, and familiarise yourself with the

1566.96 -> test tools. In some cases, it may be enough to do the

1570.92 -> experiments in the development environment. You joined the AWS

1575.24 -> Summit to learn, and you can keep learning beyond the Summit by

1578.64 -> using these training resources. Skill Builder is our online

1583.56 -> learning centre that makes it easier for anyone from beginners

1587.36 -> to experienced professionals to build AWS Cloud skills. We

1591.92 -> offer 500+ free digital courses, that can help you and

1595.8 -> your team build new cloud skills, and learn about the latest services.

1602.08 -> And with that, I would like to thank you again for

1605.32 -> taking the time to listen to my session. I really hope my

1609.32 -> explanations will help you apply the patterns to your own

1613.04 -> distributed applications. I just have one final request for you,

1617.8 -> and that's to fill in the session survey. It will only

1621.12 -> take a minute of your time and it really helps me out to know

1624.92 -> what you thought of the session. I hope you'll enjoy the rest of

1628.52 -> the AWS Summit.

Source: https://www.youtube.com/watch?v=NB3ei9pnHFA