AWS Summit ANZ 2022 - Build resilient microservices using fault-tolerant patterns (DEV5)

AWS Summit ANZ 2022 - Build resilient microservices using fault-tolerant patterns (DEV5)


AWS Summit ANZ 2022 - Build resilient microservices using fault-tolerant patterns (DEV5)

This session covers design patterns - including circuit-breaker and saga - for building resilient microservice architectures, and how they lead to application robustness, data consistency and service recovery across network calls. Learn how AWS Fault Injection Simulator helps to uncover performance bottlenecks and edge cases that are otherwise difficult to find in distributed systems.

Learn more about AWS webinar series in Australia and New Zealand at https://go.aws/3ChL0Y6.

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#AWSSummit #AWS #AmazonWebServices #CloudComputing


Content

15.28 -> Hello everyone. Welcome to the session on building resilient
19.04 -> microservices using fault-tolerant design and patterns.
23.6 -> First off, I just want to say that I really appreciate you
26.96 -> taking the time to watch this session. My name is Anitha Deenadayalan
31.8 -> I'm a Developer Specialist Solutions Architect
34.32 -> here at AWS, and I'm based out of Singapore. I work for the
38.96 -> Developer Acceleration, DevAx team. In our team, we help our
43.84 -> customers build secure, reliable, and scalable modern applications
48.24 -> on AWS. You probably want to know what you can learn from this
54.2 -> session today. I'll be talking about how distributed
57.92 -> architectures require changes in the way we think, and how
61.84 -> network plays a key role in the design of the services. Then
66.08 -> I'll talk about some of the commonly occurring problems in
69.12 -> microservice architectures, and the design patterns to resolve
73.04 -> them. I will also talk about why you should use chaos
76.8 -> engineering, and AWS Fault Injection Simulator for testing
81.12 -> the distributed applications. Today, I will discuss the
85.68 -> patterns in the context of a trip booking application
89.56 -> TravelBuddy. TravelBuddy lets you book hotel and flights. It's a
94.28 -> monolithic application, typical three-tier, web tier, application
98.96 -> layer, with a relational database. Monolithic
102.92 -> applications have most of the functionality within a single
106.28 -> process or a container. Internally, they can have
109.36 -> multiple components, layers, and libraries. The application can
115.52 -> have complex interactions, but they remain within a single
118.64 -> process. But if a particular service, maybe you can think of
122.52 -> the flight service here - has to be scaled, there is no way to
126.36 -> scale just the individual servers that's choking. The
129.96 -> entire application has to be scaled to cater to the incoming
133.4 -> requests. When releasing changes, entire application has
137.6 -> to be regression tested and released. Let's imagine that our
142.44 -> developers have been tasked to re-write the code in the
146.04 -> microservices architecture. Developers are now building
149.76 -> microservices to create hundreds and sometimes thousands of
153.48 -> small interconnected software components, and very often they
157.52 -> are using different technology stacks. Here we should remember
161.96 -> that our developers are used to writing code for monolithic
165.08 -> applications, they frequently disregard the unseen participant
169.12 -> in the communication - the network. That's somewhat
172.16 -> understandable, given that many middleware technologies have
176.08 -> tried to make the developer experience of writing client
179.4 -> code to be very close to that experience of calling a local
183.56 -> function. So if the developers are not exposed to issues with
188.44 -> network in their local tests with more data, they are less
192.16 -> likely to defend against them. So what happens when
197.8 -> applications are returned with literal error handling on
201.28 -> networking errors? If a network outage occurs, such applications
206.04 -> infinitely wait for an answer packet, permanently consuming
209.6 -> memory or other resources. And the services come back up
213.6 -> sudden, an increase load might hit a running system, causing
217.6 -> operations to return much slower than anticipated. Without
221.56 -> timeouts or circuit breakers in place, this increase in latency
225.56 -> will begin to compound, and may even look like total
228.8 -> unavailability to the system's users. Okay, let's look at the
235.48 -> TravelBuddy application again. The different components are now
239.28 -> extracted into microservices, and they are working together in
242.76 -> an event-driven architecture. The services they have their own
246.4 -> databases, a call to hotel service may lead to multiple
250.48 -> calls - here they can be flight service, payments service - before
254.84 -> returning an output to the caller. Modern application
259.4 -> architectures bring in a lot of benefits. They have smaller
262.64 -> blast radius for changes, they have faster release cycles,tyhey
266.52 -> have less constraints with resources like CPU and memory,
270.08 -> and they can scale on demand. You can also easily troubleshoot
273.96 -> and deploy changes at a service level, rather than at an
277.32 -> application level. You can choose the best technology for
280.76 -> each service. At the same time now we have hundreds of
284.44 -> services. There could be network failures, service failures,
288.2 -> service delays due to peak loads. A single transaction may span
292.6 -> different databases, making it harder to provide two phase commit.
298.2 -> So I'll be talking about some of the defensive programming
301.04 -> techniques or patterns that you can use to design applications
305.12 -> using microservices architectures. Imagine that you're
308.32 -> calling an API to get hotel information from the
311.2 -> TravelBuddy application and there are intermittent network
314.44 -> issues. If you're in front of your computer, you will probably
318.44 -> be hitting the refresh button repeatedly until the call
321.32 -> succeeds. But what if another process is calling this service?
326.16 -> How are you going to make the same call again. A good way here
332.6 -> would be to implement retries with backoff from the caller
335.84 -> side. You can define the number of retries. Based on the
339.6 -> response status, you can retry with an increased wait time
343.08 -> between the calls. You're increasing the wait time because
346.68 -> you're providing time for the network to recover. You may
350.16 -> overload the network bandwidth if you retry too frequently.
355.44 -> I'll provide provided the code here in .NET, and you can
358.36 -> choose to implement this in any of the programming languages
361.56 -> that you prefer. The next scenario I want to talk about is
366.48 -> service throttling. Individual services may be throttled when
370.56 -> there are too many requests. Microservices communicate
373.76 -> through remote procedure calls, and it's always possible that
377.16 -> transient errors could occur in the network connectivity,
380.32 -> causing failures also. Here the trip service sends an event
385.44 -> notification after processing. The rewards service starts
388.88 -> processing when the event arrives. This is an event driven
391.92 -> architecture and the services are not aware of each other. So
395.96 -> the retry logic cannot be implemented in the trip service
399.48 -> code as it causes close coupling. A better solution here
406.36 -> would be to use AWS Step Functions to retry and backoff
410.44 -> instead of embedding the retry in the individual services. So
414.76 -> if a service call fails, the workflow can try again for a
418.2 -> defined number of times with an interval and backoff rate. In my
422.64 -> example here, the workflow will wait for three seconds at first,
426.52 -> and the back of rate is 1.5. So the workflow waits for three
430.6 -> times 1.5 in the second try, which is 4.5 seconds. It will
434.88 -> continue to the next step if it succeeds in one of the retries.
441.2 -> So whenever there is service throttling, or intermittent
444.4 -> service failures, you can use AWS Step Functions to retry the
448.6 -> action. When multiple microservices collaborate to handle
454.56 -> requests, one or more services may become unavailable or
458.44 -> exhibit a high latency. If it is a synchronous call entire
463.04 -> application can become slow, and it will lead to poor user
466.32 -> experience. Our timeout in the recommendation servers will
470.76 -> propagate to the caller during synchronous execution. Imagine
476.64 -> the operation invoking the service has timeouts, and it is
480.2 -> waiting for the service to respond. The thread will be
483.52 -> blocked until the timeout period expires. If there are many
487.08 -> concurrent requests, these block requests may hold critical
490.92 -> system resources such as memory, threats, and database
494 -> connections. Now, the recommendation service team is
500 -> trying to find the cause of the issue and fix it, but as you can
504.44 -> see, the application has many users, and they continuously
508.24 -> retry the operation. This can cause the entire application to
512.28 -> go down. When faults are due to unanticipated events, it might
518 -> take much longer to fix the issue. These faults can range in
522.28 -> severity. They may cause partial loss of connectivity, or they
526.52 -> could even lead to a complete failure of the service. In these
530.32 -> situations, it might be pointless for an application to
534.04 -> continually retry an operation, that's most likely going to
537.84 -> fail. Instead the application should quickly accept that the
541.92 -> operation has failed, and handle this failure. The solution here
547.52 -> is to use a circuit breaker pattern. This pattern can
550.92 -> prevent an application from repeatedly trying to execute an
554.4 -> operation that's likely to fail. In this pattern, there is a
559.28 -> circuit breaker process that routes the calls from the caller
562.92 -> to the callee. For example, in our TravelBuddy application,
568.56 -> we can add a circuit breaker process between the trip and the
572 -> recommendation services. When there are no failures, circuit
577.24 -> breaker routes all the calls to the recommendation service. If
582.36 -> the recommendation service times out, the circuit breaker can
585.44 -> detect the timeout and track the failure.
589.04 -> If the timeouts exceed a specific threshold value, the
592.52 -> circuit is open. Once the circuit is open, the circuit
596.44 -> breaker object does not route any calls to the recommendation
599.44 -> service. Instead, it just returns an immediate failure.
604.88 -> Meantime, as you can see, the recommendation service team is
608.4 -> working on fixing the issue. The circuit breaker can periodically
612.72 -> retry to see if the calls to the recommendation service are
615.84 -> successful. If you see here, the service team is taking time to
620.16 -> fix and the service is still not available. During the retry if
625.08 -> the call to the recommendation service succeeds, the circuit is
628.08 -> closed, and all further calls are routed to it again. Next,
633.76 -> I'll be showing you a demo of the circuit breaker pattern
636.52 -> using AWS Step Functions. This is the architecture for today's
641.68 -> circuit breaker pattern demo. AWS Step Functions provides the
646.32 -> circuit breaker capabilities here. I've designed the circuit breaker
650.88 -> as a generic construct so that it's service agnostic. This is
654.68 -> one of the ways of implementing the circuit breaker pattern.
658.92 -> Trip service triggers a Step Functions workflow with the name
662.36 -> of the service, it needs to call through the circuit breaker. In
666.64 -> this example, recommendation service. We can use the same circuit
671.24 -> breaker with other services. For example, hotel service can call
675.28 -> payment service with no caller configuration changes. I'm using
680.96 -> Amazon DynamoDB here for storing the degraded service status, but
685.44 -> you can also use an in-memory data store like Amazon
688.76 -> Elasticache for lower latency axis. In my demo, I'll first degrade the
695.2 -> recommendation service by injecting timeouts, and show you
699.08 -> how the service status is stored in the DynamoDB table. Then I'll
703.96 -> call this degraded service and show you how the service returns
707.44 -> an immediate failure for subsequent calls. After a
711.36 -> specified wait, the circuit breaker will call the degraded
714.92 -> service again. I'm now in the Step Functions console. I've
721.88 -> already created the Step Function workflow using AWS CDK.
726 -> But you can use to create your cloud application resources
729.4 -> using familiar programming languages. I'll click on the
736.28 -> definition to see the state machine now. This is the circuit
741.6 -> breaker state machine definition. It gets the circuit
746.12 -> status from the DynamoDB database, checks whether the
749.48 -> circuit is open or closed. If the circuit is close, it
752.64 -> executes the Lambda function. If the Lambda function executes
756.08 -> successfully, the workflow exits without any error. If the Lambda
760.92 -> function times out or returns error repeatedly, service has
765.2 -> degraded in the update circuit status step with an expiry
768.96 -> timestamp. The expiry timestamp is required so that the service
773.48 -> call can be retried after a short wait time. So if the
777.8 -> circuit is open, then an immediate failure is returned to
780.72 -> the caller without executing the Lambda function. To get the
786.76 -> circuit status, I'm using the AWS SDK service integrations to
791.84 -> query the DynamoDB database. If the circuit is closed, then the
795.96 -> query will return a count of zero. Otherwise, it will return
800.68 -> a non-zero positive number. Execute Lambda step here
805.6 -> provides the retry with exponential backoff
808.48 -> capabilities. If this step times out after three consecutive
812.64 -> times, the service is degraded by storing an item in the
816.2 -> DynamoDB table. As I mentioned earlier, you can modify the
820.4 -> data store to be an in-memory data store instead of using
824.2 -> DynamoDB to get low latency access. So I have used CDK for
829.76 -> .NET for creating the infrastructure. First I create
833.92 -> the roles for the Lambda functions and the Step Functions.
840.28 -> Next, I'm creating the circuit breaker table in DynamoDB with
844.04 -> the service name as partition key, and the expired timestamp as
847.4 -> the sort key. Then I'm creating the API gateway and the trip
851.24 -> resource. I then integrate the trip service Lambda function to
855.32 -> the API Gateway resource endpoint.
873.8 -> So the Step Functions definition I showed you just now, this is
878.08 -> how I created. I'm creating individual task states here and
882.56 -> chaining them up together for the workflow. As you see here,
887.88 -> I'm creating the state machine type as Standard to have the
891.68 -> visual workflow. If you're using this code in production
895 -> workloads use the state machine type as Express to get a low
898.96 -> latency execution. I'm storing the state machine ARN in
904.16 -> in an environment variable of the trip service Lambda
906.68 -> function. As I'll need this for triggering the Step Functions
910.32 -> workflow from the trip service. This is the code for the trip
918.32 -> service Lambda function. I'm reading the state machine ARN and
921.8 -> from the environment variable here. The function receives an
927.88 -> input parameter object that contains the target Lambda name.
931.64 -> I serialise the JSON input, trigger the Step Functions
934.76 -> workflow here, the workflow will trigger the target Lambda
937.92 -> function based on the service status. The Step Function's
941.84 -> workflow is service agnostic, you can extract out the circuit
945.08 -> breaker calling logic logic here into a separate DLL or Lambda
948.44 -> layers, and use it to integrate the circuit breaker workflow for
951.72 -> any service. So far, I've shown you the state machine definition,
956.52 -> CDK code for creating the infrastructure, and trip service
960.68 -> code. I'll show the next execution next. First, I'll make
965.52 -> a call to degrade the recommendation service through
968.44 -> Postman. I'm sending an extra parameter here to simulate the
972.44 -> timeout. Recommendation servers will recognise this flag and
976.16 -> return a timeout. I'm clicking on send now - we will see the
981.84 -> string done status as 200, OK.
990.16 -> Next, I'm going to show you the state machine execution, where it
993.32 -> retries a service call and then updates the service status in
996.8 -> the DynamoDB database. The state machine is now executing the
1001.04 -> Lambda function, I've injected a timeout to degrade the service.
1005.76 -> So the Lambda function will be retried for three times, as per
1009.76 -> my configuration. When it finally times out, an item will
1014.08 -> be inserted in the DynamoDB database indicating that the
1017.44 -> service has been degraded.
1035.2 -> Let's now look at the DynamoDB table. The service name is the partition key
1040.64 -> and the expired timestamp is the sort key here. So if you see the output
1047.4 -> recommendation service is now degraded. I'll now rerun the call
1051.68 -> and show you how the service returns an immediate failure.
1057.44 -> I've removed the simulate timeout and calling the workflow
1060.04 -> again. I'll click on send.
1068.52 -> If you see here the state machine has returned an
1070.88 -> immediate failure, as the service is degraded. It does so by
1074.84 -> checking the DynamoDB table, finds a record, and fails
1078.2 -> immediately. The recommendation service is not called in this case.
1092.52 -> I'll call the recommendation service again now. So the state
1096.32 -> machine has now called the recommendation service, and it
1099.44 -> has run successfully. The current time is greater than the
1103.04 -> timestamp of the items stored in the DynamoDB table, and the
1106.8 -> service is no longer considered degraded. I have used the AWS Step
1111.36 -> Function Standard workflow for the demo as it provides a
1114.68 -> visual interface, and it's easier for me to show and explain to
1117.92 -> you. But if you're planning to use this in production
1120.72 -> workloads, you can use the Express workflow to achieve
1124.08 -> much higher performance and lower latency. There are
1127.76 -> situations when a single transaction can span multiple
1130.88 -> data stores. The basic principle of microservices is that each
1135.32 -> service manages its own data. So hotel service stores its data
1139.76 -> and its own database, flight service and its own, and the
1143.12 -> payment service can track the executed transactions in its
1146.28 -> own database. Anyone remembers storing BLOB data and relational
1151 -> database columns? How about a really long text? How many times
1155.6 -> have we converted data from one representation to another, just
1159.56 -> so that we can store and track that information in the
1162.52 -> database. When using microservices, services can choose
1169.04 -> their databases that suits the need. In our example, hotel
1173.68 -> service uses Amazon Aurora, a relational database, and the
1177.68 -> flight service uses Amazon DynamoDB, which is our noSQL
1181.16 -> data store. Payment service here we can imagine talks to an
1185.12 -> external payment SaaS. As I mentioned earlier, microservices
1189.72 -> communicate through events to indicate process changes.
1193.64 -> Imagine that a customer is booking a trip. This may include
1197.52 -> booking the hotel, flight, and making payment. You're seeing
1201.8 -> the happy path on the screen, there are no failures in the
1204.6 -> workflow, and the order gets placed successfully. If you
1208.64 -> notice, this is an example of a distributed transaction with
1211.84 -> polyglot persistence. The transaction data gets stored
1215.48 -> across different databases, and each service writes to its own
1219.36 -> database. Let's imagine that there was a network failure, and
1225.2 -> the payment gateway has timed out. So the last step in the
1228.64 -> workflow has failed. At the time of failure, the hotel database
1233 -> and the flight databases are already updated. But if you see
1237.08 -> the data, it is now inconsistent. Hotel and flights
1240.68 -> are booked, but the payment has failed. And there is no way to
1243.92 -> correct the data, and there are no options to retry the
1246.88 -> processes that have failed. In a relational database, we get ACID
1252.16 -> transactions - Atomicity, Consistency, Isolation, and
1255.8 -> Durability. If a transaction fails in a relational database,
1259.64 -> the transaction gets rolled back. But in distributed systems
1263.6 -> like microservices, two-phase commit is not an option as the
1267.16 -> transaction is distributed across various databases. In
1272 -> this case, the solution is to use the Saga orchestration
1274.72 -> pattern. If any transaction fails in the workflow, the Saga
1278.84 -> orchestrator executes a series of compensating transactions
1282.52 -> that reverts the changes that were made by the preceding
1285.72 -> transactions. Let's imagine that a customer books the hotel when
1291.32 -> the flight tickets are almost full, another customer secures
1295.16 -> the flight tickets, and it becomes unavailable. As we all
1298.36 -> know this can happen when multiple customers are competing
1301.56 -> to get tickets at the same time. In this case, the flight booking
1305.56 -> step will fail in the workflow. Now orchestrator will
1309.88 -> now execute the compensatory transactions, it will run revert
1313.56 -> flight booking, remove hotel booking steps, and return the
1317 -> status as failed. As you can see, for every action that makes
1323.04 -> a change to the datastore there is an opposite action that
1326.24 -> compensates the change in the case of a failure. It's almost
1330.36 -> like Newton's Third Law - for every action there is an equal
1333.44 -> and opposite reaction. A failure that happens in the last step,
1338.24 -> payment services in this case, will cause all the compensatory
1341.84 -> transactions to be executed, before returning a failed state.
1348.08 -> The state machine makes it very easy for us to configure the
1351.08 -> compensatory transactions in case of failures. Here you can
1355.36 -> see how the Saga orchestration has been implemented, with the
1358.56 -> Amazon States Language. This shows the payment processing
1361.92 -> step of the Saga orchestration pattern, this can be easily
1365.72 -> extended to define the orchestration of a complex
1368.24 -> transaction. So to summarise - we solve intermittent network
1374.32 -> failures with retries and code, service failures with AWS Step
1379.32 -> Functions retry workflow, service delays and performance
1383.36 -> issues with the circuit breaker pattern, and distributed
1387.6 -> transactions using the Saga orchestration pattern. These are
1392.28 -> solutions to known problems. Once outages happen, it is easy
1396.12 -> enough because you can analyse things. Of course you have to
1399.76 -> get the data metrics and traces to determine why it occurred.
1404.08 -> But how do you find issues that have not occurred yet?
1408.68 -> There are two types of complexity in software
1410.8 -> development. One is accidental complexity. Accidental
1414.8 -> complexity is generated by the developers themselves, like bad
1418.52 -> variable names, or bad API definitions. The other is
1422 -> essential complexity. It is the complexity of the problem that
1426.2 -> we want to solve. It is the inherent complexity of the
1429.48 -> domains, so we can't remove it at all. The traditional way of
1433.8 -> tackling this challenge has been testing. But as distributed
1437.44 -> systems have become more complex, testing can only verify the
1441.88 -> known conditions in isolated environments. Answering
1445.88 -> questions like - is it what we expect this, or is this the
1449.68 -> result of that function too? You know those kinds of questions.
1454.2 -> How can you test something you don't know yet? One way to do a
1458.28 -> deal with the unpredictables is using chaos engineering. By
1462.12 -> using chaos engineering, you can learn how the system behaves by
1466.04 -> running controlled experiments on your system. For distributed
1470.08 -> systems, even for a single request, many services are
1473.12 -> involved. So we need to aggregate information and
1476.12 -> visualise it to understand the system status, and to trace the
1479.6 -> request. This is what we call observability. And without
1483.48 -> observability you don't have chaos engineering, you will just
1486.2 -> have chaos. If you can't observe the behaviour of your system you
1490.56 -> can't learn from your experiments. This is the chaos engineering
1494.64 -> process. Firstly we define the steady state. Next we make a
1498.68 -> hypothesis and run an experiment by injecting faults.
1502.52 -> After the experiment, we check the system behaviour and verify
1505.52 -> the hypothesis. If we see any difference between the
1508.6 -> hypothesis and actual behaviour, we need to improve the system.
1512.48 -> Through iterations of this process, we can improve the
1515.6 -> system and gain confidence in the system's capability to
1518.92 -> withstand turbulent conditions in the production environment.
1522.92 -> Fault Injection Simulator is a fully managed service for chaos
1526.28 -> engineering, it helps you run experiments. Many similar
1530.12 -> services normally require the user to instal an agent, but FIS
1533.64 -> does not. And you can configure a stop condition. So if experiments
1538.12 -> go wrong, you can stop them automatically. And is it really
1542.84 -> mandatory to do experiments in production? There are advantages
1547.28 -> of running an experiment in the production environment, but of
1550.44 -> course, it comes with blockers and risks. For example, if
1554.04 -> you're not familiar with some tools, there is a risk of making
1557.72 -> a mistake and you may unintentionally impact the
1560.24 -> users. So it's better that you start in the development or
1564.08 -> staging environment, and familiarise yourself with the
1566.96 -> test tools. In some cases, it may be enough to do the
1570.92 -> experiments in the development environment. You joined the AWS
1575.24 -> Summit to learn, and you can keep learning beyond the Summit by
1578.64 -> using these training resources. Skill Builder is our online
1583.56 -> learning centre that makes it easier for anyone from beginners
1587.36 -> to experienced professionals to build AWS Cloud skills. We
1591.92 -> offer 500+ free digital courses, that can help you and
1595.8 -> your team build new cloud skills, and learn about the latest services.
1602.08 -> And with that, I would like to thank you again for
1605.32 -> taking the time to listen to my session. I really hope my
1609.32 -> explanations will help you apply the patterns to your own
1613.04 -> distributed applications. I just have one final request for you,
1617.8 -> and that's to fill in the session survey. It will only
1621.12 -> take a minute of your time and it really helps me out to know
1624.92 -> what you thought of the session. I hope you'll enjoy the rest of
1628.52 -> the AWS Summit.

Source: https://www.youtube.com/watch?v=NB3ei9pnHFA