AWS Summit ANZ 2022 - Build resilient microservices using fault-tolerant patterns (DEV5)
Aug 16, 2023
AWS Summit ANZ 2022 - Build resilient microservices using fault-tolerant patterns (DEV5)
This session covers design patterns - including circuit-breaker and saga - for building resilient microservice architectures, and how they lead to application robustness, data consistency and service recovery across network calls. Learn how AWS Fault Injection Simulator helps to uncover performance bottlenecks and edge cases that are otherwise difficult to find in distributed systems. Learn more about AWS webinar series in Australia and New Zealand at https://go.aws/3ChL0Y6 . Subscribe: More AWS videos http://bit.ly/2O3zS75 More AWS events videos http://bit.ly/316g9t4 ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster. #AWSSummit #AWS #AmazonWebServices #CloudComputing
Content
15.28 -> Hello everyone. Welcome to the
session on building resilient
19.04 -> microservices using fault-tolerant
design and patterns.
23.6 -> First off, I just want to say
that I really appreciate you
26.96 -> taking the time to watch this session.
My name is Anitha Deenadayalan
31.8 -> I'm a Developer Specialist
Solutions Architect
34.32 -> here at AWS, and I'm based out
of Singapore. I work for the
38.96 -> Developer Acceleration, DevAx
team. In our team, we help our
43.84 -> customers build secure, reliable,
and scalable modern applications
48.24 -> on AWS. You probably want to
know what you can learn from this
54.2 -> session today. I'll be talking
about how distributed
57.92 -> architectures require changes in
the way we think, and how
61.84 -> network plays a key role in the
design of the services. Then
66.08 -> I'll talk about some of the
commonly occurring problems in
69.12 -> microservice architectures, and
the design patterns to resolve
73.04 -> them. I will also talk about why
you should use chaos
76.8 -> engineering, and AWS Fault
Injection Simulator for testing
81.12 -> the distributed applications.
Today, I will discuss the
85.68 -> patterns in the context of a
trip booking application
89.56 -> TravelBuddy. TravelBuddy lets you
book hotel and flights. It's a
94.28 -> monolithic application, typical
three-tier, web tier, application
98.96 -> layer, with a relational
database. Monolithic
102.92 -> applications have most of the
functionality within a single
106.28 -> process or a container.
Internally, they can have
109.36 -> multiple components, layers, and
libraries. The application can
115.52 -> have complex interactions, but
they remain within a single
118.64 -> process. But if a particular
service, maybe you can think of
122.52 -> the flight service here - has to
be scaled, there is no way to
126.36 -> scale just the individual
servers that's choking. The
129.96 -> entire application has to be
scaled to cater to the incoming
133.4 -> requests. When releasing
changes, entire application has
137.6 -> to be regression tested and
released. Let's imagine that our
142.44 -> developers have been tasked to
re-write the code in the
146.04 -> microservices architecture.
Developers are now building
149.76 -> microservices to create hundreds
and sometimes thousands of
153.48 -> small interconnected software
components, and very often they
157.52 -> are using different technology
stacks. Here we should remember
161.96 -> that our developers are used to
writing code for monolithic
165.08 -> applications, they frequently
disregard the unseen participant
169.12 -> in the communication - the
network. That's somewhat
172.16 -> understandable, given that many
middleware technologies have
176.08 -> tried to make the developer
experience of writing client
179.4 -> code to be very close to that
experience of calling a local
183.56 -> function. So if the developers
are not exposed to issues with
188.44 -> network in their local tests
with more data, they are less
192.16 -> likely to defend against them.
So what happens when
197.8 -> applications are returned with
literal error handling on
201.28 -> networking errors? If a network
outage occurs, such applications
206.04 -> infinitely wait for an answer
packet, permanently consuming
209.6 -> memory or other resources. And
the services come back up
213.6 -> sudden, an increase load might
hit a running system, causing
217.6 -> operations to return much slower
than anticipated. Without
221.56 -> timeouts or circuit breakers in
place, this increase in latency
225.56 -> will begin to compound, and may
even look like total
228.8 -> unavailability to the system's
users. Okay, let's look at the
235.48 -> TravelBuddy application again.
The different components are now
239.28 -> extracted into microservices,
and they are working together in
242.76 -> an event-driven architecture.
The services they have their own
246.4 -> databases, a call to hotel
service may lead to multiple
250.48 -> calls - here they can be flight
service, payments service - before
254.84 -> returning an output to the
caller. Modern application
259.4 -> architectures bring in a lot of
benefits. They have smaller
262.64 -> blast radius for changes, they
have faster release cycles,tyhey
266.52 -> have less constraints with
resources like CPU and memory,
270.08 -> and they can scale on demand.
You can also easily troubleshoot
273.96 -> and deploy changes at a service
level, rather than at an
277.32 -> application level. You can
choose the best technology for
280.76 -> each service. At the same time
now we have hundreds of
284.44 -> services. There could be network
failures, service failures,
288.2 -> service delays due to peak loads.
A single transaction may span
292.6 -> different databases, making it
harder to provide two phase commit.
298.2 -> So I'll be talking about some of
the defensive programming
301.04 -> techniques or patterns that you
can use to design applications
305.12 -> using microservices architectures.
Imagine that you're
308.32 -> calling an API to get
hotel information from the
311.2 -> TravelBuddy application and
there are intermittent network
314.44 -> issues. If you're in front of
your computer, you will probably
318.44 -> be hitting the refresh button
repeatedly until the call
321.32 -> succeeds. But what if another
process is calling this service?
326.16 -> How are you going to make the
same call again. A good way here
332.6 -> would be to implement retries
with backoff from the caller
335.84 -> side. You can define the number
of retries. Based on the
339.6 -> response status, you can retry
with an increased wait time
343.08 -> between the calls. You're
increasing the wait time because
346.68 -> you're providing time for the
network to recover. You may
350.16 -> overload the network bandwidth
if you retry too frequently.
355.44 -> I'll provide provided the code
here in .NET, and you can
358.36 -> choose to implement this in any
of the programming languages
361.56 -> that you prefer. The next
scenario I want to talk about is
366.48 -> service throttling. Individual
services may be throttled when
370.56 -> there are too many requests.
Microservices communicate
373.76 -> through remote procedure calls,
and it's always possible that
377.16 -> transient errors could occur in
the network connectivity,
380.32 -> causing failures also. Here the
trip service sends an event
385.44 -> notification after processing.
The rewards service starts
388.88 -> processing when the event
arrives. This is an event driven
391.92 -> architecture and the services
are not aware of each other. So
395.96 -> the retry logic cannot be
implemented in the trip service
399.48 -> code as it causes close
coupling. A better solution here
406.36 -> would be to use AWS Step
Functions to retry and backoff
410.44 -> instead of embedding the retry
in the individual services. So
414.76 -> if a service call fails, the
workflow can try again for a
418.2 -> defined number of times with an
interval and backoff rate. In my
422.64 -> example here, the workflow will
wait for three seconds at first,
426.52 -> and the back of rate is 1.5. So
the workflow waits for three
430.6 -> times 1.5 in the second try,
which is 4.5 seconds. It will
434.88 -> continue to the next step if it
succeeds in one of the retries.
441.2 -> So whenever there is service
throttling, or intermittent
444.4 -> service failures, you can use
AWS Step Functions to retry the
448.6 -> action. When multiple microservices
collaborate to handle
454.56 -> requests, one or more services
may become unavailable or
458.44 -> exhibit a high latency. If it is
a synchronous call entire
463.04 -> application can become slow, and
it will lead to poor user
466.32 -> experience. Our timeout in the
recommendation servers will
470.76 -> propagate to the caller during
synchronous execution. Imagine
476.64 -> the operation invoking the
service has timeouts, and it is
480.2 -> waiting for the service to
respond. The thread will be
483.52 -> blocked until the timeout period
expires. If there are many
487.08 -> concurrent requests, these block
requests may hold critical
490.92 -> system resources such as memory,
threats, and database
494 -> connections. Now, the
recommendation service team is
500 -> trying to find the cause of the
issue and fix it, but as you can
504.44 -> see, the application has many
users, and they continuously
508.24 -> retry the operation. This can
cause the entire application to
512.28 -> go down. When faults are due to
unanticipated events, it might
518 -> take much longer to fix the
issue. These faults can range in
522.28 -> severity. They may cause partial
loss of connectivity, or they
526.52 -> could even lead to a complete
failure of the service. In these
530.32 -> situations, it might be
pointless for an application to
534.04 -> continually retry an operation,
that's most likely going to
537.84 -> fail. Instead the application
should quickly accept that the
541.92 -> operation has failed, and handle
this failure. The solution here
547.52 -> is to use a circuit breaker
pattern. This pattern can
550.92 -> prevent an application from
repeatedly trying to execute an
554.4 -> operation that's likely to fail.
In this pattern, there is a
559.28 -> circuit breaker process that
routes the calls from the caller
562.92 -> to the callee. For example,
in our TravelBuddy application,
568.56 -> we can add a circuit breaker
process between the trip and the
572 -> recommendation services. When
there are no failures, circuit
577.24 -> breaker routes all the calls to
the recommendation service. If
582.36 -> the recommendation service times
out, the circuit breaker can
585.44 -> detect the timeout and track the
failure.
589.04 -> If the timeouts exceed a
specific threshold value, the
592.52 -> circuit is open. Once the
circuit is open, the circuit
596.44 -> breaker object does not route
any calls to the recommendation
599.44 -> service. Instead, it just
returns an immediate failure.
604.88 -> Meantime, as you can see, the
recommendation service team is
608.4 -> working on fixing the issue. The
circuit breaker can periodically
612.72 -> retry to see if the calls to the
recommendation service are
615.84 -> successful. If you see here, the
service team is taking time to
620.16 -> fix and the service is still not
available. During the retry if
625.08 -> the call to the recommendation
service succeeds, the circuit is
628.08 -> closed, and all further calls
are routed to it again. Next,
633.76 -> I'll be showing you a demo of
the circuit breaker pattern
636.52 -> using AWS Step Functions. This
is the architecture for today's
641.68 -> circuit breaker pattern demo.
AWS Step Functions provides the
646.32 -> circuit breaker capabilities
here. I've designed the circuit breaker
650.88 -> as a generic construct so that
it's service agnostic. This is
654.68 -> one of the ways of implementing
the circuit breaker pattern.
658.92 -> Trip service triggers a Step
Functions workflow with the name
662.36 -> of the service, it needs to call
through the circuit breaker. In
666.64 -> this example, recommendation
service. We can use the same circuit
671.24 -> breaker with other services. For
example, hotel service can call
675.28 -> payment service with no caller
configuration changes. I'm using
680.96 -> Amazon DynamoDB here for storing
the degraded service status, but
685.44 -> you can also use an in-memory
data store like Amazon
688.76 -> Elasticache for lower latency axis.
In my demo, I'll first degrade the
695.2 -> recommendation service by
injecting timeouts, and show you
699.08 -> how the service status is stored
in the DynamoDB table. Then I'll
703.96 -> call this degraded service and
show you how the service returns
707.44 -> an immediate failure for
subsequent calls. After a
711.36 -> specified wait, the circuit
breaker will call the degraded
714.92 -> service again. I'm now in the
Step Functions console. I've
721.88 -> already created the Step
Function workflow using AWS CDK.
726 -> But you can use to create your
cloud application resources
729.4 -> using familiar programming
languages. I'll click on the
736.28 -> definition to see the state
machine now. This is the circuit
741.6 -> breaker state machine
definition. It gets the circuit
746.12 -> status from the DynamoDB
database, checks whether the
749.48 -> circuit is open or closed. If
the circuit is close, it
752.64 -> executes the Lambda function. If
the Lambda function executes
756.08 -> successfully, the workflow exits
without any error. If the Lambda
760.92 -> function times out or returns
error repeatedly, service has
765.2 -> degraded in the update circuit
status step with an expiry
768.96 -> timestamp. The expiry timestamp
is required so that the service
773.48 -> call can be retried after a
short wait time. So if the
777.8 -> circuit is open, then an
immediate failure is returned to
780.72 -> the caller without executing the
Lambda function. To get the
786.76 -> circuit status, I'm using the
AWS SDK service integrations to
791.84 -> query the DynamoDB database. If
the circuit is closed, then the
795.96 -> query will return a count of
zero. Otherwise, it will return
800.68 -> a non-zero positive number.
Execute Lambda step here
805.6 -> provides the retry with
exponential backoff
808.48 -> capabilities. If this step times
out after three consecutive
812.64 -> times, the service is degraded
by storing an item in the
816.2 -> DynamoDB table. As I mentioned
earlier, you can modify the
820.4 -> data store to be an in-memory
data store instead of using
824.2 -> DynamoDB to get low latency
access. So I have used CDK for
829.76 -> .NET for creating the
infrastructure. First I create
833.92 -> the roles for the Lambda
functions and the Step Functions.
840.28 -> Next, I'm creating the circuit
breaker table in DynamoDB with
844.04 -> the service name as partition
key, and the expired timestamp as
847.4 -> the sort key. Then I'm creating
the API gateway and the trip
851.24 -> resource. I then integrate the
trip service Lambda function to
855.32 -> the API Gateway resource
endpoint.
873.8 -> So the Step Functions definition
I showed you just now, this is
878.08 -> how I created. I'm creating
individual task states here and
882.56 -> chaining them up together for
the workflow. As you see here,
887.88 -> I'm creating the state machine
type as Standard to have the
891.68 -> visual workflow. If you're using
this code in production
895 -> workloads use the state machine
type as Express to get a low
898.96 -> latency execution. I'm
storing the state machine ARN in
904.16 -> in an environment variable of
the trip service Lambda
906.68 -> function. As I'll need this for
triggering the Step Functions
910.32 -> workflow from the trip service.
This is the code for the trip
918.32 -> service Lambda function. I'm
reading the state machine ARN and
921.8 -> from the environment variable
here. The function receives an
927.88 -> input parameter object that
contains the target Lambda name.
931.64 -> I serialise the JSON input,
trigger the Step Functions
934.76 -> workflow here, the workflow will
trigger the target Lambda
937.92 -> function based on the service
status. The Step Function's
941.84 -> workflow is service agnostic, you
can extract out the circuit
945.08 -> breaker calling logic logic here
into a separate DLL or Lambda
948.44 -> layers, and use it to integrate
the circuit breaker workflow for
951.72 -> any service. So far, I've shown
you the state machine definition,
956.52 -> CDK code for creating the
infrastructure, and trip service
960.68 -> code. I'll show the next
execution next. First, I'll make
965.52 -> a call to degrade the
recommendation service through
968.44 -> Postman. I'm sending an extra
parameter here to simulate the
972.44 -> timeout. Recommendation servers
will recognise this flag and
976.16 -> return a timeout. I'm clicking
on send now - we will see the
981.84 -> string done status as 200, OK.
990.16 -> Next, I'm going to show you the
state machine execution, where it
993.32 -> retries a service call and then
updates the service status in
996.8 -> the DynamoDB database. The state
machine is now executing the
1001.04 -> Lambda function, I've injected a
timeout to degrade the service.
1005.76 -> So the Lambda function will be
retried for three times, as per
1009.76 -> my configuration. When it
finally times out, an item will
1014.08 -> be inserted in the DynamoDB
database indicating that the
1017.44 -> service has been degraded.
1035.2 -> Let's now look at the DynamoDB table.
The service name is the partition key
1040.64 -> and the expired timestamp is the
sort key here. So if you see the output
1047.4 -> recommendation service is now
degraded. I'll now rerun the call
1051.68 -> and show you how the service
returns an immediate failure.
1057.44 -> I've removed the simulate
timeout and calling the workflow
1060.04 -> again. I'll click on send.
1068.52 -> If you see here the state
machine has returned an
1070.88 -> immediate failure, as the service
is degraded. It does so by
1074.84 -> checking the DynamoDB table,
finds a record, and fails
1078.2 -> immediately. The recommendation
service is not called in this case.
1092.52 -> I'll call the recommendation
service again now. So the state
1096.32 -> machine has now called the
recommendation service, and it
1099.44 -> has run successfully. The
current time is greater than the
1103.04 -> timestamp of the items stored in
the DynamoDB table, and the
1106.8 -> service is no longer considered
degraded. I have used the AWS Step
1111.36 -> Function Standard workflow for
the demo as it provides a
1114.68 -> visual interface, and it's easier
for me to show and explain to
1117.92 -> you. But if you're planning to
use this in production
1120.72 -> workloads, you can use the
Express workflow to achieve
1124.08 -> much higher performance and
lower latency. There are
1127.76 -> situations when a single
transaction can span multiple
1130.88 -> data stores. The basic principle
of microservices is that each
1135.32 -> service manages its own data. So
hotel service stores its data
1139.76 -> and its own database, flight
service and its own, and the
1143.12 -> payment service can track the
executed transactions in its
1146.28 -> own database. Anyone remembers
storing BLOB data and relational
1151 -> database columns? How about a
really long text? How many times
1155.6 -> have we converted data from one
representation to another, just
1159.56 -> so that we can store and track
that information in the
1162.52 -> database. When using microservices,
services can choose
1169.04 -> their databases that suits the
need. In our example, hotel
1173.68 -> service uses Amazon Aurora, a
relational database, and the
1177.68 -> flight service uses Amazon
DynamoDB, which is our noSQL
1181.16 -> data store. Payment service
here we can imagine talks to an
1185.12 -> external payment SaaS. As I
mentioned earlier, microservices
1189.72 -> communicate through events to
indicate process changes.
1193.64 -> Imagine that a customer is
booking a trip. This may include
1197.52 -> booking the hotel, flight, and
making payment. You're seeing
1201.8 -> the happy path on the screen,
there are no failures in the
1204.6 -> workflow, and the order gets
placed successfully. If you
1208.64 -> notice, this is an example of a
distributed transaction with
1211.84 -> polyglot persistence. The
transaction data gets stored
1215.48 -> across different databases, and
each service writes to its own
1219.36 -> database. Let's imagine that
there was a network failure, and
1225.2 -> the payment gateway has timed
out. So the last step in the
1228.64 -> workflow has failed. At the time
of failure, the hotel database
1233 -> and the flight databases are
already updated. But if you see
1237.08 -> the data, it is now
inconsistent. Hotel and flights
1240.68 -> are booked, but the payment has
failed. And there is no way to
1243.92 -> correct the data, and there are
no options to retry the
1246.88 -> processes that have failed. In a
relational database, we get ACID
1252.16 -> transactions - Atomicity,
Consistency, Isolation, and
1255.8 -> Durability. If a transaction
fails in a relational database,
1259.64 -> the transaction gets rolled
back. But in distributed systems
1263.6 -> like microservices, two-phase
commit is not an option as the
1267.16 -> transaction is distributed
across various databases. In
1272 -> this case, the solution is to
use the Saga orchestration
1274.72 -> pattern. If any transaction
fails in the workflow, the Saga
1278.84 -> orchestrator executes a series
of compensating transactions
1282.52 -> that reverts the changes that
were made by the preceding
1285.72 -> transactions. Let's imagine that
a customer books the hotel when
1291.32 -> the flight tickets are almost
full, another customer secures
1295.16 -> the flight tickets, and it
becomes unavailable. As we all
1298.36 -> know this can happen when
multiple customers are competing
1301.56 -> to get tickets at the same time.
In this case, the flight booking
1305.56 -> step will fail in the
workflow. Now orchestrator will
1309.88 -> now execute the compensatory
transactions, it will run revert
1313.56 -> flight booking, remove hotel
booking steps, and return the
1317 -> status as failed. As you can
see, for every action that makes
1323.04 -> a change to the datastore there
is an opposite action that
1326.24 -> compensates the change in the
case of a failure. It's almost
1330.36 -> like Newton's Third Law - for
every action there is an equal
1333.44 -> and opposite reaction. A failure
that happens in the last step,
1338.24 -> payment services in this case,
will cause all the compensatory
1341.84 -> transactions to be executed,
before returning a failed state.
1348.08 -> The state machine makes it very
easy for us to configure the
1351.08 -> compensatory transactions in
case of failures. Here you can
1355.36 -> see how the Saga orchestration
has been implemented, with the
1358.56 -> Amazon States Language. This
shows the payment processing
1361.92 -> step of the Saga orchestration
pattern, this can be easily
1365.72 -> extended to define the
orchestration of a complex
1368.24 -> transaction. So to summarise - we
solve intermittent network
1374.32 -> failures with retries and code,
service failures with AWS Step
1379.32 -> Functions retry workflow,
service delays and performance
1383.36 -> issues with the circuit breaker
pattern, and distributed
1387.6 -> transactions using the Saga
orchestration pattern. These are
1392.28 -> solutions to known problems. Once
outages happen, it is easy
1396.12 -> enough because you can analyse
things. Of course you have to
1399.76 -> get the data metrics and traces
to determine why it occurred.
1404.08 -> But how do you find issues that
have not occurred yet?
1408.68 -> There are two types of
complexity in software
1410.8 -> development. One is accidental
complexity. Accidental
1414.8 -> complexity is generated by the
developers themselves, like bad
1418.52 -> variable names, or bad API
definitions. The other is
1422 -> essential complexity. It is the
complexity of the problem that
1426.2 -> we want to solve. It is the
inherent complexity of the
1429.48 -> domains, so we can't remove it
at all. The traditional way of
1433.8 -> tackling this challenge has been
testing. But as distributed
1437.44 -> systems have become more complex,
testing can only verify the
1441.88 -> known conditions in isolated
environments. Answering
1445.88 -> questions like - is it what we
expect this, or is this the
1449.68 -> result of that function too? You
know those kinds of questions.
1454.2 -> How can you test something you
don't know yet? One way to do a
1458.28 -> deal with the unpredictables is
using chaos engineering. By
1462.12 -> using chaos engineering, you can
learn how the system behaves by
1466.04 -> running controlled experiments
on your system. For distributed
1470.08 -> systems, even for a single
request, many services are
1473.12 -> involved. So we need to
aggregate information and
1476.12 -> visualise it to understand the
system status, and to trace the
1479.6 -> request. This is what we call
observability. And without
1483.48 -> observability you don't have
chaos engineering, you will just
1486.2 -> have chaos. If you can't observe
the behaviour of your system you
1490.56 -> can't learn from your experiments.
This is the chaos engineering
1494.64 -> process. Firstly we define the
steady state. Next we make a
1498.68 -> hypothesis and run an
experiment by injecting faults.
1502.52 -> After the experiment, we check
the system behaviour and verify
1505.52 -> the hypothesis. If we see any
difference between the
1508.6 -> hypothesis and actual behaviour,
we need to improve the system.
1512.48 -> Through iterations of this
process, we can improve the
1515.6 -> system and gain confidence in
the system's capability to
1518.92 -> withstand turbulent conditions
in the production environment.
1522.92 -> Fault Injection Simulator is a
fully managed service for chaos
1526.28 -> engineering, it helps you run
experiments. Many similar
1530.12 -> services normally require the
user to instal an agent, but FIS
1533.64 -> does not. And you can configure
a stop condition. So if experiments
1538.12 -> go wrong, you can stop them
automatically. And is it really
1542.84 -> mandatory to do experiments in
production? There are advantages
1547.28 -> of running an experiment in the
production environment, but of
1550.44 -> course, it comes with blockers
and risks. For example, if
1554.04 -> you're not familiar with some
tools, there is a risk of making
1557.72 -> a mistake and you may
unintentionally impact the
1560.24 -> users. So it's better that you
start in the development or
1564.08 -> staging environment, and
familiarise yourself with the
1566.96 -> test tools. In some cases, it
may be enough to do the
1570.92 -> experiments in the development
environment. You joined the AWS
1575.24 -> Summit to learn, and you can keep
learning beyond the Summit by
1578.64 -> using these training resources.
Skill Builder is our online
1583.56 -> learning centre that makes it
easier for anyone from beginners
1587.36 -> to experienced professionals
to build AWS Cloud skills. We
1591.92 -> offer 500+ free digital
courses, that can help you and
1595.8 -> your team build new cloud skills,
and learn about the latest services.
1602.08 -> And with that, I would
like to thank you again for
1605.32 -> taking the time to listen to my
session. I really hope my
1609.32 -> explanations will help you apply
the patterns to your own
1613.04 -> distributed applications. I just
have one final request for you,
1617.8 -> and that's to fill in the
session survey. It will only
1621.12 -> take a minute of your time and
it really helps me out to know
1624.92 -> what you thought of the session.
I hope you'll enjoy the rest of
1628.52 -> the AWS Summit.
Source: https://www.youtube.com/watch?v=NB3ei9pnHFA