AWS re:Invent 2022 - Building confidence through chaos engineering on AWS (ARC307)

AWS re:Invent 2022 - Building confidence through chaos engineering on AWS (ARC307)


AWS re:Invent 2022 - Building confidence through chaos engineering on AWS (ARC307)

Distributed systems create opportunities to improve resilience that are not addressed by traditional approaches to development and testing. To solve these unknowns in distributed systems, chaos engineering was created 11 years ago with the mission to create methodologies and tools that help build a culture of resilience in the presence of unexpected outcomes. This session provides you with an understanding of what chaos engineering is and what it isn’t, where it provides value, and how you can create your own chaos engineering program to verify the resilience of your distributed mission-critical systems on AWS.

Learn more about AWS re:Invent at https://go.aws/3ikK4dD.

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#reInvent2022 #AWSreInvent2022 #AWSEvents


Content

0.12 -> - Welcome, everyone.
2.52 -> We're here today to talk about building confidence
6.75 -> through chaos engineering.
8.88 -> And in this session, you will learn
11.7 -> what chaos engineering is and what it isn't,
15.99 -> what the value is of chaos engineering,
19.74 -> and how you can get started with chaos engineering
22.98 -> within your own firms.
26.1 -> But more importantly, I will show you how you can combine
30.72 -> the power of chaos engineering and continuous resilience
35.7 -> and build a process that you can scale chaos engineering
39.9 -> across your organization in a controlled and secure way
44.7 -> to help your developers and engineers
47.73 -> with secure, reliable and robust workloads
51.87 -> that ultimately leads to a great customer experience.
56.91 -> My name is Laurent Domb, and so let's get started.
63.27 -> First, I will introduce you to chaos engineering,
68.04 -> and we're gonna see what it is and what it isn't.
72.42 -> I will also go through the various aspects
75.48 -> when we're thinking about prerequisites
78.03 -> for chaos engineering and what you need to get started.
82.98 -> We will then dive into continuous resilience
86.16 -> and why continuous resilience is so important
89.52 -> when we're thinking about resilient applications on AWS.
96.57 -> And combined with chaos engineering
98.46 -> and continuous resilience, I will walk you through
101.13 -> our Chaos Engineering/Continuous Resilience program
104.97 -> that we use to help our customers
107.31 -> to build chaos engineering practices and programs
110.25 -> that they can scale across their organizations.
114.18 -> And last, I will show you some great workshops
117.24 -> that we have in AWS where you can get started
120.81 -> with chaos engineering on your own.
126.6 -> So when you're thinking about chaos engineering,
129.81 -> chaos engineering is not new.
132.15 -> Chaos engineering has been around for over 10 years,
137.46 -> and there are many companies
138.9 -> that have already adopted chaos engineering
141.3 -> and have taken the mechanisms
144.39 -> in trying to find the known unknowns,
147.84 -> these are things that we are aware of
150.9 -> but don't fully understand in our systems,
153.84 -> and chase the unknown unknowns,
155.61 -> which are things that we are neither aware of
158.1 -> nor fully understand.
160.38 -> And through chaos engineering, these various companies
164.55 -> were able to find deficiencies within their environments
168.48 -> and prevent large scale events,
171.03 -> and therefore, ultimately,
172.98 -> have a better experience for their customers.
177.36 -> And yet when you're thinking about chaos engineering,
181.5 -> in many ways, it's not how we see chaos engineering.
187.23 -> There is still a perception that chaos engineering
189.78 -> is that thing which blows up production,
192.09 -> nor where we randomly just shut down things
195.27 -> within an environment.
197.61 -> And that is exactly not what chaos engineering is about.
203.43 -> When we're thinking about chaos engineering,
206.55 -> we should look at it from a much different perspective.
210.84 -> Many of you have probably seen
213.69 -> the shared responsibility model for resilience.
218.58 -> When you're thinking about
219.57 -> the shared responsibility model for resilience,
222.27 -> the two sections, the blue and the orange.
227.64 -> In the resilience of the cloud, we at AWS are responsible
232.98 -> for the resilience of the facilities,
235.02 -> the network, the storage, the services that you consume.
238.77 -> But you as a customer, you're responsible
242.1 -> on how and what services you use,
244.62 -> where you place, for example, your workloads.
246.93 -> Think about zonal services like EC2
251.43 -> where you place your data, and how you fail over
255.09 -> if something happens within your environment.
259.32 -> But think about the challenges that come
263.16 -> when you're looking at the shared responsibility model.
267.87 -> How can you make sure that if a service fails
272.19 -> that you're consuming, that is in the orange,
276.12 -> that your workload is resilient?
279.36 -> How do you know if something fails
283.2 -> that your workload can fail over?
286.92 -> And this is where chaos engineering comes into play.
290.82 -> When you're thinking about the workloads
292.56 -> that you are running in the blue,
295.89 -> what you can influence is the primary dependency
299.07 -> that you're consuming in AWS.
301.53 -> If you're using EC2, if you're using Lambda,
304.92 -> if you're using SQS, if you're using ElastiCache,
309.27 -> these are the services that you can impact
312.81 -> with chaos engineering in a safe and controlled way,
316.59 -> and you can figure out mechanisms
318.87 -> on how your components within your application
321.6 -> can gracefully fail over to another service.
326.28 -> So when you're thinking about chaos engineering,
330.69 -> what it provides you is improved operational readiness,
335.31 -> because your teams will get trained on what to do
339.27 -> if a certain service fails.
341.16 -> You will have mechanisms in place
343.74 -> to be able to fail over automatically.
346.32 -> You will have great observability in place
350.01 -> because you will realize what is missing
353.55 -> within your observability that you haven't seen
356.64 -> when you're running these experiments in a controlled way.
360.33 -> And ultimately, you will learn to build
363.6 -> more resilient workloads on AWS.
369.63 -> And when you're thinking about all these together,
371.64 -> what does it lead to?
372.84 -> Of course, happy customers.
375.18 -> And that's what chaos engineering is about.
377.76 -> It's all about us building great workloads
381.12 -> that ultimately lead to a great customer experience.
386.37 -> And so when you think about chaos engineering,
392.67 -> it's all about building controlled experiments.
398.58 -> If we know that an experiment will fail,
401.91 -> we're not gonna run the experiment.
404.37 -> If we know that we're gonna inject a fault
407.73 -> and that fault will trigger a bug
409.68 -> that brings down our system,
411.24 -> we're not gonna run the experiment.
413.61 -> We already know what happens.
416.648 -> And what we wanna make sure is if we have an experiment,
420.51 -> that, by definition, that experiment
423.09 -> should be tolerated by the system and should be fail-safe.
429.12 -> Because what we want to understand
432.36 -> is "Is our system resilient to component failures?"
438.3 -> Many of you might have a similar architecture
442.23 -> than you see here on this slide.
445.23 -> But when you're thinking about it,
446.97 -> let's say you're using Redis on EC2 or ElastiCache,
452.37 -> what's your confidence level if Redis fails?
455.58 -> Do you have mechanisms in place
457.02 -> to make sure that your database
458.52 -> does not get fully overrun with requests
460.89 -> if your cache suddenly fails?
463.68 -> Or what if you think about latency
466.26 -> that suddenly gets injected between two microservices
470.01 -> and you create a retry storm?
472.29 -> Do you have mechanisms to mitigate that
475.02 -> with exponential backoff and jitter?
479.31 -> And what if to that, you have cascading failures
483.51 -> and an entire AZ gets out of commission?
487.2 -> Are you confident that you can fail over
489.18 -> from one availability zone to another?
494.07 -> And think about impacts that you might have
496.56 -> on a regional service.
499.98 -> What is your confidence level
502.05 -> that if an entire region or a service in a region
505.08 -> has an impact that you rely on,
507.51 -> and because of your SLAs,
509.13 -> have to fail over into a secondary region?
512.73 -> What's your confidence level
514.53 -> that your runbooks and fail over playbooks
516.6 -> that they're all up to date and you can say,
518.167 -> "Yes, I can run through them"?
521.25 -> And so when you're thinking about chaos engineering
525.12 -> and we're thinking about the services
527.04 -> that we build on a daily basis,
531.6 -> they're all based on tradeoffs
534.18 -> that we have every single day.
536.91 -> Now, when you're thinking about everyone here in the room,
539.79 -> we all want to build awesome workloads,
542.19 -> resilient workloads, robust workloads.
545.34 -> But the reality is we're all under pressure,
547.8 -> there's a certain budget that I can use,
551.28 -> there's a certain time that I need to deliver,
553.53 -> and certain features.
555.48 -> But in a distributed system,
557.61 -> there is no way that every single person
559.95 -> understands the thousands or hundreds microservices
562.83 -> that communicate with each other.
565.86 -> And ultimately, what happens
567.21 -> is if I think that I'm depending on a soft dependency
571.35 -> where someone suddenly changes code,
573.24 -> that becomes a hard dependency, and what happens?
576.75 -> We suddenly have an event.
579.54 -> And when you're thinking about these events,
581.58 -> usually they happen,
583.2 -> you're somewhere in a restaurant or on vacation
585.512 -> and you get called at 2:00 in the morning,
587.73 -> and everybody runs and tries to fix
590.49 -> and bring the system back up.
593.79 -> And the challenge with this is once the system is back up,
597.72 -> you just go back to business as usual
601.77 -> until the same challenge happens again.
605.43 -> And it's not because we don't wanna fix it,
608.31 -> but it's because good intentions, they don't work.
611.4 -> And this is where mechanisms come into play
614.49 -> like chaos engineering and continuous resilience.
619.74 -> Now, I mentioned in the beginning
622.71 -> that there are many companies
625.05 -> that already have adopted chaos engineering.
629.04 -> And these are just some of the verticals
632.04 -> of all companies that have adopted chaos engineering,
635.64 -> and some of them already started five to six years ago.
639.9 -> But I want to give you a few examples
642.87 -> of a very regulated industry,
644.79 -> the financial services industry,
648.03 -> where you have very large companies
650.97 -> that have adopted chaos engineering.
654.84 -> One of these companies also spoke here
656.94 -> at re:Invent this year, which is Capital One.
660.54 -> Capital One wrote many great blog posts
663.03 -> that you can find under the Chaos Engineering Stories,
666.03 -> the link that you have here,
668.4 -> and have explained how they were thinking about building
671.55 -> their chaos engineering story and processes,
674.94 -> what they were looking at in regards to resilience
678.3 -> and readiness of their applications.
681.06 -> But they also, over five years,
683.52 -> have built a tool, Cloud Doctor, that uses various services
687.66 -> helping their developers to delineate of their services,
692.13 -> fault injections and reports
694.86 -> when they execute the chaos experiments
697.53 -> to help them build better workloads on AWS.
702.18 -> There are others like the National Australia Bank
706.71 -> that looked at observability
709.77 -> and defined observability as being key to chaos engineering,
714.27 -> looking at aspects like errors, traffic,
717.96 -> various tracing, metrics and logging, as well as saturation
722.04 -> that has to be part of chaos engineering.
724.65 -> Because if they don't see that, they define this as chaos.
729.9 -> And then you have others like Intuit
732.42 -> that shared a great story about how they were thinking
736.23 -> migrating from on-premises to the cloud,
740.19 -> and how they were thinking about resilience,
742.89 -> and how the resilience was different
744.87 -> from doing a FEMA analysis after the fact,
747.93 -> going to chaos engineering
749.88 -> and trying to understand if one obsoletes the other,
753.54 -> but realized that they still need both,
756.36 -> and also build the process to help their developers
759.96 -> to automate chaos experiments from start to the end.
764.88 -> And there are many more stories like that
767.67 -> that I could talk about today.
769.26 -> Unfortunately, we don't have enough time.
771.87 -> But if you're flying back today or tomorrow, look at the URL
775.691 -> and there are many of these stories
777.12 -> that can help you get started with chaos engineering.
781.5 -> So there are many more customers
784.02 -> that will adopt chaos engineering next year.
788.07 -> There's a great study by Gardner that was done
791.94 -> for the infrastructure and operations leader's guide
795.24 -> that said that 40% of companies
798.39 -> will adopt chaos engineering next year.
802.02 -> And they're doing that because they think
804.66 -> that they can increase customer experience by 20%.
809.49 -> Think about how many more happy customers you're gonna have
813.27 -> with such a number.
815.91 -> So let's get to the prerequisites
817.59 -> on how you can get started with chaos engineering.
823.35 -> So first, you need basic monitoring,
827.61 -> and if you have observability, that's great.
831.24 -> Then, you need to have organizational awareness.
834.84 -> We need to think about real world events
838.5 -> that we're injecting our faults.
840.72 -> And then, of course,
841.89 -> if we find a deficiency within our environment,
846 -> we need to commit and go and fix it
847.59 -> if it's resilience or security-focused.
850.02 -> So let's dive a little bit more into this.
855.06 -> So when you're thinking about metrics,
858.06 -> many of us have really great metrics.
861.696 -> In chaos engineering, we call metrics 'known knowns.'
866.46 -> These are things that we are aware of
869.01 -> and we fully understand.
871.98 -> And when you're thinking about metrics,
874.383 -> it's CPU percentage, it's memory, it's this scale,
877.92 -> and it's all great.
879.66 -> But in a distributed system,
882.18 -> you're gonna look at many, many different dashboards
884.43 -> and metrics to figure out what's going on
886.47 -> within your environment.
888.48 -> And so when we're starting with chaos engineering,
890.46 -> many times, when we're running the first experiment,
893.31 -> even if we're trying to make sure
894.78 -> that we're seeing everything,
895.89 -> we realize we can't see it.
900.33 -> And this is what leads us to observability.
904.8 -> Observability helps us find the needle in the haystack.
910.29 -> We can start looking at the highest level, at our baseline,
915.06 -> look at the graph.
917.01 -> And even if we have absolutely no idea what's going on,
920.07 -> we're gonna understand where we are.
923.19 -> We can drill down all the way to tracing,
925.83 -> like AWS XA, and understand it,
928.47 -> but there also are the stocks
929.73 -> from an open source perspective.
931.44 -> And if you use them, that's perfectly fine.
935.1 -> So when you're thinking about observability,
936.897 -> and this is the key,
938.52 -> observability is based on three pillars.
943.14 -> You have metrics, you have logging, and you have tracing.
947.1 -> Now, why is that important?
948.33 -> Because you wanna make sure that you embed, for example,
952.08 -> metrics within your logs,
954.81 -> so that if you're looking at the high level,
958.53 -> steady state that you might have, and you wanna drill in,
961.95 -> that as soon as you get to the stage
963.69 -> from a tracing to a log,
965.16 -> that you see what was going on and can correlate.
968.4 -> And so at any point in time,
969.78 -> you understand where your application is.
973.41 -> Let me give you an example.
976.92 -> When you're looking at this graph, every single one of you,
980.97 -> even if you have absolutely no idea what that workload is,
985.23 -> sees that there are a few issues.
989.28 -> You look at the spikes and you're gonna say,
990.997 -> "Hmm, something happened there."
994.17 -> And if we would drill down, we would've seen
996.51 -> that we have a process which ran out of control,
998.76 -> and suddenly, CPU spiked.
1001.7 -> Every one of you is able to look at that graph down here
1004.46 -> and say, "Wait a minute, why did this drop?"
1007.67 -> And if you would drill into it, you would realize
1009.83 -> that I had an issue with my Kubernetes cluster
1012.05 -> and the pods suddenly start restarting.
1016.22 -> And every one of you
1018.11 -> sees that we suddenly had a huge impact somewhere
1021.59 -> which caused 500 errors,
1023.39 -> which was caused by node failures within my cluster.
1027.71 -> That's what observability is about.
1030.89 -> We can look at the graph and we see either way
1034.94 -> where we have to go and drill into.
1037.64 -> Now, this is a very observability and SRE view,
1040.49 -> and, of course, we want developers to have a similar view,
1043.34 -> and this is where the tracing aspects come into play.
1046.85 -> You want to provide the developers
1049.1 -> with the aspects of understanding the interactions
1051.44 -> with the microservices.
1052.61 -> And especially when you're thinking
1053.87 -> about chaos engineering and experiments,
1056.93 -> you want them to understand
1058.1 -> what is the impact of the experiment.
1061.4 -> And what we shouldn't forget is the user experience
1064.49 -> and what the user sees when we're running these experiments.
1069.65 -> Because if you're thinking about that baseline
1071.75 -> and we're running an experiment,
1074.06 -> and the baseline doesn't move,
1075.47 -> means that the customer is super happy.
1077.81 -> And that also means that we're resilient to such a failure.
1084.14 -> So now that we understand the observability aspects,
1088.61 -> I'd like to move on to the organizational awareness.
1094.79 -> Now, what we have found
1097.19 -> is that when you're starting with a small team
1099.83 -> and you enable the small team on chaos engineering,
1104.03 -> and they build common faults
1107 -> that can be injected across the organization,
1110.09 -> and then enable the decentralized development teams
1112.91 -> on chaos engineering, that works fairly well.
1117.2 -> Now, why is that?
1118.85 -> If you're thinking about, many of you that sit in the room,
1121.94 -> you have hundreds if not thousands of development teams.
1125.42 -> There is no way that that central team
1127.28 -> will understand every single workload that is around you.
1129.89 -> There is also no way that that central team
1132.05 -> will get the power to basically inject failures everywhere.
1136.04 -> But those development teams already have IAM permission
1139.13 -> to access their environments
1140.75 -> and do things in their environments.
1142.88 -> And so it's much easier to help them run experiments
1147.29 -> than having a central team that runs it all.
1150.77 -> And that also helps with building customized experiments
1154.94 -> for those various development teams,
1156.95 -> that they eventually then can share with others
1158.927 -> and the learnings that came out of it.
1162.23 -> And key to all this, of course,
1163.76 -> is having an executive sponsor
1166.19 -> that helps you make resilience part of the journey
1170.03 -> of a software development life cycle,
1172.4 -> and also shift the responsibility for resilience
1175.79 -> to those development teams.
1179.72 -> And then we need to think about real world events
1183.65 -> and examples.
1186.68 -> Now, what we see most,
1187.94 -> when we're looking at all failures that our customers have,
1192.62 -> is code and configuration errors.
1195.35 -> And so think about the faults that you can inject
1198.2 -> when you're thinking about deployments,
1200.66 -> or think about the experiments that you can do and say,
1203.877 -> "Well, do we even realize that we have a faulty deployment
1206.87 -> and do we see it within observability?"
1211.34 -> And when you're thinking about infrastructure,
1213.41 -> what if you have an EC2 instance that fails,
1216.29 -> or suddenly, an EKS cluster
1218.15 -> where a load balancer doesn't pass traffic?
1221 -> Are you able to mitigate such events?
1225.71 -> What about data and state?
1228.77 -> This is not just about a cache drift,
1230.9 -> but what if, suddenly, your database runs out of disc space?
1235.19 -> Do you have mechanisms to, one, detect that,
1237.29 -> but to mitigate this as well?
1239.99 -> And then, of course, my favorite, which is dependencies.
1243.53 -> Do you understand all the dependencies
1246.14 -> that you have within your system?
1248.96 -> And also, third party dependencies.
1250.82 -> What if you're dependent on, let's say, a third party IdP?
1253.957 -> Do you have mechanisms in place to fall back?
1257.21 -> And how do you prove that you can?
1260.42 -> And then last, of course, natural disasters,
1263.45 -> when we're thinking about human error, for example,
1266.87 -> or, again, errors like Hurricane Sandy and others.
1271.73 -> And how you can fail over and how you can simulate that,
1274.28 -> again, in a controlled way
1275.413 -> through chaos engineering experiments.
1279.98 -> And then the last prerequisite, truly, is about making sure
1286.7 -> that if we're finding a deficiency within our systems
1291.74 -> that is related to security or resilience,
1294.71 -> that we go and we remediate it.
1298.58 -> Because it's worth nothing if we build new features
1302.18 -> but our service is not available.
1304.43 -> So we need to have that executive sponsorship
1306.89 -> and we need to be able to prioritize these.
1311.39 -> And so this brings us to continuous resilience.
1316.58 -> And so when you're thinking about resilience,
1319.67 -> resilience is not a one time thing.
1322.94 -> Resilience should be part of our everyday life
1325.79 -> when we're thinking about building resilient workloads,
1328.55 -> from the bottom all the way up to the application itself.
1332.63 -> And so continuous resilience is a life cycle that helps us
1336.02 -> think about our workload from a steady state point of view
1341.42 -> and work towards mitigating events like we just went through
1346.16 -> from code and configuration all the way
1348.2 -> to the very unlikely events in disaster recovery
1351.89 -> to safe experimentation within our pipelines,
1356.06 -> outside of our pipelines,
1357.68 -> because errors happen all the time,
1359.6 -> not just when we provision new code,
1363.17 -> and making sure that we learn from the faults
1369.23 -> that surfaced during the experiments,
1371.12 -> that we learned to anticipate what to do,
1374.3 -> be able to mitigate these various faults,
1377.75 -> and then also provide those learnings
1379.82 -> throughout the organization
1381.44 -> so that others can learn from this experiment.
1385.7 -> And so when you take continuous resilience
1387.71 -> and chaos engineering, and you put them together,
1391.49 -> that's what leads us
1393.56 -> to the Chaos Engineering and Continuous Resilience program.
1398.45 -> And that's a program
1399.47 -> that we have built over the last two years at AWS
1403.16 -> and have helped many customers run through it,
1406.46 -> which enabled them to build a chaos engineering program
1409.49 -> within their own firm, and scale it
1412.34 -> across various organizations and development teams
1414.95 -> so that they can build controlled experiments
1417.65 -> within their environment.
1419.6 -> And so usually, when we're starting on this journey,
1424.97 -> it's a GameDay that we're preparing for.
1429.02 -> Another GameDay, as you might think,
1431.06 -> where we're just running for two hours
1432.89 -> and we're checking if something was fine or not.
1435.32 -> Especially when we're starting out with chaos engineering,
1438.95 -> it's important to truly plan what we want to execute on.
1444.32 -> And so setting expectations is a big part of it.
1449.27 -> So key to that, because you're gonna need
1451.97 -> quite a few people that you want to involve,
1454.58 -> is project planning.
1456.41 -> And usually, the first time, when we do this,
1458.75 -> it might be between a week and three weeks
1461.39 -> where we're planning the GameDay,
1463.31 -> the various people that we want in the GameDay,
1465.98 -> like the chaos champion that will advocate
1468.44 -> the GameDay throughout the company.
1470.93 -> The development teams, if there are SREs,
1473.27 -> we're gonna bring them in.
1474.611 -> Observability and incident response.
1477.35 -> And then once we have all the roles
1479.15 -> and responsibilities for the GameDay,
1481.97 -> we're gonna think about what is it
1484.61 -> that we want to run chaos experiments on.
1487.79 -> And when you're thinking about chaos engineering,
1489.95 -> it's not just about resilience,
1491.63 -> it can be about security as well.
1494.45 -> And so contribution is a list of what's important to you.
1498.98 -> That can be resilience, that can be availability,
1501.41 -> that can be security, that can be durability.
1503.69 -> That's something which you define.
1506.45 -> And then, of course, we wanna make sure
1508.82 -> that there is a clear outcome
1511.19 -> on what we wanna achieve with the chaos experiment.
1515.3 -> In our case, when we're starting out,
1517.76 -> what we want to prove to the organization and the sponsors
1521.84 -> is that we can run an experiment
1523.73 -> in a safe and controlled way
1525.35 -> without impacting our customers.
1527.87 -> And then we can take those learnings and share it,
1531.14 -> either if we found something or not,
1533.48 -> with our customers to be able to make sure
1538.1 -> that the business units understand
1539.99 -> how to mitigate future failures, if we found something,
1543.53 -> or have the confidence that we're resilient
1545.72 -> to the faults that we injected.
1549.14 -> So then we define the workload.
1553.49 -> And so for this presentation, I chose the workload,
1557.39 -> this is a payments workload and it's running EKS
1562.1 -> and some databases and then message boxes with Kafka.
1568.49 -> And so important there too
1569.78 -> is when you're choosing a workload,
1571.94 -> make sure that when you're starting out,
1574.28 -> don't choose the most critical workload that you have
1578.87 -> and then impact it and everyone is unhappy.
1581.03 -> Choose a workload that you know, even if degraded,
1584.72 -> if it has some customer impact, that it's still fine.
1588.08 -> And usually, we have metrics that allow that
1590.21 -> when you're thinking about SLOs for your service.
1594.41 -> So once you've chosen a workload, we're gonna make sure
1600.44 -> that our chaos experiments that we wanna run are safe,
1603.56 -> and we do that through a discovery phase for the workload.
1610.67 -> And so that discovery phase
1613.28 -> will involve quite a bit of architecture.
1617.66 -> We're gonna dive into it.
1619.34 -> All of you know the well-architected review.
1623.3 -> And so when we're thinking
1624.38 -> about the well-architected review,
1626.45 -> it's not just about clicking the buttons in the tool,
1630.23 -> but we're taking about a day
1633.05 -> to go through the various designs of the architecture,
1638.18 -> and we wanna understand how the architecture
1641.87 -> and the workloads and the components within your workloads
1644.84 -> speak to each other.
1646.79 -> What mechanisms do they have in place like retries?
1650.9 -> What mechanisms do they have in regards to circuit breakers
1653.63 -> and have you implement them?
1656.09 -> Do you have runbooks and playbooks in place
1659.21 -> in case we have to roll back?
1664.37 -> And we wanna make sure
1666.53 -> that you have the observability in place,
1670.07 -> and for example, health checks as well
1672.68 -> when we execute something
1674.15 -> so that your system automatically can recover.
1679.22 -> And if we have all that information
1680.9 -> and we see that there is a deficiency
1685.01 -> that might impact internal or external customers,
1689.84 -> that's where we stop.
1692.15 -> If we have known issues,
1694.34 -> we're gonna have to go and fix these first
1696.62 -> before we move on within the process.
1700.67 -> Now, if everything is fine, we're gonna say,
1703.287 -> "Okay, let's move on to the definition of the experiments."
1709.34 -> And that's a very exciting part.
1712.07 -> So when you're thinking about our system
1714.77 -> that we just saw before,
1716.9 -> we can now think about what can go wrong
1719.33 -> within our environment,
1720.74 -> and if we already have or have not mechanisms in place.
1723.47 -> For example, if I have a third party provider IdP,
1727.7 -> do I have a break glass account in place
1729.47 -> where I can prove that I can log in if something happens?
1733.04 -> What about my EKS cluster?
1735.65 -> And if I have a node that fails,
1738.38 -> do I know my CodeBuild time for the node itself?
1741.5 -> Do I know the CodeBuild time for the pods itself?
1744.08 -> And how long is it gonna take for me
1746.42 -> for these to be live again so that they can take traffic
1749.27 -> and my customers are not gonna get disconnected
1751.73 -> or have a bad experience?
1754.1 -> Or think about someone
1755.39 -> misconfiguring an auto scaling group and health checks,
1758.81 -> which suddenly marks most of the instances as unhealthy.
1763.64 -> Do you have mechanisms to detect that?
1766.67 -> And what does that mean, again, for your customers
1769.64 -> and the teams that operate the environment?
1776 -> And then think about this scenario
1778.58 -> where someone pushed the configuration change,
1780.71 -> and suddenly, your cluster cannot pull
1782.75 -> from your container registry anymore.
1785.6 -> That means that you cannot launch any containers.
1788.99 -> Do you have mechanisms to mitigate that?
1792.297 -> And there are a few more scenarios that we can think about
1794.96 -> like events with Kafka.
1798.32 -> Are you gonna lose messages if the broker suddenly reboots
1801.29 -> or you lose a partition?
1802.46 -> Do you have mechanisms in place to mitigate that?
1805.01 -> Or the Aurora database flipping over,
1808.46 -> do your applications know
1809.78 -> that you need to go to the other endpoint?
1813.86 -> And so these are all infrastructure faults
1815.78 -> and some of the developers which might sit there will say,
1818.157 -> "Yeah, I mean, you know, it's mostly easy to fix,"
1821.15 -> but think about latency.
1823.55 -> Latency and jitter, when you implement that,
1825.8 -> and the cascading effects
1827.03 -> that can happen within your system,
1828.53 -> these are all things we wanna make sure and understand.
1831.77 -> And with fault injection and controlled experiments,
1834.32 -> we're able to do that.
1836.03 -> And then lastly, think about challenges
1839.57 -> that your clients might have to connect to your environment.
1845.42 -> So for our experiment, what we wanted to achieve
1849.41 -> is prove that we can execute and understand
1852.89 -> a brownout scenario.
1856.19 -> What a brownout scenario is
1858.56 -> is that our client that connects to us
1861.08 -> expects the response in a certain amount of milliseconds.
1866.45 -> And if we do not provide that,
1868.13 -> the client is just gonna go and back off.
1871.1 -> But the challenge is, when you have a brownout,
1874.07 -> that your server still is trying to compute
1876.38 -> whatever they need to compute to return to the client,
1879.2 -> and that's wasted cycles.
1881.36 -> And so that inflection point is called a brownout.
1885.47 -> Now, before we can think about an experiment
1887.81 -> to simulate a brownout within our EKS environment,
1890.93 -> we need to understand the steady state
1894.77 -> and what the steady state is and what it isn't.
1898.88 -> So when you're thinking
1899.713 -> about defining a steady state for your workload,
1903.71 -> that's the high level top metric
1905.75 -> that you're thinking about your service.
1907.73 -> So for example, for a payment system,
1910.28 -> that's transactions per second.
1913.22 -> When you're thinking about retail, that's orders per second.
1917.51 -> Streaming, for example, stream starts per second
1920.54 -> and playback events started with media.
1924.53 -> And when you're looking at that line, you see very quickly
1927.86 -> if you have an order drop or a transaction drop,
1931.58 -> that something that you injected within the environment
1934.1 -> caused probably that drop.
1937.25 -> Now, once we understand what that steady state is,
1943.22 -> we're gonna think about the hypothesis.
1947.6 -> And so the hypothesis is key
1950.21 -> when you're thinking about the experiment
1953.33 -> because the hypothesis will define at the end,
1956.51 -> did your experiment turn out as you expect
1960.53 -> or did you learn something new that you didn't expect?
1965.39 -> And so the importance here is, as you see,
1969.05 -> we're saying we're expecting a transaction rate,
1972.26 -> so 300 transactions per second,
1975.62 -> and we think that even if 40% of our nodes fail
1981.47 -> within our environment,
1984.17 -> still 99% of all requests to our APIs should be successful,
1990.35 -> so the 99th percentile,
1991.94 -> and return a response within 100 milliseconds.
1997.22 -> What we also would want to define
1999.74 -> is because we know our systems,
2001.51 -> we're gonna say, "Okay, based on our experience,
2004.33 -> the nodes should come back within five minutes.
2007.93 -> Pods should get scheduled within eight and be available,
2011.62 -> and traffic will flow again to those pods.
2016.66 -> And alerts will fire after three minutes."
2021.7 -> And once we are all agreeing on that hypothesis,
2025.03 -> then we're gonna go and fill out the experiment template.
2031.39 -> And so when you're thinking about
2033.64 -> the experiment itself and the template,
2036.55 -> we're gonna make sure
2037.57 -> that we're very clearly defining what we wanna run.
2041.8 -> We're gonna have the definition of the workload itself,
2045.61 -> what experiment and action we wanna run,
2047.98 -> in our case, it's gonna be terminating 40% of nodes,
2052.45 -> because if you're thinking about the steady state load
2055.03 -> you have on the system and you remove nodes,
2058.51 -> that's the brownout scenario which you wanna simulate,
2062.59 -> the environment that we're gonna run in,
2065.5 -> and in our case, we're always starting with the process
2068.32 -> in a lower environment and not in production,
2072.13 -> the duration that we wanna run,
2074.41 -> and now, think about as well,
2075.66 -> in this case, we're gonna run 30 minutes.
2078.4 -> But you might run experiments where you say,
2080.417 -> "I'm gonna run 30 minutes with five minutes intervals,"
2084.13 -> to make sure that you can look at the graphs
2086.05 -> based on the experiment,
2087.31 -> staggering experiments that you're running
2089.38 -> to understand the impact of the experiment.
2092.86 -> And then, of course,
2094.63 -> because we want to do this in a controlled way,
2096.79 -> we need to be very clear
2098.65 -> what the fault isolation boundary is of our experiment
2101.86 -> and we're gonna clearly define that as well.
2105.01 -> And the alarms that are in place
2106.9 -> that would trigger the experiment to roll back
2111.31 -> if it gets out of bound.
2113.32 -> And that's key, because we wanna make sure
2116.23 -> that we're practicing safe chaos engineering experiments.
2120.22 -> We also wanna make sure that we understand
2122.95 -> where is the observability and what are we looking at
2125.71 -> when we're running the experiment.
2128.83 -> And then you would also add the hypothesis again
2131.38 -> to the template as well.
2133.54 -> You also see two empty lines there,
2136.9 -> which are the findings and the correction of error.
2140.2 -> And when we're thinking about the experiment itself,
2143.32 -> good or bad, we're always gonna have an end report
2147.49 -> where we might celebrate that our system is resilient
2150.73 -> or we might celebrate that we find something
2153.1 -> that we didn't know,
2155.23 -> and just helped our organization
2157.12 -> to mitigate a large scale event.
2160.78 -> So once we have the experiment ready,
2165.46 -> we're gonna think about priming the environment
2168.43 -> for our experiment.
2171.25 -> But before we go there, I want to walk you through
2175.12 -> an entire cycle on how we execute an experiment.
2181.3 -> So first, we have to check if the system is still healthy,
2186.13 -> because if you remember, in the beginning, we said,
2189.527 -> "If we know the system will fail
2191.5 -> or the experiment will fail,
2193.63 -> we're not gonna run the experiment."
2196.09 -> So once we see that the system's healthy, we're gonna check,
2200.05 -> is the experiment that we wanted to run still valid?
2202.63 -> Because it might be that the developer already fixed the bug
2205.48 -> that we thought might exist if we run an experiment.
2209.56 -> And if we see it is, then comes something very important.
2215.59 -> We're gonna create a control and experimental group
2218.617 -> and we're gonna make sure that that's defined.
2222.37 -> And I'm gonna go into that in a few seconds.
2225.58 -> And if we see that the control and the experimental group
2228.1 -> is there and defined, and up and running,
2231.67 -> then we start generating load against the control
2235.21 -> and the experimental group in our environment.
2237.85 -> And we're checking again, is the steady state that we have
2242.29 -> in the tolerance that we think it should be or not?
2245.56 -> If it is tolerant, then now, finally,
2248.29 -> we can go and run the experiment against the target.
2253.03 -> And then, again, we check,
2254.38 -> is it in tolerance based on what we think?
2256.57 -> And if it isn't, then the stop condition is gonna kick in
2259.6 -> and it's gonna roll back.
2261.43 -> And if it is, that experiment turns into irrigation test.
2266.567 -> Why?
2267.4 -> Because now we understand
2269.53 -> what that experiment does to our system.
2271.75 -> We know we can mitigate it and it's predictable.
2276.97 -> So I mentioned the aspects
2279.91 -> of a control and experimental group.
2283.36 -> When you're thinking about chaos engineering
2285.43 -> and running experiments,
2288.19 -> the goal always is, one, that it's controlled,
2290.47 -> and two, that you have minimal to no impact
2292.96 -> to your customers when you're running it.
2296.83 -> So weighing how you can do that is we call it,
2300.4 -> not just having synthetic load that you generate,
2304.12 -> but also synthetic resources.
2306.7 -> For example, you spin up a new EKS cluster, a synthetic one,
2313.21 -> one that you have and inject a fault
2315.61 -> and the other one which is healthy
2317.56 -> in the same environment that you are in.
2320.44 -> And so you're not impacting existing resources
2324.76 -> that might have customer traffic,
2326.77 -> but new resources with exactly the same code base
2329.53 -> as the other ones where you understand what happens
2333.46 -> in a certain failure scenario.
2337 -> So once we prime the experiment
2338.65 -> and we see that control and experimental group are healthy,
2342.73 -> and I see the steady state,
2345.64 -> I can move on and think about running the experiment itself.
2352.69 -> Now, running a chaos engineering experiment
2355.45 -> requires great tools that are safe to run the experiment.
2364 -> And so when you're thinking about tools,
2365.92 -> there are various out there that you can use and consume.
2370.75 -> In AWS, we released Fault Injection Simulator
2374.65 -> last year in March.
2376.96 -> And when you're thinking about one of the first slide
2379.81 -> with the shared responsibility model for resilience,
2383.59 -> Fault Injection Simulator helps you quite a bit with that
2387.49 -> because the faults that you can inject,
2389.29 -> the actions that you can run
2392.47 -> are running against the AWS API directly.
2396.85 -> And you can inject faults against your primary dependency
2400 -> to make sure that you can create mechanisms,
2403.72 -> that you can survive a component failure within your system.
2408.31 -> Now, two faults and actions that I wanna highlight
2411.43 -> that we just recently released are the following.
2415.3 -> There is now integration with Litmus Chaos and Chaos Mesh.
2420.58 -> And the great thing about this is that now it provides you
2423.7 -> with a widened scope of faults that you can inject,
2426.55 -> for example, into your Kubernetes cluster
2429.61 -> to Fault Injection Simulator via a single pane of glass.
2434.41 -> And then we also released
2436.24 -> the Network Connectivity Disruption
2438.16 -> that allows you to simulate, for example,
2440.32 -> availability zone issues that you might have and events
2444.37 -> to understand how your workloads react
2447.61 -> if you suddenly have a disruption
2449.83 -> within an availability zone.
2452.26 -> And we're also working on various other deep checks
2455.95 -> where you can flip a switch and have impact
2458.62 -> into an AWS service during the experiment.
2462.4 -> And once the experiment is over,
2464.86 -> your service will just work just fine.
2467.77 -> And so that will help you
2469.54 -> to again validate and verify the entire process
2474.01 -> that your systems are able to survive component failures.
2479.05 -> Now, if you want to run actions
2480.97 -> against, let's say, EC2 systems,
2484.36 -> you also have the capability to run these through SSM.
2489.46 -> Now, think about it where these come into play.
2492.64 -> When we're thinking about running experiments,
2495.37 -> there are various ways
2496.57 -> on how you can create disruptions within the system.
2500.83 -> Let's say you have various microservices
2503.77 -> that run and consume a database.
2507.88 -> Now, you might say,
2508.713 -> "Well, how can I create a fault within the database
2511.78 -> without having impact to all those microservices?"
2517.12 -> And the answer to that is you can inject faults
2521.86 -> within the microservices itself, for example, packet loss,
2526.99 -> that would result in exactly the same
2529.18 -> as the application not being able to write to the database
2532.12 -> because it's not getting there.
2534.94 -> And so it's important there to widen the scope
2537.58 -> and think about the experiments that you can run
2541.36 -> and see with the actions that you have
2543.4 -> on how you can simulate those various experiments.
2547.99 -> So in our case, because we want to do that brownout
2552.22 -> that I showed before, we would use the EKS action
2555.61 -> that can terminate a certain amount of nodes,
2558.1 -> a percentage of nodes within our cluster,
2560.74 -> and we would run them.
2563.38 -> Now, I mentioned that we want to have
2567.76 -> a tool that we can trust,
2570.19 -> that we wanna make sure that something goes wrong,
2573.4 -> that an alert automatically kicks in
2576.13 -> and helps us roll back the experiment.
2580.24 -> And Fault Injection Simulator has these mechanisms.
2583 -> So when you build an experiment, you can define,
2586.39 -> what are my alarms that have to kick in
2589.24 -> based on the experiment?
2591.25 -> And if something goes wrong, the experiment stops
2594.67 -> and you can roll back the experiment.
2598.33 -> And so in our case, everything was fine
2602.32 -> and we said, "Okay, well, now we have confidence
2604.99 -> based on the observability that we have
2607.18 -> for this experiment to move up to the next environment."
2614.71 -> Now, here, it's key, as soon as you get into production,
2619.75 -> you have to think about the guardrails
2622.06 -> that are important in your production environment.
2627.52 -> When we're running chaos engineering experiment
2629.83 -> in production, especially when you're thinking
2632.17 -> about running them for the first time,
2634.09 -> please don't run them on peak hours.
2636.91 -> It's probably not the best idea.
2640.36 -> And also make sure, because in many ways,
2642.85 -> when you're running those experiments
2644.35 -> in lower level environments,
2645.79 -> your permissions might be much more permissive
2648.61 -> than you have them in a production environment.
2650.83 -> You gotta make sure
2652.39 -> that you have the observability in place,
2655.03 -> that you have the permissions
2656.2 -> to execute the various experiments,
2659.8 -> and that you have the observability in place
2662.05 -> to be able to see what's going on
2663.79 -> with that environment as well.
2666.4 -> Also key is to understand that the fault boundary changed
2670.42 -> because we're in production now.
2672.79 -> So make sure you understand that as well
2674.8 -> and be able to understand
2676.06 -> what is the risk if I'm running this experiment
2678.58 -> within the production environment.
2681.13 -> Because, again, we wanna make sure
2683.59 -> that we're not impacting our customers.
2688.18 -> And the last one, which we often see not being up to date,
2692.5 -> is that runbooks and playbooks are not up to date
2694.9 -> based on the latest changes that were made to the workload.
2699.16 -> So make sure you have all these, and if we see that we do,
2703.27 -> we're finally at the stage
2704.44 -> where we can think about moving up to production.
2709.27 -> So here, again, we're gonna think
2710.98 -> about priming the environment for production.
2715.12 -> And so you've seen this picture before
2717.19 -> in the lower level environment.
2719.83 -> But if we're in a production environment
2722.74 -> and we don't have a mirrored environment
2725.11 -> that some of our customers do where they split the traffic
2727.9 -> and have a chaos engineering environment in production
2730.15 -> and another environment,
2732.25 -> we can also use a canary
2734.2 -> to say that we're gonna take real user traffic, a percent,
2739.774 -> and we're gonna start bringing that real user traffic
2742.87 -> to the controlled and experimental group.
2746.44 -> Now, keep in mind, at this point in time,
2748.42 -> nothing should go wrong.
2750.61 -> We have the control and experimental group there.
2752.77 -> We haven't injected the fault.
2754.24 -> We should be able to see from an observability perspective
2757.87 -> that's all thumbs up.
2760.6 -> And once we see that that truly happened,
2764.68 -> that's where your heart starts pounding most of the time
2767.38 -> when you're running it the first time.
2769.69 -> We're gonna get to running the experiment
2771.58 -> actually in production.
2775.87 -> But see, when you're thinking
2776.92 -> about running the experiment in production,
2780.91 -> it's very different from an event
2783.28 -> that happens out of nothing,
2785.53 -> where there's no one in the room, you have to page everyone,
2788.23 -> and people are running around.
2790.45 -> Like here, we already ran through all these stages
2794.26 -> that we have this confidence
2795.97 -> that our workload should be perfectly fine
2798.7 -> with what we've seen.
2801.31 -> And if there is anything that would happen,
2804.61 -> you have the entire team of experts in that room
2807.64 -> looking at that dashboard.
2809.95 -> And if they see that there is a spike
2811.81 -> which they didn't expect,
2812.86 -> that experiment is done,
2814.51 -> and you're gonna fix that issue right away when you see it.
2820.96 -> So the aspect of running experiments in production,
2824.05 -> especially when you're thinking about running it
2825.82 -> in a GameDay style is very different
2828.43 -> when you're thinking about a real world event,
2831.04 -> and in many ways, helps you to find deficiencies
2834.22 -> within production environments as well
2836.59 -> that might not have occurred in lower level environments.
2842.23 -> So now, even if something happened or not,
2845.74 -> always, after a chaos engineering GameDay
2848.83 -> and experiments that you also run automatically
2851.8 -> like at Capital One,
2854.77 -> we're gonna go into a post-mortem or correction of error.
2860.29 -> Now, key here, when we're thinking about chaos engineering
2865.21 -> is that we're very transparent and blameless
2868.78 -> in regards to what we have found within our system.
2871.93 -> Because only this way, we're gonna learn
2874.09 -> and be able to tell others
2876.4 -> what we've seen within the environment.
2879.34 -> But there's certain questions which we need to ask
2881.86 -> when we're looking at the experiment.
2883.42 -> For example, how did we communicate with each other
2886.51 -> during the experiment?
2888.58 -> Was there someone that we had to bring in
2893.08 -> to make sense of what we saw on the screen
2895.6 -> because there is this guy
2896.95 -> which just knows almost everything, but he wasn't here?
2901.87 -> And what are, for example, some of the mechanisms
2904.93 -> that we want to use to share the learnings, good or bad,
2909.34 -> that we found during the experiments?
2914.08 -> Now, think about it.
2915.1 -> You want to make sure that your business units
2918.07 -> and the various developers see those findings
2920.83 -> because you want to share these with them
2923.29 -> so that they don't make the same mistakes.
2925.87 -> It's advertisement for you that you can say,
2927.887 -> "Hey, I found these various issues,
2932.71 -> and for that, I was able to mitigate X, Y and Z."
2936.37 -> There's another customer
2937.63 -> that is also here at re:Invent, Riot Games.
2941.53 -> They wrote a great story about how they were looking at
2945.07 -> and did a chaos engineering GameDay on Redis
2949.03 -> and were able to find various issues within that,
2952.84 -> and also were able to help another team
2955.12 -> with chaos engineering and help them have a seamless launch,
2959.86 -> but previous to that, found faults within their environments
2963.64 -> in regards to configuration errors
2965.71 -> with load balancers and circuit breakers
2968.14 -> that weren't implemented.
2970.12 -> And they shared this with the development team
2972.19 -> so they see, "Wow, I get really a lot of value out of this."
2976.15 -> And so it's important to promote that as well.
2982 -> Now, we're not at the end yet.
2985.592 -> What we have done now is created a great level of confidence
2989.62 -> that we know we can run this experiment
2992.89 -> without impact to our environments.
2995.68 -> And this is where the continuous resilience aspect
2998.2 -> comes into play.
3000.36 -> As I said in the beginning, resilience is not an after fact.
3003.27 -> It should be with you all the time.
3005.97 -> And this is where we're thinking about
3008.37 -> the automation of those experiments.
3013.2 -> Now, chaos experiments,
3014.34 -> you can think about it as individual experiments.
3016.77 -> If you're thinking about the organizational awareness
3019.53 -> that we had in the beginning,
3021.21 -> you will have various teams
3022.44 -> that have specific experiments they wanna run,
3025.41 -> and that's what you would run individually
3027.33 -> to prove a certain point.
3029.82 -> We would also have the experiments that we run in pipeline.
3035.61 -> But as mentioned, we need to make sure
3038.58 -> that experiments also run outside of the pipeline
3042.03 -> because faults happen all the time.
3043.65 -> They don't just happen when I push my code.
3046.2 -> They happen day in, day out, morning, at night, whenever.
3051 -> And then, use the GameDays to bring the teams together
3054.93 -> and make sure that you understand
3057.09 -> not just the aspects of how I recover the apps,
3060.63 -> but also look at it
3061.65 -> from a continuity of operations perspective,
3064.83 -> on how your processes work,
3067.14 -> and are the people alerted in certain ways
3069.51 -> when you're running experiments
3071.49 -> that they see what they need to do or what not.
3078.03 -> So to make it easier for our customers,
3081.51 -> we have built, of course, various templates and handbooks
3085.17 -> that when we're going to the experiments with them,
3088.41 -> we share, like the "Chaos Engineering Handbook"
3092.16 -> that shows the business value of chaos engineering
3094.86 -> and how it helps with resilience,
3097.62 -> the chaos engineering templates,
3099.75 -> as well as the correction of error template,
3102.87 -> and also the various aspects of the reports
3106.02 -> that we share with the customers
3108.18 -> when we're running to the program.
3112.35 -> Now, next, I just wanna show you some resources
3116.79 -> that you can start with
3118.56 -> when you're thinking about chaos engineering
3120.6 -> on your own time.
3123.72 -> And so we have great chaos engineering workshops
3130.38 -> that go from resilience to security.
3134.46 -> But when I run through these workshops,
3136.44 -> what I usually do is I start with the observability one.
3143.13 -> And the reason I do that is because that workshop
3147.69 -> builds an entire system that provides me
3151.71 -> with everything in the stack of observability
3154.62 -> and I have to do absolutely nothing to get it
3156.99 -> outside of pressing a button.
3159.87 -> And once I have that and I have the observability
3162.36 -> from top down to tracing and logging,
3165.75 -> I'm going to the chaos engineering workshop
3167.85 -> and I'm looking at the experiments that I have there
3170.13 -> and it starts with some database, fault injection,
3172.83 -> and containers and EC2,
3174.48 -> and shows you how you can do that in the pipeline.
3177.78 -> You take those experiments
3179.22 -> and you run it against the pet shop
3181.05 -> within observability workshop.
3184.38 -> And that gives you a great view
3186.27 -> of what's going on within your system.
3188.64 -> If you inject those faults,
3190.02 -> you will see them right away within those dashboards
3193.2 -> with no effort in observability.
3195.45 -> And that's important because, again,
3197.28 -> as the National Australia Bank said,
3198.847 -> "You need to see what's happening within your system,
3201.27 -> else, there's nothing
3203.04 -> which your best architecture is worth."
3206.43 -> And another workshop, and there's a white paper
3208.53 -> that was released as well
3209.7 -> about fault isolation boundaries in AWS
3212.88 -> is the Testing Resilience Using Availability Zone Failures.
3217.77 -> Think about running chaos experiments
3220.32 -> that would trigger the availability zone failures.
3224.49 -> And this workshop also shows you
3226.38 -> how to use various embedded metric formats
3228.6 -> within cloud watch logs to be able to add these
3231.51 -> to observability to the pet shop
3233.85 -> where you then see how it fails over.
3237.78 -> So to make it easier for you to find all these,
3242.28 -> I've provided you with some QR codes
3245.04 -> and the slides will be available as well
3247.95 -> where you can find those various workshops
3250.14 -> and blogs and then white papers
3252.06 -> that are interesting to read.
3256.89 -> So in this session, you've learned about
3260.76 -> what chaos engineering is and what it isn't,
3266.04 -> but more importantly, what the power of chaos engineering is
3269.34 -> and what you can achieve with it,
3271.5 -> and how you can build your own chaos engineering program
3276.39 -> and scale it to build more resilient
3279.24 -> and robust workloads in AWS
3282.96 -> that provide your customers with a better user experience.
3287.97 -> So if you have any questions, I will be down here
3291.33 -> for the next few minutes or 20 minutes or so.
3293.73 -> And else, wish you a great rest of re:Invent,
3297.06 -> and enjoy re:Play tonight.
3298.2 -> Thank you.
3299.059 -> (audience applauding)

Source: https://www.youtube.com/watch?v=tm5GEePP1PY