AWS re:Invent 2022 - Building confidence through chaos engineering on AWS (ARC307)

Aug 16, 2023

AWS re:Invent 2022 - Building confidence through chaos engineering on AWS (ARC307)

Distributed systems create opportunities to improve resilience that are not addressed by traditional approaches to development and testing. To solve these unknowns in distributed systems, chaos engineering was created 11 years ago with the mission to create methodologies and tools that help build a culture of resilience in the presence of unexpected outcomes. This session provides you with an understanding of what chaos engineering is and what it isn’t, where it provides value, and how you can create your own chaos engineering program to verify the resilience of your distributed mission-critical systems on AWS.

Learn more about AWS re:Invent at https://go.aws/3ikK4dD.

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#reInvent2022 #AWSreInvent2022 #AWSEvents

Content

0.12 -> - Welcome, everyone.

2.52 -> We're here today to talk about building confidence

6.75 -> through chaos engineering.

8.88 -> And in this session, you will learn

11.7 -> what chaos engineering is and what it isn't,

15.99 -> what the value is of chaos engineering,

19.74 -> and how you can get started with chaos engineering

22.98 -> within your own firms.

26.1 -> But more importantly, I will show you how you can combine

30.72 -> the power of chaos engineering and continuous resilience

35.7 -> and build a process that you can scale chaos engineering

39.9 -> across your organization in a controlled and secure way

44.7 -> to help your developers and engineers

47.73 -> with secure, reliable and robust workloads

51.87 -> that ultimately leads to a great customer experience.

56.91 -> My name is Laurent Domb, and so let's get started.

63.27 -> First, I will introduce you to chaos engineering,

68.04 -> and we're gonna see what it is and what it isn't.

72.42 -> I will also go through the various aspects

75.48 -> when we're thinking about prerequisites

78.03 -> for chaos engineering and what you need to get started.

82.98 -> We will then dive into continuous resilience

86.16 -> and why continuous resilience is so important

89.52 -> when we're thinking about resilient applications on AWS.

96.57 -> And combined with chaos engineering

98.46 -> and continuous resilience, I will walk you through

101.13 -> our Chaos Engineering/Continuous Resilience program

104.97 -> that we use to help our customers

107.31 -> to build chaos engineering practices and programs

110.25 -> that they can scale across their organizations.

114.18 -> And last, I will show you some great workshops

117.24 -> that we have in AWS where you can get started

120.81 -> with chaos engineering on your own.

126.6 -> So when you're thinking about chaos engineering,

129.81 -> chaos engineering is not new.

132.15 -> Chaos engineering has been around for over 10 years,

137.46 -> and there are many companies

138.9 -> that have already adopted chaos engineering

141.3 -> and have taken the mechanisms

144.39 -> in trying to find the known unknowns,

147.84 -> these are things that we are aware of

150.9 -> but don't fully understand in our systems,

153.84 -> and chase the unknown unknowns,

155.61 -> which are things that we are neither aware of

158.1 -> nor fully understand.

160.38 -> And through chaos engineering, these various companies

164.55 -> were able to find deficiencies within their environments

168.48 -> and prevent large scale events,

171.03 -> and therefore, ultimately,

172.98 -> have a better experience for their customers.

177.36 -> And yet when you're thinking about chaos engineering,

181.5 -> in many ways, it's not how we see chaos engineering.

187.23 -> There is still a perception that chaos engineering

189.78 -> is that thing which blows up production,

192.09 -> nor where we randomly just shut down things

195.27 -> within an environment.

197.61 -> And that is exactly not what chaos engineering is about.

203.43 -> When we're thinking about chaos engineering,

206.55 -> we should look at it from a much different perspective.

210.84 -> Many of you have probably seen

213.69 -> the shared responsibility model for resilience.

218.58 -> When you're thinking about

219.57 -> the shared responsibility model for resilience,

222.27 -> the two sections, the blue and the orange.

227.64 -> In the resilience of the cloud, we at AWS are responsible

232.98 -> for the resilience of the facilities,

235.02 -> the network, the storage, the services that you consume.

238.77 -> But you as a customer, you're responsible

242.1 -> on how and what services you use,

244.62 -> where you place, for example, your workloads.

246.93 -> Think about zonal services like EC2

251.43 -> where you place your data, and how you fail over

255.09 -> if something happens within your environment.

259.32 -> But think about the challenges that come

263.16 -> when you're looking at the shared responsibility model.

267.87 -> How can you make sure that if a service fails

272.19 -> that you're consuming, that is in the orange,

276.12 -> that your workload is resilient?

279.36 -> How do you know if something fails

283.2 -> that your workload can fail over?

286.92 -> And this is where chaos engineering comes into play.

290.82 -> When you're thinking about the workloads

292.56 -> that you are running in the blue,

295.89 -> what you can influence is the primary dependency

299.07 -> that you're consuming in AWS.

301.53 -> If you're using EC2, if you're using Lambda,

304.92 -> if you're using SQS, if you're using ElastiCache,

309.27 -> these are the services that you can impact

312.81 -> with chaos engineering in a safe and controlled way,

316.59 -> and you can figure out mechanisms

318.87 -> on how your components within your application

321.6 -> can gracefully fail over to another service.

326.28 -> So when you're thinking about chaos engineering,

330.69 -> what it provides you is improved operational readiness,

335.31 -> because your teams will get trained on what to do

339.27 -> if a certain service fails.

341.16 -> You will have mechanisms in place

343.74 -> to be able to fail over automatically.

346.32 -> You will have great observability in place

350.01 -> because you will realize what is missing

353.55 -> within your observability that you haven't seen

356.64 -> when you're running these experiments in a controlled way.

360.33 -> And ultimately, you will learn to build

363.6 -> more resilient workloads on AWS.

369.63 -> And when you're thinking about all these together,

371.64 -> what does it lead to?

372.84 -> Of course, happy customers.

375.18 -> And that's what chaos engineering is about.

377.76 -> It's all about us building great workloads

381.12 -> that ultimately lead to a great customer experience.

386.37 -> And so when you think about chaos engineering,

392.67 -> it's all about building controlled experiments.

398.58 -> If we know that an experiment will fail,

401.91 -> we're not gonna run the experiment.

404.37 -> If we know that we're gonna inject a fault

407.73 -> and that fault will trigger a bug

409.68 -> that brings down our system,

411.24 -> we're not gonna run the experiment.

413.61 -> We already know what happens.

416.648 -> And what we wanna make sure is if we have an experiment,

420.51 -> that, by definition, that experiment

423.09 -> should be tolerated by the system and should be fail-safe.

429.12 -> Because what we want to understand

432.36 -> is "Is our system resilient to component failures?"

438.3 -> Many of you might have a similar architecture

442.23 -> than you see here on this slide.

445.23 -> But when you're thinking about it,

446.97 -> let's say you're using Redis on EC2 or ElastiCache,

452.37 -> what's your confidence level if Redis fails?

455.58 -> Do you have mechanisms in place

457.02 -> to make sure that your database

458.52 -> does not get fully overrun with requests

460.89 -> if your cache suddenly fails?

463.68 -> Or what if you think about latency

466.26 -> that suddenly gets injected between two microservices

470.01 -> and you create a retry storm?

472.29 -> Do you have mechanisms to mitigate that

475.02 -> with exponential backoff and jitter?

479.31 -> And what if to that, you have cascading failures

483.51 -> and an entire AZ gets out of commission?

487.2 -> Are you confident that you can fail over

489.18 -> from one availability zone to another?

494.07 -> And think about impacts that you might have

496.56 -> on a regional service.

499.98 -> What is your confidence level

502.05 -> that if an entire region or a service in a region

505.08 -> has an impact that you rely on,

507.51 -> and because of your SLAs,

509.13 -> have to fail over into a secondary region?

512.73 -> What's your confidence level

514.53 -> that your runbooks and fail over playbooks

516.6 -> that they're all up to date and you can say,

518.167 -> "Yes, I can run through them"?

521.25 -> And so when you're thinking about chaos engineering

525.12 -> and we're thinking about the services

527.04 -> that we build on a daily basis,

531.6 -> they're all based on tradeoffs

534.18 -> that we have every single day.

536.91 -> Now, when you're thinking about everyone here in the room,

539.79 -> we all want to build awesome workloads,

542.19 -> resilient workloads, robust workloads.

545.34 -> But the reality is we're all under pressure,

547.8 -> there's a certain budget that I can use,

551.28 -> there's a certain time that I need to deliver,

553.53 -> and certain features.

555.48 -> But in a distributed system,

557.61 -> there is no way that every single person

559.95 -> understands the thousands or hundreds microservices

562.83 -> that communicate with each other.

565.86 -> And ultimately, what happens

567.21 -> is if I think that I'm depending on a soft dependency

571.35 -> where someone suddenly changes code,

573.24 -> that becomes a hard dependency, and what happens?

576.75 -> We suddenly have an event.

579.54 -> And when you're thinking about these events,

581.58 -> usually they happen,

583.2 -> you're somewhere in a restaurant or on vacation

585.512 -> and you get called at 2:00 in the morning,

587.73 -> and everybody runs and tries to fix

590.49 -> and bring the system back up.

593.79 -> And the challenge with this is once the system is back up,

597.72 -> you just go back to business as usual

601.77 -> until the same challenge happens again.

605.43 -> And it's not because we don't wanna fix it,

608.31 -> but it's because good intentions, they don't work.

611.4 -> And this is where mechanisms come into play

614.49 -> like chaos engineering and continuous resilience.

619.74 -> Now, I mentioned in the beginning

622.71 -> that there are many companies

625.05 -> that already have adopted chaos engineering.

629.04 -> And these are just some of the verticals

632.04 -> of all companies that have adopted chaos engineering,

635.64 -> and some of them already started five to six years ago.

639.9 -> But I want to give you a few examples

642.87 -> of a very regulated industry,

644.79 -> the financial services industry,

648.03 -> where you have very large companies

650.97 -> that have adopted chaos engineering.

654.84 -> One of these companies also spoke here

656.94 -> at re:Invent this year, which is Capital One.

660.54 -> Capital One wrote many great blog posts

663.03 -> that you can find under the Chaos Engineering Stories,

666.03 -> the link that you have here,

668.4 -> and have explained how they were thinking about building

671.55 -> their chaos engineering story and processes,

674.94 -> what they were looking at in regards to resilience

678.3 -> and readiness of their applications.

681.06 -> But they also, over five years,

683.52 -> have built a tool, Cloud Doctor, that uses various services

687.66 -> helping their developers to delineate of their services,

692.13 -> fault injections and reports

694.86 -> when they execute the chaos experiments

697.53 -> to help them build better workloads on AWS.

702.18 -> There are others like the National Australia Bank

706.71 -> that looked at observability

709.77 -> and defined observability as being key to chaos engineering,

714.27 -> looking at aspects like errors, traffic,

717.96 -> various tracing, metrics and logging, as well as saturation

722.04 -> that has to be part of chaos engineering.

724.65 -> Because if they don't see that, they define this as chaos.

729.9 -> And then you have others like Intuit

732.42 -> that shared a great story about how they were thinking

736.23 -> migrating from on-premises to the cloud,

740.19 -> and how they were thinking about resilience,

742.89 -> and how the resilience was different

744.87 -> from doing a FEMA analysis after the fact,

747.93 -> going to chaos engineering

749.88 -> and trying to understand if one obsoletes the other,

753.54 -> but realized that they still need both,

756.36 -> and also build the process to help their developers

759.96 -> to automate chaos experiments from start to the end.

764.88 -> And there are many more stories like that

767.67 -> that I could talk about today.

769.26 -> Unfortunately, we don't have enough time.

771.87 -> But if you're flying back today or tomorrow, look at the URL

775.691 -> and there are many of these stories

777.12 -> that can help you get started with chaos engineering.

781.5 -> So there are many more customers

784.02 -> that will adopt chaos engineering next year.

788.07 -> There's a great study by Gardner that was done

791.94 -> for the infrastructure and operations leader's guide

795.24 -> that said that 40% of companies

798.39 -> will adopt chaos engineering next year.

802.02 -> And they're doing that because they think

804.66 -> that they can increase customer experience by 20%.

809.49 -> Think about how many more happy customers you're gonna have

813.27 -> with such a number.

815.91 -> So let's get to the prerequisites

817.59 -> on how you can get started with chaos engineering.

823.35 -> So first, you need basic monitoring,

827.61 -> and if you have observability, that's great.

831.24 -> Then, you need to have organizational awareness.

834.84 -> We need to think about real world events

838.5 -> that we're injecting our faults.

840.72 -> And then, of course,

841.89 -> if we find a deficiency within our environment,

846 -> we need to commit and go and fix it

847.59 -> if it's resilience or security-focused.

850.02 -> So let's dive a little bit more into this.

855.06 -> So when you're thinking about metrics,

858.06 -> many of us have really great metrics.

861.696 -> In chaos engineering, we call metrics 'known knowns.'

866.46 -> These are things that we are aware of

869.01 -> and we fully understand.

871.98 -> And when you're thinking about metrics,

874.383 -> it's CPU percentage, it's memory, it's this scale,

877.92 -> and it's all great.

879.66 -> But in a distributed system,

882.18 -> you're gonna look at many, many different dashboards

884.43 -> and metrics to figure out what's going on

886.47 -> within your environment.

888.48 -> And so when we're starting with chaos engineering,

890.46 -> many times, when we're running the first experiment,

893.31 -> even if we're trying to make sure

894.78 -> that we're seeing everything,

895.89 -> we realize we can't see it.

900.33 -> And this is what leads us to observability.

904.8 -> Observability helps us find the needle in the haystack.

910.29 -> We can start looking at the highest level, at our baseline,

915.06 -> look at the graph.

917.01 -> And even if we have absolutely no idea what's going on,

920.07 -> we're gonna understand where we are.

923.19 -> We can drill down all the way to tracing,

925.83 -> like AWS XA, and understand it,

928.47 -> but there also are the stocks

929.73 -> from an open source perspective.

931.44 -> And if you use them, that's perfectly fine.

935.1 -> So when you're thinking about observability,

936.897 -> and this is the key,

938.52 -> observability is based on three pillars.

943.14 -> You have metrics, you have logging, and you have tracing.

947.1 -> Now, why is that important?

948.33 -> Because you wanna make sure that you embed, for example,

952.08 -> metrics within your logs,

954.81 -> so that if you're looking at the high level,

958.53 -> steady state that you might have, and you wanna drill in,

961.95 -> that as soon as you get to the stage

963.69 -> from a tracing to a log,

965.16 -> that you see what was going on and can correlate.

968.4 -> And so at any point in time,

969.78 -> you understand where your application is.

973.41 -> Let me give you an example.

976.92 -> When you're looking at this graph, every single one of you,

980.97 -> even if you have absolutely no idea what that workload is,

985.23 -> sees that there are a few issues.

989.28 -> You look at the spikes and you're gonna say,

990.997 -> "Hmm, something happened there."

994.17 -> And if we would drill down, we would've seen

996.51 -> that we have a process which ran out of control,

998.76 -> and suddenly, CPU spiked.

1001.7 -> Every one of you is able to look at that graph down here

1004.46 -> and say, "Wait a minute, why did this drop?"

1007.67 -> And if you would drill into it, you would realize

1009.83 -> that I had an issue with my Kubernetes cluster

1012.05 -> and the pods suddenly start restarting.

1016.22 -> And every one of you

1018.11 -> sees that we suddenly had a huge impact somewhere

1021.59 -> which caused 500 errors,

1023.39 -> which was caused by node failures within my cluster.

1027.71 -> That's what observability is about.

1030.89 -> We can look at the graph and we see either way

1034.94 -> where we have to go and drill into.

1037.64 -> Now, this is a very observability and SRE view,

1040.49 -> and, of course, we want developers to have a similar view,

1043.34 -> and this is where the tracing aspects come into play.

1046.85 -> You want to provide the developers

1049.1 -> with the aspects of understanding the interactions

1051.44 -> with the microservices.

1052.61 -> And especially when you're thinking

1053.87 -> about chaos engineering and experiments,

1056.93 -> you want them to understand

1058.1 -> what is the impact of the experiment.

1061.4 -> And what we shouldn't forget is the user experience

1064.49 -> and what the user sees when we're running these experiments.

1069.65 -> Because if you're thinking about that baseline

1071.75 -> and we're running an experiment,

1074.06 -> and the baseline doesn't move,

1075.47 -> means that the customer is super happy.

1077.81 -> And that also means that we're resilient to such a failure.

1084.14 -> So now that we understand the observability aspects,

1088.61 -> I'd like to move on to the organizational awareness.

1094.79 -> Now, what we have found

1097.19 -> is that when you're starting with a small team

1099.83 -> and you enable the small team on chaos engineering,

1104.03 -> and they build common faults

1107 -> that can be injected across the organization,

1110.09 -> and then enable the decentralized development teams

1112.91 -> on chaos engineering, that works fairly well.

1117.2 -> Now, why is that?

1118.85 -> If you're thinking about, many of you that sit in the room,

1121.94 -> you have hundreds if not thousands of development teams.

1125.42 -> There is no way that that central team

1127.28 -> will understand every single workload that is around you.

1129.89 -> There is also no way that that central team

1132.05 -> will get the power to basically inject failures everywhere.

1136.04 -> But those development teams already have IAM permission

1139.13 -> to access their environments

1140.75 -> and do things in their environments.

1142.88 -> And so it's much easier to help them run experiments

1147.29 -> than having a central team that runs it all.

1150.77 -> And that also helps with building customized experiments

1154.94 -> for those various development teams,

1156.95 -> that they eventually then can share with others

1158.927 -> and the learnings that came out of it.

1162.23 -> And key to all this, of course,

1163.76 -> is having an executive sponsor

1166.19 -> that helps you make resilience part of the journey

1170.03 -> of a software development life cycle,

1172.4 -> and also shift the responsibility for resilience

1175.79 -> to those development teams.

1179.72 -> And then we need to think about real world events

1183.65 -> and examples.

1186.68 -> Now, what we see most,

1187.94 -> when we're looking at all failures that our customers have,

1192.62 -> is code and configuration errors.

1195.35 -> And so think about the faults that you can inject

1198.2 -> when you're thinking about deployments,

1200.66 -> or think about the experiments that you can do and say,

1203.877 -> "Well, do we even realize that we have a faulty deployment

1206.87 -> and do we see it within observability?"

1211.34 -> And when you're thinking about infrastructure,

1213.41 -> what if you have an EC2 instance that fails,

1216.29 -> or suddenly, an EKS cluster

1218.15 -> where a load balancer doesn't pass traffic?

1221 -> Are you able to mitigate such events?

1225.71 -> What about data and state?

1228.77 -> This is not just about a cache drift,

1230.9 -> but what if, suddenly, your database runs out of disc space?

1235.19 -> Do you have mechanisms to, one, detect that,

1237.29 -> but to mitigate this as well?

1239.99 -> And then, of course, my favorite, which is dependencies.

1243.53 -> Do you understand all the dependencies

1246.14 -> that you have within your system?

1248.96 -> And also, third party dependencies.

1250.82 -> What if you're dependent on, let's say, a third party IdP?

1253.957 -> Do you have mechanisms in place to fall back?

1257.21 -> And how do you prove that you can?

1260.42 -> And then last, of course, natural disasters,

1263.45 -> when we're thinking about human error, for example,

1266.87 -> or, again, errors like Hurricane Sandy and others.

1271.73 -> And how you can fail over and how you can simulate that,

1274.28 -> again, in a controlled way

1275.413 -> through chaos engineering experiments.

1279.98 -> And then the last prerequisite, truly, is about making sure

1286.7 -> that if we're finding a deficiency within our systems

1291.74 -> that is related to security or resilience,

1294.71 -> that we go and we remediate it.

1298.58 -> Because it's worth nothing if we build new features

1302.18 -> but our service is not available.

1304.43 -> So we need to have that executive sponsorship

1306.89 -> and we need to be able to prioritize these.

1311.39 -> And so this brings us to continuous resilience.

1316.58 -> And so when you're thinking about resilience,

1319.67 -> resilience is not a one time thing.

1322.94 -> Resilience should be part of our everyday life

1325.79 -> when we're thinking about building resilient workloads,

1328.55 -> from the bottom all the way up to the application itself.

1332.63 -> And so continuous resilience is a life cycle that helps us

1336.02 -> think about our workload from a steady state point of view

1341.42 -> and work towards mitigating events like we just went through

1346.16 -> from code and configuration all the way

1348.2 -> to the very unlikely events in disaster recovery

1351.89 -> to safe experimentation within our pipelines,

1356.06 -> outside of our pipelines,

1357.68 -> because errors happen all the time,

1359.6 -> not just when we provision new code,

1363.17 -> and making sure that we learn from the faults

1369.23 -> that surfaced during the experiments,

1371.12 -> that we learned to anticipate what to do,

1374.3 -> be able to mitigate these various faults,

1377.75 -> and then also provide those learnings

1379.82 -> throughout the organization

1381.44 -> so that others can learn from this experiment.

1385.7 -> And so when you take continuous resilience

1387.71 -> and chaos engineering, and you put them together,

1391.49 -> that's what leads us

1393.56 -> to the Chaos Engineering and Continuous Resilience program.

1398.45 -> And that's a program

1399.47 -> that we have built over the last two years at AWS

1403.16 -> and have helped many customers run through it,

1406.46 -> which enabled them to build a chaos engineering program

1409.49 -> within their own firm, and scale it

1412.34 -> across various organizations and development teams

1414.95 -> so that they can build controlled experiments

1417.65 -> within their environment.

1419.6 -> And so usually, when we're starting on this journey,

1424.97 -> it's a GameDay that we're preparing for.

1429.02 -> Another GameDay, as you might think,

1431.06 -> where we're just running for two hours

1432.89 -> and we're checking if something was fine or not.

1435.32 -> Especially when we're starting out with chaos engineering,

1438.95 -> it's important to truly plan what we want to execute on.

1444.32 -> And so setting expectations is a big part of it.

1449.27 -> So key to that, because you're gonna need

1451.97 -> quite a few people that you want to involve,

1454.58 -> is project planning.

1456.41 -> And usually, the first time, when we do this,

1458.75 -> it might be between a week and three weeks

1461.39 -> where we're planning the GameDay,

1463.31 -> the various people that we want in the GameDay,

1465.98 -> like the chaos champion that will advocate

1468.44 -> the GameDay throughout the company.

1470.93 -> The development teams, if there are SREs,

1473.27 -> we're gonna bring them in.

1474.611 -> Observability and incident response.

1477.35 -> And then once we have all the roles

1479.15 -> and responsibilities for the GameDay,

1481.97 -> we're gonna think about what is it

1484.61 -> that we want to run chaos experiments on.

1487.79 -> And when you're thinking about chaos engineering,

1489.95 -> it's not just about resilience,

1491.63 -> it can be about security as well.

1494.45 -> And so contribution is a list of what's important to you.

1498.98 -> That can be resilience, that can be availability,

1501.41 -> that can be security, that can be durability.

1503.69 -> That's something which you define.

1506.45 -> And then, of course, we wanna make sure

1508.82 -> that there is a clear outcome

1511.19 -> on what we wanna achieve with the chaos experiment.

1515.3 -> In our case, when we're starting out,

1517.76 -> what we want to prove to the organization and the sponsors

1521.84 -> is that we can run an experiment

1523.73 -> in a safe and controlled way

1525.35 -> without impacting our customers.

1527.87 -> And then we can take those learnings and share it,

1531.14 -> either if we found something or not,

1533.48 -> with our customers to be able to make sure

1538.1 -> that the business units understand

1539.99 -> how to mitigate future failures, if we found something,

1543.53 -> or have the confidence that we're resilient

1545.72 -> to the faults that we injected.

1549.14 -> So then we define the workload.

1553.49 -> And so for this presentation, I chose the workload,

1557.39 -> this is a payments workload and it's running EKS

1562.1 -> and some databases and then message boxes with Kafka.

1568.49 -> And so important there too

1569.78 -> is when you're choosing a workload,

1571.94 -> make sure that when you're starting out,

1574.28 -> don't choose the most critical workload that you have

1578.87 -> and then impact it and everyone is unhappy.

1581.03 -> Choose a workload that you know, even if degraded,

1584.72 -> if it has some customer impact, that it's still fine.

1588.08 -> And usually, we have metrics that allow that

1590.21 -> when you're thinking about SLOs for your service.

1594.41 -> So once you've chosen a workload, we're gonna make sure

1600.44 -> that our chaos experiments that we wanna run are safe,

1603.56 -> and we do that through a discovery phase for the workload.

1610.67 -> And so that discovery phase

1613.28 -> will involve quite a bit of architecture.

1617.66 -> We're gonna dive into it.

1619.34 -> All of you know the well-architected review.

1623.3 -> And so when we're thinking

1624.38 -> about the well-architected review,

1626.45 -> it's not just about clicking the buttons in the tool,

1630.23 -> but we're taking about a day

1633.05 -> to go through the various designs of the architecture,

1638.18 -> and we wanna understand how the architecture

1641.87 -> and the workloads and the components within your workloads

1644.84 -> speak to each other.

1646.79 -> What mechanisms do they have in place like retries?

1650.9 -> What mechanisms do they have in regards to circuit breakers

1653.63 -> and have you implement them?

1656.09 -> Do you have runbooks and playbooks in place

1659.21 -> in case we have to roll back?

1664.37 -> And we wanna make sure

1666.53 -> that you have the observability in place,

1670.07 -> and for example, health checks as well

1672.68 -> when we execute something

1674.15 -> so that your system automatically can recover.

1679.22 -> And if we have all that information

1680.9 -> and we see that there is a deficiency

1685.01 -> that might impact internal or external customers,

1689.84 -> that's where we stop.

1692.15 -> If we have known issues,

1694.34 -> we're gonna have to go and fix these first

1696.62 -> before we move on within the process.

1700.67 -> Now, if everything is fine, we're gonna say,

1703.287 -> "Okay, let's move on to the definition of the experiments."

1709.34 -> And that's a very exciting part.

1712.07 -> So when you're thinking about our system

1714.77 -> that we just saw before,

1716.9 -> we can now think about what can go wrong

1719.33 -> within our environment,

1720.74 -> and if we already have or have not mechanisms in place.

1723.47 -> For example, if I have a third party provider IdP,

1727.7 -> do I have a break glass account in place

1729.47 -> where I can prove that I can log in if something happens?

1733.04 -> What about my EKS cluster?

1735.65 -> And if I have a node that fails,

1738.38 -> do I know my CodeBuild time for the node itself?

1741.5 -> Do I know the CodeBuild time for the pods itself?

1744.08 -> And how long is it gonna take for me

1746.42 -> for these to be live again so that they can take traffic

1749.27 -> and my customers are not gonna get disconnected

1751.73 -> or have a bad experience?

1754.1 -> Or think about someone

1755.39 -> misconfiguring an auto scaling group and health checks,

1758.81 -> which suddenly marks most of the instances as unhealthy.

1763.64 -> Do you have mechanisms to detect that?

1766.67 -> And what does that mean, again, for your customers

1769.64 -> and the teams that operate the environment?

1776 -> And then think about this scenario

1778.58 -> where someone pushed the configuration change,

1780.71 -> and suddenly, your cluster cannot pull

1782.75 -> from your container registry anymore.

1785.6 -> That means that you cannot launch any containers.

1788.99 -> Do you have mechanisms to mitigate that?

1792.297 -> And there are a few more scenarios that we can think about

1794.96 -> like events with Kafka.

1798.32 -> Are you gonna lose messages if the broker suddenly reboots

1801.29 -> or you lose a partition?

1802.46 -> Do you have mechanisms in place to mitigate that?

1805.01 -> Or the Aurora database flipping over,

1808.46 -> do your applications know

1809.78 -> that you need to go to the other endpoint?

1813.86 -> And so these are all infrastructure faults

1815.78 -> and some of the developers which might sit there will say,

1818.157 -> "Yeah, I mean, you know, it's mostly easy to fix,"

1821.15 -> but think about latency.

1823.55 -> Latency and jitter, when you implement that,

1825.8 -> and the cascading effects

1827.03 -> that can happen within your system,

1828.53 -> these are all things we wanna make sure and understand.

1831.77 -> And with fault injection and controlled experiments,

1834.32 -> we're able to do that.

1836.03 -> And then lastly, think about challenges

1839.57 -> that your clients might have to connect to your environment.

1845.42 -> So for our experiment, what we wanted to achieve

1849.41 -> is prove that we can execute and understand

1852.89 -> a brownout scenario.

1856.19 -> What a brownout scenario is

1858.56 -> is that our client that connects to us

1861.08 -> expects the response in a certain amount of milliseconds.

1866.45 -> And if we do not provide that,

1868.13 -> the client is just gonna go and back off.

1871.1 -> But the challenge is, when you have a brownout,

1874.07 -> that your server still is trying to compute

1876.38 -> whatever they need to compute to return to the client,

1879.2 -> and that's wasted cycles.

1881.36 -> And so that inflection point is called a brownout.

1885.47 -> Now, before we can think about an experiment

1887.81 -> to simulate a brownout within our EKS environment,

1890.93 -> we need to understand the steady state

1894.77 -> and what the steady state is and what it isn't.

1898.88 -> So when you're thinking

1899.713 -> about defining a steady state for your workload,

1903.71 -> that's the high level top metric

1905.75 -> that you're thinking about your service.

1907.73 -> So for example, for a payment system,

1910.28 -> that's transactions per second.

1913.22 -> When you're thinking about retail, that's orders per second.

1917.51 -> Streaming, for example, stream starts per second

1920.54 -> and playback events started with media.

1924.53 -> And when you're looking at that line, you see very quickly

1927.86 -> if you have an order drop or a transaction drop,

1931.58 -> that something that you injected within the environment

1934.1 -> caused probably that drop.

1937.25 -> Now, once we understand what that steady state is,

1943.22 -> we're gonna think about the hypothesis.

1947.6 -> And so the hypothesis is key

1950.21 -> when you're thinking about the experiment

1953.33 -> because the hypothesis will define at the end,

1956.51 -> did your experiment turn out as you expect

1960.53 -> or did you learn something new that you didn't expect?

1965.39 -> And so the importance here is, as you see,

1969.05 -> we're saying we're expecting a transaction rate,

1972.26 -> so 300 transactions per second,

1975.62 -> and we think that even if 40% of our nodes fail

1981.47 -> within our environment,

1984.17 -> still 99% of all requests to our APIs should be successful,

1990.35 -> so the 99th percentile,

1991.94 -> and return a response within 100 milliseconds.

1997.22 -> What we also would want to define

1999.74 -> is because we know our systems,

2001.51 -> we're gonna say, "Okay, based on our experience,

2004.33 -> the nodes should come back within five minutes.

2007.93 -> Pods should get scheduled within eight and be available,

2011.62 -> and traffic will flow again to those pods.

2016.66 -> And alerts will fire after three minutes."

2021.7 -> And once we are all agreeing on that hypothesis,

2025.03 -> then we're gonna go and fill out the experiment template.

2031.39 -> And so when you're thinking about

2033.64 -> the experiment itself and the template,

2036.55 -> we're gonna make sure

2037.57 -> that we're very clearly defining what we wanna run.

2041.8 -> We're gonna have the definition of the workload itself,

2045.61 -> what experiment and action we wanna run,

2047.98 -> in our case, it's gonna be terminating 40% of nodes,

2052.45 -> because if you're thinking about the steady state load

2055.03 -> you have on the system and you remove nodes,

2058.51 -> that's the brownout scenario which you wanna simulate,

2062.59 -> the environment that we're gonna run in,

2065.5 -> and in our case, we're always starting with the process

2068.32 -> in a lower environment and not in production,

2072.13 -> the duration that we wanna run,

2074.41 -> and now, think about as well,

2075.66 -> in this case, we're gonna run 30 minutes.

2078.4 -> But you might run experiments where you say,

2080.417 -> "I'm gonna run 30 minutes with five minutes intervals,"

2084.13 -> to make sure that you can look at the graphs

2086.05 -> based on the experiment,

2087.31 -> staggering experiments that you're running

2089.38 -> to understand the impact of the experiment.

2092.86 -> And then, of course,

2094.63 -> because we want to do this in a controlled way,

2096.79 -> we need to be very clear

2098.65 -> what the fault isolation boundary is of our experiment

2101.86 -> and we're gonna clearly define that as well.

2105.01 -> And the alarms that are in place

2106.9 -> that would trigger the experiment to roll back

2111.31 -> if it gets out of bound.

2113.32 -> And that's key, because we wanna make sure

2116.23 -> that we're practicing safe chaos engineering experiments.

2120.22 -> We also wanna make sure that we understand

2122.95 -> where is the observability and what are we looking at

2125.71 -> when we're running the experiment.

2128.83 -> And then you would also add the hypothesis again

2131.38 -> to the template as well.

2133.54 -> You also see two empty lines there,

2136.9 -> which are the findings and the correction of error.

2140.2 -> And when we're thinking about the experiment itself,

2143.32 -> good or bad, we're always gonna have an end report

2147.49 -> where we might celebrate that our system is resilient

2150.73 -> or we might celebrate that we find something

2153.1 -> that we didn't know,

2155.23 -> and just helped our organization

2157.12 -> to mitigate a large scale event.

2160.78 -> So once we have the experiment ready,

2165.46 -> we're gonna think about priming the environment

2168.43 -> for our experiment.

2171.25 -> But before we go there, I want to walk you through

2175.12 -> an entire cycle on how we execute an experiment.

2181.3 -> So first, we have to check if the system is still healthy,

2186.13 -> because if you remember, in the beginning, we said,

2189.527 -> "If we know the system will fail

2191.5 -> or the experiment will fail,

2193.63 -> we're not gonna run the experiment."

2196.09 -> So once we see that the system's healthy, we're gonna check,

2200.05 -> is the experiment that we wanted to run still valid?

2202.63 -> Because it might be that the developer already fixed the bug

2205.48 -> that we thought might exist if we run an experiment.

2209.56 -> And if we see it is, then comes something very important.

2215.59 -> We're gonna create a control and experimental group

2218.617 -> and we're gonna make sure that that's defined.

2222.37 -> And I'm gonna go into that in a few seconds.

2225.58 -> And if we see that the control and the experimental group

2228.1 -> is there and defined, and up and running,

2231.67 -> then we start generating load against the control

2235.21 -> and the experimental group in our environment.

2237.85 -> And we're checking again, is the steady state that we have

2242.29 -> in the tolerance that we think it should be or not?

2245.56 -> If it is tolerant, then now, finally,

2248.29 -> we can go and run the experiment against the target.

2253.03 -> And then, again, we check,

2254.38 -> is it in tolerance based on what we think?

2256.57 -> And if it isn't, then the stop condition is gonna kick in

2259.6 -> and it's gonna roll back.

2261.43 -> And if it is, that experiment turns into irrigation test.

2266.567 -> Why?

2267.4 -> Because now we understand

2269.53 -> what that experiment does to our system.

2271.75 -> We know we can mitigate it and it's predictable.

2276.97 -> So I mentioned the aspects

2279.91 -> of a control and experimental group.

2283.36 -> When you're thinking about chaos engineering

2285.43 -> and running experiments,

2288.19 -> the goal always is, one, that it's controlled,

2290.47 -> and two, that you have minimal to no impact

2292.96 -> to your customers when you're running it.

2296.83 -> So weighing how you can do that is we call it,

2300.4 -> not just having synthetic load that you generate,

2304.12 -> but also synthetic resources.

2306.7 -> For example, you spin up a new EKS cluster, a synthetic one,

2313.21 -> one that you have and inject a fault

2315.61 -> and the other one which is healthy

2317.56 -> in the same environment that you are in.

2320.44 -> And so you're not impacting existing resources

2324.76 -> that might have customer traffic,

2326.77 -> but new resources with exactly the same code base

2329.53 -> as the other ones where you understand what happens

2333.46 -> in a certain failure scenario.

2337 -> So once we prime the experiment

2338.65 -> and we see that control and experimental group are healthy,

2342.73 -> and I see the steady state,

2345.64 -> I can move on and think about running the experiment itself.

2352.69 -> Now, running a chaos engineering experiment

2355.45 -> requires great tools that are safe to run the experiment.

2364 -> And so when you're thinking about tools,

2365.92 -> there are various out there that you can use and consume.

2370.75 -> In AWS, we released Fault Injection Simulator

2374.65 -> last year in March.

2376.96 -> And when you're thinking about one of the first slide

2379.81 -> with the shared responsibility model for resilience,

2383.59 -> Fault Injection Simulator helps you quite a bit with that

2387.49 -> because the faults that you can inject,

2389.29 -> the actions that you can run

2392.47 -> are running against the AWS API directly.

2396.85 -> And you can inject faults against your primary dependency

2400 -> to make sure that you can create mechanisms,

2403.72 -> that you can survive a component failure within your system.

2408.31 -> Now, two faults and actions that I wanna highlight

2411.43 -> that we just recently released are the following.

2415.3 -> There is now integration with Litmus Chaos and Chaos Mesh.

2420.58 -> And the great thing about this is that now it provides you

2423.7 -> with a widened scope of faults that you can inject,

2426.55 -> for example, into your Kubernetes cluster

2429.61 -> to Fault Injection Simulator via a single pane of glass.

2434.41 -> And then we also released

2436.24 -> the Network Connectivity Disruption

2438.16 -> that allows you to simulate, for example,

2440.32 -> availability zone issues that you might have and events

2444.37 -> to understand how your workloads react

2447.61 -> if you suddenly have a disruption

2449.83 -> within an availability zone.

2452.26 -> And we're also working on various other deep checks

2455.95 -> where you can flip a switch and have impact

2458.62 -> into an AWS service during the experiment.

2462.4 -> And once the experiment is over,

2464.86 -> your service will just work just fine.

2467.77 -> And so that will help you

2469.54 -> to again validate and verify the entire process

2474.01 -> that your systems are able to survive component failures.

2479.05 -> Now, if you want to run actions

2480.97 -> against, let's say, EC2 systems,

2484.36 -> you also have the capability to run these through SSM.

2489.46 -> Now, think about it where these come into play.

2492.64 -> When we're thinking about running experiments,

2495.37 -> there are various ways

2496.57 -> on how you can create disruptions within the system.

2500.83 -> Let's say you have various microservices

2503.77 -> that run and consume a database.

2507.88 -> Now, you might say,

2508.713 -> "Well, how can I create a fault within the database

2511.78 -> without having impact to all those microservices?"

2517.12 -> And the answer to that is you can inject faults

2521.86 -> within the microservices itself, for example, packet loss,

2526.99 -> that would result in exactly the same

2529.18 -> as the application not being able to write to the database

2532.12 -> because it's not getting there.

2534.94 -> And so it's important there to widen the scope

2537.58 -> and think about the experiments that you can run

2541.36 -> and see with the actions that you have

2543.4 -> on how you can simulate those various experiments.

2547.99 -> So in our case, because we want to do that brownout

2552.22 -> that I showed before, we would use the EKS action

2555.61 -> that can terminate a certain amount of nodes,

2558.1 -> a percentage of nodes within our cluster,

2560.74 -> and we would run them.

2563.38 -> Now, I mentioned that we want to have

2567.76 -> a tool that we can trust,

2570.19 -> that we wanna make sure that something goes wrong,

2573.4 -> that an alert automatically kicks in

2576.13 -> and helps us roll back the experiment.

2580.24 -> And Fault Injection Simulator has these mechanisms.

2583 -> So when you build an experiment, you can define,

2586.39 -> what are my alarms that have to kick in

2589.24 -> based on the experiment?

2591.25 -> And if something goes wrong, the experiment stops

2594.67 -> and you can roll back the experiment.

2598.33 -> And so in our case, everything was fine

2602.32 -> and we said, "Okay, well, now we have confidence

2604.99 -> based on the observability that we have

2607.18 -> for this experiment to move up to the next environment."

2614.71 -> Now, here, it's key, as soon as you get into production,

2619.75 -> you have to think about the guardrails

2622.06 -> that are important in your production environment.

2627.52 -> When we're running chaos engineering experiment

2629.83 -> in production, especially when you're thinking

2632.17 -> about running them for the first time,

2634.09 -> please don't run them on peak hours.

2636.91 -> It's probably not the best idea.

2640.36 -> And also make sure, because in many ways,

2642.85 -> when you're running those experiments

2644.35 -> in lower level environments,

2645.79 -> your permissions might be much more permissive

2648.61 -> than you have them in a production environment.

2650.83 -> You gotta make sure

2652.39 -> that you have the observability in place,

2655.03 -> that you have the permissions

2656.2 -> to execute the various experiments,

2659.8 -> and that you have the observability in place

2662.05 -> to be able to see what's going on

2663.79 -> with that environment as well.

2666.4 -> Also key is to understand that the fault boundary changed

2670.42 -> because we're in production now.

2672.79 -> So make sure you understand that as well

2674.8 -> and be able to understand

2676.06 -> what is the risk if I'm running this experiment

2678.58 -> within the production environment.

2681.13 -> Because, again, we wanna make sure

2683.59 -> that we're not impacting our customers.

2688.18 -> And the last one, which we often see not being up to date,

2692.5 -> is that runbooks and playbooks are not up to date

2694.9 -> based on the latest changes that were made to the workload.

2699.16 -> So make sure you have all these, and if we see that we do,

2703.27 -> we're finally at the stage

2704.44 -> where we can think about moving up to production.

2709.27 -> So here, again, we're gonna think

2710.98 -> about priming the environment for production.

2715.12 -> And so you've seen this picture before

2717.19 -> in the lower level environment.

2719.83 -> But if we're in a production environment

2722.74 -> and we don't have a mirrored environment

2725.11 -> that some of our customers do where they split the traffic

2727.9 -> and have a chaos engineering environment in production

2730.15 -> and another environment,

2732.25 -> we can also use a canary

2734.2 -> to say that we're gonna take real user traffic, a percent,

2739.774 -> and we're gonna start bringing that real user traffic

2742.87 -> to the controlled and experimental group.

2746.44 -> Now, keep in mind, at this point in time,

2748.42 -> nothing should go wrong.

2750.61 -> We have the control and experimental group there.

2752.77 -> We haven't injected the fault.

2754.24 -> We should be able to see from an observability perspective

2757.87 -> that's all thumbs up.

2760.6 -> And once we see that that truly happened,

2764.68 -> that's where your heart starts pounding most of the time

2767.38 -> when you're running it the first time.

2769.69 -> We're gonna get to running the experiment

2771.58 -> actually in production.

2775.87 -> But see, when you're thinking

2776.92 -> about running the experiment in production,

2780.91 -> it's very different from an event

2783.28 -> that happens out of nothing,

2785.53 -> where there's no one in the room, you have to page everyone,

2788.23 -> and people are running around.

2790.45 -> Like here, we already ran through all these stages

2794.26 -> that we have this confidence

2795.97 -> that our workload should be perfectly fine

2798.7 -> with what we've seen.

2801.31 -> And if there is anything that would happen,

2804.61 -> you have the entire team of experts in that room

2807.64 -> looking at that dashboard.

2809.95 -> And if they see that there is a spike

2811.81 -> which they didn't expect,

2812.86 -> that experiment is done,

2814.51 -> and you're gonna fix that issue right away when you see it.

2820.96 -> So the aspect of running experiments in production,

2824.05 -> especially when you're thinking about running it

2825.82 -> in a GameDay style is very different

2828.43 -> when you're thinking about a real world event,

2831.04 -> and in many ways, helps you to find deficiencies

2834.22 -> within production environments as well

2836.59 -> that might not have occurred in lower level environments.

2842.23 -> So now, even if something happened or not,

2845.74 -> always, after a chaos engineering GameDay

2848.83 -> and experiments that you also run automatically

2851.8 -> like at Capital One,

2854.77 -> we're gonna go into a post-mortem or correction of error.

2860.29 -> Now, key here, when we're thinking about chaos engineering

2865.21 -> is that we're very transparent and blameless

2868.78 -> in regards to what we have found within our system.

2871.93 -> Because only this way, we're gonna learn

2874.09 -> and be able to tell others

2876.4 -> what we've seen within the environment.

2879.34 -> But there's certain questions which we need to ask

2881.86 -> when we're looking at the experiment.

2883.42 -> For example, how did we communicate with each other

2886.51 -> during the experiment?

2888.58 -> Was there someone that we had to bring in

2893.08 -> to make sense of what we saw on the screen

2895.6 -> because there is this guy

2896.95 -> which just knows almost everything, but he wasn't here?

2901.87 -> And what are, for example, some of the mechanisms

2904.93 -> that we want to use to share the learnings, good or bad,

2909.34 -> that we found during the experiments?

2914.08 -> Now, think about it.

2915.1 -> You want to make sure that your business units

2918.07 -> and the various developers see those findings

2920.83 -> because you want to share these with them

2923.29 -> so that they don't make the same mistakes.

2925.87 -> It's advertisement for you that you can say,

2927.887 -> "Hey, I found these various issues,

2932.71 -> and for that, I was able to mitigate X, Y and Z."

2936.37 -> There's another customer

2937.63 -> that is also here at re:Invent, Riot Games.

2941.53 -> They wrote a great story about how they were looking at

2945.07 -> and did a chaos engineering GameDay on Redis

2949.03 -> and were able to find various issues within that,

2952.84 -> and also were able to help another team

2955.12 -> with chaos engineering and help them have a seamless launch,

2959.86 -> but previous to that, found faults within their environments

2963.64 -> in regards to configuration errors

2965.71 -> with load balancers and circuit breakers

2968.14 -> that weren't implemented.

2970.12 -> And they shared this with the development team

2972.19 -> so they see, "Wow, I get really a lot of value out of this."

2976.15 -> And so it's important to promote that as well.

2982 -> Now, we're not at the end yet.

2985.592 -> What we have done now is created a great level of confidence

2989.62 -> that we know we can run this experiment

2992.89 -> without impact to our environments.

2995.68 -> And this is where the continuous resilience aspect

2998.2 -> comes into play.

3000.36 -> As I said in the beginning, resilience is not an after fact.

3003.27 -> It should be with you all the time.

3005.97 -> And this is where we're thinking about

3008.37 -> the automation of those experiments.

3013.2 -> Now, chaos experiments,

3014.34 -> you can think about it as individual experiments.

3016.77 -> If you're thinking about the organizational awareness

3019.53 -> that we had in the beginning,

3021.21 -> you will have various teams

3022.44 -> that have specific experiments they wanna run,

3025.41 -> and that's what you would run individually

3027.33 -> to prove a certain point.

3029.82 -> We would also have the experiments that we run in pipeline.

3035.61 -> But as mentioned, we need to make sure

3038.58 -> that experiments also run outside of the pipeline

3042.03 -> because faults happen all the time.

3043.65 -> They don't just happen when I push my code.

3046.2 -> They happen day in, day out, morning, at night, whenever.

3051 -> And then, use the GameDays to bring the teams together

3054.93 -> and make sure that you understand

3057.09 -> not just the aspects of how I recover the apps,

3060.63 -> but also look at it

3061.65 -> from a continuity of operations perspective,

3064.83 -> on how your processes work,

3067.14 -> and are the people alerted in certain ways

3069.51 -> when you're running experiments

3071.49 -> that they see what they need to do or what not.

3078.03 -> So to make it easier for our customers,

3081.51 -> we have built, of course, various templates and handbooks

3085.17 -> that when we're going to the experiments with them,

3088.41 -> we share, like the "Chaos Engineering Handbook"

3092.16 -> that shows the business value of chaos engineering

3094.86 -> and how it helps with resilience,

3097.62 -> the chaos engineering templates,

3099.75 -> as well as the correction of error template,

3102.87 -> and also the various aspects of the reports

3106.02 -> that we share with the customers

3108.18 -> when we're running to the program.

3112.35 -> Now, next, I just wanna show you some resources

3116.79 -> that you can start with

3118.56 -> when you're thinking about chaos engineering

3120.6 -> on your own time.

3123.72 -> And so we have great chaos engineering workshops

3130.38 -> that go from resilience to security.

3134.46 -> But when I run through these workshops,

3136.44 -> what I usually do is I start with the observability one.

3143.13 -> And the reason I do that is because that workshop

3147.69 -> builds an entire system that provides me

3151.71 -> with everything in the stack of observability

3154.62 -> and I have to do absolutely nothing to get it

3156.99 -> outside of pressing a button.

3159.87 -> And once I have that and I have the observability

3162.36 -> from top down to tracing and logging,

3165.75 -> I'm going to the chaos engineering workshop

3167.85 -> and I'm looking at the experiments that I have there

3170.13 -> and it starts with some database, fault injection,

3172.83 -> and containers and EC2,

3174.48 -> and shows you how you can do that in the pipeline.

3177.78 -> You take those experiments

3179.22 -> and you run it against the pet shop

3181.05 -> within observability workshop.

3184.38 -> And that gives you a great view

3186.27 -> of what's going on within your system.

3188.64 -> If you inject those faults,

3190.02 -> you will see them right away within those dashboards

3193.2 -> with no effort in observability.

3195.45 -> And that's important because, again,

3197.28 -> as the National Australia Bank said,

3198.847 -> "You need to see what's happening within your system,

3201.27 -> else, there's nothing

3203.04 -> which your best architecture is worth."

3206.43 -> And another workshop, and there's a white paper

3208.53 -> that was released as well

3209.7 -> about fault isolation boundaries in AWS

3212.88 -> is the Testing Resilience Using Availability Zone Failures.

3217.77 -> Think about running chaos experiments

3220.32 -> that would trigger the availability zone failures.

3224.49 -> And this workshop also shows you

3226.38 -> how to use various embedded metric formats

3228.6 -> within cloud watch logs to be able to add these

3231.51 -> to observability to the pet shop

3233.85 -> where you then see how it fails over.

3237.78 -> So to make it easier for you to find all these,

3242.28 -> I've provided you with some QR codes

3245.04 -> and the slides will be available as well

3247.95 -> where you can find those various workshops

3250.14 -> and blogs and then white papers

3252.06 -> that are interesting to read.

3256.89 -> So in this session, you've learned about

3260.76 -> what chaos engineering is and what it isn't,

3266.04 -> but more importantly, what the power of chaos engineering is

3269.34 -> and what you can achieve with it,

3271.5 -> and how you can build your own chaos engineering program

3276.39 -> and scale it to build more resilient

3279.24 -> and robust workloads in AWS

3282.96 -> that provide your customers with a better user experience.

3287.97 -> So if you have any questions, I will be down here

3291.33 -> for the next few minutes or 20 minutes or so.

3293.73 -> And else, wish you a great rest of re:Invent,

3297.06 -> and enjoy re:Play tonight.

3298.2 -> Thank you.

3299.059 -> (audience applauding)

Source: https://www.youtube.com/watch?v=tm5GEePP1PY