AWS re:Invent 2022 - Building modern apps: Architecting for observability & resilience (ARC217-L)

AWS re:Invent 2022 - Building modern apps: Architecting for observability & resilience (ARC217-L)


AWS re:Invent 2022 - Building modern apps: Architecting for observability & resilience (ARC217-L)

Cloud computing is transforming architecture design and application delivery at organizations across the world. As cloud architectures evolve, new design patterns are essential. Architecting for resilience, observability, security, and emerging trends serves as the foundation that empowers builders to innovate, optimize their workloads, and scale adoption over time. In this session, hear from AWS customers and cloud experts about their proven architecture best practices, tools, and blueprints for architecting for reliability, observability, and modernization on AWS.

Learn more at: https://go.aws/3ucTJpm

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#reInvent2022 #AWSreInvent2022 #AWSEvents #AWS #AmazonWebServices #CloudComputing


Content

0.234 -> [music playing]
2.503 -> Please welcome Vice President, Technology
5.339 -> and Customer Solutions, AWS, Francessca Vasquez.
9.676 -> [music playing]
17.718 -> Welcome everyone. Welcome to re:Invent 2022.
22.656 -> I can't hear you all. Welcome to re:Invent 2022!
26.96 -> My name is Francessca Vasquez and I have the honor and privilege
30.964 -> to lead AWS's Solutions Architecture and Customer Solutions organization.
36.436 -> I also have the opportunity to lead our AWS
40.307 -> Resilience Customer and Partner program
43.644 -> and I am so excited to have you here today.
47.714 -> On behalf of the AWS Architecture Organization
52.519 -> and our broader community, we'd like to dedicate this session
57.724 -> on building modern applications through the lens of observability
61.762 -> and resilience to David Grimm.
64.998 -> He was a solutions architect, a technologist,
68.468 -> and an amazing thought leader that we lost here recently.
73.207 -> So, about a year ago, on December 25th,
78.512 -> we witnessed one of the largest innovations
82.049 -> in human history go production.
85.185 -> Over 29,000 engineers and scientists worked on this product
90.624 -> and this project that would impact the entire human race.
96.063 -> And this innovation was arguably one of the largest resilience
102.069 -> and observability initiatives ever taken.
107.708 -> What I'm referring to is the James Webb Telescope,
111.512 -> which you heard referenced earlier in Adam's keynote.
115.148 -> It is the largest most powerful telescope ever built.
119.987 -> It will allow scientists to look at what our universe was like,
124.124 -> about 200 million years after the Big Bang.
128.595 -> Just incredible innovation.
132.599 -> The James Webb Telescope has been under development for over 20 years.
138.839 -> $10 billion budget for the program,
141.508 -> which was up from the original $1 billion forecasted in 2002.
147.181 -> And as the team was doing their own resilience testing,
151.518 -> they found that during a vibration test
155.856 -> that the screws holding the actual sun shield together failed
160.794 -> and they also popped loose.
163.263 -> This led to another 10 months
165.265 -> and an additional $800 million in costs.
168.535 -> This testing was done to simulate the actual testing
172.406 -> of riding an Ariane 5 rocket to space.
176.71 -> This telescope had to actually travel a million miles
181.081 -> and then open up remotely.
183.684 -> And did I mention, there's only one of these?
187.688 -> NASA actually had to build the necessary observability
192.092 -> and resilience capabilities into one device.
196.163 -> They had to build this system to be able to withstand
200.234 -> the micro meteor strikes and temperature swings,
204.404 -> that ranged anywhere from 230 degrees
208.075 -> Fahrenheit to negative 394 degrees Fahrenheit.
212.613 -> And it took about a month to get to its final location,
216.884 -> which is about a million miles from Earth.
221.421 -> And when I tell you this, this space telescope, it's really big.
227.728 -> It represents about a three-story building.
231.698 -> It's width and length is equal to that of a tennis court.
236.603 -> And to actually get this thing to launch,
239.273 -> you actually fold it up before the rocket can go off.
243.677 -> And finally, the team at NASA and our European Space Agency,
249.116 -> they had to create the tools to test and overcome more
253.22 -> than 300 single points of failure.
258.158 -> They had to create the necessary observability
260.727 -> tooling to actually make this possible.
263.564 -> And the photo that you happen to see up here is of the Tarantula Nebula,
269.036 -> which is 161,000 light-years away.
276.076 -> Today, we have an action packed agenda.
280.814 -> We'll be on a very deep exploration
283.917 -> on how we think about resilience and observability
287.554 -> with the same mission critical application lens,
291.525 -> as a NASA space telescope.
294.228 -> We're also going to hear from two amazing customers
297.965 -> who will be sharing their best practices
300.634 -> on how to build and design modern applications with resiliency in mind.
305.839 -> Finally, we're going to close off the day today with strategies
310.31 -> and resources to help you all think about continuous improvement.
318.252 -> Now, before we begin, I want to level set everyone
323.19 -> on some important definitions just so we're on the same mental model.
327.528 -> First off, how we think about observability.
330.764 -> It really describes how well you can understand
333.433 -> what is happening with your system,
335.335 -> often by instrumenting it to collect metrics, logs,
339.006 -> and traces. Resilience, the ability of a workload
343.911 -> to recover from infrastructure, service or application disruptions.
349.483 -> Both observability and resilience
352.619 -> are a critical and foundational element
356.023 -> to AWS's Well-Architected Framework.
359.293 -> They specifically live in the reliability
362.896 -> and operational excellence pillars.
365.132 -> So, two of the six pillars of an architectural framework that we use.
369.636 -> Similar to NASA's approach with the Webb Telescope at AWS,
374.975 -> we want to partner with all of you
377.11 -> to help you automatically be able to recover from failures,
381.281 -> to build and test recovery procedures
384.351 -> to help you scale and to help implement changes
388.522 -> being ongoing automation so that we can reduce manual errors.
393.66 -> So, to cover off on our building applications and resiliency,
398.732 -> please give a warm welcome to my colleague Shaown Nandi,
402.736 -> who is the Director of Solutions Architecture
405.038 -> and Customer Success for our strategic customers.
408.008 -> Thank you so much. Welcome to re:Invent!
411.078 -> [music playing]
419.486 -> Thank you Francessca.
420.854 -> What an incredible story to hear about the telescope.
423.39 -> It is great to be here with all of you today.
425.626 -> I see customers, I see friends in the audience.
428.762 -> Thank you for taking time,
429.997 -> especially with all the exciting World Cup action.
432.199 -> I hope there's some football, American soccer
434.268 -> fans in the audience, maybe.
435.869 -> Yes.
439.206 -> For those tuning in via the stream,
441.308 -> please thank you for joining us virtually.
443.844 -> In business, the resilience of workloads is critical.
446.847 -> Today we're going to hear from two customers
448.482 -> from FINRA and Capital One,
450.35 -> and we'll walk through how we work with all our customers
453.053 -> to ensure their workloads are Well-Architected
455.022 -> and resilient on AWS.
456.657 -> So, let's dive right in.
458.926 -> What is resilience?
460.427 -> Resilience refers to the ability for workloads,
463.23 -> workloads are your applications, your products,
466.2 -> your business processes, to respond and quickly recover from failure.
470.771 -> A workload can be simple as a single application
473.34 -> running in a single AWS account,
475.242 -> or it might be a set of products that span multiple accounts.
478.445 -> There are three mental models I want you to consider
481.315 -> as we go through today's presentation.
483.45 -> Think about how you can build systems to be highly available
486.82 -> with resistance to common failure modes,
489.156 -> how to recover your system
490.757 -> if you run into one of these rare failure scenarios.
493.927 -> And underpinning all of this is the idea of continuous resilience.
498.065 -> This is where you're implementing DevOps practices like CI/CD
500.934 -> to automate your delivery pipelines.
503.604 -> Introducing failure on an ongoing basis
506.039 -> to test your system chain and test your teams for weaknesses
510.377 -> and implementing ongoing observability
512.579 -> and monitoring practices.
515.449 -> As with security and sustainability,
518.619 -> resilience is a shared responsibility.
520.921 -> AWS is responsible for resilience of the cloud.
524.424 -> You our customers are responsible resilience
527.394 -> of your workloads in the cloud.
530.163 -> This shared model helps relieve your operational burdens,
532.866 -> the customer's operational burdens, as AWS operates,
536.27 -> manages, controls, the host operating system,
539.072 -> the virtualization layer, all of the infrastructure.
542.242 -> We're responsible for all the services
544.144 -> that are offered in the AWS Cloud.
546.78 -> Customer responsibility is determined by the services you select.
550.684 -> You have to carefully consider
552.386 -> which services you choose as responsibilities will vary
555.289 -> depending on how they integrate what regulatory framework apply.
558.959 -> And often you'll hear us
560.527 -> recommending using our higher-level services.
562.863 -> You heard about the serverless announcements
564.264 -> this morning at Adam's keynote,
565.933 -> those tend to have less operational responsibilities that sit with you.
569.837 -> As a solution architecture leader,
571.371 -> my teams partner with all of you
573.006 -> to make these decisions not alone in figuring this out.
578.045 -> AWS from day one has built resilience into our culture.
582.216 -> We use a service ownership model internally,
585.419 -> which incentivizes our teams to continuously
587.621 -> improve their operations.
589.189 -> We've organized our engineering and product management
591.592 -> efforts into small multi-disciplinary teams.
594.394 -> We call them two-pizza teams.
595.829 -> Most of you heard this term.
597.264 -> We like it 'cause we think those teams can be fed by two pizzas,
600.667 -> depends I guess, on the size of the pizzas.
602.736 -> And these teams own a service end-to-end.
605.706 -> What that means is the ownership is not
608.842 -> just of designing and launching their service,
611.111 -> but you have to operate it during production,
613.714 -> be on calls for issues as they arise.
616.817 -> For customers who use this structure, it's a major cultural shift.
620.42 -> The idea that your responsibility for the service never really ends.
624.958 -> In AWS all new services are reviewed for launch
628.495 -> using an operation readiness process we call ORR.
631.798 -> It's basically a set of questions,
634.067 -> a checklist that uses known best practices and a standardized runbook.
638.038 -> When we roll out new services or update existing ones,
640.908 -> we use safe continuous deployment pipelines
644.011 -> that automate pre-production testing,
646.146 -> support automatic rollbacks and stagger deployments.
649.016 -> When we launch our services or even add features, we start small.
652.586 -> We start with a single instance,
654.354 -> go across an AZ roll out across multiple AZs finish a Region
658.525 -> and then ultimately roll to our other Regions.
661.061 -> And if any issues arise, we leverage our correction of error process.
665.632 -> This is where we go and understand what the root cause was.
669.136 -> This is not about placing blame.
671.605 -> This is about diving deep to find the true reason
674.341 -> that something failed.
675.876 -> And after an issue is mitigated, we drive company-wide engineering
679.012 -> sprints to ensure the issue is fixed across all AWS services.
683.75 -> The learnings become part of the ORR process,
685.986 -> goes right back to the top and ensures similar issues don't reoccur.
692.292 -> When we think about resilience in the cloud,
694.928 -> there are four key areas we focus on.
697.364 -> First, you have to anticipate what's going to happen.
700.434 -> To do this, we use code reviews, failure-oriented programming,
704.204 -> immutability, simple designs, whenever possible.
707.808 -> Second, monitoring, we're going to talk a lot more about this,
710.444 -> health checks, tracing, alarms, dashboarding.
713.814 -> Responding, this can be the longest part of any incident
717.184 -> and you minimize that by identifying event-driven patterns,
720.153 -> using machine learning powered operations.
722.923 -> And finally learning.
724.291 -> I talk about the CoE process that we've built.
727.06 -> We look at our logs and all that learning is fed right back in
730.864 -> to anticipate better the next time and we come full circle.
735.569 -> When we think about your responsibility
737.437 -> for resilience in the cloud,
739.406 -> you also need to think about resilience threat modeling.
742.075 -> There are typical categories of failure,
743.577 -> you see them up on the screen.
745.179 -> For example, for code deployments,
747.381 -> what happens when you have a failed deployment?
749.55 -> What are you set up to have occur?
751.285 -> Do you have instrumentation to detect it?
753.487 -> Can your CI/CD system automatically roll back?
757.09 -> In core infrastructure, what if a single instance is terminated?
760.961 -> Have you designed to make sure this will not impact you?
764.131 -> And if there's an impairment in a single AZ
766.033 -> or you experience a gray failure, what's going to happen?
769.636 -> In terms of data and state,
771.338 -> what if your customers overwhelm your service
773.74 -> or your database gets corrupted?
775.609 -> What happens if a third-party dependency fails?
778.178 -> Do you gracefully degrade?
779.913 -> I talked to a CIO this morning about a challenge
781.915 -> they faced a couple weeks ago.
783.483 -> A major application outage.
785.219 -> They had a dependent system, it was a login system,
788.522 -> federation system that failed.
790.39 -> A queue built up 1.2 million requests out there
793.627 -> and it took them seven hours to respond
796.129 -> and detect. They didn't have automatic rollback in place.
799.199 -> And they knew, they knew that it was a challenge after the fact
801.835 -> they hadn't prioritized fixing it and they want to fix it.
804.771 -> You have to consider these.
806.406 -> Now you also have to think about unlikely scenarios.
809.109 -> You may choose not to engineer for the really unlikely ones,
811.979 -> but you should consider what'll happen.
813.48 -> What if a natural disaster impacts an entire coast in the United States?
818.151 -> Or my favorite panic scenario,
820.12 -> what are you going to do during the zombie apocalypse?
822.422 -> How will you keep your systems up?
823.891 -> I keep a bag packed just in case.
828.428 -> A key part of resilience workloads
830.163 -> is making sure you have a strong foundation.
832.533 -> Start with AWS infrastructure.
835.169 -> With the launch of Hyderabad last week, we have 30 AWS Regions
839.039 -> and each Region has at least three Availability Zones.
842.075 -> And each AZ is a multiple physically separated data center.
846.18 -> Each region has two independent, fully redundant transit centers.
850.117 -> On top of that is your network.
852.152 -> Your network should always be redundant, always available,
855.055 -> and seamlessly routed.
856.69 -> On top of that, you've got your data,
859.226 -> you have to have confidence in the resiliency of your data.
861.895 -> There's so many forms, file system, block databases, in-memory caches,
866.967 -> consider how eventual consistency impacts design.
870.003 -> And finally, your application.
872.806 -> Your highly resistant application should be able to self-heal.
876.376 -> We like you to use microservices app architecture
878.812 -> when you're building new, we know many existing apps don't do that.
881.849 -> You want to decouple interdependencies,
884.017 -> have loose coupling when possible and always remove state
887.221 -> when you can from app components.
890.457 -> Your goal is to build systems that never fail.
893.193 -> The reality is failures do happen.
895.996 -> So, there are a few things you can do to help reduce the impact of failure.
899.9 -> First, very basic set timeouts.
903.036 -> Do you know that many frameworks to fall to infinite timeouts
906.139 -> that is just asking for trouble?
908.675 -> You need retry with backoff.
910.644 -> When you don't backoff, you're going to end up with a retry storm.
913.347 -> A subsequent cascading failure.
915.382 -> One retry is usually resilient enough to be resilient
918.719 -> to intermittent errors.
920.053 -> Retry once fail fast.
922.222 -> That example I mentioned earlier, 1.2 million up in the queue.
926.193 -> Yeah, clearing that was a pretty big pain.
928.495 -> You want to limit the sizes of your queues
930.564 -> and make sure you rate limit your APIs and load shed when needed.
936.87 -> Before failures occur, it's important to test.
940.674 -> Continuous testing is imperative in understanding
943.477 -> how your system will react to unknowns.
945.679 -> This includes strategies like chaos, engineering,
947.781 -> conducting game days, practicing failovers.
951.084 -> We use chaos to prove or disprove our assumptions about our system's
955.155 -> capability to handle disruptive events.
957.491 -> Chaos stresses an application
959.259 -> in testing or production environments.
961.161 -> We create disruptive events artificially server outages,
964.031 -> API throttling.
965.332 -> Amazon has been purposely injecting control failure
968.402 -> into limited environments since the early 2000s.
971.405 -> That's how we ensure readiness for the most adverse of circumstances.
976.476 -> AWS is investing heavily in resilient services.
979.913 -> We just spoke about chaos engineering.
981.715 -> I want to call out the AWS Fault Injection Simulator,
984.818 -> a fully managed service.
986.153 -> We love those, that simulates real- world failure to uncover hidden bugs,
990.224 -> monitoring blind spots and performance bottlenecks.
993.293 -> Our resilience hub, AWS Resilience Hub
996.463 -> provides a central place to define, validate,
998.832 -> and track the resilience of your applications on AWS.
1002.236 -> And it pulls in the best practices from our Well-Architected Framework
1005.339 -> so you can benefit from what all your other customers
1007.741 -> have learned and done.
1009.309 -> And lastly, I'd like to highlight the Amazon Route
1012.546 -> 53 Application Recovery Controller.
1015.182 -> It enables you to control your application recovery
1017.651 -> across multiple AWS Regions, availability zones and on-prem.
1022.256 -> It makes recovery simpler and more reliable
1024.825 -> by eliminating the manual steps
1026.093 -> required by traditional tools and processes.
1028.529 -> And I'm excited today to announce a new enhancement
1031.798 -> to the Route 53 Application Recovery Controller.
1035.068 -> We're adding a new feature in preview today called Zonal Shift.
1039.306 -> It's built on Elastic Load Balancer inclusive of ALBs and NLBs.
1044.144 -> During a failure, removing your application in an AZ can be complex.
1048.715 -> It can wound configuration steps across EC2, ELB, auto-scaling.
1053.387 -> Customers like yourself have been asking us for simple,
1055.656 -> reliable and easy-to-use tool
1057.524 -> that can help recover from AZ impairments.
1059.96 -> With Zonal Shift, when you build your applications with ELB
1063.33 -> and have cross-zone traffic disabled, you get a built-in control
1067.301 -> for shifting application traffic away from an AZ with a single action.
1071.205 -> To learn more about this preview, there's a session tomorrow, ARC 329
1075.142 -> at 10:45 AM Pacific, breakout session.
1077.344 -> Please go check it out. It's a pretty cool feature.
1080.48 -> Now I'm excited to introduce our first customer Will Meyer,
1084.351 -> Managing VP of Cloud and Connectivity for Capital One.
1087.588 -> Please join me in welcoming Will Meyer to the stage.
1090.29 -> [music playing]
1096.496 -> Thank you Shaown and good afternoon, everybody.
1098.532 -> This is awesome to see all of you.
1101.101 -> You know, I spend a fair bit of time
1102.369 -> thinking about what are the differences
1104.071 -> between just being on the cloud and really thriving on the cloud.
1108.041 -> And I think true system resilience is one of those things,
1110.777 -> so it's great to be able to talk about it with you.
1115.582 -> If you're not familiar with Capital One,
1116.85 -> we are a financial services institution,
1118.819 -> basically an information business.
1121.488 -> Our success is based on our ability to process data, generate insight
1125.058 -> that we can use to help our customers
1126.927 -> financially, for example, by giving them credit.
1130.497 -> And so, when you think about it being resilient to external change
1133.767 -> and managing risk are baked into our business model,
1136.803 -> not just our tech stacks.
1138.238 -> We've been doing it for a while,
1139.439 -> since before the public cloud was a thing.
1141.675 -> But I think we knew pretty early that it would be an accelerator for us.
1144.845 -> There were Capital One folks on stage at re:Invent in 2015
1148.849 -> talking about our intention to go all-in on AWS.
1152.386 -> We did that.
1153.654 -> We rebuilt and migrated thousands of workloads, petabytes of data.
1158.559 -> We built a security and controls framework
1162.062 -> that was appropriate to the level of trust
1164.031 -> that our customers place in us and also what our regulators demand.
1168.235 -> And in 2020 we closed our last data center.
1170.938 -> We have hundreds of teams that are doing everything from online
1174.408 -> user experience to advanced machine learning,
1177.11 -> call center, back office, all on AWS.
1180.681 -> It has been challenging, but it's also been super fun.
1183.217 -> And AWS has been a tremendous enabler for our teams and for our business.
1189.556 -> We also have some battle scars and I think we have learned along the way
1193.327 -> that we're making a lot of good progress on saying defaults
1195.863 -> but the cloud isn't perfectly plug and play quite yet.
1198.866 -> There is complexity, there are trade-offs everywhere.
1201.935 -> I think we see that when we talk about cost.
1203.704 -> I know many of us are focused on reducing waste in our cloud spend.
1207.14 -> I think we see in security and compliance.
1209.409 -> We want the right thing to be the easy thing,
1212.145 -> but we are still working on easy.
1214.314 -> And so, as we talk about observability and resilience in general,
1217.184 -> I think it's important to not just look at them
1218.852 -> as a set of isolated patterns,
1220.42 -> but really as an integrated part of your overall approach
1223.39 -> to managing the cloud.
1228.095 -> You know, you all work with the cloud every day.
1229.763 -> I think it's easy to forget just how far the conversation
1232.533 -> about cloud resilience has really come.
1234.568 -> You know, I remember being asked,
1235.869 -> "Hey, what happens when Amazon has a big holiday e-commerce spike,
1239.706 -> and all of a sudden there's no more spare capacity to run on AWS?"
1242.843 -> And you know, it never really worked that way,
1244.545 -> but I think it's a good reminder of just how much has changed.
1247.414 -> AWS is serving incredibly demanding sectors.
1251.018 -> We see really world-class engineering talents
1253.887 -> working on resilience of public cloud infrastructure,
1257.024 -> including through open-source.
1259.226 -> At Capital One, we have had tremendous outcomes.
1262.196 -> We have fewer customer impacting incidents.
1264.631 -> We are faster to recover when we do have them.
1267.201 -> We attribute that largely to the partnership with AWS.
1269.77 -> It has been really powerful,
1271.338 -> and I know many of you have seen the same.
1274.308 -> You also all know, I think that the cloud isn't actually magic
1277.845 -> and you don't get this completely for free.
1280.347 -> Just like when we talk about costs or security or compliance and resilience,
1284.618 -> AWS gives you incredibly powerful tools and concepts,
1288.088 -> but you need to use those tools
1289.456 -> and you need to embrace those concepts
1291.258 -> and integrate them into your ecosystem and your ways of working.
1294.828 -> And we work hard to do that.
1297.564 -> We run our US businesses in two Regions in east and west,
1300.868 -> multiple AZs in each.
1302.97 -> Our most critical workloads,
1304.104 -> are active-active with latency-based routing.
1307.341 -> We do auto-scale, although honestly, we tend to run a bit overprovision,
1311.178 -> partly because we want to be able to fail our entire business,
1314.314 -> all divisions over into one of those Regions within minutes,
1317.417 -> which we do occasionally.
1319.853 -> And this particular topology does not make sense for everyone,
1323.457 -> but whatever does make sense,
1324.625 -> focus on standardizing the deployment pattern
1327.194 -> that you want your teams to have.
1329.363 -> We have spent a lot of time organizing and defining
1331.365 -> our non-negotiable requirements.
1333.433 -> We've organized them into a couple of different tiers.
1336.036 -> This platinum, this multi-Region,
1339.173 -> multi-AZ active-active version we call our platinum standard.
1343.41 -> And we really hold ourselves accountable to that.
1346.613 -> We've also made a bet on a company-wide deployment pipeline
1349.416 -> that lets us do blue-green deployments
1351.084 -> across Regions consistently and safely.
1352.986 -> That's a big investment, but we have found that it is worth it.
1357.591 -> AWS also talks a lot about powerful architectural concepts;
1361.895 -> you heard about a few of them a minute ago.
1364.231 -> We're also big fans of static stability,
1366.2 -> thinking about how your system behaves in isolation
1369.336 -> so that when everything around it is going haywire,
1371.205 -> you don't actually need to take any action to remain stable.
1374.107 -> I think that's important when you think about how much complexity there
1376.743 -> is in many of these cloud architectures.
1379.279 -> Loose coupling is great; it's how we build evolutionary architecture.
1382.182 -> But with microservices everywhere and event-driven everything,
1385.519 -> these things can be hard to reason about.
1388.188 -> And I think when we look at production incidents
1390.157 -> much more than straight failures,
1391.725 -> what we see are complicated degraded performance,
1395.796 -> partial failures across multiple systems
1398.866 -> interacting in some kind of complex way.
1400.734 -> And I think, by the way,
1401.935 -> that's true in incidents within the clouds as well.
1404.404 -> Those do happen.
1408.041 -> You also need the tools.
1409.776 -> So, I mentioned our deployment pipeline,
1411.411 -> but everything starts with how you effect change
1413.614 -> in the environment that has to be managed,
1415.315 -> and that means with infrastructure as code,
1417.284 -> not with the AWS Console, that's absolutely critical.
1420.387 -> We've also invested in tooling that helps us reason
1423.79 -> about the state of our infrastructure in sort of larger logical units.
1427.728 -> We built kind of a data layer
1429.062 -> and a management plane that helps us do things like coordinate
1432.032 -> those large-scale failures still with appropriate access control.
1435.335 -> I'll make just a quick plug for Cloud Doctor,
1436.97 -> which you can see in our booth.
1439.206 -> We're also really excited about all the investments
1440.841 -> we see coming from AWS.
1442.276 -> A few were just mentioned, Fault Injection Simulator
1444.578 -> is a favorite of ours.
1448.382 -> Lots of amazing tooling continue to be invested in by AWS.
1452.119 -> We also know that no matter how good the tools are,
1455.556 -> things are going to fail.
1456.857 -> And so, we practice for that.
1458.325 -> Yes, doing exercises with tech teams is key,
1460.594 -> but also think about it, cross-department.
1462.829 -> How do you engage customer comms, call center,
1465.699 -> decision makers that you may need if you're going to disable a capability
1468.468 -> in the heat of the moment?
1469.603 -> We think company-wide, organization-wide game days
1472.806 -> can be really powerful for that.
1477.444 -> I'll just admit I couldn't resist an opportunity
1479.446 -> to mention our former president,
1481.114 -> but he sort of said something about software engineering.
1484.518 -> Resilience comes from learning and adaptation.
1489.423 -> Sometimes that learning is in the heat of the moment, right?
1492.226 -> How do you coordinate across multiple teams to debug
1495.762 -> and fix these complicated system
1497.464 -> interactions that we're talking about?
1499.166 -> Collaborative problem solving really is the new normal.
1501.735 -> We can't just all look at our own APIs and contracts
1504.004 -> and say, you know, "Hey, it's not me."
1506.206 -> So, think about how you incentivize that.
1507.641 -> Sometimes the learning is in the follow-ups.
1510.143 -> At Capital One, we talk a lot about blameless postmortems,
1512.646 -> and we spend the time really digging deep when things fail.
1516.383 -> We aren't trying to assign blame,
1517.985 -> but we are trying to find the truth and it really matters a lot.
1520.587 -> And so, invest in that and in particular
1523.09 -> invest in tracking the follow-ups.
1524.725 -> I know AWS talked about that a moment ago.
1526.693 -> That part is important.
1529.162 -> I think at the end of the day,
1530.33 -> we have found that we spend a lot of time
1532.799 -> working on the systems that help us learn and improve.
1536.47 -> And to a large extent, the time we spend looking backward
1539.006 -> is the way we speed up moving forward.
1543.644 -> I want to make a quick side point on serverless.
1546.18 -> I think we at Capital One have made a pretty intentional investment
1549.583 -> in adopting more and more managed services,
1552.586 -> and I think it's interesting to talk about that
1554.121 -> in the context of resilience.
1556.523 -> Over the years, I think we have all embraced distributed DevOps,
1559.493 -> we have shifted left, and you know,
1561.361 -> we might have horizontal SRE teams or platform teams,
1563.931 -> but generally all of our application teams
1566.233 -> have a lot of ownership of their infrastructure.
1569.203 -> And that has a bunch of benefits that I think we all understand.
1573.34 -> It also makes some pretty big asks of those teams:
1576.243 -> be cost efficient, be resilient, patch your vulnerabilities.
1580.38 -> It's not quite the simplicity that the cloud promised.
1582.916 -> I think we also see the need to build internal platforms
1586.153 -> and abstraction layers
1587.354 -> that then help those teams to do all of those things.
1590.257 -> And this is where we think serverless fits in.
1592.659 -> We want to move teams up the stack.
1594.161 -> We want to lean into the shared responsibility model
1596.597 -> and basically get AWS to do as much work
1598.832 -> as possible on our own resilience.
1602.369 -> There is some potential risk with that operationally, sure.
1604.605 -> But I think we see the resilience
1606.206 -> of the managed services to be good and improving.
1609.776 -> People also talk about lock-in,
1610.944 -> but I think we've already seen containers kind of surpass our VMs.
1614.515 -> We see functional event-driven models pretty much everywhere.
1618.085 -> I think in most domains we are building lighter-weight applications
1621.154 -> and we're even starting to talk about being able
1623.357 -> to run them on-prem or at the edge or anywhere in between.
1627.194 -> And so, we think this move up the stack really
1628.896 -> is on the right side of history
1631.131 -> and we want to continue to offload undifferentiated heavy lifting,
1634.635 -> not only to improve our productivity,
1636.537 -> but to improve our resilience over time as well.
1641.341 -> Parting thoughts:
1642.476 -> Yes, do the textbook stuff.
1644.144 -> Start with the Well-Architected Framework,
1645.846 -> but make it your own.
1647.281 -> Be intentional about setting standards
1649.149 -> and expectations for architectural resilience with your teams.
1652.619 -> Educate your teams, track your progress,
1654.988 -> integrate those standards into your tools,
1657.391 -> start all the way at the beginning of the development process,
1659.426 -> build the guardrails that enforce resilience requirements
1662.696 -> right into the infrastructure.
1665.632 -> Interrogate all those dependencies
1667.267 -> and use architectural tools like static stability to mitigate them.
1671.471 -> And then most important, think about how you cultivate
1674.041 -> the mechanisms to use the AWS word for learning and response,
1678.345 -> both during emergencies, which you should be practicing for
1681.148 -> and in the follow-ups that help you improve over time.
1684.384 -> I think like a lot of things in tech,
1685.686 -> resilience isn't just about tech,
1687.454 -> it is also about you as a responsive and resilient organization as well.
1692.826 -> Thanks for your time. I'll hand you back to Shaown.
1695.062 -> [music playing]
1701.802 -> Thank you, Will.
1703.036 -> Some really great learning out there
1704.404 -> that people can start applying to their workloads
1706.406 -> and implementing immediately.
1707.708 -> I especially love the little bit on serverless at the end.
1710.41 -> I gave a shout out to it earlier.
1712.513 -> It is such a good way to start to reduce your operational burden
1715.282 -> and put it in someone else i.e. our hands to some extent.
1718.785 -> A founding member of the Amazon EC2 team put it really nicely.
1722.656 -> "You can't legislate against failure.
1724.992 -> Focus on fast detection and response."
1727.427 -> We all know there are going to be failures.
1729.73 -> So, this is where observability gives you the ability
1732.466 -> to efficiently detect, investigate, and respond.
1736.57 -> Often customers don't detect issues as soon as they begin.
1739.006 -> There's a lag from when the issue starts to when you find it.
1742.676 -> You can respond to failures quicker if you alert near the source.
1746.613 -> Investigation is where people spend the most amount of time
1750.35 -> during an operational event.
1751.685 -> This is the largest contributor to downtime.
1754.154 -> I mentioned that incident earlier, seven hours to investigate.
1758.292 -> Leverage logs, metrics tracing to help you investigate
1762.362 -> quickly and understand the root cause.
1764.164 -> Your time is valuable.
1765.933 -> Focus on the stuff that matters during an operational event.
1768.902 -> There's nothing worse than trying to fix something
1771.104 -> and making the situation worse.
1773.24 -> And I mentioned our CoE process earlier.
1775.509 -> Make sure you conduct a post-event analysis
1777.611 -> to help you determine how you could have prevented this.
1779.98 -> It will probably happen again if you don't.
1782.449 -> Your goal should be to ensure that doesn't happen.
1784.585 -> You never have repetition in these errors,
1786.753 -> and if it does, you know how to identify it faster
1789.69 -> and remediate it automatically.
1792.626 -> Our philosophy of monitoring is to ensure
1794.728 -> we measure the things customers care about
1797.564 -> and measure them from multiple perspectives.
1799.766 -> We want to continuously introspect those metrics and question them.
1804.705 -> This is all to understand the customer experience.
1807.407 -> Instrumentation allows us to learn about our system,
1810.21 -> give operators real-time feedback
1812.446 -> on how the system is operating and feed data into alarms.
1815.949 -> This helps us detect and respond to events when they happen.
1818.886 -> We need to make sure we're monitoring the right things
1821.221 -> and asking the right questions.
1823.857 -> We want to ask these questions all about our systems.
1827.127 -> Why is it operating that way?
1828.262 -> You need to add instrumentation, go right to the top of the stack
1831.832 -> and figure out what's going on.
1833.367 -> The instrumentation produces logs, metrics and traces.
1836.77 -> We use alarms and dashboards to analyze those,
1840.007 -> and then we ask more questions.
1842.442 -> Why did that thing go into alarm?
1844.144 -> What was actually going on when I saw this spike in the dashboard?
1847.047 -> And to answer that, we need more instrumentation
1849.483 -> and we go through the same circle over and over again
1852.352 -> and it improves operations and more importantly,
1854.955 -> it improves our end customer experience.
1857.624 -> This is the virtuous cycle of monitoring.
1861.361 -> Think about a real-world example.
1862.896 -> A service that customers are calling through a Load Balancer.
1865.832 -> So, you get a call trying to get some product info,
1868.635 -> it goes and hits the load balancer.
1871.004 -> Questions you want to ask: What product are we looking up?
1873.473 -> Who called the API?
1874.575 -> What was this end customer?
1875.843 -> What type was it, coming through a website,
1877.911 -> coming through some other area?
1879.78 -> Did we find the item in our local cache
1882.049 -> or did we have to punch out and get to a remote cache?
1884.885 -> How long did it take to read from the cache?
1887.154 -> How full is the local cache?
1889.056 -> How long did the query take?
1891.124 -> You went out to a remote database perhaps did the query succeed?
1894.862 -> And how long did it take to go back and populate the caches?
1897.531 -> Was the cache full?
1898.732 -> Did you have to evict items?
1900.467 -> And how big was that product info object that you went and fetched?
1903.737 -> And what was the response code from the server?
1906.473 -> And finally, what was the latency?
1908.876 -> If I was operating this service in production,
1911.512 -> I would need so much instrumentation in this code
1913.947 -> to be able to understand its behavior in production.
1915.916 -> That's a good thing.
1917.05 -> I need the ability to troubleshoot failed requests, slow requests,
1920.487 -> I want to monitor for trends, signs that different dependencies
1923.79 -> are under-scaled or misbehaving, there's a lot there.
1926.693 -> Don't oversimplify.
1929.463 -> And we have a couple types of metrics we categorize.
1932.733 -> First health metrics.
1934.368 -> Am I failing?
1935.569 -> It doesn't answer the question, why am I failing?
1938.405 -> Health metrics, the alarms are there to alert you to issues.
1941.375 -> And then the diagnostic metrics.
1943.243 -> What's the value of this thing I measured?
1945.345 -> Why isn't my system working?
1947.648 -> These both fall into three essential categories.
1950.784 -> The customer experience metrics are the ones that let you detect
1953.687 -> that your customer has a problem
1955.389 -> and the service is not responding to them.
1957.491 -> Once you've found a problem, you can use impact assessment metrics
1961.495 -> to measure the number and percentage of customers
1963.797 -> resources, workloads impacted,
1966.567 -> and then you can use operational health metrics
1969.536 -> to determine why the impact is occurring,
1972.339 -> what, when the why is discovered, responders
1974.608 -> and automation can take action to go and resolve the event.
1978.979 -> There are three commonly agreed upon pillars of observability.
1982.916 -> It's metrics, log and traces.
1985.385 -> Metrics are the numeric data measured at various time intervals,
1989.189 -> request rates, error rates, durations, CPU percentage.
1993.026 -> Logs are your timestamp records of distinct events
1995.362 -> that occurred within an application
1996.697 -> or a system such as a failure and error
1998.966 -> or just a state transformation.
2000.734 -> And traces represent a single user's journey
2002.87 -> across multiple applications and systems.
2005.372 -> Usually with microservices, great modern architecture,
2010.11 -> we have a broad suite of observability capabilities.
2012.679 -> We have native services that integrate deeply
2015.349 -> with our AWS services.
2017.217 -> We have Container Insights and Lambda Insights.
2018.986 -> You heard about some new ones being launched
2020.487 -> this morning from Adam.
2021.889 -> We have open-source tools: Managed Grafana,
2026.66 -> Managed Prometheus, big fan.
2028.729 -> We give you lots of options, lots of choices.
2030.864 -> You can put the right tool for your workload.
2034.334 -> With that, I am super excited to introduce Kym Weiland,
2038.672 -> Vice President of Enterprise
2039.873 -> Operations at FINRA, who's going to talk more
2041.909 -> about the importance of observability.
2043.81 -> Please welcome Kym.
2045.379 -> [music playing]
2054.688 -> Good afternoon.
2055.822 -> FINRA.
2056.957 -> FINRA is the Financial Industry Regulatory Authority
2061.094 -> and we help play a critical role
2063.931 -> in ensuring the integrity of America's financial system.
2067.701 -> We write and enforce rules governing
2069.736 -> the ethical activity of brokers in the United States,
2073.273 -> we examine firms for compliance for those rules,
2076.91 -> we foster market transparency and we educate investors.
2082.516 -> In addition, we also do big data, lots of big data.
2087.888 -> FINRA has processed peak volumes
2089.656 -> of over 600 billion market events per day.
2092.693 -> We do this in support of 24 exchanges,
2095.762 -> and we have had to run upwards of 300,000 compute nodes
2099.7 -> in a single day.
2101.702 -> Currently, we have a storage footprint of about 500 petabytes.
2108.375 -> We did not get here overnight.
2110.978 -> We started in early 2014, 2015,
2113.18 -> and by 2016 we were doing about 39 billion transactions.
2117.451 -> We had about a hundred RDS Instances.
2120.32 -> That grew by 2019 where we were doing about 190 billion transactions
2125.792 -> and we upped until about 500 databases at that point
2129.496 -> with about a 50 petabyte storage footprint.
2133.5 -> But between 2019 and 2022, there was unprecedented market volatility.
2139.273 -> And we have grown to an average data
2141.308 -> intake of over 450 billion events per day,
2147.381 -> over 500 petabyte storage footprint, and we have over 1200 databases.
2154.421 -> For 2026, what does the production look like?
2157.991 -> Simply bigger.
2159.86 -> So, the question is, is how did we get here?
2163.33 -> We got here by starting
2164.565 -> with some very fundamental architectural principles.
2168.635 -> We are a data-centric organization
2171.271 -> and therefore, it started with our data management.
2174.107 -> And this is more than just storage.
2176.944 -> It includes data registration, data lineage, data lifecycle,
2180.814 -> including security data classifications.
2183.984 -> We wanted to ensure that we took advantage of cloud elasticity.
2188.355 -> Everything from serverless to flexibility
2191.525 -> with instance types, flexibility with storage types,
2195.562 -> and again, making sure that it was done in such a way
2198.098 -> that we could do on-demand scalability.
2201.568 -> From an architecture perspective,
2203.804 -> you have to start with a core framework of services.
2207.407 -> All of our application teams understand
2209.176 -> the supported blueprints, there's centralized messaging,
2213.18 -> we are API-focused and we prefer open source.
2218.018 -> But security is also a key part of that architectural principle.
2223.423 -> And it's not just encryption. Configuration, authorizations
2229.062 -> and observability, in terms of it being auditable.
2233.834 -> DevOps infrastructure as code that is core,
2238.138 -> including your CI/CD and all of the automated testing
2241.909 -> that needs to happen in that area.
2244.645 -> Operations needs to be resilient and again, automated.
2249.483 -> It also needs to be performant
2251.351 -> and you have to have that enterprise observability
2254.188 -> in order to ensure all of those are balanced.
2257.624 -> The output of all of that data should be done
2260.093 -> in such a way that compliance is also done as code
2263.864 -> and you can do analytics in order to show it.
2268.468 -> To manage this is a balancing act.
2270.07 -> It's a balancing act between innovation,
2272.072 -> optimization and adoption.
2275.242 -> Innovation.
2276.844 -> AWS has new services all the time.
2280.848 -> In order to do that, you have to have service evaluations
2283.35 -> on how you integrate those services into your architecture,
2287.487 -> into your observability.
2289.556 -> You can do that through R&D, through POC.
2292.693 -> We also are very active in preview participation
2296.196 -> and it allows us to give informed feedback to those services.
2300.701 -> For optimization of workloads, it is observability of capacity,
2304.705 -> but also those cost efficiencies in order to get the most efficient,
2309.176 -> cost-effective workloads.
2311.745 -> And then adoption: delivery focused automation.
2316.984 -> Putting that automation in the hands of the teams
2319.486 -> with the provisioning guardrails
2321.555 -> so that they can do the right thing the easiest way.
2325.626 -> But that automation has to be handled at scale.
2330.13 -> So, again, observability.
2331.365 -> Monitoring.
2332.432 -> Everyone is familiar with monitoring.
2334.001 -> That's what they think of with observability,
2336.37 -> but it is far more than that.
2337.604 -> It is compliance, everything down to records management,
2342.176 -> oversight, policy enforcement.
2345.145 -> It is security.
2346.48 -> Yes, it is auditable, but it's also do you have those standards
2350.317 -> in your architectural principles for security logging,
2353.153 -> for security control, operational scorecards.
2358.358 -> These are not just testing,
2360.093 -> but what is your resilience posture of your cloud environment?
2364.464 -> Can you measure it?
2365.732 -> Can you view it?
2368.202 -> Everything including AWS Trusted Advisor,
2371.071 -> great for cost optimization.
2373.874 -> And then application health status, the highest layer,
2378.912 -> and this is important not only for those core connectivities,
2383.25 -> Is your application up?
2385.052 -> But also what are the dependencies of those applications,
2388.255 -> either direct dependencies or indirect dependencies.
2393.861 -> About three years ago we embarked on a strategic vision
2397.03 -> to do the multi-Region disaster recovery.
2401.268 -> And we did so basing it on some core services,
2405.439 -> Amazon S3 with the Cross-Region Replication for data,
2410.077 -> Aurora Global Database, KMS multi-Region keys,
2415.449 -> and the DynamoDB Global Tables specifically around parameter store.
2420.387 -> Combining those again to make a specific strategic vision
2425.158 -> for what that architecture would look like
2427.928 -> and it continues to grow.
2429.63 -> We continue to infuse resilience
2432.165 -> through those architecture and services.
2435.068 -> You always want to constantly learn and evaluate opportunities
2438.939 -> for efficiencies between those Regions
2442.142 -> while course ensuring data encryption, data security,
2445.779 -> and your data durability.
2450.083 -> So, observability specifically for resilience.
2454.254 -> We have annual tests, sometimes more than that of the multi-Region.
2458.792 -> So, we created what we call FINRA canary,
2461.395 -> which of course reports into our birdcage
2463.964 -> in order to check the health of the systems in the multiple Regions,
2469.336 -> but not just the system itself or its core dependencies,
2474.775 -> but also its downstream dependencies.
2477.544 -> So, it cares about itself and it cares about its friends.
2481.515 -> To do this, we've implemented a simple red light,
2484.885 -> yellow light, green light, implementation
2489.056 -> and you can look at them across regions and go,
2491.491 -> my app is good, my app is good, but my friends aren't so happy.
2495.596 -> And that may be expected.
2497.064 -> Not every app's position in disaster recovery
2499.967 -> may be up at any given time,
2502.302 -> but knowing where that is and when both you become,
2505.806 -> your app happy and your friends
2508.108 -> and all your neighbors that you depend on,
2510.31 -> that's when you go green.
2512.846 -> And that allows a level of observability
2515.516 -> to be able to be seen across the enterprise
2518.785 -> and at scale and across Regions.
2520.854 -> It's gone?
2526.627 -> There we go.
2528.128 -> So, I will leave you with some parting thoughts
2530.597 -> about managing this at scale.
2532.933 -> Again, integrated security, hands-off operations.
2537.871 -> Infrastructure as code.
2540.474 -> You are not going to go on a server and fix something
2544.645 -> because that server may not be there tomorrow.
2546.98 -> It may be a different server.
2548.282 -> So, make sure it is a hands-off operation.
2551.585 -> Automation, automation of onboarding to your fleet
2555.556 -> and make sure that it is self-reporting
2557.391 -> and the audit trails are there,
2560.027 -> but most importantly it is delivery focused,
2564.097 -> self-service automation that those teams can use
2568.535 -> to make the right thing, the easy thing.
2572.105 -> With that, I will pass the stage back to Shaown.
2575.108 -> [music playing]
2582.616 -> Thank you, Kym.
2583.717 -> You know, as you heard Kym talk about the importance
2587.054 -> of maintaining observability.
2589.156 -> Resilience is not a one and done thing or a checklist project.
2592.492 -> You must continuously invest in improving your resilience,
2595.028 -> which is what we're going to close with today.
2596.964 -> By the way, I have to give her a shout out
2598.398 -> for calling dependencies your friends.
2600.934 -> They're your friends until they break
2603.103 -> and then you're quite angry at those dependencies.
2604.838 -> But it's nice when they can be your friends.
2608.175 -> Part of continuous improvement
2609.877 -> is making sure that every workload is reviewed
2611.979 -> using the AWS Well-Architected Framework.
2614.381 -> Not just once, but every time
2616.416 -> you make a significant change to your workload.
2618.719 -> The Well-Architected Framework is a set of questions,
2620.888 -> design principles that enables you to build and deploy faster
2624.591 -> to release value more often.
2626.193 -> Understand where you have risks in your architecture.
2628.862 -> You want to intentionally know about those risks
2632.165 -> and ensure that you've made intentional architectural decisions
2637.738 -> that highlight how they will impact business outcomes.
2640.507 -> And you want to make sure your teams know all
2642.242 -> about the best practices we've learned
2644.211 -> from reviewing thousands of customers' architectures on AWS.
2648.649 -> The reliability pillar of Well-Architected outlines
2651.318 -> some key design principles.
2653.22 -> Automatically recovering from failure.
2654.821 -> Think back to Will's examples at Capital One.
2657.491 -> You want to test recovery procedures.
2659.092 -> We talked about this earlier,
2660.561 -> the importance of chaos engineering and continuous testing.
2663.73 -> You want to be able to scale horizontally
2665.766 -> to increase workload availability and plan for your capacity needs.
2670.771 -> Right-size your instances and resources
2672.673 -> to match your workloads needs.
2674.107 -> You heard Will talk a little about being over-provisioned.
2676.81 -> Getting close to right will also as the bonus
2679.246 -> help you optimize your costs, something we all care about.
2681.982 -> And manage change and automation.
2686.587 -> Performing operations is---
2689.256 -> Operational excellence principles are something
2691.558 -> that are also really important.
2692.826 -> It's the second pillar in Well-Architected.
2694.928 -> The first one to think about is performing operations as code.
2698.465 -> You can define your entire workload,
2700.467 -> applications, infrastructure as code and update it with code.
2704.404 -> That's the game changer.
2705.639 -> if you get to that level of automation.
2707.574 -> You can design workloads to allow components
2709.409 -> to be updated regularly to increase
2711.245 -> the flow of beneficial changes into your workloads,
2714.515 -> and as you use your operations procedures,
2716.884 -> look for opportunities to improve them.
2719.353 -> Use proactive threat modeling to identify potential sources of failure
2723.557 -> so they can be moved or mitigated ahead of time.
2726.193 -> And finally, drive improvement through lesson
2728.428 -> learned from all operational events.
2730.397 -> That correction area of error process I keep coming through -- back to.
2733.567 -> It's something that people forget about.
2735.469 -> They get worried about blame.
2736.937 -> Do not think about blame, think about the learning.
2740.44 -> One of our favorite mechanisms for doing this is game days.
2743.443 -> After you design for resilience in place,
2746.113 -> you want to make sure that it works in production.
2748.515 -> And a game day is a way to ensure that everything works as planned.
2751.685 -> Use game days to regularly exercise
2753.654 -> your procedures for responding to events and failures
2756.256 -> as close to production as possible;
2758.458 -> this includes in production when possible.
2760.727 -> And you want to use the people who will be
2762.563 -> actually involved in failure scenarios.
2764.631 -> It's a very common mistake to have a paper exercise
2767.568 -> and none of the folks who are actually on call at 3:00 AM
2769.837 -> are involved.
2771.004 -> Game days should simulate a failure or an event to test systems,
2774.842 -> processes and team responses.
2776.944 -> The purpose is to perform the actions the team would perform
2779.746 -> as if that event happened.
2781.415 -> And you have to do this regularly.
2782.816 -> You want muscle memory on how to respond.
2784.818 -> You don't want someone reading the manual.
2786.92 -> I'm going to say it again at 3:00 AM at night.
2788.455 -> It's always 3:00 AM somewhere, that's for sure, and avoid that.
2793.694 -> Infrastructure event management.
2795.295 -> This is something we offer to our large enterprise support customers.
2798.832 -> It's available to you if you're an enterprise support customer.
2801.268 -> It's focused on planning and support for business-critical events.
2805.706 -> Think about steps you might take to prepare your workload
2807.975 -> to handle a 10x traffic increase due to a product launch.
2811.512 -> Like our latest Kindle, I went and ordered one.
2813.28 -> I'm sort of excited about writing on a Kindle.
2816.016 -> Customer onboarding, it could be tied to an ad campaign,
2818.619 -> anything that's going to happen
2819.887 -> that's going to cause a large change in how you operate.
2823.257 -> Our enterprise support team can work with you
2825.392 -> over a timeline on several weeks
2826.727 -> to figure out how your infrastructure's configured,
2829.296 -> make sure service limits are right,
2832.099 -> make sure auto scaling groups are in place,
2833.867 -> look at your Load Balancers.
2835.502 -> They'll also raise awareness insides about your event
2838.438 -> and prepare support engineers to be ready
2840.274 -> to tackle any issues that come up.
2842.376 -> We use this for events like Prime Day internally.
2845.045 -> We can do it with you as well.
2847.281 -> And for those really exceptional cases,
2850.117 -> I'm thinking about the Cyber Monday this week
2851.818 -> for those retailers out there, we offer joint mission control.
2855.255 -> This is just like NASA, right, going back to where we started,
2858.559 -> think about an all-hands-on deck event
2860.427 -> to make sure the right people potentially in the room with you.
2864.031 -> During Prime Day, we have hundreds of engineers who come together,
2866.7 -> both virtually and in person to prepare for the worst-case
2870.17 -> scenario while being ready and hoping for the best.
2874.741 -> Finally, I want to give you a couple resources.
2877.644 -> These are things you can reach out to get more information.
2880.08 -> Three really good things here.
2881.782 -> First, the Well-Architected Framework;
2883.183 -> we have a set of pages that is full of information.
2886.053 -> Everyone should have checked out the Well-Architected Framework
2887.921 -> if you're running in AWS.
2889.69 -> Second, a brand-new whitepaper published
2892.292 -> by one of our principal SAs,
2893.493 -> Mike Haken, on fault isolation boundaries, really compelling.
2897.397 -> I saw Corey Quinn tweeting about it the day it came out.
2899.8 -> I was on vacation.
2901.101 -> Got me excited that it got published.
2902.636 -> Really good information there.
2904.037 -> And finally, our AWS Solutions Library,
2906.874 -> which is full of Solutions for all purposes.
2909.243 -> In this case, we've just launched the resilience guidance.
2912.88 -> We have backup and restore, failover and failback and many more.
2916.683 -> Please check out the Solutions Library.
2920.654 -> All that said, because resilience is continuous,
2923.757 -> you have to think about it as a journey instead of a destination.
2926.76 -> You are never quite done.
2928.629 -> We realized that all of you,
2929.93 -> our customers, are in different spots in that resilience journey.
2933.233 -> Some of you are just starting out.
2935.035 -> Some of you are further along.
2936.47 -> We heard from Will, we heard from Kym;
2938.038 -> they're pretty far along in that journey.
2939.84 -> But we have one thing all of us have in common.
2942.442 -> We're all builders on this journey.
2944.044 -> Whether you're an executive building a business strategy around resilience
2947.08 -> or a developer building resilience
2949.016 -> through the application or part of a cloud CoE
2951.051 -> or DevOps team building guardrails out,
2954.321 -> AWS can provide you with the right guidance,
2956.49 -> the services and infrastructure to enable your success.
2959.293 -> I want to thank all of you for spending this hour with us,
2962.563 -> from me and Francessca, and please have a great re:Invent.
2965.699 -> [applause]

Source: https://www.youtube.com/watch?v=GamnNc6ZMew