AWS re:Invent 2022 - Building modern apps: Architecting for observability & resilience (ARC217-L)

Aug 16, 2023

AWS re:Invent 2022 - Building modern apps: Architecting for observability & resilience (ARC217-L)

Cloud computing is transforming architecture design and application delivery at organizations across the world. As cloud architectures evolve, new design patterns are essential. Architecting for resilience, observability, security, and emerging trends serves as the foundation that empowers builders to innovate, optimize their workloads, and scale adoption over time. In this session, hear from AWS customers and cloud experts about their proven architecture best practices, tools, and blueprints for architecting for reliability, observability, and modernization on AWS.

Learn more at: https://go.aws/3ucTJpm

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#reInvent2022 #AWSreInvent2022 #AWSEvents #AWS #AmazonWebServices #CloudComputing

Content

0.234 -> [music playing]

2.503 -> Please welcome Vice President, Technology

5.339 -> and Customer Solutions, AWS, Francessca Vasquez.

9.676 -> [music playing]

17.718 -> Welcome everyone. Welcome to re:Invent 2022.

22.656 -> I can't hear you all. Welcome to re:Invent 2022!

26.96 -> My name is Francessca Vasquez and I have the honor and privilege

30.964 -> to lead AWS's Solutions Architecture and Customer Solutions organization.

36.436 -> I also have the opportunity to lead our AWS

40.307 -> Resilience Customer and Partner program

43.644 -> and I am so excited to have you here today.

47.714 -> On behalf of the AWS Architecture Organization

52.519 -> and our broader community, we'd like to dedicate this session

57.724 -> on building modern applications through the lens of observability

61.762 -> and resilience to David Grimm.

64.998 -> He was a solutions architect, a technologist,

68.468 -> and an amazing thought leader that we lost here recently.

73.207 -> So, about a year ago, on December 25th,

78.512 -> we witnessed one of the largest innovations

82.049 -> in human history go production.

85.185 -> Over 29,000 engineers and scientists worked on this product

90.624 -> and this project that would impact the entire human race.

96.063 -> And this innovation was arguably one of the largest resilience

102.069 -> and observability initiatives ever taken.

107.708 -> What I'm referring to is the James Webb Telescope,

111.512 -> which you heard referenced earlier in Adam's keynote.

115.148 -> It is the largest most powerful telescope ever built.

119.987 -> It will allow scientists to look at what our universe was like,

124.124 -> about 200 million years after the Big Bang.

128.595 -> Just incredible innovation.

132.599 -> The James Webb Telescope has been under development for over 20 years.

138.839 -> $10 billion budget for the program,

141.508 -> which was up from the original $1 billion forecasted in 2002.

147.181 -> And as the team was doing their own resilience testing,

151.518 -> they found that during a vibration test

155.856 -> that the screws holding the actual sun shield together failed

160.794 -> and they also popped loose.

163.263 -> This led to another 10 months

165.265 -> and an additional $800 million in costs.

168.535 -> This testing was done to simulate the actual testing

172.406 -> of riding an Ariane 5 rocket to space.

176.71 -> This telescope had to actually travel a million miles

181.081 -> and then open up remotely.

183.684 -> And did I mention, there's only one of these?

187.688 -> NASA actually had to build the necessary observability

192.092 -> and resilience capabilities into one device.

196.163 -> They had to build this system to be able to withstand

200.234 -> the micro meteor strikes and temperature swings,

204.404 -> that ranged anywhere from 230 degrees

208.075 -> Fahrenheit to negative 394 degrees Fahrenheit.

212.613 -> And it took about a month to get to its final location,

216.884 -> which is about a million miles from Earth.

221.421 -> And when I tell you this, this space telescope, it's really big.

227.728 -> It represents about a three-story building.

231.698 -> It's width and length is equal to that of a tennis court.

236.603 -> And to actually get this thing to launch,

239.273 -> you actually fold it up before the rocket can go off.

243.677 -> And finally, the team at NASA and our European Space Agency,

249.116 -> they had to create the tools to test and overcome more

253.22 -> than 300 single points of failure.

258.158 -> They had to create the necessary observability

260.727 -> tooling to actually make this possible.

263.564 -> And the photo that you happen to see up here is of the Tarantula Nebula,

269.036 -> which is 161,000 light-years away.

276.076 -> Today, we have an action packed agenda.

280.814 -> We'll be on a very deep exploration

283.917 -> on how we think about resilience and observability

287.554 -> with the same mission critical application lens,

291.525 -> as a NASA space telescope.

294.228 -> We're also going to hear from two amazing customers

297.965 -> who will be sharing their best practices

300.634 -> on how to build and design modern applications with resiliency in mind.

305.839 -> Finally, we're going to close off the day today with strategies

310.31 -> and resources to help you all think about continuous improvement.

318.252 -> Now, before we begin, I want to level set everyone

323.19 -> on some important definitions just so we're on the same mental model.

327.528 -> First off, how we think about observability.

330.764 -> It really describes how well you can understand

333.433 -> what is happening with your system,

335.335 -> often by instrumenting it to collect metrics, logs,

339.006 -> and traces. Resilience, the ability of a workload

343.911 -> to recover from infrastructure, service or application disruptions.

349.483 -> Both observability and resilience

352.619 -> are a critical and foundational element

356.023 -> to AWS's Well-Architected Framework.

359.293 -> They specifically live in the reliability

362.896 -> and operational excellence pillars.

365.132 -> So, two of the six pillars of an architectural framework that we use.

369.636 -> Similar to NASA's approach with the Webb Telescope at AWS,

374.975 -> we want to partner with all of you

377.11 -> to help you automatically be able to recover from failures,

381.281 -> to build and test recovery procedures

384.351 -> to help you scale and to help implement changes

388.522 -> being ongoing automation so that we can reduce manual errors.

393.66 -> So, to cover off on our building applications and resiliency,

398.732 -> please give a warm welcome to my colleague Shaown Nandi,

402.736 -> who is the Director of Solutions Architecture

405.038 -> and Customer Success for our strategic customers.

408.008 -> Thank you so much. Welcome to re:Invent!

411.078 -> [music playing]

419.486 -> Thank you Francessca.

420.854 -> What an incredible story to hear about the telescope.

423.39 -> It is great to be here with all of you today.

425.626 -> I see customers, I see friends in the audience.

428.762 -> Thank you for taking time,

429.997 -> especially with all the exciting World Cup action.

432.199 -> I hope there's some football, American soccer

434.268 -> fans in the audience, maybe.

435.869 -> Yes.

439.206 -> For those tuning in via the stream,

441.308 -> please thank you for joining us virtually.

443.844 -> In business, the resilience of workloads is critical.

446.847 -> Today we're going to hear from two customers

448.482 -> from FINRA and Capital One,

450.35 -> and we'll walk through how we work with all our customers

453.053 -> to ensure their workloads are Well-Architected

455.022 -> and resilient on AWS.

456.657 -> So, let's dive right in.

458.926 -> What is resilience?

460.427 -> Resilience refers to the ability for workloads,

463.23 -> workloads are your applications, your products,

466.2 -> your business processes, to respond and quickly recover from failure.

470.771 -> A workload can be simple as a single application

473.34 -> running in a single AWS account,

475.242 -> or it might be a set of products that span multiple accounts.

478.445 -> There are three mental models I want you to consider

481.315 -> as we go through today's presentation.

483.45 -> Think about how you can build systems to be highly available

486.82 -> with resistance to common failure modes,

489.156 -> how to recover your system

490.757 -> if you run into one of these rare failure scenarios.

493.927 -> And underpinning all of this is the idea of continuous resilience.

498.065 -> This is where you're implementing DevOps practices like CI/CD

500.934 -> to automate your delivery pipelines.

503.604 -> Introducing failure on an ongoing basis

506.039 -> to test your system chain and test your teams for weaknesses

510.377 -> and implementing ongoing observability

512.579 -> and monitoring practices.

515.449 -> As with security and sustainability,

518.619 -> resilience is a shared responsibility.

520.921 -> AWS is responsible for resilience of the cloud.

524.424 -> You our customers are responsible resilience

527.394 -> of your workloads in the cloud.

530.163 -> This shared model helps relieve your operational burdens,

532.866 -> the customer's operational burdens, as AWS operates,

536.27 -> manages, controls, the host operating system,

539.072 -> the virtualization layer, all of the infrastructure.

542.242 -> We're responsible for all the services

544.144 -> that are offered in the AWS Cloud.

546.78 -> Customer responsibility is determined by the services you select.

550.684 -> You have to carefully consider

552.386 -> which services you choose as responsibilities will vary

555.289 -> depending on how they integrate what regulatory framework apply.

558.959 -> And often you'll hear us

560.527 -> recommending using our higher-level services.

562.863 -> You heard about the serverless announcements

564.264 -> this morning at Adam's keynote,

565.933 -> those tend to have less operational responsibilities that sit with you.

569.837 -> As a solution architecture leader,

571.371 -> my teams partner with all of you

573.006 -> to make these decisions not alone in figuring this out.

578.045 -> AWS from day one has built resilience into our culture.

582.216 -> We use a service ownership model internally,

585.419 -> which incentivizes our teams to continuously

587.621 -> improve their operations.

589.189 -> We've organized our engineering and product management

591.592 -> efforts into small multi-disciplinary teams.

594.394 -> We call them two-pizza teams.

595.829 -> Most of you heard this term.

597.264 -> We like it 'cause we think those teams can be fed by two pizzas,

600.667 -> depends I guess, on the size of the pizzas.

602.736 -> And these teams own a service end-to-end.

605.706 -> What that means is the ownership is not

608.842 -> just of designing and launching their service,

611.111 -> but you have to operate it during production,

613.714 -> be on calls for issues as they arise.

616.817 -> For customers who use this structure, it's a major cultural shift.

620.42 -> The idea that your responsibility for the service never really ends.

624.958 -> In AWS all new services are reviewed for launch

628.495 -> using an operation readiness process we call ORR.

631.798 -> It's basically a set of questions,

634.067 -> a checklist that uses known best practices and a standardized runbook.

638.038 -> When we roll out new services or update existing ones,

640.908 -> we use safe continuous deployment pipelines

644.011 -> that automate pre-production testing,

646.146 -> support automatic rollbacks and stagger deployments.

649.016 -> When we launch our services or even add features, we start small.

652.586 -> We start with a single instance,

654.354 -> go across an AZ roll out across multiple AZs finish a Region

658.525 -> and then ultimately roll to our other Regions.

661.061 -> And if any issues arise, we leverage our correction of error process.

665.632 -> This is where we go and understand what the root cause was.

669.136 -> This is not about placing blame.

671.605 -> This is about diving deep to find the true reason

674.341 -> that something failed.

675.876 -> And after an issue is mitigated, we drive company-wide engineering

679.012 -> sprints to ensure the issue is fixed across all AWS services.

683.75 -> The learnings become part of the ORR process,

685.986 -> goes right back to the top and ensures similar issues don't reoccur.

692.292 -> When we think about resilience in the cloud,

694.928 -> there are four key areas we focus on.

697.364 -> First, you have to anticipate what's going to happen.

700.434 -> To do this, we use code reviews, failure-oriented programming,

704.204 -> immutability, simple designs, whenever possible.

707.808 -> Second, monitoring, we're going to talk a lot more about this,

710.444 -> health checks, tracing, alarms, dashboarding.

713.814 -> Responding, this can be the longest part of any incident

717.184 -> and you minimize that by identifying event-driven patterns,

720.153 -> using machine learning powered operations.

722.923 -> And finally learning.

724.291 -> I talk about the CoE process that we've built.

727.06 -> We look at our logs and all that learning is fed right back in

730.864 -> to anticipate better the next time and we come full circle.

735.569 -> When we think about your responsibility

737.437 -> for resilience in the cloud,

739.406 -> you also need to think about resilience threat modeling.

742.075 -> There are typical categories of failure,

743.577 -> you see them up on the screen.

745.179 -> For example, for code deployments,

747.381 -> what happens when you have a failed deployment?

749.55 -> What are you set up to have occur?

751.285 -> Do you have instrumentation to detect it?

753.487 -> Can your CI/CD system automatically roll back?

757.09 -> In core infrastructure, what if a single instance is terminated?

760.961 -> Have you designed to make sure this will not impact you?

764.131 -> And if there's an impairment in a single AZ

766.033 -> or you experience a gray failure, what's going to happen?

769.636 -> In terms of data and state,

771.338 -> what if your customers overwhelm your service

773.74 -> or your database gets corrupted?

775.609 -> What happens if a third-party dependency fails?

778.178 -> Do you gracefully degrade?

779.913 -> I talked to a CIO this morning about a challenge

781.915 -> they faced a couple weeks ago.

783.483 -> A major application outage.

785.219 -> They had a dependent system, it was a login system,

788.522 -> federation system that failed.

790.39 -> A queue built up 1.2 million requests out there

793.627 -> and it took them seven hours to respond

796.129 -> and detect. They didn't have automatic rollback in place.

799.199 -> And they knew, they knew that it was a challenge after the fact

801.835 -> they hadn't prioritized fixing it and they want to fix it.

804.771 -> You have to consider these.

806.406 -> Now you also have to think about unlikely scenarios.

809.109 -> You may choose not to engineer for the really unlikely ones,

811.979 -> but you should consider what'll happen.

813.48 -> What if a natural disaster impacts an entire coast in the United States?

818.151 -> Or my favorite panic scenario,

820.12 -> what are you going to do during the zombie apocalypse?

822.422 -> How will you keep your systems up?

823.891 -> I keep a bag packed just in case.

828.428 -> A key part of resilience workloads

830.163 -> is making sure you have a strong foundation.

832.533 -> Start with AWS infrastructure.

835.169 -> With the launch of Hyderabad last week, we have 30 AWS Regions

839.039 -> and each Region has at least three Availability Zones.

842.075 -> And each AZ is a multiple physically separated data center.

846.18 -> Each region has two independent, fully redundant transit centers.

850.117 -> On top of that is your network.

852.152 -> Your network should always be redundant, always available,

855.055 -> and seamlessly routed.

856.69 -> On top of that, you've got your data,

859.226 -> you have to have confidence in the resiliency of your data.

861.895 -> There's so many forms, file system, block databases, in-memory caches,

866.967 -> consider how eventual consistency impacts design.

870.003 -> And finally, your application.

872.806 -> Your highly resistant application should be able to self-heal.

876.376 -> We like you to use microservices app architecture

878.812 -> when you're building new, we know many existing apps don't do that.

881.849 -> You want to decouple interdependencies,

884.017 -> have loose coupling when possible and always remove state

887.221 -> when you can from app components.

890.457 -> Your goal is to build systems that never fail.

893.193 -> The reality is failures do happen.

895.996 -> So, there are a few things you can do to help reduce the impact of failure.

899.9 -> First, very basic set timeouts.

903.036 -> Do you know that many frameworks to fall to infinite timeouts

906.139 -> that is just asking for trouble?

908.675 -> You need retry with backoff.

910.644 -> When you don't backoff, you're going to end up with a retry storm.

913.347 -> A subsequent cascading failure.

915.382 -> One retry is usually resilient enough to be resilient

918.719 -> to intermittent errors.

920.053 -> Retry once fail fast.

922.222 -> That example I mentioned earlier, 1.2 million up in the queue.

926.193 -> Yeah, clearing that was a pretty big pain.

928.495 -> You want to limit the sizes of your queues

930.564 -> and make sure you rate limit your APIs and load shed when needed.

936.87 -> Before failures occur, it's important to test.

940.674 -> Continuous testing is imperative in understanding

943.477 -> how your system will react to unknowns.

945.679 -> This includes strategies like chaos, engineering,

947.781 -> conducting game days, practicing failovers.

951.084 -> We use chaos to prove or disprove our assumptions about our system's

955.155 -> capability to handle disruptive events.

957.491 -> Chaos stresses an application

959.259 -> in testing or production environments.

961.161 -> We create disruptive events artificially server outages,

964.031 -> API throttling.

965.332 -> Amazon has been purposely injecting control failure

968.402 -> into limited environments since the early 2000s.

971.405 -> That's how we ensure readiness for the most adverse of circumstances.

976.476 -> AWS is investing heavily in resilient services.

979.913 -> We just spoke about chaos engineering.

981.715 -> I want to call out the AWS Fault Injection Simulator,

984.818 -> a fully managed service.

986.153 -> We love those, that simulates real- world failure to uncover hidden bugs,

990.224 -> monitoring blind spots and performance bottlenecks.

993.293 -> Our resilience hub, AWS Resilience Hub

996.463 -> provides a central place to define, validate,

998.832 -> and track the resilience of your applications on AWS.

1002.236 -> And it pulls in the best practices from our Well-Architected Framework

1005.339 -> so you can benefit from what all your other customers

1007.741 -> have learned and done.

1009.309 -> And lastly, I'd like to highlight the Amazon Route

1012.546 -> 53 Application Recovery Controller.

1015.182 -> It enables you to control your application recovery

1017.651 -> across multiple AWS Regions, availability zones and on-prem.

1022.256 -> It makes recovery simpler and more reliable

1024.825 -> by eliminating the manual steps

1026.093 -> required by traditional tools and processes.

1028.529 -> And I'm excited today to announce a new enhancement

1031.798 -> to the Route 53 Application Recovery Controller.

1035.068 -> We're adding a new feature in preview today called Zonal Shift.

1039.306 -> It's built on Elastic Load Balancer inclusive of ALBs and NLBs.

1044.144 -> During a failure, removing your application in an AZ can be complex.

1048.715 -> It can wound configuration steps across EC2, ELB, auto-scaling.

1053.387 -> Customers like yourself have been asking us for simple,

1055.656 -> reliable and easy-to-use tool

1057.524 -> that can help recover from AZ impairments.

1059.96 -> With Zonal Shift, when you build your applications with ELB

1063.33 -> and have cross-zone traffic disabled, you get a built-in control

1067.301 -> for shifting application traffic away from an AZ with a single action.

1071.205 -> To learn more about this preview, there's a session tomorrow, ARC 329

1075.142 -> at 10:45 AM Pacific, breakout session.

1077.344 -> Please go check it out. It's a pretty cool feature.

1080.48 -> Now I'm excited to introduce our first customer Will Meyer,

1084.351 -> Managing VP of Cloud and Connectivity for Capital One.

1087.588 -> Please join me in welcoming Will Meyer to the stage.

1090.29 -> [music playing]

1096.496 -> Thank you Shaown and good afternoon, everybody.

1098.532 -> This is awesome to see all of you.

1101.101 -> You know, I spend a fair bit of time

1102.369 -> thinking about what are the differences

1104.071 -> between just being on the cloud and really thriving on the cloud.

1108.041 -> And I think true system resilience is one of those things,

1110.777 -> so it's great to be able to talk about it with you.

1115.582 -> If you're not familiar with Capital One,

1116.85 -> we are a financial services institution,

1118.819 -> basically an information business.

1121.488 -> Our success is based on our ability to process data, generate insight

1125.058 -> that we can use to help our customers

1126.927 -> financially, for example, by giving them credit.

1130.497 -> And so, when you think about it being resilient to external change

1133.767 -> and managing risk are baked into our business model,

1136.803 -> not just our tech stacks.

1138.238 -> We've been doing it for a while,

1139.439 -> since before the public cloud was a thing.

1141.675 -> But I think we knew pretty early that it would be an accelerator for us.

1144.845 -> There were Capital One folks on stage at re:Invent in 2015

1148.849 -> talking about our intention to go all-in on AWS.

1152.386 -> We did that.

1153.654 -> We rebuilt and migrated thousands of workloads, petabytes of data.

1158.559 -> We built a security and controls framework

1162.062 -> that was appropriate to the level of trust

1164.031 -> that our customers place in us and also what our regulators demand.

1168.235 -> And in 2020 we closed our last data center.

1170.938 -> We have hundreds of teams that are doing everything from online

1174.408 -> user experience to advanced machine learning,

1177.11 -> call center, back office, all on AWS.

1180.681 -> It has been challenging, but it's also been super fun.

1183.217 -> And AWS has been a tremendous enabler for our teams and for our business.

1189.556 -> We also have some battle scars and I think we have learned along the way

1193.327 -> that we're making a lot of good progress on saying defaults

1195.863 -> but the cloud isn't perfectly plug and play quite yet.

1198.866 -> There is complexity, there are trade-offs everywhere.

1201.935 -> I think we see that when we talk about cost.

1203.704 -> I know many of us are focused on reducing waste in our cloud spend.

1207.14 -> I think we see in security and compliance.

1209.409 -> We want the right thing to be the easy thing,

1212.145 -> but we are still working on easy.

1214.314 -> And so, as we talk about observability and resilience in general,

1217.184 -> I think it's important to not just look at them

1218.852 -> as a set of isolated patterns,

1220.42 -> but really as an integrated part of your overall approach

1223.39 -> to managing the cloud.

1228.095 -> You know, you all work with the cloud every day.

1229.763 -> I think it's easy to forget just how far the conversation

1232.533 -> about cloud resilience has really come.

1234.568 -> You know, I remember being asked,

1235.869 -> "Hey, what happens when Amazon has a big holiday e-commerce spike,

1239.706 -> and all of a sudden there's no more spare capacity to run on AWS?"

1242.843 -> And you know, it never really worked that way,

1244.545 -> but I think it's a good reminder of just how much has changed.

1247.414 -> AWS is serving incredibly demanding sectors.

1251.018 -> We see really world-class engineering talents

1253.887 -> working on resilience of public cloud infrastructure,

1257.024 -> including through open-source.

1259.226 -> At Capital One, we have had tremendous outcomes.

1262.196 -> We have fewer customer impacting incidents.

1264.631 -> We are faster to recover when we do have them.

1267.201 -> We attribute that largely to the partnership with AWS.

1269.77 -> It has been really powerful,

1271.338 -> and I know many of you have seen the same.

1274.308 -> You also all know, I think that the cloud isn't actually magic

1277.845 -> and you don't get this completely for free.

1280.347 -> Just like when we talk about costs or security or compliance and resilience,

1284.618 -> AWS gives you incredibly powerful tools and concepts,

1288.088 -> but you need to use those tools

1289.456 -> and you need to embrace those concepts

1291.258 -> and integrate them into your ecosystem and your ways of working.

1294.828 -> And we work hard to do that.

1297.564 -> We run our US businesses in two Regions in east and west,

1300.868 -> multiple AZs in each.

1302.97 -> Our most critical workloads,

1304.104 -> are active-active with latency-based routing.

1307.341 -> We do auto-scale, although honestly, we tend to run a bit overprovision,

1311.178 -> partly because we want to be able to fail our entire business,

1314.314 -> all divisions over into one of those Regions within minutes,

1317.417 -> which we do occasionally.

1319.853 -> And this particular topology does not make sense for everyone,

1323.457 -> but whatever does make sense,

1324.625 -> focus on standardizing the deployment pattern

1327.194 -> that you want your teams to have.

1329.363 -> We have spent a lot of time organizing and defining

1331.365 -> our non-negotiable requirements.

1333.433 -> We've organized them into a couple of different tiers.

1336.036 -> This platinum, this multi-Region,

1339.173 -> multi-AZ active-active version we call our platinum standard.

1343.41 -> And we really hold ourselves accountable to that.

1346.613 -> We've also made a bet on a company-wide deployment pipeline

1349.416 -> that lets us do blue-green deployments

1351.084 -> across Regions consistently and safely.

1352.986 -> That's a big investment, but we have found that it is worth it.

1357.591 -> AWS also talks a lot about powerful architectural concepts;

1361.895 -> you heard about a few of them a minute ago.

1364.231 -> We're also big fans of static stability,

1366.2 -> thinking about how your system behaves in isolation

1369.336 -> so that when everything around it is going haywire,

1371.205 -> you don't actually need to take any action to remain stable.

1374.107 -> I think that's important when you think about how much complexity there

1376.743 -> is in many of these cloud architectures.

1379.279 -> Loose coupling is great; it's how we build evolutionary architecture.

1382.182 -> But with microservices everywhere and event-driven everything,

1385.519 -> these things can be hard to reason about.

1388.188 -> And I think when we look at production incidents

1390.157 -> much more than straight failures,

1391.725 -> what we see are complicated degraded performance,

1395.796 -> partial failures across multiple systems

1398.866 -> interacting in some kind of complex way.

1400.734 -> And I think, by the way,

1401.935 -> that's true in incidents within the clouds as well.

1404.404 -> Those do happen.

1408.041 -> You also need the tools.

1409.776 -> So, I mentioned our deployment pipeline,

1411.411 -> but everything starts with how you effect change

1413.614 -> in the environment that has to be managed,

1415.315 -> and that means with infrastructure as code,

1417.284 -> not with the AWS Console, that's absolutely critical.

1420.387 -> We've also invested in tooling that helps us reason

1423.79 -> about the state of our infrastructure in sort of larger logical units.

1427.728 -> We built kind of a data layer

1429.062 -> and a management plane that helps us do things like coordinate

1432.032 -> those large-scale failures still with appropriate access control.

1435.335 -> I'll make just a quick plug for Cloud Doctor,

1436.97 -> which you can see in our booth.

1439.206 -> We're also really excited about all the investments

1440.841 -> we see coming from AWS.

1442.276 -> A few were just mentioned, Fault Injection Simulator

1444.578 -> is a favorite of ours.

1448.382 -> Lots of amazing tooling continue to be invested in by AWS.

1452.119 -> We also know that no matter how good the tools are,

1455.556 -> things are going to fail.

1456.857 -> And so, we practice for that.

1458.325 -> Yes, doing exercises with tech teams is key,

1460.594 -> but also think about it, cross-department.

1462.829 -> How do you engage customer comms, call center,

1465.699 -> decision makers that you may need if you're going to disable a capability

1468.468 -> in the heat of the moment?

1469.603 -> We think company-wide, organization-wide game days

1472.806 -> can be really powerful for that.

1477.444 -> I'll just admit I couldn't resist an opportunity

1479.446 -> to mention our former president,

1481.114 -> but he sort of said something about software engineering.

1484.518 -> Resilience comes from learning and adaptation.

1489.423 -> Sometimes that learning is in the heat of the moment, right?

1492.226 -> How do you coordinate across multiple teams to debug

1495.762 -> and fix these complicated system

1497.464 -> interactions that we're talking about?

1499.166 -> Collaborative problem solving really is the new normal.

1501.735 -> We can't just all look at our own APIs and contracts

1504.004 -> and say, you know, "Hey, it's not me."

1506.206 -> So, think about how you incentivize that.

1507.641 -> Sometimes the learning is in the follow-ups.

1510.143 -> At Capital One, we talk a lot about blameless postmortems,

1512.646 -> and we spend the time really digging deep when things fail.

1516.383 -> We aren't trying to assign blame,

1517.985 -> but we are trying to find the truth and it really matters a lot.

1520.587 -> And so, invest in that and in particular

1523.09 -> invest in tracking the follow-ups.

1524.725 -> I know AWS talked about that a moment ago.

1526.693 -> That part is important.

1529.162 -> I think at the end of the day,

1530.33 -> we have found that we spend a lot of time

1532.799 -> working on the systems that help us learn and improve.

1536.47 -> And to a large extent, the time we spend looking backward

1539.006 -> is the way we speed up moving forward.

1543.644 -> I want to make a quick side point on serverless.

1546.18 -> I think we at Capital One have made a pretty intentional investment

1549.583 -> in adopting more and more managed services,

1552.586 -> and I think it's interesting to talk about that

1554.121 -> in the context of resilience.

1556.523 -> Over the years, I think we have all embraced distributed DevOps,

1559.493 -> we have shifted left, and you know,

1561.361 -> we might have horizontal SRE teams or platform teams,

1563.931 -> but generally all of our application teams

1566.233 -> have a lot of ownership of their infrastructure.

1569.203 -> And that has a bunch of benefits that I think we all understand.

1573.34 -> It also makes some pretty big asks of those teams:

1576.243 -> be cost efficient, be resilient, patch your vulnerabilities.

1580.38 -> It's not quite the simplicity that the cloud promised.

1582.916 -> I think we also see the need to build internal platforms

1586.153 -> and abstraction layers

1587.354 -> that then help those teams to do all of those things.

1590.257 -> And this is where we think serverless fits in.

1592.659 -> We want to move teams up the stack.

1594.161 -> We want to lean into the shared responsibility model

1596.597 -> and basically get AWS to do as much work

1598.832 -> as possible on our own resilience.

1602.369 -> There is some potential risk with that operationally, sure.

1604.605 -> But I think we see the resilience

1606.206 -> of the managed services to be good and improving.

1609.776 -> People also talk about lock-in,

1610.944 -> but I think we've already seen containers kind of surpass our VMs.

1614.515 -> We see functional event-driven models pretty much everywhere.

1618.085 -> I think in most domains we are building lighter-weight applications

1621.154 -> and we're even starting to talk about being able

1623.357 -> to run them on-prem or at the edge or anywhere in between.

1627.194 -> And so, we think this move up the stack really

1628.896 -> is on the right side of history

1631.131 -> and we want to continue to offload undifferentiated heavy lifting,

1634.635 -> not only to improve our productivity,

1636.537 -> but to improve our resilience over time as well.

1641.341 -> Parting thoughts:

1642.476 -> Yes, do the textbook stuff.

1644.144 -> Start with the Well-Architected Framework,

1645.846 -> but make it your own.

1647.281 -> Be intentional about setting standards

1649.149 -> and expectations for architectural resilience with your teams.

1652.619 -> Educate your teams, track your progress,

1654.988 -> integrate those standards into your tools,

1657.391 -> start all the way at the beginning of the development process,

1659.426 -> build the guardrails that enforce resilience requirements

1662.696 -> right into the infrastructure.

1665.632 -> Interrogate all those dependencies

1667.267 -> and use architectural tools like static stability to mitigate them.

1671.471 -> And then most important, think about how you cultivate

1674.041 -> the mechanisms to use the AWS word for learning and response,

1678.345 -> both during emergencies, which you should be practicing for

1681.148 -> and in the follow-ups that help you improve over time.

1684.384 -> I think like a lot of things in tech,

1685.686 -> resilience isn't just about tech,

1687.454 -> it is also about you as a responsive and resilient organization as well.

1692.826 -> Thanks for your time. I'll hand you back to Shaown.

1695.062 -> [music playing]

1701.802 -> Thank you, Will.

1703.036 -> Some really great learning out there

1704.404 -> that people can start applying to their workloads

1706.406 -> and implementing immediately.

1707.708 -> I especially love the little bit on serverless at the end.

1710.41 -> I gave a shout out to it earlier.

1712.513 -> It is such a good way to start to reduce your operational burden

1715.282 -> and put it in someone else i.e. our hands to some extent.

1718.785 -> A founding member of the Amazon EC2 team put it really nicely.

1722.656 -> "You can't legislate against failure.

1724.992 -> Focus on fast detection and response."

1727.427 -> We all know there are going to be failures.

1729.73 -> So, this is where observability gives you the ability

1732.466 -> to efficiently detect, investigate, and respond.

1736.57 -> Often customers don't detect issues as soon as they begin.

1739.006 -> There's a lag from when the issue starts to when you find it.

1742.676 -> You can respond to failures quicker if you alert near the source.

1746.613 -> Investigation is where people spend the most amount of time

1750.35 -> during an operational event.

1751.685 -> This is the largest contributor to downtime.

1754.154 -> I mentioned that incident earlier, seven hours to investigate.

1758.292 -> Leverage logs, metrics tracing to help you investigate

1762.362 -> quickly and understand the root cause.

1764.164 -> Your time is valuable.

1765.933 -> Focus on the stuff that matters during an operational event.

1768.902 -> There's nothing worse than trying to fix something

1771.104 -> and making the situation worse.

1773.24 -> And I mentioned our CoE process earlier.

1775.509 -> Make sure you conduct a post-event analysis

1777.611 -> to help you determine how you could have prevented this.

1779.98 -> It will probably happen again if you don't.

1782.449 -> Your goal should be to ensure that doesn't happen.

1784.585 -> You never have repetition in these errors,

1786.753 -> and if it does, you know how to identify it faster

1789.69 -> and remediate it automatically.

1792.626 -> Our philosophy of monitoring is to ensure

1794.728 -> we measure the things customers care about

1797.564 -> and measure them from multiple perspectives.

1799.766 -> We want to continuously introspect those metrics and question them.

1804.705 -> This is all to understand the customer experience.

1807.407 -> Instrumentation allows us to learn about our system,

1810.21 -> give operators real-time feedback

1812.446 -> on how the system is operating and feed data into alarms.

1815.949 -> This helps us detect and respond to events when they happen.

1818.886 -> We need to make sure we're monitoring the right things

1821.221 -> and asking the right questions.

1823.857 -> We want to ask these questions all about our systems.

1827.127 -> Why is it operating that way?

1828.262 -> You need to add instrumentation, go right to the top of the stack

1831.832 -> and figure out what's going on.

1833.367 -> The instrumentation produces logs, metrics and traces.

1836.77 -> We use alarms and dashboards to analyze those,

1840.007 -> and then we ask more questions.

1842.442 -> Why did that thing go into alarm?

1844.144 -> What was actually going on when I saw this spike in the dashboard?

1847.047 -> And to answer that, we need more instrumentation

1849.483 -> and we go through the same circle over and over again

1852.352 -> and it improves operations and more importantly,

1854.955 -> it improves our end customer experience.

1857.624 -> This is the virtuous cycle of monitoring.

1861.361 -> Think about a real-world example.

1862.896 -> A service that customers are calling through a Load Balancer.

1865.832 -> So, you get a call trying to get some product info,

1868.635 -> it goes and hits the load balancer.

1871.004 -> Questions you want to ask: What product are we looking up?

1873.473 -> Who called the API?

1874.575 -> What was this end customer?

1875.843 -> What type was it, coming through a website,

1877.911 -> coming through some other area?

1879.78 -> Did we find the item in our local cache

1882.049 -> or did we have to punch out and get to a remote cache?

1884.885 -> How long did it take to read from the cache?

1887.154 -> How full is the local cache?

1889.056 -> How long did the query take?

1891.124 -> You went out to a remote database perhaps did the query succeed?

1894.862 -> And how long did it take to go back and populate the caches?

1897.531 -> Was the cache full?

1898.732 -> Did you have to evict items?

1900.467 -> And how big was that product info object that you went and fetched?

1903.737 -> And what was the response code from the server?

1906.473 -> And finally, what was the latency?

1908.876 -> If I was operating this service in production,

1911.512 -> I would need so much instrumentation in this code

1913.947 -> to be able to understand its behavior in production.

1915.916 -> That's a good thing.

1917.05 -> I need the ability to troubleshoot failed requests, slow requests,

1920.487 -> I want to monitor for trends, signs that different dependencies

1923.79 -> are under-scaled or misbehaving, there's a lot there.

1926.693 -> Don't oversimplify.

1929.463 -> And we have a couple types of metrics we categorize.

1932.733 -> First health metrics.

1934.368 -> Am I failing?

1935.569 -> It doesn't answer the question, why am I failing?

1938.405 -> Health metrics, the alarms are there to alert you to issues.

1941.375 -> And then the diagnostic metrics.

1943.243 -> What's the value of this thing I measured?

1945.345 -> Why isn't my system working?

1947.648 -> These both fall into three essential categories.

1950.784 -> The customer experience metrics are the ones that let you detect

1953.687 -> that your customer has a problem

1955.389 -> and the service is not responding to them.

1957.491 -> Once you've found a problem, you can use impact assessment metrics

1961.495 -> to measure the number and percentage of customers

1963.797 -> resources, workloads impacted,

1966.567 -> and then you can use operational health metrics

1969.536 -> to determine why the impact is occurring,

1972.339 -> what, when the why is discovered, responders

1974.608 -> and automation can take action to go and resolve the event.

1978.979 -> There are three commonly agreed upon pillars of observability.

1982.916 -> It's metrics, log and traces.

1985.385 -> Metrics are the numeric data measured at various time intervals,

1989.189 -> request rates, error rates, durations, CPU percentage.

1993.026 -> Logs are your timestamp records of distinct events

1995.362 -> that occurred within an application

1996.697 -> or a system such as a failure and error

1998.966 -> or just a state transformation.

2000.734 -> And traces represent a single user's journey

2002.87 -> across multiple applications and systems.

2005.372 -> Usually with microservices, great modern architecture,

2010.11 -> we have a broad suite of observability capabilities.

2012.679 -> We have native services that integrate deeply

2015.349 -> with our AWS services.

2017.217 -> We have Container Insights and Lambda Insights.

2018.986 -> You heard about some new ones being launched

2020.487 -> this morning from Adam.

2021.889 -> We have open-source tools: Managed Grafana,

2026.66 -> Managed Prometheus, big fan.

2028.729 -> We give you lots of options, lots of choices.

2030.864 -> You can put the right tool for your workload.

2034.334 -> With that, I am super excited to introduce Kym Weiland,

2038.672 -> Vice President of Enterprise

2039.873 -> Operations at FINRA, who's going to talk more

2041.909 -> about the importance of observability.

2043.81 -> Please welcome Kym.

2045.379 -> [music playing]

2054.688 -> Good afternoon.

2055.822 -> FINRA.

2056.957 -> FINRA is the Financial Industry Regulatory Authority

2061.094 -> and we help play a critical role

2063.931 -> in ensuring the integrity of America's financial system.

2067.701 -> We write and enforce rules governing

2069.736 -> the ethical activity of brokers in the United States,

2073.273 -> we examine firms for compliance for those rules,

2076.91 -> we foster market transparency and we educate investors.

2082.516 -> In addition, we also do big data, lots of big data.

2087.888 -> FINRA has processed peak volumes

2089.656 -> of over 600 billion market events per day.

2092.693 -> We do this in support of 24 exchanges,

2095.762 -> and we have had to run upwards of 300,000 compute nodes

2099.7 -> in a single day.

2101.702 -> Currently, we have a storage footprint of about 500 petabytes.

2108.375 -> We did not get here overnight.

2110.978 -> We started in early 2014, 2015,

2113.18 -> and by 2016 we were doing about 39 billion transactions.

2117.451 -> We had about a hundred RDS Instances.

2120.32 -> That grew by 2019 where we were doing about 190 billion transactions

2125.792 -> and we upped until about 500 databases at that point

2129.496 -> with about a 50 petabyte storage footprint.

2133.5 -> But between 2019 and 2022, there was unprecedented market volatility.

2139.273 -> And we have grown to an average data

2141.308 -> intake of over 450 billion events per day,

2147.381 -> over 500 petabyte storage footprint, and we have over 1200 databases.

2154.421 -> For 2026, what does the production look like?

2157.991 -> Simply bigger.

2159.86 -> So, the question is, is how did we get here?

2163.33 -> We got here by starting

2164.565 -> with some very fundamental architectural principles.

2168.635 -> We are a data-centric organization

2171.271 -> and therefore, it started with our data management.

2174.107 -> And this is more than just storage.

2176.944 -> It includes data registration, data lineage, data lifecycle,

2180.814 -> including security data classifications.

2183.984 -> We wanted to ensure that we took advantage of cloud elasticity.

2188.355 -> Everything from serverless to flexibility

2191.525 -> with instance types, flexibility with storage types,

2195.562 -> and again, making sure that it was done in such a way

2198.098 -> that we could do on-demand scalability.

2201.568 -> From an architecture perspective,

2203.804 -> you have to start with a core framework of services.

2207.407 -> All of our application teams understand

2209.176 -> the supported blueprints, there's centralized messaging,

2213.18 -> we are API-focused and we prefer open source.

2218.018 -> But security is also a key part of that architectural principle.

2223.423 -> And it's not just encryption. Configuration, authorizations

2229.062 -> and observability, in terms of it being auditable.

2233.834 -> DevOps infrastructure as code that is core,

2238.138 -> including your CI/CD and all of the automated testing

2241.909 -> that needs to happen in that area.

2244.645 -> Operations needs to be resilient and again, automated.

2249.483 -> It also needs to be performant

2251.351 -> and you have to have that enterprise observability

2254.188 -> in order to ensure all of those are balanced.

2257.624 -> The output of all of that data should be done

2260.093 -> in such a way that compliance is also done as code

2263.864 -> and you can do analytics in order to show it.

2268.468 -> To manage this is a balancing act.

2270.07 -> It's a balancing act between innovation,

2272.072 -> optimization and adoption.

2275.242 -> Innovation.

2276.844 -> AWS has new services all the time.

2280.848 -> In order to do that, you have to have service evaluations

2283.35 -> on how you integrate those services into your architecture,

2287.487 -> into your observability.

2289.556 -> You can do that through R&D, through POC.

2292.693 -> We also are very active in preview participation

2296.196 -> and it allows us to give informed feedback to those services.

2300.701 -> For optimization of workloads, it is observability of capacity,

2304.705 -> but also those cost efficiencies in order to get the most efficient,

2309.176 -> cost-effective workloads.

2311.745 -> And then adoption: delivery focused automation.

2316.984 -> Putting that automation in the hands of the teams

2319.486 -> with the provisioning guardrails

2321.555 -> so that they can do the right thing the easiest way.

2325.626 -> But that automation has to be handled at scale.

2330.13 -> So, again, observability.

2331.365 -> Monitoring.

2332.432 -> Everyone is familiar with monitoring.

2334.001 -> That's what they think of with observability,

2336.37 -> but it is far more than that.

2337.604 -> It is compliance, everything down to records management,

2342.176 -> oversight, policy enforcement.

2345.145 -> It is security.

2346.48 -> Yes, it is auditable, but it's also do you have those standards

2350.317 -> in your architectural principles for security logging,

2353.153 -> for security control, operational scorecards.

2358.358 -> These are not just testing,

2360.093 -> but what is your resilience posture of your cloud environment?

2364.464 -> Can you measure it?

2365.732 -> Can you view it?

2368.202 -> Everything including AWS Trusted Advisor,

2371.071 -> great for cost optimization.

2373.874 -> And then application health status, the highest layer,

2378.912 -> and this is important not only for those core connectivities,

2383.25 -> Is your application up?

2385.052 -> But also what are the dependencies of those applications,

2388.255 -> either direct dependencies or indirect dependencies.

2393.861 -> About three years ago we embarked on a strategic vision

2397.03 -> to do the multi-Region disaster recovery.

2401.268 -> And we did so basing it on some core services,

2405.439 -> Amazon S3 with the Cross-Region Replication for data,

2410.077 -> Aurora Global Database, KMS multi-Region keys,

2415.449 -> and the DynamoDB Global Tables specifically around parameter store.

2420.387 -> Combining those again to make a specific strategic vision

2425.158 -> for what that architecture would look like

2427.928 -> and it continues to grow.

2429.63 -> We continue to infuse resilience

2432.165 -> through those architecture and services.

2435.068 -> You always want to constantly learn and evaluate opportunities

2438.939 -> for efficiencies between those Regions

2442.142 -> while course ensuring data encryption, data security,

2445.779 -> and your data durability.

2450.083 -> So, observability specifically for resilience.

2454.254 -> We have annual tests, sometimes more than that of the multi-Region.

2458.792 -> So, we created what we call FINRA canary,

2461.395 -> which of course reports into our birdcage

2463.964 -> in order to check the health of the systems in the multiple Regions,

2469.336 -> but not just the system itself or its core dependencies,

2474.775 -> but also its downstream dependencies.

2477.544 -> So, it cares about itself and it cares about its friends.

2481.515 -> To do this, we've implemented a simple red light,

2484.885 -> yellow light, green light, implementation

2489.056 -> and you can look at them across regions and go,

2491.491 -> my app is good, my app is good, but my friends aren't so happy.

2495.596 -> And that may be expected.

2497.064 -> Not every app's position in disaster recovery

2499.967 -> may be up at any given time,

2502.302 -> but knowing where that is and when both you become,

2505.806 -> your app happy and your friends

2508.108 -> and all your neighbors that you depend on,

2510.31 -> that's when you go green.

2512.846 -> And that allows a level of observability

2515.516 -> to be able to be seen across the enterprise

2518.785 -> and at scale and across Regions.

2520.854 -> It's gone?

2526.627 -> There we go.

2528.128 -> So, I will leave you with some parting thoughts

2530.597 -> about managing this at scale.

2532.933 -> Again, integrated security, hands-off operations.

2537.871 -> Infrastructure as code.

2540.474 -> You are not going to go on a server and fix something

2544.645 -> because that server may not be there tomorrow.

2546.98 -> It may be a different server.

2548.282 -> So, make sure it is a hands-off operation.

2551.585 -> Automation, automation of onboarding to your fleet

2555.556 -> and make sure that it is self-reporting

2557.391 -> and the audit trails are there,

2560.027 -> but most importantly it is delivery focused,

2564.097 -> self-service automation that those teams can use

2568.535 -> to make the right thing, the easy thing.

2572.105 -> With that, I will pass the stage back to Shaown.

2575.108 -> [music playing]

2582.616 -> Thank you, Kym.

2583.717 -> You know, as you heard Kym talk about the importance

2587.054 -> of maintaining observability.

2589.156 -> Resilience is not a one and done thing or a checklist project.

2592.492 -> You must continuously invest in improving your resilience,

2595.028 -> which is what we're going to close with today.

2596.964 -> By the way, I have to give her a shout out

2598.398 -> for calling dependencies your friends.

2600.934 -> They're your friends until they break

2603.103 -> and then you're quite angry at those dependencies.

2604.838 -> But it's nice when they can be your friends.

2608.175 -> Part of continuous improvement

2609.877 -> is making sure that every workload is reviewed

2611.979 -> using the AWS Well-Architected Framework.

2614.381 -> Not just once, but every time

2616.416 -> you make a significant change to your workload.

2618.719 -> The Well-Architected Framework is a set of questions,

2620.888 -> design principles that enables you to build and deploy faster

2624.591 -> to release value more often.

2626.193 -> Understand where you have risks in your architecture.

2628.862 -> You want to intentionally know about those risks

2632.165 -> and ensure that you've made intentional architectural decisions

2637.738 -> that highlight how they will impact business outcomes.

2640.507 -> And you want to make sure your teams know all

2642.242 -> about the best practices we've learned

2644.211 -> from reviewing thousands of customers' architectures on AWS.

2648.649 -> The reliability pillar of Well-Architected outlines

2651.318 -> some key design principles.

2653.22 -> Automatically recovering from failure.

2654.821 -> Think back to Will's examples at Capital One.

2657.491 -> You want to test recovery procedures.

2659.092 -> We talked about this earlier,

2660.561 -> the importance of chaos engineering and continuous testing.

2663.73 -> You want to be able to scale horizontally

2665.766 -> to increase workload availability and plan for your capacity needs.

2670.771 -> Right-size your instances and resources

2672.673 -> to match your workloads needs.

2674.107 -> You heard Will talk a little about being over-provisioned.

2676.81 -> Getting close to right will also as the bonus

2679.246 -> help you optimize your costs, something we all care about.

2681.982 -> And manage change and automation.

2686.587 -> Performing operations is---

2689.256 -> Operational excellence principles are something

2691.558 -> that are also really important.

2692.826 -> It's the second pillar in Well-Architected.

2694.928 -> The first one to think about is performing operations as code.

2698.465 -> You can define your entire workload,

2700.467 -> applications, infrastructure as code and update it with code.

2704.404 -> That's the game changer.

2705.639 -> if you get to that level of automation.

2707.574 -> You can design workloads to allow components

2709.409 -> to be updated regularly to increase

2711.245 -> the flow of beneficial changes into your workloads,

2714.515 -> and as you use your operations procedures,

2716.884 -> look for opportunities to improve them.

2719.353 -> Use proactive threat modeling to identify potential sources of failure

2723.557 -> so they can be moved or mitigated ahead of time.

2726.193 -> And finally, drive improvement through lesson

2728.428 -> learned from all operational events.

2730.397 -> That correction area of error process I keep coming through -- back to.

2733.567 -> It's something that people forget about.

2735.469 -> They get worried about blame.

2736.937 -> Do not think about blame, think about the learning.

2740.44 -> One of our favorite mechanisms for doing this is game days.

2743.443 -> After you design for resilience in place,

2746.113 -> you want to make sure that it works in production.

2748.515 -> And a game day is a way to ensure that everything works as planned.

2751.685 -> Use game days to regularly exercise

2753.654 -> your procedures for responding to events and failures

2756.256 -> as close to production as possible;

2758.458 -> this includes in production when possible.

2760.727 -> And you want to use the people who will be

2762.563 -> actually involved in failure scenarios.

2764.631 -> It's a very common mistake to have a paper exercise

2767.568 -> and none of the folks who are actually on call at 3:00 AM

2769.837 -> are involved.

2771.004 -> Game days should simulate a failure or an event to test systems,

2774.842 -> processes and team responses.

2776.944 -> The purpose is to perform the actions the team would perform

2779.746 -> as if that event happened.

2781.415 -> And you have to do this regularly.

2782.816 -> You want muscle memory on how to respond.

2784.818 -> You don't want someone reading the manual.

2786.92 -> I'm going to say it again at 3:00 AM at night.

2788.455 -> It's always 3:00 AM somewhere, that's for sure, and avoid that.

2793.694 -> Infrastructure event management.

2795.295 -> This is something we offer to our large enterprise support customers.

2798.832 -> It's available to you if you're an enterprise support customer.

2801.268 -> It's focused on planning and support for business-critical events.

2805.706 -> Think about steps you might take to prepare your workload

2807.975 -> to handle a 10x traffic increase due to a product launch.

2811.512 -> Like our latest Kindle, I went and ordered one.

2813.28 -> I'm sort of excited about writing on a Kindle.

2816.016 -> Customer onboarding, it could be tied to an ad campaign,

2818.619 -> anything that's going to happen

2819.887 -> that's going to cause a large change in how you operate.

2823.257 -> Our enterprise support team can work with you

2825.392 -> over a timeline on several weeks

2826.727 -> to figure out how your infrastructure's configured,

2829.296 -> make sure service limits are right,

2832.099 -> make sure auto scaling groups are in place,

2833.867 -> look at your Load Balancers.

2835.502 -> They'll also raise awareness insides about your event

2838.438 -> and prepare support engineers to be ready

2840.274 -> to tackle any issues that come up.

2842.376 -> We use this for events like Prime Day internally.

2845.045 -> We can do it with you as well.

2847.281 -> And for those really exceptional cases,

2850.117 -> I'm thinking about the Cyber Monday this week

2851.818 -> for those retailers out there, we offer joint mission control.

2855.255 -> This is just like NASA, right, going back to where we started,

2858.559 -> think about an all-hands-on deck event

2860.427 -> to make sure the right people potentially in the room with you.

2864.031 -> During Prime Day, we have hundreds of engineers who come together,

2866.7 -> both virtually and in person to prepare for the worst-case

2870.17 -> scenario while being ready and hoping for the best.

2874.741 -> Finally, I want to give you a couple resources.

2877.644 -> These are things you can reach out to get more information.

2880.08 -> Three really good things here.

2881.782 -> First, the Well-Architected Framework;

2883.183 -> we have a set of pages that is full of information.

2886.053 -> Everyone should have checked out the Well-Architected Framework

2887.921 -> if you're running in AWS.

2889.69 -> Second, a brand-new whitepaper published

2892.292 -> by one of our principal SAs,

2893.493 -> Mike Haken, on fault isolation boundaries, really compelling.

2897.397 -> I saw Corey Quinn tweeting about it the day it came out.

2899.8 -> I was on vacation.

2901.101 -> Got me excited that it got published.

2902.636 -> Really good information there.

2904.037 -> And finally, our AWS Solutions Library,

2906.874 -> which is full of Solutions for all purposes.

2909.243 -> In this case, we've just launched the resilience guidance.

2912.88 -> We have backup and restore, failover and failback and many more.

2916.683 -> Please check out the Solutions Library.

2920.654 -> All that said, because resilience is continuous,

2923.757 -> you have to think about it as a journey instead of a destination.

2926.76 -> You are never quite done.

2928.629 -> We realized that all of you,

2929.93 -> our customers, are in different spots in that resilience journey.

2933.233 -> Some of you are just starting out.

2935.035 -> Some of you are further along.

2936.47 -> We heard from Will, we heard from Kym;

2938.038 -> they're pretty far along in that journey.

2939.84 -> But we have one thing all of us have in common.

2942.442 -> We're all builders on this journey.

2944.044 -> Whether you're an executive building a business strategy around resilience

2947.08 -> or a developer building resilience

2949.016 -> through the application or part of a cloud CoE

2951.051 -> or DevOps team building guardrails out,

2954.321 -> AWS can provide you with the right guidance,

2956.49 -> the services and infrastructure to enable your success.

2959.293 -> I want to thank all of you for spending this hour with us,

2962.563 -> from me and Francessca, and please have a great re:Invent.

2965.699 -> [applause]

Source: https://www.youtube.com/watch?v=GamnNc6ZMew