AWS re:Invent 2022 - Building modern apps: Architecting for observability & resilience (ARC217-L)
Aug 16, 2023
AWS re:Invent 2022 - Building modern apps: Architecting for observability & resilience (ARC217-L)
Cloud computing is transforming architecture design and application delivery at organizations across the world. As cloud architectures evolve, new design patterns are essential. Architecting for resilience, observability, security, and emerging trends serves as the foundation that empowers builders to innovate, optimize their workloads, and scale adoption over time. In this session, hear from AWS customers and cloud experts about their proven architecture best practices, tools, and blueprints for architecting for reliability, observability, and modernization on AWS. Learn more at: https://go.aws/3ucTJpm Subscribe: More AWS videos http://bit.ly/2O3zS75 More AWS events videos http://bit.ly/316g9t4 ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster. #reInvent2022 #AWSreInvent2022 #AWSEvents #AWS #AmazonWebServices #CloudComputing
Content
0.234 -> [music playing]
2.503 -> Please welcome
Vice President, Technology
5.339 -> and Customer Solutions,
AWS, Francessca Vasquez.
9.676 -> [music playing]
17.718 -> Welcome everyone.
Welcome to re:Invent 2022.
22.656 -> I can't hear you all.
Welcome to re:Invent 2022!
26.96 -> My name is Francessca Vasquez
and I have the honor and privilege
30.964 -> to lead AWS's Solutions Architecture
and Customer Solutions organization.
36.436 -> I also have the opportunity
to lead our AWS
40.307 -> Resilience Customer and Partner program
43.644 -> and I am so excited
to have you here today.
47.714 -> On behalf of the AWS
Architecture Organization
52.519 -> and our broader community,
we'd like to dedicate this session
57.724 -> on building modern applications
through the lens of observability
61.762 -> and resilience to David Grimm.
64.998 -> He was a solutions architect,
a technologist,
68.468 -> and an amazing thought leader
that we lost here recently.
73.207 -> So, about a year ago, on December 25th,
78.512 -> we witnessed one of
the largest innovations
82.049 -> in human history go production.
85.185 -> Over 29,000 engineers
and scientists worked on this product
90.624 -> and this project that would
impact the entire human race.
96.063 -> And this innovation was arguably
one of the largest resilience
102.069 -> and observability
initiatives ever taken.
107.708 -> What I'm referring to
is the James Webb Telescope,
111.512 -> which you heard referenced
earlier in Adam's keynote.
115.148 -> It is the largest most powerful
telescope ever built.
119.987 -> It will allow scientists to look
at what our universe was like,
124.124 -> about 200 million years
after the Big Bang.
128.595 -> Just incredible innovation.
132.599 -> The James Webb Telescope has been
under development for over 20 years.
138.839 -> $10 billion budget for the program,
141.508 -> which was up from the original
$1 billion forecasted in 2002.
147.181 -> And as the team was doing
their own resilience testing,
151.518 -> they found that during a vibration test
155.856 -> that the screws holding
the actual sun shield together failed
160.794 -> and they also popped loose.
163.263 -> This led to another 10 months
165.265 -> and an additional $800 million in costs.
168.535 -> This testing was done to simulate
the actual testing
172.406 -> of riding an Ariane 5 rocket to space.
176.71 -> This telescope had to actually
travel a million miles
181.081 -> and then open up remotely.
183.684 -> And did I mention,
there's only one of these?
187.688 -> NASA actually had to build
the necessary observability
192.092 -> and resilience
capabilities into one device.
196.163 -> They had to build this system
to be able to withstand
200.234 -> the micro meteor strikes
and temperature swings,
204.404 -> that ranged anywhere from 230 degrees
208.075 -> Fahrenheit to negative
394 degrees Fahrenheit.
212.613 -> And it took about a month
to get to its final location,
216.884 -> which is about
a million miles from Earth.
221.421 -> And when I tell you this, this space
telescope, it's really big.
227.728 -> It represents about
a three-story building.
231.698 -> It's width and length is equal
to that of a tennis court.
236.603 -> And to actually get this thing
to launch,
239.273 -> you actually fold it up
before the rocket can go off.
243.677 -> And finally, the team at NASA
and our European Space Agency,
249.116 -> they had to create the tools to test
and overcome more
253.22 -> than 300 single points of failure.
258.158 -> They had to create
the necessary observability
260.727 -> tooling to actually make this possible.
263.564 -> And the photo that you happen to see
up here is of the Tarantula Nebula,
269.036 -> which is 161,000 light-years away.
276.076 -> Today, we have an action packed agenda.
280.814 -> We'll be on a very deep exploration
283.917 -> on how we think about resilience
and observability
287.554 -> with the same mission
critical application lens,
291.525 -> as a NASA space telescope.
294.228 -> We're also going to hear
from two amazing customers
297.965 -> who will be sharing their best practices
300.634 -> on how to build and design modern
applications with resiliency in mind.
305.839 -> Finally, we're going to close off
the day today with strategies
310.31 -> and resources to help you all think
about continuous improvement.
318.252 -> Now, before we begin,
I want to level set everyone
323.19 -> on some important definitions just
so we're on the same mental model.
327.528 -> First off, how we think
about observability.
330.764 -> It really describes how well
you can understand
333.433 -> what is happening with your system,
335.335 -> often by instrumenting it
to collect metrics, logs,
339.006 -> and traces. Resilience,
the ability of a workload
343.911 -> to recover from infrastructure,
service or application disruptions.
349.483 -> Both observability and resilience
352.619 -> are a critical and foundational element
356.023 -> to AWS's Well-Architected Framework.
359.293 -> They specifically live in the reliability
362.896 -> and operational excellence pillars.
365.132 -> So, two of the six pillars of an
architectural framework that we use.
369.636 -> Similar to NASA's approach
with the Webb Telescope at AWS,
374.975 -> we want to partner with all of you
377.11 -> to help you automatically be able
to recover from failures,
381.281 -> to build and test recovery procedures
384.351 -> to help you scale
and to help implement changes
388.522 -> being ongoing automation so that
we can reduce manual errors.
393.66 -> So, to cover off on our building
applications and resiliency,
398.732 -> please give a warm welcome
to my colleague Shaown Nandi,
402.736 -> who is the Director
of Solutions Architecture
405.038 -> and Customer Success
for our strategic customers.
408.008 -> Thank you so much. Welcome to re:Invent!
411.078 -> [music playing]
419.486 -> Thank you Francessca.
420.854 -> What an incredible story
to hear about the telescope.
423.39 -> It is great to be here
with all of you today.
425.626 -> I see customers, I see friends
in the audience.
428.762 -> Thank you for taking time,
429.997 -> especially with all the exciting
World Cup action.
432.199 -> I hope there's some football,
American soccer
434.268 -> fans in the audience, maybe.
435.869 -> Yes.
439.206 -> For those tuning in via the stream,
441.308 -> please thank you
for joining us virtually.
443.844 -> In business, the resilience
of workloads is critical.
446.847 -> Today we're going to hear
from two customers
448.482 -> from FINRA and Capital One,
450.35 -> and we'll walk through
how we work with all our customers
453.053 -> to ensure their workloads
are Well-Architected
455.022 -> and resilient on AWS.
456.657 -> So, let's dive right in.
458.926 -> What is resilience?
460.427 -> Resilience refers to the ability
for workloads,
463.23 -> workloads are your applications,
your products,
466.2 -> your business processes, to respond
and quickly recover from failure.
470.771 -> A workload can be simple as
a single application
473.34 -> running in a single AWS account,
475.242 -> or it might be a set of products
that span multiple accounts.
478.445 -> There are three mental models
I want you to consider
481.315 -> as we go through today's presentation.
483.45 -> Think about how you can build systems
to be highly available
486.82 -> with resistance to common failure modes,
489.156 -> how to recover your system
490.757 -> if you run into one of
these rare failure scenarios.
493.927 -> And underpinning all of this
is the idea of continuous resilience.
498.065 -> This is where you're implementing
DevOps practices like CI/CD
500.934 -> to automate your delivery pipelines.
503.604 -> Introducing failure on an ongoing basis
506.039 -> to test your system chain
and test your teams for weaknesses
510.377 -> and implementing ongoing observability
512.579 -> and monitoring practices.
515.449 -> As with security and sustainability,
518.619 -> resilience is a shared responsibility.
520.921 -> AWS is responsible
for resilience of the cloud.
524.424 -> You our customers are
responsible resilience
527.394 -> of your workloads in the cloud.
530.163 -> This shared model helps relieve
your operational burdens,
532.866 -> the customer's operational burdens,
as AWS operates,
536.27 -> manages, controls,
the host operating system,
539.072 -> the virtualization layer,
all of the infrastructure.
542.242 -> We're responsible for all the services
544.144 -> that are offered in the AWS Cloud.
546.78 -> Customer responsibility is determined
by the services you select.
550.684 -> You have to carefully consider
552.386 -> which services you choose
as responsibilities will vary
555.289 -> depending on how they integrate
what regulatory framework apply.
558.959 -> And often you'll hear us
560.527 -> recommending using
our higher-level services.
562.863 -> You heard about
the serverless announcements
564.264 -> this morning at Adam's keynote,
565.933 -> those tend to have less operational
responsibilities that sit with you.
569.837 -> As a solution architecture leader,
571.371 -> my teams partner with all of you
573.006 -> to make these decisions
not alone in figuring this out.
578.045 -> AWS from day one has built
resilience into our culture.
582.216 -> We use a service
ownership model internally,
585.419 -> which incentivizes our teams
to continuously
587.621 -> improve their operations.
589.189 -> We've organized our engineering
and product management
591.592 -> efforts
into small multi-disciplinary teams.
594.394 -> We call them two-pizza teams.
595.829 -> Most of you heard this term.
597.264 -> We like it 'cause we think
those teams can be fed by two pizzas,
600.667 -> depends I guess,
on the size of the pizzas.
602.736 -> And these teams own
a service end-to-end.
605.706 -> What that means is the ownership is not
608.842 -> just of designing
and launching their service,
611.111 -> but you have to operate it
during production,
613.714 -> be on calls for issues as they arise.
616.817 -> For customers who use this structure,
it's a major cultural shift.
620.42 -> The idea that your responsibility
for the service never really ends.
624.958 -> In AWS all new services
are reviewed for launch
628.495 -> using an operation readiness
process we call ORR.
631.798 -> It's basically a set of questions,
634.067 -> a checklist that uses known best
practices and a standardized runbook.
638.038 -> When we roll out new services
or update existing ones,
640.908 -> we use safe continuous
deployment pipelines
644.011 -> that automate pre-production testing,
646.146 -> support automatic rollbacks
and stagger deployments.
649.016 -> When we launch our services
or even add features, we start small.
652.586 -> We start with a single instance,
654.354 -> go across an AZ roll out
across multiple AZs finish a Region
658.525 -> and then ultimately roll
to our other Regions.
661.061 -> And if any issues arise, we leverage
our correction of error process.
665.632 -> This is where we go and understand
what the root cause was.
669.136 -> This is not about placing blame.
671.605 -> This is about diving deep to find
the true reason
674.341 -> that something failed.
675.876 -> And after an issue is mitigated,
we drive company-wide engineering
679.012 -> sprints to ensure the issue
is fixed across all AWS services.
683.75 -> The learnings become part
of the ORR process,
685.986 -> goes right back to the top and
ensures similar issues don't reoccur.
692.292 -> When we think about resilience
in the cloud,
694.928 -> there are four key areas we focus on.
697.364 -> First, you have to anticipate
what's going to happen.
700.434 -> To do this, we use code reviews,
failure-oriented programming,
704.204 -> immutability, simple designs,
whenever possible.
707.808 -> Second, monitoring, we're going
to talk a lot more about this,
710.444 -> health checks,
tracing, alarms, dashboarding.
713.814 -> Responding, this can be the longest
part of any incident
717.184 -> and you minimize that by
identifying event-driven patterns,
720.153 -> using machine learning
powered operations.
722.923 -> And finally learning.
724.291 -> I talk about the CoE process
that we've built.
727.06 -> We look at our logs and all
that learning is fed right back in
730.864 -> to anticipate better the next time
and we come full circle.
735.569 -> When we think about your responsibility
737.437 -> for resilience in the cloud,
739.406 -> you also need to think
about resilience threat modeling.
742.075 -> There are typical categories of failure,
743.577 -> you see them up on the screen.
745.179 -> For example, for code deployments,
747.381 -> what happens when you have
a failed deployment?
749.55 -> What are you set up to have occur?
751.285 -> Do you have instrumentation
to detect it?
753.487 -> Can your CI/CD system
automatically roll back?
757.09 -> In core infrastructure, what if
a single instance is terminated?
760.961 -> Have you designed to make sure
this will not impact you?
764.131 -> And if there's an impairment
in a single AZ
766.033 -> or you experience a gray failure,
what's going to happen?
769.636 -> In terms of data and state,
771.338 -> what if your customers
overwhelm your service
773.74 -> or your database gets corrupted?
775.609 -> What happens if a third-party
dependency fails?
778.178 -> Do you gracefully degrade?
779.913 -> I talked to a CIO this morning
about a challenge
781.915 -> they faced a couple weeks ago.
783.483 -> A major application outage.
785.219 -> They had a dependent system,
it was a login system,
788.522 -> federation system that failed.
790.39 -> A queue built up 1.2 million
requests out there
793.627 -> and it took them seven hours to respond
796.129 -> and detect. They didn't have
automatic rollback in place.
799.199 -> And they knew, they knew that it was
a challenge after the fact
801.835 -> they hadn't prioritized fixing it
and they want to fix it.
804.771 -> You have to consider these.
806.406 -> Now you also have to think
about unlikely scenarios.
809.109 -> You may choose not to engineer
for the really unlikely ones,
811.979 -> but you should consider what'll happen.
813.48 -> What if a natural disaster impacts
an entire coast in the United States?
818.151 -> Or my favorite panic scenario,
820.12 -> what are you going to do
during the zombie apocalypse?
822.422 -> How will you keep your systems up?
823.891 -> I keep a bag packed just in case.
828.428 -> A key part of resilience workloads
830.163 -> is making sure you have
a strong foundation.
832.533 -> Start with AWS infrastructure.
835.169 -> With the launch of Hyderabad
last week, we have 30 AWS Regions
839.039 -> and each Region has at least
three Availability Zones.
842.075 -> And each AZ is a multiple
physically separated data center.
846.18 -> Each region has two independent,
fully redundant transit centers.
850.117 -> On top of that is your network.
852.152 -> Your network should always be
redundant, always available,
855.055 -> and seamlessly routed.
856.69 -> On top of that, you've got your data,
859.226 -> you have to have confidence
in the resiliency of your data.
861.895 -> There's so many forms, file system,
block databases, in-memory caches,
866.967 -> consider how eventual
consistency impacts design.
870.003 -> And finally, your application.
872.806 -> Your highly resistant application
should be able to self-heal.
876.376 -> We like you to use microservices
app architecture
878.812 -> when you're building new, we know
many existing apps don't do that.
881.849 -> You want to decouple interdependencies,
884.017 -> have loose coupling when possible
and always remove state
887.221 -> when you can from app components.
890.457 -> Your goal is to build systems
that never fail.
893.193 -> The reality is failures do happen.
895.996 -> So, there are a few things you can do
to help reduce the impact of failure.
899.9 -> First, very basic set timeouts.
903.036 -> Do you know that many frameworks
to fall to infinite timeouts
906.139 -> that is just asking for trouble?
908.675 -> You need retry with backoff.
910.644 -> When you don't backoff, you're going
to end up with a retry storm.
913.347 -> A subsequent cascading failure.
915.382 -> One retry is usually resilient enough
to be resilient
918.719 -> to intermittent errors.
920.053 -> Retry once fail fast.
922.222 -> That example I mentioned earlier,
1.2 million up in the queue.
926.193 -> Yeah, clearing that was
a pretty big pain.
928.495 -> You want to limit the sizes
of your queues
930.564 -> and make sure you rate limit
your APIs and load shed when needed.
936.87 -> Before failures occur,
it's important to test.
940.674 -> Continuous testing is imperative
in understanding
943.477 -> how your system will react to unknowns.
945.679 -> This includes strategies like chaos,
engineering,
947.781 -> conducting game days,
practicing failovers.
951.084 -> We use chaos to prove or disprove
our assumptions about our system's
955.155 -> capability to handle disruptive events.
957.491 -> Chaos stresses an application
959.259 -> in testing or production environments.
961.161 -> We create disruptive events
artificially server outages,
964.031 -> API throttling.
965.332 -> Amazon has been purposely
injecting control failure
968.402 -> into limited environments
since the early 2000s.
971.405 -> That's how we ensure readiness for
the most adverse of circumstances.
976.476 -> AWS is investing heavily
in resilient services.
979.913 -> We just spoke about chaos engineering.
981.715 -> I want to call out the AWS
Fault Injection Simulator,
984.818 -> a fully managed service.
986.153 -> We love those, that simulates real-
world failure to uncover hidden bugs,
990.224 -> monitoring blind spots
and performance bottlenecks.
993.293 -> Our resilience hub, AWS Resilience Hub
996.463 -> provides a central place
to define, validate,
998.832 -> and track the resilience
of your applications on AWS.
1002.236 -> And it pulls in the best practices
from our Well-Architected Framework
1005.339 -> so you can benefit from
what all your other customers
1007.741 -> have learned and done.
1009.309 -> And lastly, I'd like to highlight
the Amazon Route
1012.546 -> 53 Application Recovery Controller.
1015.182 -> It enables you to control
your application recovery
1017.651 -> across multiple AWS Regions,
availability zones and on-prem.
1022.256 -> It makes recovery simpler
and more reliable
1024.825 -> by eliminating the manual steps
1026.093 -> required by traditional tools
and processes.
1028.529 -> And I'm excited today to announce
a new enhancement
1031.798 -> to the Route 53 Application
Recovery Controller.
1035.068 -> We're adding a new feature
in preview today called Zonal Shift.
1039.306 -> It's built on Elastic Load
Balancer inclusive of ALBs and NLBs.
1044.144 -> During a failure, removing your
application in an AZ can be complex.
1048.715 -> It can wound configuration steps
across EC2, ELB, auto-scaling.
1053.387 -> Customers like yourself have been
asking us for simple,
1055.656 -> reliable and easy-to-use tool
1057.524 -> that can help recover
from AZ impairments.
1059.96 -> With Zonal Shift, when you build
your applications with ELB
1063.33 -> and have cross-zone traffic disabled,
you get a built-in control
1067.301 -> for shifting application traffic away
from an AZ with a single action.
1071.205 -> To learn more about this preview,
there's a session tomorrow, ARC 329
1075.142 -> at 10:45 AM Pacific, breakout session.
1077.344 -> Please go check it out.
It's a pretty cool feature.
1080.48 -> Now I'm excited to introduce
our first customer Will Meyer,
1084.351 -> Managing VP of Cloud
and Connectivity for Capital One.
1087.588 -> Please join me in welcoming
Will Meyer to the stage.
1090.29 -> [music playing]
1096.496 -> Thank you Shaown and good afternoon,
everybody.
1098.532 -> This is awesome to see all of you.
1101.101 -> You know, I spend a fair bit of time
1102.369 -> thinking about what are the differences
1104.071 -> between just being on the cloud
and really thriving on the cloud.
1108.041 -> And I think true system resilience
is one of those things,
1110.777 -> so it's great to be able to talk
about it with you.
1115.582 -> If you're not familiar with Capital One,
1116.85 -> we are a financial services institution,
1118.819 -> basically an information business.
1121.488 -> Our success is based on our ability
to process data, generate insight
1125.058 -> that we can use to help our customers
1126.927 -> financially, for example,
by giving them credit.
1130.497 -> And so, when you think about it
being resilient to external change
1133.767 -> and managing risk are baked
into our business model,
1136.803 -> not just our tech stacks.
1138.238 -> We've been doing it for a while,
1139.439 -> since before the public cloud
was a thing.
1141.675 -> But I think we knew pretty early that
it would be an accelerator for us.
1144.845 -> There were Capital One folks on stage
at re:Invent in 2015
1148.849 -> talking about our intention
to go all-in on AWS.
1152.386 -> We did that.
1153.654 -> We rebuilt and migrated thousands
of workloads, petabytes of data.
1158.559 -> We built a security
and controls framework
1162.062 -> that was appropriate
to the level of trust
1164.031 -> that our customers place in us
and also what our regulators demand.
1168.235 -> And in 2020 we closed
our last data center.
1170.938 -> We have hundreds of teams
that are doing everything from online
1174.408 -> user experience
to advanced machine learning,
1177.11 -> call center, back office, all on AWS.
1180.681 -> It has been challenging,
but it's also been super fun.
1183.217 -> And AWS has been a tremendous enabler
for our teams and for our business.
1189.556 -> We also have some battle scars and
I think we have learned along the way
1193.327 -> that we're making a lot of
good progress on saying defaults
1195.863 -> but the cloud isn't perfectly plug
and play quite yet.
1198.866 -> There is complexity,
there are trade-offs everywhere.
1201.935 -> I think we see that
when we talk about cost.
1203.704 -> I know many of us are focused on
reducing waste in our cloud spend.
1207.14 -> I think we see in security
and compliance.
1209.409 -> We want the right thing to be
the easy thing,
1212.145 -> but we are still working on easy.
1214.314 -> And so, as we talk about
observability and resilience in general,
1217.184 -> I think it's important
to not just look at them
1218.852 -> as a set of isolated patterns,
1220.42 -> but really as an integrated part
of your overall approach
1223.39 -> to managing the cloud.
1228.095 -> You know, you all work with the cloud
every day.
1229.763 -> I think it's easy to forget
just how far the conversation
1232.533 -> about cloud resilience has really come.
1234.568 -> You know, I remember being asked,
1235.869 -> "Hey, what happens when Amazon has
a big holiday e-commerce spike,
1239.706 -> and all of a sudden there's no more
spare capacity to run on AWS?"
1242.843 -> And you know,
it never really worked that way,
1244.545 -> but I think it's a good reminder
of just how much has changed.
1247.414 -> AWS is serving incredibly
demanding sectors.
1251.018 -> We see really world-class
engineering talents
1253.887 -> working on resilience
of public cloud infrastructure,
1257.024 -> including through open-source.
1259.226 -> At Capital One, we have had
tremendous outcomes.
1262.196 -> We have fewer customer
impacting incidents.
1264.631 -> We are faster to recover
when we do have them.
1267.201 -> We attribute that largely
to the partnership with AWS.
1269.77 -> It has been really powerful,
1271.338 -> and I know many of you
have seen the same.
1274.308 -> You also all know, I think that
the cloud isn't actually magic
1277.845 -> and you don't get
this completely for free.
1280.347 -> Just like when we talk about costs or
security or compliance and resilience,
1284.618 -> AWS gives you incredibly
powerful tools and concepts,
1288.088 -> but you need to use those tools
1289.456 -> and you need to embrace those concepts
1291.258 -> and integrate them into your
ecosystem and your ways of working.
1294.828 -> And we work hard to do that.
1297.564 -> We run our US businesses
in two Regions in east and west,
1300.868 -> multiple AZs in each.
1302.97 -> Our most critical workloads,
1304.104 -> are active-active
with latency-based routing.
1307.341 -> We do auto-scale, although honestly,
we tend to run a bit overprovision,
1311.178 -> partly because we want to be able
to fail our entire business,
1314.314 -> all divisions over into one of
those Regions within minutes,
1317.417 -> which we do occasionally.
1319.853 -> And this particular topology
does not make sense for everyone,
1323.457 -> but whatever does make sense,
1324.625 -> focus on standardizing
the deployment pattern
1327.194 -> that you want your teams to have.
1329.363 -> We have spent a lot of time
organizing and defining
1331.365 -> our non-negotiable requirements.
1333.433 -> We've organized them into
a couple of different tiers.
1336.036 -> This platinum, this multi-Region,
1339.173 -> multi-AZ active-active version
we call our platinum standard.
1343.41 -> And we really hold ourselves
accountable to that.
1346.613 -> We've also made a bet
on a company-wide deployment pipeline
1349.416 -> that lets us do blue-green deployments
1351.084 -> across Regions consistently and safely.
1352.986 -> That's a big investment, but we
have found that it is worth it.
1357.591 -> AWS also talks a lot about
powerful architectural concepts;
1361.895 -> you heard about a few of them
a minute ago.
1364.231 -> We're also big fans of static stability,
1366.2 -> thinking about how your system
behaves in isolation
1369.336 -> so that when everything
around it is going haywire,
1371.205 -> you don't actually need to take
any action to remain stable.
1374.107 -> I think that's important when you
think about how much complexity there
1376.743 -> is in many of these cloud architectures.
1379.279 -> Loose coupling is great; it's how
we build evolutionary architecture.
1382.182 -> But with microservices everywhere
and event-driven everything,
1385.519 -> these things can be hard
to reason about.
1388.188 -> And I think when we look
at production incidents
1390.157 -> much more than straight failures,
1391.725 -> what we see are complicated
degraded performance,
1395.796 -> partial failures across multiple systems
1398.866 -> interacting in some kind of complex way.
1400.734 -> And I think, by the way,
1401.935 -> that's true in incidents
within the clouds as well.
1404.404 -> Those do happen.
1408.041 -> You also need the tools.
1409.776 -> So, I mentioned our deployment pipeline,
1411.411 -> but everything starts
with how you effect change
1413.614 -> in the environment
that has to be managed,
1415.315 -> and that means
with infrastructure as code,
1417.284 -> not with the AWS Console,
that's absolutely critical.
1420.387 -> We've also invested in tooling
that helps us reason
1423.79 -> about the state of our infrastructure
in sort of larger logical units.
1427.728 -> We built kind of a data layer
1429.062 -> and a management plane that helps us
do things like coordinate
1432.032 -> those large-scale failures still
with appropriate access control.
1435.335 -> I'll make just a quick plug
for Cloud Doctor,
1436.97 -> which you can see in our booth.
1439.206 -> We're also really excited
about all the investments
1440.841 -> we see coming from AWS.
1442.276 -> A few were just mentioned,
Fault Injection Simulator
1444.578 -> is a favorite of ours.
1448.382 -> Lots of amazing tooling continue
to be invested in by AWS.
1452.119 -> We also know that no matter
how good the tools are,
1455.556 -> things are going to fail.
1456.857 -> And so, we practice for that.
1458.325 -> Yes, doing exercises
with tech teams is key,
1460.594 -> but also think about it,
cross-department.
1462.829 -> How do you engage customer comms,
call center,
1465.699 -> decision makers that you may need if
you're going to disable a capability
1468.468 -> in the heat of the moment?
1469.603 -> We think company-wide,
organization-wide game days
1472.806 -> can be really powerful for that.
1477.444 -> I'll just admit I couldn't resist
an opportunity
1479.446 -> to mention our former president,
1481.114 -> but he sort of said something
about software engineering.
1484.518 -> Resilience comes from learning
and adaptation.
1489.423 -> Sometimes that learning is
in the heat of the moment, right?
1492.226 -> How do you coordinate across
multiple teams to debug
1495.762 -> and fix these complicated system
1497.464 -> interactions that we're talking about?
1499.166 -> Collaborative problem solving
really is the new normal.
1501.735 -> We can't just all look at our own
APIs and contracts
1504.004 -> and say, you know, "Hey, it's not me."
1506.206 -> So, think about
how you incentivize that.
1507.641 -> Sometimes the learning
is in the follow-ups.
1510.143 -> At Capital One, we talk a lot
about blameless postmortems,
1512.646 -> and we spend the time
really digging deep when things fail.
1516.383 -> We aren't trying to assign blame,
1517.985 -> but we are trying to find the truth
and it really matters a lot.
1520.587 -> And so, invest in that and in particular
1523.09 -> invest in tracking the follow-ups.
1524.725 -> I know AWS talked about
that a moment ago.
1526.693 -> That part is important.
1529.162 -> I think at the end of the day,
1530.33 -> we have found that we spend
a lot of time
1532.799 -> working on the systems
that help us learn and improve.
1536.47 -> And to a large extent, the time
we spend looking backward
1539.006 -> is the way we speed up moving forward.
1543.644 -> I want to make a quick side
point on serverless.
1546.18 -> I think we at Capital One have made
a pretty intentional investment
1549.583 -> in adopting more
and more managed services,
1552.586 -> and I think it's interesting
to talk about that
1554.121 -> in the context of resilience.
1556.523 -> Over the years, I think we have
all embraced distributed DevOps,
1559.493 -> we have shifted left, and you know,
1561.361 -> we might have horizontal
SRE teams or platform teams,
1563.931 -> but generally all of
our application teams
1566.233 -> have a lot of ownership
of their infrastructure.
1569.203 -> And that has a bunch of benefits
that I think we all understand.
1573.34 -> It also makes some pretty big
asks of those teams:
1576.243 -> be cost efficient, be resilient,
patch your vulnerabilities.
1580.38 -> It's not quite the simplicity
that the cloud promised.
1582.916 -> I think we also see the need
to build internal platforms
1586.153 -> and abstraction layers
1587.354 -> that then help those teams
to do all of those things.
1590.257 -> And this is where we think
serverless fits in.
1592.659 -> We want to move teams up the stack.
1594.161 -> We want to lean into
the shared responsibility model
1596.597 -> and basically get AWS to do as much work
1598.832 -> as possible on our own resilience.
1602.369 -> There is some potential risk
with that operationally, sure.
1604.605 -> But I think we see the resilience
1606.206 -> of the managed services
to be good and improving.
1609.776 -> People also talk about lock-in,
1610.944 -> but I think we've already seen
containers kind of surpass our VMs.
1614.515 -> We see functional event-driven
models pretty much everywhere.
1618.085 -> I think in most domains we are
building lighter-weight applications
1621.154 -> and we're even starting to talk
about being able
1623.357 -> to run them on-prem or at the edge
or anywhere in between.
1627.194 -> And so, we think this move up
the stack really
1628.896 -> is on the right side of history
1631.131 -> and we want to continue to offload
undifferentiated heavy lifting,
1634.635 -> not only to improve our productivity,
1636.537 -> but to improve our resilience
over time as well.
1641.341 -> Parting thoughts:
1642.476 -> Yes, do the textbook stuff.
1644.144 -> Start with
the Well-Architected Framework,
1645.846 -> but make it your own.
1647.281 -> Be intentional about setting standards
1649.149 -> and expectations for architectural
resilience with your teams.
1652.619 -> Educate your teams, track your progress,
1654.988 -> integrate those standards
into your tools,
1657.391 -> start all the way at the beginning
of the development process,
1659.426 -> build the guardrails that
enforce resilience requirements
1662.696 -> right into the infrastructure.
1665.632 -> Interrogate all those dependencies
1667.267 -> and use architectural tools like
static stability to mitigate them.
1671.471 -> And then most important,
think about how you cultivate
1674.041 -> the mechanisms to use the AWS word
for learning and response,
1678.345 -> both during emergencies,
which you should be practicing for
1681.148 -> and in the follow-ups
that help you improve over time.
1684.384 -> I think like a lot of things in tech,
1685.686 -> resilience isn't just about tech,
1687.454 -> it is also about you as a responsive
and resilient organization as well.
1692.826 -> Thanks for your time.
I'll hand you back to Shaown.
1695.062 -> [music playing]
1701.802 -> Thank you, Will.
1703.036 -> Some really great learning out there
1704.404 -> that people can start applying
to their workloads
1706.406 -> and implementing immediately.
1707.708 -> I especially love the little bit
on serverless at the end.
1710.41 -> I gave a shout out to it earlier.
1712.513 -> It is such a good way to start
to reduce your operational burden
1715.282 -> and put it in someone else
i.e. our hands to some extent.
1718.785 -> A founding member of the Amazon
EC2 team put it really nicely.
1722.656 -> "You can't legislate against failure.
1724.992 -> Focus on fast detection and response."
1727.427 -> We all know there are going
to be failures.
1729.73 -> So, this is where observability
gives you the ability
1732.466 -> to efficiently detect,
investigate, and respond.
1736.57 -> Often customers don't detect issues
as soon as they begin.
1739.006 -> There's a lag from when the issue
starts to when you find it.
1742.676 -> You can respond to failures quicker
if you alert near the source.
1746.613 -> Investigation is where people spend
the most amount of time
1750.35 -> during an operational event.
1751.685 -> This is the largest contributor
to downtime.
1754.154 -> I mentioned that incident earlier,
seven hours to investigate.
1758.292 -> Leverage logs, metrics tracing
to help you investigate
1762.362 -> quickly and understand the root cause.
1764.164 -> Your time is valuable.
1765.933 -> Focus on the stuff that matters
during an operational event.
1768.902 -> There's nothing worse
than trying to fix something
1771.104 -> and making the situation worse.
1773.24 -> And I mentioned our CoE process earlier.
1775.509 -> Make sure you conduct
a post-event analysis
1777.611 -> to help you determine
how you could have prevented this.
1779.98 -> It will probably happen again
if you don't.
1782.449 -> Your goal should be to ensure
that doesn't happen.
1784.585 -> You never have repetition
in these errors,
1786.753 -> and if it does,
you know how to identify it faster
1789.69 -> and remediate it automatically.
1792.626 -> Our philosophy of monitoring
is to ensure
1794.728 -> we measure the things
customers care about
1797.564 -> and measure them
from multiple perspectives.
1799.766 -> We want to continuously introspect
those metrics and question them.
1804.705 -> This is all to understand
the customer experience.
1807.407 -> Instrumentation allows us
to learn about our system,
1810.21 -> give operators real-time feedback
1812.446 -> on how the system is operating
and feed data into alarms.
1815.949 -> This helps us detect and respond
to events when they happen.
1818.886 -> We need to make sure
we're monitoring the right things
1821.221 -> and asking the right questions.
1823.857 -> We want to ask these questions
all about our systems.
1827.127 -> Why is it operating that way?
1828.262 -> You need to add instrumentation,
go right to the top of the stack
1831.832 -> and figure out what's going on.
1833.367 -> The instrumentation produces logs,
metrics and traces.
1836.77 -> We use alarms and dashboards
to analyze those,
1840.007 -> and then we ask more questions.
1842.442 -> Why did that thing go into alarm?
1844.144 -> What was actually going on when
I saw this spike in the dashboard?
1847.047 -> And to answer that, we need
more instrumentation
1849.483 -> and we go through the same
circle over and over again
1852.352 -> and it improves operations
and more importantly,
1854.955 -> it improves our end customer experience.
1857.624 -> This is the virtuous cycle
of monitoring.
1861.361 -> Think about a real-world example.
1862.896 -> A service that customers are calling
through a Load Balancer.
1865.832 -> So, you get a call trying to get
some product info,
1868.635 -> it goes and hits the load balancer.
1871.004 -> Questions you want to ask:
What product are we looking up?
1873.473 -> Who called the API?
1874.575 -> What was this end customer?
1875.843 -> What type was it,
coming through a website,
1877.911 -> coming through some other area?
1879.78 -> Did we find the item in our local cache
1882.049 -> or did we have to punch out
and get to a remote cache?
1884.885 -> How long did it take
to read from the cache?
1887.154 -> How full is the local cache?
1889.056 -> How long did the query take?
1891.124 -> You went out to a remote database
perhaps did the query succeed?
1894.862 -> And how long did it take to go
back and populate the caches?
1897.531 -> Was the cache full?
1898.732 -> Did you have to evict items?
1900.467 -> And how big was that product info
object that you went and fetched?
1903.737 -> And what was the response
code from the server?
1906.473 -> And finally, what was the latency?
1908.876 -> If I was operating this service
in production,
1911.512 -> I would need so much
instrumentation in this code
1913.947 -> to be able to understand
its behavior in production.
1915.916 -> That's a good thing.
1917.05 -> I need the ability to troubleshoot
failed requests, slow requests,
1920.487 -> I want to monitor for trends,
signs that different dependencies
1923.79 -> are under-scaled or misbehaving,
there's a lot there.
1926.693 -> Don't oversimplify.
1929.463 -> And we have a couple types
of metrics we categorize.
1932.733 -> First health metrics.
1934.368 -> Am I failing?
1935.569 -> It doesn't answer the question,
why am I failing?
1938.405 -> Health metrics, the alarms
are there to alert you to issues.
1941.375 -> And then the diagnostic metrics.
1943.243 -> What's the value of this thing
I measured?
1945.345 -> Why isn't my system working?
1947.648 -> These both fall into
three essential categories.
1950.784 -> The customer experience metrics
are the ones that let you detect
1953.687 -> that your customer has a problem
1955.389 -> and the service
is not responding to them.
1957.491 -> Once you've found a problem,
you can use impact assessment metrics
1961.495 -> to measure the number
and percentage of customers
1963.797 -> resources, workloads impacted,
1966.567 -> and then you can use
operational health metrics
1969.536 -> to determine why
the impact is occurring,
1972.339 -> what, when the why is discovered,
responders
1974.608 -> and automation can take action to go
and resolve the event.
1978.979 -> There are three commonly agreed
upon pillars of observability.
1982.916 -> It's metrics, log and traces.
1985.385 -> Metrics are the numeric data
measured at various time intervals,
1989.189 -> request rates, error rates,
durations, CPU percentage.
1993.026 -> Logs are your timestamp records
of distinct events
1995.362 -> that occurred within an application
1996.697 -> or a system such as a failure and error
1998.966 -> or just a state transformation.
2000.734 -> And traces represent a single
user's journey
2002.87 -> across multiple applications
and systems.
2005.372 -> Usually with microservices,
great modern architecture,
2010.11 -> we have a broad suite
of observability capabilities.
2012.679 -> We have native services
that integrate deeply
2015.349 -> with our AWS services.
2017.217 -> We have Container Insights
and Lambda Insights.
2018.986 -> You heard about some new ones
being launched
2020.487 -> this morning from Adam.
2021.889 -> We have open-source tools:
Managed Grafana,
2026.66 -> Managed Prometheus, big fan.
2028.729 -> We give you lots of options,
lots of choices.
2030.864 -> You can put the right tool
for your workload.
2034.334 -> With that, I am super excited
to introduce Kym Weiland,
2038.672 -> Vice President of Enterprise
2039.873 -> Operations at FINRA,
who's going to talk more
2041.909 -> about the importance of observability.
2043.81 -> Please welcome Kym.
2045.379 -> [music playing]
2054.688 -> Good afternoon.
2055.822 -> FINRA.
2056.957 -> FINRA is the Financial
Industry Regulatory Authority
2061.094 -> and we help play a critical role
2063.931 -> in ensuring the integrity
of America's financial system.
2067.701 -> We write and enforce rules governing
2069.736 -> the ethical activity of brokers
in the United States,
2073.273 -> we examine firms for
compliance for those rules,
2076.91 -> we foster market transparency
and we educate investors.
2082.516 -> In addition, we also do big data,
lots of big data.
2087.888 -> FINRA has processed peak volumes
2089.656 -> of over 600 billion market
events per day.
2092.693 -> We do this in support of 24 exchanges,
2095.762 -> and we have had to run upwards
of 300,000 compute nodes
2099.7 -> in a single day.
2101.702 -> Currently, we have a storage
footprint of about 500 petabytes.
2108.375 -> We did not get here overnight.
2110.978 -> We started in early 2014, 2015,
2113.18 -> and by 2016 we were doing
about 39 billion transactions.
2117.451 -> We had about a hundred RDS Instances.
2120.32 -> That grew by 2019 where we were doing
about 190 billion transactions
2125.792 -> and we upped until about
500 databases at that point
2129.496 -> with about a 50 petabyte
storage footprint.
2133.5 -> But between 2019 and 2022, there was
unprecedented market volatility.
2139.273 -> And we have grown to an average data
2141.308 -> intake of over 450 billion
events per day,
2147.381 -> over 500 petabyte storage footprint,
and we have over 1200 databases.
2154.421 -> For 2026, what does
the production look like?
2157.991 -> Simply bigger.
2159.86 -> So, the question is,
is how did we get here?
2163.33 -> We got here by starting
2164.565 -> with some very fundamental
architectural principles.
2168.635 -> We are a data-centric organization
2171.271 -> and therefore, it started
with our data management.
2174.107 -> And this is more than just storage.
2176.944 -> It includes data registration,
data lineage, data lifecycle,
2180.814 -> including security data classifications.
2183.984 -> We wanted to ensure that we took
advantage of cloud elasticity.
2188.355 -> Everything from serverless to flexibility
2191.525 -> with instance types,
flexibility with storage types,
2195.562 -> and again, making sure
that it was done in such a way
2198.098 -> that we could do on-demand scalability.
2201.568 -> From an architecture perspective,
2203.804 -> you have to start with
a core framework of services.
2207.407 -> All of our application teams understand
2209.176 -> the supported blueprints,
there's centralized messaging,
2213.18 -> we are API-focused
and we prefer open source.
2218.018 -> But security is also a key part
of that architectural principle.
2223.423 -> And it's not just encryption.
Configuration, authorizations
2229.062 -> and observability,
in terms of it being auditable.
2233.834 -> DevOps infrastructure as code
that is core,
2238.138 -> including your CI/CD
and all of the automated testing
2241.909 -> that needs to happen in that area.
2244.645 -> Operations needs to be
resilient and again, automated.
2249.483 -> It also needs to be performant
2251.351 -> and you have to have
that enterprise observability
2254.188 -> in order to ensure
all of those are balanced.
2257.624 -> The output of all of that data
should be done
2260.093 -> in such a way that compliance
is also done as code
2263.864 -> and you can do analytics
in order to show it.
2268.468 -> To manage this is a balancing act.
2270.07 -> It's a balancing act between innovation,
2272.072 -> optimization and adoption.
2275.242 -> Innovation.
2276.844 -> AWS has new services all the time.
2280.848 -> In order to do that, you have
to have service evaluations
2283.35 -> on how you integrate those services
into your architecture,
2287.487 -> into your observability.
2289.556 -> You can do that through R&D,
through POC.
2292.693 -> We also are very active
in preview participation
2296.196 -> and it allows us to give informed
feedback to those services.
2300.701 -> For optimization of workloads,
it is observability of capacity,
2304.705 -> but also those cost efficiencies
in order to get the most efficient,
2309.176 -> cost-effective workloads.
2311.745 -> And then adoption:
delivery focused automation.
2316.984 -> Putting that automation
in the hands of the teams
2319.486 -> with the provisioning guardrails
2321.555 -> so that they can do
the right thing the easiest way.
2325.626 -> But that automation
has to be handled at scale.
2330.13 -> So, again, observability.
2331.365 -> Monitoring.
2332.432 -> Everyone is familiar with monitoring.
2334.001 -> That's what they think of
with observability,
2336.37 -> but it is far more than that.
2337.604 -> It is compliance, everything down
to records management,
2342.176 -> oversight, policy enforcement.
2345.145 -> It is security.
2346.48 -> Yes, it is auditable, but it's also
do you have those standards
2350.317 -> in your architectural principles
for security logging,
2353.153 -> for security control,
operational scorecards.
2358.358 -> These are not just testing,
2360.093 -> but what is your resilience
posture of your cloud environment?
2364.464 -> Can you measure it?
2365.732 -> Can you view it?
2368.202 -> Everything including AWS
Trusted Advisor,
2371.071 -> great for cost optimization.
2373.874 -> And then application health status,
the highest layer,
2378.912 -> and this is important not only
for those core connectivities,
2383.25 -> Is your application up?
2385.052 -> But also what are the dependencies
of those applications,
2388.255 -> either direct dependencies
or indirect dependencies.
2393.861 -> About three years ago we embarked
on a strategic vision
2397.03 -> to do the multi-Region
disaster recovery.
2401.268 -> And we did so basing it
on some core services,
2405.439 -> Amazon S3 with the Cross-Region
Replication for data,
2410.077 -> Aurora Global Database,
KMS multi-Region keys,
2415.449 -> and the DynamoDB Global Tables
specifically around parameter store.
2420.387 -> Combining those again to make
a specific strategic vision
2425.158 -> for what that architecture
would look like
2427.928 -> and it continues to grow.
2429.63 -> We continue to infuse resilience
2432.165 -> through those architecture and services.
2435.068 -> You always want to constantly
learn and evaluate opportunities
2438.939 -> for efficiencies between those Regions
2442.142 -> while course ensuring data
encryption, data security,
2445.779 -> and your data durability.
2450.083 -> So, observability specifically
for resilience.
2454.254 -> We have annual tests, sometimes more
than that of the multi-Region.
2458.792 -> So, we created what we call
FINRA canary,
2461.395 -> which of course reports
into our birdcage
2463.964 -> in order to check the health of
the systems in the multiple Regions,
2469.336 -> but not just the system itself
or its core dependencies,
2474.775 -> but also its downstream dependencies.
2477.544 -> So, it cares about itself
and it cares about its friends.
2481.515 -> To do this, we've implemented
a simple red light,
2484.885 -> yellow light,
green light, implementation
2489.056 -> and you can look at them
across regions and go,
2491.491 -> my app is good, my app is good,
but my friends aren't so happy.
2495.596 -> And that may be expected.
2497.064 -> Not every app's position
in disaster recovery
2499.967 -> may be up at any given time,
2502.302 -> but knowing where that is
and when both you become,
2505.806 -> your app happy and your friends
2508.108 -> and all your neighbors
that you depend on,
2510.31 -> that's when you go green.
2512.846 -> And that allows a level of observability
2515.516 -> to be able to be seen
across the enterprise
2518.785 -> and at scale and across Regions.
2520.854 -> It's gone?
2526.627 -> There we go.
2528.128 -> So, I will leave you with some
parting thoughts
2530.597 -> about managing this at scale.
2532.933 -> Again, integrated security,
hands-off operations.
2537.871 -> Infrastructure as code.
2540.474 -> You are not going to go on a server
and fix something
2544.645 -> because that server
may not be there tomorrow.
2546.98 -> It may be a different server.
2548.282 -> So, make sure it is
a hands-off operation.
2551.585 -> Automation, automation
of onboarding to your fleet
2555.556 -> and make sure that it is self-reporting
2557.391 -> and the audit trails are there,
2560.027 -> but most importantly it
is delivery focused,
2564.097 -> self-service automation
that those teams can use
2568.535 -> to make the right thing, the easy thing.
2572.105 -> With that, I will pass the stage
back to Shaown.
2575.108 -> [music playing]
2582.616 -> Thank you, Kym.
2583.717 -> You know, as you heard Kym
talk about the importance
2587.054 -> of maintaining observability.
2589.156 -> Resilience is not a one and
done thing or a checklist project.
2592.492 -> You must continuously invest
in improving your resilience,
2595.028 -> which is what we're going
to close with today.
2596.964 -> By the way, I have to give her
a shout out
2598.398 -> for calling dependencies your friends.
2600.934 -> They're your friends until they break
2603.103 -> and then you're quite angry
at those dependencies.
2604.838 -> But it's nice when
they can be your friends.
2608.175 -> Part of continuous improvement
2609.877 -> is making sure that
every workload is reviewed
2611.979 -> using the AWS
Well-Architected Framework.
2614.381 -> Not just once, but every time
2616.416 -> you make a significant change
to your workload.
2618.719 -> The Well-Architected Framework
is a set of questions,
2620.888 -> design principles that enables you
to build and deploy faster
2624.591 -> to release value more often.
2626.193 -> Understand where you have risks
in your architecture.
2628.862 -> You want to intentionally know
about those risks
2632.165 -> and ensure that you've made
intentional architectural decisions
2637.738 -> that highlight how they
will impact business outcomes.
2640.507 -> And you want to make sure
your teams know all
2642.242 -> about the best practices we've learned
2644.211 -> from reviewing thousands of
customers' architectures on AWS.
2648.649 -> The reliability pillar of
Well-Architected outlines
2651.318 -> some key design principles.
2653.22 -> Automatically recovering from failure.
2654.821 -> Think back to Will's
examples at Capital One.
2657.491 -> You want to test recovery procedures.
2659.092 -> We talked about this earlier,
2660.561 -> the importance of chaos engineering
and continuous testing.
2663.73 -> You want to be able to scale
horizontally
2665.766 -> to increase workload availability
and plan for your capacity needs.
2670.771 -> Right-size your instances and resources
2672.673 -> to match your workloads needs.
2674.107 -> You heard Will talk a little
about being over-provisioned.
2676.81 -> Getting close to right will also
as the bonus
2679.246 -> help you optimize your costs,
something we all care about.
2681.982 -> And manage change and automation.
2686.587 -> Performing operations is---
2689.256 -> Operational excellence
principles are something
2691.558 -> that are also really important.
2692.826 -> It's the second pillar
in Well-Architected.
2694.928 -> The first one to think about
is performing operations as code.
2698.465 -> You can define your entire workload,
2700.467 -> applications, infrastructure
as code and update it with code.
2704.404 -> That's the game changer.
2705.639 -> if you get to that level of automation.
2707.574 -> You can design workloads
to allow components
2709.409 -> to be updated regularly to increase
2711.245 -> the flow of beneficial
changes into your workloads,
2714.515 -> and as you use
your operations procedures,
2716.884 -> look for opportunities to improve them.
2719.353 -> Use proactive threat modeling to
identify potential sources of failure
2723.557 -> so they can be moved
or mitigated ahead of time.
2726.193 -> And finally, drive improvement
through lesson
2728.428 -> learned from all operational events.
2730.397 -> That correction area of error process
I keep coming through -- back to.
2733.567 -> It's something that people forget about.
2735.469 -> They get worried about blame.
2736.937 -> Do not think about blame,
think about the learning.
2740.44 -> One of our favorite mechanisms
for doing this is game days.
2743.443 -> After you design
for resilience in place,
2746.113 -> you want to make sure
that it works in production.
2748.515 -> And a game day is a way to ensure
that everything works as planned.
2751.685 -> Use game days to regularly exercise
2753.654 -> your procedures for responding
to events and failures
2756.256 -> as close to production as possible;
2758.458 -> this includes in production
when possible.
2760.727 -> And you want to use the people
who will be
2762.563 -> actually involved in failure scenarios.
2764.631 -> It's a very common mistake
to have a paper exercise
2767.568 -> and none of the folks who are
actually on call at 3:00 AM
2769.837 -> are involved.
2771.004 -> Game days should simulate a failure
or an event to test systems,
2774.842 -> processes and team responses.
2776.944 -> The purpose is to perform the actions
the team would perform
2779.746 -> as if that event happened.
2781.415 -> And you have to do this regularly.
2782.816 -> You want muscle memory
on how to respond.
2784.818 -> You don't want someone
reading the manual.
2786.92 -> I'm going to say it again
at 3:00 AM at night.
2788.455 -> It's always 3:00 AM somewhere,
that's for sure, and avoid that.
2793.694 -> Infrastructure event management.
2795.295 -> This is something we offer to our
large enterprise support customers.
2798.832 -> It's available to you if you're
an enterprise support customer.
2801.268 -> It's focused on planning and support
for business-critical events.
2805.706 -> Think about steps you might take
to prepare your workload
2807.975 -> to handle a 10x traffic increase
due to a product launch.
2811.512 -> Like our latest Kindle,
I went and ordered one.
2813.28 -> I'm sort of excited
about writing on a Kindle.
2816.016 -> Customer onboarding, it could be
tied to an ad campaign,
2818.619 -> anything that's going to happen
2819.887 -> that's going to cause a large change
in how you operate.
2823.257 -> Our enterprise support team
can work with you
2825.392 -> over a timeline on several weeks
2826.727 -> to figure out how
your infrastructure's configured,
2829.296 -> make sure service limits are right,
2832.099 -> make sure auto scaling
groups are in place,
2833.867 -> look at your Load Balancers.
2835.502 -> They'll also raise awareness
insides about your event
2838.438 -> and prepare support engineers
to be ready
2840.274 -> to tackle any issues that come up.
2842.376 -> We use this for events
like Prime Day internally.
2845.045 -> We can do it with you as well.
2847.281 -> And for those really exceptional cases,
2850.117 -> I'm thinking about
the Cyber Monday this week
2851.818 -> for those retailers out there,
we offer joint mission control.
2855.255 -> This is just like NASA, right,
going back to where we started,
2858.559 -> think about an all-hands-on deck event
2860.427 -> to make sure the right people
potentially in the room with you.
2864.031 -> During Prime Day, we have hundreds
of engineers who come together,
2866.7 -> both virtually and in person
to prepare for the worst-case
2870.17 -> scenario while being ready
and hoping for the best.
2874.741 -> Finally, I want to give you
a couple resources.
2877.644 -> These are things you can reach out
to get more information.
2880.08 -> Three really good things here.
2881.782 -> First, the Well-Architected Framework;
2883.183 -> we have a set of pages
that is full of information.
2886.053 -> Everyone should have checked out
the Well-Architected Framework
2887.921 -> if you're running in AWS.
2889.69 -> Second, a brand-new whitepaper published
2892.292 -> by one of our principal SAs,
2893.493 -> Mike Haken, on fault isolation
boundaries, really compelling.
2897.397 -> I saw Corey Quinn tweeting
about it the day it came out.
2899.8 -> I was on vacation.
2901.101 -> Got me excited that it got published.
2902.636 -> Really good information there.
2904.037 -> And finally, our AWS Solutions Library,
2906.874 -> which is full of Solutions
for all purposes.
2909.243 -> In this case, we've just
launched the resilience guidance.
2912.88 -> We have backup and restore,
failover and failback and many more.
2916.683 -> Please check out the Solutions Library.
2920.654 -> All that said, because
resilience is continuous,
2923.757 -> you have to think about it as
a journey instead of a destination.
2926.76 -> You are never quite done.
2928.629 -> We realized that all of you,
2929.93 -> our customers, are in different spots
in that resilience journey.
2933.233 -> Some of you are just starting out.
2935.035 -> Some of you are further along.
2936.47 -> We heard from Will, we heard from Kym;
2938.038 -> they're pretty far along
in that journey.
2939.84 -> But we have one thing
all of us have in common.
2942.442 -> We're all builders on this journey.
2944.044 -> Whether you're an executive building
a business strategy around resilience
2947.08 -> or a developer building resilience
2949.016 -> through the application
or part of a cloud CoE
2951.051 -> or DevOps team building guardrails out,
2954.321 -> AWS can provide you
with the right guidance,
2956.49 -> the services and infrastructure
to enable your success.
2959.293 -> I want to thank all of you
for spending this hour with us,
2962.563 -> from me and Francessca,
and please have a great re:Invent.
2965.699 -> [applause]
Source: https://www.youtube.com/watch?v=GamnNc6ZMew