AWS re:Invent 2022 - AWS Incident Detection and Response (SUP201)
AWS re:Invent 2022 - AWS Incident Detection and Response (SUP201)
In this session, learn how to monitor, detect, and manage incidents with AWS services for a quick resolution, and get visibility into the operational excellence of a real-world banking platform. AWS Support’s new offering, AWS Incident Detection and Response, offers proactive monitoring and incident management to reduce the opportunities for failure on workloads and accelerate recovery from critical incidents. Learn how JPMorgan Chase has collaborated with AWS to enable AWS Incident Detection and Response during the Chase.com migration to AWS and for its critical observability and resiliency requirements.
ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.
AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.
#reInvent2022 #AWSreInvent2022 #AWSEvents
Content
1.027 -> - Good afternoon and again,
1.86 -> thank you for being here as well.
3.18 -> My name is Michael Proctor.
4.38 -> I'm a SAT Reliability Engineer
6.09 -> in the Public Cloud
Enablement team in Chase.
10.29 -> Been with the firm for 25 years,
12.15 -> which is great 'cause there's
a lot of different roles.
15.36 -> By way of quick introduction,
16.8 -> I've started programming
in Fortran on punch cards
20.46 -> when I was at university in South Africa
22.29 -> and at that point,
resiliency and redundancy
24.42 -> were putting two rubber bands
26.07 -> around the deck of cards you had
27.33 -> and making sure you had a paper
copy saved somewhere else.
31.68 -> - Excellent. Michael, thank
you for being with us.
34.83 -> Okay, so before we dive
into those two stories
37.2 -> that I mentioned,
38.1 -> we're gonna do a quick
customer survey here.
40.08 -> I wanna find out a little
bit about your needs
42.51 -> and what brought you to this
particular session today.
44.64 -> So I've got three questions
we'd like to start with.
48.06 -> So question number one is
within the last six months,
51.21 -> how many of you within
your teams have dealt
54.021 -> with a major IT outage that
lasted more than one hour?
57.39 -> Just raise your hand so I can
see how many have dealt with
59.94 -> those kind of incidents.
62.25 -> Okay, fair number.
63.81 -> My second question for you is,
66.66 -> of those systems that
experience those outages,
69.27 -> how many of them were
classified as business critical
71.7 -> systems by your business and by the users?
76.86 -> Okay, fair number.
78.87 -> My third question for
you is for those systems,
83.07 -> had you done monitoring, testing,
chaos resiliency testing,
89.19 -> that should have captured the
particular failure mode that
91.65 -> presented during that failure?
96.84 -> Okay.
98.04 -> All right, so we're gonna
look at some market data
100.38 -> in a moment that suggests that
the number of these outages
102.96 -> of critical systems and the
duration of how long it takes
106.05 -> to recover from them,
106.95 -> there's probably more of
them and they take longer
108.78 -> to recover from than you
perhaps might expect,
112.38 -> and what we can do about that.
114.21 -> Okay, so we're gonna go through this today
117.03 -> in four different sections.
118.59 -> We're gonna talk about the needs
119.91 -> for these type of critical applications,
121.74 -> what you need to make them more reliable.
123.99 -> We're going to go through the
release of this new service
127.11 -> called AWS Support Incident
Detection And Response,
130.44 -> which, by the way, is quite a mouthful.
132.45 -> So you're probably gonna
hear me use the acronym IDR
135.42 -> for the rest of that talk.
136.38 -> Whenever I say IDR, that
means the new AWS Service,
139.65 -> for Incident Detection and Response.
141.93 -> We're going to move from that to again,
144.24 -> the story of how Chase was one
of the first major customers
146.42 -> of that and the benefits
they've seen as as a result.
149.85 -> And then finally, we're
gonna talk about some things
151.41 -> that we think you can take away
152.43 -> to help make your
applications more resilient
155.1 -> and to have less incidents
156.36 -> and recover from incidents more quickly.
163.5 -> Okay, so let's start with a quote
165.09 -> 'cause we always like to start
these talks with a quote,
167.22 -> and this one happens to be,
168.45 -> as you can see from Sun
Tzu and "The Art of War."
171.51 -> This particular talk is
all about preparedness.
173.82 -> It's about preparedness
for critical situations
176.07 -> and for IT incidents and specific.
178.41 -> And this particular quote, I think,
180.12 -> has some important things to
say about preparedness, okay?
183.582 -> And so what the quote is saying is,
186.582 -> if you only know yourself,
189.51 -> or in this case your workload
and the failure modes
191.97 -> it's subject to on a surface level,
195.15 -> you're gonna have probably
some bad outcomes, okay?
198.99 -> If you understand the
workload and the failure modes
201.6 -> it's subject to on a deeper level,
204.48 -> you're gonna have much better outcomes
206.49 -> and you're gonna be able to take
207.75 -> what might be unexpected
events or incidents
210.48 -> and turn them into things
211.92 -> that are expected and more routine,
214.08 -> things that you can recover from
215.52 -> more quickly, okay?
217.59 -> So we're gonna talk about preparedness
219.96 -> through the lens of a
number of mechanisms,
222.69 -> reviews, architecturally
and operationally,
225.27 -> that you do up front
226.74 -> to the automation of
deployment and management,
229.44 -> to better monitoring and alerting
232.56 -> and through things like
chaos testing, okay?
235.8 -> And how a service like RDR can
help drive and enforce those
240.48 -> types of best practices.
243.731 -> Okay, so at AWS you like to work backwards
246.42 -> from customer requirements
and what customers need.
248.76 -> So what have customers told
us in this particular space?
252.18 -> Customers tell us that despite
their attempts to build
254.7 -> resilient systems,
255.84 -> and despite putting a lot
of effort into implementing,
258.78 -> monitoring, and alerting for those systems
261.6 -> that most frequently
their first indication
264.06 -> that a critical incident is
evolving is when customers
267.09 -> start telling them that
they have a problem
268.8 -> and start reporting things
into their help desks,
271.2 -> which is obviously not the
reactive posture that any of us
273.69 -> wanna be in for our customers.
277.5 -> Once the incident has begun,
279.3 -> the amount of time it
takes to dive into that
281.73 -> and understand the root cause
282.93 -> and what's happening
is far longer than they
285.09 -> or their users can tolerate.
288.06 -> The frequency of these incidents
289.89 -> is much higher than is desired,
292.68 -> and that erodes their trust and
reliability of their systems
295.86 -> with their customers.
297.9 -> And so they really are looking
for solutions to help them
300.69 -> reduce the frequency and the
duration of these incidents.
304.59 -> And so that's what we're
gonna talk about today.
308.25 -> So let's quantify some
of those customer quotes
311.28 -> that we just went through.
312.96 -> There was a market study done
earlier this year by IDC,
316.74 -> which had some interesting findings.
318.57 -> They looked at large Enterprises
320.01 -> like many of you probably work for,
322.02 -> and what they found was that
the average Enterprise is
325.26 -> experiencing over 29 of these
sorts of incidents per year,
330.12 -> okay, over two a month.
331.56 -> As somebody who works as a
technical account manager,
334.32 -> helping customers avoid and
respond to such incidents,
337.86 -> that's obviously way too high a number.
341.1 -> Very interestingly,
341.94 -> they found that on average it
takes five hours to recover
345.87 -> from one of these incidents.
347.52 -> We're gonna explore today why is that?
350.19 -> Why would it take so long for
a system that's probably been
352.98 -> through tons of testing, has
tons of alarms defined for it,
357.54 -> why does it take five hours
to realize something is going
360.36 -> wrong, fix it, or avoid the
problem from escalating further?
366.24 -> Finally, the cost to those organizations
368.46 -> for mediating those sequence of events
371.07 -> runs in the order of 13.5 million
per organization per year.
377.58 -> I don't know what size
your organization is,
379.5 -> what size your IT budget is,
381.12 -> but 13.5 million is a lot
of money to me anyway.
384.9 -> Imagine what you could do
using that money to invest in
388.44 -> inventing better solutions
and improving your solutions
391.44 -> on behalf of your users
and your customers.
395.07 -> So disruptions are costly.
396.72 -> There's some that you can quantify.
398.94 -> There are others that are
more difficult to quantify,
401.4 -> but they are every bit as important
403.5 -> so we're gonna talk about
some secondary impacts.
406.62 -> This picture here in the
lower right-hand corner,
408.63 -> you can see the poor operations guy
410.52 -> that's trying to hold up the
falling domino of effects
413.7 -> that's happening during one
of these incidents, right?
416.04 -> So the question there is what
level of burnout do you have
419.82 -> in your staff from
these multiple instances
422.04 -> that are happening per year?
424.23 -> How much attrition does that
drive of valuable skills
427.11 -> across your organization?
428.67 -> How much time do you spend
trying to hire people to replace
431.79 -> those people who have burnt out
and moved on to another job?
435.66 -> That's one sort of secondary impact.
438.42 -> There's reputational damage, okay,
440.54 -> we talked about customers getting fed up
442.89 -> with repeat incidents, right?
444.75 -> I'm sure you've all heard
the quote that says,
446.347 -> "A reputation takes a lifetime to earn
448.53 -> and a moment to lose."
450.12 -> Very difficult to
quantify the cost of that.
454.47 -> If you work in regulatory
and regulated environments
457.71 -> as our customer here does,
459.72 -> there's cost associated with
those incidents when you meet
462.27 -> certain reporting timeframes
and recovery timeframes.
466.53 -> And again, there's cascading effects
468.99 -> down your chain of your
partners and others
471.45 -> that use your system who get impacted
473.37 -> by these incidents as well.
477.57 -> All right, so that's
the problem statement.
480.36 -> Why is this important?
481.32 -> Why do we need to reduce the frequency?
483.18 -> Why do we need to reduce the cost of this?
484.71 -> So what's the solution?
486.57 -> Okay, so this service that we developed,
488.67 -> which we're gonna call IDR here today,
490.98 -> it went into general availability
just a month or so ago
493.62 -> in September, if you
haven't seen that yet.
496.86 -> So this service at its essence is AWS
501.33 -> and the Customer Specific Application Team
503.82 -> working together to define a set of alerts
507 -> that are leading indicators of problems
508.92 -> with that particular workload.
510.21 -> And so then we work, set up automation
513.03 -> to detect those failure
modes and we feed those
515.88 -> not only to the customer's
operation teams and consoles,
519.33 -> but we also feed them directly
into AWS's internal incident
523.71 -> management systems and
Incident Management engineers.
526.74 -> So effectively what you
are getting is the ability
530.01 -> to benefit from the same
type of rich monitoring
533.85 -> and Incident Management
that we apply internally
536.34 -> to our AWS services as well as we apply
539.43 -> externally to AWS Managed
Services customers
542.52 -> that some of you may be.
544.32 -> And so you can benefit from
that type of response time
547.86 -> and detection by having
that automation in place
551.4 -> and agreeing on those signals and alarms
553.86 -> gives us the ability to
commit to having an incident
557.19 -> spun up with the right AWS resources on it
560.28 -> in 15 minutes or less.
562.59 -> Now that might sound similar initially to
565.11 -> if you're Enterprise
Support customers today,
567.03 -> you know that when you open a critical,
568.83 -> the highest level support case,
570.84 -> there's also a 15-minute response
time objective for those.
574.26 -> The difference is the clock on that starts
576.48 -> when you open a support case.
578.61 -> The clock on an IDR incident
begins when the system
581.82 -> detects degradation or
problems in your system
585.15 -> and our experiences that that lag between
587.91 -> when a system detects a
problem with your application
591.03 -> and when it actually gets diagnosed
592.89 -> and a support case gets
opened is very significant.
595.59 -> We're gonna look at how
significant in a minute.
598.08 -> So that's the difference between
Enterprise Support and IDR,
601.41 -> which is an add-on offering
which Enterprise Support
604.02 -> customers can purchase.
605.82 -> And our AWS Managed
Services customers that have
608.43 -> Enterprise Support get IDR
610.23 -> as a default part of their service.
614.25 -> Okay, so how does onboarding
to this service work?
617.64 -> So we said before in our opening quote,
619.687 -> "The most important thing
is to know yourself,
621.72 -> know your workload, and
know the failure mode."
623.43 -> So the very first thing that
we do is hopefully if this
626.4 -> workload is in production or near to it,
628.86 -> you've gone through a set of architectural
630.6 -> and operational reviews
using something like
632.423 -> the AWS well-architected framework.
634.95 -> If that hasn't happened
for whatever reason,
636.96 -> for this particular workload
that we're onboarding to IDR,
639.84 -> then we'll work with you
and go through that review,
641.82 -> make sure that we understand
the key solutions,
644.7 -> dependencies, failure modes,
monitoring, et cetera.
649.14 -> The next thing that we do
is we look at your current
651.87 -> monitoring alerting setup.
653.28 -> And do you have alerts
that are appropriate
655.32 -> for the type of Incident
Management triggering
658.29 -> that we're talking about here?
659.61 -> If you have those,
660.63 -> if they happen to be
defined in CloudWatch,
662.97 -> then we configure an EventBridge,
which feeds that directly
665.88 -> to our Incident Management platform
667.89 -> and our Incident Management engineers.
670.17 -> If you happen to use one
of our great third-party
673.11 -> monitoring partners and
not have those alarms
676.05 -> in CloudWatch yet, that's
fine, we'll work with you
678.21 -> and get those set up
in CloudWatch as well.
681.57 -> We're working on a future
development where we can take
684.78 -> alarms directly from some of those leading
686.61 -> third-party providers into IDR.
688.83 -> But in the initial GA
services of September,
692.04 -> CloudWatch and EventBridge
693.63 -> are the foundations for that feed.
696.24 -> Okay, so once we get
those alarms agreed to
698.82 -> between the application, the
customer, and the AWS team,
702.18 -> we get those tested.
703.2 -> The final step is looking
at what are your runbooks?
706.26 -> What do you do today when
problems begin to evolve
709.23 -> and we make sure that you have
the runbooks on your side,
712.83 -> we have runbooks on our side
for Incident Response team
715.41 -> that say how are we gonna
guide you through these things
718.56 -> should they occur.
720.03 -> In the AWS case, we implement our runbooks
723.39 -> in the AWS Simple Systems Manager Service
726.15 -> so that our runbooks are automated
728.34 -> to the greatest degree possible
729.87 -> so that the communication
and coordination that happens
732.66 -> is repeatable and reliable
734.28 -> and happens very quickly
during an incident.
737.37 -> So that's how the
onboarding process looks.
741.33 -> So once customers have
adopted this service
743.58 -> and have been onboarded to it,
744.72 -> what value have they been seeing from it?
746.64 -> We've been doing this with beta customers
749.73 -> throughout the year
750.563 -> and with production
customers since September.
754.56 -> So what are some of the benefits?
757.23 -> From an observability standpoint,
759 -> I'm sure you'll know that
implementing reliable alerting on
762.6 -> complex distributed systems is hard.
764.94 -> There's many stacks,
there's many components,
767.4 -> different API entry points,
769.32 -> many different metrics,
770.46 -> some of which are good
indicators of customer empathic
772.77 -> problems, some of which are not.
774.99 -> So picking those out to have
reliable signals and indicators
778.98 -> is a difficult problem.
780.57 -> What we have found is that doing this
783.36 -> in tandem with the customer,
784.71 -> with your expertise about your
application and your monitors
788.31 -> and bringing our AWS
experts to bear as well,
791.1 -> we end up with a richer,
792.33 -> deeper set of monitors in place
for IDR-enabled customers.
796.11 -> And so that's one benefit.
798.93 -> A second benefit is that
early incident detection piece
801.93 -> that I tried to indicate before.
804.42 -> you know, getting rid of
the initial manual diagnosis
808.02 -> and triggering things earlier,
810.06 -> triggering directly to 24/7 teams
812.91 -> that we have around the world
814.14 -> that are already monitoring our services,
818.505 -> in terms of faster resolution time,
821.49 -> several things are leading to that value.
824.67 -> The time that we spend
together working on the alarms
828.18 -> and working on the playbooks
results in incidents
832.23 -> being diagnosed faster,
834.06 -> the fact that we can bring,
835.62 -> because we're looking at our
telemetry of what services may
839.1 -> be involved,
839.933 -> we can bring the right
engineers into the call
841.86 -> from the beginning rather than
waiting for escalations later
844.62 -> to bring them in.
845.64 -> And so we've got the right
people on the phone faster
848.67 -> with the right data in front
of them, and that all leads
851.43 -> to a much quicker resolution of incidents.
855.93 -> The final value comes from
857.85 -> a continuous improvement standpoint.
860.22 -> So once we've been through
one of these incidents,
863.04 -> we will produce a
post-incident report for you
865.05 -> that tells what happened,
what were the steps,
867.36 -> how long did each take,
868.89 -> and we can use that to
drive corrective action
871.38 -> and figure out how we
can avoid this next time
873.87 -> and make sure you don't
have any repeat incidents.
876.93 -> So those are some of the values
that customers are seeing
879.36 -> from onboarding to this idea of service.
883.8 -> So just to wrap up this
section on the offering,
886.14 -> before I hand it off to Michael,
888.69 -> let's go back to that
five-hour recovery time
891.21 -> and look at again,
892.86 -> let's get behind what that is,
and how does IDR as a service
896.73 -> actually help decrease
that very significantly.
900.78 -> So assume you have an application,
903.39 -> it's stood up all or in part on AWS.
906.99 -> You've got your application
alarms on your side
909.15 -> that feed your teams in
terms of what's going on
910.8 -> at the application layer.
912.27 -> Our services are obviously instrumented
914.07 -> and have telemetry of
what's going on with them.
916.83 -> So say a problem begins to evolve,
919.2 -> whatever the source might be,
920.52 -> application problem,
maybe a service problem,
922.89 -> alarms begin to trigger in
different places, right?
926.13 -> At some point you begin
a customer investigation
928.41 -> to find out, "Okay, what's going on here?"
930.51 -> You're looking at recent changes.
931.83 -> What's causing this?
933.09 -> Is it a problem in our infrastructure?
934.77 -> Is it a problem in AWS?
936.207 -> And at some point you're gonna
look at some log entry or
938.85 -> some metric and say, "Okay, I
think this is an AWS problem.
942.27 -> I need help resolving this from AWS."
944.73 -> You're gonna open an
appropriate support case
946.68 -> to the right team to deal with that.
949.11 -> That gets us down to the white box.
951.51 -> Our experience from our
measurements suggests that that is
955.38 -> usually in the range of
two to three hours from the
958.17 -> initial trigger to the initial
customer investigation,
961.47 -> to the time that they're
convinced that AWS needs to help
963.81 -> with the problem and opens a support case.
966 -> So, at that point,
968.37 -> your technical account
managers are getting paged out,
970.62 -> we're all on a bridge together,
971.61 -> we're doing a joint
investigation in the blue phase.
975.24 -> There may be pipelines on
both sides that may need
977.79 -> to be rolled back to deal with a problem.
980.49 -> There may be multiple mitigation paths
982.29 -> that we're investigating to try to resolve
984.06 -> whatever the root cause
of the incident is.
986.58 -> And eventually we find when that works,
988.26 -> we fix the problem, things are recovered,
990.51 -> everything is great,
992.04 -> that second half from
the white bar on down
994.14 -> takes another two to
three hours on average.
996.57 -> So you put together the
front-end investigation
999.15 -> and the back-end joint investigation
1000.86 -> and that's where your five
hours can typically come from,
1003.83 -> even for a well-instrumented,
well-prepared kind of app.
1008.36 -> So how does IDR help with that?
1010.01 -> Well, as I said before,
1011.75 -> IDR helps by short-circuiting
in the whole top half
1014.51 -> of that investigation by the
alarms that we've put in place,
1018.38 -> when they trigger,
1019.213 -> we already know there's a problem
1020.93 -> and we have some indication
of where the problem is.
1023.54 -> So the investigation can
hopefully start in the right place
1026.87 -> and proceed much more quickly.
1030.14 -> You get right into that blue phase
1031.97 -> within 15 minutes or less,
1033.74 -> and that phase itself
proceeds much more quickly
1037.34 -> because of the runbooks
1038.48 -> and the joint preparation that we've done
1040.19 -> through well=architected, et cetera.
1042.32 -> And our experience through the
deployment of this offering
1045.05 -> is that, as you would expect,
1046.85 -> it results in much shorter
response times for customers,
1051.62 -> or recovery times rather,
1053.3 -> and we've actually had some
of our customers tell us that
1055.88 -> the telemetry that we've
agreed to and set up
1058.46 -> has actually helped them detect things
1060.59 -> and resolve them before
they actually became
1063.59 -> customer-impacting problems at all.
1065.06 -> So you move from the range
of detection to prevention,
1067.4 -> which obviously is the ultimate goal.
1070.28 -> So that's a little bit about
Incident Detection Response
1073.19 -> and why we created and
what we're seeing from it.
1075.2 -> I'd like to turn the podium
over to my friend, Mr. Proctor,
1078.53 -> and hear about the amazing
migration of Chase.com
1081.2 -> that happened this year and
how IDR was part of that.
1084.83 -> - Thank you, Tom.
1086.54 -> Thank you, everybody, and again,
1087.44 -> thank you for spending time with us today.
1089.6 -> So Tom described something
that I think will resonate
1091.79 -> with so many people, you know,
1093.32 -> the ability to know that at some point
1095.51 -> in your triaging of an incident
1098.72 -> that you don't have to
make a call, open a case,
1101.6 -> try to describe to somebody
who doesn't know your
1103.4 -> infrastructure necessarily
or hasn't really spent time
1105.8 -> in your alerting what the problem is
1107.75 -> while you're also simultaneously trying
1109.43 -> to do a bunch of other things.
1110.42 -> So I think that that will resonate.
1112.58 -> I think, as well, the important component
1114.68 -> around preparedness is really critical.
1116.87 -> And so having a good sense
of how you can be prepared
1120.71 -> not just in the we'll hand this off to AWS
1123.62 -> and partner with them,
1124.73 -> but in the sense of actually
going through processes
1127.37 -> and pre-production activity
that will leave you
1130.01 -> with a sense of confidence
that you can move forward.
1133.49 -> So at Chase we spend a lot
of time thinking about scale,
1137.75 -> the impacts, et cetera, and
have a really solid view
1141.29 -> of what operational excellence is.
1143.442 -> And so a lot of what we're
talking about really falls into
1145.49 -> this operational excellence
concept of making sure
1148.67 -> that we or can be as
prepared as we can be.
1153.5 -> So I'm sure you have measures as well,
1155.18 -> that you use KPIs perhaps,
1157.01 -> that tell you how many
customers were impacted
1159.71 -> in a specific issue,
1160.73 -> how long it took to mitigate and repair
1162.29 -> and some of those numbers
that Tom showed us
1164.39 -> on the previous slide, we do the same.
1167.33 -> And I think, you know, driving towards
1169.19 -> making those better all
the time is something
1170.87 -> that any organization,
no matter of any size,
1174.26 -> is really gonna focus upon.
1176.57 -> So I'm gonna try to share some
learnings in the preparation
1179.72 -> and the processing that
we did prior to going live
1182.75 -> with the migration that Tom talks about.
1185.27 -> And basically this was a,
1186.71 -> it's our digital front-end
at services both Chase.com
1189.86 -> and our Chase mobile app,
1191.39 -> largely an Apache Tomcat-based deployment.
1194.39 -> And we'll talk about the
specifics of that deployment
1196.52 -> in a little while, but that's
the context of the discussion.
1201.05 -> Before I jump forward,
just a quick show of hands,
1204.59 -> who is already a Chase customer
1206.15 -> either through banking,
investments, credit card,
1208.46 -> lending products, et cetera.
1211.1 -> Thank you for being a customer
1212.36 -> and I don't ask the question
out of simple curiosity
1215.69 -> and I'm not gonna chase anybody
who didn't raise their hand
1217.55 -> and offer you a product
or anything like that.
1219.56 -> It's really around scale and
I think scale is important for
1222.29 -> a lot of different institutions.
1224.03 -> You know, either you
have a very niche market
1225.41 -> or you have a very broad
market, you have competitors,
1227.87 -> et cetera so the scale
of that you're at now
1230.417 -> and the scale where you
would like to be is important
1233.36 -> in how you strategize to move forward.
1236.36 -> The scale that you can see here,
1237.701 -> speaking about Chase as a whole,
1240.89 -> I think is relatively
large by any standard.
1243.17 -> We have over 250,000
employees across the globe,
1246.98 -> as you can see,
1248.81 -> have relationships in the US
with one in two households,
1253.16 -> and that's why the hands
were roughly one in two,
1255.53 -> I would say maybe,
maybe a little bit less.
1257.9 -> And then just two other metrics
speaking about the amount of
1260.513 -> payments that are processed
in a period of time
1263.3 -> to give you a sense again of the scale.
1265.82 -> But JPMC or JP Morgan
Chase is not a monolith.
1268.49 -> We actually comprised of
four major lines of business,
1271.34 -> eight different corporate groups
1274.64 -> and the lines of business are very focused
1276.26 -> on different customer segments
and the needs that they have.
1280.1 -> I'll run through them very
quickly because I think
1282.11 -> it gives a sense of
context in this discussion.
1285.53 -> First we have the Consumer
and Community Banking,
1288.29 -> that's really the one
that's focused on the US,
1290.36 -> handles all your retail activity.
1291.89 -> As you can see, we have
branches in all the 48 states,
1295.16 -> the lower 48 states,
1296.6 -> and you can access your banking
1299.09 -> or complete your banking
requirements really
1301.13 -> in any way you choose,
branches, through the telephone,
1304.22 -> ATMs are pretty versatile.
1306.05 -> And then increasingly and more and more,
1308.81 -> I think we're seeing digital
adoption through the Chase.com
1311.48 -> or the mobile app.
1313.04 -> We also have the Corporate
and Investment Bank,
1316.46 -> well known as the JP Morgan brand.
1318.44 -> Something that, you know, you can see
1319.97 -> provides advice and such
to corporate customers
1323.66 -> in over a hundred countries.
1325.55 -> We have the Asset and Wealth Management,
1327.38 -> which is very focused on customers
1330.92 -> with very specific financial needs,
1332.99 -> and then we have the Commercial Bank,
1334.76 -> which is focused on businesses
1336.44 -> in over 20 countries around the world.
1338.93 -> Now none of this would be possible
1341.72 -> without global technology,
and I think I indicated
1343.79 -> Global Technology is one
of the corporate groups
1346.73 -> and resiliency and preparedness,
I think, as I said,
1350.48 -> falling into operational excellence
1351.86 -> is really one of the cornerstones
for trying to make sure
1354.26 -> that we can grow our base
and grow our presence
1356.42 -> and grow the brands.
1358.31 -> Looking then at global
technology and global technology
1361.76 -> here, not just the name
of the corporate group,
1363.8 -> but also the way we've
constructed technology
1366.35 -> around the world, I think
you can see based on
1369.221 -> just the few metrics
that are on the screen,
1371.93 -> it's large as well by any size.
1373.79 -> It's 20% of the workforce
actually has an IT role
1376.88 -> of one nature or another.
1378.921 -> We have, you know, 6,400
applications in production
1383.12 -> and you can see there are the number
1384.65 -> of active digital customers and and so on.
1387.44 -> Now with that many applications,
with that many customers,
1390.2 -> you can imagine that the
opportunity for things to go wrong
1393.29 -> and for customer impacts to
grow really quickly is large.
1395.99 -> And so being prepared quickly
means that we can avoid some
1398.9 -> of those things that Tom showed.
1400.88 -> We don't wanna be the
guy holding the dominoes
1403.07 -> and we certainly don't want
to be seeing our name in print
1406.55 -> because of some major issue that occurred.
1408.11 -> So, you know, the importance
of making sure we prepared
1411.71 -> I think is quite significant.
1413.96 -> There are a couple of
strategies or strategic pillars
1415.79 -> inside of global technology
1416.99 -> that I'd just like to touch on quickly.
1419.36 -> First one is, you know, the products,
1420.77 -> platforms, and experiences,
and this goes to supporting
1423.98 -> business needs for
differentiated experiences.
1426.2 -> Obviously we're all in competitive spaces
1428.99 -> and being able to rapidly deploy
and implement new releases,
1434.54 -> software development, and infrastructure.
1436.28 -> Looking here specifically in this context
1438.74 -> and at AWS about adopting Elastic Cloud
1441.86 -> and making sure that we can
leverage for our deployments
1445.16 -> and as well as our engineering
1447.44 -> to bring a lot of acceleration to it
1450.89 -> given, you know, comparing it
1452.21 -> to some of our on-prem deployments.
1455.54 -> You've heard a lot about data
1456.95 -> and about the importance of
data throughout the conference.
1459.86 -> Yesterday's keynote had a big chunk about
1461.66 -> data clearly unlocking the power of data,
1463.85 -> not just for customer purposes,
1466.22 -> but, you know, as Tom was saying,
1467.36 -> the telemetry and the
importance of knowing
1470.03 -> that you can establish with confidence
1471.65 -> what the health of your application is,
1473.15 -> what the health of your workload,
1474.14 -> and the infrastructure is important.
1476.48 -> But none of this would be possible
1478.16 -> without the ability to protect customers
1479.9 -> and the firm as a whole, you know.
1482.27 -> We all know the world we
live in in terms of threats,
1484.43 -> et cetera so embedding security
and privacy at every layer
1487.76 -> is really critical.
1490.55 -> So I'm almost at a point
where we'll see a slide
1493.04 -> that has AWS icons and stuff,
1495.32 -> so just give us a couple of minutes there.
1498.05 -> The Chase.com migration
I think is critical
1500.03 -> and I know that a lot of folk
who've attended some of the,
1503.15 -> like the CTO talk on Monday
around how you choose
1507.47 -> what you're gonna send to
the cloud is important.
1509.84 -> So I'm gonna touch very
quickly on why the Chase.com
1512.24 -> migration really made sense to us
1514.07 -> and it really fulfilled a bunch
of the strategic priorities
1516.47 -> that technology has and can
support the business with.
1519.17 -> First is the leveraging of,
you know, data and technology
1522.23 -> and we touched on that
on the slide before,
1524.12 -> but the ability to actually
take advantage of what the cloud
1526.61 -> offers is is clearly critical, you know,
1528.86 -> driving customer engagement
through enhanced experiences
1532.76 -> and clearly the reverse of that is true.
1534.83 -> Whenever you have an outage
1536.09 -> or you have some sort of failure,
1537.77 -> you're reducing your customer engagement
1539.48 -> and you're placing
friction in front of them
1541.31 -> and at some point you're
gonna lose a customer.
1543.92 -> Building around very dedicated,
very specific protections
1547.37 -> that will take care of the
security that we require
1550.7 -> for maintaining a very
strong and well-regulated
1553.88 -> controls and risk environment is important
1556.16 -> and being able to adopt various
cloud-specific techniques
1560.27 -> there is important as well.
1561.95 -> And finally, and it
may not seem intuitive,
1563.75 -> but the whole idea of focusing
on the employee experience,
1567.53 -> especially on engineers and
others in the technology space,
1570.38 -> of making sure that we
can reduce friction.
1572.33 -> And we've heard this in other talks
1573.65 -> through the last couple of days, you know,
1575.57 -> reducing friction of having to manually
1578.9 -> do the same task repetitively,
1581.03 -> of taking time to get
things into production,
1582.95 -> or going through loads of hoops.
1584.63 -> I think we see a lot of
opportunity by adopting the cloud
1588.74 -> for this specific application
1589.94 -> and others that will
follow in that regard.
1593.63 -> So the Chase.com migration had
several demanding objectives
1596.84 -> that were outside of the
strategic priorities.
1599.691 -> We're moving out of some
legacy data centers.
1602.75 -> The ability to be able to
deliver at the same speed
1605.51 -> and with the same metrics or
better than those data centers
1608.27 -> was clearly important.
1609.83 -> Here we're focusing on an objective
1611.84 -> of four nines of availability
1613.19 -> and I know availability is critical
1614.87 -> as all the hands go up a little earlier.
1617.3 -> Clearly cost is a factor,
1618.68 -> and it'll be a factor for
any size of organization,
1621.26 -> but also the ability to
engineer the end-to-end solution
1624.11 -> was really paramount.
1625.61 -> Making sure that we can
automate specifically things
1628.25 -> like our deployments,
our component releases,
1631.34 -> the self-healing and mitigation, you know,
1633.59 -> in order to try to
avoid some of the things
1635.24 -> that Tom was talking about earlier,
1637.04 -> and also testing and
validation and certification.
1642.35 -> The object was to complete this
program by the end of 2022.
1647.03 -> We actually completed the migration
1648.68 -> of our last customer wave in late October.
1653.3 -> The total volume of customers migrated
1655.22 -> and I should say we didn't
migrate data into the cloud,
1658.13 -> we migrated the processes
that pointed customers
1660.83 -> to various infrastructure
to drive all of that,
1664.52 -> all those sessions, all of that
activity through the cloud,
1666.95 -> was a hundred million customers
1668.09 -> so a hundred million profiles now point
1670.76 -> all of those customer
sessions to run into the cloud
1673.43 -> even though not everybody is active
1675.17 -> as you saw some of the earlier numbers.
1679.94 -> I'm gonna quickly just touch on
1682.04 -> the preparedness that Tom spoke about
1683.6 -> and he spoke about resiliency
testing and the like,
1685.7 -> and just share with you how we processed
1688.7 -> through the engagement,
1691.07 -> and I must also note that the team
1693.65 -> that was built to complete this migration
1696.2 -> was above and beyond just the
application team owning it.
1699.23 -> It was a broad range of people
1700.67 -> from a wide range of components of Chase
1704.9 -> as well as AWS and our Proserve colleagues
1707.99 -> that have partnered to
make this a reality.
1711.62 -> We also engage the digital SRE
team to really focus on the
1716.041 -> observability and the
resiliency component of it.
1718.25 -> And to that end,
1719.33 -> we started with a Failure
Mode And Effects Analysis.
1722.63 -> So this is a way of producing
a complete inventory
1725.15 -> of all the failure modes
you might anticipate
1727.01 -> in an environment in both the application
1729.83 -> as well as the infrastructure,
1731.69 -> of making sure your understanding of what
1733.97 -> might trigger those events,
1735.47 -> what you would do to mitigate them,
1737.24 -> what the potential is for them to happen,
1739.61 -> and that can sometimes be a SWAG,
1741.32 -> but it's still an important thing to have.
1743.36 -> What your projected outcomes would be
1744.98 -> in terms of customer impacts,
1746.63 -> and to use those attributes
as a way of trying to rank
1749.45 -> different failure modes against each other
1751.16 -> and to then focus on the
ones that are most important.
1754.25 -> And once you have the
ability to lay out them,
1757.97 -> lay out the failure modes
and you've ranked them,
1759.95 -> to then work through building scenarios
1761.63 -> that will help you take care of the how,
1763.88 -> how will you trigger them,
1764.87 -> how will you know they've happened,
1766.4 -> how will you assess the impacts?
1768.95 -> And sometimes that requires
more than just documenting it,
1773 -> it's actually building
processes that will do that.
1777.35 -> Having the scenarios then
leads to scheduled game days.
1780.35 -> And I think Tom mentioned
the word game day
1781.88 -> and that's really an important component
1783.38 -> of this entire cycle.
1785.42 -> And that's basically scheduling an event
1787.58 -> with a series of scenarios potentially
1789.83 -> where you can engage all
the appropriate stakeholders
1793.1 -> from the application teams,
1794.93 -> the architects, et cetera,
1796.76 -> as well as the operational
teams to all have the experience
1799.79 -> of what goes on when something fails.
1801.8 -> And to do so outside of the realm
1804.08 -> of when customers are really impacted
1805.61 -> and are gonna be, you
know, aware of the fact
1807.59 -> that you've got a
problem and to spend time
1809.93 -> working through those scenarios,
you know, driving them,
1812.63 -> leveraging tools such as Gremlin
1814.76 -> or AWS's Fault Injection System
1816.62 -> to inject the failures that you see
1819.02 -> and then work towards a process
1821.45 -> where you can do a postmortem,
1822.77 -> the same kind of thing you
would do in a production space.
1824.51 -> But to do so really focused
on not just the nature
1829.19 -> of the failure or the
architecture but the application,
1833.24 -> the documentation, how you
would respond to things,
1836.12 -> and obviously the detection
and such, as well.
1839.75 -> Hopefully at the back-end
of that you come up
1841.13 -> with a bunch of improvements,
1842.12 -> especially if this is an early adoption
1844.13 -> of a cloud deployment
1845.21 -> and you're working through
potential alternatives
1847.4 -> to the way that you
might do your deployment.
1850.13 -> Deploying, building the solutions
or building improvements,
1852.83 -> deploying them and retesting
them is critical obviously.
1855.65 -> And at the same time, you may
also need to revise your FMEA,
1859.4 -> so to revise your Failure
Mode And Effect Analysis
1861.86 -> 'cause you may have potentially
reduced a failure mode
1864.5 -> by the improvements you've
got or even potentially
1866.66 -> have increased one or added to it.
1869.66 -> And once you feel comfortable
that you've accomplished
1871.431 -> the insights or you've
gained the insight you need,
1874.22 -> you've perfected it as much as you can
1876.83 -> for those specific failure
modes is to then iterate
1879.14 -> and to expand your scope
to go through failure modes
1881.75 -> that maybe didn't make
it into the first wave.
1884.39 -> Those are all important ways of doing this
1886.94 -> on a regular basis.
1888.08 -> And we created a bunch of
scenarios that we tagged
1891.44 -> as certification scenarios and we run them
1894.05 -> as certification game days
1895.34 -> and we run them deliberately at the point
1896.9 -> where we expect to have large releases,
1899.42 -> where we've had component
level upgrades, EKS upgrades,
1902.33 -> things like that,
1903.59 -> as well as running them in
production once a month,
1906.05 -> and that took some doing
in terms of actually trying
1908.66 -> to get them running in production.
1910.31 -> I should say by way of disclosure
1912.8 -> that the sharded version
of our customer base
1914.96 -> that experiences these production failures
1916.82 -> through these game days
are employee accounts
1919.64 -> and everybody who did
so volunteered to do so,
1922.43 -> but at some point I think the
goal is that we would be able
1924.2 -> to run these kinds of certification tests
1926.66 -> in a production environment
knowing that the impacts
1929.12 -> the customers would be zero
because of the mitigation steps
1932.39 -> and datelemetry and all the processes
1934.7 -> that we've built around it.
1937.91 -> Now let's jump into looking at
the architecture of Chase.com
1940.64 -> and this is a fairly busy slide
but I'm sure it'll resonate
1943.76 -> with you as well.
1946.28 -> As I indicated, we have
a sharded solution,
1948.59 -> so sharded customer base with
multi-account, multi-region,
1952.73 -> multi-AZ and the philosophy of making sure
1955.4 -> that we can have as much
redundancy and isolation
1958.22 -> as possible so that we can
minimize blast radius impact
1961.07 -> for any issue that does occur.
1964.146 -> In front of the application vpc,
1965.78 -> we've got multiple layers of
security and services that help
1969.98 -> take care of our inputs.
1971.69 -> You can see Route 53 there as well.
1974.39 -> The application layer is
built out of series of pods
1977.39 -> inside of EKS running
Kubernetes obviously.
1981.14 -> And that's where we have, you know,
1983.158 -> the full application layer.
1985.49 -> The managed services that we
rely on apart from the ones
1990.08 -> that already indicated
specifically are around ElastiCache
1993.14 -> and RDS for the small amount
of data that we require
1995.75 -> to be able to operate our system.
1999.55 -> But the bulk of our activity
2001.39 -> in terms of fulfilling
customer requests occurs
2003.76 -> at the back-end in our
corporate data centers
2005.71 -> and that's plural.
2006.91 -> You can also see, you know,
the focus on redundancy
2010.63 -> in terms of some of the
icons on the diagram,
2014.86 -> and then from an
operational point of view,
2016.54 -> underpinning all of that,
2017.92 -> really are a load of functions
2019.66 -> and just working to
them from left to right.
2022.12 -> We run a lot of synthetics,
2023.71 -> many of them are driven out
of our Digital Robotics team
2026.83 -> that run tests both in
the pre-production space
2029.11 -> as well as production, giving
us good signals and insights
2031.78 -> into how our processes are performing.
2034.69 -> We leverage that with ThousandEyes
2036.4 -> and with Dynatrace as well.
2038.53 -> And Chaos test services
both in pre-production
2041.11 -> and production, and in
the cloud and on-prem
2043.39 -> we leverage Gremlin.
2044.89 -> We also have a series of scripts
and other bespoke solutions
2048.55 -> that we've written to supplement
what Gremlin does for us.
2053.08 -> From the monitoring point of view,
2057.52 -> we've made sure that we've
built monitors to cover
2059.89 -> all the failure modes that
were built out of our FMEA
2062.86 -> as well as to provide
rollups and aggregations
2065.89 -> of what we can see in
terms of both workload
2069.19 -> and health for the infrastructure.
2071.71 -> And we've done that in
multiple different tools.
2075.64 -> People may wonder why the multiple tools,
2077.38 -> and we'll talk about that in a moment,
2079.27 -> but we rely in the cloud on Datadog
2081.85 -> supplemented by CloudWatch.
2083.77 -> On-prem, we have Splunk and Dynatrace
2086.35 -> and a bespoke operationals portal.
2088.78 -> Dynatrace though is also
deployed in our cloud environment
2091.45 -> so it gives us the ability to
trace and to connect the dots
2094.21 -> between a lot of the processing
that occurs in the cloud
2097.96 -> into our, deep into our
on-prem infrastructure
2101.23 -> and the services that are hosted there.
2105.61 -> The monitoring platforms that we have
2107.17 -> all basically emit alerts and signals
2111.76 -> into our Corporate Alert
Hub and therefore seamlessly
2114.97 -> into our Corporate Incident
Management Process.
2118.36 -> So like Tom described, a
pretty robust comprehensive
2122.23 -> AWS Incident Management process.
2123.43 -> We believe we have a very similar one
2126.16 -> and that is critical in
terms of getting signals
2129.67 -> to the right people at the right time.
2132.82 -> Not described here, but
important to notice,
2135.22 -> we have a series of automated
processes to assess health,
2139.24 -> like a health check service,
2140.35 -> both for the infrastructure
as well as the application.
2143.05 -> And that gives us the ability
to automatically route traffic
2145.9 -> appropriately when needed
because of the multi-region
2149.23 -> and the multi-AZ components of it,
2152.11 -> we're able to route traffic
away from environments that are
2154.75 -> experiencing an issue.
2155.583 -> So we'll focus on having a primary,
2157.48 -> having our sessions run in primary regions
2160.06 -> compared to secondary,
2161.38 -> but we always have the
redundancy that we can fail over
2163.42 -> to the secondary region
and that becomes helpful
2165.94 -> and useful at least to this
stage in our progression
2168.97 -> in order to ensure that our deployments
2171.19 -> can be successfully tested
prior to customer traffic
2174.01 -> actually running through them.
2177.28 -> So it's clear from, you
know, oh, and finally,
2179.92 -> the same structures that
you see here are created
2184.12 -> and deployed in our
pre-production environments
2186.007 -> and that gives us, you know,
2187.66 -> the resiliency testing opportunities
2190.54 -> that I mentioned a moment
ago where we can really run
2192.97 -> under performance load,
2194.5 -> mimicking customer customer
traffic in production.
2198.25 -> We can run our resiliency
tests and our performance tests
2201.55 -> and to run the resiliency
tests under load,
2204.13 -> I think that's really important
'cause you can get a really
2206.35 -> good sense of whether your
expectations were correct
2208.93 -> in terms of both the resiliency
2210.73 -> as well as the impacts and
your mitigation strategies
2214.15 -> and the fact that you can detect
2215.44 -> and observe these issues promptly.
2219.1 -> So building that out across your platforms
2221.41 -> is really important.
2225.61 -> Now it's clear from the
number of components
2227.8 -> that you see here
2228.73 -> and from this logical
groupings of infrastructure
2231.76 -> that multiple teams could
get involved in the event
2234.46 -> that we have an issue and that's true,
2238.84 -> we have a multi-tiered,
multi-level support structure.
2241.48 -> Today at its heart is
a mission control group
2244.99 -> that is on-prem 24/7 in
multiple places in the world
2248.68 -> and that have, are co-located
with various IT disciplines
2252.34 -> to really allow us to
respond as fast as possible
2254.8 -> and their primary role is
to mitigate an incident
2257.44 -> and to escalate appropriately as needed.
2260.35 -> So we expanded that to include
additional support teams
2263.86 -> with inside of Chase and
obviously the expansion goes
2266.44 -> to what Tom described
earlier in terms of IDR.
2272.44 -> I'm sorry, my notes have
disappeared below the bottom here
2275.29 -> and I'm struggling to find them.
2279.88 -> I don't wanna leave anything out.
2282.37 -> So, as Tom indicated, you
know, because of our focus
2284.74 -> on operational excellence
because of our requirement
2288.55 -> and our need to ensure
that we can resolve issues
2291.07 -> as fast as we can and
to expand the tooling
2293.62 -> that's available to us, we
went through the process
2295.9 -> of working through the
well-architected review
2299.14 -> that Tom described and we chose to adopt
2303.19 -> and onboard to the IDR offering.
2307.33 -> And by doing so, working with
them in terms of creating
2310.75 -> the appropriate cloud watch
alerts, identifying the signals,
2314.14 -> collaborating on runbooks and
what various signals will mean
2317.32 -> and how we would respond and
how we would be supported
2320.83 -> was really important.
2322.33 -> Additionally, doing connections
in terms of communication
2326.41 -> and status is critical as well.
2328.96 -> So it doesn't... you don't
just get a call out the blue,
2332.26 -> you're aware of how this process will flow
2334.75 -> and it doesn't interrupt
your triaging activities.
2338.47 -> We're at the point now in the
migration, as I indicated,
2341.35 -> the customer migration
was completed in October,
2344.74 -> of working through the remaining activity,
2347.53 -> that is processing in
our legacy infrastructure
2349.9 -> for this application.
2351.52 -> And we're considering at this stage
2354.34 -> that we've been very successful
and have accomplished
2357.04 -> the goals we wanted to.
2358.48 -> Obviously it's not a single journey.
2360.7 -> We're not at the destination,
2361.84 -> a lot of work to be done going forward,
2364.15 -> but the iterative processing
and the continual focus
2368.47 -> on improvement that you
can gain through doing
2371.71 -> resiliency activities,
through the resiliency tests,
2374.56 -> through ongoing training, and
awareness of the support teams
2379.3 -> is not gonna stop.
2380.2 -> And it will be a foundation
not just for this application
2383.08 -> but for other applications
that are going to migrate
2385.42 -> to the cloud in the coming year.
2388.72 -> Finally, we spoke a little bit
about the multiple platforms
2393.46 -> that help us with our
observability and we have
2395.439 -> Datadog, Splunk, Dynatrace and
others and people will say,
2397.75 -> Well, why would you do that?"
2399.16 -> Some of it is legacy.
2400.45 -> I know people who've worked in
organizations that are large
2403.45 -> and perhaps have had a long
presence in an IT space
2407.41 -> will have legacy platforms
that they're bringing forward
2409.21 -> with them, that's part of it.
2410.5 -> But also they tend to focus
on very specific needs
2413.38 -> and I'm not saying this is
our end solution by any means,
2415.99 -> but it's a great step forward
that helps us leverage
2418.9 -> what we know, adopt things
that are brand new to us,
2422.83 -> and then work through the
process of refining that
2425.29 -> as time passes, and that's
a journey that I think
2428.59 -> we're still well underway with.
2434.74 -> So looking then at Best
practices, calls to action,
2437.71 -> I'll take the best practices
2438.82 -> and Tom will return to
look at the call to action
2440.68 -> and then we'll be happy to take questions.
2443.8 -> Clearly having a very
clear sense of the measures
2446.156 -> that you need to be able
to determine whether or not
2449.59 -> there's business impact is important.
2451.39 -> I think that came up a
lot in what Tom said.
2453.01 -> It's certainly been something
that we've had to focus on
2454.84 -> as well.
2456.22 -> We've heard in other talks
just the volume of telemetry
2459.28 -> that comes out of the cloud environment
2461.56 -> plus all the additional metrics
and such that you will get
2465.25 -> out of your application
along with logs and the like
2467.92 -> can really make it a bit
daunting to determine
2471.79 -> which are the signals that
will really help you out.
2474.433 -> You know, you can have literally
2475.266 -> hundreds and hundreds of metrics
2476.92 -> that you need to wade through
in any normal deployment
2480.34 -> in the cloud.
2482.23 -> It reminds me a little bit of,
2483.4 -> and if I may borrow from "The
Rime of the Ancient Mariner,"
2485.56 -> where the Mariner's sitting
in a be-calmed ocean,
2488.65 -> all he can see is water from end-to-end,
2490.75 -> and he and his crew are all
dying of thirst and he says,
2492.737 -> "Water, water, everywhere,
nor any drop to drink."
2495.58 -> It's a little bit the same for technology
2497.26 -> where we can have metrics,
2498.58 -> metrics, fill my screen,
nor is there sense in that
2501.37 -> 'cause if you can't gain the sense
2502.99 -> out of what the metrics tell you,
2504.49 -> if you can't get the
insights you need to be able
2506.26 -> to really understand what's going on,
2508.18 -> and what the health of your workload
2509.47 -> and your infrastructure is,
then you'll be at a loss.
2513.31 -> And that can be especially bad
2514.6 -> when you're trying to triage an issue.
2517.66 -> Business impact is also a
very important component.
2520.87 -> I think maturity in terms
of alerting really focuses
2523.78 -> on not just the binary
up or down of a service,
2526.93 -> but what's the business impact potentially
2528.64 -> associated with that, up or down.
2531.37 -> We also need to alarm to IDR
2533.257 -> and to the corporate support
teams at the same time,
2536.998 -> you know, this is a partnership
and it's a progression.
2540.1 -> There's a lot of runbooks that we have
2541.96 -> that will be supplemented
by what IDR can do for us,
2545.2 -> but certainly IDR is not intended
nor would we want it to be
2548.74 -> the first line of defense in
any issue that we experience.
2553.21 -> Again, a maturity component, you know,
2555.16 -> alarming at multiple levels,
not just at workload,
2557.83 -> at an ALB perhaps or components
here that it may have failed
2561.97 -> some liveliness check, but
rather trying to do that
2566.11 -> in a structured way that
would give you better insights
2568.51 -> into where things might be happening.
2570.28 -> And if you can correlate
some of the signals,
2572.28 -> it really gives you much better indication
2574.03 -> of how a segment of
your processing perhaps
2576.76 -> or a segment of your request
structures might be failing
2579.46 -> because of a certain issue in a component
2581.2 -> or even a backend system.
2585.97 -> Being able to validate runbooks and again,
2588.73 -> going through the resiliency testing
2590.08 -> because there's nothing like
practice to really help you
2592.3 -> determine whether or not
something's working as you expect
2594.94 -> to help you with the tooling
that you would otherwise
2597.1 -> be relying upon in a crisis is important.
2601.12 -> But it also helps you
go through the runbooks,
2602.95 -> the communication paths,
brings familiarity to people
2606.55 -> on both the corporate side as
well as on the AWS IDR side.
2610.63 -> And finally I'd say,
you know, the alerting
2613.87 -> and the tuning of your alerts
needs a little bit of time
2616.36 -> to bake in in production.
2617.32 -> And so you need to be,
just leave a little time
2619.33 -> for that to happen.
2620.56 -> No matter how comprehensive
the performance testing
2623.35 -> can be and how much you try to mimic what
2625.36 -> your production activity is
in terms of customer behavior,
2628.39 -> you're gonna find instances
where that changes
2630.61 -> and it can especially change
under the threat of an incident
2633.55 -> or when an incident's
unfolding, customers will retry
2636.49 -> or tech will move, you know,
inadvertently somewhere else.
2639.85 -> And just the ability to know
how to catch those signals
2642.97 -> is really important.
2645.01 -> So with that, I would
just close by saying,
2648.353 -> a well-architected review,
which Tom mentioned,
2650.14 -> and I think I touched on earlier as well,
2652.09 -> is really a great place to start.
2653.2 -> It gives you a nice
comprehensive view of things
2654.973 -> that you can take care of
2656.68 -> and areas you should be focused upon.
2658.66 -> Engaging with AWS in that
process is even more valuable
2663.28 -> given their insights into
the processing as well.
2665.77 -> And there are some segments
that focus very specifically
2667.84 -> on different business domains
like financial services.
2672.34 -> So with that, here's Tom for the closing
2674.41 -> and the call to action.
2678.79 -> - Okay, thank you, Michael, so much
2680.23 -> for the experience and
sharing that with us today.
2683.74 -> Okay, so you've heard what IDR is,
2686.17 -> why we built it,
2687.16 -> and you've heard from one
of our largest customers
2688.96 -> about their experience implementing it
2690.58 -> so I hope that that has been helpful.
2693.91 -> So where do we go forward from here?
2695.59 -> How can this help you?
2696.52 -> So this chart here is trying
to show how you can think about
2700.9 -> applying some of these lessons
2702.13 -> learned to your own environments.
2703.84 -> So it starts with looking
through your portfolio
2706.6 -> and your applications
2707.44 -> and deciding which critical
applications would benefit
2710.11 -> from the type of increased responsiveness
2713.29 -> that we talked about here today.
2715.12 -> I think the key thing, like
in many things is, you know,
2718.93 -> start somewhere.
2720.25 -> You may not have a digital platform
2722.5 -> that serves hundreds of
millions of users at a time,
2725.74 -> but you've got something that's
critical to your business.
2727.87 -> So start with that.
2729.37 -> Go through the orderly life
cycle that Michael showed you
2732.13 -> in terms of the reviews and
the testing and all of that.
2739 -> You know, if you're interested
in using IDR as a mechanism
2741.64 -> to help formalize and drive that forward,
2743.47 -> certainly talk to your
account team, to your TAMs,
2745.36 -> we'd be happy to work with you on that.
2747.61 -> And since we are here in Vegas,
2749.95 -> I'm gonna use the gambling
analogy and, you know,
2751.87 -> say be prepared.
2753.55 -> Don't gamble with your customer experience
2755.68 -> or the availability of your application.
2759.25 -> All right, so just to wrap things up,
2761.02 -> if you're looking for information,
2762.85 -> we've got some good pages
out here that can give you
2765.31 -> more information on IDR
2766.417 -> and that will be available in the deck.
2768.76 -> Most interestingly might be,
2770.95 -> we've talked a little bit about
the value of the offering,
2772.99 -> you might have some questions
about cost for value.
2776.23 -> There's a pricing page that
describes how IDR is priced,
2779.89 -> and to just summarize it in
a nutshell, it's very similar
2782.92 -> to the pricing of Enterprise
Support-tiered model
2785.86 -> based on usage.
2787.12 -> It's roughly about 40% of the base cost
2790.12 -> of Enterprise Support.
2791.98 -> So if you wanna talk more
about, you know, scenarios,
2794.32 -> they're happy to do that
after the talk here,
2797.11 -> and the FAQ as well has some great
2799.57 -> Frequently Asked Questions
that you can review at
2802.18 -> your leisure.
2803.59 -> so with that we will open it to questions.