AWS re:Invent 2022 - AWS Incident Detection and Response (SUP201)

Aug 16, 2023

AWS re:Invent 2022 - AWS Incident Detection and Response (SUP201)

In this session, learn how to monitor, detect, and manage incidents with AWS services for a quick resolution, and get visibility into the operational excellence of a real-world banking platform. AWS Support’s new offering, AWS Incident Detection and Response, offers proactive monitoring and incident management to reduce the opportunities for failure on workloads and accelerate recovery from critical incidents. Learn how JPMorgan Chase has collaborated with AWS to enable AWS Incident Detection and Response during the Chase.com migration to AWS and for its critical observability and resiliency requirements.

Learn more about AWS re:Invent at https://go.aws/3ikK4dD.

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#reInvent2022 #AWSreInvent2022 #AWSEvents

Content

1.027 -> - Good afternoon and again,

1.86 -> thank you for being here as well.

3.18 -> My name is Michael Proctor.

4.38 -> I'm a SAT Reliability Engineer

6.09 -> in the Public Cloud Enablement team in Chase.

10.29 -> Been with the firm for 25 years,

12.15 -> which is great 'cause there's a lot of different roles.

15.36 -> By way of quick introduction,

16.8 -> I've started programming in Fortran on punch cards

20.46 -> when I was at university in South Africa

22.29 -> and at that point, resiliency and redundancy

24.42 -> were putting two rubber bands

26.07 -> around the deck of cards you had

27.33 -> and making sure you had a paper copy saved somewhere else.

31.68 -> - Excellent. Michael, thank you for being with us.

34.83 -> Okay, so before we dive into those two stories

37.2 -> that I mentioned,

38.1 -> we're gonna do a quick customer survey here.

40.08 -> I wanna find out a little bit about your needs

42.51 -> and what brought you to this particular session today.

44.64 -> So I've got three questions we'd like to start with.

48.06 -> So question number one is within the last six months,

51.21 -> how many of you within your teams have dealt

54.021 -> with a major IT outage that lasted more than one hour?

57.39 -> Just raise your hand so I can see how many have dealt with

59.94 -> those kind of incidents.

62.25 -> Okay, fair number.

63.81 -> My second question for you is,

66.66 -> of those systems that experience those outages,

69.27 -> how many of them were classified as business critical

71.7 -> systems by your business and by the users?

76.86 -> Okay, fair number.

78.87 -> My third question for you is for those systems,

83.07 -> had you done monitoring, testing, chaos resiliency testing,

89.19 -> that should have captured the particular failure mode that

91.65 -> presented during that failure?

96.84 -> Okay.

98.04 -> All right, so we're gonna look at some market data

100.38 -> in a moment that suggests that the number of these outages

102.96 -> of critical systems and the duration of how long it takes

106.05 -> to recover from them,

106.95 -> there's probably more of them and they take longer

108.78 -> to recover from than you perhaps might expect,

112.38 -> and what we can do about that.

114.21 -> Okay, so we're gonna go through this today

117.03 -> in four different sections.

118.59 -> We're gonna talk about the needs

119.91 -> for these type of critical applications,

121.74 -> what you need to make them more reliable.

123.99 -> We're going to go through the release of this new service

127.11 -> called AWS Support Incident Detection And Response,

130.44 -> which, by the way, is quite a mouthful.

132.45 -> So you're probably gonna hear me use the acronym IDR

135.42 -> for the rest of that talk.

136.38 -> Whenever I say IDR, that means the new AWS Service,

139.65 -> for Incident Detection and Response.

141.93 -> We're going to move from that to again,

144.24 -> the story of how Chase was one of the first major customers

146.42 -> of that and the benefits they've seen as as a result.

149.85 -> And then finally, we're gonna talk about some things

151.41 -> that we think you can take away

152.43 -> to help make your applications more resilient

155.1 -> and to have less incidents

156.36 -> and recover from incidents more quickly.

163.5 -> Okay, so let's start with a quote

165.09 -> 'cause we always like to start these talks with a quote,

167.22 -> and this one happens to be,

168.45 -> as you can see from Sun Tzu and "The Art of War."

171.51 -> This particular talk is all about preparedness.

173.82 -> It's about preparedness for critical situations

176.07 -> and for IT incidents and specific.

178.41 -> And this particular quote, I think,

180.12 -> has some important things to say about preparedness, okay?

183.582 -> And so what the quote is saying is,

186.582 -> if you only know yourself,

189.51 -> or in this case your workload and the failure modes

191.97 -> it's subject to on a surface level,

195.15 -> you're gonna have probably some bad outcomes, okay?

198.99 -> If you understand the workload and the failure modes

201.6 -> it's subject to on a deeper level,

204.48 -> you're gonna have much better outcomes

206.49 -> and you're gonna be able to take

207.75 -> what might be unexpected events or incidents

210.48 -> and turn them into things

211.92 -> that are expected and more routine,

214.08 -> things that you can recover from

215.52 -> more quickly, okay?

217.59 -> So we're gonna talk about preparedness

219.96 -> through the lens of a number of mechanisms,

222.69 -> reviews, architecturally and operationally,

225.27 -> that you do up front

226.74 -> to the automation of deployment and management,

229.44 -> to better monitoring and alerting

232.56 -> and through things like chaos testing, okay?

235.8 -> And how a service like RDR can help drive and enforce those

240.48 -> types of best practices.

243.731 -> Okay, so at AWS you like to work backwards

246.42 -> from customer requirements and what customers need.

248.76 -> So what have customers told us in this particular space?

252.18 -> Customers tell us that despite their attempts to build

254.7 -> resilient systems,

255.84 -> and despite putting a lot of effort into implementing,

258.78 -> monitoring, and alerting for those systems

261.6 -> that most frequently their first indication

264.06 -> that a critical incident is evolving is when customers

267.09 -> start telling them that they have a problem

268.8 -> and start reporting things into their help desks,

271.2 -> which is obviously not the reactive posture that any of us

273.69 -> wanna be in for our customers.

277.5 -> Once the incident has begun,

279.3 -> the amount of time it takes to dive into that

281.73 -> and understand the root cause

282.93 -> and what's happening is far longer than they

285.09 -> or their users can tolerate.

288.06 -> The frequency of these incidents

289.89 -> is much higher than is desired,

292.68 -> and that erodes their trust and reliability of their systems

295.86 -> with their customers.

297.9 -> And so they really are looking for solutions to help them

300.69 -> reduce the frequency and the duration of these incidents.

304.59 -> And so that's what we're gonna talk about today.

308.25 -> So let's quantify some of those customer quotes

311.28 -> that we just went through.

312.96 -> There was a market study done earlier this year by IDC,

316.74 -> which had some interesting findings.

318.57 -> They looked at large Enterprises

320.01 -> like many of you probably work for,

322.02 -> and what they found was that the average Enterprise is

325.26 -> experiencing over 29 of these sorts of incidents per year,

330.12 -> okay, over two a month.

331.56 -> As somebody who works as a technical account manager,

334.32 -> helping customers avoid and respond to such incidents,

337.86 -> that's obviously way too high a number.

341.1 -> Very interestingly,

341.94 -> they found that on average it takes five hours to recover

345.87 -> from one of these incidents.

347.52 -> We're gonna explore today why is that?

350.19 -> Why would it take so long for a system that's probably been

352.98 -> through tons of testing, has tons of alarms defined for it,

357.54 -> why does it take five hours to realize something is going

360.36 -> wrong, fix it, or avoid the problem from escalating further?

366.24 -> Finally, the cost to those organizations

368.46 -> for mediating those sequence of events

371.07 -> runs in the order of 13.5 million per organization per year.

377.58 -> I don't know what size your organization is,

379.5 -> what size your IT budget is,

381.12 -> but 13.5 million is a lot of money to me anyway.

384.9 -> Imagine what you could do using that money to invest in

388.44 -> inventing better solutions and improving your solutions

391.44 -> on behalf of your users and your customers.

395.07 -> So disruptions are costly.

396.72 -> There's some that you can quantify.

398.94 -> There are others that are more difficult to quantify,

401.4 -> but they are every bit as important

403.5 -> so we're gonna talk about some secondary impacts.

406.62 -> This picture here in the lower right-hand corner,

408.63 -> you can see the poor operations guy

410.52 -> that's trying to hold up the falling domino of effects

413.7 -> that's happening during one of these incidents, right?

416.04 -> So the question there is what level of burnout do you have

419.82 -> in your staff from these multiple instances

422.04 -> that are happening per year?

424.23 -> How much attrition does that drive of valuable skills

427.11 -> across your organization?

428.67 -> How much time do you spend trying to hire people to replace

431.79 -> those people who have burnt out and moved on to another job?

435.66 -> That's one sort of secondary impact.

438.42 -> There's reputational damage, okay,

440.54 -> we talked about customers getting fed up

442.89 -> with repeat incidents, right?

444.75 -> I'm sure you've all heard the quote that says,

446.347 -> "A reputation takes a lifetime to earn

448.53 -> and a moment to lose."

450.12 -> Very difficult to quantify the cost of that.

454.47 -> If you work in regulatory and regulated environments

457.71 -> as our customer here does,

459.72 -> there's cost associated with those incidents when you meet

462.27 -> certain reporting timeframes and recovery timeframes.

466.53 -> And again, there's cascading effects

468.99 -> down your chain of your partners and others

471.45 -> that use your system who get impacted

473.37 -> by these incidents as well.

477.57 -> All right, so that's the problem statement.

480.36 -> Why is this important?

481.32 -> Why do we need to reduce the frequency?

483.18 -> Why do we need to reduce the cost of this?

484.71 -> So what's the solution?

486.57 -> Okay, so this service that we developed,

488.67 -> which we're gonna call IDR here today,

490.98 -> it went into general availability just a month or so ago

493.62 -> in September, if you haven't seen that yet.

496.86 -> So this service at its essence is AWS

501.33 -> and the Customer Specific Application Team

503.82 -> working together to define a set of alerts

507 -> that are leading indicators of problems

508.92 -> with that particular workload.

510.21 -> And so then we work, set up automation

513.03 -> to detect those failure modes and we feed those

515.88 -> not only to the customer's operation teams and consoles,

519.33 -> but we also feed them directly into AWS's internal incident

523.71 -> management systems and Incident Management engineers.

526.74 -> So effectively what you are getting is the ability

530.01 -> to benefit from the same type of rich monitoring

533.85 -> and Incident Management that we apply internally

536.34 -> to our AWS services as well as we apply

539.43 -> externally to AWS Managed Services customers

542.52 -> that some of you may be.

544.32 -> And so you can benefit from that type of response time

547.86 -> and detection by having that automation in place

551.4 -> and agreeing on those signals and alarms

553.86 -> gives us the ability to commit to having an incident

557.19 -> spun up with the right AWS resources on it

560.28 -> in 15 minutes or less.

562.59 -> Now that might sound similar initially to

565.11 -> if you're Enterprise Support customers today,

567.03 -> you know that when you open a critical,

568.83 -> the highest level support case,

570.84 -> there's also a 15-minute response time objective for those.

574.26 -> The difference is the clock on that starts

576.48 -> when you open a support case.

578.61 -> The clock on an IDR incident begins when the system

581.82 -> detects degradation or problems in your system

585.15 -> and our experiences that that lag between

587.91 -> when a system detects a problem with your application

591.03 -> and when it actually gets diagnosed

592.89 -> and a support case gets opened is very significant.

595.59 -> We're gonna look at how significant in a minute.

598.08 -> So that's the difference between Enterprise Support and IDR,

601.41 -> which is an add-on offering which Enterprise Support

604.02 -> customers can purchase.

605.82 -> And our AWS Managed Services customers that have

608.43 -> Enterprise Support get IDR

610.23 -> as a default part of their service.

614.25 -> Okay, so how does onboarding to this service work?

617.64 -> So we said before in our opening quote,

619.687 -> "The most important thing is to know yourself,

621.72 -> know your workload, and know the failure mode."

623.43 -> So the very first thing that we do is hopefully if this

626.4 -> workload is in production or near to it,

628.86 -> you've gone through a set of architectural

630.6 -> and operational reviews using something like

632.423 -> the AWS well-architected framework.

634.95 -> If that hasn't happened for whatever reason,

636.96 -> for this particular workload that we're onboarding to IDR,

639.84 -> then we'll work with you and go through that review,

641.82 -> make sure that we understand the key solutions,

644.7 -> dependencies, failure modes, monitoring, et cetera.

649.14 -> The next thing that we do is we look at your current

651.87 -> monitoring alerting setup.

653.28 -> And do you have alerts that are appropriate

655.32 -> for the type of Incident Management triggering

658.29 -> that we're talking about here?

659.61 -> If you have those,

660.63 -> if they happen to be defined in CloudWatch,

662.97 -> then we configure an EventBridge, which feeds that directly

665.88 -> to our Incident Management platform

667.89 -> and our Incident Management engineers.

670.17 -> If you happen to use one of our great third-party

673.11 -> monitoring partners and not have those alarms

676.05 -> in CloudWatch yet, that's fine, we'll work with you

678.21 -> and get those set up in CloudWatch as well.

681.57 -> We're working on a future development where we can take

684.78 -> alarms directly from some of those leading

686.61 -> third-party providers into IDR.

688.83 -> But in the initial GA services of September,

692.04 -> CloudWatch and EventBridge

693.63 -> are the foundations for that feed.

696.24 -> Okay, so once we get those alarms agreed to

698.82 -> between the application, the customer, and the AWS team,

702.18 -> we get those tested.

703.2 -> The final step is looking at what are your runbooks?

706.26 -> What do you do today when problems begin to evolve

709.23 -> and we make sure that you have the runbooks on your side,

712.83 -> we have runbooks on our side for Incident Response team

715.41 -> that say how are we gonna guide you through these things

718.56 -> should they occur.

720.03 -> In the AWS case, we implement our runbooks

723.39 -> in the AWS Simple Systems Manager Service

726.15 -> so that our runbooks are automated

728.34 -> to the greatest degree possible

729.87 -> so that the communication and coordination that happens

732.66 -> is repeatable and reliable

734.28 -> and happens very quickly during an incident.

737.37 -> So that's how the onboarding process looks.

741.33 -> So once customers have adopted this service

743.58 -> and have been onboarded to it,

744.72 -> what value have they been seeing from it?

746.64 -> We've been doing this with beta customers

749.73 -> throughout the year

750.563 -> and with production customers since September.

754.56 -> So what are some of the benefits?

757.23 -> From an observability standpoint,

759 -> I'm sure you'll know that implementing reliable alerting on

762.6 -> complex distributed systems is hard.

764.94 -> There's many stacks, there's many components,

767.4 -> different API entry points,

769.32 -> many different metrics,

770.46 -> some of which are good indicators of customer empathic

772.77 -> problems, some of which are not.

774.99 -> So picking those out to have reliable signals and indicators

778.98 -> is a difficult problem.

780.57 -> What we have found is that doing this

783.36 -> in tandem with the customer,

784.71 -> with your expertise about your application and your monitors

788.31 -> and bringing our AWS experts to bear as well,

791.1 -> we end up with a richer,

792.33 -> deeper set of monitors in place for IDR-enabled customers.

796.11 -> And so that's one benefit.

798.93 -> A second benefit is that early incident detection piece

801.93 -> that I tried to indicate before.

804.42 -> you know, getting rid of the initial manual diagnosis

808.02 -> and triggering things earlier,

810.06 -> triggering directly to 24/7 teams

812.91 -> that we have around the world

814.14 -> that are already monitoring our services,

818.505 -> in terms of faster resolution time,

821.49 -> several things are leading to that value.

824.67 -> The time that we spend together working on the alarms

828.18 -> and working on the playbooks results in incidents

832.23 -> being diagnosed faster,

834.06 -> the fact that we can bring,

835.62 -> because we're looking at our telemetry of what services may

839.1 -> be involved,

839.933 -> we can bring the right engineers into the call

841.86 -> from the beginning rather than waiting for escalations later

844.62 -> to bring them in.

845.64 -> And so we've got the right people on the phone faster

848.67 -> with the right data in front of them, and that all leads

851.43 -> to a much quicker resolution of incidents.

855.93 -> The final value comes from

857.85 -> a continuous improvement standpoint.

860.22 -> So once we've been through one of these incidents,

863.04 -> we will produce a post-incident report for you

865.05 -> that tells what happened, what were the steps,

867.36 -> how long did each take,

868.89 -> and we can use that to drive corrective action

871.38 -> and figure out how we can avoid this next time

873.87 -> and make sure you don't have any repeat incidents.

876.93 -> So those are some of the values that customers are seeing

879.36 -> from onboarding to this idea of service.

883.8 -> So just to wrap up this section on the offering,

886.14 -> before I hand it off to Michael,

888.69 -> let's go back to that five-hour recovery time

891.21 -> and look at again,

892.86 -> let's get behind what that is, and how does IDR as a service

896.73 -> actually help decrease that very significantly.

900.78 -> So assume you have an application,

903.39 -> it's stood up all or in part on AWS.

906.99 -> You've got your application alarms on your side

909.15 -> that feed your teams in terms of what's going on

910.8 -> at the application layer.

912.27 -> Our services are obviously instrumented

914.07 -> and have telemetry of what's going on with them.

916.83 -> So say a problem begins to evolve,

919.2 -> whatever the source might be,

920.52 -> application problem, maybe a service problem,

922.89 -> alarms begin to trigger in different places, right?

926.13 -> At some point you begin a customer investigation

928.41 -> to find out, "Okay, what's going on here?"

930.51 -> You're looking at recent changes.

931.83 -> What's causing this?

933.09 -> Is it a problem in our infrastructure?

934.77 -> Is it a problem in AWS?

936.207 -> And at some point you're gonna look at some log entry or

938.85 -> some metric and say, "Okay, I think this is an AWS problem.

942.27 -> I need help resolving this from AWS."

944.73 -> You're gonna open an appropriate support case

946.68 -> to the right team to deal with that.

949.11 -> That gets us down to the white box.

951.51 -> Our experience from our measurements suggests that that is

955.38 -> usually in the range of two to three hours from the

958.17 -> initial trigger to the initial customer investigation,

961.47 -> to the time that they're convinced that AWS needs to help

963.81 -> with the problem and opens a support case.

966 -> So, at that point,

968.37 -> your technical account managers are getting paged out,

970.62 -> we're all on a bridge together,

971.61 -> we're doing a joint investigation in the blue phase.

975.24 -> There may be pipelines on both sides that may need

977.79 -> to be rolled back to deal with a problem.

980.49 -> There may be multiple mitigation paths

982.29 -> that we're investigating to try to resolve

984.06 -> whatever the root cause of the incident is.

986.58 -> And eventually we find when that works,

988.26 -> we fix the problem, things are recovered,

990.51 -> everything is great,

992.04 -> that second half from the white bar on down

994.14 -> takes another two to three hours on average.

996.57 -> So you put together the front-end investigation

999.15 -> and the back-end joint investigation

1000.86 -> and that's where your five hours can typically come from,

1003.83 -> even for a well-instrumented, well-prepared kind of app.

1008.36 -> So how does IDR help with that?

1010.01 -> Well, as I said before,

1011.75 -> IDR helps by short-circuiting in the whole top half

1014.51 -> of that investigation by the alarms that we've put in place,

1018.38 -> when they trigger,

1019.213 -> we already know there's a problem

1020.93 -> and we have some indication of where the problem is.

1023.54 -> So the investigation can hopefully start in the right place

1026.87 -> and proceed much more quickly.

1030.14 -> You get right into that blue phase

1031.97 -> within 15 minutes or less,

1033.74 -> and that phase itself proceeds much more quickly

1037.34 -> because of the runbooks

1038.48 -> and the joint preparation that we've done

1040.19 -> through well=architected, et cetera.

1042.32 -> And our experience through the deployment of this offering

1045.05 -> is that, as you would expect,

1046.85 -> it results in much shorter response times for customers,

1051.62 -> or recovery times rather,

1053.3 -> and we've actually had some of our customers tell us that

1055.88 -> the telemetry that we've agreed to and set up

1058.46 -> has actually helped them detect things

1060.59 -> and resolve them before they actually became

1063.59 -> customer-impacting problems at all.

1065.06 -> So you move from the range of detection to prevention,

1067.4 -> which obviously is the ultimate goal.

1070.28 -> So that's a little bit about Incident Detection Response

1073.19 -> and why we created and what we're seeing from it.

1075.2 -> I'd like to turn the podium over to my friend, Mr. Proctor,

1078.53 -> and hear about the amazing migration of Chase.com

1081.2 -> that happened this year and how IDR was part of that.

1084.83 -> - Thank you, Tom.

1086.54 -> Thank you, everybody, and again,

1087.44 -> thank you for spending time with us today.

1089.6 -> So Tom described something that I think will resonate

1091.79 -> with so many people, you know,

1093.32 -> the ability to know that at some point

1095.51 -> in your triaging of an incident

1098.72 -> that you don't have to make a call, open a case,

1101.6 -> try to describe to somebody who doesn't know your

1103.4 -> infrastructure necessarily or hasn't really spent time

1105.8 -> in your alerting what the problem is

1107.75 -> while you're also simultaneously trying

1109.43 -> to do a bunch of other things.

1110.42 -> So I think that that will resonate.

1112.58 -> I think, as well, the important component

1114.68 -> around preparedness is really critical.

1116.87 -> And so having a good sense of how you can be prepared

1120.71 -> not just in the we'll hand this off to AWS

1123.62 -> and partner with them,

1124.73 -> but in the sense of actually going through processes

1127.37 -> and pre-production activity that will leave you

1130.01 -> with a sense of confidence that you can move forward.

1133.49 -> So at Chase we spend a lot of time thinking about scale,

1137.75 -> the impacts, et cetera, and have a really solid view

1141.29 -> of what operational excellence is.

1143.442 -> And so a lot of what we're talking about really falls into

1145.49 -> this operational excellence concept of making sure

1148.67 -> that we or can be as prepared as we can be.

1153.5 -> So I'm sure you have measures as well,

1155.18 -> that you use KPIs perhaps,

1157.01 -> that tell you how many customers were impacted

1159.71 -> in a specific issue,

1160.73 -> how long it took to mitigate and repair

1162.29 -> and some of those numbers that Tom showed us

1164.39 -> on the previous slide, we do the same.

1167.33 -> And I think, you know, driving towards

1169.19 -> making those better all the time is something

1170.87 -> that any organization, no matter of any size,

1174.26 -> is really gonna focus upon.

1176.57 -> So I'm gonna try to share some learnings in the preparation

1179.72 -> and the processing that we did prior to going live

1182.75 -> with the migration that Tom talks about.

1185.27 -> And basically this was a,

1186.71 -> it's our digital front-end at services both Chase.com

1189.86 -> and our Chase mobile app,

1191.39 -> largely an Apache Tomcat-based deployment.

1194.39 -> And we'll talk about the specifics of that deployment

1196.52 -> in a little while, but that's the context of the discussion.

1201.05 -> Before I jump forward, just a quick show of hands,

1204.59 -> who is already a Chase customer

1206.15 -> either through banking, investments, credit card,

1208.46 -> lending products, et cetera.

1211.1 -> Thank you for being a customer

1212.36 -> and I don't ask the question out of simple curiosity

1215.69 -> and I'm not gonna chase anybody who didn't raise their hand

1217.55 -> and offer you a product or anything like that.

1219.56 -> It's really around scale and I think scale is important for

1222.29 -> a lot of different institutions.

1224.03 -> You know, either you have a very niche market

1225.41 -> or you have a very broad market, you have competitors,

1227.87 -> et cetera so the scale of that you're at now

1230.417 -> and the scale where you would like to be is important

1233.36 -> in how you strategize to move forward.

1236.36 -> The scale that you can see here,

1237.701 -> speaking about Chase as a whole,

1240.89 -> I think is relatively large by any standard.

1243.17 -> We have over 250,000 employees across the globe,

1246.98 -> as you can see,

1248.81 -> have relationships in the US with one in two households,

1253.16 -> and that's why the hands were roughly one in two,

1255.53 -> I would say maybe, maybe a little bit less.

1257.9 -> And then just two other metrics speaking about the amount of

1260.513 -> payments that are processed in a period of time

1263.3 -> to give you a sense again of the scale.

1265.82 -> But JPMC or JP Morgan Chase is not a monolith.

1268.49 -> We actually comprised of four major lines of business,

1271.34 -> eight different corporate groups

1274.64 -> and the lines of business are very focused

1276.26 -> on different customer segments and the needs that they have.

1280.1 -> I'll run through them very quickly because I think

1282.11 -> it gives a sense of context in this discussion.

1285.53 -> First we have the Consumer and Community Banking,

1288.29 -> that's really the one that's focused on the US,

1290.36 -> handles all your retail activity.

1291.89 -> As you can see, we have branches in all the 48 states,

1295.16 -> the lower 48 states,

1296.6 -> and you can access your banking

1299.09 -> or complete your banking requirements really

1301.13 -> in any way you choose, branches, through the telephone,

1304.22 -> ATMs are pretty versatile.

1306.05 -> And then increasingly and more and more,

1308.81 -> I think we're seeing digital adoption through the Chase.com

1311.48 -> or the mobile app.

1313.04 -> We also have the Corporate and Investment Bank,

1316.46 -> well known as the JP Morgan brand.

1318.44 -> Something that, you know, you can see

1319.97 -> provides advice and such to corporate customers

1323.66 -> in over a hundred countries.

1325.55 -> We have the Asset and Wealth Management,

1327.38 -> which is very focused on customers

1330.92 -> with very specific financial needs,

1332.99 -> and then we have the Commercial Bank,

1334.76 -> which is focused on businesses

1336.44 -> in over 20 countries around the world.

1338.93 -> Now none of this would be possible

1341.72 -> without global technology, and I think I indicated

1343.79 -> Global Technology is one of the corporate groups

1346.73 -> and resiliency and preparedness, I think, as I said,

1350.48 -> falling into operational excellence

1351.86 -> is really one of the cornerstones for trying to make sure

1354.26 -> that we can grow our base and grow our presence

1356.42 -> and grow the brands.

1358.31 -> Looking then at global technology and global technology

1361.76 -> here, not just the name of the corporate group,

1363.8 -> but also the way we've constructed technology

1366.35 -> around the world, I think you can see based on

1369.221 -> just the few metrics that are on the screen,

1371.93 -> it's large as well by any size.

1373.79 -> It's 20% of the workforce actually has an IT role

1376.88 -> of one nature or another.

1378.921 -> We have, you know, 6,400 applications in production

1383.12 -> and you can see there are the number

1384.65 -> of active digital customers and and so on.

1387.44 -> Now with that many applications, with that many customers,

1390.2 -> you can imagine that the opportunity for things to go wrong

1393.29 -> and for customer impacts to grow really quickly is large.

1395.99 -> And so being prepared quickly means that we can avoid some

1398.9 -> of those things that Tom showed.

1400.88 -> We don't wanna be the guy holding the dominoes

1403.07 -> and we certainly don't want to be seeing our name in print

1406.55 -> because of some major issue that occurred.

1408.11 -> So, you know, the importance of making sure we prepared

1411.71 -> I think is quite significant.

1413.96 -> There are a couple of strategies or strategic pillars

1415.79 -> inside of global technology

1416.99 -> that I'd just like to touch on quickly.

1419.36 -> First one is, you know, the products,

1420.77 -> platforms, and experiences, and this goes to supporting

1423.98 -> business needs for differentiated experiences.

1426.2 -> Obviously we're all in competitive spaces

1428.99 -> and being able to rapidly deploy and implement new releases,

1434.54 -> software development, and infrastructure.

1436.28 -> Looking here specifically in this context

1438.74 -> and at AWS about adopting Elastic Cloud

1441.86 -> and making sure that we can leverage for our deployments

1445.16 -> and as well as our engineering

1447.44 -> to bring a lot of acceleration to it

1450.89 -> given, you know, comparing it

1452.21 -> to some of our on-prem deployments.

1455.54 -> You've heard a lot about data

1456.95 -> and about the importance of data throughout the conference.

1459.86 -> Yesterday's keynote had a big chunk about

1461.66 -> data clearly unlocking the power of data,

1463.85 -> not just for customer purposes,

1466.22 -> but, you know, as Tom was saying,

1467.36 -> the telemetry and the importance of knowing

1470.03 -> that you can establish with confidence

1471.65 -> what the health of your application is,

1473.15 -> what the health of your workload,

1474.14 -> and the infrastructure is important.

1476.48 -> But none of this would be possible

1478.16 -> without the ability to protect customers

1479.9 -> and the firm as a whole, you know.

1482.27 -> We all know the world we live in in terms of threats,

1484.43 -> et cetera so embedding security and privacy at every layer

1487.76 -> is really critical.

1490.55 -> So I'm almost at a point where we'll see a slide

1493.04 -> that has AWS icons and stuff,

1495.32 -> so just give us a couple of minutes there.

1498.05 -> The Chase.com migration I think is critical

1500.03 -> and I know that a lot of folk who've attended some of the,

1503.15 -> like the CTO talk on Monday around how you choose

1507.47 -> what you're gonna send to the cloud is important.

1509.84 -> So I'm gonna touch very quickly on why the Chase.com

1512.24 -> migration really made sense to us

1514.07 -> and it really fulfilled a bunch of the strategic priorities

1516.47 -> that technology has and can support the business with.

1519.17 -> First is the leveraging of, you know, data and technology

1522.23 -> and we touched on that on the slide before,

1524.12 -> but the ability to actually take advantage of what the cloud

1526.61 -> offers is is clearly critical, you know,

1528.86 -> driving customer engagement through enhanced experiences

1532.76 -> and clearly the reverse of that is true.

1534.83 -> Whenever you have an outage

1536.09 -> or you have some sort of failure,

1537.77 -> you're reducing your customer engagement

1539.48 -> and you're placing friction in front of them

1541.31 -> and at some point you're gonna lose a customer.

1543.92 -> Building around very dedicated, very specific protections

1547.37 -> that will take care of the security that we require

1550.7 -> for maintaining a very strong and well-regulated

1553.88 -> controls and risk environment is important

1556.16 -> and being able to adopt various cloud-specific techniques

1560.27 -> there is important as well.

1561.95 -> And finally, and it may not seem intuitive,

1563.75 -> but the whole idea of focusing on the employee experience,

1567.53 -> especially on engineers and others in the technology space,

1570.38 -> of making sure that we can reduce friction.

1572.33 -> And we've heard this in other talks

1573.65 -> through the last couple of days, you know,

1575.57 -> reducing friction of having to manually

1578.9 -> do the same task repetitively,

1581.03 -> of taking time to get things into production,

1582.95 -> or going through loads of hoops.

1584.63 -> I think we see a lot of opportunity by adopting the cloud

1588.74 -> for this specific application

1589.94 -> and others that will follow in that regard.

1593.63 -> So the Chase.com migration had several demanding objectives

1596.84 -> that were outside of the strategic priorities.

1599.691 -> We're moving out of some legacy data centers.

1602.75 -> The ability to be able to deliver at the same speed

1605.51 -> and with the same metrics or better than those data centers

1608.27 -> was clearly important.

1609.83 -> Here we're focusing on an objective

1611.84 -> of four nines of availability

1613.19 -> and I know availability is critical

1614.87 -> as all the hands go up a little earlier.

1617.3 -> Clearly cost is a factor,

1618.68 -> and it'll be a factor for any size of organization,

1621.26 -> but also the ability to engineer the end-to-end solution

1624.11 -> was really paramount.

1625.61 -> Making sure that we can automate specifically things

1628.25 -> like our deployments, our component releases,

1631.34 -> the self-healing and mitigation, you know,

1633.59 -> in order to try to avoid some of the things

1635.24 -> that Tom was talking about earlier,

1637.04 -> and also testing and validation and certification.

1642.35 -> The object was to complete this program by the end of 2022.

1647.03 -> We actually completed the migration

1648.68 -> of our last customer wave in late October.

1653.3 -> The total volume of customers migrated

1655.22 -> and I should say we didn't migrate data into the cloud,

1658.13 -> we migrated the processes that pointed customers

1660.83 -> to various infrastructure to drive all of that,

1664.52 -> all those sessions, all of that activity through the cloud,

1666.95 -> was a hundred million customers

1668.09 -> so a hundred million profiles now point

1670.76 -> all of those customer sessions to run into the cloud

1673.43 -> even though not everybody is active

1675.17 -> as you saw some of the earlier numbers.

1679.94 -> I'm gonna quickly just touch on

1682.04 -> the preparedness that Tom spoke about

1683.6 -> and he spoke about resiliency testing and the like,

1685.7 -> and just share with you how we processed

1688.7 -> through the engagement,

1691.07 -> and I must also note that the team

1693.65 -> that was built to complete this migration

1696.2 -> was above and beyond just the application team owning it.

1699.23 -> It was a broad range of people

1700.67 -> from a wide range of components of Chase

1704.9 -> as well as AWS and our Proserve colleagues

1707.99 -> that have partnered to make this a reality.

1711.62 -> We also engage the digital SRE team to really focus on the

1716.041 -> observability and the resiliency component of it.

1718.25 -> And to that end,

1719.33 -> we started with a Failure Mode And Effects Analysis.

1722.63 -> So this is a way of producing a complete inventory

1725.15 -> of all the failure modes you might anticipate

1727.01 -> in an environment in both the application

1729.83 -> as well as the infrastructure,

1731.69 -> of making sure your understanding of what

1733.97 -> might trigger those events,

1735.47 -> what you would do to mitigate them,

1737.24 -> what the potential is for them to happen,

1739.61 -> and that can sometimes be a SWAG,

1741.32 -> but it's still an important thing to have.

1743.36 -> What your projected outcomes would be

1744.98 -> in terms of customer impacts,

1746.63 -> and to use those attributes as a way of trying to rank

1749.45 -> different failure modes against each other

1751.16 -> and to then focus on the ones that are most important.

1754.25 -> And once you have the ability to lay out them,

1757.97 -> lay out the failure modes and you've ranked them,

1759.95 -> to then work through building scenarios

1761.63 -> that will help you take care of the how,

1763.88 -> how will you trigger them,

1764.87 -> how will you know they've happened,

1766.4 -> how will you assess the impacts?

1768.95 -> And sometimes that requires more than just documenting it,

1773 -> it's actually building processes that will do that.

1777.35 -> Having the scenarios then leads to scheduled game days.

1780.35 -> And I think Tom mentioned the word game day

1781.88 -> and that's really an important component

1783.38 -> of this entire cycle.

1785.42 -> And that's basically scheduling an event

1787.58 -> with a series of scenarios potentially

1789.83 -> where you can engage all the appropriate stakeholders

1793.1 -> from the application teams,

1794.93 -> the architects, et cetera,

1796.76 -> as well as the operational teams to all have the experience

1799.79 -> of what goes on when something fails.

1801.8 -> And to do so outside of the realm

1804.08 -> of when customers are really impacted

1805.61 -> and are gonna be, you know, aware of the fact

1807.59 -> that you've got a problem and to spend time

1809.93 -> working through those scenarios, you know, driving them,

1812.63 -> leveraging tools such as Gremlin

1814.76 -> or AWS's Fault Injection System

1816.62 -> to inject the failures that you see

1819.02 -> and then work towards a process

1821.45 -> where you can do a postmortem,

1822.77 -> the same kind of thing you would do in a production space.

1824.51 -> But to do so really focused on not just the nature

1829.19 -> of the failure or the architecture but the application,

1833.24 -> the documentation, how you would respond to things,

1836.12 -> and obviously the detection and such, as well.

1839.75 -> Hopefully at the back-end of that you come up

1841.13 -> with a bunch of improvements,

1842.12 -> especially if this is an early adoption

1844.13 -> of a cloud deployment

1845.21 -> and you're working through potential alternatives

1847.4 -> to the way that you might do your deployment.

1850.13 -> Deploying, building the solutions or building improvements,

1852.83 -> deploying them and retesting them is critical obviously.

1855.65 -> And at the same time, you may also need to revise your FMEA,

1859.4 -> so to revise your Failure Mode And Effect Analysis

1861.86 -> 'cause you may have potentially reduced a failure mode

1864.5 -> by the improvements you've got or even potentially

1866.66 -> have increased one or added to it.

1869.66 -> And once you feel comfortable that you've accomplished

1871.431 -> the insights or you've gained the insight you need,

1874.22 -> you've perfected it as much as you can

1876.83 -> for those specific failure modes is to then iterate

1879.14 -> and to expand your scope to go through failure modes

1881.75 -> that maybe didn't make it into the first wave.

1884.39 -> Those are all important ways of doing this

1886.94 -> on a regular basis.

1888.08 -> And we created a bunch of scenarios that we tagged

1891.44 -> as certification scenarios and we run them

1894.05 -> as certification game days

1895.34 -> and we run them deliberately at the point

1896.9 -> where we expect to have large releases,

1899.42 -> where we've had component level upgrades, EKS upgrades,

1902.33 -> things like that,

1903.59 -> as well as running them in production once a month,

1906.05 -> and that took some doing in terms of actually trying

1908.66 -> to get them running in production.

1910.31 -> I should say by way of disclosure

1912.8 -> that the sharded version of our customer base

1914.96 -> that experiences these production failures

1916.82 -> through these game days are employee accounts

1919.64 -> and everybody who did so volunteered to do so,

1922.43 -> but at some point I think the goal is that we would be able

1924.2 -> to run these kinds of certification tests

1926.66 -> in a production environment knowing that the impacts

1929.12 -> the customers would be zero because of the mitigation steps

1932.39 -> and datelemetry and all the processes

1934.7 -> that we've built around it.

1937.91 -> Now let's jump into looking at the architecture of Chase.com

1940.64 -> and this is a fairly busy slide but I'm sure it'll resonate

1943.76 -> with you as well.

1946.28 -> As I indicated, we have a sharded solution,

1948.59 -> so sharded customer base with multi-account, multi-region,

1952.73 -> multi-AZ and the philosophy of making sure

1955.4 -> that we can have as much redundancy and isolation

1958.22 -> as possible so that we can minimize blast radius impact

1961.07 -> for any issue that does occur.

1964.146 -> In front of the application vpc,

1965.78 -> we've got multiple layers of security and services that help

1969.98 -> take care of our inputs.

1971.69 -> You can see Route 53 there as well.

1974.39 -> The application layer is built out of series of pods

1977.39 -> inside of EKS running Kubernetes obviously.

1981.14 -> And that's where we have, you know,

1983.158 -> the full application layer.

1985.49 -> The managed services that we rely on apart from the ones

1990.08 -> that already indicated specifically are around ElastiCache

1993.14 -> and RDS for the small amount of data that we require

1995.75 -> to be able to operate our system.

1999.55 -> But the bulk of our activity

2001.39 -> in terms of fulfilling customer requests occurs

2003.76 -> at the back-end in our corporate data centers

2005.71 -> and that's plural.

2006.91 -> You can also see, you know, the focus on redundancy

2010.63 -> in terms of some of the icons on the diagram,

2014.86 -> and then from an operational point of view,

2016.54 -> underpinning all of that,

2017.92 -> really are a load of functions

2019.66 -> and just working to them from left to right.

2022.12 -> We run a lot of synthetics,

2023.71 -> many of them are driven out of our Digital Robotics team

2026.83 -> that run tests both in the pre-production space

2029.11 -> as well as production, giving us good signals and insights

2031.78 -> into how our processes are performing.

2034.69 -> We leverage that with ThousandEyes

2036.4 -> and with Dynatrace as well.

2038.53 -> And Chaos test services both in pre-production

2041.11 -> and production, and in the cloud and on-prem

2043.39 -> we leverage Gremlin.

2044.89 -> We also have a series of scripts and other bespoke solutions

2048.55 -> that we've written to supplement what Gremlin does for us.

2053.08 -> From the monitoring point of view,

2057.52 -> we've made sure that we've built monitors to cover

2059.89 -> all the failure modes that were built out of our FMEA

2062.86 -> as well as to provide rollups and aggregations

2065.89 -> of what we can see in terms of both workload

2069.19 -> and health for the infrastructure.

2071.71 -> And we've done that in multiple different tools.

2075.64 -> People may wonder why the multiple tools,

2077.38 -> and we'll talk about that in a moment,

2079.27 -> but we rely in the cloud on Datadog

2081.85 -> supplemented by CloudWatch.

2083.77 -> On-prem, we have Splunk and Dynatrace

2086.35 -> and a bespoke operationals portal.

2088.78 -> Dynatrace though is also deployed in our cloud environment

2091.45 -> so it gives us the ability to trace and to connect the dots

2094.21 -> between a lot of the processing that occurs in the cloud

2097.96 -> into our, deep into our on-prem infrastructure

2101.23 -> and the services that are hosted there.

2105.61 -> The monitoring platforms that we have

2107.17 -> all basically emit alerts and signals

2111.76 -> into our Corporate Alert Hub and therefore seamlessly

2114.97 -> into our Corporate Incident Management Process.

2118.36 -> So like Tom described, a pretty robust comprehensive

2122.23 -> AWS Incident Management process.

2123.43 -> We believe we have a very similar one

2126.16 -> and that is critical in terms of getting signals

2129.67 -> to the right people at the right time.

2132.82 -> Not described here, but important to notice,

2135.22 -> we have a series of automated processes to assess health,

2139.24 -> like a health check service,

2140.35 -> both for the infrastructure as well as the application.

2143.05 -> And that gives us the ability to automatically route traffic

2145.9 -> appropriately when needed because of the multi-region

2149.23 -> and the multi-AZ components of it,

2152.11 -> we're able to route traffic away from environments that are

2154.75 -> experiencing an issue.

2155.583 -> So we'll focus on having a primary,

2157.48 -> having our sessions run in primary regions

2160.06 -> compared to secondary,

2161.38 -> but we always have the redundancy that we can fail over

2163.42 -> to the secondary region and that becomes helpful

2165.94 -> and useful at least to this stage in our progression

2168.97 -> in order to ensure that our deployments

2171.19 -> can be successfully tested prior to customer traffic

2174.01 -> actually running through them.

2177.28 -> So it's clear from, you know, oh, and finally,

2179.92 -> the same structures that you see here are created

2184.12 -> and deployed in our pre-production environments

2186.007 -> and that gives us, you know,

2187.66 -> the resiliency testing opportunities

2190.54 -> that I mentioned a moment ago where we can really run

2192.97 -> under performance load,

2194.5 -> mimicking customer customer traffic in production.

2198.25 -> We can run our resiliency tests and our performance tests

2201.55 -> and to run the resiliency tests under load,

2204.13 -> I think that's really important 'cause you can get a really

2206.35 -> good sense of whether your expectations were correct

2208.93 -> in terms of both the resiliency

2210.73 -> as well as the impacts and your mitigation strategies

2214.15 -> and the fact that you can detect

2215.44 -> and observe these issues promptly.

2219.1 -> So building that out across your platforms

2221.41 -> is really important.

2225.61 -> Now it's clear from the number of components

2227.8 -> that you see here

2228.73 -> and from this logical groupings of infrastructure

2231.76 -> that multiple teams could get involved in the event

2234.46 -> that we have an issue and that's true,

2238.84 -> we have a multi-tiered, multi-level support structure.

2241.48 -> Today at its heart is a mission control group

2244.99 -> that is on-prem 24/7 in multiple places in the world

2248.68 -> and that have, are co-located with various IT disciplines

2252.34 -> to really allow us to respond as fast as possible

2254.8 -> and their primary role is to mitigate an incident

2257.44 -> and to escalate appropriately as needed.

2260.35 -> So we expanded that to include additional support teams

2263.86 -> with inside of Chase and obviously the expansion goes

2266.44 -> to what Tom described earlier in terms of IDR.

2272.44 -> I'm sorry, my notes have disappeared below the bottom here

2275.29 -> and I'm struggling to find them.

2279.88 -> I don't wanna leave anything out.

2282.37 -> So, as Tom indicated, you know, because of our focus

2284.74 -> on operational excellence because of our requirement

2288.55 -> and our need to ensure that we can resolve issues

2291.07 -> as fast as we can and to expand the tooling

2293.62 -> that's available to us, we went through the process

2295.9 -> of working through the well-architected review

2299.14 -> that Tom described and we chose to adopt

2303.19 -> and onboard to the IDR offering.

2307.33 -> And by doing so, working with them in terms of creating

2310.75 -> the appropriate cloud watch alerts, identifying the signals,

2314.14 -> collaborating on runbooks and what various signals will mean

2317.32 -> and how we would respond and how we would be supported

2320.83 -> was really important.

2322.33 -> Additionally, doing connections in terms of communication

2326.41 -> and status is critical as well.

2328.96 -> So it doesn't... you don't just get a call out the blue,

2332.26 -> you're aware of how this process will flow

2334.75 -> and it doesn't interrupt your triaging activities.

2338.47 -> We're at the point now in the migration, as I indicated,

2341.35 -> the customer migration was completed in October,

2344.74 -> of working through the remaining activity,

2347.53 -> that is processing in our legacy infrastructure

2349.9 -> for this application.

2351.52 -> And we're considering at this stage

2354.34 -> that we've been very successful and have accomplished

2357.04 -> the goals we wanted to.

2358.48 -> Obviously it's not a single journey.

2360.7 -> We're not at the destination,

2361.84 -> a lot of work to be done going forward,

2364.15 -> but the iterative processing and the continual focus

2368.47 -> on improvement that you can gain through doing

2371.71 -> resiliency activities, through the resiliency tests,

2374.56 -> through ongoing training, and awareness of the support teams

2379.3 -> is not gonna stop.

2380.2 -> And it will be a foundation not just for this application

2383.08 -> but for other applications that are going to migrate

2385.42 -> to the cloud in the coming year.

2388.72 -> Finally, we spoke a little bit about the multiple platforms

2393.46 -> that help us with our observability and we have

2395.439 -> Datadog, Splunk, Dynatrace and others and people will say,

2397.75 -> Well, why would you do that?"

2399.16 -> Some of it is legacy.

2400.45 -> I know people who've worked in organizations that are large

2403.45 -> and perhaps have had a long presence in an IT space

2407.41 -> will have legacy platforms that they're bringing forward

2409.21 -> with them, that's part of it.

2410.5 -> But also they tend to focus on very specific needs

2413.38 -> and I'm not saying this is our end solution by any means,

2415.99 -> but it's a great step forward that helps us leverage

2418.9 -> what we know, adopt things that are brand new to us,

2422.83 -> and then work through the process of refining that

2425.29 -> as time passes, and that's a journey that I think

2428.59 -> we're still well underway with.

2434.74 -> So looking then at Best practices, calls to action,

2437.71 -> I'll take the best practices

2438.82 -> and Tom will return to look at the call to action

2440.68 -> and then we'll be happy to take questions.

2443.8 -> Clearly having a very clear sense of the measures

2446.156 -> that you need to be able to determine whether or not

2449.59 -> there's business impact is important.

2451.39 -> I think that came up a lot in what Tom said.

2453.01 -> It's certainly been something that we've had to focus on

2454.84 -> as well.

2456.22 -> We've heard in other talks just the volume of telemetry

2459.28 -> that comes out of the cloud environment

2461.56 -> plus all the additional metrics and such that you will get

2465.25 -> out of your application along with logs and the like

2467.92 -> can really make it a bit daunting to determine

2471.79 -> which are the signals that will really help you out.

2474.433 -> You know, you can have literally

2475.266 -> hundreds and hundreds of metrics

2476.92 -> that you need to wade through in any normal deployment

2480.34 -> in the cloud.

2482.23 -> It reminds me a little bit of,

2483.4 -> and if I may borrow from "The Rime of the Ancient Mariner,"

2485.56 -> where the Mariner's sitting in a be-calmed ocean,

2488.65 -> all he can see is water from end-to-end,

2490.75 -> and he and his crew are all dying of thirst and he says,

2492.737 -> "Water, water, everywhere, nor any drop to drink."

2495.58 -> It's a little bit the same for technology

2497.26 -> where we can have metrics,

2498.58 -> metrics, fill my screen, nor is there sense in that

2501.37 -> 'cause if you can't gain the sense

2502.99 -> out of what the metrics tell you,

2504.49 -> if you can't get the insights you need to be able

2506.26 -> to really understand what's going on,

2508.18 -> and what the health of your workload

2509.47 -> and your infrastructure is, then you'll be at a loss.

2513.31 -> And that can be especially bad

2514.6 -> when you're trying to triage an issue.

2517.66 -> Business impact is also a very important component.

2520.87 -> I think maturity in terms of alerting really focuses

2523.78 -> on not just the binary up or down of a service,

2526.93 -> but what's the business impact potentially

2528.64 -> associated with that, up or down.

2531.37 -> We also need to alarm to IDR

2533.257 -> and to the corporate support teams at the same time,

2536.998 -> you know, this is a partnership and it's a progression.

2540.1 -> There's a lot of runbooks that we have

2541.96 -> that will be supplemented by what IDR can do for us,

2545.2 -> but certainly IDR is not intended nor would we want it to be

2548.74 -> the first line of defense in any issue that we experience.

2553.21 -> Again, a maturity component, you know,

2555.16 -> alarming at multiple levels, not just at workload,

2557.83 -> at an ALB perhaps or components here that it may have failed

2561.97 -> some liveliness check, but rather trying to do that

2566.11 -> in a structured way that would give you better insights

2568.51 -> into where things might be happening.

2570.28 -> And if you can correlate some of the signals,

2572.28 -> it really gives you much better indication

2574.03 -> of how a segment of your processing perhaps

2576.76 -> or a segment of your request structures might be failing

2579.46 -> because of a certain issue in a component

2581.2 -> or even a backend system.

2585.97 -> Being able to validate runbooks and again,

2588.73 -> going through the resiliency testing

2590.08 -> because there's nothing like practice to really help you

2592.3 -> determine whether or not something's working as you expect

2594.94 -> to help you with the tooling that you would otherwise

2597.1 -> be relying upon in a crisis is important.

2601.12 -> But it also helps you go through the runbooks,

2602.95 -> the communication paths, brings familiarity to people

2606.55 -> on both the corporate side as well as on the AWS IDR side.

2610.63 -> And finally I'd say, you know, the alerting

2613.87 -> and the tuning of your alerts needs a little bit of time

2616.36 -> to bake in in production.

2617.32 -> And so you need to be, just leave a little time

2619.33 -> for that to happen.

2620.56 -> No matter how comprehensive the performance testing

2623.35 -> can be and how much you try to mimic what

2625.36 -> your production activity is in terms of customer behavior,

2628.39 -> you're gonna find instances where that changes

2630.61 -> and it can especially change under the threat of an incident

2633.55 -> or when an incident's unfolding, customers will retry

2636.49 -> or tech will move, you know, inadvertently somewhere else.

2639.85 -> And just the ability to know how to catch those signals

2642.97 -> is really important.

2645.01 -> So with that, I would just close by saying,

2648.353 -> a well-architected review, which Tom mentioned,

2650.14 -> and I think I touched on earlier as well,

2652.09 -> is really a great place to start.

2653.2 -> It gives you a nice comprehensive view of things

2654.973 -> that you can take care of

2656.68 -> and areas you should be focused upon.

2658.66 -> Engaging with AWS in that process is even more valuable

2663.28 -> given their insights into the processing as well.

2665.77 -> And there are some segments that focus very specifically

2667.84 -> on different business domains like financial services.

2672.34 -> So with that, here's Tom for the closing

2674.41 -> and the call to action.

2678.79 -> - Okay, thank you, Michael, so much

2680.23 -> for the experience and sharing that with us today.

2683.74 -> Okay, so you've heard what IDR is,

2686.17 -> why we built it,

2687.16 -> and you've heard from one of our largest customers

2688.96 -> about their experience implementing it

2690.58 -> so I hope that that has been helpful.

2693.91 -> So where do we go forward from here?

2695.59 -> How can this help you?

2696.52 -> So this chart here is trying to show how you can think about

2700.9 -> applying some of these lessons

2702.13 -> learned to your own environments.

2703.84 -> So it starts with looking through your portfolio

2706.6 -> and your applications

2707.44 -> and deciding which critical applications would benefit

2710.11 -> from the type of increased responsiveness

2713.29 -> that we talked about here today.

2715.12 -> I think the key thing, like in many things is, you know,

2718.93 -> start somewhere.

2720.25 -> You may not have a digital platform

2722.5 -> that serves hundreds of millions of users at a time,

2725.74 -> but you've got something that's critical to your business.

2727.87 -> So start with that.

2729.37 -> Go through the orderly life cycle that Michael showed you

2732.13 -> in terms of the reviews and the testing and all of that.

2739 -> You know, if you're interested in using IDR as a mechanism

2741.64 -> to help formalize and drive that forward,

2743.47 -> certainly talk to your account team, to your TAMs,

2745.36 -> we'd be happy to work with you on that.

2747.61 -> And since we are here in Vegas,

2749.95 -> I'm gonna use the gambling analogy and, you know,

2751.87 -> say be prepared.

2753.55 -> Don't gamble with your customer experience

2755.68 -> or the availability of your application.

2759.25 -> All right, so just to wrap things up,

2761.02 -> if you're looking for information,

2762.85 -> we've got some good pages out here that can give you

2765.31 -> more information on IDR

2766.417 -> and that will be available in the deck.

2768.76 -> Most interestingly might be,

2770.95 -> we've talked a little bit about the value of the offering,

2772.99 -> you might have some questions about cost for value.

2776.23 -> There's a pricing page that describes how IDR is priced,

2779.89 -> and to just summarize it in a nutshell, it's very similar

2782.92 -> to the pricing of Enterprise Support-tiered model

2785.86 -> based on usage.

2787.12 -> It's roughly about 40% of the base cost

2790.12 -> of Enterprise Support.

2791.98 -> So if you wanna talk more about, you know, scenarios,

2794.32 -> they're happy to do that after the talk here,

2797.11 -> and the FAQ as well has some great

2799.57 -> Frequently Asked Questions that you can review at

2802.18 -> your leisure.

2803.59 -> so with that we will open it to questions.

Source: https://www.youtube.com/watch?v=IbSgM4IP9IE