AWS re:Invent 2022 - Developing an observability strategy (COP302)

Aug 16, 2023

AWS re:Invent 2022 - Developing an observability strategy (COP302)

Do you have a plan for your observability? Do you understand what the stakeholders for your applications want to observe? As you move to the cloud or mature your operations within the cloud, it’s important to optimize your observability to help your stakeholders understand how your applications are operating. Join this session to learn how to define your observability strategy for the future in order to serve the requirements of all of your stakeholders and to help ensure that you can deliver successful business outcomes.

Learn more about AWS re:Invent at https://go.aws/3ikK4dD.

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#reInvent2022 #AWSreInvent2022 #AWSEvents

Content

1.83 -> - Okay. Hi everyone.

3.93 -> Welcome to developing an observability strategy.

9 -> So, my name's Alex.

11.55 -> I'm a specialist solutions architect,

13.65 -> specializing in observability.

16.68 -> Done just about over 20 years in IT.

20.37 -> Got a seven year old daughter.

22.56 -> I used to play ultimate Frisbee,

23.82 -> but I got too old and a bit too fat as well.

28.38 -> So, now I try and play, I play golf instead.

32.31 -> So, we're gonna talk about how to design an observability

36.03 -> strategy and we're also gonna go into a few

37.89 -> technical things as well.

41.01 -> - Hi, my name is an Ania Develter.

42.99 -> I'm also a specialist solutions architect on the cloud

45.84 -> operations team here at AWS.

48.9 -> I have been in IT for over 20 years as well.

51.99 -> Three and a half of those years here at AWS.

54.96 -> I'm a mom of two, I've got a 20 and a 15 year old daughter,

58.71 -> or daughters, and also I'm a cat lover.

61.86 -> I actually have four cats and two dogs and I don't talk,

64.71 -> I don't stop talking about them, and they actually feature

68.04 -> in our observability workshop, that's how much I love them.

72.691 -> - Hi, I'm Igor Sedukhin.

74.22 -> I'm a general manager for application observability at AWS.

79.2 -> I'm alike dad of one great ten-year-old daughter,

84.24 -> and love snow.

85.44 -> So, I get to do it quite a bit, when I get back,

87 -> it seems like it's dumped.

90.06 -> So, talk a little bit about observability.

93.54 -> If there is any business of value that has to perform

97.35 -> to expectations, it pretty much has to be observed.

101.417 -> And so observability has to deliver and scale with your

105.9 -> business needs to billions of runtime instances

109.83 -> to millions of users.

112.89 -> It needs to be there when you need it the most.

116.19 -> It has to be as available or better than what it observes.

120.87 -> And more importantly, it actually needs to make your life

124.35 -> and cloud operations simpler.

126.72 -> And so we'll explore that aspect of how do you approach

130.8 -> that, how do you make it simple?

135.66 -> So, before we launch into the session,

138.15 -> let's talk a little bit about the context

140.79 -> in which observability exists.

142.47 -> Cloud operations.

143.97 -> Where in AWS life cycle would you encounter

148.29 -> the need to observe something?

150.66 -> So, first, you're highly likely to set up your environment

155.07 -> for compliant operations

156.99 -> So, create your AWS organizations.

162.06 -> Set up IM roles,

163.74 -> make sure you have compliance controls in the environment.

167.28 -> That was the first step.

168.9 -> Second, you're migrating or developing an application.

173.82 -> Maybe a serverless application directly in AWS,

177.63 -> and you'd use CloudFormation templates and all different

180.66 -> tools that we have to develop that application.

184.11 -> Once the application is in production,

186.87 -> you would highly likely need to observe

189.45 -> and see if it performs as expected.

192 -> And that's where you would encounter tools,

194.61 -> a family of tools in CloudWatch to be able to observe

198.48 -> and explore your application as it runs and serves users.

204.39 -> So, let's get into it.

207.12 -> - Okay, let's talk about the important metrics.

210.81 -> CPU, RAM, and disk metrics are the core measurements

215.25 -> of any system.

217.23 -> Even serverless services run on servers, right?

222.51 -> Yes, these metrics are important.

225.66 -> But are they actually important to your business

228.99 -> and your application?

231.03 -> You should only worry about those metrics for the purpose

234.45 -> of cost optimization, scaling, or capacity management.

239.34 -> But should you care about these metrics?

242.22 -> Do they tell you anything about how your application

244.83 -> is running, or more importantly, do they give you any

249 -> insights into your customer's experience?

254.37 -> Now, we gonna break from convention here and ask you to get

258.21 -> your phone out and follow the QR code that you see

261.27 -> on the screen there.

262.77 -> It'll take you to a very simple film website.

266.43 -> We just ask you for you to just browse around,

268.53 -> maybe search for a film by year.

271.222 -> - Yeah, I'm sorry, this is very basic.

272.91 -> You blame me, I've built this.

274.71 -> But it's gonna serve a purpose.

282.285 -> Right - Okay, so what- Sorry.

283.868 -> (laughter)

284.701 -> So, while's your browsing around, we would appreciate

287.16 -> some audience participation here.

289.92 -> Tell us about the metrics that you care about

293.22 -> in your organization.

294.39 -> And there's no right or wrong answer here.

296.73 -> - So, anyone shout out what's important?

299.227 -> - [Audience Member 1] Latency.

300.06 -> - Latency we've got.

301.65 -> - Anyone else?

304.428 -> - [Audience Member 2] Response time, throughput.

306.257 -> - Response time and throughput.

308.256 -> - [Audience Member 3] Lack of errors

309.51 -> - Lack of errors.

310.513 -> - [Audience Member 4] Freshness.

312.172 -> - Freshness. - Freshness.

313.505 -> - [Audience Member 5] Usage of data.

314.901 -> - Usage of data.

316.087 -> - [Audience Member 6] Hits per second.

318.12 -> - Sorry?

318.953 -> - [Audience Member 6] Hits per second.

319.786 -> - Hits per second. Thank you.

321.54 -> Anyone else?

324.51 -> - Another one about failures, I didn't quite-

327.96 -> - Price did you say?

329.587 -> - [Audience Member 7] 4x and 5x errors.

330.93 -> - Yeah, 4x and 5x errors, yeah.

333.27 -> - Hold on. This gentleman, I can't hear 'em.

336.24 -> - Cost. Yeah, cost, very important.

340.83 -> - Igor, do you wanna tell us more about

342.96 -> the metrics that your team cares about?

344.91 -> - Yeah, exactly.

346.02 -> At AWS, we run services for you.

348.51 -> And this is one of the most important questions

350.76 -> we actually ask our teams,

353.79 -> what is the most important metric for your,

356.46 -> for the business that that service performs,

358.47 -> in relation to you as a customer.

361.11 -> And so, as you know, it was mentioned,

363.87 -> tends to be things like, does the console load,

367.56 -> is it available for you to use it?

370.8 -> Is the latency of the APIs or of the user experiences

374.19 -> adequate, right?

375.48 -> And is availability adequate?

378.09 -> And so what we do, well, you know,

382.35 -> with that metric is, we, in fact,

384.57 -> have a company-wide meeting every week

388.26 -> where we spin the wheel, and maybe you've read about it,

392.7 -> and there's AWS wheel.

394.86 -> And it picks the service, one or two, that will show

398.85 -> to the company what metrics they observe and whether or not

403.53 -> the operations are healthy, right?

405.78 -> And we do this kind of friendly inspection of each other

410.37 -> to ensure that, you know, these metrics,

413.19 -> the key metrics that we observe on your behalf, right,

418.53 -> are actually indeed surveying the needs of that business,

422.18 -> of that server service.

423.48 -> So, this is the most critical question.

427.5 -> - [Ania] Okay. And what about amazon.com Alex?

430.65 -> - Yeah, so I guess the most important metric

434.34 -> in terms of amazon.com or any of our other retail sites

438.48 -> is how many orders are we getting?

442.14 -> How many orders are we processing?

444.03 -> And if that falls below a certain threshold,

446.07 -> then we know that there's a problem.

448.41 -> And this should become obvious why we're talking about this

451.92 -> and why that's an important metric.

453.87 -> - Okay, so we're gonna take this back to the context

456.45 -> of that film website that you just browsed.

458.55 -> So, if you're still browsing, you can stop searching

460.41 -> for films now, thank you.

461.82 -> And Alex is gonna show you the metrics that we care about.

468.09 -> - Okay, so here I've got a, I've got a dashboard.

472.65 -> And I'll just go through what we've got on this dashboard.

476.58 -> So, we care about availability,

480.39 -> we've got a hundred percent availability, that's all good.

483.9 -> Then I've got some SLA statuses here.

487.2 -> So, there's two, there's three SLAs here.

491.73 -> Availability SLA.

493.59 -> We've got payload SLAs.

495.24 -> We want 99% of our payloads to happen within two seconds.

501 -> And we've also got an API response SLAs.

503.16 -> We want our APIs to respond, again, 99% of them to respond

507.15 -> within the second.

509.46 -> I've got some other metrics here that I care about.

511.44 -> So, I've got sessions in the last hour.

514.53 -> Searches in the last hour.

517.02 -> So, these are other, kind of, important,

520.05 -> would be important business metrics if the website

523.32 -> that you looked at was a business.

526.65 -> So, when we look at our payload times,

529.86 -> I've got these separated out into different percentiles.

533.28 -> Our percentile are really important.

535.26 -> If you don't use percentile, averages can hide problems.

540.39 -> So, we can see here at P50,

543.6 -> and I'll explain what that is in a minute.

547.32 -> We've got 850 milliseconds for page load time.

550.53 -> Great.

551.55 -> So, what this means is that 50% of our users,

555.48 -> the maximum page load time for them is 850 milliseconds.

560.91 -> At 90% it goes up to over a second.

564.96 -> And at 99% it goes up to 18.8 seconds.

569.76 -> That's definitely above the two second mark.

573.24 -> And we're using CloudWatch real user monitoring

576.499 -> to get some of this data.

579.45 -> And what I have here on the right hand side is metrics

584.46 -> that come directly from real user monitoring.

586.83 -> Let me try and get rid of that.

590.46 -> So, we've got the percentage of page load times

592.98 -> that happen under two seconds.

594.66 -> We want that to be above 99%, but it's 96.7.

599.34 -> We've got some of them 0.25% happening

603.09 -> between two and eight seconds.

605.1 -> And we've got 3% that are over eight seconds.

607.95 -> So, there's something wrong.

608.91 -> We'll look at this, we'll look at this later.

613.11 -> I've got some metrics here from load balancer as well.

616.98 -> And these metrics are basically the same metrics,

620.49 -> but they're from our API.

623.19 -> So, again, if we look at 99%,

627.66 -> if we look at the 99th percentile,

629.13 -> so that's the maximum response time for-

632.91 -> if we take into account 99% of requests, it's 25.2 seconds.

637.98 -> Again, this is really bad.

640.83 -> And then the reason we ask you to log on

645.361 -> and have a look and play around is to show you

648.54 -> these metrics as well.

650.4 -> So, you can see here at the top, we've got 47,000 requests

657.3 -> coming from the US.

659.25 -> Some of those, or a lot of those,

661.02 -> And you'll see a lot of these at the top.

665.61 -> They're coming from some containers I've got running

668.37 -> in the background generating traffic.

670.89 -> But it looks like some people are using SIM cards that are

673.68 -> going back to their country.

675.6 -> 'Cause we've got, looks like Canada and Australia I think.

681.93 -> And then we've got browsers as well and operating systems.

686.43 -> So, what we've got 2,250 hits from Android

693.48 -> and 2000 from iOS.

696.21 -> So, put your hand up if you are using an Android phone.

702.84 -> And who's using an iPhone?

706.23 -> Yeah, so I'd say that was roughly about half and half.

709.74 -> So, we know that works, which is good.

713.88 -> So, this is a kind of dashboard of the things that

719.28 -> I want to see if I'm running this very basic site.

724.26 -> We get some of this information from real user monitoring.

727.5 -> So, I'm just going to take you through

729.24 -> real user monitoring very quickly.

732.93 -> And this also demonstrates why averages

736.47 -> are sometimes not good.

738.51 -> So, we look at our average load time, it's one and a half,

742.53 -> 1.7 seconds, that seems like it's under our SLA.

747.54 -> But as you saw from the percentiles, it's not.

749.55 -> So, again, definitely needs percentiles.

753.3 -> Then we've got data here. This is web vitals data.

758.1 -> So, this gives us data about low performance,

761.19 -> first input delay, so that measures inactivity,

765.27 -> and cumulative layout shifts.

766.86 -> So, that measures how visually stable

770.902 -> your page is when it loads.

774.15 -> And then we've also got a breakdown of what happens

778.08 -> when the page is loading.

779.73 -> So, for our site here, time to first byte

783.484 -> is the biggest, and DOM processing time.

786.99 -> So, we could use this information to maybe cut down

791.46 -> on the time to first byte and DOM processing time.

795.48 -> And we can look at this by browser and device.

801.39 -> And I'm just gonna show you quickly the configuration,

804.6 -> because all you need to do to enable this is

807 -> this JavaScript snippet here.

809.34 -> You just add that into the head of your pages.

813.57 -> Okay, so I'm gonna go back to the presentation.

815.7 -> - It would be interesting to note that one of my teams,

818.34 -> actually, is RUM.

820.8 -> So, RUM is the team inside of the application observability,

824.79 -> you know, organization.

826.29 -> And we use RUM ourselves to observe RUM to know whether

831.03 -> performs well for you in a very similar fashion,

833.43 -> where you can see the metrics of page load,

836.04 -> or reaction time, or-

838.32 -> And then, you know, we're able to trace it back

840.42 -> to the service behavior.

841.44 -> So, we use our own tools to make sure our services

844.92 -> run well for you.

848.67 -> - Okay, so this is Amazon's vision.

853.65 -> So, it's our vision statement.

855.6 -> And we strive to be Earth's most customer-centric company,

859.38 -> Earth's best employer, and Earth's safest place to work.

862.56 -> Why am I talking about vision?

864.72 -> So, anything that any of us do, any of us on stage,

868.74 -> or anything that any of you do, anyone in this room,

873 -> it should all align to your organization's vision.

876.39 -> That all of us, really basically, we're all employed

879.24 -> to fulfill the vision of our organization.

882.54 -> And for us, like I said, that's to be Earth's

884.283 -> most customer-centric company.

890.4 -> And the way that we strive to be Earth's most

893.28 -> customer-centric company is by starting with our customers

898.26 -> and working backwards.

900.54 -> So, when you think about observability,

902.67 -> you should think about what your customers want

904.89 -> and what they need.

907.62 -> And customer obsession is also

909.51 -> one of our leadership principles.

910.83 -> And we say that leaders start with their customers

913.437 -> and work backwards.

914.97 -> It's really fundamental to us, but it's also really

917.85 -> fundamental to how you should think about observability.

923.97 -> So, we wanna care-

926.85 -> We wanna think about what do our customers care about,

930.42 -> what do they need, what do they want?

932.4 -> And these are, basically, some examples

934.11 -> of what customers generally care about.

937.26 -> In this case, it's in the context of an e-commerce site.

941.22 -> So, you know, all of you, I imagine, have used

944.49 -> an e-commerce site, hopefully ours.

948.39 -> And you care about things like delivery.

950.79 -> You want your stuff to be delivered quickly.

955.65 -> Or at least know when it's gonna be delivered.

957.78 -> Obviously everybody cares about the price.

960.96 -> You care about security and privacy,

963.66 -> you care about how long the page takes to load,

966.57 -> and you also care if you can find the product

968.616 -> that you're actually looking for.

970.8 -> And things like page speed

973.77 -> really, really affects the business.

976.71 -> So, a 2017 study showed that 53% of mobile users

983.79 -> will just abandon the page if it takes

985.41 -> more than three seconds to load.

988.05 -> And I'm just giving that example of a thing

990.54 -> you need to think about in terms of how metric observations

996.27 -> can affect business outcomes.

1000.68 -> So, once you've kind of thought about what your customer

1003.74 -> requirements are, you need to map these requirements

1006.8 -> to stakeholders in your organization.

1010.1 -> So, you need to talk to them to understand

1013.52 -> the metrics that they care about that could

1016.64 -> impact customer requirements.

1018.86 -> And this is because we might not know, right?

1021.86 -> As technical people, we might not know what

1025.1 -> the customer requirements are.

1026.6 -> But the stakeholders, it's their job to know.

1028.82 -> It's their job to care about what the customers care about.

1033.733 -> And it might seem easy for something like

1035.87 -> an e-commerce site, probably most of it seems obvious,

1040.25 -> but I'm gonna give you another stat now.

1042.11 -> So, there's a Forrester Research report that said

1045.11 -> 43% of visitors, when they go to a site,

1048.86 -> they'll go straight to the search.

1052.34 -> And searches are two to three times more likely to convert

1056.45 -> compared to non searches.

1058.37 -> So, search is really, really important,

1060.68 -> and it has to work really, really well.

1063.175 -> And a bad search experience will take people to another site

1065.87 -> and it's gonna impact your business.

1068.33 -> Another example might be that a logistics manager,

1072.53 -> or again, a product manager might understand the impact

1076.82 -> of an item being out of stock.

1079.28 -> Obviously if an item's out of stock or it has

1082.07 -> a long delivery time, people are gonna be less likely

1084.47 -> to buy it.

1088.1 -> - Okay, so this might sound familiar to

1090.68 -> most people in this room, but very often business

1093.59 -> stakeholders pop up and ask questions that actually relate

1097.64 -> to metrics in your application.

1100.588 -> - Okay, Ania, can you tell me how many customers

1102.77 -> are abandoning their shopping carts because

1104.72 -> the delivery time is too long.

1106.55 -> - I'll have to look that up for you Alex.

1107.96 -> So, I'll have to get back to you later.

1110.84 -> - How much does it cost to run?

1113.64 -> - Igor, I am in the middle of a really important project.

1116.3 -> I will have to get back to you later.

1118.73 -> - How many attempted attacks have we had in the last month?

1122.27 -> - I'll get back to you shortly.

1124.67 -> I believe we do have that data.

1127.82 -> - How long does it take to load our front page?

1131.264 -> - Let me check my phone.

1135.587 -> - And how many times are we producing no search results

1138.08 -> when somebody searches for something?

1139.88 -> - We don't actually record that data, Alex, I'm sorry.

1142.04 -> - Really?

1143.84 -> - Okay, does this sound familiar?

1146.48 -> It's like a game of whack-a-mole.

1148.37 -> Finance asks how much, a product manager will ask

1151.61 -> how long to release a feature.

1153.56 -> But how do we get this information to your stakeholders?

1158 -> How do you even know what data to collect?

1161.69 -> You went backwards from the customer.

1168.11 -> Okay, so you want to make sure that your customers

1171.59 -> have a great customer experience.

1174.02 -> So, you start with those customer requirements

1176.33 -> and work backwards.

1178.07 -> You then work with your stakeholders because they understand

1181.01 -> those requirements really well.

1183.44 -> You then work together to derive KPIs.

1187.19 -> Once you have those KPIs,

1188.78 -> you identify metrics, which you then collect.

1193.28 -> Once you have the metrics you can alert, and you can act

1196.97 -> when business outcome or customer experience is at risk.

1200.84 -> And then once you've had to act,

1203 -> you then make improvements based on what was wrong.

1207.38 -> And that, in turn, improves customer experience and creates

1210.74 -> this virtuous cycle of continuous improvement and continuous

1216.38 -> improvement of customer experience.

1218.6 -> - And what's really, really important here

1220.73 -> is you can't do continuous improvement

1223.28 -> or generally make improvements to your products

1227.21 -> or your applications without this data

1230.15 -> unless you just how come to get lucky.

1232.79 -> Which can happen, but don't rely on luck.

1235.85 -> - Yeah, you can't improve what you don't measure.

1237.89 -> So, that's important.

1239.84 -> - Okay, so now that you understand why you need

1243.44 -> observability, how do you go about it?

1246.5 -> Now this is key.

1247.37 -> You need to develop a strategy.

1250.19 -> So, first think, what do I collect?

1253.76 -> So, we already talked about how important

1256.52 -> customer obsession, customer requirements are,

1260.03 -> how important it is to work with those stakeholders,

1263 -> and to derive those KPIs.

1265.34 -> And the KPIs will inform you what you need to collect.

1269.87 -> And then next, you need to ask yourself, in your strategy,

1274.55 -> what do I observe?

1276.77 -> Well, first you're gonna have to identify the sources

1279.89 -> of the information of those metrics.

1282.74 -> All those things that will help you

1284.21 -> measure against those KPI's.

1287.24 -> Now, these could be in multiple places,

1289.07 -> but typically you'll find them in either metrics,

1291.56 -> logs, and traces, or all of them at once.

1294.32 -> But it is really important that the data

1296.63 -> that you care about the most is that data

1299.15 -> that gives you insights into your customer experience.

1304.73 -> Once you have that data,

1306.86 -> you then develop a strategy on, how do you act?

1311.57 -> You need to develop an alerting strategy so that you

1315.74 -> are aware where, when outcomes are at risk.

1319.16 -> You also have to evaluate business impact.

1322.67 -> And you also have to plan, or at least define a strategy

1325.97 -> of how you intend to act and what your expectations are.

1330.14 -> And your strategy should allow you to observe your

1333.32 -> application based on its purpose to your customers.

1341.75 -> - Okay, so, as well as having this strategy,

1344.42 -> you need to have a plan.

1345.65 -> How do I go about doing this?

1347.6 -> So, when you are designing your strategy,

1351.08 -> these are the, basically, the steps that we

1354.2 -> think you should take.

1355.79 -> So, work backwards from your customer.

1358.73 -> Again, I'm gonna repeat that.

1360.08 -> Work out what they want and what they need.

1363.74 -> Talk to the stakeholders that represent the requirements

1366.17 -> of your customer because they know what your customer needs.

1373.52 -> And that will help you to identify your KPIs.

1377.51 -> You can then use those KPIs to identify

1380.09 -> where you're gonna get the data from.

1383.24 -> And you might need to extract this data if-

1386.03 -> Obviously you're gonna need metrics and you might

1387.65 -> need to extract this from logs or traces.

1391.4 -> And from there build stakeholder dashboards

1393.8 -> to reflect these KPIs.

1395.483 -> Then you don't have to play this whack-a-mole game.

1400.19 -> And then design your alerts for when business outcomes

1403.58 -> are at risk.

1405.11 -> That's the critical thing.

1409.79 -> And then you need to design, or use an existing,

1412.58 -> alerting strategy that classifies impact and severity.

1416.33 -> So, obviously, things like business impact,

1418.85 -> business critical impact, business critical severity,

1421.28 -> and you might have low impact and low severity.

1425 -> And you also need to route alerts to people

1428.24 -> that can actually fix issues or make a decision.

1434.21 -> Every single alarm that you have should have a plan.

1438.17 -> So, either a runbook or, ideally for a technical issue,

1442.46 -> that should be automated or at least semi-automated.

1446.6 -> And when you've got non-technical issues,

1448.52 -> you should still alert stakeholders.

1450.26 -> For example, if you've got drop in orders in a certain

1454.19 -> country, you may want to alert the country manager,

1459.2 -> the managers for that country to let them know

1461.12 -> that they've got a drop in orders.

1467.42 -> Okay, so we looked at some generic metrics on the dashboard

1471.98 -> earlier around, like, page load speed and response time.

1478.25 -> In the bottom right-hand corner of this slide,

1481.94 -> you'll see a chart that shows our search results

1485.99 -> being returned from our site.

1488.75 -> Now, technically this might not seem like

1492.08 -> an important metric, why do we care about this?

1494.66 -> It doesn't tell us if our application's running or not.

1498.32 -> And this is what we're talking about, in terms of,

1502.13 -> you know, it's not about CPU, RAM, and disk utilization.

1507.71 -> If...

1510.89 -> So, search is really important to this particular business

1515.69 -> or this application, this films application.

1519.92 -> And if people searching for something on your website,

1522.92 -> and you don't have it, well maybe you should have it.

1525.08 -> Or maybe people are commonly misspelling a product

1528.71 -> and it returns no search results and then you

1531.86 -> might wanna rethink your search algorithm.

1533.84 -> So, you might want to go from naive pattern matching

1536.75 -> to natural language processing.

1540.47 -> So, I'm gonna talk a little bit about

1542.45 -> how we made this graph.

1543.92 -> So, this is using CloudWatch Contributor Insights

1547.94 -> and this analyzes specific fields in log events.

1551.78 -> And this allows you to identify outliers.

1554.72 -> This could be expensive if you had to slice and dice

1556.927 -> all of this information.

1559.49 -> And in fact, I've already spoken to two customers this week

1565.85 -> and they had a perfect,

1567.41 -> both had perfect use cases for using Contributor Insights,

1571.34 -> wanted to look at millions of devices and they wanted

1574.7 -> to know or be alerted when they were having lots of errors

1579.86 -> with a specific device.

1581.63 -> That would be hugely expensive if you were collecting

1584.18 -> metrics for millions of devices.

1587 -> But, by using Contributor Insights, you can slice and dice

1590.63 -> this information at a much lower cost.

1594.41 -> So, in this case, we're using it to count the total

1596.93 -> number of searches for each year that's being

1600.92 -> searched for in our application.

1603.98 -> Because we have a finite number of years to search,

1607.37 -> we can identify searches that aren't yielding any results.

1611.48 -> So, if you search for a year,

1614.54 -> I can't remember which ones yielded no results,

1616.76 -> but let's say you searched from 1947 and you didn't get any

1620.33 -> films from 1947, but you really wanted to buy a film

1623.3 -> from then, then that would be bad news,

1626.9 -> and we might want to do something.

1630.83 -> It wasn't technical, but something business related

1633.35 -> to improve it.

1634.79 -> And we use this a lot internally as well.

1637.43 -> So, how do you go about doing this?

1639.8 -> - Correct Alex.

1640.79 -> So, all of our services serve millions of customers, right?

1644.24 -> And we observe percentiles of aggregates, of course,

1648.68 -> to make it available or compute latency.

1652.1 -> But it is very important for us to understand the specific

1656.3 -> issues or behaviors of each individual customers.

1658.73 -> And we deploy this approach to understand the outliers.

1661.76 -> Like, in a big set of customers,

1664.7 -> there may be one who experienced this particularly

1667.52 -> long latency because one of the services could have had

1670.73 -> a stuck partition or an instance, right?

1674.69 -> And so we deployed this approach ourselves to understand,

1678.05 -> you know, and better mange services for you, to make sure we

1682.52 -> understand each of you, not as a average, but individually.

1686.51 -> So.

1689.96 -> - Okay, we are gonna talk about designing dashboards next.

1692.57 -> So, we've added a QR code to an Amazon builder's library

1696.5 -> called Building Dashboards For Operational Visibility.

1699.65 -> You should check it out. It's a great article.

1703.43 -> Okay, so when designing a dashboarding strategy,

1707.96 -> we already talked about stakeholder dashboards,

1710.36 -> and those stakeholders typically come to you

1712.4 -> and you know it's always at the most inconvenient time,

1714.56 -> you are in the middle of a project and they always

1716.72 -> need the data straight away.

1718.82 -> And it's important for them.

1720.44 -> So, when you're designing a dashboards,

1722.87 -> keep in mind those stakeholder dashboards,

1724.73 -> they are really important.

1726.41 -> So, examples there would be, a cost dashboard for the CFO,

1730.25 -> or a service audit dashboard for the CISO.

1734.72 -> But the goal of a good observably strategy is to detect

1739.94 -> issues before they affect your customers.

1743.78 -> And dashboards are that human facing view

1747.35 -> into your system that summarize your system behavior.

1750.92 -> And this is why you need those other high level dashboards,

1754.13 -> those more application specific dashboards,

1756.74 -> such as customer experience dashboard

1759.08 -> or system or a service level dashboard.

1763.22 -> Okay, so, the main reason for this talk is to create

1766.04 -> an observability strategy centered around your customers.

1769.34 -> But those technical dashboards are also important

1772.43 -> and can also affect business.

1774.59 -> And this is why you also need to design those low-level

1777.23 -> dashboards, those dependency dashboards,

1779.78 -> infrastructure dashboards, microservice dashboards.

1783.71 -> And this is where you will see those

1785.3 -> CPU, RAM, and disk metrics.

1791.84 -> Now onto designing alarms.

1794.45 -> So, when you're coming up with your alarm alert strategy,

1798.05 -> be very deliberate about what you alert on,

1801.5 -> because this will have a massive impact

1804.71 -> on your business outcome.

1807.5 -> If you alert to all, you will lose visibility.

1811.49 -> But if you alert too much, you could also lose visibility.

1817.67 -> So first, define a strategy.

1820.34 -> What is a warning? What is an alarm?

1823.22 -> And what should be an actionable alert?

1826.28 -> Now alerts must be actionable.

1829.13 -> Why would you alert if you don't want to take action?

1831.8 -> So, I always apply that 3:00 AM rule to all the alerts.

1835.01 -> If you got that alert at three in the morning,

1837.5 -> would you read information and turn over and think,

1839.63 -> I can sort that tomorrow?

1841.76 -> Perhaps that shouldn't be an alert, then.

1844.91 -> So, also make sure that those notifications are meaningful

1849.11 -> and that they are received and routed to the right people.

1853.61 -> And then define expected actions with the use of playbooks,

1858.11 -> so that anyone who's responsible to responding to an alert,

1862.43 -> no matter how long they've been with the organization,

1865.46 -> how well they understand the business or the application,

1868.61 -> it's so that they know exactly the process

1871.04 -> that you expect them to follow.

1873.38 -> And then, where possible, remediate with the use of runbooks

1877.22 -> for automated or semi-automated remediation.

1883.46 -> Now, alert fatigue and alarm noise can be a real business

1888.56 -> impacting issue, because if you have too many alarms,

1892.67 -> it is really easy to miss something important.

1895.79 -> So, where possible create meaningful alarms.

1899.51 -> So, in CloudWatch we've got composite alarms,

1901.7 -> and they allow you to use and, or, or not operators

1905.9 -> to combine multiple metric alarms

1908.06 -> to create those more meaningful alerts.

1912.2 -> So, for example, a lower number of orders might be okay,

1917.3 -> but actually if it's, at the same time there's

1919.73 -> a lower number of search results, that might indicate

1922.643 -> that there is a customer facing issue in your application.

1926.75 -> So, by using an and operator in a composite alarm,

1930.8 -> you can be alerted when the thresholds for orders

1934.64 -> or for search results fall below a certain level.

1938 -> So, you will only be alerted then

1939.5 -> as you understand that there's a correlation.

1941.6 -> So, you don't need to know when just one's below level,

1944.93 -> both are below the levels.

1946.82 -> Now, another feature of composite alarms

1949.37 -> is alarm suppression.

1951.62 -> So, here you can create an alarm, for example,

1954.95 -> to indicate there is a deployment in progress,

1958.37 -> and that will act as a suppressor in the composite alarm.

1963.89 -> And when that suppressor alarm is in the alarm state,

1968.15 -> any other alarm in that composite alarm is suppressed,

1972.29 -> no alert is generated, and no action is taken.

1976.07 -> And you can also nest multiple composite and alarms

1979.7 -> for those complex scenarios, such as disaster recovery.

1983.72 -> And then lastly, one other way to reduce alarm noise

1986.66 -> would be to leverage anomaly detection alarms in CloudWatch.

1990.95 -> Now, these don't have those static thresholds

1993.11 -> that you set yourself.

1994.58 -> Instead they follow, or they alert or alarm when that value

2000.76 -> falls outside of the expected value,

2003.76 -> that's based on that anomaly detection model.

2008.38 -> - [Alex] Okay, so I'm gonna do another quick demo.

2013.78 -> Cool, okay, this one's gonna be pretty quick.

2016.84 -> So, I've highlighted some alarms here already.

2021.52 -> So, I've got three CloudWatch alarms here.

2024.55 -> I've got tolerated or frustration, frustrated navigation.

2028.6 -> This is basically when real user monitoring is telling me

2032.35 -> that I've got a page load speed of either between

2034.84 -> two and eight seconds or above eight seconds,

2037.36 -> which is the metric I showed in the dashboard

2039.51 -> at the beginning.

2041.65 -> I've also got an alarm that's telling me

2044.89 -> if I'm being throttled by DynamoDB.

2048.82 -> And then I've got a special type of alarm here that says,

2051.88 -> DynamoDB throttling affecting page load time.

2056.958 -> And this is a composite alarm.

2058.93 -> And this what the composite alarm looks like.

2062.02 -> So, it contains the child alarms, so the tolerated

2066.07 -> or frustrated navigation, and the DynamoDB read throttles.

2070.45 -> And there's my alarm rules.

2072.85 -> This is what Ania was talking about,

2074.65 -> having and, or ors, or both in the rules.

2081.97 -> So, you can see, as well, the history of this alarm.

2085.18 -> This alarm has been going off a lot,

2086.59 -> and we'll talk about that.

2089.5 -> And then I'm just gonna go to a dashboard that, kind of,

2092.44 -> shows why we might use this.

2096.04 -> So, at the moment one of these alarms is going off.

2098.29 -> So, we've got DynamoDB read throttles going off.

2103.69 -> Put your hand up if you had a really slow page load

2106.45 -> when you accessed the site.

2109.87 -> Anyone?

2111.73 -> Yeah, cool. That's on purpose.

2116.68 -> So, we've got some DynamoDB read throttles,

2120.37 -> and this is a graph showing the alarm.

2123.82 -> And we've got tolerated or frustrated navigation.

2126.22 -> We've got another graph showing alarm.

2128.83 -> So, I then correlated this but-

2130.9 -> And this is in the bottom-left.

2133.48 -> And you can see that the correlation

2136.15 -> is not immediately obvious.

2138.01 -> So, what I actually used is Cloud Automated maths

2141.34 -> to multiply the frustrated navigation by 10.

2147.7 -> And the reason behind that is, when the application is

2152.5 -> calling DynamoDB and it's being throttled,

2155.26 -> it will attempt to retry up to nine times.

2157.87 -> So, if I multiply it by 10,

2161.74 -> we should see, roughly, the correlation.

2164.14 -> And here you can see we've got this correlation.

2167.32 -> When we're having read throttles, we are getting

2169.99 -> this poor navigation experience.

2172.33 -> And when both of these alarms go off,

2174.55 -> so either side of this middle one,

2177.04 -> that will then cause this alarm in the middle to go off.

2180.91 -> And that's what I care about.

2182.26 -> I might not care if we get the occasional amount of

2185.5 -> read throttling, or the occasional frustrated navigation,

2189.82 -> but if one is causing the other,

2192.58 -> I really wanna know about that so I can fix it.

2195.79 -> And we will do that in a minute.

2198.94 -> - So, this is something by the way we use,

2200.59 -> we practice ourselves quite a bit to reduce alarm noise,

2204.01 -> is anomalies by themselves are not really worth looking at.

2209.71 -> It's very important to understand if they're in the context

2212.47 -> of something that relates to a customer experience.

2215.77 -> That really helps our teams deal with just volume of alarms.

2223.57 -> - Cool, so, once you have a metric and you've identified,

2229.021 -> you know, you've got a problem, like we've done just then.

2232.63 -> You've kind of, then you've got a race against time.

2235.526 -> And in a distributed system particularly,

2238.06 -> how do you find out quickly why there's a problem?

2242.47 -> And part of being able to do this is having a good

2245.71 -> logging and tracing strategy.

2248.53 -> Having data from distributed systems in one single console

2253.3 -> is gonna help you get to the root cause more quickly,

2256.48 -> rather than having to look in numerous different places

2258.82 -> or even having to log onto instances

2260.56 -> and see what's going on.

2263.23 -> So, there's a couple of things you can do

2264.82 -> once you've got all this as well.

2268.09 -> You can use services which offer additional insights.

2270.43 -> So, we've got things like,

2272.71 -> well you've seen contributor insights.

2275.05 -> We've got container insights,

2276.58 -> which gives you some more information about containers.

2279.25 -> Lambda insights, just gives you more information

2282.46 -> about your lambda functions.

2284.83 -> And then we've also got ServiceLens,

2286.01 -> which is a great place to start to visualize issues

2289.18 -> which have been correlated from your

2292.6 -> metrics, traces, and logs.

2294.58 -> And the key thing is here that, you know,

2296.333 -> a metric and then an alarm tells you that,

2299.68 -> maybe you've got a problem but it doesn't tell you where

2302.38 -> the problem is, and that's where you need

2304.27 -> logging and tracing.

2310.33 -> So, logging and tracing help we get to that root cause

2313.33 -> more quickly and hopefully either reduce the impact

2316.63 -> or get rid of it completely.

2319.18 -> And as I said, metrics help you.

2321.463 -> They might help you identify an issue,

2323.44 -> they might help you with predicting trends.

2327.04 -> But it's logs and traces that complete the picture.

2331.24 -> And I'm sure we all know what logs are,

2333.46 -> but I'll say it anyway, they're an immutable record

2335.95 -> of a an event that's taking place in your application.

2340.42 -> And you know, we've all, I'm sure, looked at logs

2343.39 -> to try and investigate why there's been a problem.

2346.42 -> And that's one place where you can deep dive

2348.01 -> and find out why there's been an issue.

2350.8 -> Traces on the other hand, they take a user centric view

2353.83 -> or transaction centric view of the path the request takes

2358.63 -> through your services.

2360.85 -> And they can tell you the impact,

2362.32 -> either to the path, or to the user.

2365.35 -> So, we've got a couple of examples here.

2369.46 -> So, in this logs example on the left,

2374.23 -> this is from from an Apache web server and I'm just looking

2379.39 -> for events with 403 or 404 and this might tell me where

2383.11 -> I've got, potentially where I've got bots

2385.45 -> coming to the site.

2388 -> And this is where it's also important to use, if you can,

2392.59 -> to use structured logs.

2395.14 -> And in the case of Apache web server, and there are lots

2397.72 -> of other applications like this,

2400.033 -> this is a one line change to the config file

2403.087 -> in Apache web server to enable structured logs.

2408.04 -> And this makes it much easier to query your logs,

2412.18 -> to use contributor insights,

2415.08 -> and also to extract metrics from your logs.

2418.45 -> And you can do that using log metric filters.

2420.94 -> And traces kind of stick all this information together

2424.03 -> and they allow you to correlate

2426.4 -> business transactions with events.

2430.57 -> So, I'm going to do a demo now troubleshooting

2433.33 -> why we've got this issue.

2438.94 -> Okay, so I hope you can see this.

2440.77 -> We've got a service map, here, of our application.

2445.18 -> So, just very quickly, is a pretty basic application.

2449.02 -> There's pretty much two or three parts to it.

2451.69 -> We've got some EC2 instances behind a load balancer

2455.53 -> in an auto scaling group.

2457.39 -> And basically they call DynamoDB and that's

2459.4 -> where the films are.

2461.53 -> And then we've got an API, which basically does

2465.4 -> the same thing, and that's using an API gateway,

2468.16 -> going to a lambda function, that's also making calls

2471.01 -> to the same DynamoDB database.

2475 -> On this map we can look at metrics as well.

2479.74 -> So, we can look at average latency.

2482.17 -> So, the average latency on my EC2 instances is 514

2485.98 -> milliseconds and we've got 72 transactions

2489.34 -> per minute going on.

2490.93 -> I can even click on a link going between two nodes

2497.32 -> on the map, and that's gonna show me

2501.07 -> the same kind of metrics.

2503.074 -> So, in this case it's, sorry,

2504.49 -> it's error rates or okay rates.

2507.16 -> In this case everything's okay, a hundred percent okay.

2509.253 -> And the transactions per minute between the two nodes.

2512.89 -> But what we're gonna do now is troubleshoot.

2515.62 -> And we can see from the legend here that,

2519.25 -> if I've got red, that means I've got a fault.

2522.37 -> So, I'm gonna click on this API gateway.

2528.43 -> And we can see now immediately, we've got some metrics with,

2532.84 -> for latency, for requests, and for faults.

2536.71 -> And we can see here we've got a lot of latency

2538.897 -> and we've got a lot of faults.

2541.18 -> And you can see that they line up, right?

2544.54 -> The latency lines up with the faults.

2547.48 -> So, what we can do from here is click on the faults

2550.99 -> and go to View Filtered Traces.

2555.07 -> So, it's automatically filled in a search for me.

2560.56 -> And I can see a list of all of the traces

2562.72 -> that have resulted in a fault.

2565.15 -> And now I just need to go to one of these,

2568.21 -> and I immediately go to an overview of this individual

2571.87 -> transaction that's taken place.

2574.39 -> In this case, it's a synthetic canary that's done

2577.99 -> the transaction, it's gone to the API gateway,

2581.47 -> gone to the lambda function,

2582.97 -> there's a call made to the DynamoDB service,

2585.22 -> and there's a call made to my DynamoDB table.

2590.38 -> We can then start to look at what's happening.

2593.29 -> Why have we got this fault?

2595.18 -> So, the fault is exposed in the canary,

2599.35 -> but it's actually downstream.

2601.12 -> Exposed in the API gateway stage,

2603.07 -> but it's actually downstream.

2605.05 -> And it's downstream in the lambda function,

2607.18 -> and it's when it's calling DynamoDB.

2610.75 -> So, when I look at this in more detail,

2612.31 -> I can get an overview here of the trace

2615.61 -> and when it happened.

2617.8 -> But the key here is, when I go to exceptions.

2621.04 -> And this exception tells me exactly what went wrong.

2624.1 -> So, I've gone from a service map, just doing a few clicks,

2627.7 -> and gone right down to the error message.

2630.82 -> So, I get an error here, and it's telling me, basically,

2635.59 -> I'm being throttled by DynamoDB.

2637.387 -> And that's my issue.

2640.27 -> It also tells, I've also got a stack trace,

2641.89 -> and that tells me exactly where in my lambda function

2645.76 -> the problem's occurring.

2646.75 -> So, I could look at the code if I needed to.

2649.45 -> In this case I don't need to, because I know

2651.19 -> it's a throttling issue.

2654.91 -> We've also got correlation here.

2657.58 -> So, we can also see every single log event that happened

2662.77 -> that's been correlated with the trace ID.

2667.66 -> And if I expand one of these,

2669.52 -> you'll see I've got that same error occurring

2673.15 -> in my log as well.

2676.93 -> So, we can now look at DynamoDB,

2682.482 -> and we can see here

2684.34 -> this chart on the very left shows how much provisioned

2690.19 -> capacity I've got, which is one.

2694.99 -> And then how much consumed, which is the blue line,

2697.48 -> which is way above it.

2699.01 -> So, it's peaking at about three

2701.8 -> and I've only got capacity for one.

2704.08 -> And this is causing lots of throttling requests.

2707.92 -> It's pretty much throttling all the time.

2710.59 -> So, what I need to do to fix this, is go to the settings

2718.96 -> and edit these capacity units.

2720.73 -> I could edit it, set it to 10.

2722.92 -> I could also do some other things.

2724.15 -> I could set it to autoscaling or set it to on demand,

2727.21 -> that would also fix it.

2729.52 -> Click Save, and within a minute or so that should

2733.9 -> completely fix the issue.

2735.88 -> Obviously things aren't gonna be this easy all of the time,

2739.03 -> but essentially what I want to show you is that you

2742 -> can go in this service lens map,

2748.03 -> from having an overview

2750.4 -> of your application, seeing where there are faults

2753.4 -> or errors, and then drilling down into the individual

2756.7 -> traces, and going straight to an error or a stack trace.

2762.13 -> And, nearly forgot to do this.

2766.03 -> X-ray Insights also automatically

2769.63 -> discovers anomalies with your traces.

2773.05 -> So, here insights has discovered

2776.83 -> that we've got this problem.

2778.96 -> It's even established the root cause of the issue.

2782.44 -> And if I click on this,

2786.97 -> I can see the root cause details

2789.31 -> So, this has given me a map.

2791.38 -> This might be difficult to see, but it's given me a map,

2794.71 -> and it tells me here the root cause,

2798.387 -> and it tells me that it's DynamoDB.

2804.637 -> And I cannot get back to the slides.

2809.8 -> - [Igor] One other technique, just to mention,

2811.9 -> that we use commonly to help with logs and traces.

2815.14 -> Alex talked about Apache logs, system logs,

2818.8 -> and then, you know, traces.

2821.05 -> Is to send in or upload your own custom events,

2824.11 -> the business critical events, say purchase completed,

2826.78 -> or you know, user deleted, or whatever.

2829.57 -> And then they actually really help in your investigations

2832.15 -> 'cause then all that processes otherwise looks, you know,

2836.2 -> A talks to B, now you can contextualize it quite a bit, so.

2840.22 -> - [Alex] And also just to add to that, when you are,

2841.69 -> when you're doing tracing, you can add

2843.91 -> annotations to traces.

2845.23 -> So, things like, if you wanna add a customer ID

2848.65 -> or something like that, you could add that to your trace.

2853.9 -> So, now you've got your metrics that your stakeholders

2856.54 -> care about that reflect what your customers care about,

2859.397 -> you know, the customer experience.

2864.37 -> And you've got dashboards for your stakeholders.

2868.42 -> They don't have to wait anymore.

2869.53 -> You don't have to spend, you know,

2870.76 -> your valuable time trying to find these metrics

2873.79 -> or running reports for people.

2876.88 -> You've added traces to help troubleshoot issues

2879.61 -> and get to the root cause quickly.

2881.8 -> And you've also added, oh- (laughs)

2886.63 -> You've also added important metrics

2887.693 -> so you can help identify potential issues.

2890.86 -> I thought I hadn't done the animation

2892.321 -> and then I think it had.

2893.89 -> So, everybody's happy and everybody's smiling.

2898.06 -> - Okay, so the key takeaways from today's session,

2901.45 -> although those CPU, RAM, and disks, and other

2905.29 -> technical metrics are important,

2908.08 -> they don't tell you anything about the customer experience,

2911.68 -> or they don't give you any insight into that.

2914.23 -> So, your customers are what's important,

2916.93 -> and their experience is what you should be

2919.15 -> monitoring and observing.

2921.82 -> This is why you should work backwards from your customers,

2924.85 -> work with your stakeholders, derive those KPIs,

2928.51 -> and collect metrics that actually matter.

2932.5 -> - And look, if you're gonna take away one thing

2935.23 -> from this session is, speak to your stakeholders,

2939.22 -> even if you have to force them to speak to you,

2941.32 -> to find out what matters to your customer,

2944.59 -> because that's what you should be observing.

2949.81 -> Right, lust on a side note,

2951.88 -> we have a new observability training course.

2954.28 -> We released it last week.

2957.64 -> It covers an introduction,

2960.52 -> it covers about how to use the CloudWatch agent.

2962.89 -> So, how to install it, how to configure it.

2966.07 -> Some of the alerting and insights that we've showed you,

2970.27 -> more into application monitoring and all the application

2973.57 -> monitoring tools that we have.

2975.82 -> And also some of the open source tools that we have as well.

2980.08 -> I might be biased cause I created it,

2982.09 -> but it's well worth going and having a look.

2987.55 -> Please complete the session survey.

2990.16 -> We are also in the expo,

2992.59 -> so we're in the cloud ops section

2994 -> and we've got an observability stand there,

2997.93 -> and that QR code points you to

2999.94 -> other observability sessions.

3004.83 -> And yeah, thank you very much.

3006.75 -> Again, please don't forget to fill in the survey,

3009.54 -> and we'll either be here until we get-

3011.542 -> (audience applauds)

Source: https://www.youtube.com/watch?v=Ub3ATriFapQ