AWS re:Invent 2022 - Developing an observability strategy (COP302)

AWS re:Invent 2022 - Developing an observability strategy (COP302)


AWS re:Invent 2022 - Developing an observability strategy (COP302)

Do you have a plan for your observability? Do you understand what the stakeholders for your applications want to observe? As you move to the cloud or mature your operations within the cloud, it’s important to optimize your observability to help your stakeholders understand how your applications are operating. Join this session to learn how to define your observability strategy for the future in order to serve the requirements of all of your stakeholders and to help ensure that you can deliver successful business outcomes.

Learn more about AWS re:Invent at https://go.aws/3ikK4dD.

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#reInvent2022 #AWSreInvent2022 #AWSEvents


Content

1.83 -> - Okay. Hi everyone.
3.93 -> Welcome to developing an observability strategy.
9 -> So, my name's Alex.
11.55 -> I'm a specialist solutions architect,
13.65 -> specializing in observability.
16.68 -> Done just about over 20 years in IT.
20.37 -> Got a seven year old daughter.
22.56 -> I used to play ultimate Frisbee,
23.82 -> but I got too old and a bit too fat as well.
28.38 -> So, now I try and play, I play golf instead.
32.31 -> So, we're gonna talk about how to design an observability
36.03 -> strategy and we're also gonna go into a few
37.89 -> technical things as well.
41.01 -> - Hi, my name is an Ania Develter.
42.99 -> I'm also a specialist solutions architect on the cloud
45.84 -> operations team here at AWS.
48.9 -> I have been in IT for over 20 years as well.
51.99 -> Three and a half of those years here at AWS.
54.96 -> I'm a mom of two, I've got a 20 and a 15 year old daughter,
58.71 -> or daughters, and also I'm a cat lover.
61.86 -> I actually have four cats and two dogs and I don't talk,
64.71 -> I don't stop talking about them, and they actually feature
68.04 -> in our observability workshop, that's how much I love them.
72.691 -> - Hi, I'm Igor Sedukhin.
74.22 -> I'm a general manager for application observability at AWS.
79.2 -> I'm alike dad of one great ten-year-old daughter,
84.24 -> and love snow.
85.44 -> So, I get to do it quite a bit, when I get back,
87 -> it seems like it's dumped.
90.06 -> So, talk a little bit about observability.
93.54 -> If there is any business of value that has to perform
97.35 -> to expectations, it pretty much has to be observed.
101.417 -> And so observability has to deliver and scale with your
105.9 -> business needs to billions of runtime instances
109.83 -> to millions of users.
112.89 -> It needs to be there when you need it the most.
116.19 -> It has to be as available or better than what it observes.
120.87 -> And more importantly, it actually needs to make your life
124.35 -> and cloud operations simpler.
126.72 -> And so we'll explore that aspect of how do you approach
130.8 -> that, how do you make it simple?
135.66 -> So, before we launch into the session,
138.15 -> let's talk a little bit about the context
140.79 -> in which observability exists.
142.47 -> Cloud operations.
143.97 -> Where in AWS life cycle would you encounter
148.29 -> the need to observe something?
150.66 -> So, first, you're highly likely to set up your environment
155.07 -> for compliant operations
156.99 -> So, create your AWS organizations.
162.06 -> Set up IM roles,
163.74 -> make sure you have compliance controls in the environment.
167.28 -> That was the first step.
168.9 -> Second, you're migrating or developing an application.
173.82 -> Maybe a serverless application directly in AWS,
177.63 -> and you'd use CloudFormation templates and all different
180.66 -> tools that we have to develop that application.
184.11 -> Once the application is in production,
186.87 -> you would highly likely need to observe
189.45 -> and see if it performs as expected.
192 -> And that's where you would encounter tools,
194.61 -> a family of tools in CloudWatch to be able to observe
198.48 -> and explore your application as it runs and serves users.
204.39 -> So, let's get into it.
207.12 -> - Okay, let's talk about the important metrics.
210.81 -> CPU, RAM, and disk metrics are the core measurements
215.25 -> of any system.
217.23 -> Even serverless services run on servers, right?
222.51 -> Yes, these metrics are important.
225.66 -> But are they actually important to your business
228.99 -> and your application?
231.03 -> You should only worry about those metrics for the purpose
234.45 -> of cost optimization, scaling, or capacity management.
239.34 -> But should you care about these metrics?
242.22 -> Do they tell you anything about how your application
244.83 -> is running, or more importantly, do they give you any
249 -> insights into your customer's experience?
254.37 -> Now, we gonna break from convention here and ask you to get
258.21 -> your phone out and follow the QR code that you see
261.27 -> on the screen there.
262.77 -> It'll take you to a very simple film website.
266.43 -> We just ask you for you to just browse around,
268.53 -> maybe search for a film by year.
271.222 -> - Yeah, I'm sorry, this is very basic.
272.91 -> You blame me, I've built this.
274.71 -> But it's gonna serve a purpose.
282.285 -> Right - Okay, so what- Sorry.
283.868 -> (laughter)
284.701 -> So, while's your browsing around, we would appreciate
287.16 -> some audience participation here.
289.92 -> Tell us about the metrics that you care about
293.22 -> in your organization.
294.39 -> And there's no right or wrong answer here.
296.73 -> - So, anyone shout out what's important?
299.227 -> - [Audience Member 1] Latency.
300.06 -> - Latency we've got.
301.65 -> - Anyone else?
304.428 -> - [Audience Member 2] Response time, throughput.
306.257 -> - Response time and throughput.
308.256 -> - [Audience Member 3] Lack of errors
309.51 -> - Lack of errors.
310.513 -> - [Audience Member 4] Freshness.
312.172 -> - Freshness. - Freshness.
313.505 -> - [Audience Member 5] Usage of data.
314.901 -> - Usage of data.
316.087 -> - [Audience Member 6] Hits per second.
318.12 -> - Sorry?
318.953 -> - [Audience Member 6] Hits per second.
319.786 -> - Hits per second. Thank you.
321.54 -> Anyone else?
324.51 -> - Another one about failures, I didn't quite-
327.96 -> - Price did you say?
329.587 -> - [Audience Member 7] 4x and 5x errors.
330.93 -> - Yeah, 4x and 5x errors, yeah.
333.27 -> - Hold on. This gentleman, I can't hear 'em.
336.24 -> - Cost. Yeah, cost, very important.
340.83 -> - Igor, do you wanna tell us more about
342.96 -> the metrics that your team cares about?
344.91 -> - Yeah, exactly.
346.02 -> At AWS, we run services for you.
348.51 -> And this is one of the most important questions
350.76 -> we actually ask our teams,
353.79 -> what is the most important metric for your,
356.46 -> for the business that that service performs,
358.47 -> in relation to you as a customer.
361.11 -> And so, as you know, it was mentioned,
363.87 -> tends to be things like, does the console load,
367.56 -> is it available for you to use it?
370.8 -> Is the latency of the APIs or of the user experiences
374.19 -> adequate, right?
375.48 -> And is availability adequate?
378.09 -> And so what we do, well, you know,
382.35 -> with that metric is, we, in fact,
384.57 -> have a company-wide meeting every week
388.26 -> where we spin the wheel, and maybe you've read about it,
392.7 -> and there's AWS wheel.
394.86 -> And it picks the service, one or two, that will show
398.85 -> to the company what metrics they observe and whether or not
403.53 -> the operations are healthy, right?
405.78 -> And we do this kind of friendly inspection of each other
410.37 -> to ensure that, you know, these metrics,
413.19 -> the key metrics that we observe on your behalf, right,
418.53 -> are actually indeed surveying the needs of that business,
422.18 -> of that server service.
423.48 -> So, this is the most critical question.
427.5 -> - [Ania] Okay. And what about amazon.com Alex?
430.65 -> - Yeah, so I guess the most important metric
434.34 -> in terms of amazon.com or any of our other retail sites
438.48 -> is how many orders are we getting?
442.14 -> How many orders are we processing?
444.03 -> And if that falls below a certain threshold,
446.07 -> then we know that there's a problem.
448.41 -> And this should become obvious why we're talking about this
451.92 -> and why that's an important metric.
453.87 -> - Okay, so we're gonna take this back to the context
456.45 -> of that film website that you just browsed.
458.55 -> So, if you're still browsing, you can stop searching
460.41 -> for films now, thank you.
461.82 -> And Alex is gonna show you the metrics that we care about.
468.09 -> - Okay, so here I've got a, I've got a dashboard.
472.65 -> And I'll just go through what we've got on this dashboard.
476.58 -> So, we care about availability,
480.39 -> we've got a hundred percent availability, that's all good.
483.9 -> Then I've got some SLA statuses here.
487.2 -> So, there's two, there's three SLAs here.
491.73 -> Availability SLA.
493.59 -> We've got payload SLAs.
495.24 -> We want 99% of our payloads to happen within two seconds.
501 -> And we've also got an API response SLAs.
503.16 -> We want our APIs to respond, again, 99% of them to respond
507.15 -> within the second.
509.46 -> I've got some other metrics here that I care about.
511.44 -> So, I've got sessions in the last hour.
514.53 -> Searches in the last hour.
517.02 -> So, these are other, kind of, important,
520.05 -> would be important business metrics if the website
523.32 -> that you looked at was a business.
526.65 -> So, when we look at our payload times,
529.86 -> I've got these separated out into different percentiles.
533.28 -> Our percentile are really important.
535.26 -> If you don't use percentile, averages can hide problems.
540.39 -> So, we can see here at P50,
543.6 -> and I'll explain what that is in a minute.
547.32 -> We've got 850 milliseconds for page load time.
550.53 -> Great.
551.55 -> So, what this means is that 50% of our users,
555.48 -> the maximum page load time for them is 850 milliseconds.
560.91 -> At 90% it goes up to over a second.
564.96 -> And at 99% it goes up to 18.8 seconds.
569.76 -> That's definitely above the two second mark.
573.24 -> And we're using CloudWatch real user monitoring
576.499 -> to get some of this data.
579.45 -> And what I have here on the right hand side is metrics
584.46 -> that come directly from real user monitoring.
586.83 -> Let me try and get rid of that.
590.46 -> So, we've got the percentage of page load times
592.98 -> that happen under two seconds.
594.66 -> We want that to be above 99%, but it's 96.7.
599.34 -> We've got some of them 0.25% happening
603.09 -> between two and eight seconds.
605.1 -> And we've got 3% that are over eight seconds.
607.95 -> So, there's something wrong.
608.91 -> We'll look at this, we'll look at this later.
613.11 -> I've got some metrics here from load balancer as well.
616.98 -> And these metrics are basically the same metrics,
620.49 -> but they're from our API.
623.19 -> So, again, if we look at 99%,
627.66 -> if we look at the 99th percentile,
629.13 -> so that's the maximum response time for-
632.91 -> if we take into account 99% of requests, it's 25.2 seconds.
637.98 -> Again, this is really bad.
640.83 -> And then the reason we ask you to log on
645.361 -> and have a look and play around is to show you
648.54 -> these metrics as well.
650.4 -> So, you can see here at the top, we've got 47,000 requests
657.3 -> coming from the US.
659.25 -> Some of those, or a lot of those,
661.02 -> And you'll see a lot of these at the top.
665.61 -> They're coming from some containers I've got running
668.37 -> in the background generating traffic.
670.89 -> But it looks like some people are using SIM cards that are
673.68 -> going back to their country.
675.6 -> 'Cause we've got, looks like Canada and Australia I think.
681.93 -> And then we've got browsers as well and operating systems.
686.43 -> So, what we've got 2,250 hits from Android
693.48 -> and 2000 from iOS.
696.21 -> So, put your hand up if you are using an Android phone.
702.84 -> And who's using an iPhone?
706.23 -> Yeah, so I'd say that was roughly about half and half.
709.74 -> So, we know that works, which is good.
713.88 -> So, this is a kind of dashboard of the things that
719.28 -> I want to see if I'm running this very basic site.
724.26 -> We get some of this information from real user monitoring.
727.5 -> So, I'm just going to take you through
729.24 -> real user monitoring very quickly.
732.93 -> And this also demonstrates why averages
736.47 -> are sometimes not good.
738.51 -> So, we look at our average load time, it's one and a half,
742.53 -> 1.7 seconds, that seems like it's under our SLA.
747.54 -> But as you saw from the percentiles, it's not.
749.55 -> So, again, definitely needs percentiles.
753.3 -> Then we've got data here. This is web vitals data.
758.1 -> So, this gives us data about low performance,
761.19 -> first input delay, so that measures inactivity,
765.27 -> and cumulative layout shifts.
766.86 -> So, that measures how visually stable
770.902 -> your page is when it loads.
774.15 -> And then we've also got a breakdown of what happens
778.08 -> when the page is loading.
779.73 -> So, for our site here, time to first byte
783.484 -> is the biggest, and DOM processing time.
786.99 -> So, we could use this information to maybe cut down
791.46 -> on the time to first byte and DOM processing time.
795.48 -> And we can look at this by browser and device.
801.39 -> And I'm just gonna show you quickly the configuration,
804.6 -> because all you need to do to enable this is
807 -> this JavaScript snippet here.
809.34 -> You just add that into the head of your pages.
813.57 -> Okay, so I'm gonna go back to the presentation.
815.7 -> - It would be interesting to note that one of my teams,
818.34 -> actually, is RUM.
820.8 -> So, RUM is the team inside of the application observability,
824.79 -> you know, organization.
826.29 -> And we use RUM ourselves to observe RUM to know whether
831.03 -> performs well for you in a very similar fashion,
833.43 -> where you can see the metrics of page load,
836.04 -> or reaction time, or-
838.32 -> And then, you know, we're able to trace it back
840.42 -> to the service behavior.
841.44 -> So, we use our own tools to make sure our services
844.92 -> run well for you.
848.67 -> - Okay, so this is Amazon's vision.
853.65 -> So, it's our vision statement.
855.6 -> And we strive to be Earth's most customer-centric company,
859.38 -> Earth's best employer, and Earth's safest place to work.
862.56 -> Why am I talking about vision?
864.72 -> So, anything that any of us do, any of us on stage,
868.74 -> or anything that any of you do, anyone in this room,
873 -> it should all align to your organization's vision.
876.39 -> That all of us, really basically, we're all employed
879.24 -> to fulfill the vision of our organization.
882.54 -> And for us, like I said, that's to be Earth's
884.283 -> most customer-centric company.
890.4 -> And the way that we strive to be Earth's most
893.28 -> customer-centric company is by starting with our customers
898.26 -> and working backwards.
900.54 -> So, when you think about observability,
902.67 -> you should think about what your customers want
904.89 -> and what they need.
907.62 -> And customer obsession is also
909.51 -> one of our leadership principles.
910.83 -> And we say that leaders start with their customers
913.437 -> and work backwards.
914.97 -> It's really fundamental to us, but it's also really
917.85 -> fundamental to how you should think about observability.
923.97 -> So, we wanna care-
926.85 -> We wanna think about what do our customers care about,
930.42 -> what do they need, what do they want?
932.4 -> And these are, basically, some examples
934.11 -> of what customers generally care about.
937.26 -> In this case, it's in the context of an e-commerce site.
941.22 -> So, you know, all of you, I imagine, have used
944.49 -> an e-commerce site, hopefully ours.
948.39 -> And you care about things like delivery.
950.79 -> You want your stuff to be delivered quickly.
955.65 -> Or at least know when it's gonna be delivered.
957.78 -> Obviously everybody cares about the price.
960.96 -> You care about security and privacy,
963.66 -> you care about how long the page takes to load,
966.57 -> and you also care if you can find the product
968.616 -> that you're actually looking for.
970.8 -> And things like page speed
973.77 -> really, really affects the business.
976.71 -> So, a 2017 study showed that 53% of mobile users
983.79 -> will just abandon the page if it takes
985.41 -> more than three seconds to load.
988.05 -> And I'm just giving that example of a thing
990.54 -> you need to think about in terms of how metric observations
996.27 -> can affect business outcomes.
1000.68 -> So, once you've kind of thought about what your customer
1003.74 -> requirements are, you need to map these requirements
1006.8 -> to stakeholders in your organization.
1010.1 -> So, you need to talk to them to understand
1013.52 -> the metrics that they care about that could
1016.64 -> impact customer requirements.
1018.86 -> And this is because we might not know, right?
1021.86 -> As technical people, we might not know what
1025.1 -> the customer requirements are.
1026.6 -> But the stakeholders, it's their job to know.
1028.82 -> It's their job to care about what the customers care about.
1033.733 -> And it might seem easy for something like
1035.87 -> an e-commerce site, probably most of it seems obvious,
1040.25 -> but I'm gonna give you another stat now.
1042.11 -> So, there's a Forrester Research report that said
1045.11 -> 43% of visitors, when they go to a site,
1048.86 -> they'll go straight to the search.
1052.34 -> And searches are two to three times more likely to convert
1056.45 -> compared to non searches.
1058.37 -> So, search is really, really important,
1060.68 -> and it has to work really, really well.
1063.175 -> And a bad search experience will take people to another site
1065.87 -> and it's gonna impact your business.
1068.33 -> Another example might be that a logistics manager,
1072.53 -> or again, a product manager might understand the impact
1076.82 -> of an item being out of stock.
1079.28 -> Obviously if an item's out of stock or it has
1082.07 -> a long delivery time, people are gonna be less likely
1084.47 -> to buy it.
1088.1 -> - Okay, so this might sound familiar to
1090.68 -> most people in this room, but very often business
1093.59 -> stakeholders pop up and ask questions that actually relate
1097.64 -> to metrics in your application.
1100.588 -> - Okay, Ania, can you tell me how many customers
1102.77 -> are abandoning their shopping carts because
1104.72 -> the delivery time is too long.
1106.55 -> - I'll have to look that up for you Alex.
1107.96 -> So, I'll have to get back to you later.
1110.84 -> - How much does it cost to run?
1113.64 -> - Igor, I am in the middle of a really important project.
1116.3 -> I will have to get back to you later.
1118.73 -> - How many attempted attacks have we had in the last month?
1122.27 -> - I'll get back to you shortly.
1124.67 -> I believe we do have that data.
1127.82 -> - How long does it take to load our front page?
1131.264 -> - Let me check my phone.
1135.587 -> - And how many times are we producing no search results
1138.08 -> when somebody searches for something?
1139.88 -> - We don't actually record that data, Alex, I'm sorry.
1142.04 -> - Really?
1143.84 -> - Okay, does this sound familiar?
1146.48 -> It's like a game of whack-a-mole.
1148.37 -> Finance asks how much, a product manager will ask
1151.61 -> how long to release a feature.
1153.56 -> But how do we get this information to your stakeholders?
1158 -> How do you even know what data to collect?
1161.69 -> You went backwards from the customer.
1168.11 -> Okay, so you want to make sure that your customers
1171.59 -> have a great customer experience.
1174.02 -> So, you start with those customer requirements
1176.33 -> and work backwards.
1178.07 -> You then work with your stakeholders because they understand
1181.01 -> those requirements really well.
1183.44 -> You then work together to derive KPIs.
1187.19 -> Once you have those KPIs,
1188.78 -> you identify metrics, which you then collect.
1193.28 -> Once you have the metrics you can alert, and you can act
1196.97 -> when business outcome or customer experience is at risk.
1200.84 -> And then once you've had to act,
1203 -> you then make improvements based on what was wrong.
1207.38 -> And that, in turn, improves customer experience and creates
1210.74 -> this virtuous cycle of continuous improvement and continuous
1216.38 -> improvement of customer experience.
1218.6 -> - And what's really, really important here
1220.73 -> is you can't do continuous improvement
1223.28 -> or generally make improvements to your products
1227.21 -> or your applications without this data
1230.15 -> unless you just how come to get lucky.
1232.79 -> Which can happen, but don't rely on luck.
1235.85 -> - Yeah, you can't improve what you don't measure.
1237.89 -> So, that's important.
1239.84 -> - Okay, so now that you understand why you need
1243.44 -> observability, how do you go about it?
1246.5 -> Now this is key.
1247.37 -> You need to develop a strategy.
1250.19 -> So, first think, what do I collect?
1253.76 -> So, we already talked about how important
1256.52 -> customer obsession, customer requirements are,
1260.03 -> how important it is to work with those stakeholders,
1263 -> and to derive those KPIs.
1265.34 -> And the KPIs will inform you what you need to collect.
1269.87 -> And then next, you need to ask yourself, in your strategy,
1274.55 -> what do I observe?
1276.77 -> Well, first you're gonna have to identify the sources
1279.89 -> of the information of those metrics.
1282.74 -> All those things that will help you
1284.21 -> measure against those KPI's.
1287.24 -> Now, these could be in multiple places,
1289.07 -> but typically you'll find them in either metrics,
1291.56 -> logs, and traces, or all of them at once.
1294.32 -> But it is really important that the data
1296.63 -> that you care about the most is that data
1299.15 -> that gives you insights into your customer experience.
1304.73 -> Once you have that data,
1306.86 -> you then develop a strategy on, how do you act?
1311.57 -> You need to develop an alerting strategy so that you
1315.74 -> are aware where, when outcomes are at risk.
1319.16 -> You also have to evaluate business impact.
1322.67 -> And you also have to plan, or at least define a strategy
1325.97 -> of how you intend to act and what your expectations are.
1330.14 -> And your strategy should allow you to observe your
1333.32 -> application based on its purpose to your customers.
1341.75 -> - Okay, so, as well as having this strategy,
1344.42 -> you need to have a plan.
1345.65 -> How do I go about doing this?
1347.6 -> So, when you are designing your strategy,
1351.08 -> these are the, basically, the steps that we
1354.2 -> think you should take.
1355.79 -> So, work backwards from your customer.
1358.73 -> Again, I'm gonna repeat that.
1360.08 -> Work out what they want and what they need.
1363.74 -> Talk to the stakeholders that represent the requirements
1366.17 -> of your customer because they know what your customer needs.
1373.52 -> And that will help you to identify your KPIs.
1377.51 -> You can then use those KPIs to identify
1380.09 -> where you're gonna get the data from.
1383.24 -> And you might need to extract this data if-
1386.03 -> Obviously you're gonna need metrics and you might
1387.65 -> need to extract this from logs or traces.
1391.4 -> And from there build stakeholder dashboards
1393.8 -> to reflect these KPIs.
1395.483 -> Then you don't have to play this whack-a-mole game.
1400.19 -> And then design your alerts for when business outcomes
1403.58 -> are at risk.
1405.11 -> That's the critical thing.
1409.79 -> And then you need to design, or use an existing,
1412.58 -> alerting strategy that classifies impact and severity.
1416.33 -> So, obviously, things like business impact,
1418.85 -> business critical impact, business critical severity,
1421.28 -> and you might have low impact and low severity.
1425 -> And you also need to route alerts to people
1428.24 -> that can actually fix issues or make a decision.
1434.21 -> Every single alarm that you have should have a plan.
1438.17 -> So, either a runbook or, ideally for a technical issue,
1442.46 -> that should be automated or at least semi-automated.
1446.6 -> And when you've got non-technical issues,
1448.52 -> you should still alert stakeholders.
1450.26 -> For example, if you've got drop in orders in a certain
1454.19 -> country, you may want to alert the country manager,
1459.2 -> the managers for that country to let them know
1461.12 -> that they've got a drop in orders.
1467.42 -> Okay, so we looked at some generic metrics on the dashboard
1471.98 -> earlier around, like, page load speed and response time.
1478.25 -> In the bottom right-hand corner of this slide,
1481.94 -> you'll see a chart that shows our search results
1485.99 -> being returned from our site.
1488.75 -> Now, technically this might not seem like
1492.08 -> an important metric, why do we care about this?
1494.66 -> It doesn't tell us if our application's running or not.
1498.32 -> And this is what we're talking about, in terms of,
1502.13 -> you know, it's not about CPU, RAM, and disk utilization.
1507.71 -> If...
1510.89 -> So, search is really important to this particular business
1515.69 -> or this application, this films application.
1519.92 -> And if people searching for something on your website,
1522.92 -> and you don't have it, well maybe you should have it.
1525.08 -> Or maybe people are commonly misspelling a product
1528.71 -> and it returns no search results and then you
1531.86 -> might wanna rethink your search algorithm.
1533.84 -> So, you might want to go from naive pattern matching
1536.75 -> to natural language processing.
1540.47 -> So, I'm gonna talk a little bit about
1542.45 -> how we made this graph.
1543.92 -> So, this is using CloudWatch Contributor Insights
1547.94 -> and this analyzes specific fields in log events.
1551.78 -> And this allows you to identify outliers.
1554.72 -> This could be expensive if you had to slice and dice
1556.927 -> all of this information.
1559.49 -> And in fact, I've already spoken to two customers this week
1565.85 -> and they had a perfect,
1567.41 -> both had perfect use cases for using Contributor Insights,
1571.34 -> wanted to look at millions of devices and they wanted
1574.7 -> to know or be alerted when they were having lots of errors
1579.86 -> with a specific device.
1581.63 -> That would be hugely expensive if you were collecting
1584.18 -> metrics for millions of devices.
1587 -> But, by using Contributor Insights, you can slice and dice
1590.63 -> this information at a much lower cost.
1594.41 -> So, in this case, we're using it to count the total
1596.93 -> number of searches for each year that's being
1600.92 -> searched for in our application.
1603.98 -> Because we have a finite number of years to search,
1607.37 -> we can identify searches that aren't yielding any results.
1611.48 -> So, if you search for a year,
1614.54 -> I can't remember which ones yielded no results,
1616.76 -> but let's say you searched from 1947 and you didn't get any
1620.33 -> films from 1947, but you really wanted to buy a film
1623.3 -> from then, then that would be bad news,
1626.9 -> and we might want to do something.
1630.83 -> It wasn't technical, but something business related
1633.35 -> to improve it.
1634.79 -> And we use this a lot internally as well.
1637.43 -> So, how do you go about doing this?
1639.8 -> - Correct Alex.
1640.79 -> So, all of our services serve millions of customers, right?
1644.24 -> And we observe percentiles of aggregates, of course,
1648.68 -> to make it available or compute latency.
1652.1 -> But it is very important for us to understand the specific
1656.3 -> issues or behaviors of each individual customers.
1658.73 -> And we deploy this approach to understand the outliers.
1661.76 -> Like, in a big set of customers,
1664.7 -> there may be one who experienced this particularly
1667.52 -> long latency because one of the services could have had
1670.73 -> a stuck partition or an instance, right?
1674.69 -> And so we deployed this approach ourselves to understand,
1678.05 -> you know, and better mange services for you, to make sure we
1682.52 -> understand each of you, not as a average, but individually.
1686.51 -> So.
1689.96 -> - Okay, we are gonna talk about designing dashboards next.
1692.57 -> So, we've added a QR code to an Amazon builder's library
1696.5 -> called Building Dashboards For Operational Visibility.
1699.65 -> You should check it out. It's a great article.
1703.43 -> Okay, so when designing a dashboarding strategy,
1707.96 -> we already talked about stakeholder dashboards,
1710.36 -> and those stakeholders typically come to you
1712.4 -> and you know it's always at the most inconvenient time,
1714.56 -> you are in the middle of a project and they always
1716.72 -> need the data straight away.
1718.82 -> And it's important for them.
1720.44 -> So, when you're designing a dashboards,
1722.87 -> keep in mind those stakeholder dashboards,
1724.73 -> they are really important.
1726.41 -> So, examples there would be, a cost dashboard for the CFO,
1730.25 -> or a service audit dashboard for the CISO.
1734.72 -> But the goal of a good observably strategy is to detect
1739.94 -> issues before they affect your customers.
1743.78 -> And dashboards are that human facing view
1747.35 -> into your system that summarize your system behavior.
1750.92 -> And this is why you need those other high level dashboards,
1754.13 -> those more application specific dashboards,
1756.74 -> such as customer experience dashboard
1759.08 -> or system or a service level dashboard.
1763.22 -> Okay, so, the main reason for this talk is to create
1766.04 -> an observability strategy centered around your customers.
1769.34 -> But those technical dashboards are also important
1772.43 -> and can also affect business.
1774.59 -> And this is why you also need to design those low-level
1777.23 -> dashboards, those dependency dashboards,
1779.78 -> infrastructure dashboards, microservice dashboards.
1783.71 -> And this is where you will see those
1785.3 -> CPU, RAM, and disk metrics.
1791.84 -> Now onto designing alarms.
1794.45 -> So, when you're coming up with your alarm alert strategy,
1798.05 -> be very deliberate about what you alert on,
1801.5 -> because this will have a massive impact
1804.71 -> on your business outcome.
1807.5 -> If you alert to all, you will lose visibility.
1811.49 -> But if you alert too much, you could also lose visibility.
1817.67 -> So first, define a strategy.
1820.34 -> What is a warning? What is an alarm?
1823.22 -> And what should be an actionable alert?
1826.28 -> Now alerts must be actionable.
1829.13 -> Why would you alert if you don't want to take action?
1831.8 -> So, I always apply that 3:00 AM rule to all the alerts.
1835.01 -> If you got that alert at three in the morning,
1837.5 -> would you read information and turn over and think,
1839.63 -> I can sort that tomorrow?
1841.76 -> Perhaps that shouldn't be an alert, then.
1844.91 -> So, also make sure that those notifications are meaningful
1849.11 -> and that they are received and routed to the right people.
1853.61 -> And then define expected actions with the use of playbooks,
1858.11 -> so that anyone who's responsible to responding to an alert,
1862.43 -> no matter how long they've been with the organization,
1865.46 -> how well they understand the business or the application,
1868.61 -> it's so that they know exactly the process
1871.04 -> that you expect them to follow.
1873.38 -> And then, where possible, remediate with the use of runbooks
1877.22 -> for automated or semi-automated remediation.
1883.46 -> Now, alert fatigue and alarm noise can be a real business
1888.56 -> impacting issue, because if you have too many alarms,
1892.67 -> it is really easy to miss something important.
1895.79 -> So, where possible create meaningful alarms.
1899.51 -> So, in CloudWatch we've got composite alarms,
1901.7 -> and they allow you to use and, or, or not operators
1905.9 -> to combine multiple metric alarms
1908.06 -> to create those more meaningful alerts.
1912.2 -> So, for example, a lower number of orders might be okay,
1917.3 -> but actually if it's, at the same time there's
1919.73 -> a lower number of search results, that might indicate
1922.643 -> that there is a customer facing issue in your application.
1926.75 -> So, by using an and operator in a composite alarm,
1930.8 -> you can be alerted when the thresholds for orders
1934.64 -> or for search results fall below a certain level.
1938 -> So, you will only be alerted then
1939.5 -> as you understand that there's a correlation.
1941.6 -> So, you don't need to know when just one's below level,
1944.93 -> both are below the levels.
1946.82 -> Now, another feature of composite alarms
1949.37 -> is alarm suppression.
1951.62 -> So, here you can create an alarm, for example,
1954.95 -> to indicate there is a deployment in progress,
1958.37 -> and that will act as a suppressor in the composite alarm.
1963.89 -> And when that suppressor alarm is in the alarm state,
1968.15 -> any other alarm in that composite alarm is suppressed,
1972.29 -> no alert is generated, and no action is taken.
1976.07 -> And you can also nest multiple composite and alarms
1979.7 -> for those complex scenarios, such as disaster recovery.
1983.72 -> And then lastly, one other way to reduce alarm noise
1986.66 -> would be to leverage anomaly detection alarms in CloudWatch.
1990.95 -> Now, these don't have those static thresholds
1993.11 -> that you set yourself.
1994.58 -> Instead they follow, or they alert or alarm when that value
2000.76 -> falls outside of the expected value,
2003.76 -> that's based on that anomaly detection model.
2008.38 -> - [Alex] Okay, so I'm gonna do another quick demo.
2013.78 -> Cool, okay, this one's gonna be pretty quick.
2016.84 -> So, I've highlighted some alarms here already.
2021.52 -> So, I've got three CloudWatch alarms here.
2024.55 -> I've got tolerated or frustration, frustrated navigation.
2028.6 -> This is basically when real user monitoring is telling me
2032.35 -> that I've got a page load speed of either between
2034.84 -> two and eight seconds or above eight seconds,
2037.36 -> which is the metric I showed in the dashboard
2039.51 -> at the beginning.
2041.65 -> I've also got an alarm that's telling me
2044.89 -> if I'm being throttled by DynamoDB.
2048.82 -> And then I've got a special type of alarm here that says,
2051.88 -> DynamoDB throttling affecting page load time.
2056.958 -> And this is a composite alarm.
2058.93 -> And this what the composite alarm looks like.
2062.02 -> So, it contains the child alarms, so the tolerated
2066.07 -> or frustrated navigation, and the DynamoDB read throttles.
2070.45 -> And there's my alarm rules.
2072.85 -> This is what Ania was talking about,
2074.65 -> having and, or ors, or both in the rules.
2081.97 -> So, you can see, as well, the history of this alarm.
2085.18 -> This alarm has been going off a lot,
2086.59 -> and we'll talk about that.
2089.5 -> And then I'm just gonna go to a dashboard that, kind of,
2092.44 -> shows why we might use this.
2096.04 -> So, at the moment one of these alarms is going off.
2098.29 -> So, we've got DynamoDB read throttles going off.
2103.69 -> Put your hand up if you had a really slow page load
2106.45 -> when you accessed the site.
2109.87 -> Anyone?
2111.73 -> Yeah, cool. That's on purpose.
2116.68 -> So, we've got some DynamoDB read throttles,
2120.37 -> and this is a graph showing the alarm.
2123.82 -> And we've got tolerated or frustrated navigation.
2126.22 -> We've got another graph showing alarm.
2128.83 -> So, I then correlated this but-
2130.9 -> And this is in the bottom-left.
2133.48 -> And you can see that the correlation
2136.15 -> is not immediately obvious.
2138.01 -> So, what I actually used is Cloud Automated maths
2141.34 -> to multiply the frustrated navigation by 10.
2147.7 -> And the reason behind that is, when the application is
2152.5 -> calling DynamoDB and it's being throttled,
2155.26 -> it will attempt to retry up to nine times.
2157.87 -> So, if I multiply it by 10,
2161.74 -> we should see, roughly, the correlation.
2164.14 -> And here you can see we've got this correlation.
2167.32 -> When we're having read throttles, we are getting
2169.99 -> this poor navigation experience.
2172.33 -> And when both of these alarms go off,
2174.55 -> so either side of this middle one,
2177.04 -> that will then cause this alarm in the middle to go off.
2180.91 -> And that's what I care about.
2182.26 -> I might not care if we get the occasional amount of
2185.5 -> read throttling, or the occasional frustrated navigation,
2189.82 -> but if one is causing the other,
2192.58 -> I really wanna know about that so I can fix it.
2195.79 -> And we will do that in a minute.
2198.94 -> - So, this is something by the way we use,
2200.59 -> we practice ourselves quite a bit to reduce alarm noise,
2204.01 -> is anomalies by themselves are not really worth looking at.
2209.71 -> It's very important to understand if they're in the context
2212.47 -> of something that relates to a customer experience.
2215.77 -> That really helps our teams deal with just volume of alarms.
2223.57 -> - Cool, so, once you have a metric and you've identified,
2229.021 -> you know, you've got a problem, like we've done just then.
2232.63 -> You've kind of, then you've got a race against time.
2235.526 -> And in a distributed system particularly,
2238.06 -> how do you find out quickly why there's a problem?
2242.47 -> And part of being able to do this is having a good
2245.71 -> logging and tracing strategy.
2248.53 -> Having data from distributed systems in one single console
2253.3 -> is gonna help you get to the root cause more quickly,
2256.48 -> rather than having to look in numerous different places
2258.82 -> or even having to log onto instances
2260.56 -> and see what's going on.
2263.23 -> So, there's a couple of things you can do
2264.82 -> once you've got all this as well.
2268.09 -> You can use services which offer additional insights.
2270.43 -> So, we've got things like,
2272.71 -> well you've seen contributor insights.
2275.05 -> We've got container insights,
2276.58 -> which gives you some more information about containers.
2279.25 -> Lambda insights, just gives you more information
2282.46 -> about your lambda functions.
2284.83 -> And then we've also got ServiceLens,
2286.01 -> which is a great place to start to visualize issues
2289.18 -> which have been correlated from your
2292.6 -> metrics, traces, and logs.
2294.58 -> And the key thing is here that, you know,
2296.333 -> a metric and then an alarm tells you that,
2299.68 -> maybe you've got a problem but it doesn't tell you where
2302.38 -> the problem is, and that's where you need
2304.27 -> logging and tracing.
2310.33 -> So, logging and tracing help we get to that root cause
2313.33 -> more quickly and hopefully either reduce the impact
2316.63 -> or get rid of it completely.
2319.18 -> And as I said, metrics help you.
2321.463 -> They might help you identify an issue,
2323.44 -> they might help you with predicting trends.
2327.04 -> But it's logs and traces that complete the picture.
2331.24 -> And I'm sure we all know what logs are,
2333.46 -> but I'll say it anyway, they're an immutable record
2335.95 -> of a an event that's taking place in your application.
2340.42 -> And you know, we've all, I'm sure, looked at logs
2343.39 -> to try and investigate why there's been a problem.
2346.42 -> And that's one place where you can deep dive
2348.01 -> and find out why there's been an issue.
2350.8 -> Traces on the other hand, they take a user centric view
2353.83 -> or transaction centric view of the path the request takes
2358.63 -> through your services.
2360.85 -> And they can tell you the impact,
2362.32 -> either to the path, or to the user.
2365.35 -> So, we've got a couple of examples here.
2369.46 -> So, in this logs example on the left,
2374.23 -> this is from from an Apache web server and I'm just looking
2379.39 -> for events with 403 or 404 and this might tell me where
2383.11 -> I've got, potentially where I've got bots
2385.45 -> coming to the site.
2388 -> And this is where it's also important to use, if you can,
2392.59 -> to use structured logs.
2395.14 -> And in the case of Apache web server, and there are lots
2397.72 -> of other applications like this,
2400.033 -> this is a one line change to the config file
2403.087 -> in Apache web server to enable structured logs.
2408.04 -> And this makes it much easier to query your logs,
2412.18 -> to use contributor insights,
2415.08 -> and also to extract metrics from your logs.
2418.45 -> And you can do that using log metric filters.
2420.94 -> And traces kind of stick all this information together
2424.03 -> and they allow you to correlate
2426.4 -> business transactions with events.
2430.57 -> So, I'm going to do a demo now troubleshooting
2433.33 -> why we've got this issue.
2438.94 -> Okay, so I hope you can see this.
2440.77 -> We've got a service map, here, of our application.
2445.18 -> So, just very quickly, is a pretty basic application.
2449.02 -> There's pretty much two or three parts to it.
2451.69 -> We've got some EC2 instances behind a load balancer
2455.53 -> in an auto scaling group.
2457.39 -> And basically they call DynamoDB and that's
2459.4 -> where the films are.
2461.53 -> And then we've got an API, which basically does
2465.4 -> the same thing, and that's using an API gateway,
2468.16 -> going to a lambda function, that's also making calls
2471.01 -> to the same DynamoDB database.
2475 -> On this map we can look at metrics as well.
2479.74 -> So, we can look at average latency.
2482.17 -> So, the average latency on my EC2 instances is 514
2485.98 -> milliseconds and we've got 72 transactions
2489.34 -> per minute going on.
2490.93 -> I can even click on a link going between two nodes
2497.32 -> on the map, and that's gonna show me
2501.07 -> the same kind of metrics.
2503.074 -> So, in this case it's, sorry,
2504.49 -> it's error rates or okay rates.
2507.16 -> In this case everything's okay, a hundred percent okay.
2509.253 -> And the transactions per minute between the two nodes.
2512.89 -> But what we're gonna do now is troubleshoot.
2515.62 -> And we can see from the legend here that,
2519.25 -> if I've got red, that means I've got a fault.
2522.37 -> So, I'm gonna click on this API gateway.
2528.43 -> And we can see now immediately, we've got some metrics with,
2532.84 -> for latency, for requests, and for faults.
2536.71 -> And we can see here we've got a lot of latency
2538.897 -> and we've got a lot of faults.
2541.18 -> And you can see that they line up, right?
2544.54 -> The latency lines up with the faults.
2547.48 -> So, what we can do from here is click on the faults
2550.99 -> and go to View Filtered Traces.
2555.07 -> So, it's automatically filled in a search for me.
2560.56 -> And I can see a list of all of the traces
2562.72 -> that have resulted in a fault.
2565.15 -> And now I just need to go to one of these,
2568.21 -> and I immediately go to an overview of this individual
2571.87 -> transaction that's taken place.
2574.39 -> In this case, it's a synthetic canary that's done
2577.99 -> the transaction, it's gone to the API gateway,
2581.47 -> gone to the lambda function,
2582.97 -> there's a call made to the DynamoDB service,
2585.22 -> and there's a call made to my DynamoDB table.
2590.38 -> We can then start to look at what's happening.
2593.29 -> Why have we got this fault?
2595.18 -> So, the fault is exposed in the canary,
2599.35 -> but it's actually downstream.
2601.12 -> Exposed in the API gateway stage,
2603.07 -> but it's actually downstream.
2605.05 -> And it's downstream in the lambda function,
2607.18 -> and it's when it's calling DynamoDB.
2610.75 -> So, when I look at this in more detail,
2612.31 -> I can get an overview here of the trace
2615.61 -> and when it happened.
2617.8 -> But the key here is, when I go to exceptions.
2621.04 -> And this exception tells me exactly what went wrong.
2624.1 -> So, I've gone from a service map, just doing a few clicks,
2627.7 -> and gone right down to the error message.
2630.82 -> So, I get an error here, and it's telling me, basically,
2635.59 -> I'm being throttled by DynamoDB.
2637.387 -> And that's my issue.
2640.27 -> It also tells, I've also got a stack trace,
2641.89 -> and that tells me exactly where in my lambda function
2645.76 -> the problem's occurring.
2646.75 -> So, I could look at the code if I needed to.
2649.45 -> In this case I don't need to, because I know
2651.19 -> it's a throttling issue.
2654.91 -> We've also got correlation here.
2657.58 -> So, we can also see every single log event that happened
2662.77 -> that's been correlated with the trace ID.
2667.66 -> And if I expand one of these,
2669.52 -> you'll see I've got that same error occurring
2673.15 -> in my log as well.
2676.93 -> So, we can now look at DynamoDB,
2682.482 -> and we can see here
2684.34 -> this chart on the very left shows how much provisioned
2690.19 -> capacity I've got, which is one.
2694.99 -> And then how much consumed, which is the blue line,
2697.48 -> which is way above it.
2699.01 -> So, it's peaking at about three
2701.8 -> and I've only got capacity for one.
2704.08 -> And this is causing lots of throttling requests.
2707.92 -> It's pretty much throttling all the time.
2710.59 -> So, what I need to do to fix this, is go to the settings
2718.96 -> and edit these capacity units.
2720.73 -> I could edit it, set it to 10.
2722.92 -> I could also do some other things.
2724.15 -> I could set it to autoscaling or set it to on demand,
2727.21 -> that would also fix it.
2729.52 -> Click Save, and within a minute or so that should
2733.9 -> completely fix the issue.
2735.88 -> Obviously things aren't gonna be this easy all of the time,
2739.03 -> but essentially what I want to show you is that you
2742 -> can go in this service lens map,
2748.03 -> from having an overview
2750.4 -> of your application, seeing where there are faults
2753.4 -> or errors, and then drilling down into the individual
2756.7 -> traces, and going straight to an error or a stack trace.
2762.13 -> And, nearly forgot to do this.
2766.03 -> X-ray Insights also automatically
2769.63 -> discovers anomalies with your traces.
2773.05 -> So, here insights has discovered
2776.83 -> that we've got this problem.
2778.96 -> It's even established the root cause of the issue.
2782.44 -> And if I click on this,
2786.97 -> I can see the root cause details
2789.31 -> So, this has given me a map.
2791.38 -> This might be difficult to see, but it's given me a map,
2794.71 -> and it tells me here the root cause,
2798.387 -> and it tells me that it's DynamoDB.
2804.637 -> And I cannot get back to the slides.
2809.8 -> - [Igor] One other technique, just to mention,
2811.9 -> that we use commonly to help with logs and traces.
2815.14 -> Alex talked about Apache logs, system logs,
2818.8 -> and then, you know, traces.
2821.05 -> Is to send in or upload your own custom events,
2824.11 -> the business critical events, say purchase completed,
2826.78 -> or you know, user deleted, or whatever.
2829.57 -> And then they actually really help in your investigations
2832.15 -> 'cause then all that processes otherwise looks, you know,
2836.2 -> A talks to B, now you can contextualize it quite a bit, so.
2840.22 -> - [Alex] And also just to add to that, when you are,
2841.69 -> when you're doing tracing, you can add
2843.91 -> annotations to traces.
2845.23 -> So, things like, if you wanna add a customer ID
2848.65 -> or something like that, you could add that to your trace.
2853.9 -> So, now you've got your metrics that your stakeholders
2856.54 -> care about that reflect what your customers care about,
2859.397 -> you know, the customer experience.
2864.37 -> And you've got dashboards for your stakeholders.
2868.42 -> They don't have to wait anymore.
2869.53 -> You don't have to spend, you know,
2870.76 -> your valuable time trying to find these metrics
2873.79 -> or running reports for people.
2876.88 -> You've added traces to help troubleshoot issues
2879.61 -> and get to the root cause quickly.
2881.8 -> And you've also added, oh- (laughs)
2886.63 -> You've also added important metrics
2887.693 -> so you can help identify potential issues.
2890.86 -> I thought I hadn't done the animation
2892.321 -> and then I think it had.
2893.89 -> So, everybody's happy and everybody's smiling.
2898.06 -> - Okay, so the key takeaways from today's session,
2901.45 -> although those CPU, RAM, and disks, and other
2905.29 -> technical metrics are important,
2908.08 -> they don't tell you anything about the customer experience,
2911.68 -> or they don't give you any insight into that.
2914.23 -> So, your customers are what's important,
2916.93 -> and their experience is what you should be
2919.15 -> monitoring and observing.
2921.82 -> This is why you should work backwards from your customers,
2924.85 -> work with your stakeholders, derive those KPIs,
2928.51 -> and collect metrics that actually matter.
2932.5 -> - And look, if you're gonna take away one thing
2935.23 -> from this session is, speak to your stakeholders,
2939.22 -> even if you have to force them to speak to you,
2941.32 -> to find out what matters to your customer,
2944.59 -> because that's what you should be observing.
2949.81 -> Right, lust on a side note,
2951.88 -> we have a new observability training course.
2954.28 -> We released it last week.
2957.64 -> It covers an introduction,
2960.52 -> it covers about how to use the CloudWatch agent.
2962.89 -> So, how to install it, how to configure it.
2966.07 -> Some of the alerting and insights that we've showed you,
2970.27 -> more into application monitoring and all the application
2973.57 -> monitoring tools that we have.
2975.82 -> And also some of the open source tools that we have as well.
2980.08 -> I might be biased cause I created it,
2982.09 -> but it's well worth going and having a look.
2987.55 -> Please complete the session survey.
2990.16 -> We are also in the expo,
2992.59 -> so we're in the cloud ops section
2994 -> and we've got an observability stand there,
2997.93 -> and that QR code points you to
2999.94 -> other observability sessions.
3004.83 -> And yeah, thank you very much.
3006.75 -> Again, please don't forget to fill in the survey,
3009.54 -> and we'll either be here until we get-
3011.542 -> (audience applauds)

Source: https://www.youtube.com/watch?v=Ub3ATriFapQ