AWS re:Invent 2022 - Developing an observability strategy (COP302)
AWS re:Invent 2022 - Developing an observability strategy (COP302)
Do you have a plan for your observability? Do you understand what the stakeholders for your applications want to observe? As you move to the cloud or mature your operations within the cloud, it’s important to optimize your observability to help your stakeholders understand how your applications are operating. Join this session to learn how to define your observability strategy for the future in order to serve the requirements of all of your stakeholders and to help ensure that you can deliver successful business outcomes.
ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.
AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.
#reInvent2022 #AWSreInvent2022 #AWSEvents
Content
1.83 -> - Okay. Hi everyone.
3.93 -> Welcome to developing an
observability strategy.
9 -> So, my name's Alex.
11.55 -> I'm a specialist solutions architect,
13.65 -> specializing in observability.
16.68 -> Done just about over 20 years in IT.
20.37 -> Got a seven year old daughter.
22.56 -> I used to play ultimate Frisbee,
23.82 -> but I got too old and
a bit too fat as well.
28.38 -> So, now I try and play,
I play golf instead.
32.31 -> So, we're gonna talk about
how to design an observability
36.03 -> strategy and we're also
gonna go into a few
37.89 -> technical things as well.
41.01 -> - Hi, my name is an Ania Develter.
42.99 -> I'm also a specialist solutions
architect on the cloud
45.84 -> operations team here at AWS.
48.9 -> I have been in IT for
over 20 years as well.
51.99 -> Three and a half of
those years here at AWS.
54.96 -> I'm a mom of two, I've got a
20 and a 15 year old daughter,
58.71 -> or daughters, and also I'm a cat lover.
61.86 -> I actually have four cats and
two dogs and I don't talk,
64.71 -> I don't stop talking about
them, and they actually feature
68.04 -> in our observability workshop,
that's how much I love them.
72.691 -> - Hi, I'm Igor Sedukhin.
74.22 -> I'm a general manager for
application observability at AWS.
79.2 -> I'm alike dad of one great
ten-year-old daughter,
84.24 -> and love snow.
85.44 -> So, I get to do it quite
a bit, when I get back,
87 -> it seems like it's dumped.
90.06 -> So, talk a little bit about observability.
93.54 -> If there is any business of
value that has to perform
97.35 -> to expectations, it pretty
much has to be observed.
101.417 -> And so observability has to
deliver and scale with your
105.9 -> business needs to billions
of runtime instances
109.83 -> to millions of users.
112.89 -> It needs to be there when
you need it the most.
116.19 -> It has to be as available or
better than what it observes.
120.87 -> And more importantly, it
actually needs to make your life
124.35 -> and cloud operations simpler.
126.72 -> And so we'll explore that
aspect of how do you approach
130.8 -> that, how do you make it simple?
135.66 -> So, before we launch into the session,
138.15 -> let's talk a little bit about the context
140.79 -> in which observability exists.
142.47 -> Cloud operations.
143.97 -> Where in AWS life cycle
would you encounter
148.29 -> the need to observe something?
150.66 -> So, first, you're highly likely
to set up your environment
155.07 -> for compliant operations
156.99 -> So, create your AWS organizations.
162.06 -> Set up IM roles,
163.74 -> make sure you have compliance
controls in the environment.
167.28 -> That was the first step.
168.9 -> Second, you're migrating or
developing an application.
173.82 -> Maybe a serverless
application directly in AWS,
177.63 -> and you'd use CloudFormation
templates and all different
180.66 -> tools that we have to
develop that application.
184.11 -> Once the application is in production,
186.87 -> you would highly likely need to observe
189.45 -> and see if it performs as expected.
192 -> And that's where you
would encounter tools,
194.61 -> a family of tools in CloudWatch
to be able to observe
198.48 -> and explore your application
as it runs and serves users.
204.39 -> So, let's get into it.
207.12 -> - Okay, let's talk about
the important metrics.
210.81 -> CPU, RAM, and disk metrics
are the core measurements
215.25 -> of any system.
217.23 -> Even serverless services
run on servers, right?
222.51 -> Yes, these metrics are important.
225.66 -> But are they actually
important to your business
228.99 -> and your application?
231.03 -> You should only worry about
those metrics for the purpose
234.45 -> of cost optimization, scaling,
or capacity management.
239.34 -> But should you care about these metrics?
242.22 -> Do they tell you anything
about how your application
244.83 -> is running, or more importantly,
do they give you any
249 -> insights into your customer's experience?
254.37 -> Now, we gonna break from
convention here and ask you to get
258.21 -> your phone out and follow
the QR code that you see
261.27 -> on the screen there.
262.77 -> It'll take you to a very
simple film website.
266.43 -> We just ask you for you
to just browse around,
268.53 -> maybe search for a film by year.
271.222 -> - Yeah, I'm sorry, this is very basic.
272.91 -> You blame me, I've built this.
274.71 -> But it's gonna serve a purpose.
282.285 -> Right
- Okay, so what- Sorry.
283.868 -> (laughter)
284.701 -> So, while's your browsing
around, we would appreciate
287.16 -> some audience participation here.
289.92 -> Tell us about the metrics
that you care about
293.22 -> in your organization.
294.39 -> And there's no right or wrong answer here.
296.73 -> - So, anyone shout out what's important?
299.227 -> - [Audience Member 1] Latency.
300.06 -> - Latency we've got.
301.65 -> - Anyone else?
304.428 -> - [Audience Member 2]
Response time, throughput.
306.257 -> - Response time and throughput.
308.256 -> - [Audience Member 3] Lack of errors
309.51 -> - Lack of errors.
310.513 -> - [Audience Member 4] Freshness.
312.172 -> - Freshness.
- Freshness.
313.505 -> - [Audience Member 5] Usage of data.
314.901 -> - Usage of data.
316.087 -> - [Audience Member 6] Hits per second.
318.12 -> - Sorry?
318.953 -> - [Audience Member 6] Hits per second.
319.786 -> - Hits per second. Thank you.
321.54 -> Anyone else?
324.51 -> - Another one about
failures, I didn't quite-
327.96 -> - Price did you say?
329.587 -> - [Audience Member 7] 4x and 5x errors.
330.93 -> - Yeah, 4x and 5x errors, yeah.
333.27 -> - Hold on. This gentleman,
I can't hear 'em.
336.24 -> - Cost. Yeah, cost, very important.
340.83 -> - Igor, do you wanna tell us more about
342.96 -> the metrics that your team cares about?
344.91 -> - Yeah, exactly.
346.02 -> At AWS, we run services for you.
348.51 -> And this is one of the
most important questions
350.76 -> we actually ask our teams,
353.79 -> what is the most
important metric for your,
356.46 -> for the business that
that service performs,
358.47 -> in relation to you as a customer.
361.11 -> And so, as you know, it was mentioned,
363.87 -> tends to be things like,
does the console load,
367.56 -> is it available for you to use it?
370.8 -> Is the latency of the APIs
or of the user experiences
374.19 -> adequate, right?
375.48 -> And is availability adequate?
378.09 -> And so what we do, well, you know,
382.35 -> with that metric is, we, in fact,
384.57 -> have a company-wide meeting every week
388.26 -> where we spin the wheel, and
maybe you've read about it,
392.7 -> and there's AWS wheel.
394.86 -> And it picks the service,
one or two, that will show
398.85 -> to the company what metrics
they observe and whether or not
403.53 -> the operations are healthy, right?
405.78 -> And we do this kind of friendly
inspection of each other
410.37 -> to ensure that, you know, these metrics,
413.19 -> the key metrics that we
observe on your behalf, right,
418.53 -> are actually indeed surveying
the needs of that business,
422.18 -> of that server service.
423.48 -> So, this is the most critical question.
427.5 -> - [Ania] Okay. And what
about amazon.com Alex?
430.65 -> - Yeah, so I guess the
most important metric
434.34 -> in terms of amazon.com or
any of our other retail sites
438.48 -> is how many orders are we getting?
442.14 -> How many orders are we processing?
444.03 -> And if that falls below
a certain threshold,
446.07 -> then we know that there's a problem.
448.41 -> And this should become obvious
why we're talking about this
451.92 -> and why that's an important metric.
453.87 -> - Okay, so we're gonna take
this back to the context
456.45 -> of that film website
that you just browsed.
458.55 -> So, if you're still browsing,
you can stop searching
460.41 -> for films now, thank you.
461.82 -> And Alex is gonna show you the
metrics that we care about.
468.09 -> - Okay, so here I've got
a, I've got a dashboard.
472.65 -> And I'll just go through what
we've got on this dashboard.
476.58 -> So, we care about availability,
480.39 -> we've got a hundred percent
availability, that's all good.
483.9 -> Then I've got some SLA statuses here.
487.2 -> So, there's two, there's three SLAs here.
491.73 -> Availability SLA.
493.59 -> We've got payload SLAs.
495.24 -> We want 99% of our payloads
to happen within two seconds.
501 -> And we've also got an API response SLAs.
503.16 -> We want our APIs to respond,
again, 99% of them to respond
507.15 -> within the second.
509.46 -> I've got some other metrics
here that I care about.
511.44 -> So, I've got sessions in the last hour.
514.53 -> Searches in the last hour.
517.02 -> So, these are other, kind of, important,
520.05 -> would be important business
metrics if the website
523.32 -> that you looked at was a business.
526.65 -> So, when we look at our payload times,
529.86 -> I've got these separated out
into different percentiles.
533.28 -> Our percentile are really important.
535.26 -> If you don't use percentile,
averages can hide problems.
540.39 -> So, we can see here at P50,
543.6 -> and I'll explain what that is in a minute.
547.32 -> We've got 850 milliseconds
for page load time.
550.53 -> Great.
551.55 -> So, what this means is
that 50% of our users,
555.48 -> the maximum page load time
for them is 850 milliseconds.
560.91 -> At 90% it goes up to over a second.
564.96 -> And at 99% it goes up to 18.8 seconds.
569.76 -> That's definitely above
the two second mark.
573.24 -> And we're using CloudWatch
real user monitoring
576.499 -> to get some of this data.
579.45 -> And what I have here on the
right hand side is metrics
584.46 -> that come directly from
real user monitoring.
586.83 -> Let me try and get rid of that.
590.46 -> So, we've got the percentage
of page load times
592.98 -> that happen under two seconds.
594.66 -> We want that to be above
99%, but it's 96.7.
599.34 -> We've got some of them 0.25% happening
603.09 -> between two and eight seconds.
605.1 -> And we've got 3% that
are over eight seconds.
607.95 -> So, there's something wrong.
608.91 -> We'll look at this,
we'll look at this later.
613.11 -> I've got some metrics here
from load balancer as well.
616.98 -> And these metrics are
basically the same metrics,
620.49 -> but they're from our API.
623.19 -> So, again, if we look at 99%,
627.66 -> if we look at the 99th percentile,
629.13 -> so that's the maximum response time for-
632.91 -> if we take into account 99% of
requests, it's 25.2 seconds.
637.98 -> Again, this is really bad.
640.83 -> And then the reason we ask you to log on
645.361 -> and have a look and play
around is to show you
648.54 -> these metrics as well.
650.4 -> So, you can see here at the
top, we've got 47,000 requests
657.3 -> coming from the US.
659.25 -> Some of those, or a lot of those,
661.02 -> And you'll see a lot of these at the top.
665.61 -> They're coming from some
containers I've got running
668.37 -> in the background generating traffic.
670.89 -> But it looks like some people
are using SIM cards that are
673.68 -> going back to their country.
675.6 -> 'Cause we've got, looks like
Canada and Australia I think.
681.93 -> And then we've got browsers
as well and operating systems.
686.43 -> So, what we've got 2,250 hits from Android
693.48 -> and 2000 from iOS.
696.21 -> So, put your hand up if you
are using an Android phone.
702.84 -> And who's using an iPhone?
706.23 -> Yeah, so I'd say that was
roughly about half and half.
709.74 -> So, we know that works, which is good.
713.88 -> So, this is a kind of
dashboard of the things that
719.28 -> I want to see if I'm running
this very basic site.
724.26 -> We get some of this information
from real user monitoring.
727.5 -> So, I'm just going to take you through
729.24 -> real user monitoring very quickly.
732.93 -> And this also demonstrates why averages
736.47 -> are sometimes not good.
738.51 -> So, we look at our average
load time, it's one and a half,
742.53 -> 1.7 seconds, that seems
like it's under our SLA.
747.54 -> But as you saw from the
percentiles, it's not.
749.55 -> So, again, definitely needs percentiles.
753.3 -> Then we've got data here.
This is web vitals data.
758.1 -> So, this gives us data
about low performance,
761.19 -> first input delay, so
that measures inactivity,
765.27 -> and cumulative layout shifts.
766.86 -> So, that measures how visually stable
770.902 -> your page is when it loads.
774.15 -> And then we've also got a
breakdown of what happens
778.08 -> when the page is loading.
779.73 -> So, for our site here, time to first byte
783.484 -> is the biggest, and DOM processing time.
786.99 -> So, we could use this
information to maybe cut down
791.46 -> on the time to first byte
and DOM processing time.
795.48 -> And we can look at this
by browser and device.
801.39 -> And I'm just gonna show you
quickly the configuration,
804.6 -> because all you need
to do to enable this is
807 -> this JavaScript snippet here.
809.34 -> You just add that into
the head of your pages.
813.57 -> Okay, so I'm gonna go
back to the presentation.
815.7 -> - It would be interesting to
note that one of my teams,
818.34 -> actually, is RUM.
820.8 -> So, RUM is the team inside of
the application observability,
824.79 -> you know, organization.
826.29 -> And we use RUM ourselves to
observe RUM to know whether
831.03 -> performs well for you in
a very similar fashion,
833.43 -> where you can see the
metrics of page load,
836.04 -> or reaction time, or-
838.32 -> And then, you know, we're
able to trace it back
840.42 -> to the service behavior.
841.44 -> So, we use our own tools
to make sure our services
844.92 -> run well for you.
848.67 -> - Okay, so this is Amazon's vision.
853.65 -> So, it's our vision statement.
855.6 -> And we strive to be Earth's
most customer-centric company,
859.38 -> Earth's best employer, and
Earth's safest place to work.
862.56 -> Why am I talking about vision?
864.72 -> So, anything that any of
us do, any of us on stage,
868.74 -> or anything that any of you
do, anyone in this room,
873 -> it should all align to
your organization's vision.
876.39 -> That all of us, really
basically, we're all employed
879.24 -> to fulfill the vision of our organization.
882.54 -> And for us, like I said,
that's to be Earth's
884.283 -> most customer-centric company.
890.4 -> And the way that we
strive to be Earth's most
893.28 -> customer-centric company is
by starting with our customers
898.26 -> and working backwards.
900.54 -> So, when you think about observability,
902.67 -> you should think about
what your customers want
904.89 -> and what they need.
907.62 -> And customer obsession is also
909.51 -> one of our leadership principles.
910.83 -> And we say that leaders
start with their customers
913.437 -> and work backwards.
914.97 -> It's really fundamental to
us, but it's also really
917.85 -> fundamental to how you should
think about observability.
923.97 -> So, we wanna care-
926.85 -> We wanna think about what
do our customers care about,
930.42 -> what do they need, what do they want?
932.4 -> And these are, basically, some examples
934.11 -> of what customers generally care about.
937.26 -> In this case, it's in the
context of an e-commerce site.
941.22 -> So, you know, all of
you, I imagine, have used
944.49 -> an e-commerce site, hopefully ours.
948.39 -> And you care about things like delivery.
950.79 -> You want your stuff to
be delivered quickly.
955.65 -> Or at least know when
it's gonna be delivered.
957.78 -> Obviously everybody cares about the price.
960.96 -> You care about security and privacy,
963.66 -> you care about how long
the page takes to load,
966.57 -> and you also care if
you can find the product
968.616 -> that you're actually looking for.
970.8 -> And things like page speed
973.77 -> really, really affects the business.
976.71 -> So, a 2017 study showed
that 53% of mobile users
983.79 -> will just abandon the page if it takes
985.41 -> more than three seconds to load.
988.05 -> And I'm just giving
that example of a thing
990.54 -> you need to think about in
terms of how metric observations
996.27 -> can affect business outcomes.
1000.68 -> So, once you've kind of thought
about what your customer
1003.74 -> requirements are, you need
to map these requirements
1006.8 -> to stakeholders in your organization.
1010.1 -> So, you need to talk to them to understand
1013.52 -> the metrics that they
care about that could
1016.64 -> impact customer requirements.
1018.86 -> And this is because we
might not know, right?
1021.86 -> As technical people,
we might not know what
1025.1 -> the customer requirements are.
1026.6 -> But the stakeholders,
it's their job to know.
1028.82 -> It's their job to care about
what the customers care about.
1033.733 -> And it might seem easy for something like
1035.87 -> an e-commerce site, probably
most of it seems obvious,
1040.25 -> but I'm gonna give you another stat now.
1042.11 -> So, there's a Forrester
Research report that said
1045.11 -> 43% of visitors, when they go to a site,
1048.86 -> they'll go straight to the search.
1052.34 -> And searches are two to three
times more likely to convert
1056.45 -> compared to non searches.
1058.37 -> So, search is really, really important,
1060.68 -> and it has to work really, really well.
1063.175 -> And a bad search experience
will take people to another site
1065.87 -> and it's gonna impact your business.
1068.33 -> Another example might be
that a logistics manager,
1072.53 -> or again, a product manager
might understand the impact
1076.82 -> of an item being out of stock.
1079.28 -> Obviously if an item's
out of stock or it has
1082.07 -> a long delivery time, people
are gonna be less likely
1084.47 -> to buy it.
1088.1 -> - Okay, so this might sound familiar to
1090.68 -> most people in this room,
but very often business
1093.59 -> stakeholders pop up and ask
questions that actually relate
1097.64 -> to metrics in your application.
1100.588 -> - Okay, Ania, can you
tell me how many customers
1102.77 -> are abandoning their
shopping carts because
1104.72 -> the delivery time is too long.
1106.55 -> - I'll have to look that up for you Alex.
1107.96 -> So, I'll have to get back to you later.
1110.84 -> - How much does it cost to run?
1113.64 -> - Igor, I am in the middle of
a really important project.
1116.3 -> I will have to get back to you later.
1118.73 -> - How many attempted attacks
have we had in the last month?
1122.27 -> - I'll get back to you shortly.
1124.67 -> I believe we do have that data.
1127.82 -> - How long does it take
to load our front page?
1131.264 -> - Let me check my phone.
1135.587 -> - And how many times are we
producing no search results
1138.08 -> when somebody searches for something?
1139.88 -> - We don't actually record
that data, Alex, I'm sorry.
1142.04 -> - Really?
1143.84 -> - Okay, does this sound familiar?
1146.48 -> It's like a game of whack-a-mole.
1148.37 -> Finance asks how much, a
product manager will ask
1151.61 -> how long to release a feature.
1153.56 -> But how do we get this
information to your stakeholders?
1158 -> How do you even know what data to collect?
1161.69 -> You went backwards from the customer.
1168.11 -> Okay, so you want to make
sure that your customers
1171.59 -> have a great customer experience.
1174.02 -> So, you start with those
customer requirements
1176.33 -> and work backwards.
1178.07 -> You then work with your
stakeholders because they understand
1181.01 -> those requirements really well.
1183.44 -> You then work together to derive KPIs.
1187.19 -> Once you have those KPIs,
1188.78 -> you identify metrics,
which you then collect.
1193.28 -> Once you have the metrics you
can alert, and you can act
1196.97 -> when business outcome or
customer experience is at risk.
1200.84 -> And then once you've had to act,
1203 -> you then make improvements
based on what was wrong.
1207.38 -> And that, in turn, improves
customer experience and creates
1210.74 -> this virtuous cycle of continuous
improvement and continuous
1216.38 -> improvement of customer experience.
1218.6 -> - And what's really, really important here
1220.73 -> is you can't do continuous improvement
1223.28 -> or generally make
improvements to your products
1227.21 -> or your applications without this data
1230.15 -> unless you just how come to get lucky.
1232.79 -> Which can happen, but don't rely on luck.
1235.85 -> - Yeah, you can't improve
what you don't measure.
1237.89 -> So, that's important.
1239.84 -> - Okay, so now that you
understand why you need
1243.44 -> observability, how do you go about it?
1246.5 -> Now this is key.
1247.37 -> You need to develop a strategy.
1250.19 -> So, first think, what do I collect?
1253.76 -> So, we already talked about how important