AWS Summit SF 2022 - Full-stack observability and application monitoring with AWS (COP310)

AWS Summit SF 2022 - Full-stack observability and application monitoring with AWS (COP310)


AWS Summit SF 2022 - Full-stack observability and application monitoring with AWS (COP310)

Downtime in applications and services costs money. It is critical to resolve issues as close to real time as possible. Seeing how your applications and infrastructure are doing will help you achieve service-level objectives (SLOs) and improve availability, reliability, and performance. In this session, learn how you can use AWS to monitor and observe your applications, workloads, and resources; make data-informed decisions; and drive your business outcomes. Whether your application is in the cloud, on premises and moving to the cloud, or built on open-source technologies, AWS offers a full-stack observability solution that is fully managed, cloud-native, hyperscale, and includes application performance monitoring (APM) as well as managed open-source solutions.

Learn more about AWS Summits at https://go.aws/3zwaA9T.

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#AWSSFSummit2022 #AWSSummit #AWSAMERSummit #AWS #AmazonWebServices #CloudComputing


Content

0.171 -> (upbeat music)
8.59 -> Hello, thank you for joining me here today.
10.99 -> I'm Rich McDonough.
12.16 -> I'm a Senior Worldwide Specialist Solutions Architect.
17.29 -> Oh, are you unable to hear?
20.55 -> Let's give people one, you're good now?
21.96 -> Excellent, okay, I'll start over (laughs).
23.871 -> Thank you for joining me here today.
24.73 -> I'm Rich McDonough.
25.563 -> I'm a Specialist Solutions Architect
27.05 -> with Amazon Web Services.
28.91 -> And we're gonna talk today about full-stack observability
32.48 -> and application monitoring with AWS.
35.71 -> First, a little preview of the agenda.
37.74 -> We're gonna talk about what observability is.
40.38 -> This is a very important discussion for us to have
42.503 -> to make sure that nomenclature is correct,
44.94 -> to make sure that we're all talking about the same thing.
47.89 -> We're gonna talk about strategies
49.73 -> and service level objectives.
52 -> This is key for everything that we do.
54.13 -> We don't monitor things just
56.25 -> for the sake of monitoring them.
57.46 -> We're gonna talk about why we monitor, what we monitor.
60.796 -> And then we're gonna talk about cloud native patterns
63.64 -> that we see every day with our customers.
65.88 -> These would be things that involve a number
67.91 -> of our cloud native services.
69.44 -> We'll walk through these as we go.
72.21 -> Then I'm gonna hand things over to my partner, Imaya,
75.24 -> who is going to discuss open-source patterns
78.1 -> that exist in the same realm.
80.44 -> This is definitely not a case
82.54 -> where you have to choose either cloud native or open-source.
86.85 -> This is definitely a better together story.
89.62 -> We'll talk about next steps.
90.93 -> And then we're gonna talk a little bit
92.24 -> about how this is part of a larger story
95.06 -> around cloud operations.
97.92 -> So first, what is observability?
101.97 -> We like to think of about observability
104.32 -> as being somewhat more than just monitoring.
109.72 -> So when we talk about level setting on nomenclature,
113.51 -> observability really means understanding
115.62 -> what your systems are doing based on the telemetry
120.1 -> that your systems emit.
122.18 -> This is just a fancy way of saying,
123.63 -> you know what your servers or your applications are doing
126.81 -> and how they're doing it for how long and how well.
129.84 -> You can draw an analogy from a car.
132.01 -> You'll know if your car engine is on
134.15 -> because it's gonna be warm, it's gonna be humming.
136.15 -> You'll be driving down the highway.
137.56 -> With observability though,
138.94 -> you're gonna understand more about the car itself.
142.69 -> You'll have the speedometer, the tachometer, oil pressure,
148.05 -> all of these things that indicate as signals
150.97 -> what is happening inside of your actual vehicle.
155.15 -> Now for our purposes, we don't typically collect data
157.42 -> in terms of oil pressure or miles per gallon.
160.42 -> We certainly can for some customers.
163.07 -> But we typically talk more
164.97 -> about basic data types that we collect.
168.4 -> These come in the form of logs, metrics and traces.
173.11 -> You'll often, including today,
174.93 -> hear logs, metrics and traces described
176.96 -> as the three pillars of observability,
180.12 -> and our customers build out
181.67 -> their full-stack observability practices
184.73 -> based on these data types.
188.35 -> Now, in order to maintain
190.5 -> and evolve your operational excellence
193.86 -> and continually meet your business objectives,
196.03 -> and that's really the important thing,
197.64 -> is to make sure business objectives are met all the time,
201.59 -> you need to understand how to do this instrumentation.
207.84 -> Now, monitoring itself is a little bit simpler
211.45 -> depending on your viewpoint.
214.02 -> And we talk a lot about the three pillars of observability.
216.98 -> We'll get into that more in just a few minutes.
220.758 -> It's important to disambiguate though.
222.28 -> When we talk about monitoring,
223.81 -> it's very often more in the context of red light,
226.51 -> green light, is a system on or is it off?
230.23 -> Is it in an error state or is it in a good state?
233.19 -> And you miss a little bit of nuance.
235.04 -> So today I'm gonna talk a lot
236.87 -> about the example of an e-commerce application.
240.613 -> It's ubiquitous, obviously we have a lot of experience
244.57 -> operating e-commerce applications and websites,
248.73 -> and I'm gonna come back
249.563 -> to this example a few times for context.
254.47 -> But you have to ask the question,
256.29 -> how do you know if your application is down?
260.07 -> When does simply monitoring not quite fill
262.83 -> in the gaps well enough?
264.81 -> So for example, let's say your e-commerce site relies
267.85 -> on a third party payment processor.
271.4 -> If your e-commerce application
273.19 -> is for all intents and purposes working,
275.44 -> you can add products to a cart,
277.21 -> you can go through all the process
279.43 -> of proceeding to a checkout,
281.77 -> and then your purchase fails at the end.
285.49 -> Is your site actually working?
286.91 -> I'll make the case that it is not,
288.5 -> but you need to be able to understand why it is not working
291.3 -> and how to fix it.
292.58 -> In order to do so, you need more data.
295.18 -> In order to get this data, you need signals,
297.897 -> and that of course leads back to observability.
301.98 -> The three pillars of observability,
304.33 -> logs, metrics and traces,
307.06 -> these have important characteristics.
309.24 -> And on this slide, we have, pardon me,
312.6 -> we have not just AWS Services,
315.41 -> but also a little bit of an open-source service mentioned
319.41 -> over in the metrics section.
322.09 -> These services have some properties in common.
326.52 -> Logs are sequence aware, this is important.
331 -> You need to make sure that your logging data
332.61 -> is received in the right order.
334.32 -> You need to ensure that people aren't able to inject data
337.67 -> in the wrong order.
338.89 -> That way, you can reconstruct a series of events
341.3 -> and see what is actually happening in your environment.
344.7 -> Metrics are obviously crucial, we'll come back
347.84 -> to the most common metric people usually pick on,
350.53 -> so things like CPU utilization.
353.92 -> This is another signal that you use
355.42 -> to understand what is happening in your environment.
357.46 -> And then there's application traces, which is very often
360.3 -> where we find our customers need the most help.
363.21 -> Your app, as it's running,
365.23 -> should ideally tell you what it is doing
368.09 -> and how long it's taking to do it,
370.21 -> how well it's actually doing it.
372.27 -> This will include data like partners
374.43 -> that you've integrated with,
376.12 -> what the response time was from those integrations,
379.05 -> what the status codes were as a result,
381.37 -> and so on, and other deep operational details
384.5 -> of what has actually taken place.
389.03 -> So what you see here is a view
392.28 -> of all of the components that can be used
395.99 -> to build your observability full stack solution.
400.15 -> You don't need to use all of them,
403.894 -> but you need to use the ones
404.727 -> that help serve your business objectives.
408.62 -> So building an observability practice, it requires tools.
411.63 -> It doesn't just happen.
413.57 -> Here we see both the cloud native side
416.89 -> and the open-source managed side of the fence.
420.27 -> I talk about this diagram usually from the bottom up,
423.79 -> starting with the Collectors.
425.71 -> So to understand your workload, you need data.
428.35 -> That data comes from an agent.
432.31 -> On the bottom left hand side, we have the CloudWatch agent,
435.87 -> which is used very commonly by a million ends of customers
439.43 -> to collect logs and metrics
443.72 -> from applications, from servers, from EC2 instances,
449.06 -> and from other applications you might be working
452.93 -> in your environment, containerized applications
454.87 -> can use the CloudWatch agent as well.
457.62 -> There's also the X-Ray agent,
459.07 -> which is used exclusively for application trace data.
463.61 -> This data is then fed up into CloudWatch
467.81 -> in the form of metrics, logs, traces, et cetera.
471.95 -> And as we move further up the stack,
475.08 -> we have services that are higher level and can do insight
480.07 -> and analysis across broader sets of applications.
483.97 -> You see this insights tier, Container Insights,
486.69 -> Lambda Insights, Contributor Insights.
490.01 -> These services can give you a broad view
492.98 -> across an entire application stack all at once
496.52 -> and it can tell you if you have an impact
499.24 -> across your entire environment
501.61 -> and let you drill down more narrowly
503.37 -> to see where an actual problem is taking place.
507.23 -> On top of all of this is ServiceLens,
509.55 -> which is a node view of your application stack complete
515.21 -> with your response time with the services that speak
519.07 -> to other services, how well they're doing so,
522.24 -> and with what frequency.
523.73 -> It will tell you data like the response time
526.87 -> from a particular application in your environment.
530.92 -> Moving over to the open-source side,
533.25 -> there are some very similar tools.
536.18 -> Over on the bottom right hand side,
538.4 -> we have the AWS Distro for OpenTelemetry.
541.93 -> This is a distribution of OpenTelemetry
546.86 -> that Amazon Web Services maintains
549.16 -> as an open-source project.
551.01 -> And people use this as part
553 -> of an OTel distributed tracing framework
558.8 -> to send data into both AWS X-Ray,
563.51 -> which is a part of CloudWatch,
565.57 -> or into other services such as OpenSearch
568.86 -> or into Zipkin and Yager.
571.29 -> Then customers often have their own do it yourself
574.67 -> or ISV product that may sit
577.93 -> in between Amazon Managed Grafana
581.31 -> and the rest of this data that is being collected.
584.42 -> There is definitely not one size fits all
587.33 -> when it comes to an observability story,
589.58 -> but this is the tools that you can use to compose the story.
594.83 -> So now let's flip the view on its side for just a moment.
598.02 -> We're gonna focus more on the cloud native side
600.92 -> for the next little while.
603.7 -> If you take all those previous services
605.92 -> and view them in terms of where you would implement them
608.84 -> in your environment,
611 -> infrastructure belongs to the primitives,
614.1 -> so logs, metrics, traces, dashboards for visibility.
618.96 -> Your application level observability itself
621.43 -> is going to come from insight services.
623.75 -> Again, able to correlate data
625.47 -> from across a wide set of applications,
629.29 -> Contributor Insights, Application Insights,
631.45 -> Container Insights, particularly powerful
634.37 -> for a distributed application,
636 -> where you need to have a composed view
638.74 -> of the state of your environment, of your health.
643.1 -> And then we have digital experience monitoring as well,
645.8 -> an even higher level super set that has the ability
650.5 -> to measure end users' actual interactions
654.09 -> with your application,
656.05 -> quantify these actions as positive or negative,
660.24 -> and then give you meaningful data from the field
663.88 -> about how well your app is actually performing
666.64 -> in a customer's own web browser.
668.85 -> We also have Evidently,
670.63 -> which can be used for feature launches,
673.51 -> AB tests, experimentation, and Synthetics,
676.84 -> which is a way of doing actual synthetic browser testing.
680.25 -> So the aforementioned e-commerce example,
682.66 -> a Synthetic canary would probe an e-commerce site,
687.3 -> it will add an item to a cart, it could perform a login.
689.94 -> It could even perform a checkout if you so desire.
696.06 -> Now, this is where we have to talk about strategy.
700.8 -> There's two basic approaches that you take, pardon me,
706.46 -> when you begin looking at your observability solution.
710.95 -> Do you wanna watch things from the outside in
714.36 -> or do you wanna watch them from the inside out?
717.473 -> And there's reasons why you might choose one over the other.
721.5 -> Now, these reasons of course are tied directly back
725.17 -> to your service level objectives
727.45 -> and what matters to your business.
731.03 -> I'm gonna come back
731.863 -> to the CPU utilization example again and again,
734.27 -> because it's a great example
735.77 -> of where people sometimes look at the wrong data
739.56 -> and they may be alerting on something that doesn't matter,
741.94 -> depending on the workload.
743.167 -> And we wanna make sure that you are looking
744.87 -> at the right things.
749.98 -> So the first question you have to ask
752.34 -> when looking at a full stack observability solution,
755.25 -> what does good look like?
757.06 -> Is good measured in terms of sales,
760.48 -> is it a very low response time, or is good something
766.23 -> that maybe means something very different
768.46 -> for your application?
769.73 -> Perhaps it's a batch processing workload
771.8 -> that needs to generate reports across millions of records.
775.6 -> Good doesn't look the same for everybody,
777.83 -> but regardless, you can use the same tools
780.55 -> to obtain that view of what good looks like.
784.68 -> So e-commerce example, page response times,
787.8 -> probably a very good metric for you to watch,
790.47 -> because it is tightly correlated with things like sales.
795.04 -> So you've probably heard of the rule of three,
798.9 -> which is anecdotal,
801.27 -> I think it still holds some validity though.
804.31 -> If your webpage doesn't load in under about three seconds,
808.09 -> about half of your users will go away,
810 -> will go visit another site.
811.87 -> This is important, this is why response time matters,
815.21 -> particularly when you're having an outside in strategy
818.11 -> for full-stack observability.
820.57 -> What does good look like?
821.51 -> Good looks like customers having a webpage fast,
824.07 -> having a good experience and using your application,
826.7 -> having a delightful experience.
829.01 -> Failed purchases is another excellent example.
831.84 -> This is the kind of thing that you need to measure.
835.7 -> And I would ask the question,
837.06 -> how does monitoring your disc space
839.84 -> or your CPU really tell you anything about failed purchases?
844.35 -> It may or it may not.
845.66 -> It depends on how you're implementing your solution.
849.16 -> Now, going over to a batch processing workload,
851.88 -> let's say you have one that processes millions
854.4 -> of customer health records per night,
857.48 -> and it has a complex landscape of internal APIs
860.52 -> that must integrate with internal databases,
863.25 -> perhaps even partner integrations that have to take place.
867.37 -> The things that might matter more for you
869.31 -> could include things like looking for slow SQL queries.
872.64 -> That could be a very important indicator
875.4 -> of your application outcomes being at risk,
879.67 -> looking for integration health,
881.6 -> again, a key metric.
884.68 -> And then we have containerized applications.
887.97 -> What happens if you have a workload that doesn't need
892.31 -> to care about CPU usage?
894.23 -> Or if a disc fills up, maybe it doesn't matter,
897.59 -> and maybe you don't need to spend time looking at that
899.033 -> and you should be building your SLOs
902.73 -> and your observability strategy against things
905.41 -> that are important.
907.14 -> In either case, the outcome remains the same.
911 -> You should observe what matters to your business.
914.82 -> We don't monitor or observe just
916.68 -> because we're diligent dev ops, engineers, or sys admins.
920.9 -> We have a stake in seeing successful business outcomes.
928.19 -> So I gotta reiterate this point, ooh, went too far,
932.06 -> business outcomes shake your approach
934.31 -> to observability and monitoring.
939.18 -> So I'm gonna challenge you for just a moment here.
941.46 -> Can you think of an example where you would want
944.26 -> to have an application with as high a CPU utilization
948.41 -> for as long as possible?
951.38 -> 'Cause I absolutely can, and I've had workloads
954.67 -> that I've needed to do this with.
956.33 -> Again, the batch processing workload,
958.5 -> the kind of thing that needs to run for hours and hours,
963.12 -> but maybe it's running on a single threaded application.
966.04 -> Maybe it's one where throwing a larger instance size at it
970.61 -> to make it run faster
971.89 -> is actually going to be counter productive
975.29 -> and could make your workload cost more money per run,
979.33 -> in which case your service level objective
982.02 -> and the things that you need to monitor
983.94 -> are the ones that make sure your CPU is well utilized,
987.16 -> that you're getting the most value out
989.32 -> of your compute cycles.
991.37 -> In this case, this is actually looking
992.7 -> for the exact opposite of what you would
994.94 -> with typical external facing e-commerce application.
999.94 -> What you monitor has to be important
1002.34 -> for your business objectives.
1005.14 -> So I've seen this example in the wild.
1007.1 -> This is real and it's especially true
1009.41 -> with containerized applications
1011.02 -> where you're measuring more for the outcome
1013.96 -> and less for what all the specific signals may be.
1022.89 -> So we refer to both signals and insights
1026.41 -> when discussing observability,
1028.18 -> and the goal is to always help you make informed,
1030.75 -> data-driven decisions with these signals as a data point.
1036.75 -> When you can draw a line from a slow application
1040.84 -> due to a CPU that has been overloaded
1044.23 -> on an overtaxed server,
1045.53 -> which in turn leads to poor webpage response time,
1048.75 -> and that in turn decreases your customer's sentiment
1051.9 -> and jeopardizes your SLAs, your sales and other objectives,
1055.75 -> now you're in the right head space
1057.22 -> to talk about full-stack observability.
1059.79 -> Data feeds up, and stopping a problem
1062.73 -> from becoming an issue means knowing
1065.2 -> when problems are occurring.
1067.88 -> So let's get into the weeds now with a bit of an example
1070.51 -> of an outside in strategy for full-stack observability.
1077.15 -> So here you see the CloudWatch services
1083.25 -> that form a part of an observability solution.
1086.94 -> Now going through these quickly,
1088.2 -> I'd mentioned very briefly Synthetics as a tool that is used
1091.55 -> to proactively test an application using the same paths
1095.8 -> that your users do in order to see that it
1099.08 -> is behaving in an expected way.
1102.66 -> There's CloudWatch RUM,
1103.55 -> that's actually short for real user monitoring.
1105.95 -> This is used to measure what a customer or a Synthetic probe
1113.66 -> has actually experienced in a web application.
1117.96 -> So if you have a customer in a remote geography,
1121.07 -> let's say that they're 3,000 miles away,
1125.32 -> and if they have a bad response time to a webpage,
1128.29 -> but your customers locally have fast response times,
1131.25 -> that tells you something.
1132.51 -> You need to measure those customers
1134.4 -> and their outcomes and their experiences on your webpage
1138.86 -> as that's an important data point
1140.29 -> that leads to overall sentiment
1142 -> and can affect your objectives.
1145.01 -> Evidently, as mentioned previously,
1146.81 -> is used to do AB testing, launching, experiments
1152.03 -> and measure the results of these experiments
1155.33 -> as experienced by real customers.
1157.51 -> So there's a definite focus here
1159.54 -> on real customer experiences.
1162.35 -> You need to measure these things
1163.6 -> in order to make meaningful decisions based on it.
1167.33 -> But then moving a little bit further over,
1169.15 -> we have metrics and Metrics Insights,
1171.45 -> well, metrics you're probably very familiar with.
1174.2 -> We've all seen graphs of disc space usage and so on.
1178.08 -> Metrics Insights is a new feature of CloudWatch
1181.61 -> that allows us to query that data using SQL syntax.
1186.26 -> It's very powerful and it lets you aggregate
1188.78 -> across thousands of different metrics all in one query.
1193.03 -> We also have anomaly detection.
1195.01 -> You feed this data into an anomaly detection model
1198.03 -> and it can help you find the data, the thresholds,
1202.99 -> these alerts that you maybe didn't even know
1205.88 -> that you needed to look for.
1208.67 -> Getting back to the question of what does good look like,
1211.43 -> if you can't always answer that question, maybe that's okay.
1214.97 -> And maybe an anomaly detection can help you figure that out.
1217.97 -> And then of course, this logs, dashboards, obviously,
1222.38 -> and X-Ray, which is used to collect all of this trace data.
1229.86 -> And now all you do is plug in your web application
1232.46 -> and have send this data into these services.
1237.44 -> So think of what you could do with a website
1240.43 -> that you are about to launch a new feature on,
1242.99 -> and you don't know if this feature
1244.57 -> is going to impact your sales positively or negatively,
1248.93 -> ideally positively, but you don't know.
1252.57 -> Using this outside in approach,
1255.17 -> making sure that you're watching for things
1257.24 -> at every tier of your application,
1259.72 -> you can do this measurement.
1261.09 -> You can do a safe launch
1262.14 -> with Evidently across a small subset
1264.41 -> of your audience, say 10%.
1266.92 -> You can measure the impact, you can do a correlation,
1270.95 -> and you can have a much more successful outcome
1272.81 -> with less pain as a result.
1279.92 -> So here you see some of the typical signals
1282.22 -> that would be used to create your service level objectives,
1285.34 -> keeping page load times under three seconds is ubiquitous.
1289.03 -> The common wisdom though is of course,
1291.44 -> it is somewhat anecdotal, as I said,
1293.29 -> I'll stick to the three second rule for the time being.
1299.12 -> Purchases completed successfully,
1301.34 -> this is a very important SLO,
1303.4 -> which I think a lot of people really need to lean in on.
1307.27 -> I'll make the case that if you don't have 100%
1311.43 -> of your purchases being successfully completed,
1314.05 -> there might be something that you need to look at.
1316.37 -> And maybe there's something about your workload
1317.97 -> that is in jeopardy.
1319.11 -> Maybe there's something about your audience
1321.52 -> that might be causing a problem.
1325.89 -> JavaScript and HTML errors
1328 -> directly impact the customer's experience,
1331.22 -> they absolutely do.
1332.61 -> If there are customers that are frustrated,
1334.77 -> then they are probably going to go elsewhere.
1340.38 -> So I have to reiterate, all of these examples
1343.46 -> are related entirely to your end user's behavior
1348.44 -> and what they observe on the application.
1351.53 -> But now let's talk about what's happening in the backend,
1354.46 -> 'cause it is a different story,
1355.96 -> although it is tightly related.
1360.26 -> So you see many of the same services here.
1362.39 -> There's two that I don't have icons for.
1364.21 -> So Logs Insights and Contributor Insights.
1367.57 -> These are, in order, Logs Insights
1371.03 -> is a search engine for your logging data.
1374.83 -> When data is received by CloudWatch Logs,
1377.93 -> it can be automatically parsed and indexed,
1382.35 -> and you can look for specific values in this logging data.
1386.86 -> It has a very powerful effect of allowing you
1390.98 -> to search across gigabytes
1392.77 -> or even potentially terabytes of data in one frame of glass,
1397.03 -> all inside of the CloudWatch console.
1400.71 -> Contributor Insights is in many ways
1402.47 -> what I call the unsung hero of CloudWatch.
1405.43 -> So this allows you to look for your top talkers
1408.9 -> in your environment,
1411.36 -> and you can do so across more than one dimension.
1414.17 -> So as an example, if you were to stream some
1417.61 -> of your application logging data,
1418.92 -> that includes the country in which a customer lives
1423.66 -> and the city in which a customer lives,
1426.28 -> imagine if you could have a time series graph
1429.83 -> that shows you how many of your customers come from where,
1433.26 -> and you can watch this graph as it proceeds through time,
1437.16 -> and you can learn on it as well, you can create thresholds.
1439.87 -> If you have a sudden drop in customers coming from Australia
1445.04 -> or coming from Georgia,
1447.75 -> that might be something that you need to have visibility of.
1452.75 -> Contributor Insights can do this for you.
1456.5 -> So the things that will feed into this,
1459.83 -> I'm showing just the AWS Services here,
1463.307 -> but they're unsurprising.
1464.91 -> The mechanisms for sending this data
1466.55 -> will vary a little bit depending on your technology stack.
1469.86 -> Anything that runs on a server, so an EC2 instance,
1474.6 -> it can use the CloudWatch agent
1475.93 -> to send logging data and metrics.
1478.391 -> AWS native services like RDS
1482.13 -> will send their data into CloudWatch
1483.46 -> without requiring any special figuration.
1487.18 -> And likewise with Lambda functions,
1488.71 -> Lambda has a very easy path
1490.31 -> to send its logging data into CloudWatch.
1495.79 -> The three core data types to observability,
1498.15 -> the aforementioned logs, metrics and traces,
1501.15 -> they're still used to build your practices on.
1503.5 -> The CloudWatch alarms, an anomaly detection
1505.7 -> that I had spoken of,
1507.94 -> they work just as well across these backend signals
1510.78 -> as they do for your front-end signals.
1513.33 -> And of course, we're not forgetting about those workloads
1515.89 -> that don't run on AWS.
1519.06 -> And we'll have a little bit more to talk about
1520.94 -> in that regard in a few minutes.
1523.99 -> So examples of SLOs that are backend focused,
1528.48 -> slow queries, vitally important for most customers.
1533.42 -> These things can kill a workload
1535.46 -> and they can grind an otherwise very large
1538.32 -> and healthy database server to an actual painful halt.
1542.1 -> So having objectives that keep those targets low
1546.36 -> is typically very important.
1548.26 -> And we've already talked about CPU quite a lot.
1550.82 -> Disc usage and IOPS are also very crucial,
1553.91 -> especially on those aforementioned database servers
1556.57 -> that I keep on harping on about.
1558.77 -> Those slow queries might not be the result
1561.11 -> of bad queries, by the way.
1562.36 -> I've seen this too, where you have queries
1564.33 -> that are actually very well written, very well vetted,
1569.38 -> however, maybe the server doesn't have disc
1571.6 -> that can keep up with it.
1573.29 -> You need to understand your disk IOPS for some workloads,
1575.86 -> highly, highly important.
1577.58 -> And then, of course there's these super common KPIs,
1579.98 -> response times, errors, faults, retries.
1585.4 -> If you have error budgets that are part of your application,
1588.46 -> if that's part of your SLO story,
1590.37 -> those are key for you to watch.
1592.45 -> And these are the sorts of things that your outside in,
1596.2 -> sorry, pardon me, inside out strategy are built on,
1600.03 -> these are all internal facing signals.
1604.45 -> And this would be an example, of course,
1606.08 -> of putting all of this together.
1608.86 -> This is where most people will land.
1611.35 -> You don't just need to have one story.
1613.83 -> You need to be inclusive.
1615.56 -> Your outside in story and your inside out
1617.84 -> have to come together to watch the full stack.
1620.66 -> So you'll know what's happening in every environment.
1623.3 -> So if there is a knock-on impact
1625.03 -> to one stack of your application,
1627.44 -> you'll see it affect your outcomes elsewhere.
1631.89 -> And you need to be able to understand why.
1633.81 -> Not every customer is gonna have a solution
1635.28 -> that looks like this, but a lot of you will.
1639.25 -> Now, the common element to all of this is CloudWatch
1644.18 -> in the cloud native model
1645.41 -> that we're discussing here right now.
1647.758 -> CloudWatch is highly resilient and fault tolerant.
1650.39 -> It's built for scale in ways that most customers
1653.72 -> can't actually build their own services on.
1659.11 -> And the way that we operate these services,
1660.63 -> the way that we build them,
1661.72 -> we build our services using our services
1664.36 -> and our own skillset.
1665.68 -> So CloudWatch is a very compelling offer
1668.26 -> for those who need to have a good observability story.
1671.65 -> But what about your hybrid and on-premises workloads?
1675.47 -> What about those things that just don't run in AWS today?
1679.12 -> They're not second class citizens in this journey.
1682.12 -> They're not shut out from using all of the tools.
1685.69 -> They can absolutely consume the same tooling, and why not?
1691.4 -> So hybrid, on-premises, very distributed workloads,
1696.09 -> they produce data and they speak to the internet,
1698.04 -> just like everything else does.
1700.24 -> That's really the whole basis of cloud based solutions.
1704.65 -> They deliver through the internet
1706.16 -> and with AWS, is with the pay-as-you-go model.
1711.42 -> Now when I said internet,
1712.75 -> I of course could be more broadly through a VPN.
1715.51 -> You don't have to go over the public internet,
1717.63 -> I would make the case, and a lot of times,
1719.392 -> you shouldn't even consider it.
1723.51 -> Or you could be taking advantage of Direct Connect
1726.26 -> and have truly private access into CloudWatch
1729.36 -> from your on-premises environment.
1732.85 -> Either way, the workloads that you operate
1735.35 -> in AWS in this model look almost exactly the same
1739.54 -> as the workloads that you operate on premises or elsewhere.
1743.06 -> And you can run the CloudWatch agent or the X-ray daemon
1746.57 -> in a remote data center in almost the same exact way
1751.33 -> that you would in AWS, it's very low friction.
1754.48 -> So we encourage you to think of AWS and our monitoring tools
1759.61 -> as an extension of your on-premises or hybrid environment.
1768.26 -> So then, I'm actually gonna hand things over
1770.17 -> to my colleague, Imaya.
1772.12 -> He's gonna talk about the open-source story.
1780.69 -> All right, thank you, Rich.
1782.97 -> Hello, everyone, my name is Imaya.
1784.79 -> I'm a Principal Solution Architect
1786.13 -> and I focus on open-source observability.
1791.27 -> First of all, like Rich mentioned,
1794.048 -> CloudWatch is a AWS native solution,
1795.56 -> I hope you all can hear me, is a AWS native solution
1798.55 -> that is built on top of, in other words,
1803.53 -> many dozens of AWS services automatically bins logs
1806.3 -> and metrics to CloudWatch automatically.
1809.29 -> It supports several features that are specifically built
1812.92 -> for specific workloads like containers,
1814.59 -> like Container Insights,
1815.59 -> Lambda with Lambda Insights, and so on,
1817.65 -> and Contributor Insights and all that.
1820.133 -> There's a lot of features in CloudWatch.
1823.036 -> At the same time, it was actually
1825.63 -> when it comes to open-source, last week,
1827.05 -> I was reading an article, basically a survey report
1833.581 -> that basically they talked to a lot of customers,
1835.85 -> small, medium, and large,
1836.95 -> even enterprises that have more than 30,000 employees
1839.64 -> that have large enterprise IT departments.
1842.34 -> They all wanna adopt open-source software.
1844.78 -> It's not only observability based software,
1846.94 -> but open-source software in general,
1849.16 -> but they raised three main concerns.
1852.5 -> And the number one concern is security.
1854.88 -> They are not really sure that the open-source software
1857.6 -> is actually secure or not, whether it'll be compliant
1860.59 -> with the demands that they have in their environment.
1863.3 -> And by the way, the recent Log4j issue that,
1867.24 -> if y'all remember that in the last month or so,
1870.47 -> caused a lot of anxiety in these customers.
1873.46 -> And number two is they don't know what version
1877.04 -> of the open-source software to use
1878.71 -> and how long they should wait
1880.28 -> to upgrade to the newer version, and so on.
1883.27 -> So they had a lot of confusion there.
1885.41 -> And number three, if they opened up open-source software,
1887.93 -> everybody used their software that they wanted to use,
1891.14 -> and they really did not, the operations team
1893.95 -> have no control over what software the company's using
1897.98 -> and who to support and what happens
1900.46 -> if a vulnerability discovered
1901.81 -> and who should maintain that, and all that.
1903.45 -> There's a lot of such confusions.
1905.72 -> However, it's the same thing applies
1907.46 -> with open-source or observability software as well.
1912.972 -> Some of such problems or challenges are solved
1921.08 -> by the managed service,
1923.68 -> like Managed Prometheus Service, for example.
1926.26 -> Amazon Managed Service for Prometheus
1927.43 -> is a fully managed metric monitoring solution.
1930.5 -> It is fully PromQL compatible.
1932.5 -> And we have Amazon Managed Grafana,
1934.91 -> which is a fully managed visualization solution
1938.53 -> that allows you to create Grafana environments
1941.76 -> that you really don't have to manage at all.
1943.3 -> It's fully managed by AWS.
1944.71 -> And also we have AWS Distro for OpenTelemetry,
1947.4 -> which is basically a redistribution
1949.56 -> of the OpenTelemetry project, which is part of CNCF,
1953.37 -> or Cloud Native Compute Foundation, right?
1956.05 -> Let's take a look at every one of these
1958.16 -> in depth a little bit.
1959.86 -> So let's talk about Prometheus,
1961.703 -> Prometheus is a great metric monitoring solution, right?
1964.71 -> So it is very popular in the monitoring world
1968.62 -> because of its querying language called PromQL,
1972.54 -> which is extremely powerful,
1973.73 -> simple at the same time, it's very powerful.
1975.8 -> It has a really good alerting solution, alert manager.
1980.05 -> It can support recording rules.
1982.04 -> You can have a lot of rules to change,
1984 -> to aggregate metrics that you're ingesting.
1985.99 -> It supports high cardinality metric collection,
1988.36 -> and it has dozens and dozens of exporters.
1991.18 -> Exporters are little agents that you can run
1992.54 -> on any environment to collect metrics
1994.01 -> from specific workloads, like for example, JMX Exporter
1996.5 -> is one good example that allows you to export metrics
1999.53 -> from JVM-based applications, and Node Exporter,
2003.16 -> which basically exports metrics in Prometheus format
2005.21 -> from Linux based machines.
2006.38 -> And there are close to 100 such exporters
2008.95 -> that again simply deploy in your environment
2010.64 -> and collect metrics from using any of the Collectors,
2014.17 -> like Prometheus servers, and so on.
2015.82 -> And it's very easy to adopt, and it's one of the reasons.
2018.79 -> Another one is it really works very well with Kubernetes.
2022.398 -> It supports Kubernetes service, Discovery.
2024.46 -> So it is very popular there as well.
2027.36 -> However, like I mentioned about those issues
2031.4 -> with open-source software,
2032.57 -> the same things apply to this too.
2034.6 -> The customers really do not have an easier way
2038.03 -> to run Prometheus in a large scale environment.
2041.25 -> Leaving out even the security challenges
2044.41 -> that they might have,
2045.75 -> it's even much harder to kind of provision
2048.71 -> and maintain a really large Prometheus environment.
2050.53 -> So if you are collecting metrics from large workloads,
2054.65 -> like big, large EKS clusters,
2056.52 -> or even EC2 instances, and so on,
2058.51 -> you have to provision a large of storage,
2060.11 -> you have to keep the Prometheus server
2061.82 -> and the collection agents, all of them up and running,
2064.32 -> and you have to be able to have that environment
2069.36 -> or that becomes one of the most critical workloads for you,
2073.07 -> which is not really where you wanna be.
2074.91 -> Because if your application systems are down
2078.348 -> or if this one goes down,
2079.181 -> you're basically running blind,
2080.22 -> so which become one of the most crucial workloads,
2083.64 -> which is very challenging.
2085.38 -> So customers don't wanna do this because it's a challenge
2091.922 -> that is really not a business application per se.
2094.69 -> It is not a money making application.
2096.384 -> It's an observability solution.
2097.34 -> So what Managed Prometheus Service does is it allows you
2101.41 -> to create Prometheus compatible environments.
2105.72 -> So when you go to the Manage Prometheus Service,
2107.59 -> you create something called a workspace,
2108.99 -> which is it's a service that's based on Cortex,
2111.52 -> which is a CNCF project.
2113.29 -> And it is a multi-tenant environment
2115.54 -> where you basically create a workspace
2116.93 -> just by giving it a name, and automatically,
2119.51 -> you get a highly available PromQL comparable environment,
2124.41 -> where you can send your metrics in
2126.23 -> and you can query metrics from, and so on.
2128.226 -> So it is automatically deployed across Multiple-AZ.
2131.74 -> So it is highly available and it scales based on needs.
2135.4 -> And you don't really pay any upfront costs for that.
2139.84 -> And it is also fully PromQL compatible.
2141.67 -> What that means is if you're using Prometheus today,
2143.71 -> you can simply bring the same PromQL queries
2145.75 -> to Managed Prometheus Service, and all of that should work.
2147.76 -> And same thing applies to your alerting
2150.31 -> and recording rules and all of that.
2151.82 -> So it's fully compatible
2153.47 -> with the existing solution that we have.
2156.12 -> There are no service to manage,
2157.35 -> there's no capacity to provision whatsoever,
2160.06 -> simply create a workspace, and you get going.
2162.36 -> And like I mentioned before,
2166.15 -> it really supports a lot of environments,
2167.75 -> whether it is a AWS environment, EC2 or EKS
2171.61 -> or ECS, whatever, it works.
2174.29 -> And even, you can use the Managed Prometheus Service
2178.59 -> to collect metrics from even on-prem
2180.45 -> or any other environments
2181.59 -> that you're running your workloads on.
2182.71 -> As long as you're able to make calls
2184.77 -> to the Managed Prometheus Service through AWS,
2188.03 -> through authenticated calls, it works.
2189.84 -> you need to make your calls through SigV4.
2191.8 -> SigV4 is the protocol that is used by AWS, SDK and CLIs,
2195.08 -> and all of that to make calls to all and any AWS service.
2198.42 -> And this uses Go SDK, so it's going
2200.5 -> to go through the Go SDK credential provider chain.
2202.31 -> So multiple options for you there
2203.6 -> to kind of provide credentials to the Collector.
2207.44 -> So as long as you do that,
2208.37 -> you'll be able to send metrics, right?
2210.9 -> And the way it works is this.
2212.94 -> So when you create a Prometheus workspace,
2214.9 -> you get an ingestion endpoint, you get a querying endpoint,
2218.25 -> basically you'll be able to send metrics
2219.84 -> through the ingestion endpoint,
2221.048 -> through the querying endpoint,
2222.86 -> you'll be able to query the metrics.
2223.83 -> And for long term storage, we use S3 buckets.
2227.1 -> So that's all internally, let's say S3 buckets
2229.52 -> is not visible for you, S3 is what we use in the background.
2232.72 -> And you have recording rules, you have an alert manager.
2235.63 -> So with Alert Manager, you can define alerts.
2237.48 -> You can create thresholds
2238.84 -> that will trigger alerts to destinations
2241.7 -> that you want through SNS.
2243.04 -> And also you can take actions,
2244.9 -> like auto scaling and so on as well.
2247.55 -> And that environment is fully managed by AWS.
2251 -> I'm showing that just so you understand
2253.19 -> that what is sitting behind that workspace,
2255.21 -> but that's fully managed by AWS.
2256.377 -> All you do is just give it a name, and that's all, right?
2258.75 -> And all on the left hand side,
2259.78 -> what you see is the collection mechanism.
2262.3 -> You can use any of your favorite Collectors,
2265.23 -> like the AWS Distro for OpenTelemetry Collector,
2267.67 -> which we will talk about in a little bit,
2269.25 -> or a Prometheus server, for example, any of that,
2271.72 -> even there's a Grafana agent.
2273.85 -> And you can even write your own application to send metrics.
2278.67 -> As long as can use SigV4, you're good to go.
2282.58 -> So you can collect your metrics
2283.88 -> through those different Collectors,
2286.19 -> send the metrics to the Prometheus service
2287.84 -> and we'll take it, right?
2289.13 -> And then on the right hand side, you see the querying.
2292.11 -> And you can use Managed Grafana or your own Grafana,
2296.04 -> or you can obviously, it's all HTPI,
2298.47 -> so you can make HTPI calls
2299.69 -> and they have to be again authenticated through SigV4,
2302.19 -> as long as you're doing that, you'll be able to make calls.
2304.37 -> It's pretty straightforward, right?
2306.92 -> And we're talking about Grafana, right?
2310.12 -> So Grafana is a rich visualization solution, right?
2316.265 -> Grafana is very popular because of its simplicity.
2318.86 -> At the same time, it is very powerful.
2320.53 -> It doesn't store any data.
2321.81 -> You never send your data to Grafana per se.
2325.34 -> It has a lot of data sources that it supports.
2327.55 -> And you basically connect to your destination.
2332.18 -> Like for example, you can connect
2333.45 -> to Managed Prometheus Service,
2335.11 -> and you will use a native querying language.
2336.92 -> If you connect to Prometheus, you will use PromQL.
2339.23 -> You can connect to CloudWatch,
2340.157 -> you can use the CloudWatch, if you querying logs,
2343.16 -> using Logs Insights, if you're using metrics,
2345.007 -> you will either use Metrics Insights
2346.49 -> or CloudWatch Metric Math functions, and so on.
2348.63 -> There are several dozens of data sources that are available.
2352.03 -> But again, Grafana's ability is in its capacity
2358.04 -> to provide that single pane view of,
2361.475 -> a glass view of different data sources.
2363.61 -> In one dashboard, you can see data from CloudWatch,
2367.2 -> maybe Prometheus, maybe Oracle database SQL Server,
2369.79 -> or maybe some other third party monitoring solution,
2371.68 -> like Datadog, and so on.
2373.203 -> It's all possible in just on dashboard,
2376.05 -> even in one widget in Grafana,
2378.69 -> just in one widget, just one graph,
2381.39 -> you can actually see data from multiple data sources.
2383.46 -> So it's very powerful.
2384.73 -> However, again, while customers love using Grafana,
2389.25 -> if they're using Grafana in their environment,
2392.07 -> it becomes yet another challenge
2394.24 -> in terms of managing it, patching it,
2397.05 -> securing it, and providing authentication,
2399.32 -> integrating with other AWS services and so on,
2402.49 -> it becomes a challenge.
2403.45 -> And that's where Managed Grafana comes into play.
2405.52 -> So you create Grafana, Manage Grafana,
2407.58 -> you create a workspace,
2408.83 -> and you get a highly available Grafana environment deployed
2411.92 -> across three availability zones.
2414.63 -> It is secured in the sense the control plane
2417.15 -> is IM authenticated because you're doing it
2420.44 -> through the AWS console.
2421.72 -> And the Grafana environment to provision users itself,
2424.21 -> you can use AWS SSO or any SAML based identity provider.
2428.93 -> So when we launched the SAML
2430.69 -> based identity provider feature,
2433.19 -> we partnered with the one login Okta CyberArk,
2437.65 -> or sorry, Ping Identity and Azure AD, right?
2440.82 -> But it's SAML based, so technically any SAML based,
2445.13 -> SAML 2.0 based identity provider will work.
2447.03 -> So you can actually integrate
2448.3 -> your existing SAML based enterprises
2450.24 -> to provision users into Grafana.
2452.4 -> It is pay-as-you-go, there's no upfront cost or whatever.
2460.36 -> This service was launched in partnership with Grafana Labs,
2464.89 -> and there's by default,
2467.54 -> you get the Grafana open-source version.
2470.02 -> And you can also, if you want,
2471.61 -> upgrade to the Grafana Enterprise version
2473.42 -> from within the console.
2474.71 -> So when you do that, what happens
2475.87 -> is you actually get additional data source plugins,
2478.34 -> like you see in the column in the middle,
2480.87 -> AppDynamics, Datadog, and Dynatrace,
2482.59 -> and all the way to Wavefront, right?
2483.74 -> So all those data, so if you're using any
2485.81 -> of those data sources, then the Enterprise version
2487.77 -> of Grafana is the way to go.
2489.07 -> And out of the box, when you create Grafana,
2492.357 -> you get all these data sources installed for you.
2495.17 -> So you don't have to install anything and manage all that.
2499.39 -> So if you run Grafana yourself,
2500.9 -> you would basically be copying files,
2502.57 -> changing configuration, rebooting the servers and so on.
2504.844 -> You don't do any of that with the Managed Grafana, right?
2508.12 -> That is the visualization part.
2509.69 -> Then we move on to OpenTelemetry.
2512.75 -> So OpenTelemetry is a CNCF project, right?
2516.91 -> So it's a Cloud Native Compute Foundation project.
2519.48 -> So one of the biggest challenges in collecting signals,
2523.34 -> logs, metrics and traces from applications
2525.41 -> is that there are numerous SDKs and agents
2530.26 -> and Collectors that are involved.
2531.78 -> So if you liked a logging solution,
2533.95 -> you have to use the logging vendor's SDK,
2536.49 -> logging solution vendor's agent to collect logs.
2540.24 -> And if you like the metric solution, same thing,
2542.18 -> same SDK, Collector, theirs,
2544.13 -> and then same with tracing solution.
2545.62 -> Now all of a sudden you have multiple agents
2547.56 -> and multiple SDKs.
2549.022 -> Okay, now you wanna change your metric monitoring vendor.
2553.54 -> So you wanna move from X to Y, vendor Y, so what happens?
2558.15 -> Now it's not as easy as simply replacing a Collector.
2560.45 -> No, because you have instrumented your applications
2562.65 -> to expose metrics using that SDK
2565.48 -> that was provided by the vendor,
2567.26 -> now you have to go rewrite your code,
2568.61 -> rewrite your application,
2569.55 -> that could take weeks, even months.
2571.25 -> Customers don't have the money or the time to do that.
2573.89 -> So which is an industry problem,
2575.38 -> which is a vendor lock-in situation, right?
2577.27 -> Which is everybody understands that, all vendors,
2580.31 -> and that's CNCF based OpenTelemetry project
2584.19 -> actually tries to solve that.
2585.51 -> There are a lot of companies that are contributing to this.
2588.97 -> And the idea is to create standardization
2592.38 -> on specification of how these signals should look like
2595.5 -> and should be structured and how they should behave.
2597.68 -> Number two, create SDKs for different languages
2602.1 -> to make use of those SDKs to create signals based
2605.43 -> on the specifications that we defined.
2607.16 -> Number three, create an agent
2609.65 -> to be able to deal with those signals
2612.71 -> and send to any destination.
2614.1 -> So with OpenTelemetry project,
2616.48 -> when it's mature, right now,
2617.93 -> the tracing specification SDK and the Collector,
2620.68 -> that's all GA, it's available for you to use,
2623.05 -> metrics is almost GA, it's available.
2626.69 -> But the Prometheus support is already there.
2628.501 -> Prometheus metric support is there.
2630.474 -> And number of three, the logs is not GA, it's not ready yet.
2634.06 -> It's still being worked upon, right?
2637.707 -> It would also solve two main problems, right?
2640.64 -> One is the ability to simply replace a vendor,
2648.09 -> change anytime you want, because it's all standardized,
2651.13 -> you can simply change the configuration.
2652.57 -> You can send the traces, today you're sending to service X,
2655.73 -> and then tomorrow if you don't like it,
2656.563 -> all you have to do is change the pipeline
2658.57 -> with the OpenTelemetry Collector that they'll use
2660.16 -> and automatically the signals
2661.34 -> are going to a different destination.
2662.45 -> The applications can be left untouched.
2665.52 -> The second problem that it's solving is correlation.
2669.719 -> It's very hard to do correlation,
2671.68 -> even if you're using just one vendor, one service provider,
2674.58 -> but when you're using multiple vendors,
2676.62 -> multiple service providers, it's even harder.
2678.49 -> It's almost next to impossible.
2680.16 -> So that's the other problem that OpenTelemetry project
2683.4 -> is trying to solve is trying
2684.74 -> to add context between these signals,
2687.71 -> so in order for you to connect and correlate these signals,
2692.25 -> when some incident happens, so you can reduce MTTR,
2694.57 -> which is mean time to resolution, right?
2697.28 -> What is AWS doing with it?
2699.536 -> AWS is basically redistributing the OpenTelemetry project.
2706.842 -> It's called AWS Distro for OpenTelemetry.
2709.18 -> So what we do is every single line
2710.57 -> of code that we write goes
2711.49 -> into the main upstream OpenTelemetry project.
2713.9 -> We take it, we have it go through the AppSec process
2717.93 -> to make sure it is secure and it works the way we want,
2720.77 -> and it goes through rigorous stressing internally,
2722.83 -> and we redistribute it.
2724.927 -> And it's not only that we are doing, by the way,
2728.442 -> if you do that, if you use the OpenTelemetry,
2730.09 -> ADOT in short, Collector, you also get AWS support.
2734.7 -> But not only we are just redistribution,
2736.52 -> but we are also contributing.
2738.09 -> So we have been actively working with the community.
2739.95 -> We are part of the CNCF community
2742.22 -> that defines specifications, works
2743.89 -> on different parts of the component.
2747.43 -> We have contributed to several exporters
2749.94 -> and receivers and so on,
2751.54 -> exporters are basically ones that export the signals
2753.35 -> to different destinations and receivers are the ones
2755.17 -> that's receiving signals in different format and so on.
2757.85 -> So we've been actively working and making,
2761.88 -> we've been trying hard to make it really easy
2764.06 -> for you to deploy the OpenTelemetry Collector as well.
2768.29 -> For example, if you watched Amazon ECS console,
2771.82 -> when you create a task definitely today,
2773.14 -> you can simply go and specify that,
2775.72 -> yes, I want to collect traces, it's just a check box
2777.71 -> and you select it, and the last key, okay,
2779.51 -> so what format are you collecting traces?
2781.34 -> Are you using X-Rays SDK or the OpenTelemetry SDK?
2783.523 -> And you can select that, and then you can select that,
2786.78 -> okay, after that, it'll automatically be coming.
2788.89 -> Same thing with the metrics too.
2790.6 -> You can say simply go check a check box.
2792.68 -> And it's all very easy for you to deploy.
2794.89 -> All you have to do is just specify, give those inputs,
2797.18 -> and we will take care of deploying the Collector,
2799.14 -> configuring it, and sending where you wanna send.
2801.74 -> We also have a trimmed down version
2803.95 -> of the OpenTelemetry Collector published
2805.5 -> as a Lambda layer that you can deploy as a Lambda extension,
2808.94 -> and that would help you to collect traces
2811.18 -> and also metrics in certain languages
2813.33 -> and framework to send to different destinations.
2817.312 -> And we also are working on making it even easier
2821.62 -> for you to use and deploy own EKS as well.
2824.97 -> And that's this, the OpenTelemetry Kubernetes Operator.
2828.68 -> So we have something called ADOT,
2830.347 -> OpenTelemetry ADOT Operator.
2832.69 -> So, which is a Kubernetes operator
2834.46 -> that allows you to deploy the OpenTelemetry Collector
2837.58 -> through the operator
2838.413 -> so the operator can manage the resource for you.
2840.66 -> So it is basically you giving,
2843.04 -> handing the controls over to the operator
2844.95 -> so it can automatically manage
2846.4 -> while you are specifying the custom resource
2849.33 -> and how you want the operator to deploy and so on.
2852 -> So one really, really interesting advantage
2856.01 -> of using this OpenTelemetry or the ADOT operator
2858.65 -> is that if you have a large EKS cluster,
2861.99 -> and if you have let's say 10,000 targets
2864.43 -> where you wanna scrape all the metrics,
2867.02 -> it's a busy environment.
2868.46 -> So at that point, what about the Collector's health
2873.33 -> and performance and what about the availability
2875.77 -> of the Collector itself?
2876.77 -> That becomes a challenge, right?
2878.04 -> So with the operator what you can do
2880.23 -> is you can deploy it as a StatefulSet, for example,
2882.85 -> and you can tell the operator to deploy the Collector
2886.52 -> in a StatefulSet manner, as a StatefulSet,
2889.6 -> and say that you want to deploy,
2891.28 -> let's say five copies of my Collector,
2893.42 -> and it will deploy five copies of the Collector.
2895.81 -> It'll equally chart the load between these Collectors.
2899.2 -> So in our example of 10,000 targets
2903.6 -> and 2,000 targets per Collector, and it's equally divided.
2906.83 -> And so, you have kind of the load that is shared.
2911.84 -> So not one particular Collector
2913.93 -> or one particular part in this case
2915.52 -> is overloaded and it's kind of dying
2917.16 -> and that's an issue and so on.
2918.78 -> So it makes it a lot easier.
2919.84 -> So by the way, today, it's 12:20,
2924.66 -> so as of 10 o'clock, if you saw the EKS console,
2928.1 -> we also have added the ability
2930.69 -> for you to deploy the ADOT Operator as an EKS add-on.
2934.44 -> So you can simply deploy it as an add-on,
2936.84 -> and then you can specify the CRD
2939.84 -> and the Collector will be there.
2942.58 -> So in next couple of hours,
2945.64 -> we will update the documentation.
2948.25 -> There'll be a blog post that will go out, and so on.
2949.85 -> So it's one of the ways for us
2955.38 -> to make collecting metrics and traces
2957.73 -> from Kubernetes environments easy, all right?
2960.38 -> Then putting all of that together,
2962.92 -> you basically have flexible options, right?
2966.48 -> Whether you're running your workload
2968.96 -> on premises environment or in AWS environment,
2972.087 -> whether it's using EC2 or EKS or Lambda or anything,
2977.38 -> you have different options.
2979.2 -> And also, even if you use two different solutions,
2982.01 -> like maybe you're using CloudWatch for logs
2983.67 -> and Prometheus for metrics,
2985.48 -> it's all possible to put all of that together
2987.69 -> and make sense through maybe something like Grafana,
2990.84 -> for example, right?
2992.06 -> You have different data sources that Grafana supports
2994.29 -> that you can actually visualize and make use of.
2995.77 -> So putting all of it together,
2998.14 -> you can choose AWS native observability solutions
3002.25 -> or open source solution for the need that you have,
3005.65 -> but you have options to pick and choose which one you want,
3009.36 -> but you can also have a mix of both
3011.28 -> and put all that together
3012.14 -> and still make sense of your environment.
3015.99 -> Having said that, I'm handing over to Rich, thank you.
3023.529 -> Thank you, sir.
3024.88 -> I trust everyone can still hear me.
3026.33 -> I can hear me, that's good.
3028.03 -> Okay, next steps, so this is a lot of information.
3030.18 -> There's a lot of options.
3032.842 -> We're not gonna leave you completely hanging.
3034.88 -> We have a workshop along
3036.94 -> with 17 different modules contained therein
3040.13 -> that takes you through everything
3041.43 -> that we've talked about here today.
3043.51 -> And the workshop is in five different languages.
3045.77 -> So the URL is a little bit truncated there.
3047.75 -> It's catalog.workshops.aws/observability.
3052.55 -> Now this workshop is our go-to tool
3056.78 -> for helping our customers learn their way
3059.35 -> through these services.
3060.25 -> And it's not just limited to the cloud native services.
3063.33 -> We have modules
3064.42 -> for the open-source managed services as well.
3067.58 -> It's a fantastic resource.
3068.85 -> I highly encourage everybody to go visit it
3072.96 -> and try out the content for yourself.
3076.14 -> Now, the workshop, these tools
3079.6 -> that we've discussed and observability
3081.83 -> on the whole doesn't exist entirely on its own.
3084.233 -> It's part of a larger cloud operations story.
3089.02 -> And we'll typically just refer to this as Cloud Ops
3091.68 -> when we are here inside of AWS.
3093.4 -> So the question is why operate these things
3096.44 -> in the cloud in the first place?
3098.27 -> Why not just use ISV solutions
3102.27 -> that you operate yourself, COTS solutions?
3105.74 -> And the answers are really displayed here.
3107.87 -> So enterprises, they typically wanna be in the cloud.
3110.46 -> A lot of them are in the process
3111.91 -> of doing their digital transformation
3113.76 -> and they're asking us for help.
3115.91 -> But sometimes when they hear about their services,
3117.75 -> they can find it to be complex.
3120.41 -> And that slows them down from quickly understanding
3122.81 -> how AWS can help.
3125.53 -> The Cloud Ops approach, it's our solution
3130.82 -> and also a global marketing initiative
3133.04 -> for what enterprises need to do
3135.51 -> in order to build and operate successfully in the cloud.
3139.63 -> And why are they choosing AWS for Cloud Ops?
3142.27 -> And this is some of the reasons here.
3143.73 -> I particularly like to zoom in on the carbon savings.
3146.45 -> So when you move into AWS, we operate at a scale
3150.15 -> that again, most customers just can't do on their own,
3153.053 -> just the economies of scale.
3154.89 -> And this helps us do things
3156.64 -> like reduce your overall carbon footprint
3159.16 -> for your IT operations by up to 88%.
3166.09 -> Now, I could talk a lot about how enterprises
3167.94 -> are increasingly turning to the cloud
3170.32 -> to achieve their business outcomes.
3173.09 -> And there are large portfolios of legacy applications
3177.3 -> that are likewise coming into AWS.
3180.36 -> And this includes Intel based processes as well,
3182.96 -> and data centers that wanna have much more agility
3187.89 -> than they might have had previously.
3190.07 -> And there's this misconception
3192.02 -> that you need to sacrifice a lot of your governance
3196.03 -> in order to achieve agility.
3197.68 -> And what we've done as our guiding principle,
3199.52 -> our Northern star
3200.87 -> when we design these cloud operation services
3204.18 -> is to make sure that you don't have that sacrifice
3206.77 -> of agility when you build out your governance solution.
3211.72 -> So we have a depth of experience with cloud operations
3216.46 -> that we like to share with our customers
3218.47 -> as we build our products and services.
3223.13 -> This would be another view of this same process,
3226.22 -> but a little bit more of a flow model.
3227.53 -> So on the left, we see people, process and technology
3231.75 -> as inputs into the automation and security
3236.28 -> that allows you to enable,
3237.68 -> sustain, operate your environment.
3240.33 -> And the output of this approach
3242.9 -> are gonna be the controls that you need,
3246.58 -> the agility that you want,
3248.74 -> the ease of use that helps you to use the platform
3252.43 -> in a very fast and effective manner.
3255.52 -> This is the AWS view on cloud operations.
3259.58 -> This is our goal is to enable your people, your processes,
3263.44 -> and your technology to be enabled, to be secure, to grow,
3268.05 -> to migrate in a healthy and satisfactory way into the cloud,
3272.53 -> and to then to continue to operate
3274.55 -> and continue to evolve your workloads once they are there,
3277.86 -> giving you better business outcomes in the end.
3283.87 -> So we highly encourage people to visit these two links.
3289.06 -> And I'll leave this slide up here for just a moment
3290.58 -> for those on your phones.
3292.15 -> So the AWS Skill Builder, it's free courses.
3295.22 -> You don't have to pay, it's super easy to access.
3298.5 -> We have video training, we have lab material.
3303.55 -> It's a fantastic resource
3305.2 -> for everybody in your organization.
3307.23 -> And there is of course the certification path as well.
3310.11 -> There are a lot of benefits to being AWS Certified.
3312.63 -> We do have a whole training
3314.04 -> and certification part of the organization that spends a lot
3318.71 -> of their time building high quality training.
3321.38 -> And then if you choose to become AWS Certified,
3324.24 -> it's industry recognized, it stands out.
3327.44 -> It's a very important thing for a lot of employers
3330.95 -> when they're going to the field
3332.66 -> and trying to find resources themselves.
3334.71 -> But we are always refreshing these certification processes
3338.7 -> as well, so there are new exam guides that are coming out.
3341.85 -> There are new versions of the exams
3343.43 -> that are coming out over time.
3346.42 -> We encourage people to be certified,
3349 -> to maintain their certifications.
3352.68 -> And otherwise just use the QR codes,
3355.468 -> and go find this for yourselves.
3358.07 -> And that would be it for our content today.
3359.99 -> I wanna thank you all for hanging out with us.
3362.47 -> This was fantastic.
3363.862 -> (upbeat music)

Source: https://www.youtube.com/watch?v=or7uFFyHIX0