AWS Summit SF 2022 - Full-stack observability and application monitoring with AWS (COP310)

Aug 16, 2023

AWS Summit SF 2022 - Full-stack observability and application monitoring with AWS (COP310)

Downtime in applications and services costs money. It is critical to resolve issues as close to real time as possible. Seeing how your applications and infrastructure are doing will help you achieve service-level objectives (SLOs) and improve availability, reliability, and performance. In this session, learn how you can use AWS to monitor and observe your applications, workloads, and resources; make data-informed decisions; and drive your business outcomes. Whether your application is in the cloud, on premises and moving to the cloud, or built on open-source technologies, AWS offers a full-stack observability solution that is fully managed, cloud-native, hyperscale, and includes application performance monitoring (APM) as well as managed open-source solutions.

Learn more about AWS Summits at https://go.aws/3zwaA9T.

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#AWSSFSummit2022 #AWSSummit #AWSAMERSummit #AWS #AmazonWebServices #CloudComputing

Content

0.171 -> (upbeat music)

8.59 -> Hello, thank you for joining me here today.

10.99 -> I'm Rich McDonough.

12.16 -> I'm a Senior Worldwide Specialist Solutions Architect.

17.29 -> Oh, are you unable to hear?

20.55 -> Let's give people one, you're good now?

21.96 -> Excellent, okay, I'll start over (laughs).

23.871 -> Thank you for joining me here today.

24.73 -> I'm Rich McDonough.

25.563 -> I'm a Specialist Solutions Architect

27.05 -> with Amazon Web Services.

28.91 -> And we're gonna talk today about full-stack observability

32.48 -> and application monitoring with AWS.

35.71 -> First, a little preview of the agenda.

37.74 -> We're gonna talk about what observability is.

40.38 -> This is a very important discussion for us to have

42.503 -> to make sure that nomenclature is correct,

44.94 -> to make sure that we're all talking about the same thing.

47.89 -> We're gonna talk about strategies

49.73 -> and service level objectives.

52 -> This is key for everything that we do.

54.13 -> We don't monitor things just

56.25 -> for the sake of monitoring them.

57.46 -> We're gonna talk about why we monitor, what we monitor.

60.796 -> And then we're gonna talk about cloud native patterns

63.64 -> that we see every day with our customers.

65.88 -> These would be things that involve a number

67.91 -> of our cloud native services.

69.44 -> We'll walk through these as we go.

72.21 -> Then I'm gonna hand things over to my partner, Imaya,

75.24 -> who is going to discuss open-source patterns

78.1 -> that exist in the same realm.

80.44 -> This is definitely not a case

82.54 -> where you have to choose either cloud native or open-source.

86.85 -> This is definitely a better together story.

89.62 -> We'll talk about next steps.

90.93 -> And then we're gonna talk a little bit

92.24 -> about how this is part of a larger story

95.06 -> around cloud operations.

97.92 -> So first, what is observability?

101.97 -> We like to think of about observability

104.32 -> as being somewhat more than just monitoring.

109.72 -> So when we talk about level setting on nomenclature,

113.51 -> observability really means understanding

115.62 -> what your systems are doing based on the telemetry

120.1 -> that your systems emit.

122.18 -> This is just a fancy way of saying,

123.63 -> you know what your servers or your applications are doing

126.81 -> and how they're doing it for how long and how well.

129.84 -> You can draw an analogy from a car.

132.01 -> You'll know if your car engine is on

134.15 -> because it's gonna be warm, it's gonna be humming.

136.15 -> You'll be driving down the highway.

137.56 -> With observability though,

138.94 -> you're gonna understand more about the car itself.

142.69 -> You'll have the speedometer, the tachometer, oil pressure,

148.05 -> all of these things that indicate as signals

150.97 -> what is happening inside of your actual vehicle.

155.15 -> Now for our purposes, we don't typically collect data

157.42 -> in terms of oil pressure or miles per gallon.

160.42 -> We certainly can for some customers.

163.07 -> But we typically talk more

164.97 -> about basic data types that we collect.

168.4 -> These come in the form of logs, metrics and traces.

173.11 -> You'll often, including today,

174.93 -> hear logs, metrics and traces described

176.96 -> as the three pillars of observability,

180.12 -> and our customers build out

181.67 -> their full-stack observability practices

184.73 -> based on these data types.

188.35 -> Now, in order to maintain

190.5 -> and evolve your operational excellence

193.86 -> and continually meet your business objectives,

196.03 -> and that's really the important thing,

197.64 -> is to make sure business objectives are met all the time,

201.59 -> you need to understand how to do this instrumentation.

207.84 -> Now, monitoring itself is a little bit simpler

211.45 -> depending on your viewpoint.

214.02 -> And we talk a lot about the three pillars of observability.

216.98 -> We'll get into that more in just a few minutes.

220.758 -> It's important to disambiguate though.

222.28 -> When we talk about monitoring,

223.81 -> it's very often more in the context of red light,

226.51 -> green light, is a system on or is it off?

230.23 -> Is it in an error state or is it in a good state?

233.19 -> And you miss a little bit of nuance.

235.04 -> So today I'm gonna talk a lot

236.87 -> about the example of an e-commerce application.

240.613 -> It's ubiquitous, obviously we have a lot of experience

244.57 -> operating e-commerce applications and websites,

248.73 -> and I'm gonna come back

249.563 -> to this example a few times for context.

254.47 -> But you have to ask the question,

256.29 -> how do you know if your application is down?

260.07 -> When does simply monitoring not quite fill

262.83 -> in the gaps well enough?

264.81 -> So for example, let's say your e-commerce site relies

267.85 -> on a third party payment processor.

271.4 -> If your e-commerce application

273.19 -> is for all intents and purposes working,

275.44 -> you can add products to a cart,

277.21 -> you can go through all the process

279.43 -> of proceeding to a checkout,

281.77 -> and then your purchase fails at the end.

285.49 -> Is your site actually working?

286.91 -> I'll make the case that it is not,

288.5 -> but you need to be able to understand why it is not working

291.3 -> and how to fix it.

292.58 -> In order to do so, you need more data.

295.18 -> In order to get this data, you need signals,

297.897 -> and that of course leads back to observability.

301.98 -> The three pillars of observability,

304.33 -> logs, metrics and traces,

307.06 -> these have important characteristics.

309.24 -> And on this slide, we have, pardon me,

312.6 -> we have not just AWS Services,

315.41 -> but also a little bit of an open-source service mentioned

319.41 -> over in the metrics section.

322.09 -> These services have some properties in common.

326.52 -> Logs are sequence aware, this is important.

331 -> You need to make sure that your logging data

332.61 -> is received in the right order.

334.32 -> You need to ensure that people aren't able to inject data

337.67 -> in the wrong order.

338.89 -> That way, you can reconstruct a series of events

341.3 -> and see what is actually happening in your environment.

344.7 -> Metrics are obviously crucial, we'll come back

347.84 -> to the most common metric people usually pick on,

350.53 -> so things like CPU utilization.

353.92 -> This is another signal that you use

355.42 -> to understand what is happening in your environment.

357.46 -> And then there's application traces, which is very often

360.3 -> where we find our customers need the most help.

363.21 -> Your app, as it's running,

365.23 -> should ideally tell you what it is doing

368.09 -> and how long it's taking to do it,

370.21 -> how well it's actually doing it.

372.27 -> This will include data like partners

374.43 -> that you've integrated with,

376.12 -> what the response time was from those integrations,

379.05 -> what the status codes were as a result,

381.37 -> and so on, and other deep operational details

384.5 -> of what has actually taken place.

389.03 -> So what you see here is a view

392.28 -> of all of the components that can be used

395.99 -> to build your observability full stack solution.

400.15 -> You don't need to use all of them,

403.894 -> but you need to use the ones

404.727 -> that help serve your business objectives.

408.62 -> So building an observability practice, it requires tools.

411.63 -> It doesn't just happen.

413.57 -> Here we see both the cloud native side

416.89 -> and the open-source managed side of the fence.

420.27 -> I talk about this diagram usually from the bottom up,

423.79 -> starting with the Collectors.

425.71 -> So to understand your workload, you need data.

428.35 -> That data comes from an agent.

432.31 -> On the bottom left hand side, we have the CloudWatch agent,

435.87 -> which is used very commonly by a million ends of customers

439.43 -> to collect logs and metrics

443.72 -> from applications, from servers, from EC2 instances,

449.06 -> and from other applications you might be working

452.93 -> in your environment, containerized applications

454.87 -> can use the CloudWatch agent as well.

457.62 -> There's also the X-Ray agent,

459.07 -> which is used exclusively for application trace data.

463.61 -> This data is then fed up into CloudWatch

467.81 -> in the form of metrics, logs, traces, et cetera.

471.95 -> And as we move further up the stack,

475.08 -> we have services that are higher level and can do insight

480.07 -> and analysis across broader sets of applications.

483.97 -> You see this insights tier, Container Insights,

486.69 -> Lambda Insights, Contributor Insights.

490.01 -> These services can give you a broad view

492.98 -> across an entire application stack all at once

496.52 -> and it can tell you if you have an impact

499.24 -> across your entire environment

501.61 -> and let you drill down more narrowly

503.37 -> to see where an actual problem is taking place.

507.23 -> On top of all of this is ServiceLens,

509.55 -> which is a node view of your application stack complete

515.21 -> with your response time with the services that speak

519.07 -> to other services, how well they're doing so,

522.24 -> and with what frequency.

523.73 -> It will tell you data like the response time

526.87 -> from a particular application in your environment.

530.92 -> Moving over to the open-source side,

533.25 -> there are some very similar tools.

536.18 -> Over on the bottom right hand side,

538.4 -> we have the AWS Distro for OpenTelemetry.

541.93 -> This is a distribution of OpenTelemetry

546.86 -> that Amazon Web Services maintains

549.16 -> as an open-source project.

551.01 -> And people use this as part

553 -> of an OTel distributed tracing framework

558.8 -> to send data into both AWS X-Ray,

563.51 -> which is a part of CloudWatch,

565.57 -> or into other services such as OpenSearch

568.86 -> or into Zipkin and Yager.

571.29 -> Then customers often have their own do it yourself

574.67 -> or ISV product that may sit

577.93 -> in between Amazon Managed Grafana

581.31 -> and the rest of this data that is being collected.

584.42 -> There is definitely not one size fits all

587.33 -> when it comes to an observability story,

589.58 -> but this is the tools that you can use to compose the story.

594.83 -> So now let's flip the view on its side for just a moment.

598.02 -> We're gonna focus more on the cloud native side

600.92 -> for the next little while.

603.7 -> If you take all those previous services

605.92 -> and view them in terms of where you would implement them

608.84 -> in your environment,

611 -> infrastructure belongs to the primitives,

614.1 -> so logs, metrics, traces, dashboards for visibility.

618.96 -> Your application level observability itself

621.43 -> is going to come from insight services.

623.75 -> Again, able to correlate data

625.47 -> from across a wide set of applications,

629.29 -> Contributor Insights, Application Insights,

631.45 -> Container Insights, particularly powerful

634.37 -> for a distributed application,

636 -> where you need to have a composed view

638.74 -> of the state of your environment, of your health.

643.1 -> And then we have digital experience monitoring as well,

645.8 -> an even higher level super set that has the ability

650.5 -> to measure end users' actual interactions

654.09 -> with your application,

656.05 -> quantify these actions as positive or negative,

660.24 -> and then give you meaningful data from the field

663.88 -> about how well your app is actually performing

666.64 -> in a customer's own web browser.

668.85 -> We also have Evidently,

670.63 -> which can be used for feature launches,

673.51 -> AB tests, experimentation, and Synthetics,

676.84 -> which is a way of doing actual synthetic browser testing.

680.25 -> So the aforementioned e-commerce example,

682.66 -> a Synthetic canary would probe an e-commerce site,

687.3 -> it will add an item to a cart, it could perform a login.

689.94 -> It could even perform a checkout if you so desire.

696.06 -> Now, this is where we have to talk about strategy.

700.8 -> There's two basic approaches that you take, pardon me,

706.46 -> when you begin looking at your observability solution.

710.95 -> Do you wanna watch things from the outside in

714.36 -> or do you wanna watch them from the inside out?

717.473 -> And there's reasons why you might choose one over the other.

721.5 -> Now, these reasons of course are tied directly back

725.17 -> to your service level objectives

727.45 -> and what matters to your business.

731.03 -> I'm gonna come back

731.863 -> to the CPU utilization example again and again,

734.27 -> because it's a great example

735.77 -> of where people sometimes look at the wrong data

739.56 -> and they may be alerting on something that doesn't matter,

741.94 -> depending on the workload.

743.167 -> And we wanna make sure that you are looking

744.87 -> at the right things.

749.98 -> So the first question you have to ask

752.34 -> when looking at a full stack observability solution,

755.25 -> what does good look like?

757.06 -> Is good measured in terms of sales,

760.48 -> is it a very low response time, or is good something

766.23 -> that maybe means something very different

768.46 -> for your application?

769.73 -> Perhaps it's a batch processing workload

771.8 -> that needs to generate reports across millions of records.

775.6 -> Good doesn't look the same for everybody,

777.83 -> but regardless, you can use the same tools

780.55 -> to obtain that view of what good looks like.

784.68 -> So e-commerce example, page response times,

787.8 -> probably a very good metric for you to watch,

790.47 -> because it is tightly correlated with things like sales.

795.04 -> So you've probably heard of the rule of three,

798.9 -> which is anecdotal,

801.27 -> I think it still holds some validity though.

804.31 -> If your webpage doesn't load in under about three seconds,

808.09 -> about half of your users will go away,

810 -> will go visit another site.

811.87 -> This is important, this is why response time matters,

815.21 -> particularly when you're having an outside in strategy

818.11 -> for full-stack observability.

820.57 -> What does good look like?

821.51 -> Good looks like customers having a webpage fast,

824.07 -> having a good experience and using your application,

826.7 -> having a delightful experience.

829.01 -> Failed purchases is another excellent example.

831.84 -> This is the kind of thing that you need to measure.

835.7 -> And I would ask the question,

837.06 -> how does monitoring your disc space

839.84 -> or your CPU really tell you anything about failed purchases?

844.35 -> It may or it may not.

845.66 -> It depends on how you're implementing your solution.

849.16 -> Now, going over to a batch processing workload,

851.88 -> let's say you have one that processes millions

854.4 -> of customer health records per night,

857.48 -> and it has a complex landscape of internal APIs

860.52 -> that must integrate with internal databases,

863.25 -> perhaps even partner integrations that have to take place.

867.37 -> The things that might matter more for you

869.31 -> could include things like looking for slow SQL queries.

872.64 -> That could be a very important indicator

875.4 -> of your application outcomes being at risk,

879.67 -> looking for integration health,

881.6 -> again, a key metric.

884.68 -> And then we have containerized applications.

887.97 -> What happens if you have a workload that doesn't need

892.31 -> to care about CPU usage?

894.23 -> Or if a disc fills up, maybe it doesn't matter,

897.59 -> and maybe you don't need to spend time looking at that

899.033 -> and you should be building your SLOs

902.73 -> and your observability strategy against things

905.41 -> that are important.

907.14 -> In either case, the outcome remains the same.

911 -> You should observe what matters to your business.

914.82 -> We don't monitor or observe just

916.68 -> because we're diligent dev ops, engineers, or sys admins.

920.9 -> We have a stake in seeing successful business outcomes.

928.19 -> So I gotta reiterate this point, ooh, went too far,

932.06 -> business outcomes shake your approach

934.31 -> to observability and monitoring.

939.18 -> So I'm gonna challenge you for just a moment here.

941.46 -> Can you think of an example where you would want

944.26 -> to have an application with as high a CPU utilization

948.41 -> for as long as possible?

951.38 -> 'Cause I absolutely can, and I've had workloads

954.67 -> that I've needed to do this with.

956.33 -> Again, the batch processing workload,

958.5 -> the kind of thing that needs to run for hours and hours,

963.12 -> but maybe it's running on a single threaded application.

966.04 -> Maybe it's one where throwing a larger instance size at it

970.61 -> to make it run faster

971.89 -> is actually going to be counter productive

975.29 -> and could make your workload cost more money per run,

979.33 -> in which case your service level objective

982.02 -> and the things that you need to monitor

983.94 -> are the ones that make sure your CPU is well utilized,

987.16 -> that you're getting the most value out

989.32 -> of your compute cycles.

991.37 -> In this case, this is actually looking

992.7 -> for the exact opposite of what you would

994.94 -> with typical external facing e-commerce application.

999.94 -> What you monitor has to be important

1002.34 -> for your business objectives.

1005.14 -> So I've seen this example in the wild.

1007.1 -> This is real and it's especially true

1009.41 -> with containerized applications

1011.02 -> where you're measuring more for the outcome

1013.96 -> and less for what all the specific signals may be.

1022.89 -> So we refer to both signals and insights

1026.41 -> when discussing observability,

1028.18 -> and the goal is to always help you make informed,

1030.75 -> data-driven decisions with these signals as a data point.

1036.75 -> When you can draw a line from a slow application

1040.84 -> due to a CPU that has been overloaded

1044.23 -> on an overtaxed server,

1045.53 -> which in turn leads to poor webpage response time,

1048.75 -> and that in turn decreases your customer's sentiment

1051.9 -> and jeopardizes your SLAs, your sales and other objectives,

1055.75 -> now you're in the right head space

1057.22 -> to talk about full-stack observability.

1059.79 -> Data feeds up, and stopping a problem

1062.73 -> from becoming an issue means knowing

1065.2 -> when problems are occurring.

1067.88 -> So let's get into the weeds now with a bit of an example

1070.51 -> of an outside in strategy for full-stack observability.

1077.15 -> So here you see the CloudWatch services

1083.25 -> that form a part of an observability solution.

1086.94 -> Now going through these quickly,

1088.2 -> I'd mentioned very briefly Synthetics as a tool that is used

1091.55 -> to proactively test an application using the same paths

1095.8 -> that your users do in order to see that it

1099.08 -> is behaving in an expected way.

1102.66 -> There's CloudWatch RUM,

1103.55 -> that's actually short for real user monitoring.

1105.95 -> This is used to measure what a customer or a Synthetic probe

1113.66 -> has actually experienced in a web application.

1117.96 -> So if you have a customer in a remote geography,

1121.07 -> let's say that they're 3,000 miles away,

1125.32 -> and if they have a bad response time to a webpage,

1128.29 -> but your customers locally have fast response times,

1131.25 -> that tells you something.

1132.51 -> You need to measure those customers

1134.4 -> and their outcomes and their experiences on your webpage

1138.86 -> as that's an important data point

1140.29 -> that leads to overall sentiment

1142 -> and can affect your objectives.

1145.01 -> Evidently, as mentioned previously,

1146.81 -> is used to do AB testing, launching, experiments

1152.03 -> and measure the results of these experiments

1155.33 -> as experienced by real customers.

1157.51 -> So there's a definite focus here

1159.54 -> on real customer experiences.

1162.35 -> You need to measure these things

1163.6 -> in order to make meaningful decisions based on it.

1167.33 -> But then moving a little bit further over,

1169.15 -> we have metrics and Metrics Insights,

1171.45 -> well, metrics you're probably very familiar with.

1174.2 -> We've all seen graphs of disc space usage and so on.

1178.08 -> Metrics Insights is a new feature of CloudWatch

1181.61 -> that allows us to query that data using SQL syntax.

1186.26 -> It's very powerful and it lets you aggregate

1188.78 -> across thousands of different metrics all in one query.

1193.03 -> We also have anomaly detection.

1195.01 -> You feed this data into an anomaly detection model

1198.03 -> and it can help you find the data, the thresholds,

1202.99 -> these alerts that you maybe didn't even know

1205.88 -> that you needed to look for.

1208.67 -> Getting back to the question of what does good look like,

1211.43 -> if you can't always answer that question, maybe that's okay.

1214.97 -> And maybe an anomaly detection can help you figure that out.

1217.97 -> And then of course, this logs, dashboards, obviously,

1222.38 -> and X-Ray, which is used to collect all of this trace data.

1229.86 -> And now all you do is plug in your web application

1232.46 -> and have send this data into these services.

1237.44 -> So think of what you could do with a website

1240.43 -> that you are about to launch a new feature on,

1242.99 -> and you don't know if this feature

1244.57 -> is going to impact your sales positively or negatively,

1248.93 -> ideally positively, but you don't know.

1252.57 -> Using this outside in approach,

1255.17 -> making sure that you're watching for things

1257.24 -> at every tier of your application,

1259.72 -> you can do this measurement.

1261.09 -> You can do a safe launch

1262.14 -> with Evidently across a small subset

1264.41 -> of your audience, say 10%.

1266.92 -> You can measure the impact, you can do a correlation,

1270.95 -> and you can have a much more successful outcome

1272.81 -> with less pain as a result.

1279.92 -> So here you see some of the typical signals

1282.22 -> that would be used to create your service level objectives,

1285.34 -> keeping page load times under three seconds is ubiquitous.

1289.03 -> The common wisdom though is of course,

1291.44 -> it is somewhat anecdotal, as I said,

1293.29 -> I'll stick to the three second rule for the time being.

1299.12 -> Purchases completed successfully,

1301.34 -> this is a very important SLO,

1303.4 -> which I think a lot of people really need to lean in on.

1307.27 -> I'll make the case that if you don't have 100%

1311.43 -> of your purchases being successfully completed,

1314.05 -> there might be something that you need to look at.

1316.37 -> And maybe there's something about your workload

1317.97 -> that is in jeopardy.

1319.11 -> Maybe there's something about your audience

1321.52 -> that might be causing a problem.

1325.89 -> JavaScript and HTML errors

1328 -> directly impact the customer's experience,

1331.22 -> they absolutely do.

1332.61 -> If there are customers that are frustrated,

1334.77 -> then they are probably going to go elsewhere.

1340.38 -> So I have to reiterate, all of these examples

1343.46 -> are related entirely to your end user's behavior

1348.44 -> and what they observe on the application.

1351.53 -> But now let's talk about what's happening in the backend,

1354.46 -> 'cause it is a different story,

1355.96 -> although it is tightly related.

1360.26 -> So you see many of the same services here.

1362.39 -> There's two that I don't have icons for.

1364.21 -> So Logs Insights and Contributor Insights.

1367.57 -> These are, in order, Logs Insights

1371.03 -> is a search engine for your logging data.

1374.83 -> When data is received by CloudWatch Logs,

1377.93 -> it can be automatically parsed and indexed,

1382.35 -> and you can look for specific values in this logging data.

1386.86 -> It has a very powerful effect of allowing you

1390.98 -> to search across gigabytes

1392.77 -> or even potentially terabytes of data in one frame of glass,

1397.03 -> all inside of the CloudWatch console.

1400.71 -> Contributor Insights is in many ways

1402.47 -> what I call the unsung hero of CloudWatch.

1405.43 -> So this allows you to look for your top talkers

1408.9 -> in your environment,

1411.36 -> and you can do so across more than one dimension.

1414.17 -> So as an example, if you were to stream some

1417.61 -> of your application logging data,

1418.92 -> that includes the country in which a customer lives

1423.66 -> and the city in which a customer lives,

1426.28 -> imagine if you could have a time series graph

1429.83 -> that shows you how many of your customers come from where,

1433.26 -> and you can watch this graph as it proceeds through time,

1437.16 -> and you can learn on it as well, you can create thresholds.

1439.87 -> If you have a sudden drop in customers coming from Australia

1445.04 -> or coming from Georgia,

1447.75 -> that might be something that you need to have visibility of.

1452.75 -> Contributor Insights can do this for you.

1456.5 -> So the things that will feed into this,

1459.83 -> I'm showing just the AWS Services here,

1463.307 -> but they're unsurprising.

1464.91 -> The mechanisms for sending this data

1466.55 -> will vary a little bit depending on your technology stack.

1469.86 -> Anything that runs on a server, so an EC2 instance,

1474.6 -> it can use the CloudWatch agent

1475.93 -> to send logging data and metrics.

1478.391 -> AWS native services like RDS

1482.13 -> will send their data into CloudWatch

1483.46 -> without requiring any special figuration.

1487.18 -> And likewise with Lambda functions,

1488.71 -> Lambda has a very easy path

1490.31 -> to send its logging data into CloudWatch.

1495.79 -> The three core data types to observability,

1498.15 -> the aforementioned logs, metrics and traces,

1501.15 -> they're still used to build your practices on.

1503.5 -> The CloudWatch alarms, an anomaly detection

1505.7 -> that I had spoken of,

1507.94 -> they work just as well across these backend signals

1510.78 -> as they do for your front-end signals.

1513.33 -> And of course, we're not forgetting about those workloads

1515.89 -> that don't run on AWS.

1519.06 -> And we'll have a little bit more to talk about

1520.94 -> in that regard in a few minutes.

1523.99 -> So examples of SLOs that are backend focused,

1528.48 -> slow queries, vitally important for most customers.

1533.42 -> These things can kill a workload

1535.46 -> and they can grind an otherwise very large

1538.32 -> and healthy database server to an actual painful halt.

1542.1 -> So having objectives that keep those targets low

1546.36 -> is typically very important.

1548.26 -> And we've already talked about CPU quite a lot.

1550.82 -> Disc usage and IOPS are also very crucial,

1553.91 -> especially on those aforementioned database servers

1556.57 -> that I keep on harping on about.

1558.77 -> Those slow queries might not be the result

1561.11 -> of bad queries, by the way.

1562.36 -> I've seen this too, where you have queries

1564.33 -> that are actually very well written, very well vetted,

1569.38 -> however, maybe the server doesn't have disc

1571.6 -> that can keep up with it.

1573.29 -> You need to understand your disk IOPS for some workloads,

1575.86 -> highly, highly important.

1577.58 -> And then, of course there's these super common KPIs,

1579.98 -> response times, errors, faults, retries.

1585.4 -> If you have error budgets that are part of your application,

1588.46 -> if that's part of your SLO story,

1590.37 -> those are key for you to watch.

1592.45 -> And these are the sorts of things that your outside in,

1596.2 -> sorry, pardon me, inside out strategy are built on,

1600.03 -> these are all internal facing signals.

1604.45 -> And this would be an example, of course,

1606.08 -> of putting all of this together.

1608.86 -> This is where most people will land.

1611.35 -> You don't just need to have one story.

1613.83 -> You need to be inclusive.

1615.56 -> Your outside in story and your inside out

1617.84 -> have to come together to watch the full stack.

1620.66 -> So you'll know what's happening in every environment.

1623.3 -> So if there is a knock-on impact

1625.03 -> to one stack of your application,

1627.44 -> you'll see it affect your outcomes elsewhere.

1631.89 -> And you need to be able to understand why.

1633.81 -> Not every customer is gonna have a solution

1635.28 -> that looks like this, but a lot of you will.

1639.25 -> Now, the common element to all of this is CloudWatch

1644.18 -> in the cloud native model

1645.41 -> that we're discussing here right now.

1647.758 -> CloudWatch is highly resilient and fault tolerant.

1650.39 -> It's built for scale in ways that most customers

1653.72 -> can't actually build their own services on.

1659.11 -> And the way that we operate these services,

1660.63 -> the way that we build them,

1661.72 -> we build our services using our services

1664.36 -> and our own skillset.

1665.68 -> So CloudWatch is a very compelling offer

1668.26 -> for those who need to have a good observability story.

1671.65 -> But what about your hybrid and on-premises workloads?

1675.47 -> What about those things that just don't run in AWS today?

1679.12 -> They're not second class citizens in this journey.

1682.12 -> They're not shut out from using all of the tools.

1685.69 -> They can absolutely consume the same tooling, and why not?

1691.4 -> So hybrid, on-premises, very distributed workloads,

1696.09 -> they produce data and they speak to the internet,

1698.04 -> just like everything else does.

1700.24 -> That's really the whole basis of cloud based solutions.

1704.65 -> They deliver through the internet

1706.16 -> and with AWS, is with the pay-as-you-go model.

1711.42 -> Now when I said internet,

1712.75 -> I of course could be more broadly through a VPN.

1715.51 -> You don't have to go over the public internet,

1717.63 -> I would make the case, and a lot of times,

1719.392 -> you shouldn't even consider it.

1723.51 -> Or you could be taking advantage of Direct Connect

1726.26 -> and have truly private access into CloudWatch

1729.36 -> from your on-premises environment.

1732.85 -> Either way, the workloads that you operate

1735.35 -> in AWS in this model look almost exactly the same

1739.54 -> as the workloads that you operate on premises or elsewhere.

1743.06 -> And you can run the CloudWatch agent or the X-ray daemon

1746.57 -> in a remote data center in almost the same exact way

1751.33 -> that you would in AWS, it's very low friction.

1754.48 -> So we encourage you to think of AWS and our monitoring tools

1759.61 -> as an extension of your on-premises or hybrid environment.

1768.26 -> So then, I'm actually gonna hand things over

1770.17 -> to my colleague, Imaya.

1772.12 -> He's gonna talk about the open-source story.

1780.69 -> All right, thank you, Rich.

1782.97 -> Hello, everyone, my name is Imaya.

1784.79 -> I'm a Principal Solution Architect

1786.13 -> and I focus on open-source observability.

1791.27 -> First of all, like Rich mentioned,

1794.048 -> CloudWatch is a AWS native solution,

1795.56 -> I hope you all can hear me, is a AWS native solution

1798.55 -> that is built on top of, in other words,

1803.53 -> many dozens of AWS services automatically bins logs

1806.3 -> and metrics to CloudWatch automatically.

1809.29 -> It supports several features that are specifically built

1812.92 -> for specific workloads like containers,

1814.59 -> like Container Insights,

1815.59 -> Lambda with Lambda Insights, and so on,

1817.65 -> and Contributor Insights and all that.

1820.133 -> There's a lot of features in CloudWatch.

1823.036 -> At the same time, it was actually

1825.63 -> when it comes to open-source, last week,

1827.05 -> I was reading an article, basically a survey report

1833.581 -> that basically they talked to a lot of customers,

1835.85 -> small, medium, and large,

1836.95 -> even enterprises that have more than 30,000 employees

1839.64 -> that have large enterprise IT departments.

1842.34 -> They all wanna adopt open-source software.

1844.78 -> It's not only observability based software,

1846.94 -> but open-source software in general,

1849.16 -> but they raised three main concerns.

1852.5 -> And the number one concern is security.

1854.88 -> They are not really sure that the open-source software

1857.6 -> is actually secure or not, whether it'll be compliant

1860.59 -> with the demands that they have in their environment.

1863.3 -> And by the way, the recent Log4j issue that,

1867.24 -> if y'all remember that in the last month or so,

1870.47 -> caused a lot of anxiety in these customers.

1873.46 -> And number two is they don't know what version

1877.04 -> of the open-source software to use

1878.71 -> and how long they should wait

1880.28 -> to upgrade to the newer version, and so on.

1883.27 -> So they had a lot of confusion there.

1885.41 -> And number three, if they opened up open-source software,

1887.93 -> everybody used their software that they wanted to use,

1891.14 -> and they really did not, the operations team

1893.95 -> have no control over what software the company's using

1897.98 -> and who to support and what happens

1900.46 -> if a vulnerability discovered

1901.81 -> and who should maintain that, and all that.

1903.45 -> There's a lot of such confusions.

1905.72 -> However, it's the same thing applies

1907.46 -> with open-source or observability software as well.

1912.972 -> Some of such problems or challenges are solved

1921.08 -> by the managed service,

1923.68 -> like Managed Prometheus Service, for example.

1926.26 -> Amazon Managed Service for Prometheus

1927.43 -> is a fully managed metric monitoring solution.

1930.5 -> It is fully PromQL compatible.

1932.5 -> And we have Amazon Managed Grafana,

1934.91 -> which is a fully managed visualization solution

1938.53 -> that allows you to create Grafana environments

1941.76 -> that you really don't have to manage at all.

1943.3 -> It's fully managed by AWS.

1944.71 -> And also we have AWS Distro for OpenTelemetry,

1947.4 -> which is basically a redistribution

1949.56 -> of the OpenTelemetry project, which is part of CNCF,

1953.37 -> or Cloud Native Compute Foundation, right?

1956.05 -> Let's take a look at every one of these

1958.16 -> in depth a little bit.

1959.86 -> So let's talk about Prometheus,

1961.703 -> Prometheus is a great metric monitoring solution, right?

1964.71 -> So it is very popular in the monitoring world

1968.62 -> because of its querying language called PromQL,

1972.54 -> which is extremely powerful,

1973.73 -> simple at the same time, it's very powerful.

1975.8 -> It has a really good alerting solution, alert manager.

1980.05 -> It can support recording rules.

1982.04 -> You can have a lot of rules to change,

1984 -> to aggregate metrics that you're ingesting.

1985.99 -> It supports high cardinality metric collection,

1988.36 -> and it has dozens and dozens of exporters.

1991.18 -> Exporters are little agents that you can run

1992.54 -> on any environment to collect metrics

1994.01 -> from specific workloads, like for example, JMX Exporter

1996.5 -> is one good example that allows you to export metrics

1999.53 -> from JVM-based applications, and Node Exporter,

2003.16 -> which basically exports metrics in Prometheus format

2005.21 -> from Linux based machines.

2006.38 -> And there are close to 100 such exporters

2008.95 -> that again simply deploy in your environment

2010.64 -> and collect metrics from using any of the Collectors,

2014.17 -> like Prometheus servers, and so on.

2015.82 -> And it's very easy to adopt, and it's one of the reasons.

2018.79 -> Another one is it really works very well with Kubernetes.

2022.398 -> It supports Kubernetes service, Discovery.

2024.46 -> So it is very popular there as well.

2027.36 -> However, like I mentioned about those issues

2031.4 -> with open-source software,

2032.57 -> the same things apply to this too.

2034.6 -> The customers really do not have an easier way

2038.03 -> to run Prometheus in a large scale environment.

2041.25 -> Leaving out even the security challenges

2044.41 -> that they might have,

2045.75 -> it's even much harder to kind of provision

2048.71 -> and maintain a really large Prometheus environment.

2050.53 -> So if you are collecting metrics from large workloads,

2054.65 -> like big, large EKS clusters,

2056.52 -> or even EC2 instances, and so on,

2058.51 -> you have to provision a large of storage,

2060.11 -> you have to keep the Prometheus server

2061.82 -> and the collection agents, all of them up and running,

2064.32 -> and you have to be able to have that environment

2069.36 -> or that becomes one of the most critical workloads for you,

2073.07 -> which is not really where you wanna be.

2074.91 -> Because if your application systems are down

2078.348 -> or if this one goes down,

2079.181 -> you're basically running blind,

2080.22 -> so which become one of the most crucial workloads,

2083.64 -> which is very challenging.

2085.38 -> So customers don't wanna do this because it's a challenge

2091.922 -> that is really not a business application per se.

2094.69 -> It is not a money making application.

2096.384 -> It's an observability solution.

2097.34 -> So what Managed Prometheus Service does is it allows you

2101.41 -> to create Prometheus compatible environments.

2105.72 -> So when you go to the Manage Prometheus Service,

2107.59 -> you create something called a workspace,

2108.99 -> which is it's a service that's based on Cortex,

2111.52 -> which is a CNCF project.

2113.29 -> And it is a multi-tenant environment

2115.54 -> where you basically create a workspace

2116.93 -> just by giving it a name, and automatically,

2119.51 -> you get a highly available PromQL comparable environment,

2124.41 -> where you can send your metrics in

2126.23 -> and you can query metrics from, and so on.

2128.226 -> So it is automatically deployed across Multiple-AZ.

2131.74 -> So it is highly available and it scales based on needs.

2135.4 -> And you don't really pay any upfront costs for that.

2139.84 -> And it is also fully PromQL compatible.

2141.67 -> What that means is if you're using Prometheus today,

2143.71 -> you can simply bring the same PromQL queries

2145.75 -> to Managed Prometheus Service, and all of that should work.

2147.76 -> And same thing applies to your alerting

2150.31 -> and recording rules and all of that.

2151.82 -> So it's fully compatible

2153.47 -> with the existing solution that we have.

2156.12 -> There are no service to manage,

2157.35 -> there's no capacity to provision whatsoever,

2160.06 -> simply create a workspace, and you get going.

2162.36 -> And like I mentioned before,

2166.15 -> it really supports a lot of environments,

2167.75 -> whether it is a AWS environment, EC2 or EKS

2171.61 -> or ECS, whatever, it works.

2174.29 -> And even, you can use the Managed Prometheus Service

2178.59 -> to collect metrics from even on-prem

2180.45 -> or any other environments

2181.59 -> that you're running your workloads on.

2182.71 -> As long as you're able to make calls

2184.77 -> to the Managed Prometheus Service through AWS,

2188.03 -> through authenticated calls, it works.

2189.84 -> you need to make your calls through SigV4.

2191.8 -> SigV4 is the protocol that is used by AWS, SDK and CLIs,

2195.08 -> and all of that to make calls to all and any AWS service.

2198.42 -> And this uses Go SDK, so it's going

2200.5 -> to go through the Go SDK credential provider chain.

2202.31 -> So multiple options for you there

2203.6 -> to kind of provide credentials to the Collector.

2207.44 -> So as long as you do that,

2208.37 -> you'll be able to send metrics, right?

2210.9 -> And the way it works is this.

2212.94 -> So when you create a Prometheus workspace,

2214.9 -> you get an ingestion endpoint, you get a querying endpoint,

2218.25 -> basically you'll be able to send metrics

2219.84 -> through the ingestion endpoint,

2221.048 -> through the querying endpoint,

2222.86 -> you'll be able to query the metrics.

2223.83 -> And for long term storage, we use S3 buckets.

2227.1 -> So that's all internally, let's say S3 buckets

2229.52 -> is not visible for you, S3 is what we use in the background.

2232.72 -> And you have recording rules, you have an alert manager.

2235.63 -> So with Alert Manager, you can define alerts.

2237.48 -> You can create thresholds

2238.84 -> that will trigger alerts to destinations

2241.7 -> that you want through SNS.

2243.04 -> And also you can take actions,

2244.9 -> like auto scaling and so on as well.

2247.55 -> And that environment is fully managed by AWS.

2251 -> I'm showing that just so you understand

2253.19 -> that what is sitting behind that workspace,

2255.21 -> but that's fully managed by AWS.

2256.377 -> All you do is just give it a name, and that's all, right?

2258.75 -> And all on the left hand side,

2259.78 -> what you see is the collection mechanism.

2262.3 -> You can use any of your favorite Collectors,

2265.23 -> like the AWS Distro for OpenTelemetry Collector,

2267.67 -> which we will talk about in a little bit,

2269.25 -> or a Prometheus server, for example, any of that,

2271.72 -> even there's a Grafana agent.

2273.85 -> And you can even write your own application to send metrics.

2278.67 -> As long as can use SigV4, you're good to go.

2282.58 -> So you can collect your metrics

2283.88 -> through those different Collectors,

2286.19 -> send the metrics to the Prometheus service

2287.84 -> and we'll take it, right?

2289.13 -> And then on the right hand side, you see the querying.

2292.11 -> And you can use Managed Grafana or your own Grafana,

2296.04 -> or you can obviously, it's all HTPI,

2298.47 -> so you can make HTPI calls

2299.69 -> and they have to be again authenticated through SigV4,

2302.19 -> as long as you're doing that, you'll be able to make calls.

2304.37 -> It's pretty straightforward, right?

2306.92 -> And we're talking about Grafana, right?

2310.12 -> So Grafana is a rich visualization solution, right?

2316.265 -> Grafana is very popular because of its simplicity.

2318.86 -> At the same time, it is very powerful.

2320.53 -> It doesn't store any data.

2321.81 -> You never send your data to Grafana per se.

2325.34 -> It has a lot of data sources that it supports.

2327.55 -> And you basically connect to your destination.

2332.18 -> Like for example, you can connect

2333.45 -> to Managed Prometheus Service,

2335.11 -> and you will use a native querying language.

2336.92 -> If you connect to Prometheus, you will use PromQL.

2339.23 -> You can connect to CloudWatch,

2340.157 -> you can use the CloudWatch, if you querying logs,

2343.16 -> using Logs Insights, if you're using metrics,

2345.007 -> you will either use Metrics Insights

2346.49 -> or CloudWatch Metric Math functions, and so on.

2348.63 -> There are several dozens of data sources that are available.

2352.03 -> But again, Grafana's ability is in its capacity

2358.04 -> to provide that single pane view of,

2361.475 -> a glass view of different data sources.

2363.61 -> In one dashboard, you can see data from CloudWatch,

2367.2 -> maybe Prometheus, maybe Oracle database SQL Server,

2369.79 -> or maybe some other third party monitoring solution,

2371.68 -> like Datadog, and so on.

2373.203 -> It's all possible in just on dashboard,

2376.05 -> even in one widget in Grafana,

2378.69 -> just in one widget, just one graph,

2381.39 -> you can actually see data from multiple data sources.

2383.46 -> So it's very powerful.

2384.73 -> However, again, while customers love using Grafana,

2389.25 -> if they're using Grafana in their environment,

2392.07 -> it becomes yet another challenge

2394.24 -> in terms of managing it, patching it,

2397.05 -> securing it, and providing authentication,

2399.32 -> integrating with other AWS services and so on,

2402.49 -> it becomes a challenge.

2403.45 -> And that's where Managed Grafana comes into play.

2405.52 -> So you create Grafana, Manage Grafana,

2407.58 -> you create a workspace,

2408.83 -> and you get a highly available Grafana environment deployed

2411.92 -> across three availability zones.

2414.63 -> It is secured in the sense the control plane

2417.15 -> is IM authenticated because you're doing it

2420.44 -> through the AWS console.

2421.72 -> And the Grafana environment to provision users itself,

2424.21 -> you can use AWS SSO or any SAML based identity provider.

2428.93 -> So when we launched the SAML

2430.69 -> based identity provider feature,

2433.19 -> we partnered with the one login Okta CyberArk,

2437.65 -> or sorry, Ping Identity and Azure AD, right?

2440.82 -> But it's SAML based, so technically any SAML based,

2445.13 -> SAML 2.0 based identity provider will work.

2447.03 -> So you can actually integrate

2448.3 -> your existing SAML based enterprises

2450.24 -> to provision users into Grafana.

2452.4 -> It is pay-as-you-go, there's no upfront cost or whatever.

2460.36 -> This service was launched in partnership with Grafana Labs,

2464.89 -> and there's by default,

2467.54 -> you get the Grafana open-source version.

2470.02 -> And you can also, if you want,

2471.61 -> upgrade to the Grafana Enterprise version

2473.42 -> from within the console.

2474.71 -> So when you do that, what happens

2475.87 -> is you actually get additional data source plugins,

2478.34 -> like you see in the column in the middle,

2480.87 -> AppDynamics, Datadog, and Dynatrace,

2482.59 -> and all the way to Wavefront, right?

2483.74 -> So all those data, so if you're using any

2485.81 -> of those data sources, then the Enterprise version

2487.77 -> of Grafana is the way to go.

2489.07 -> And out of the box, when you create Grafana,

2492.357 -> you get all these data sources installed for you.

2495.17 -> So you don't have to install anything and manage all that.

2499.39 -> So if you run Grafana yourself,

2500.9 -> you would basically be copying files,

2502.57 -> changing configuration, rebooting the servers and so on.

2504.844 -> You don't do any of that with the Managed Grafana, right?

2508.12 -> That is the visualization part.

2509.69 -> Then we move on to OpenTelemetry.

2512.75 -> So OpenTelemetry is a CNCF project, right?

2516.91 -> So it's a Cloud Native Compute Foundation project.

2519.48 -> So one of the biggest challenges in collecting signals,

2523.34 -> logs, metrics and traces from applications

2525.41 -> is that there are numerous SDKs and agents

2530.26 -> and Collectors that are involved.

2531.78 -> So if you liked a logging solution,

2533.95 -> you have to use the logging vendor's SDK,

2536.49 -> logging solution vendor's agent to collect logs.

2540.24 -> And if you like the metric solution, same thing,

2542.18 -> same SDK, Collector, theirs,

2544.13 -> and then same with tracing solution.

2545.62 -> Now all of a sudden you have multiple agents

2547.56 -> and multiple SDKs.

2549.022 -> Okay, now you wanna change your metric monitoring vendor.

2553.54 -> So you wanna move from X to Y, vendor Y, so what happens?

2558.15 -> Now it's not as easy as simply replacing a Collector.

2560.45 -> No, because you have instrumented your applications

2562.65 -> to expose metrics using that SDK

2565.48 -> that was provided by the vendor,

2567.26 -> now you have to go rewrite your code,

2568.61 -> rewrite your application,

2569.55 -> that could take weeks, even months.

2571.25 -> Customers don't have the money or the time to do that.

2573.89 -> So which is an industry problem,

2575.38 -> which is a vendor lock-in situation, right?

2577.27 -> Which is everybody understands that, all vendors,

2580.31 -> and that's CNCF based OpenTelemetry project

2584.19 -> actually tries to solve that.

2585.51 -> There are a lot of companies that are contributing to this.

2588.97 -> And the idea is to create standardization

2592.38 -> on specification of how these signals should look like

2595.5 -> and should be structured and how they should behave.

2597.68 -> Number two, create SDKs for different languages

2602.1 -> to make use of those SDKs to create signals based

2605.43 -> on the specifications that we defined.

2607.16 -> Number three, create an agent

2609.65 -> to be able to deal with those signals

2612.71 -> and send to any destination.

2614.1 -> So with OpenTelemetry project,

2616.48 -> when it's mature, right now,

2617.93 -> the tracing specification SDK and the Collector,

2620.68 -> that's all GA, it's available for you to use,

2623.05 -> metrics is almost GA, it's available.

2626.69 -> But the Prometheus support is already there.

2628.501 -> Prometheus metric support is there.

2630.474 -> And number of three, the logs is not GA, it's not ready yet.

2634.06 -> It's still being worked upon, right?

2637.707 -> It would also solve two main problems, right?

2640.64 -> One is the ability to simply replace a vendor,

2648.09 -> change anytime you want, because it's all standardized,

2651.13 -> you can simply change the configuration.

2652.57 -> You can send the traces, today you're sending to service X,

2655.73 -> and then tomorrow if you don't like it,

2656.563 -> all you have to do is change the pipeline

2658.57 -> with the OpenTelemetry Collector that they'll use

2660.16 -> and automatically the signals

2661.34 -> are going to a different destination.

2662.45 -> The applications can be left untouched.

2665.52 -> The second problem that it's solving is correlation.

2669.719 -> It's very hard to do correlation,

2671.68 -> even if you're using just one vendor, one service provider,

2674.58 -> but when you're using multiple vendors,

2676.62 -> multiple service providers, it's even harder.

2678.49 -> It's almost next to impossible.

2680.16 -> So that's the other problem that OpenTelemetry project

2683.4 -> is trying to solve is trying

2684.74 -> to add context between these signals,

2687.71 -> so in order for you to connect and correlate these signals,

2692.25 -> when some incident happens, so you can reduce MTTR,

2694.57 -> which is mean time to resolution, right?

2697.28 -> What is AWS doing with it?

2699.536 -> AWS is basically redistributing the OpenTelemetry project.

2706.842 -> It's called AWS Distro for OpenTelemetry.

2709.18 -> So what we do is every single line

2710.57 -> of code that we write goes

2711.49 -> into the main upstream OpenTelemetry project.

2713.9 -> We take it, we have it go through the AppSec process

2717.93 -> to make sure it is secure and it works the way we want,

2720.77 -> and it goes through rigorous stressing internally,

2722.83 -> and we redistribute it.

2724.927 -> And it's not only that we are doing, by the way,

2728.442 -> if you do that, if you use the OpenTelemetry,

2730.09 -> ADOT in short, Collector, you also get AWS support.

2734.7 -> But not only we are just redistribution,

2736.52 -> but we are also contributing.

2738.09 -> So we have been actively working with the community.

2739.95 -> We are part of the CNCF community

2742.22 -> that defines specifications, works

2743.89 -> on different parts of the component.

2747.43 -> We have contributed to several exporters

2749.94 -> and receivers and so on,

2751.54 -> exporters are basically ones that export the signals

2753.35 -> to different destinations and receivers are the ones

2755.17 -> that's receiving signals in different format and so on.

2757.85 -> So we've been actively working and making,

2761.88 -> we've been trying hard to make it really easy

2764.06 -> for you to deploy the OpenTelemetry Collector as well.

2768.29 -> For example, if you watched Amazon ECS console,

2771.82 -> when you create a task definitely today,

2773.14 -> you can simply go and specify that,

2775.72 -> yes, I want to collect traces, it's just a check box

2777.71 -> and you select it, and the last key, okay,

2779.51 -> so what format are you collecting traces?

2781.34 -> Are you using X-Rays SDK or the OpenTelemetry SDK?

2783.523 -> And you can select that, and then you can select that,

2786.78 -> okay, after that, it'll automatically be coming.

2788.89 -> Same thing with the metrics too.

2790.6 -> You can say simply go check a check box.

2792.68 -> And it's all very easy for you to deploy.

2794.89 -> All you have to do is just specify, give those inputs,

2797.18 -> and we will take care of deploying the Collector,

2799.14 -> configuring it, and sending where you wanna send.

2801.74 -> We also have a trimmed down version

2803.95 -> of the OpenTelemetry Collector published

2805.5 -> as a Lambda layer that you can deploy as a Lambda extension,

2808.94 -> and that would help you to collect traces

2811.18 -> and also metrics in certain languages

2813.33 -> and framework to send to different destinations.

2817.312 -> And we also are working on making it even easier

2821.62 -> for you to use and deploy own EKS as well.

2824.97 -> And that's this, the OpenTelemetry Kubernetes Operator.

2828.68 -> So we have something called ADOT,

2830.347 -> OpenTelemetry ADOT Operator.

2832.69 -> So, which is a Kubernetes operator

2834.46 -> that allows you to deploy the OpenTelemetry Collector

2837.58 -> through the operator

2838.413 -> so the operator can manage the resource for you.

2840.66 -> So it is basically you giving,

2843.04 -> handing the controls over to the operator

2844.95 -> so it can automatically manage

2846.4 -> while you are specifying the custom resource

2849.33 -> and how you want the operator to deploy and so on.

2852 -> So one really, really interesting advantage

2856.01 -> of using this OpenTelemetry or the ADOT operator

2858.65 -> is that if you have a large EKS cluster,

2861.99 -> and if you have let's say 10,000 targets

2864.43 -> where you wanna scrape all the metrics,

2867.02 -> it's a busy environment.

2868.46 -> So at that point, what about the Collector's health

2873.33 -> and performance and what about the availability

2875.77 -> of the Collector itself?

2876.77 -> That becomes a challenge, right?

2878.04 -> So with the operator what you can do

2880.23 -> is you can deploy it as a StatefulSet, for example,

2882.85 -> and you can tell the operator to deploy the Collector

2886.52 -> in a StatefulSet manner, as a StatefulSet,

2889.6 -> and say that you want to deploy,

2891.28 -> let's say five copies of my Collector,

2893.42 -> and it will deploy five copies of the Collector.

2895.81 -> It'll equally chart the load between these Collectors.

2899.2 -> So in our example of 10,000 targets

2903.6 -> and 2,000 targets per Collector, and it's equally divided.

2906.83 -> And so, you have kind of the load that is shared.

2911.84 -> So not one particular Collector

2913.93 -> or one particular part in this case

2915.52 -> is overloaded and it's kind of dying

2917.16 -> and that's an issue and so on.

2918.78 -> So it makes it a lot easier.

2919.84 -> So by the way, today, it's 12:20,

2924.66 -> so as of 10 o'clock, if you saw the EKS console,

2928.1 -> we also have added the ability

2930.69 -> for you to deploy the ADOT Operator as an EKS add-on.

2934.44 -> So you can simply deploy it as an add-on,

2936.84 -> and then you can specify the CRD

2939.84 -> and the Collector will be there.

2942.58 -> So in next couple of hours,

2945.64 -> we will update the documentation.

2948.25 -> There'll be a blog post that will go out, and so on.

2949.85 -> So it's one of the ways for us

2955.38 -> to make collecting metrics and traces

2957.73 -> from Kubernetes environments easy, all right?

2960.38 -> Then putting all of that together,

2962.92 -> you basically have flexible options, right?

2966.48 -> Whether you're running your workload

2968.96 -> on premises environment or in AWS environment,

2972.087 -> whether it's using EC2 or EKS or Lambda or anything,

2977.38 -> you have different options.

2979.2 -> And also, even if you use two different solutions,

2982.01 -> like maybe you're using CloudWatch for logs

2983.67 -> and Prometheus for metrics,

2985.48 -> it's all possible to put all of that together

2987.69 -> and make sense through maybe something like Grafana,

2990.84 -> for example, right?

2992.06 -> You have different data sources that Grafana supports

2994.29 -> that you can actually visualize and make use of.

2995.77 -> So putting all of it together,

2998.14 -> you can choose AWS native observability solutions

3002.25 -> or open source solution for the need that you have,

3005.65 -> but you have options to pick and choose which one you want,

3009.36 -> but you can also have a mix of both

3011.28 -> and put all that together

3012.14 -> and still make sense of your environment.

3015.99 -> Having said that, I'm handing over to Rich, thank you.

3023.529 -> Thank you, sir.

3024.88 -> I trust everyone can still hear me.

3026.33 -> I can hear me, that's good.

3028.03 -> Okay, next steps, so this is a lot of information.

3030.18 -> There's a lot of options.

3032.842 -> We're not gonna leave you completely hanging.

3034.88 -> We have a workshop along

3036.94 -> with 17 different modules contained therein

3040.13 -> that takes you through everything

3041.43 -> that we've talked about here today.

3043.51 -> And the workshop is in five different languages.

3045.77 -> So the URL is a little bit truncated there.

3047.75 -> It's catalog.workshops.aws/observability.

3052.55 -> Now this workshop is our go-to tool

3056.78 -> for helping our customers learn their way

3059.35 -> through these services.

3060.25 -> And it's not just limited to the cloud native services.

3063.33 -> We have modules

3064.42 -> for the open-source managed services as well.

3067.58 -> It's a fantastic resource.

3068.85 -> I highly encourage everybody to go visit it

3072.96 -> and try out the content for yourself.

3076.14 -> Now, the workshop, these tools

3079.6 -> that we've discussed and observability

3081.83 -> on the whole doesn't exist entirely on its own.

3084.233 -> It's part of a larger cloud operations story.

3089.02 -> And we'll typically just refer to this as Cloud Ops

3091.68 -> when we are here inside of AWS.

3093.4 -> So the question is why operate these things

3096.44 -> in the cloud in the first place?

3098.27 -> Why not just use ISV solutions

3102.27 -> that you operate yourself, COTS solutions?

3105.74 -> And the answers are really displayed here.

3107.87 -> So enterprises, they typically wanna be in the cloud.

3110.46 -> A lot of them are in the process

3111.91 -> of doing their digital transformation

3113.76 -> and they're asking us for help.

3115.91 -> But sometimes when they hear about their services,

3117.75 -> they can find it to be complex.

3120.41 -> And that slows them down from quickly understanding

3122.81 -> how AWS can help.

3125.53 -> The Cloud Ops approach, it's our solution

3130.82 -> and also a global marketing initiative

3133.04 -> for what enterprises need to do

3135.51 -> in order to build and operate successfully in the cloud.

3139.63 -> And why are they choosing AWS for Cloud Ops?

3142.27 -> And this is some of the reasons here.

3143.73 -> I particularly like to zoom in on the carbon savings.

3146.45 -> So when you move into AWS, we operate at a scale

3150.15 -> that again, most customers just can't do on their own,

3153.053 -> just the economies of scale.

3154.89 -> And this helps us do things

3156.64 -> like reduce your overall carbon footprint

3159.16 -> for your IT operations by up to 88%.

3166.09 -> Now, I could talk a lot about how enterprises

3167.94 -> are increasingly turning to the cloud

3170.32 -> to achieve their business outcomes.

3173.09 -> And there are large portfolios of legacy applications

3177.3 -> that are likewise coming into AWS.

3180.36 -> And this includes Intel based processes as well,

3182.96 -> and data centers that wanna have much more agility

3187.89 -> than they might have had previously.

3190.07 -> And there's this misconception

3192.02 -> that you need to sacrifice a lot of your governance

3196.03 -> in order to achieve agility.

3197.68 -> And what we've done as our guiding principle,

3199.52 -> our Northern star

3200.87 -> when we design these cloud operation services

3204.18 -> is to make sure that you don't have that sacrifice

3206.77 -> of agility when you build out your governance solution.

3211.72 -> So we have a depth of experience with cloud operations

3216.46 -> that we like to share with our customers

3218.47 -> as we build our products and services.

3223.13 -> This would be another view of this same process,

3226.22 -> but a little bit more of a flow model.

3227.53 -> So on the left, we see people, process and technology

3231.75 -> as inputs into the automation and security

3236.28 -> that allows you to enable,

3237.68 -> sustain, operate your environment.

3240.33 -> And the output of this approach

3242.9 -> are gonna be the controls that you need,

3246.58 -> the agility that you want,

3248.74 -> the ease of use that helps you to use the platform

3252.43 -> in a very fast and effective manner.

3255.52 -> This is the AWS view on cloud operations.

3259.58 -> This is our goal is to enable your people, your processes,

3263.44 -> and your technology to be enabled, to be secure, to grow,

3268.05 -> to migrate in a healthy and satisfactory way into the cloud,

3272.53 -> and to then to continue to operate

3274.55 -> and continue to evolve your workloads once they are there,

3277.86 -> giving you better business outcomes in the end.

3283.87 -> So we highly encourage people to visit these two links.

3289.06 -> And I'll leave this slide up here for just a moment

3290.58 -> for those on your phones.

3292.15 -> So the AWS Skill Builder, it's free courses.

3295.22 -> You don't have to pay, it's super easy to access.

3298.5 -> We have video training, we have lab material.

3303.55 -> It's a fantastic resource

3305.2 -> for everybody in your organization.

3307.23 -> And there is of course the certification path as well.

3310.11 -> There are a lot of benefits to being AWS Certified.

3312.63 -> We do have a whole training

3314.04 -> and certification part of the organization that spends a lot

3318.71 -> of their time building high quality training.

3321.38 -> And then if you choose to become AWS Certified,

3324.24 -> it's industry recognized, it stands out.

3327.44 -> It's a very important thing for a lot of employers

3330.95 -> when they're going to the field

3332.66 -> and trying to find resources themselves.

3334.71 -> But we are always refreshing these certification processes

3338.7 -> as well, so there are new exam guides that are coming out.

3341.85 -> There are new versions of the exams

3343.43 -> that are coming out over time.

3346.42 -> We encourage people to be certified,

3349 -> to maintain their certifications.

3352.68 -> And otherwise just use the QR codes,

3355.468 -> and go find this for yourselves.

3358.07 -> And that would be it for our content today.

3359.99 -> I wanna thank you all for hanging out with us.

3362.47 -> This was fantastic.

3363.862 -> (upbeat music)

Source: https://www.youtube.com/watch?v=or7uFFyHIX0