AWS Innovate 2022 - Ensure Reliability and Uptime with Observability Solutions | AWS Events

Aug 16, 2023

AWS Innovate 2022 - Ensure Reliability and Uptime with Observability Solutions | AWS Events

Running a business in a 24-by-7 world requires you have an observability solution for finding and fixing application and system issues quickly. With the growth of distributed systems running in the cloud, the need for a comprehensive open source observability solution is more acute. In this video, learn how you can use AWS Analytics and observability solutions, including machine learning services, to detect and resolve anomalies, and deliver exceptional customer experiences and service availability.

Learn more at: https://go.aws/3smA0Tm

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#OpenSearchService #AWS #AmazonWebServices #CloudComputing

Content

0 -> (cheerful, electronic music)

9.38 -> - Hi everybody, my name is John Handler.

11.44 -> I'm a solutions architect with AWS

13.46 -> and I cover our OpenSearch and service

18 -> and the OpenSource project.

20.13 -> Today we're gonna talk about reliability and uptime

23.96 -> and how to build observability with various solutions.

28.66 -> So let's start out by talking about what is observability.

33.68 -> So observability is collecting,

36.92 -> analyzing data from applications

41.23 -> so that you can understand what the issues are,

44.8 -> get alerts for those issues,

46.9 -> troubleshoot those issues and resolve those issues.

50.34 -> Observability really refers to the collection

53.66 -> of instrumentation and other metric data

57.25 -> around your application

58.73 -> that enables you to solve these issues.

63.84 -> So why do we really need observability?

66.36 -> Well, anytime we're building some kind of software solution,

70.39 -> there are various problems that you're gonna have.

73.78 -> There's denial of service, service outages,

77.23 -> cost overruns, component dependencies,

79.74 -> networking connectivity, all kinds of challenges

83.62 -> that you're gonna face.

85.42 -> Ultimately, those are gonna lead to downtime

88.67 -> for your application, for your end-users,

91.59 -> for your customers and really

94.15 -> there's a cost to that to your business.

96.94 -> Not only is there a financial cost,

99.64 -> which can be as much as 80 to $100,000 a day,

104.14 -> there's also a cost in fatigue for your developers,

109.71 -> in loss of trust with your customers,

113.52 -> and eventually, this downtime

116.31 -> is gonna lead to poor business outcomes.

120.53 -> So what are the goals for doing observability?

124.797 -> You really want to improve the overall end-user experience

131.03 -> for your applications and services.

133.49 -> And observability is the suite of tools

136.28 -> and the data you use to provide that better experience.

144.01 -> Getting specific, observability really rests

147.14 -> on three different kinds of information.

150.96 -> The first information is log information.

154.9 -> Anybody who's ever done any debugging knows

157.95 -> that when something goes wrong,

159.9 -> the first thing you need to do is go look at the logs.

163.09 -> The logs have the information that lets you know

167.11 -> what happened during the execution of the application

171.01 -> that ended up causing the problem.

174.13 -> The second kind of data is metrics.

176.78 -> Metrics give you the ability to monitor what's going on,

180.84 -> how your application is performing, and with trend analysis,

184.61 -> or to be able to predict and find trends

187.98 -> that will let you know

188.87 -> that you're headed in a poor direction.

192.13 -> Traces are logs that are produced from your application

197.75 -> that capture the course of the processing

201.5 -> of the single event through your system.

204.56 -> And with these three kinds of information,

208.12 -> you can go in and figure out exactly what's going on.

212.48 -> We have (mutters) services

214.25 -> that address these various concerns

218.24 -> and we're gonna go into depth on all of them.

224.69 -> You really want to, ultimately,

227.73 -> influence and improve both your operational

230.7 -> and your business outcomes.

232.78 -> So on the operational side,

234.74 -> by employing these tools to look at the information,

239.53 -> you'll achieve better visibility into

243.01 -> the underlying functioning of your application.

249.8 -> This is gonna enable you to troubleshoot problems

253.89 -> in near realtime.

256 -> Many of these systems, most of these systems,

257.93 -> can flow data in in near realtime,

260.38 -> they can alert you when problems are occurring

263.69 -> or when problems, hopefully, are about to occur,

267.8 -> and you can go in and use these tools

270.97 -> to look at your log data, to look at your traces,

273.67 -> to look at your timing and to figure out

276.24 -> what it is that you need to fix.

279.62 -> On the business side, observability is gonna give you

283.3 -> a more resilient application.

285.53 -> By using these tools,

287.82 -> you'll be able to build out your applications

290.1 -> in a better way and keep them up and running longer.

296.18 -> End-goal, as I said, is to improve customer experience.

300.26 -> So your application is serving

302 -> whatever business need you have

303.95 -> and keeping your customers happy

306.26 -> by running a low-latency, low error-rate,

309.73 -> seamless, kind of application will, ultimately,

313.38 -> give your customers the best experience.

318.38 -> We have a number of different workloads and use cases

321.75 -> where observability is really emerging

323.82 -> as a very important tool.

326.33 -> Microservices and containers,

328.84 -> and especially the move to a service-oriented architecture,

332.05 -> is driving a lot of need to understand

335.82 -> how your application is performing

337.48 -> and how your services are performing

339.74 -> within that application.

341.22 -> So we see a lot of usage in the microservices architecture.

347.47 -> We also see a lot of use cases

349.16 -> around digital experience monitoring,

351.03 -> so when you have a web or a mobile application

355.64 -> that is providing some service to your end-users,

359.3 -> you wanna be able to dig in and understand

362.15 -> what are the dependencies and how is it working.

367.02 -> On the operational side, and of course,

368.6 -> observability really flows out of the dev ops,

371.53 -> kind of, methodology.

373.31 -> So when we're looking at an operational situation,

376.76 -> observability is really gonna, again,

378.21 -> help you dig in, troubleshoot, and fix issues.

382.77 -> And finally, looking at data lakes

385.44 -> enables you to understand how your information

388.18 -> is flowing in and how it's related to one another.

395.92 -> So, on the mechanics, with observability, it enables you,

402.81 -> again, to detect when either issues are happening

407.06 -> or issues are about to happen so that you can investigate,

412.86 -> and investigate, here, is a kind of broader bucket,

417.63 -> where even if it's not a realtime problem that's happening,

421.93 -> you still have the ability to dig in

424.67 -> and figure out, over the past, what has happened

429.27 -> in order to remedy and improve your software.

434.44 -> And finally, again, remediating and improving the software

437.94 -> is one of the ultimate goals.

440.56 -> So to be able to remediate issues

443.9 -> and bring back that high quality user experience.

451.53 -> So this is a complicated question,

454.09 -> is there one observability tool to rule them all?

457.96 -> Sadly, the answer is no.

460.37 -> At the moment, this is an emerging segment

463.19 -> and we're seeing many different technologies

466 -> that enable this observability, kind of, use case

470.5 -> where there're competing possibilities,

474.09 -> a lot of different, you know,

476.16 -> both AWS services and open source technologies

479.57 -> that give you the fundamental capabilities

481.94 -> that you're looking for,

483.76 -> but you need a way to pick a single choice to go forward.

490.17 -> We're gonna review all of the different tools

493 -> that are out there and try and highlight

494.62 -> some of the pros and cons to help you make that decision.

501.25 -> At a high level,

502.14 -> what you need to accomplish with observability

505.21 -> is really an end-to-end pipeline

508.48 -> that's going to start out with instrumentation

512.049 -> to pull or collect the logs, metrics and traces

518.048 -> into a single stream.

520.82 -> That stream, you're gonna ingest that stream

524.1 -> into some kind of collection and storage layer

527.08 -> and then that storage layer can employ

530.38 -> monitoring and alerting for realtime interaction

535.07 -> with the underlying data,

537.32 -> but you also wanna be able to search and index that data

540.89 -> in order to go and find the problems that you're having.

544.01 -> And then, finally, we end up with a visualization layer

548.35 -> where you can build dashboards

550.55 -> that enable you to correlate data

552.14 -> that's coming in through this pipeline

554.58 -> and also figure out what's going on.

558.05 -> Peeling back the onion just a tiny bit,

560.61 -> we have a number of our different technologies here.

564.64 -> So on the instrumentation side,

566.85 -> we have our CloudWatch agent, we have an X-Ray agent,

570.46 -> we have AWS Distro for open telemetry.

573.86 -> We also have Fluentbit that enables you, again,

578.96 -> to pull in those data sources and start to work with them.

584.74 -> Of course, if we have data in CloudWatch,

587.17 -> if you have data in CloudWatch,

588.53 -> there's Amazon CloudWatch service lens

591.57 -> that enables you to get container insights,

594.95 -> Lambda insights, a bunch of contributor insights,

599.04 -> and other synthetic kind of data.

602.45 -> In the collection and storage layer,

604.23 -> we have CloudWatch metrics, CloudWatch logs,

607.69 -> X-Ray, Prometheus, and manage-service for Prometheus,

612.51 -> as well as Amazon OpenSearch service.

614.3 -> So these are, kind of, the containers

616.42 -> where you can flow this data.

619.03 -> On the dash-boarding side, then,

621.07 -> we have Grafana and Amazon managed Grafana,

624.45 -> as well as OpenSearch dashboards

626.45 -> that enable you to build visualizations

628.95 -> and dig in with service maps and trace graphs

633.94 -> and the more richer kind of experiences around this data.

640.22 -> So let's talk through some of the opensource alternatives.

644.57 -> First of these we're gonna talk about is OpenSearch

646.61 -> and OpenSearch dashboards.

649.686 -> OpenSearch is a community-driven, open-source,

653.18 -> search and analytic suite.

654.71 -> It's derived from Apache 2.0

657.48 -> licensed Elasticsearch 7.10.2 and Kibana.

662.93 -> The OpenSearch project is really the way that we are

668.03 -> carrying forward that core search and analytics capability

672.81 -> into the open-source world.

676.95 -> The project itself consists of OpenSearch,

680.09 -> which is a search engine and visualization engine,

685.11 -> as well as OpenSearch dashboards,

686.97 -> which is a UI and visualization layer,

690.11 -> and includes plugins that enable things like alerting,

693.96 -> anomaly detection, deep level security,

698.61 -> a bunch of different functionality

700.38 -> that's provide by those plugins.

703.3 -> It enables you to ingest your data,

707.17 -> you can secure it, and analyze it with aggregations,

711.49 -> view it, and bring all the logs,

714.33 -> metrics and traces into one place.

718.03 -> It does have an observability suite that's built into it,

721.6 -> along with anomaly detection and alerting, as I mentioned,

725.24 -> and this enables you to identify

727.43 -> and react to issues in your solutions.

732.3 -> Prometheus, also community-driven, open-source.

735.81 -> It's a systems monitoring and alerting toolkit.

740.37 -> It collects and stores metrics, primarily,

743.64 -> they are time series and they have key value pairs,

747.76 -> called labels, that enable you to bring those metrics in.

752.61 -> It has a server, where it's gonna store that data,

755.92 -> an alert manager and a push gateway

758.65 -> to facilitate viewing and working with that data.

764.42 -> Usually you will handle visualizations with Grafana.

769.96 -> It really, again, is more on the metric side.

773.2 -> So it enables you to gain insights

775.45 -> from numeric measurements.

777.16 -> So this is stuff like your CPU use,

780.35 -> your latencies, et cetera.

786.24 -> Grafana is a visualization tool

789.3 -> that is, again, community-driven, open-source,

792.36 -> it's an analytics platform.

794.64 -> You can query your data, visualize your data,

796.83 -> alert on your data, in order to understand

799.54 -> what's going on with your metrics, logs and traces.

802.72 -> It does connect to many different back-ends.

805.81 -> So the data source plugins enable you

807.84 -> to connect to things like MongoDB, Dynatrace,

811.01 -> ServiceNow, Amazon OpenSearch service,

814.02 -> a lot of different Prometheus,

815.65 -> a lot of different back-ends that you can connect it to.

819.91 -> You can define alert rules.

821.82 -> So again, alerting is a key capability

825.62 -> within the observability stack.

827.16 -> You want to be able to send out alerts when things go wrong.

831.74 -> With Grafana, you can send alerts to places like Slack,

834.87 -> PagerDuty, and OpsGenie.

839.97 -> You can also work with different time ranges.

843.07 -> That gives you the ability to kind of drill in

845.44 -> and understand now versus then.

851.57 -> So one of the foundational elements of observability

855.46 -> is the OpenTelemetry standard, and the OpenTelemetry stack.

860.88 -> It's an OpenSource project

863.66 -> designed for the creation and management of telemetry data,

866.67 -> such as traces, metrics and logs.

869.75 -> It does support a lot of wire formats,

872.03 -> Jaeger, Zipkin and Prometheus.

875 -> This enables you to bring that data in

878.78 -> and bring it into a common format.

884.51 -> It's an evolving project.

887.388 -> A lot of different people contributing to it,

889.34 -> and we currently have, we'll get to,

893.92 -> the Amazon distribution for OpenTelemetry,

897.25 -> which brings it into a more managed situation.

902.3 -> Currently works with traces and working on metrics and logs,

907.01 -> but not entirely there yet.

910.948 -> Fluentbit is in the collection layer,

913.44 -> again, this is the instrumentation, also open-source,

917.3 -> very fast, light-weight, scalable,

920.83 -> forwards logs, metrics from their sources.

927.42 -> It has a plugin-based ecosystem

929.62 -> that enables you to collect and filter

932.63 -> and transform and augment your data.

937.38 -> It is true open-source, so it's vendor agnostic

941.81 -> and really comes as a derivative of Fluentd.

947.9 -> Integrates with Prometheus and OpenSearch,

951.46 -> Amazon CloudWatch, Amazon X-Ray, Amazon S3,

954.43 -> a lot of different destinations

956.67 -> where you can send this data and forward it on.

962.37 -> In the OpenSearch project, we have Data Prepper.

965.88 -> Data Prepper is in the collection and transformation.

969.34 -> So it's also open-source

971.8 -> and its processes observe ability data.

976.06 -> So it gives you features that enable you to filter

978.92 -> and enrich and transform the data

981.839 -> that's coming through the pipe

984.71 -> and enable downstream analytics and visualization.

991.18 -> It, right now, supports the processing

993.32 -> of distributed trace data, log ingestion,

996.53 -> and is moving towards supporting metric data in the future.

1003.37 -> Also integrates with Jaeger, Zipkin,

1005.84 -> OpenTelemetry and Fluentbit.

1011.44 -> So, we're gonna take all of those pieces

1014.03 -> and build them into one example architecture,

1018.36 -> that is an open-sourced focus architecture.

1021.09 -> On the collections side,

1022.87 -> we have our OpenTelemetry collector

1025.61 -> that's going to bring in traces

1028.32 -> and get those to Data Prepper, which is gonna prepare them.

1033.61 -> We also have Fluentbit collecting metrics

1037.33 -> and bringing those into Data Prepper.

1041.43 -> Fluentbit also can tail and collect log data,

1044.62 -> which it's gonna flow directly to OpenSearch.

1047.76 -> So we have these two pathways that come into OpenSearch,

1050.44 -> the Data Prepper pathway for traces and metrics,

1054.02 -> logs flowing from Fluentbit, directly to OpenSearch.

1058.14 -> We can also take metrics out to Prometheus from Fluentbit

1063.13 -> for our metric evaluation.

1067.66 -> On the visualization side,

1069.29 -> then OpenSearch dashboards will connect with OpenSearch

1072.5 -> and enable you to build visualizations

1074.55 -> and employ the more rich experience within OpenSearch,

1079.16 -> specifically around traces and log data.

1085.14 -> We can use Grafana to connect both to OpenSearch

1088.47 -> and Prometheus and that gives us additional ability

1092.5 -> to bring that metric data into the same location, visually,

1096.52 -> as the log and trace data.

1105.15 -> So we've gone through

1106.39 -> a number of the different open-source alternatives.

1110.55 -> Many of those have a Managed Solution, within AWS,

1114.38 -> that will make it easier to deploy and run and operate

1118.71 -> those observability components.

1121.5 -> First of those we'll talk about is

1122.78 -> Amazon Managed service for Prometheus.

1126.56 -> It's Prometheus compatible.

1128.55 -> It's a monitoring and alerting service, and again,

1131.81 -> you use it for monitoring containerized applications

1134.62 -> and infrastructure at scale.

1137.32 -> It automatically scales for ingestion and storage

1140.72 -> and alerting and querying of the metrics

1143.39 -> as your workload grows or shrinks.

1146.97 -> It's integrated with Amazon EKS,

1149.21 -> Elastic Kubernetes Service, Elastic Container Service,

1153.14 -> and AWS Distro for OpenTelemetry.

1158.78 -> We said a little bit about Amazon OpenSearch service,

1161.1 -> but Amazon OpenSearch service is a managed service

1163.97 -> that makes it easy to deploy, operate and scale OpenSearch

1167.83 -> and legacy Elasticsearch clusters in the AWS Cloud.

1171.84 -> It has some built-in observability tooling

1174.29 -> with trace analytics panel,

1176.98 -> event analytics panel and a log analytics panel.

1181.62 -> It also comes with anomaly detection features

1184.73 -> that automatically learn the normal behavior of your system

1188.33 -> and can generate alerts

1189.84 -> when that behavior leaves the normal.

1194.7 -> It's integrated with Kinesis Data Firehose,

1197.29 -> CloudWatch logs and other tooling.

1202.599 -> ServiceLens enables you to run observability

1206.47 -> on top of your CloudWatch data,

1208.87 -> integrate traces, metrics, logs and alarms into one place.

1214.29 -> It integrates CloudWatch with AWS X-Ray,

1217.72 -> also to provide an end-to-end service map

1220.5 -> and view of your application.

1223.24 -> You can do correlation against Lambda functions,

1226.13 -> API Gateways, Java applications,

1229.41 -> and either Container or on EC2.

1234.46 -> Amazon Managed Grafana is a secure production-ready,

1237.76 -> open-source distribution, supported by AWS.

1241.77 -> We developed it in collaboration with Grafana Labs

1244.9 -> and it enables you to connect to data sources

1246.93 -> like Amazon OpenSearch service,

1248.84 -> Amazon Managed service for Prometheus,

1251.27 -> X-Ray, CloudWatch, TimeStream,

1253.86 -> it has security built in and enables you

1257.51 -> to connect securely with all of these data sources

1260.5 -> in a way that preserves their integrity.

1263.1 -> There are some pre-built dashboards

1265.25 -> that give you faster access

1267.44 -> to the insights that you're looking for.

1270.19 -> AWS X-Ray collects data about requests

1272.52 -> that your application serves and gives you tools

1275.68 -> to view and gain insights into that data.

1279.72 -> X-Ray receives traces from your application,

1282.64 -> an addition to AWS services, like AWS Lambda,

1286.39 -> that enable you to bring that trace data in

1289.23 -> and view service maps and trace graphs

1291.79 -> and the other tools that you're used to.

1297.94 -> Amazon OpenSearch service is a managed service

1301.47 -> that enables you to deploy, scale and operate OpenSearch

1305.88 -> within the AWS Cloud.

1309.49 -> It supports OpenSearch versions 1.0 and 1.1,

1313.6 -> as of today, as well as Legacy Elastic search versions

1317.74 -> from 1.5 to 7.10, with visualization capabilities

1322.07 -> provided by OpenSearch dashboards and Kibana,

1325.625 -> for the 1.5 to 7.10 versions.

1331.5 -> Finally, we'll talk a little bit about AWS Distro

1333.56 -> for OpenTelemetry, secure, production-ready,

1337.03 -> AWS supported distribution of the OpenTelemetry project.

1340.81 -> It is backed by AWS support

1343.43 -> and gives you one-click deploy

1346.137 -> from the ECS and Lambda consoles.

1350.75 -> There are exporters that enable monitoring solutions,

1353.58 -> like AMP, Amazon Managed service for Prometheus,

1357.436 -> CloudWatch, X-Ray, OpenSearch service

1359.92 -> and other third party solutions.

1363.9 -> We're gonna dig in, just a touch,

1365.18 -> on Amazon OpenSearch service,

1366.81 -> just to give you a feel for some of the specifics

1369.37 -> around what observability looks like in practice.

1374.26 -> So again, we have our logs, metrics and traces,

1377.7 -> we're gonna look at, first, logs.

1380.86 -> So within Amazon OpenSearch service

1382.92 -> and from OpenSearch dashboards,

1386.29 -> you can do all kinds of visualizations

1388.35 -> that enable you, first of all,

1390.71 -> to see in aggregate what's happening.

1394.4 -> We have live tailing of the logs,

1397.05 -> including surrounding events.

1398.43 -> So this enables you to dig in

1400.8 -> and really look at the logs and figure out,

1403.33 -> okay, here's some kind of gross statistics

1406.5 -> about what's happening

1407.72 -> and here ar some specific log lines, you know,

1409.98 -> OpenSearch is ultimately a search engine,

1411.93 -> I can search for my errors, I can look at what's going on

1415.02 -> and what's happening around that in my log files,

1418.42 -> in a time-based kind of way.

1423.83 -> Amazon OpenSearch service and OpenSearch dashboard

1426.89 -> provides a complete, sort of, trace analytics experience,

1433.19 -> and when we talk about, you know, sort of trace analytics,

1435.62 -> there are a couple of major components that we see.

1439.02 -> The first of those is trace spans.

1442.13 -> Again, traces provide that end-to-end view

1445.24 -> of the processing of a request within your system.

1448.68 -> They're all connected together by a single trace ID,

1451.58 -> so a request comes in, it's assigned a trace ID,

1454.89 -> you carry that trace ID through your software,

1458.2 -> instrumenting with calls to send out that trace data,

1462.57 -> or log that trace data, all again, based on that trace.

1466.48 -> So, with this, we can get a hierarchical,

1468.92 -> you can get a hierarchical view of the processing

1471.97 -> of your request and especially the latencies

1475.27 -> and any errors that occurred

1476.83 -> in the processing of that request.

1479.15 -> This enables you to really drill in

1480.56 -> and figure out where is the time going in my application?

1484.33 -> Is my database very slow?

1486.66 -> Or perhaps there's a particular code section

1488.99 -> that is taking most of my latency.

1491.78 -> That gives you the opportunity to dig in

1493.71 -> and really look at that piece of code

1496.51 -> to figure out where the bottleneck is

1499.24 -> and really to remediate that.

1502.18 -> Your service map give you a higher level view

1505.49 -> that's an end-to-end view of all of the microservices

1509.58 -> that you've touched in the processing of your request.

1512.68 -> And again, this all aggregates up

1514.08 -> so you get a view of where is the latency going?

1517.8 -> Again, if there are errors, they'll show up here.

1521.17 -> So this lets you look at your components,

1522.87 -> figure out how they're connected,

1524.61 -> and figure out, you know,

1526.12 -> the dependencies and where there might be a challenge

1529.91 -> with latency around those dependencies.

1533.92 -> And finally, you have trace groups,

1535.32 -> trace groups enable you to bring trace information

1539.67 -> into a grouped format around

1542.69 -> particular activities in the application.

1545.07 -> So this way, you can again, look and figure out

1547.76 -> where is the latency going,

1549.8 -> and where do I need to figure out and fix something?

1556.17 -> We recently have added application analytics.

1559.7 -> This enables you to build application views

1563.95 -> across log, trace, and metric data.

1567.61 -> You select log sources or trace groups

1569.62 -> or services to be part of an application.

1572.44 -> It enables you to monitor availability

1574.52 -> and drill into detailed views

1576.57 -> on the traces and service logs.

1580.05 -> This gives you, again, the span ID and trace ID,

1583.34 -> you can trace into what's going on

1585.1 -> and figure out any issues that you're having.

1589.63 -> One of the features that's a kind of sideways feature,

1592.3 -> but super useful, with OpenSearch dashboards,

1595.87 -> we support a feature called Notebooks.

1598.06 -> Notebooks are documents that enable you

1600.71 -> to put cards onto that document

1603.22 -> to bring all kinds of different information together

1606.53 -> and really tell a story about a particular event

1609.5 -> or something that happened.

1611.46 -> You can export those as PDFs or PNGs

1614.64 -> and you can share them around

1617.41 -> and enable everybody to know what's going on.

1623.74 -> With OpenSearch and Amazon OpenSearch service,

1626.9 -> we provide machine learning kind of innovations,

1630.8 -> and chief amongst these,

1631.92 -> and especially within the observability space,

1635.19 -> is our streaming anomaly detection.

1637.94 -> With streaming anomaly detection,

1640.4 -> the system will automatically learn the correct behavior

1643.87 -> or the normal behavior of a metric that you're sending in,

1647.49 -> metric or metrics.

1649.587 -> It uses Random Cut Forest to predict

1653.4 -> when things are going off the rails

1655.72 -> and is integrated with alerting to send you alerts

1658.95 -> when things like your CPU is suddenly spiking

1661.45 -> or your traffic is suddenly spiking.

1664.2 -> It brings that information to you.

1667.75 -> Recent improvement enables you to,

1671.01 -> basically collect or group by categories within your data.

1675.04 -> So, the typical use case for this would be

1677.45 -> if you're running 1,000 servers,

1680.56 -> and you wanna look at the CPU utilization of,

1684.71 -> you know, you actually wanna group that down by the host

1688.25 -> so that you can see if there's a particular host

1690.81 -> that's exhibiting anomalous behavior,

1693.17 -> you can get an alert for that specific host.

1699.46 -> Just a quick pass at, sort of,

1702.1 -> how all of this would work in a container service,

1706.73 -> but you have your VPC with your availability zones,

1712.6 -> you have your user sending application traffic in

1716.45 -> via a load balancer,

1718.05 -> it's really all hitting your Kubernetes application,

1722.52 -> which potentially is running with Amazon RVS,

1725.7 -> as a database layer, you're gonna use,

1729.675 -> whether it's the OTEL Collector or Fluentbit

1732.86 -> or one of the other architectures,

1734.92 -> to bring that data out, to manage service for Grafana,

1738.9 -> manage Prometheus, Amazon OpenSearch service,

1742.79 -> and then use Grafana for dash boarding

1745.95 -> or OpenSearch dashboards for Dashboard.

1750.33 -> Just looking at a little bit deeper,

1753.26 -> we have our metrics, traces and logs,

1755.86 -> within the worker node then,

1757.8 -> we have our application pod and container.

1761.3 -> The Fluentbit is gonna tail standard out and standard error,

1765.41 -> gonna forward that to Amazon OpenSearch service.

1769.17 -> We have our OTEL Collector Container.

1771.57 -> OTEL is open telemetry that is going to

1775.13 -> send metrics and traces off to Prometheus

1779.32 -> as well as Amazon OpenSearch service.

1784.78 -> In the prior diagram, I had a box that said,

1787.617 -> "Buffering and delivery,"

1789.9 -> I wanted to share one of our more cost-efficient

1794.06 -> and good architectures

1796.62 -> in terms of bringing that data across.

1798.4 -> So if you have either Fluentd or Fluentbit,

1801.24 -> we generally would look at S3 as a staging area

1804.8 -> where you're sending all of your metrics, traces and logs.

1809.86 -> You can trigger off of the bucket notification

1814.4 -> and use a Lambda to queue up the object create into SQS.

1821.5 -> We then have a Lambda that's gonna pull

1823.78 -> the original object, parse it,

1827.2 -> and prepare it to deliver to Amazon OpenSearch service,

1831.05 -> using the bulk API.

1832.88 -> This is a very common pattern that we use.

1836.28 -> It is both, again, cost-efficient,

1838.71 -> it gives you, basically, a backup of all of your data,

1842.66 -> exists in S3, so that makes it easy to replay

1845.86 -> or work with other tools like Athena

1850.964 -> and this enables you to, again,

1852.95 -> keep that data over the longterm at low cost.

1859.41 -> So three takeaways here,

1861.18 -> observability really is gonna allow you

1864.53 -> to measure and monitor the behavior

1867.27 -> of your applications and infrastructure.

1870.12 -> And the end-goal of all of this observability

1872.6 -> is, again, to bring a better end-user experience

1877.13 -> for a better business outcomes for your software.

1882.425 -> We have open-source technologies across the board,

1887.74 -> also powered by AWS, in many cases,

1891.33 -> that enable you to measure and monitor this behavior.

1896.49 -> How can we help?

1898.61 -> Please learn more, look at our OpenSearch service-free trial

1903.92 -> and would love to hear about what you do.

1908.12 -> Thanks very much for your time and attention today,

1910.22 -> really appreciate it.

Source: https://www.youtube.com/watch?v=1E-ffpHHC5g