AWS Innovate 2022 - Ensure Reliability and Uptime with Observability Solutions | AWS Events
Aug 16, 2023
AWS Innovate 2022 - Ensure Reliability and Uptime with Observability Solutions | AWS Events
Running a business in a 24-by-7 world requires you have an observability solution for finding and fixing application and system issues quickly. With the growth of distributed systems running in the cloud, the need for a comprehensive open source observability solution is more acute. In this video, learn how you can use AWS Analytics and observability solutions, including machine learning services, to detect and resolve anomalies, and deliver exceptional customer experiences and service availability. Learn more at: https://go.aws/3smA0Tm Subscribe: More AWS videos http://bit.ly/2O3zS75 More AWS events videos http://bit.ly/316g9t4 ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster. #OpenSearchService #AWS #AmazonWebServices #CloudComputing
Content
0 -> (cheerful, electronic music)
9.38 -> - Hi everybody, my name is John Handler.
11.44 -> I'm a solutions architect with AWS
13.46 -> and I cover our OpenSearch and service
18 -> and the OpenSource project.
20.13 -> Today we're gonna talk
about reliability and uptime
23.96 -> and how to build observability
with various solutions.
28.66 -> So let's start out by talking
about what is observability.
33.68 -> So observability is collecting,
36.92 -> analyzing data from applications
41.23 -> so that you can understand
what the issues are,
44.8 -> get alerts for those issues,
46.9 -> troubleshoot those issues
and resolve those issues.
50.34 -> Observability really
refers to the collection
53.66 -> of instrumentation and other metric data
57.25 -> around your application
58.73 -> that enables you to solve these issues.
63.84 -> So why do we really need observability?
66.36 -> Well, anytime we're building
some kind of software solution,
70.39 -> there are various problems
that you're gonna have.
73.78 -> There's denial of
service, service outages,
77.23 -> cost overruns, component dependencies,
79.74 -> networking connectivity,
all kinds of challenges
83.62 -> that you're gonna face.
85.42 -> Ultimately, those are
gonna lead to downtime
88.67 -> for your application, for your end-users,
91.59 -> for your customers and really
94.15 -> there's a cost to that to your business.
96.94 -> Not only is there a financial cost,
99.64 -> which can be as much as
80 to $100,000 a day,
104.14 -> there's also a cost in
fatigue for your developers,
109.71 -> in loss of trust with your customers,
113.52 -> and eventually, this downtime
116.31 -> is gonna lead to poor business outcomes.
120.53 -> So what are the goals
for doing observability?
124.797 -> You really want to improve the
overall end-user experience
131.03 -> for your applications and services.
133.49 -> And observability is the suite of tools
136.28 -> and the data you use to
provide that better experience.
144.01 -> Getting specific,
observability really rests
147.14 -> on three different kinds of information.
150.96 -> The first information is log information.
154.9 -> Anybody who's ever done
any debugging knows
157.95 -> that when something goes wrong,
159.9 -> the first thing you need to
do is go look at the logs.
163.09 -> The logs have the information
that lets you know
167.11 -> what happened during the
execution of the application
171.01 -> that ended up causing the problem.
174.13 -> The second kind of data is metrics.
176.78 -> Metrics give you the ability
to monitor what's going on,
180.84 -> how your application is performing,
and with trend analysis,
184.61 -> or to be able to predict and find trends
187.98 -> that will let you know
188.87 -> that you're headed in a poor direction.
192.13 -> Traces are logs that are
produced from your application
197.75 -> that capture the course of the processing
201.5 -> of the single event through your system.
204.56 -> And with these three kinds of information,
208.12 -> you can go in and figure
out exactly what's going on.
212.48 -> We have (mutters) services
214.25 -> that address these various concerns
218.24 -> and we're gonna go into
depth on all of them.
224.69 -> You really want to, ultimately,
227.73 -> influence and improve
both your operational
230.7 -> and your business outcomes.
232.78 -> So on the operational side,
234.74 -> by employing these tools
to look at the information,
239.53 -> you'll achieve better visibility into
243.01 -> the underlying functioning
of your application.
249.8 -> This is gonna enable you
to troubleshoot problems
253.89 -> in near realtime.
256 -> Many of these systems,
most of these systems,
257.93 -> can flow data in in near realtime,
260.38 -> they can alert you when
problems are occurring
263.69 -> or when problems, hopefully,
are about to occur,
267.8 -> and you can go in and use these tools
270.97 -> to look at your log data,
to look at your traces,
273.67 -> to look at your timing and to figure out
276.24 -> what it is that you need to fix.
279.62 -> On the business side,
observability is gonna give you
283.3 -> a more resilient application.
285.53 -> By using these tools,
287.82 -> you'll be able to build
out your applications
290.1 -> in a better way and keep
them up and running longer.
296.18 -> End-goal, as I said, is to
improve customer experience.
300.26 -> So your application is serving
302 -> whatever business need you have
303.95 -> and keeping your customers happy
306.26 -> by running a low-latency, low error-rate,
309.73 -> seamless, kind of
application will, ultimately,
313.38 -> give your customers the best experience.
318.38 -> We have a number of different
workloads and use cases
321.75 -> where observability is really emerging
323.82 -> as a very important tool.
326.33 -> Microservices and containers,
328.84 -> and especially the move to a
service-oriented architecture,
332.05 -> is driving a lot of need to understand
335.82 -> how your application is performing
337.48 -> and how your services are performing
339.74 -> within that application.
341.22 -> So we see a lot of usage in
the microservices architecture.
347.47 -> We also see a lot of use cases
349.16 -> around digital experience monitoring,
351.03 -> so when you have a web
or a mobile application
355.64 -> that is providing some
service to your end-users,
359.3 -> you wanna be able to dig in and understand
362.15 -> what are the dependencies
and how is it working.
367.02 -> On the operational side, and of course,
368.6 -> observability really
flows out of the dev ops,
371.53 -> kind of, methodology.
373.31 -> So when we're looking at
an operational situation,
376.76 -> observability is really gonna, again,
378.21 -> help you dig in,
troubleshoot, and fix issues.
382.77 -> And finally, looking at data lakes
385.44 -> enables you to understand
how your information
388.18 -> is flowing in and how it's
related to one another.
395.92 -> So, on the mechanics, with
observability, it enables you,
402.81 -> again, to detect when
either issues are happening
407.06 -> or issues are about to happen
so that you can investigate,
412.86 -> and investigate, here, is
a kind of broader bucket,
417.63 -> where even if it's not a realtime
problem that's happening,
421.93 -> you still have the ability to dig in
424.67 -> and figure out, over the
past, what has happened
429.27 -> in order to remedy and
improve your software.
434.44 -> And finally, again, remediating
and improving the software
437.94 -> is one of the ultimate goals.
440.56 -> So to be able to remediate issues
443.9 -> and bring back that high
quality user experience.
451.53 -> So this is a complicated question,
454.09 -> is there one observability
tool to rule them all?
457.96 -> Sadly, the answer is no.
460.37 -> At the moment, this is an emerging segment
463.19 -> and we're seeing many
different technologies
466 -> that enable this observability,
kind of, use case
470.5 -> where there're competing possibilities,
474.09 -> a lot of different, you know,
476.16 -> both AWS services and
open source technologies
479.57 -> that give you the fundamental capabilities
481.94 -> that you're looking for,
483.76 -> but you need a way to pick a
single choice to go forward.
490.17 -> We're gonna review all
of the different tools
493 -> that are out there and try and highlight
494.62 -> some of the pros and cons to
help you make that decision.
501.25 -> At a high level,
502.14 -> what you need to accomplish
with observability
505.21 -> is really an end-to-end pipeline
508.48 -> that's going to start
out with instrumentation
512.049 -> to pull or collect the
logs, metrics and traces
518.048 -> into a single stream.
520.82 -> That stream, you're
gonna ingest that stream
524.1 -> into some kind of
collection and storage layer
527.08 -> and then that storage layer can employ
530.38 -> monitoring and alerting
for realtime interaction
535.07 -> with the underlying data,
537.32 -> but you also wanna be able
to search and index that data
540.89 -> in order to go and find the
problems that you're having.
544.01 -> And then, finally, we end up
with a visualization layer
548.35 -> where you can build dashboards
550.55 -> that enable you to correlate data
552.14 -> that's coming in through this pipeline
554.58 -> and also figure out what's going on.
558.05 -> Peeling back the onion just a tiny bit,
560.61 -> we have a number of our
different technologies here.
564.64 -> So on the instrumentation side,
566.85 -> we have our CloudWatch agent,
we have an X-Ray agent,
570.46 -> we have AWS Distro for open telemetry.
573.86 -> We also have Fluentbit
that enables you, again,
578.96 -> to pull in those data sources
and start to work with them.
584.74 -> Of course, if we have data in CloudWatch,
587.17 -> if you have data in CloudWatch,
588.53 -> there's Amazon CloudWatch service lens
591.57 -> that enables you to
get container insights,
594.95 -> Lambda insights, a bunch
of contributor insights,
599.04 -> and other synthetic kind of data.
602.45 -> In the collection and storage layer,
604.23 -> we have CloudWatch
metrics, CloudWatch logs,
607.69 -> X-Ray, Prometheus, and
manage-service for Prometheus,
612.51 -> as well as Amazon OpenSearch service.
614.3 -> So these are, kind of, the containers
616.42 -> where you can flow this data.
619.03 -> On the dash-boarding side, then,
621.07 -> we have Grafana and
Amazon managed Grafana,
624.45 -> as well as OpenSearch dashboards
626.45 -> that enable you to build visualizations
628.95 -> and dig in with service
maps and trace graphs
633.94 -> and the more richer kind of
experiences around this data.
640.22 -> So let's talk through some of
the opensource alternatives.
644.57 -> First of these we're gonna
talk about is OpenSearch
646.61 -> and OpenSearch dashboards.
649.686 -> OpenSearch is a
community-driven, open-source,
653.18 -> search and analytic suite.
654.71 -> It's derived from Apache 2.0
657.48 -> licensed Elasticsearch 7.10.2 and Kibana.
662.93 -> The OpenSearch project is
really the way that we are
668.03 -> carrying forward that core
search and analytics capability
672.81 -> into the open-source world.
676.95 -> The project itself consists of OpenSearch,
680.09 -> which is a search engine
and visualization engine,
685.11 -> as well as OpenSearch dashboards,
686.97 -> which is a UI and visualization layer,
690.11 -> and includes plugins that
enable things like alerting,
693.96 -> anomaly detection, deep level security,
698.61 -> a bunch of different functionality
700.38 -> that's provide by those plugins.
703.3 -> It enables you to ingest your data,
707.17 -> you can secure it, and
analyze it with aggregations,
711.49 -> view it, and bring all the logs,
714.33 -> metrics and traces into one place.
718.03 -> It does have an observability
suite that's built into it,
721.6 -> along with anomaly detection
and alerting, as I mentioned,
725.24 -> and this enables you to identify
727.43 -> and react to issues in your solutions.
732.3 -> Prometheus, also
community-driven, open-source.
735.81 -> It's a systems monitoring
and alerting toolkit.
740.37 -> It collects and stores metrics, primarily,
743.64 -> they are time series and
they have key value pairs,
747.76 -> called labels, that enable
you to bring those metrics in.
752.61 -> It has a server, where
it's gonna store that data,
755.92 -> an alert manager and a push gateway
758.65 -> to facilitate viewing and
working with that data.
764.42 -> Usually you will handle
visualizations with Grafana.
769.96 -> It really, again, is
more on the metric side.
773.2 -> So it enables you to gain insights
775.45 -> from numeric measurements.
777.16 -> So this is stuff like your CPU use,
780.35 -> your latencies, et cetera.
786.24 -> Grafana is a visualization tool
789.3 -> that is, again,
community-driven, open-source,
792.36 -> it's an analytics platform.
794.64 -> You can query your data,
visualize your data,
796.83 -> alert on your data, in order to understand
799.54 -> what's going on with your
metrics, logs and traces.
802.72 -> It does connect to many
different back-ends.
805.81 -> So the data source plugins enable you
807.84 -> to connect to things
like MongoDB, Dynatrace,
811.01 -> ServiceNow, Amazon OpenSearch service,
814.02 -> a lot of different Prometheus,
815.65 -> a lot of different back-ends
that you can connect it to.
819.91 -> You can define alert rules.
821.82 -> So again, alerting is a key capability
825.62 -> within the observability stack.
827.16 -> You want to be able to send out
alerts when things go wrong.
831.74 -> With Grafana, you can send
alerts to places like Slack,
834.87 -> PagerDuty, and OpsGenie.
839.97 -> You can also work with
different time ranges.
843.07 -> That gives you the ability
to kind of drill in
845.44 -> and understand now versus then.
851.57 -> So one of the foundational
elements of observability
855.46 -> is the OpenTelemetry standard,
and the OpenTelemetry stack.
860.88 -> It's an OpenSource project
863.66 -> designed for the creation and
management of telemetry data,
866.67 -> such as traces, metrics and logs.
869.75 -> It does support a lot of wire formats,
872.03 -> Jaeger, Zipkin and Prometheus.
875 -> This enables you to bring that data in
878.78 -> and bring it into a common format.
884.51 -> It's an evolving project.
887.388 -> A lot of different people
contributing to it,
889.34 -> and we currently have, we'll get to,
893.92 -> the Amazon distribution for OpenTelemetry,
897.25 -> which brings it into a
more managed situation.
902.3 -> Currently works with traces and
working on metrics and logs,
907.01 -> but not entirely there yet.
910.948 -> Fluentbit is in the collection layer,
913.44 -> again, this is the
instrumentation, also open-source,
917.3 -> very fast, light-weight, scalable,
920.83 -> forwards logs, metrics from their sources.
927.42 -> It has a plugin-based ecosystem
929.62 -> that enables you to collect and filter
932.63 -> and transform and augment your data.
937.38 -> It is true open-source,
so it's vendor agnostic
941.81 -> and really comes as a
derivative of Fluentd.
947.9 -> Integrates with Prometheus and OpenSearch,
951.46 -> Amazon CloudWatch,
Amazon X-Ray, Amazon S3,
954.43 -> a lot of different destinations
956.67 -> where you can send this
data and forward it on.
962.37 -> In the OpenSearch project,
we have Data Prepper.
965.88 -> Data Prepper is in the
collection and transformation.
969.34 -> So it's also open-source
971.8 -> and its processes observe ability data.
976.06 -> So it gives you features
that enable you to filter
978.92 -> and enrich and transform the data
981.839 -> that's coming through the pipe
984.71 -> and enable downstream
analytics and visualization.
991.18 -> It, right now, supports the processing
993.32 -> of distributed trace data, log ingestion,
996.53 -> and is moving towards supporting
metric data in the future.
1003.37 -> Also integrates with Jaeger, Zipkin,
1005.84 -> OpenTelemetry and Fluentbit.
1011.44 -> So, we're gonna take all of those pieces
1014.03 -> and build them into one
example architecture,
1018.36 -> that is an open-sourced
focus architecture.
1021.09 -> On the collections side,
1022.87 -> we have our OpenTelemetry collector
1025.61 -> that's going to bring in traces
1028.32 -> and get those to Data Prepper,
which is gonna prepare them.
1033.61 -> We also have Fluentbit collecting metrics
1037.33 -> and bringing those into Data Prepper.
1041.43 -> Fluentbit also can tail
and collect log data,
1044.62 -> which it's gonna flow
directly to OpenSearch.
1047.76 -> So we have these two pathways
that come into OpenSearch,
1050.44 -> the Data Prepper pathway
for traces and metrics,
1054.02 -> logs flowing from Fluentbit,
directly to OpenSearch.
1058.14 -> We can also take metrics out
to Prometheus from Fluentbit
1063.13 -> for our metric evaluation.
1067.66 -> On the visualization side,
1069.29 -> then OpenSearch dashboards
will connect with OpenSearch
1072.5 -> and enable you to build visualizations
1074.55 -> and employ the more rich
experience within OpenSearch,
1079.16 -> specifically around traces and log data.
1085.14 -> We can use Grafana to
connect both to OpenSearch
1088.47 -> and Prometheus and that
gives us additional ability
1092.5 -> to bring that metric data into
the same location, visually,
1096.52 -> as the log and trace data.
1105.15 -> So we've gone through
1106.39 -> a number of the different
open-source alternatives.
1110.55 -> Many of those have a Managed
Solution, within AWS,
1114.38 -> that will make it easier to
deploy and run and operate
1118.71 -> those observability components.
1121.5 -> First of those we'll talk about is
1122.78 -> Amazon Managed service for Prometheus.
1126.56 -> It's Prometheus compatible.
1128.55 -> It's a monitoring and
alerting service, and again,
1131.81 -> you use it for monitoring
containerized applications
1134.62 -> and infrastructure at scale.
1137.32 -> It automatically scales
for ingestion and storage
1140.72 -> and alerting and querying of the metrics
1143.39 -> as your workload grows or shrinks.
1146.97 -> It's integrated with Amazon EKS,
1149.21 -> Elastic Kubernetes Service,
Elastic Container Service,
1153.14 -> and AWS Distro for OpenTelemetry.
1158.78 -> We said a little bit about
Amazon OpenSearch service,
1161.1 -> but Amazon OpenSearch
service is a managed service
1163.97 -> that makes it easy to deploy,
operate and scale OpenSearch
1167.83 -> and legacy Elasticsearch
clusters in the AWS Cloud.
1171.84 -> It has some built-in observability tooling
1174.29 -> with trace analytics panel,
1176.98 -> event analytics panel and
a log analytics panel.
1181.62 -> It also comes with
anomaly detection features
1184.73 -> that automatically learn the
normal behavior of your system
1188.33 -> and can generate alerts
1189.84 -> when that behavior leaves the normal.
1194.7 -> It's integrated with
Kinesis Data Firehose,
1197.29 -> CloudWatch logs and other tooling.
1202.599 -> ServiceLens enables you
to run observability
1206.47 -> on top of your CloudWatch data,
1208.87 -> integrate traces, metrics,
logs and alarms into one place.
1214.29 -> It integrates CloudWatch with AWS X-Ray,
1217.72 -> also to provide an end-to-end service map
1220.5 -> and view of your application.
1223.24 -> You can do correlation
against Lambda functions,
1226.13 -> API Gateways, Java applications,
1229.41 -> and either Container or on EC2.
1234.46 -> Amazon Managed Grafana is
a secure production-ready,
1237.76 -> open-source distribution,
supported by AWS.
1241.77 -> We developed it in
collaboration with Grafana Labs
1244.9 -> and it enables you to
connect to data sources
1246.93 -> like Amazon OpenSearch service,
1248.84 -> Amazon Managed service for Prometheus,
1251.27 -> X-Ray, CloudWatch, TimeStream,
1253.86 -> it has security built in and enables you
1257.51 -> to connect securely with
all of these data sources
1260.5 -> in a way that preserves their integrity.
1263.1 -> There are some pre-built dashboards
1265.25 -> that give you faster access
1267.44 -> to the insights that you're looking for.
1270.19 -> AWS X-Ray collects data about requests
1272.52 -> that your application
serves and gives you tools
1275.68 -> to view and gain insights into that data.
1279.72 -> X-Ray receives traces
from your application,
1282.64 -> an addition to AWS
services, like AWS Lambda,
1286.39 -> that enable you to
bring that trace data in
1289.23 -> and view service maps and trace graphs
1291.79 -> and the other tools that you're used to.
1297.94 -> Amazon OpenSearch service
is a managed service
1301.47 -> that enables you to deploy,
scale and operate OpenSearch
1305.88 -> within the AWS Cloud.
1309.49 -> It supports OpenSearch
versions 1.0 and 1.1,
1313.6 -> as of today, as well as
Legacy Elastic search versions
1317.74 -> from 1.5 to 7.10, with
visualization capabilities
1322.07 -> provided by OpenSearch
dashboards and Kibana,
1325.625 -> for the 1.5 to 7.10 versions.
1331.5 -> Finally, we'll talk a
little bit about AWS Distro
1333.56 -> for OpenTelemetry,
secure, production-ready,
1337.03 -> AWS supported distribution
of the OpenTelemetry project.
1340.81 -> It is backed by AWS support
1343.43 -> and gives you one-click deploy
1346.137 -> from the ECS and Lambda consoles.
1350.75 -> There are exporters that
enable monitoring solutions,
1353.58 -> like AMP, Amazon Managed
service for Prometheus,
1357.436 -> CloudWatch, X-Ray, OpenSearch service
1359.92 -> and other third party solutions.
1363.9 -> We're gonna dig in, just a touch,
1365.18 -> on Amazon OpenSearch service,
1366.81 -> just to give you a feel
for some of the specifics
1369.37 -> around what observability
looks like in practice.
1374.26 -> So again, we have our
logs, metrics and traces,
1377.7 -> we're gonna look at, first, logs.
1380.86 -> So within Amazon OpenSearch service
1382.92 -> and from OpenSearch dashboards,
1386.29 -> you can do all kinds of visualizations
1388.35 -> that enable you, first of all,
1390.71 -> to see in aggregate what's happening.
1394.4 -> We have live tailing of the logs,
1397.05 -> including surrounding events.
1398.43 -> So this enables you to dig in
1400.8 -> and really look at the
logs and figure out,
1403.33 -> okay, here's some kind of gross statistics
1406.5 -> about what's happening
1407.72 -> and here ar some specific
log lines, you know,
1409.98 -> OpenSearch is ultimately a search engine,
1411.93 -> I can search for my errors,
I can look at what's going on
1415.02 -> and what's happening around
that in my log files,
1418.42 -> in a time-based kind of way.
1423.83 -> Amazon OpenSearch service
and OpenSearch dashboard
1426.89 -> provides a complete, sort of,
trace analytics experience,
1433.19 -> and when we talk about, you
know, sort of trace analytics,
1435.62 -> there are a couple of major
components that we see.
1439.02 -> The first of those is trace spans.
1442.13 -> Again, traces provide that end-to-end view
1445.24 -> of the processing of a
request within your system.
1448.68 -> They're all connected
together by a single trace ID,
1451.58 -> so a request comes in,
it's assigned a trace ID,
1454.89 -> you carry that trace ID
through your software,
1458.2 -> instrumenting with calls to
send out that trace data,
1462.57 -> or log that trace data, all
again, based on that trace.
1466.48 -> So, with this, we can get a hierarchical,
1468.92 -> you can get a hierarchical
view of the processing
1471.97 -> of your request and
especially the latencies
1475.27 -> and any errors that occurred
1476.83 -> in the processing of that request.
1479.15 -> This enables you to really drill in
1480.56 -> and figure out where is the
time going in my application?
1484.33 -> Is my database very slow?
1486.66 -> Or perhaps there's a
particular code section
1488.99 -> that is taking most of my latency.
1491.78 -> That gives you the opportunity to dig in
1493.71 -> and really look at that piece of code
1496.51 -> to figure out where the bottleneck is
1499.24 -> and really to remediate that.
1502.18 -> Your service map give
you a higher level view
1505.49 -> that's an end-to-end view
of all of the microservices
1509.58 -> that you've touched in the
processing of your request.
1512.68 -> And again, this all aggregates up
1514.08 -> so you get a view of where
is the latency going?
1517.8 -> Again, if there are errors,
they'll show up here.
1521.17 -> So this lets you look at your components,
1522.87 -> figure out how they're connected,
1524.61 -> and figure out, you know,
1526.12 -> the dependencies and where
there might be a challenge
1529.91 -> with latency around those dependencies.
1533.92 -> And finally, you have trace groups,
1535.32 -> trace groups enable you
to bring trace information
1539.67 -> into a grouped format around
1542.69 -> particular activities in the application.
1545.07 -> So this way, you can
again, look and figure out
1547.76 -> where is the latency going,
1549.8 -> and where do I need to
figure out and fix something?
1556.17 -> We recently have added
application analytics.
1559.7 -> This enables you to
build application views
1563.95 -> across log, trace, and metric data.
1567.61 -> You select log sources or trace groups
1569.62 -> or services to be part of an application.
1572.44 -> It enables you to monitor availability
1574.52 -> and drill into detailed views
1576.57 -> on the traces and service logs.
1580.05 -> This gives you, again,
the span ID and trace ID,
1583.34 -> you can trace into what's going on
1585.1 -> and figure out any issues
that you're having.
1589.63 -> One of the features that's
a kind of sideways feature,
1592.3 -> but super useful, with
OpenSearch dashboards,
1595.87 -> we support a feature called Notebooks.
1598.06 -> Notebooks are documents that enable you
1600.71 -> to put cards onto that document
1603.22 -> to bring all kinds of
different information together
1606.53 -> and really tell a story
about a particular event
1609.5 -> or something that happened.
1611.46 -> You can export those as PDFs or PNGs
1614.64 -> and you can share them around
1617.41 -> and enable everybody to
know what's going on.
1623.74 -> With OpenSearch and
Amazon OpenSearch service,
1626.9 -> we provide machine learning
kind of innovations,
1630.8 -> and chief amongst these,
1631.92 -> and especially within
the observability space,
1635.19 -> is our streaming anomaly detection.
1637.94 -> With streaming anomaly detection,
1640.4 -> the system will automatically
learn the correct behavior
1643.87 -> or the normal behavior of a
metric that you're sending in,
1647.49 -> metric or metrics.
1649.587 -> It uses Random Cut Forest to predict
1653.4 -> when things are going off the rails
1655.72 -> and is integrated with
alerting to send you alerts
1658.95 -> when things like your
CPU is suddenly spiking
1661.45 -> or your traffic is suddenly spiking.
1664.2 -> It brings that information to you.
1667.75 -> Recent improvement enables you to,
1671.01 -> basically collect or group by
categories within your data.
1675.04 -> So, the typical use case for this would be
1677.45 -> if you're running 1,000 servers,
1680.56 -> and you wanna look at
the CPU utilization of,
1684.71 -> you know, you actually wanna
group that down by the host
1688.25 -> so that you can see if
there's a particular host
1690.81 -> that's exhibiting anomalous behavior,
1693.17 -> you can get an alert
for that specific host.
1699.46 -> Just a quick pass at, sort of,
1702.1 -> how all of this would work
in a container service,
1706.73 -> but you have your VPC with
your availability zones,
1712.6 -> you have your user sending
application traffic in
1716.45 -> via a load balancer,
1718.05 -> it's really all hitting
your Kubernetes application,
1722.52 -> which potentially is
running with Amazon RVS,
1725.7 -> as a database layer, you're gonna use,
1729.675 -> whether it's the OTEL
Collector or Fluentbit
1732.86 -> or one of the other architectures,
1734.92 -> to bring that data out, to
manage service for Grafana,
1738.9 -> manage Prometheus, Amazon
OpenSearch service,
1742.79 -> and then use Grafana for dash boarding
1745.95 -> or OpenSearch dashboards for Dashboard.
1750.33 -> Just looking at a little bit deeper,
1753.26 -> we have our metrics, traces and logs,
1755.86 -> within the worker node then,
1757.8 -> we have our application pod and container.
1761.3 -> The Fluentbit is gonna tail
standard out and standard error,
1765.41 -> gonna forward that to
Amazon OpenSearch service.
1769.17 -> We have our OTEL Collector Container.
1771.57 -> OTEL is open telemetry that is going to
1775.13 -> send metrics and traces off to Prometheus
1779.32 -> as well as Amazon OpenSearch service.
1784.78 -> In the prior diagram,
I had a box that said,
1787.617 -> "Buffering and delivery,"
1789.9 -> I wanted to share one of
our more cost-efficient
1794.06 -> and good architectures
1796.62 -> in terms of bringing that data across.
1798.4 -> So if you have either
Fluentd or Fluentbit,
1801.24 -> we generally would look
at S3 as a staging area
1804.8 -> where you're sending all of
your metrics, traces and logs.
1809.86 -> You can trigger off of
the bucket notification
1814.4 -> and use a Lambda to queue up
the object create into SQS.
1821.5 -> We then have a Lambda that's gonna pull
1823.78 -> the original object, parse it,
1827.2 -> and prepare it to deliver to
Amazon OpenSearch service,
1831.05 -> using the bulk API.
1832.88 -> This is a very common pattern that we use.
1836.28 -> It is both, again, cost-efficient,
1838.71 -> it gives you, basically, a
backup of all of your data,
1842.66 -> exists in S3, so that
makes it easy to replay
1845.86 -> or work with other tools like Athena
1850.964 -> and this enables you to, again,
1852.95 -> keep that data over the
longterm at low cost.
1859.41 -> So three takeaways here,
1861.18 -> observability really is gonna allow you
1864.53 -> to measure and monitor the behavior
1867.27 -> of your applications and infrastructure.
1870.12 -> And the end-goal of all
of this observability
1872.6 -> is, again, to bring a
better end-user experience
1877.13 -> for a better business
outcomes for your software.
1882.425 -> We have open-source
technologies across the board,
1887.74 -> also powered by AWS, in many cases,
1891.33 -> that enable you to measure
and monitor this behavior.
1896.49 -> How can we help?
1898.61 -> Please learn more, look at our
OpenSearch service-free trial
1903.92 -> and would love to hear about what you do.
1908.12 -> Thanks very much for your
time and attention today,
1910.22 -> really appreciate it.
Source: https://www.youtube.com/watch?v=1E-ffpHHC5g