Episode 1: Introduction to Amazon DevOps Guru | Amazon Web Services
Aug 16, 2023
Episode 1: Introduction to Amazon DevOps Guru | Amazon Web Services
In this first episode of Amazon DevOps Guru deep dive series, we will provide a high-level foundation on how DevOps Guru helps identify and remediate operational issues and the basics on how to use the service. Learn more about Amazon DevOps Guru by visiting https://go.aws/3I7RPvy Subscribe: More AWS videos - http://bit.ly/2O3zS75 More AWS events videos - http://bit.ly/316g9t4 ABOUT AWS Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers — including the fastest-growing startups, largest enterprises, and leading government agencies — are using AWS to lower costs, become more agile, and innovate faster. #AWS #AmazonWebServices #MachineLearning #AmazonDevopsGuru #AWSDemo
Content
0 -> (bright music)
5.04 -> - [Rafael] Hello, and welcome
to a new video series.
7.56 -> focusing on Amazon DevOps Guru,
10.03 -> a machine learning
powered service from AWS,
12.59 -> helping developers
13.56 -> and operators automatically
detect anomalies,
16.21 -> and improve application availability.
18.792 -> My name is Rafael Ramos.
20.83 -> I'm a senior solutions architect with AWS,
23.67 -> and I'll be leading you
through this first episode.
26.69 -> In this introduction, we'll
establish a foundation
29.54 -> about how AWS is bringing AI and ML
32.56 -> into the development,
and operation workflows,
35.197 -> to materialize what we
call NextGen DevOps.
39.27 -> We also cover the basic
concept of DevOps Guru,
42.26 -> and typical use cases.
45.19 -> In subsequent episodes,
46.6 -> we'll dive deeper into
concepts, best practices,
49.288 -> and how to use DevOps
Guru in specific contexts.
53.9 -> Regardless of whether
you are a small business,
56.34 -> or an enterprise,
57.203 -> applications are becoming more
distributed and more complex.
61.48 -> Because of this, maintaining and improving
63.98 -> application availability,
65.161 -> and resolving operational
issues is difficult.
68.862 -> Solutions, demand additional
dashboards, alarms,
72.3 -> and metrics, that allow you to monitor
74.225 -> all the IT resources and applications.
77.457 -> Every time you add a
new service or resource,
80.94 -> you need to complete the
observability coverage.
83.792 -> This entails manually enabling monitoring,
87.28 -> configuring all the alarms,
and setting various metrics.
90.658 -> You also need to define
different thresholds
93.48 -> for alarms, which is a very tricky task.
96.598 -> From an operational perspective,
98.586 -> configuring all that manually,
100.92 -> and collecting aggregated metrics
102.75 -> from so many different
resources can be overwhelming.
106.81 -> When an issue occurs,
108.31 -> there are multiple alarms
you must deal with.
111.07 -> Then you need to stitch together
all the various data points
114.69 -> and correlate the failures
with potential causes
117.3 -> for the issue.
118.43 -> This process usually takes a
significant amount of time.
121.92 -> And because of that,
122.872 -> your meantime to recovery
or MTTR, is affected.
126.951 -> Your customers are then
impacted with downtime,
130.07 -> or poor user experience.
131.69 -> Think about the number of
alarms, metrics, and traces,
135.31 -> that an operator must track,
137.34 -> in order to find the
root cause of an issue
140.24 -> on a microservices environment,
142.22 -> with dozens, or hundreds of applications.
145.302 -> Then, think about the number
of alarms and thresholds
148.093 -> you need to enable just
to have full coverage.
151.845 -> This is why Amazon
DevOps Guru was created.
155.41 -> With its ML based anomaly detectors,
158.7 -> DevOps Guru makes it easy
160.11 -> to improve application availability,
162.47 -> nd this is the key aspect
I want to highlight
165.52 -> in this introduction.
167.61 -> Our approach towards NextGen DevOps
169.767 -> is to apply machine learning
171.45 -> to operational data
from your AWS resources.
174.9 -> So when an anomalous behavior is detected,
177.867 -> instead of having noise from
several different alarms,
181.48 -> DevOps Guru provides an intelligent digest
183.808 -> of various events and
anomalies that you can use
187.07 -> to quickly find the root cause issue,
189.38 -> and reduce the MTTR.
191.42 -> DevOps Guru is a fully managed service,
194.16 -> which means you don't need
machine learning expertise.
197.341 -> It uses pretrained models,
199.64 -> informed by years of experience
202.1 -> from amazon.com, and AWS
operational excellence.
206.28 -> These models can identify
anomalous application behavior
209.571 -> such as increased latency, error rates,
212.9 -> and resource constraints,
214.91 -> and surface critical issues
216.73 -> that could cause potential
outages, or service disruptions.
221.18 -> You don't have to manually configure
222.528 -> tons of alarms and thresholds
across your AWS resources.
227.01 -> With these pre-trained ML models,
229.745 -> DevOps Guru implicitly sets
thresholds dynamically,
232.903 -> based on the behavior of the application,
235.74 -> and alerts you
236.573 -> only when the implicit
thresholds are breached.
239.24 -> For example,
240.61 -> if your application typically handles
242.186 -> 10 requests per second,
244.7 -> and suddenly the number of
requests increases to 1000,
248.153 -> you might be under DDoS attack,
250.83 -> and DevOps Guru will notify you.
253.434 -> On the other hand, if these
1000 requests per second
256.665 -> become the standard traffic,
259.07 -> you don't have to change
the threshold manually,
261.58 -> because DevOps Guru anomaly detectors
263.27 -> will adapt accordingly.
265.64 -> Enabling DevOps Guru is as
simple as clicking a button,
269.25 -> and providing guidance about
your application boundaries.
272.97 -> To understand the applications' behavior,
275.29 -> DevOps grew automatically starts
276.94 -> to monitor the AWS resources,
278.955 -> by ingesting CloudWatch vendored metrics
282.18 -> and AWS CloudTrail events.
284.482 -> On its dashboard,
286.46 -> DevOps Guru doesn't display
fancy graphs, charts,
289.34 -> or running counters.
290.84 -> This is by design,
291.673 -> because we want to upload you
293.42 -> from monitoring the system's health.
295.302 -> DevOps Guru takes care of this aspect,
297.89 -> and only notifies you when
an anomalies detected.
301.68 -> When this happens,
302.99 -> DevOps Guru correlates issues, and trails,
305.689 -> to provide you with a
meaningful sequence of events,
309.11 -> which is called an insight.
311.26 -> Instead of dealing with multiple alarms
313.05 -> that don't pinpoint the symptoms,
314.68 -> and root cause of an issue,
316.563 -> a DevOps Guru Insight
318.27 -> groups together the information you need
319.905 -> about the failing resources,
322.14 -> and provides prescriptive guidance,
324.18 -> with recommendations to
resolve the issue quickly.
327.84 -> Now let's look
328.69 -> at how DevOps Guru
works behind the scenes.
331.964 -> This diagram depicts a simple
serverless application,
335.315 -> defined by CloudFormation template.
338.5 -> After it's deployed,
the CloudFormation stack
341.33 -> defines the application
boundaries for DevOps Guru.
344.262 -> This application has three main components
347.177 -> we can see in the diagram:
349.3 -> an API gateway endpoint,
351.33 -> that invokes a lambda function
352.94 -> to process the business logic,
355.04 -> in a DynamoDB table,
where we store the data.
358.31 -> Let's say that someone
359.155 -> accidentally updates the DynamoDB table
362.15 -> to reduce its read capacity.
365.07 -> At the same time,
366.129 -> there is a surge of HTTP
traffic to the application.
370.6 -> Eventually, the DynamoDB
table will be throttled,
373.232 -> because it can't keep
up with the new number
375.97 -> of read requests.
377.72 -> In this hypothetical scenario,
379.64 -> the operator would see all
three components failing.
383.3 -> The DynamoDB table would be throttled.
386.31 -> The lambda function would
have failed executions,
389.147 -> and the API gateway endpoint would respond
392.06 -> with 500 HTTP errors.
394.82 -> To resolve this issue,
396.23 -> the operator would need to understand
398.08 -> the root cause of the issue,
399.415 -> and what the side effects are.
402.46 -> In this case, even though
the lambda function,
405.03 -> and the API gateway
endpoint, are both failing.
408.78 -> they're just side
effects of the root cause
411.081 -> which is the reduced capacity
of the DynamoDB table.
414.97 -> After resolving the incident,
416.89 -> the operator would have to spend
more time on a post mortem,
420.24 -> to identify the read capacity
421.91 -> of the DynamoDB table being decreased,
425.16 -> because of a bug in a
previous application release.
428.59 -> This is how DevOps Guru can help you
430.57 -> with machine learning
based anomaly detectors.
434.1 -> It monitors your application,
436.15 -> by ingesting operational data
438.2 -> coming from CloudWatch and CloudTrail.
441.158 -> In this example, DevOps Guru
checks the relevant events
444.63 -> on CloudTrail, and compares them
446.89 -> with the operational
metrics from CloudWatch.
450.09 -> Based on this data, when
it detects the anomaly,
453.18 -> it creates an insight, which
presents the relevant metrics,
456.638 -> so the operator can quickly find
458.662 -> and fix the root cause of the incident.
461.487 -> Taking it a step further, the DevOps Guru
465.05 -> can identify that someone has
changed the DynamoDB table
469.11 -> read capacity, just before
the anomaly was detected,
473.06 -> and recommends rolling back this change.
476.17 -> One thing to keep in mind
477.88 -> that after DevOps Guru is enabled,
480.48 -> it takes at least 24 hours
482.67 -> to create a reliable behavior
baseline of the application.
486.67 -> Let's take a closer look
487.599 -> at how to enable DevOps
Guru on the console.
491.31 -> Navigate to the settings,
492.71 -> and you'll be able to enable
it with just a few clicks.
496.41 -> You can either choose to
analyze all AWS resources
500.03 -> in the current AWS account and region,
503.24 -> choose specific CloudFormation stacks,
505.86 -> or AWS tags, to define
your coverage boundary.
510.3 -> Let's select choose
layer, and click enable.
513.854 -> Now we can see the DevOps Guru dashboard.
516.608 -> As a summary, it displays the
number of resources analyzed
520.442 -> in the last hour, as well as
ongoing and impacted insights.
524.608 -> We haven't selected our
coverage boundary yet.
527.927 -> Let's do it now,
529.25 -> by selecting which CloudFormation
stack we want to monitor.
533 -> Click on analyze resources.
535.146 -> and then on the manage AWS
resources selection button
539.82 -> let's select the option to
analyze all AWS resources
543.028 -> in the specified CloudFormation
stacks in this region
547.29 -> and select which
CloudFormation stack we want.
550.253 -> Then click on the save
button, and confirm.
554.58 -> DevOps Guru is now
monitoring the resources
557.44 -> from that CloudFormation stack,
559.225 -> and will create an insight every
time it detects an anomaly.
563.323 -> You probably want to be
notified when an anomaly occurs.
567.839 -> If we go back to settings,
569.696 -> you'll see an option to
set SNS notifications.
574.105 -> Click on set up
notifications, add SNS topic,
578.749 -> and create a new SNS topic.
581.823 -> Give it a name, and click save.
585.64 -> Now every time DevOps
Guru creates an insight
588.117 -> it'll send a notification
to this SNS topic.
591.96 -> If DevOps Guru sounds like
something you want to try,
595.72 -> you can start with the free tier.
598.35 -> At the time of this recording
600.22 -> the free tier includes analysis
of 7,200 AWS resource hours,
605.791 -> for each resource group A and B,
609.27 -> and usage of 10,000 DevOps
Guru API calls per month,
612.9 -> for three months.
614.74 -> And one last thing.
616.78 -> DevOps Guru has a cost estimator tool.
619.569 -> If you want to know how much
621.27 -> monitoring a CloudFormation
stack will cost you,
624.49 -> go to the tool, select a stack,
626.99 -> and click estimate monthly cost.
630.315 -> Lastly, I want to leave you
632.44 -> with links to a few additional resources.
635.687 -> The developer documentation
includes detailed information
639.64 -> how to use DevOps Guru and its APIs.
642.6 -> We also regularly publish blog posts
645.6 -> that go into detail on new
features and scenarios.
649.41 -> And since it can be useful
to have hands-on experience,
652.507 -> we have a workshop where
you can get started,
655.53 -> and see DevOps Guru in action.
657.8 -> I hope this episode gave you a good sense
660.2 -> about how DevOps Guru applies AI and ML
663.35 -> into your development
and operations workflow,
666.38 -> to improve your applications availability.
669.04 -> Please join us for upcoming episodes,
671.66 -> where we will dive deeper
into specific topics
674.81 -> related to DevOps Guru.
676.46 -> You will see insights in connection,
678.47 -> and how it's applicable
for containers, databases,
681.44 -> and serverless.
682.6 -> Thanks for watching.
Source: https://www.youtube.com/watch?v=2uA8q-8mTZY