Detecting and diagnosing performance issues in Amazon Aurora - AWS Online Tech Talks
Detecting and diagnosing performance issues in Amazon Aurora - AWS Online Tech Talks
As applications scale in size and sophistication, it becomes increasingly challenging for Amazon Aurora customers to detect and resolve relational database performance bottlenecks and other operational issues quickly. Today, customers have access to Amazon DevOps Guru for RDS, which is a fully-managed ML powered service that detects operational and performance related issues for all Amazon Aurora engines and dozens of other resource types. However, determining the exact cause of a relational database performance issue and how to fix it can be complicated and time consuming. In this session, we will look at a new capability for Aurora through which customers can detect and diagnose performance issues and get recommendations on how to fix them. We will also use this time to do a demo of the new functionality and talk about how it works.
Learning Objectives: * Objective 1: Understand the benefits of leveraging Amazon DevOps Guru for RDS. * Objective 2: Learn what metrics DevOps Guru measures and how to interpret them. * Objective 3: See first hand how to deploy DevOps Guru and step-by-step walk-thru on how to use the product to pinpoint performance issues.
☁️ AWS Online Tech Talks cover a wide range of topics and expertise levels through technical deep dives, demos, customer examples, and live Q\u0026A with AWS experts. Builders can choose from bite-sized 15-minute sessions, insightful fireside chats, immersive virtual workshops, interactive office hours, or watch on-demand tech talks at your own pace. Join us to fuel your learning journey with AWS.
#AWS
Content
2.15 -> [Music]
7.6 -> good day everyone thank you for joining
10.24 -> us today for this session on detecting
12.96 -> and diagnosing performance issues in
14.96 -> amazon aurora
16.72 -> as most of you know by now we launched
19.84 -> devops guru for rds at reinvent 2021
24.32 -> so we'd be focusing our talk on that new
27.039 -> capability for today
29.679 -> i would like to first invite maxim to
32.32 -> introduce himself maxim
35.84 -> hello everyone my name is maxim
37.92 -> karchenko and i'm a principal database
40.48 -> engineer who
42 -> is working on delves guru and delb's
44.239 -> guru for rds
47.12 -> awesome and my name is shayan sanyal i'm
49.68 -> a principal database specialist
51.039 -> solutions architect and i specialize in
53.52 -> the aurora postgres engine
55.84 -> we are here to talk to you today about
58.559 -> devops guru for rds
60.879 -> and this is a great new capability for
63.52 -> devops guru that brings in-depth machine
66.08 -> learning to help you manage your
67.92 -> database performance as part of the full
70.479 -> application stack so let's get started
75.28 -> today we are going to start out by
77.439 -> setting some context by telling you
79.68 -> about amazon rds in general amazon
82.72 -> aurora and performance insights which
85.439 -> are foundational concepts and building
87.439 -> blocks of devops guru for rds
90.799 -> then we'll take some time to understand
92.88 -> an important metric that we use
95.759 -> called database load
97.759 -> we learn how it's derived and why it's
100.24 -> important
101.92 -> then we'll walk through devops guru for
104.32 -> rds including how to enable it how it
107.68 -> helps you
109.04 -> how it works
110.96 -> after that i'm going to hand it over to
113.119 -> my colleague maxim who will take you
115.119 -> through a demo of devops guru for rds
118.32 -> so you can get a sense of how it can
120.479 -> help you detect
122 -> understand and react to performance
124.24 -> concerns that may arise on even the best
127.68 -> tuned databases
129.92 -> we'll then wrap up the session with some
131.92 -> pricing information
136.64 -> okay so first let's level set about
139.28 -> amazon rds and amazon aurora
142.72 -> as many of you already know rds and
145.52 -> aurora are what we call managed database
148.239 -> services
149.68 -> in essence they are automatically
151.68 -> managed databases that take care of the
154 -> mundane day-to-day tasks for you
157.36 -> these are things like provisioning a new
159.68 -> database scaling up and down backup and
162.56 -> recovery high availability and disaster
165.44 -> recovery and the list goes on
168.64 -> basically everything you would be
170.48 -> scripting yourself or doing manually on
173.04 -> a conventional database
174.959 -> we automate all of that in amazon rds
177.68 -> and amazon aurora
181.68 -> now you may ask what is so special about
184.56 -> amazon aurora as compared to amazon rds
188.72 -> well as you might already know aurora is
191.84 -> our cloud native database engine
194.8 -> we designed it to meet the needs of
196.8 -> enterprise customers who are accustomed
198.959 -> to the legacy commercial on-prem
201.44 -> databases
202.879 -> these are customers who need powerful
205.2 -> and full-featured databases but are
207.68 -> getting tired of the legacy databases
210.239 -> punitive licensing their significant
212.879 -> expenses their lack of cloud native
215.2 -> capabilities etc
217.599 -> so amazon aurora is our answer to that
220.319 -> need
221.68 -> at the sql prompt amazon aurora looks
224.48 -> and feels just like postgres or mysql
228.239 -> but behind the scenes it features a
230.4 -> distributed fault tolerant self-healing
233.519 -> storage system that auto scales up to
236.48 -> 128 terabytes per database
240.48 -> each aurora cluster replicates
242.799 -> automatically across three availability
245.04 -> zones while at the same time delivering
247.68 -> high performance and availability with
250.08 -> up to 15 low latency read replicas
253.92 -> so in essence aurora is a cloud native
257.12 -> massively scalable and available
258.959 -> database
260.959 -> it aims to address high-end commercial
264.08 -> and enterprise use cases in the way that
266.72 -> mysql and postgres don't do by
268.72 -> themselves
270.32 -> and aurora aims to do it at a lower cost
272.72 -> to customers than the legacy commercial
274.8 -> databases
278.88 -> now as i've mentioned in rds and aurora
281.6 -> we want to make it easy to set up and
283.52 -> manage databases
285.84 -> well an important part of managing
288.08 -> databases is managing performance
291.28 -> even the best tuned databases need some
294.24 -> kind of performance help at one time or
296.56 -> another
298.16 -> and being able to see and understand
300.08 -> what's happening in your database is key
303.84 -> we know that relational databases have a
305.919 -> reputation for being complex and
308 -> difficult to tune so in 2017 we
312 -> introduced a dashboard called
313.84 -> performance insights
316.479 -> performance insights shows you your
318.88 -> current activity
321.199 -> it shows you your database metrics
324.4 -> and it shows you your top sql commands
327.039 -> all in a simple straightforward console
329.68 -> screen
331.6 -> but the core innovation of performance
333.36 -> insights is actually a single metric
335.68 -> called database load
339.12 -> this metric is the key to simplifying
341.759 -> understanding database performance
345.12 -> it's at the heart of performance
346.88 -> insights and it's at the heart of devops
349.759 -> guru for rds
353.68 -> now we are just going to take a small
355.44 -> detour to explain this db load metric a
358.88 -> little bit more
361.039 -> so what is a database load
363.759 -> well
364.56 -> simply stated it is a count of how many
367.44 -> database sessions are active at any
369.919 -> given time
371.759 -> performance insights collects this count
374.08 -> once every second
376.72 -> what do i mean by a session
378.88 -> simply put a session is just a
381.199 -> connection to the database
383.6 -> but we don't just count all sessions
386.56 -> we count all active sessions that is
388.88 -> connections for which the database is in
390.96 -> the middle of fulfilling a request
394.88 -> so that means we are counting sessions
397.199 -> that are not idle but instead active in
400.16 -> a call
402.4 -> now it turns out
404.08 -> especially if you average this count
407.84 -> 1 minute
408.96 -> 5 minute 15 minute or 1 hour intervals
413.199 -> this gives us a very accurate measure of
415.919 -> how stressed a database is
419.12 -> in performance insights we usually show
421.759 -> this metric in period averages
424.8 -> so because we are averaging the metric
426.639 -> over periods of time the unit of measure
429.36 -> we use for database load is called
431.759 -> average active sessions or aas in short
436.639 -> in essence db load is rds and aurora's
440.8 -> central metric for measuring performance
442.88 -> health
444.56 -> we at amazon have found it to be the
446.88 -> metric that's the most reflective of
449.28 -> service quality
450.56 -> from the customer perspective
455.52 -> just to dive deeper for a minute i want
458 -> to show you exactly how we derive this
460.16 -> metric from your database activity
464.319 -> as we mentioned a session is a
466.319 -> connection
467.84 -> here's a connection connection one and
470.479 -> here is the connection making a request
472.72 -> to the database
475.039 -> down below here we have a 20 second
477.44 -> timeline just for illustration purposes
481.44 -> but databases and database connections
484.08 -> don't usually just make one request
488.08 -> they make many such requests in
490.08 -> succession
492.4 -> and this is just one connection
495.44 -> there may be many such connections from
497.68 -> many applications to any given database
500.639 -> and each connection may fulfill many
502.72 -> different requests and these requests
504.879 -> will likely have a variety of durations
508.639 -> so what performance insights does is it
511.039 -> comes in and it samples the number of
513.599 -> active sessions once a second
517.279 -> so if we take a look at just the first
519.519 -> sample here from the top there is one
522.719 -> two three four five
526.24 -> active sessions at the time of the
527.68 -> sample
529.2 -> similarly if i went ahead and counted
531.68 -> all of these
533.2 -> and just put the sum below
535.279 -> you will see
536.399 -> this is what we get is a count of each
539.2 -> sample per session
542.24 -> as you can see this is what produces our
545.04 -> database load metric
547.279 -> it's that simple it's just a simple
549.04 -> account
550.88 -> so that's database load in its most
553.519 -> basic form
555.44 -> but now i want to show you something
557.2 -> interesting
559.12 -> during a database request a database
561.6 -> does all kinds of things
564.64 -> it might spend some of this request time
566.56 -> doing reads and writes and it might
569.2 -> spend some of this time waiting on locks
571.2 -> and so on
573.04 -> it seems like this metric and our
575.68 -> sampling offers an opportunity not only
578.08 -> to show
579.2 -> how many requests are being handled at a
581.04 -> time
581.839 -> but also to show you what those requests
584 -> are doing
587.12 -> now because every active database
588.959 -> request is doing something at every
591.04 -> given moment in time
592.88 -> like running on cpu
594.64 -> reading from disk or something else
598 -> now i want to introduce this concept
599.839 -> called weight events
602.959 -> most modern databases are instrumented
605.44 -> with what are called weight events
608.16 -> and these weird events are just names or
610.72 -> labels for what a request is doing at
613.04 -> any given time
615.44 -> and this illustration here is a bit of a
617.839 -> simplified version but each of these
620.32 -> colors represents a different weight
622.32 -> state or a weight event in an imaginary
624.88 -> database
626.56 -> now real databases like postgres mysql
629.6 -> oracle or sql server all have different
632.72 -> names for their dozens of various way
634.56 -> dimensions
635.68 -> but just for illustration here let's say
638.079 -> there's a database type
640.16 -> with just four weight events cpu
643.36 -> read and write
645.04 -> and
645.839 -> locks
649.12 -> you can see these different weights
651.12 -> reflected in the active session requests
654.48 -> so if we not only count up the active
656.8 -> sessions but also count how many are
660.24 -> there in each weight state
662.88 -> this means that we can build a very nice
665.36 -> visualization
666.8 -> something like this
668.959 -> a visualization that characterizes the
671.279 -> prevalence of various weight events in a
673.36 -> databases active sessions
677.839 -> this type of visualization is at the
680.079 -> heart of performance insights
682.399 -> keep in mind we are still talking about
684.24 -> database load
687.519 -> that key core metric
690 -> can tell you when a database is having
692.48 -> performance issues
695.12 -> and the weight event breakdown can tell
697.12 -> you why it is having performance issues
702 -> so now that we have a way of finding
704.16 -> performance issues and also a way of
706.56 -> understanding why we are having them
708.959 -> what's our first instinct
711.36 -> well my first instinct is to automate it
715.04 -> at aws we automate to reduce our
717.68 -> customers operational burden
720.639 -> and we also heard our customers loud and
723.279 -> clear telling us that they didn't want
725.04 -> to have to browse around in performance
726.88 -> insights looking for problems and
729.04 -> analyzing weight events manually
732 -> they wanted specific analysis and
734.16 -> prescriptive advice
736.16 -> so we got to work
738.079 -> and so at reinvent this past year
740.959 -> we launched this brand new capability
743.2 -> called devops guru for rds
746.959 -> now for a long time customers have been
749.36 -> telling us that they like a lot of
751.36 -> things about performance insights and
752.959 -> cloud watch which let them explore all
756.079 -> kinds of issues around database
757.36 -> performance and troubleshooting
760.32 -> but customers have also told us loud and
762.8 -> clear that they would like a little more
764.8 -> help tracking down potential problems
767.839 -> and even more importantly figuring out
769.76 -> what to do about them
772.32 -> this is where devops guru comes in
776.24 -> since its release in may of 2021 devops
780 -> guru has been helping customers by
782.399 -> telling them about unusual and
784.56 -> problematic performance behavior
786.56 -> throughout their application stack
790.32 -> we are now raising the bar for database
792.639 -> diagnostics bringing detailed database
795.04 -> specific capabilities to devops guru
798.72 -> devops guru for rds goes several steps
801.76 -> further than performance insights by
803.76 -> using machine learning to detect and
806.24 -> diagnose performance problems in your
808.32 -> databases
809.6 -> in order to help you fix those problems
811.76 -> very quickly
816.48 -> let's jump right in and take a look at
818.959 -> what this new capability can do
822.399 -> so devops guru for rds uses machine
824.8 -> learning and anomaly detection to look
827.36 -> for unusual and problematic performance
829.519 -> events in your databases
832.399 -> when it finds one it performs analysis
835.04 -> on the anomaly to determine what kinds
837.44 -> of conditions are contributing to it
840.8 -> just some of the conditions that devops
842.959 -> guru can highlight are
845.6 -> database weight states also known as
847.839 -> weight events that are creating
849.279 -> bottlenecks for users
851.76 -> the top sql commands that are being sent
853.839 -> to the database by the application and
856.16 -> that are contributing chiefly to the
858 -> anomaly
859.92 -> any other notable metrics like memory
862.079 -> pressure and io that might also be
864.32 -> contributing
866.16 -> finally devops guru for rds suggests
868.959 -> what you can do next to address the
871.76 -> issue
875.36 -> let's take a moment to highlight a few
877.36 -> key things about this new capability
879.6 -> that we think make it particularly
881.199 -> useful for developers and database folks
885.199 -> first of all devops guru for rds
887.519 -> monitors for anomalies using that
890.16 -> special metric called database load
893.279 -> as we've discussed this metric registers
895.839 -> any kind of database contention
898.56 -> across every possible type of bottleneck
903.36 -> our anomaly detection uses a databases
905.92 -> metrics history to look back in time
908.88 -> and establish a baseline workload
911.04 -> pattern then it zeros in on significant
914.399 -> deviations from that pattern
918.079 -> this provides a much more accurate and
920.56 -> relevant set of findings than say
922.88 -> conventional static monitoring
924.48 -> thresholds which typically can be very
927.04 -> noisy and can overload operators with
929.839 -> alerts and alarms for things that aren't
932.399 -> typically problems
935.519 -> so devops guru for rds analyzes these
937.759 -> anomalies pulls them apart and
940.16 -> identifies causes based on algorithms
942.88 -> developed from years of performance
944.88 -> analysis at amazon across thousands and
947.68 -> thousands of databases
951.519 -> to you as a customer you clearly benefit
954.8 -> from these explanations of the results
956.959 -> of this analysis as well as you're very
959.36 -> quickly able to find the heaviest
960.959 -> hitting bottlenecks on your database
963.04 -> clusters
965.6 -> once you find them devops guru for rds
968 -> provides clear next steps for what you
970.16 -> can do about them
974.959 -> i think we should pause for a minute
976.8 -> just to explain what devops guru is and
979.68 -> how devops guru for rds adds to it
985.04 -> devops guru is a service that's been
986.959 -> available on ews since a while
989.68 -> it looks across your whole application
991.839 -> stack for problematic behavior and tells
994.56 -> you what's going on
996.8 -> devops guru already includes rds and
999.6 -> aurora in their analysis but up to now
1002.72 -> it hasn't included a database deep dive
1006.88 -> so devops guru for rds adds a few
1009.44 -> special things to raise the bar in
1011.279 -> devops guru
1013.04 -> first of all it fine-tunes anomaly
1015.44 -> detection for aurora to accommodate the
1018.16 -> particular workload patterns of
1019.6 -> relational databases
1021.92 -> then as we have mentioned before it
1024.4 -> shows the customer the results of a
1026.16 -> database specific analysis and
1027.919 -> recommendation process
1032.88 -> so here's a breakdown of how this new
1035.28 -> capability functions
1038.16 -> it all starts with performance insights
1040.959 -> the database tuning metrics and
1042.88 -> telemetry tool for rds and aurora
1046.079 -> devops guru for rds relies on this data
1048.799 -> and only works if you have performance
1050.64 -> insights enabled on your error clusters
1055.52 -> devs guru for rds then ingests these
1058.32 -> numeric metrics only
1060.32 -> note that it will never
1062.4 -> ingest any customer sql commands or any
1064.799 -> other sensitive data
1068.08 -> it particularly monitors the database
1070.4 -> load metric for anomalies comparing
1073.2 -> historical baselines to current activity
1075.52 -> to pinpoint unusual behavior
1078.48 -> one of the optimizations develops guru
1080.32 -> for rds has made for database workloads
1082.64 -> is to account for seasonality
1085.52 -> in many database workloads there are
1087.84 -> regular periodic spikes of activity from
1090.799 -> say batch jobs or etl jobs reporting and
1094.4 -> other scheduled tasks
1097.12 -> devops guru for rds only highlights
1099.36 -> anomalies that are not periodic and that
1102.559 -> occur outside of the regular course of
1104.559 -> normal database activity
1108.24 -> once an anomaly is detected dev's guru
1110.96 -> for rds performs analysis on the metrics
1114.16 -> that make up the anomaly
1116 -> making note of the most prevalent wait
1118.24 -> states sql commands and other metrics
1120.88 -> that coincide with the anomaly
1123.44 -> a rule-based algorithm then processes
1125.679 -> these results and generates and stores a
1128.64 -> simple straightforward explanation and
1130.88 -> recommendations
1133.84 -> devops guru for rds sends the basic
1136.4 -> anomaly information including severity
1139.28 -> time frame and resource ids to the main
1141.679 -> devops guru service console
1144.48 -> and then devops guru has the ability to
1146.799 -> also send sns events and integrates any
1149.84 -> database anomalies into the big picture
1152.32 -> of your full application stack
1155.039 -> here in the devops guru console is where
1157.44 -> customers usually first find out about a
1159.84 -> database anomaly
1163.28 -> any time devops guru for rds has
1165.36 -> detected an anomaly in an aurora cluster
1167.84 -> the devops guru console will highlight
1169.76 -> it and direct you to the develop school
1171.84 -> for rds detail page for that anomaly
1175.679 -> on that detail page dev's guru for rds
1178.24 -> displays the analysis results providing
1181.12 -> the context necessary to understand what
1183.36 -> is contributing to the database issue
1186.24 -> and more importantly also the results of
1189.039 -> the analysis that point you to the best
1191.679 -> next steps for resolving the problem
1197.36 -> now that i've had a chance to introduce
1199.2 -> you to the feature
1200.64 -> let's talk a little bit about how to get
1202.72 -> it working
1204.4 -> as i mentioned devops guru for rds is
1207.039 -> part of devops guru
1208.799 -> and that means to use it you have to
1210.96 -> turn devops guru on for your application
1213.919 -> stack or the account
1217.28 -> you can find devops guru or the aws
1219.44 -> console under machine learning
1224.4 -> once you click there if you haven't
1226.159 -> enabled devops guru for your account
1228.24 -> you'll just arrive here
1230.08 -> on this page just click get started
1235.039 -> on the getting started page you can
1236.64 -> learn a little more about devops guru
1239.28 -> and also see the permissions that the
1241.2 -> service needs in order to do its job
1245.12 -> scrolling down you can see and choose
1247.919 -> which resources in your account you want
1250.08 -> devops guru to monitor
1252.08 -> you can choose to enable all the
1253.6 -> resources in your account or if you
1255.84 -> click choose later you can later select
1258.24 -> any resources with a particular tag or a
1261.039 -> particular cloud formation stack
1264 -> remember devops guru monitors your whole
1266.64 -> application
1267.84 -> so enabling by tag or a cloud formation
1270.64 -> stack helps it differentiate one
1273.039 -> application from another
1275.76 -> here you can also set up sns to receive
1278.48 -> alerts when devops guru detects
1280.32 -> something is happening in your
1281.6 -> environment
1283.44 -> once you've made your choices just click
1285.679 -> enable
1289.12 -> when you first enable devops guru it
1291.84 -> needs some time to establish a baseline
1294.72 -> and begin detecting unusual patterns
1298.24 -> you could start seeing insights in as
1300.24 -> soon as two hours after enabling it
1302.4 -> depending on the size of the account
1304.4 -> and of course whether something
1305.76 -> anomalous is happening
1307.84 -> this dashboard will give you a summary
1310.08 -> across all your monitored resources
1314.799 -> once devops guru begins detecting things
1317.679 -> you will find everything listed in the
1319.6 -> insights page
1321.44 -> the individual insights have names that
1323.679 -> are indicative of one or more anomalous
1326.159 -> metric that devops guru detected
1329.2 -> and if you click into an insight it
1330.96 -> shows you the inside detail page
1334.32 -> with respect to rds any given insight
1337.12 -> and develops guru may or may not have an
1339.6 -> aurora component
1341.2 -> in this example we have a few lambda
1343.28 -> errors and a rds db load anomaly
1349.2 -> on the inside detail page from the top
1351.76 -> you'll find things like insight severity
1355.12 -> start and end times of the insight and
1356.96 -> the status of the anomaly
1359.84 -> in this example you'll notice this is an
1361.76 -> ongoing anomaly so you'll see the status
1363.919 -> as ongoing
1365.52 -> you can also see aggregated metrics that
1368.08 -> devops guru detected
1370.08 -> that were an anomalous at the same time
1373.6 -> now this paints a picture of the overall
1376.159 -> issue to help you understand the full
1378.559 -> extent and possible origins of an issue
1381.12 -> in your app stack
1383.44 -> looking at this particular example we
1385.2 -> have some anomalous metrics from rds and
1387.84 -> lambda
1389.679 -> you will also see that there is a
1391.52 -> special button on the rds metric finding
1394.96 -> inviting you to view the detailed
1396.72 -> analysis
1398.24 -> this is your indication that devops guru
1400.32 -> for rds has been triggered and a deep
1402.88 -> analysis of the database component of
1404.799 -> this insight is available
1409.12 -> below the aggregated metrics section
1410.96 -> you'll find some relevant events
1413.039 -> distributed either over a timeline view
1415.28 -> or a tabular view
1417.12 -> things like infrastructure events that
1419.679 -> may happen during the anomaly like auto
1421.52 -> scaling activity etc
1423.44 -> that occurred on your abstract during
1425.12 -> this anomaly that devops guru used to
1427.36 -> generate these insights
1431.12 -> the primary metric section summarizes
1433.52 -> the db anomaly which is the top level
1436.08 -> anomaly within the insight
1438.4 -> you can think of this anomaly as the
1440.08 -> general problem that is experienced by
1442 -> your database instance
1446.32 -> finally you'll see the analysis and
1448.559 -> recommendations section which describes
1451.039 -> a specific finding that requires
1452.88 -> investigation
1454.72 -> each finding corresponds to a set of
1457.12 -> related metrics
1458.799 -> and also conveniently linked here you
1460.799 -> will find our brand new documentation
1463.679 -> which provides diagnostic guidelines for
1466 -> each of these weight events
1468.64 -> you'll notice for this example here
1470.32 -> there's also a sql id
1472.96 -> and if you hover over the sql id it
1475.44 -> shows you the actual sql statement
1478.32 -> clicking on this sql id will in fact
1480.72 -> take you to a view in performance
1482.32 -> insights dashboard that shows the
1484.72 -> details and history of the sql statement
1486.72 -> itself
1489.679 -> now in addition to the devops guru
1491.679 -> console
1492.799 -> you can also get a glimpse of an ongoing
1494.88 -> anomaly or the rds dashboard
1497.679 -> in this case it's the apg-1 instance 1
1501.279 -> db instance in our aurora postgres
1503.76 -> cluster that is seeing an increased
1506.24 -> database load of 272 times above normal
1510.4 -> that likely impacted our application
1512.159 -> performance
1515.76 -> drilling down further we will see that
1518.4 -> the anomaly is also visible in the
1520.4 -> performance insights dashboard and is
1522.72 -> highlighted as a high severity anomaly
1525.44 -> in the db load graph
1527.6 -> in this dashboard you can also see the
1529.6 -> top sql queries
1531.279 -> in the top sql view below that might be
1533.679 -> contributing to the relevant anomaly
1538.799 -> and with that i would like to invite my
1540.96 -> colleague maxim to take over and walk
1543.84 -> through a demo of the kinds of findings
1546.32 -> and advice you can expect from devops
1548.88 -> guru for rds
1551.919 -> in this demo we are going to look at
1554.72 -> two separate scenarios
1556.559 -> to see how devops guru can help you
1559.919 -> troubleshoot performance issues in your
1562.24 -> applications
1564.72 -> now to set things up
1566.64 -> for these scenarios we'll consider a
1568.64 -> case of an online bookstore
1572 -> this bookstore is selling books
1574.799 -> via its website out of its warehouse
1578.24 -> and as any online business it runs a
1581.2 -> number of applications to to support its
1584.48 -> operations
1586.88 -> and of course
1588.24 -> whenever there is a production software
1590.48 -> environment
1591.6 -> there is also a
1593.279 -> support engineer
1595.76 -> who is tasks to monitor that environment
1598.799 -> react to problems and address the
1601.44 -> various issues
1604.799 -> one day that on call engineer receives a
1607.919 -> notification from devops guru
1610.32 -> that tells her that
1612.559 -> a major performance anomaly
1614.96 -> is happening with one of their
1616.799 -> applications
1619.279 -> so let's
1621.12 -> take a look at this case and see how
1623.279 -> deos guru can help troubleshoot this
1625.919 -> particular issue
1629.84 -> so the first string that
1632.799 -> we are going to see is the overall
1635.36 -> application dashboard
1637.44 -> so in this case we can see that there is
1639.279 -> a single unhealth application
1641.679 -> called inventory tracker
1644.24 -> and
1645.039 -> we actually have several different
1646.48 -> applications but
1648.08 -> most of them are
1649.84 -> healthy but
1651.36 -> one of them is experiencing
1653.919 -> ongoing performance issue
1656.88 -> now if you're wondering about how to
1658.799 -> define these applications
1661.12 -> uh there are several ways to do it you
1663.84 -> can either use cloud formation stacks
1666.32 -> that ties all of the resources together
1668.88 -> in a single application
1671.12 -> or you can use a special tags that you
1674.08 -> can assign to
1676.08 -> different groups of resources to to mark
1678.64 -> what
1680.08 -> represents a particular application
1683.919 -> but going back uh there is currently an
1686.32 -> ongoing issue with the application
1688.159 -> called inventory tracker
1690.799 -> so let's take a look at
1692.48 -> exactly what is going on there
1698.96 -> so here we can see that there is a
1700.48 -> single an ongoing anomaly
1703.039 -> with
1704 -> high severity that started at
1706.64 -> 1219
1708.84 -> utc if i click on the
1711.52 -> link for that anomaly
1713.76 -> we're going to see the main devops guru
1716.799 -> finding
1718.64 -> and i'm just going to briefly comment
1721.36 -> on what exactly you're seeing and how
1724.399 -> exactly the ops guru
1726.72 -> identified that the reason and going
1728.96 -> performance issue
1731.76 -> well as you can see here there is a
1734.32 -> number of metrics
1736.159 -> that
1737.2 -> devops guru found
1739.679 -> that they experience anomalous behavior
1744 -> now here you can see that there are five
1745.919 -> different metrics that are
1749.44 -> spread across two different types of
1752 -> resources so there is an rds database
1755.36 -> and there is also a lambda function
1758.399 -> and
1760.08 -> devops guru shows you the exact timeline
1763.12 -> of when anomaly was detected
1765.679 -> right or here separately for each metric
1769.919 -> and you can also see the graft anomalies
1774.48 -> and these graphs would give you a more
1776.48 -> precise picture of what exactly the
1779.279 -> problematic behavior of this metrics
1781.52 -> look like
1784.48 -> but going back to the main screen
1788.48 -> here looking at these metrics
1791.36 -> we can confirm that the our application
1794.799 -> which in this case represented by
1796.88 -> lambda8 function
1798.799 -> is indeed having some
1801.2 -> serious issues
1804 -> so we can take a look at
1805.919 -> its metrics in detail
1809.52 -> and see that
1812.48 -> the duration of this function
1815.6 -> or
1816.559 -> estimated time to run for this function
1819.2 -> has increased dramatically
1821.76 -> and also this function
1823.679 -> is now throwing
1825.36 -> quite a few errors so it's definitely in
1828.159 -> a bad shape
1831.36 -> now of course
1833.679 -> this lambda function is not the only
1835.84 -> part of the application
1838.48 -> in the modern world applications are
1840.64 -> typically built
1842 -> in a distributed fashion where
1844.72 -> different pieces of business logic are
1847.679 -> assigned across multiple different
1850.72 -> tiers
1852.08 -> and
1852.96 -> when you investigate performance issues
1856.24 -> the first question that people typically
1858.64 -> ask is
1859.76 -> well
1860.64 -> which tiers in my application stack are
1863.6 -> actually responsible for the problem
1867.76 -> and here the ops guru
1870 -> has given us the first important clue
1874 -> because
1876.24 -> in this picture we not only see that the
1878.96 -> lambda function is have an issue
1881.919 -> but also that the
1884 -> database instance
1886.24 -> uh that is a part of
1888.399 -> this application stack is having a
1890.32 -> correlated issue so a performance issue
1893.44 -> that happens at the same time
1897.519 -> because we know that our lambda function
1900.88 -> issues frequent calls to the backend
1903.279 -> database
1905.2 -> this is an important observation
1907.76 -> so
1908.48 -> here we actually can make a connection
1911.76 -> that
1914.399 -> these two problems are likely related
1917.36 -> and in fact if you want to find out the
1920.08 -> real root cause of this problem
1922.559 -> you probably have to look at the
1923.84 -> database
1926.399 -> just because it's downstream
1929.919 -> well
1930.88 -> the good news is
1932.96 -> devops guru has a dedicated component
1936.399 -> that can do a deep dive on the database
1939.039 -> level
1940 -> uh run analysis on what's going on with
1943.12 -> with the database performance and report
1945.519 -> results back
1948.08 -> in this case
1950.399 -> this analysis has already been run and
1952.48 -> completed and
1954.559 -> its findings were linked to
1957.76 -> to the main devops guru finding
1961.12 -> and we can access them here
1964.72 -> now let's click on view analysis button
1967.2 -> and see
1968.32 -> what exactly are the issues with the
1970.72 -> database
1977.679 -> here is the database-centric view of of
1980.399 -> this particular performance issue
1983.44 -> and there are several pieces of
1984.96 -> important information here
1988.32 -> first of all we can see that
1991.039 -> there is a pretty significant spike of
1993.84 -> uh
1994.72 -> database lot
1996.559 -> measured in average active sessions so
1999.2 -> something my colleague has mentioned
2000.96 -> before
2003.2 -> and to put it very simply
2006.08 -> this spike means that
2008.799 -> uh there are a lot of database users or
2011.76 -> sessions
2013.2 -> who are currently executing database
2015.279 -> calls
2016.64 -> and are waiting for the database to
2018.32 -> respond back
2020.64 -> and uh these weights are quite long at
2023.919 -> this moment uh which is why we haven't
2027.279 -> this performance issue and which is why
2029.679 -> the database at the moment is considered
2032 -> to be unhealthy
2036.32 -> if we look at the spike we can notice
2038.72 -> that the spike is
2041.76 -> large in both the absolute values right
2044.88 -> so we have the
2047.32 -> 1400 average active sessions at the top
2050.72 -> of the spike
2052.159 -> and the normal uh
2055.44 -> number of average active session for
2057.52 -> this database should be less than 2
2061.28 -> which is represented by the number of
2062.879 -> vcpus here
2066 -> but also the spike is very unusual
2069.76 -> as we can see here the typical database
2072.56 -> workload is around zero hour objective
2076.24 -> sessions
2077.359 -> in other words this
2079.2 -> database is typically
2081.04 -> very idle
2082.96 -> but in this case we can see that the
2085.119 -> spike is actually
2086.639 -> more than a thousand times
2088.8 -> higher than the typical database
2092 -> workload activity
2093.76 -> so this makes it a very unusual spike
2098.56 -> well in any case
2101.2 -> the large and unusual spike represents a
2104.96 -> very acute
2106.48 -> and severe performance issue that we are
2108.88 -> observing on the database level
2113.04 -> but why exactly are we seeing this
2114.56 -> problem
2115.52 -> what's going on with the database
2119.04 -> and here is where analysis part of the
2121.52 -> ops group for rds comes in
2124.64 -> so we can look at the analysis section
2128.839 -> and as a big picture
2133.68 -> essentially what we're seeing here is
2135.599 -> that the database is
2138.8 -> experiencing heavy
2140.8 -> database lock-in conditions
2144.72 -> now
2145.92 -> just to step back a little
2148.16 -> lock-in conditions is one of those
2150 -> things that are very hard to detect
2153.04 -> using
2154.4 -> quote-unquote traditional monitoring
2156.8 -> approaches
2158.88 -> for example many organizations rely on
2162.64 -> resource-based monitoring such as
2165.599 -> cpu utilization for example
2168.4 -> and
2170.64 -> just to give you a sense of
2173.2 -> why cpu utilization is not the best
2175.52 -> metric to detect this particular type of
2177.52 -> problem
2178.72 -> we are going to take a quick look at cpu
2180.96 -> utilization that
2182.88 -> is measured on this database
2191.359 -> so here
2193.599 -> we see that the way that cpu utilization
2196.16 -> has been unfolding
2198 -> over the last three hours
2200.4 -> and there is definitely an increase
2202.96 -> which is correlated to the type of the
2204.64 -> problem
2205.92 -> but
2206.88 -> even at the peak as you can see the the
2209.119 -> maximum cpu utilization
2211.52 -> is just 35 percent and this number is
2216.24 -> usually too low to trigger any
2218.48 -> actionable alarms out of that
2221.2 -> in other words
2222.48 -> if you watch cpu utilization you'll
2224.56 -> probably miss that particular issue
2227.119 -> and that's why relying on database load
2229.68 -> metric is better
2232.32 -> and that's why we've chosen a database
2234.56 -> lot metric as our main metric
2239.119 -> but going back to our case
2243.52 -> so
2244.72 -> the database
2246 -> has
2246.8 -> been experienced in lock-in conditions
2249.599 -> but what kind of login
2251.92 -> well the devops guru for rds has
2254.48 -> determined that as well
2256.8 -> so there is a particular type of
2258.72 -> database login called log tuple
2262.079 -> that is currently occurring in in our
2265.04 -> database
2266.48 -> and we know that it's locked up all
2268.72 -> because it's the way to end
2271.52 -> once again
2273.359 -> that my colleague has mentioned before
2276 -> that is dominating this particular spike
2278.8 -> of database load
2282 -> and
2283.52 -> locked apple uh sounds a little bit
2286.079 -> cryptic
2287.599 -> and we of course recognize that fact and
2290.64 -> that's why in devs guru you can click on
2293.28 -> the wait event name
2295.839 -> and see the simplified description of
2298.4 -> what is really going on with the
2299.839 -> database
2301.839 -> and it looks like in this case
2304.96 -> we have a
2306.24 -> situation where there are multiple
2308.8 -> concurrent
2310.16 -> database sessions that are competing for
2312.4 -> the same exact database records
2316.48 -> so that's easy to understand
2319.76 -> now we can also see the
2322.4 -> actual sql
2324.24 -> that is mainly participating in that
2327.76 -> competition so in other words this
2330 -> sequel is a likely culprit for for the
2332.079 -> problem that we're seeing
2336.88 -> okay so
2338.88 -> now we understand the problem uh in a
2340.8 -> little more detail
2342.96 -> but what can we do to actually solve it
2346.48 -> how can we
2348.16 -> address the problem and uh
2350.56 -> make sure that we no longer see this
2353.92 -> database login issue
2357.28 -> and that's where troubleshooting
2359.44 -> documentation comes in
2362.4 -> so here at rds
2365.119 -> we've looked at the
2367.2 -> typical weight events that are occurring
2369.68 -> in our fleet
2371.28 -> and we've created detailed
2373.119 -> troubleshooting guides for all of these
2375.599 -> events
2377.44 -> and these guides are part of regular
2380 -> aurora documentation now and the way
2382.64 -> that they are structured
2384.48 -> is that there is usually
2387.28 -> a section that describes the likely
2389.68 -> causes of increased weights
2393.28 -> as well as the action items that you can
2395.52 -> take to troubleshoot and address them
2398.88 -> and these action items frequently
2400.72 -> include the actual code references
2404.319 -> or sql queries that you can just
2407.359 -> copy and paste and run
2410.8 -> so for example in this case we can refer
2413.599 -> to one of the documents
2416.4 -> that describes how to troubleshoot
2419.76 -> lock-in scenarios
2423.68 -> we can copy
2426.4 -> the recommended view
2429.839 -> sql that is provided in these documents
2433.359 -> and we are going to
2436.24 -> query the blocking tree
2439.119 -> to see
2440.319 -> what is the root cause of this
2442 -> particular issue
2451.68 -> so in this case you can notice that
2454.319 -> there is a single session
2456.4 -> that is holding
2458.24 -> the entire tree
2460.4 -> so we know where the session is coming
2462.56 -> from
2463.92 -> we know which statement this session is
2465.76 -> running
2467.04 -> and we also know which particular
2469.04 -> database record the session is competing
2472.319 -> on
2473.04 -> and everybody else is waiting for for
2475.2 -> this session to to unblock this record
2478.56 -> so essentially if we just kill that one
2480.56 -> particular session
2482.4 -> that would address this particular
2484.8 -> performance problem
2487.2 -> so it's as simple as that
2492.8 -> all right
2494.48 -> so going back
2496.64 -> uh just to summarize what we've seen in
2498.56 -> this example
2500.4 -> so there was guru for rds
2503.359 -> first recognized performance issue that
2506.24 -> is pretty hard to detect
2510.319 -> it also established an important
2512.4 -> connection between application issue and
2515.119 -> the database issue
2517.359 -> and directed us towards
2519.52 -> analyzing database performance because
2521.839 -> it was the most likely culprit
2525.359 -> on the database level it discovered the
2528.16 -> nature of the database issue that was
2530.16 -> going on
2531.359 -> and also explained why why this issue
2534.16 -> was happening
2535.52 -> why did we see those locks and what
2537.28 -> types of flocks we saw
2540.88 -> it highlighted the sequel that was
2542.96 -> likely the main contributor
2545.44 -> to to the problem
2548.16 -> and finally it provided detailed
2550.64 -> troubleshooting uh
2552.24 -> guidelines uh to
2554.8 -> address this this this problem at its
2557.359 -> root cause
2560.56 -> so these issues are happening uh
2564.8 -> in the real world uh quite frequently
2567.92 -> and uh
2569.04 -> they usually takes time some time to
2571.2 -> resolve
2572.56 -> but hopefully with the help of the ops
2574.24 -> guru the time to resolution can be a
2576.72 -> little bit faster
2581.92 -> all right so i'm going to switch gears
2584.8 -> and look at one more scenario
2588.72 -> so the nice thing about the ops guru is
2592.24 -> that
2593.04 -> you don't necessarily need to be on the
2595.04 -> system when a performance incident is
2597.92 -> going on
2600.24 -> devops guru
2601.839 -> captures and records all of the
2603.92 -> performance incidents and it maintains a
2606.72 -> historical record of them
2609.28 -> so essentially you can always go back to
2611.44 -> the past incident
2613.119 -> to investigate the issue that that
2614.96 -> happened
2616.16 -> some time ago
2617.68 -> maybe because
2619.04 -> of the customer request to
2621.119 -> to see what was going on
2623.04 -> yesterday at 1 pm
2626 -> so let's take a look at one of those
2627.68 -> incidents
2629.599 -> that happened in the past
2634.96 -> so here you can see that there is a
2638.4 -> history of
2641.92 -> different anomalies that happen with
2643.839 -> different applications
2646.4 -> so let's look at one of the particular
2648.72 -> anomalies that
2650.56 -> happened a couple of days ago
2652.72 -> with the shipping application
2658.96 -> now when we look at the main anomaly
2660.72 -> screen
2662.8 -> we can see a very similar picture to the
2665.28 -> first example
2666.96 -> so there is a number of metrics and
2670.88 -> the obs guru detected anomalies in those
2673.119 -> metrics
2674.16 -> and we can see the exact timeline of
2675.92 -> anomalies
2677.359 -> and also the graft representation of
2679.839 -> anomalies
2682.96 -> and also just like in the first case uh
2686.88 -> delb's guru combined anomalies from two
2689.2 -> different resources
2690.96 -> the lambda function or
2693.119 -> content called the front end of our
2694.64 -> application
2696 -> and the
2697.04 -> database or code encode the back end of
2700.48 -> our application
2702.319 -> and because we see those two resources
2705.2 -> uh in the same anomaly and because we
2708.16 -> know that our lambda function actually
2710.24 -> frequently queries uh that particular
2712.48 -> database
2713.839 -> we can
2714.96 -> reasonably infer that once again the
2717.839 -> database is likely the
2720 -> root cause of that issue and if we want
2722.64 -> to find out what's going on we need to
2724.64 -> look at the database
2727.839 -> now there is one thing that is
2729.44 -> interesting here
2731.68 -> we
2732.4 -> actually have detected anomalies on on
2735.52 -> more metrics
2737.52 -> and one of those metrics is called this
2740 -> q depth
2742.24 -> it is happening on the database resource
2746.48 -> and
2748.319 -> because we we see anomaly on this metric
2751.28 -> we can reasonably conclude that the
2753.52 -> actual
2754.72 -> problem is likely related to io
2758.56 -> but let's confirm that
2761.119 -> so if we click on the view analysis
2762.96 -> button
2764.88 -> we can see the
2766.319 -> database centric view of this
2767.839 -> performance issue
2770.56 -> and if we look at the main graph
2773.2 -> we can confirm
2774.48 -> that
2776.88 -> this particular performance issue
2779.359 -> is indeed related to io because the
2783.28 -> most dominating weighted one that we see
2785.52 -> is is io exact sync or io related
2791.68 -> so if we look at the analysis
2794.079 -> we once again can pinpoint the
2798.079 -> actual sql
2799.599 -> that
2800.839 -> experience those io issues
2804.72 -> and also the simplified description of
2808.8 -> of the problem
2810.24 -> what exactly is happening on the io
2813.04 -> and it looks like in this case what is
2814.8 -> happening is that there is
2817.92 -> something related to commits
2820.48 -> perhaps the increased number of commits
2824.96 -> we can corroborate that theory by
2827.04 -> looking through the troubleshooting
2828.56 -> documentation
2830.16 -> if we open the link
2832.72 -> and read the likely causes of increased
2835.52 -> weights and
2839.2 -> suggested action items
2841.839 -> one of the action items is actually to
2844 -> reduce the number of commits and this is
2847.599 -> probably the most typical case for this
2849.68 -> particular type of problem
2853.68 -> now there is one thing that i haven't
2855.68 -> mentioned yet that
2857.76 -> a very nice feature that devops guru
2861.359 -> provides
2862.8 -> and this is the timeline of events that
2865.599 -> it captures
2867.68 -> now
2869.2 -> whenever you investigate a performance
2871.119 -> issue it's really important to see
2875.119 -> whether
2876.24 -> there was something
2877.92 -> in in the environment a configuration
2881.68 -> or perhaps a code
2883.44 -> that changed very recently
2886.559 -> in other words if you see for example
2888.559 -> that a particular database parameter
2891.839 -> or particular piece of code was changed
2894.64 -> right prior to to the time when anomaly
2897.28 -> has actually started
2899.28 -> it's very likely that
2902.24 -> there is a causal relationship between
2904.24 -> them
2905.28 -> in other words you change the database
2907.119 -> parameter and that
2909.52 -> probably triggered the anomaly
2912.64 -> now let's look at the timeline
2915.119 -> for this particular incident
2926.24 -> and here we can see
2929.359 -> that right prior to the anomaly start
2932.88 -> there was actually a change in code
2936.88 -> which
2937.76 -> in our case is probably important
2940.559 -> so
2941.52 -> the action item here would be to to look
2943.599 -> at that piece of code
2945.44 -> and uh
2946.72 -> and find out why exactly
2949.2 -> uh
2950.8 -> it
2952.079 -> made the application issue
2954.319 -> uh more commits
2956.88 -> and that's something that that we can do
2962.72 -> i hope that these two examples uh
2966.24 -> showed you that
2968.8 -> devops guru and derp's guru for rds
2971.599 -> are useful tools that
2974.8 -> can help you
2976.16 -> troubleshoot performance issues better
2978.48 -> and faster
2980.8 -> and as a side note
2982.8 -> i'd like to say that we are constantly
2985.04 -> working on improving these tools
2987.68 -> and
2989.599 -> we want to to make them even more useful
2992.16 -> in the future so
2994.319 -> please
2995.2 -> keep watching the space
2997.28 -> thank you
3000.4 -> thanks maxim those are some great
3002.88 -> examples of the kinds of insights devops
3005.44 -> guru for rds can provide
3008.64 -> now let's talk a little bit about
3010.319 -> pricing
3012.48 -> so devops guru for rds is offered at no
3015.76 -> additional charge as part of the
3017.44 -> existing price that devops guru charges
3020.079 -> you for analyzing amazon rds resources
3024.079 -> devops guru segments the resource types
3026.48 -> it evaluates into two groups
3028.8 -> group a which consists of lambda and s3
3032 -> and group b which consists of amazon rds
3035.2 -> ec2 redshift clusters and 25 other aws
3038.72 -> resource types
3040.8 -> you will also notice that group a
3043.28 -> and group b both are priced very very
3045.76 -> economically which where group a
3048.4 -> actually equates to approximately two
3050.319 -> dollars per resource for a month
3052.559 -> and group b equates to approximately
3054.8 -> three dollars per resource for a month
3058.24 -> if you choose to you can also
3060.72 -> opt to use tags to control costs by
3064 -> enabling devops guru only for your
3065.92 -> aurora resources
3068.64 -> for more information about using aws
3070.88 -> tags with devops guru
3072.8 -> please visit the topic working with
3074.72 -> resource tags in our documentation page
3080.72 -> so to summarize
3082.8 -> today we didn't only want to tell you
3084.88 -> about this great new capability called
3087.119 -> devops guru for rds
3089.119 -> but we also wanted to provide you with
3091.28 -> some foundational knowledge that you
3093.2 -> could use and that could help you
3096 -> understand why
3097.599 -> it matters to find and resolve these
3099.599 -> performance disruptions
3102.24 -> amazon rds and amazon aurora are
3104.559 -> designed to take care of the complicated
3106.48 -> parts of running a database
3108.88 -> but performance issues are a complicated
3111.04 -> part
3111.92 -> and until recently we only had
3114.319 -> performance insights to help you shine a
3117.359 -> ray of light onto that topic
3120.319 -> but now we have devops guru for rds
3124.24 -> using machine learning and all of our
3126.16 -> experience managing a fleet of amazon
3128.4 -> databases we are going one step further
3131.76 -> towards making performance management an
3134.079 -> automated part of our managed database
3136.72 -> service
3140.88 -> we would encourage you to leverage these
3143.119 -> helpful resources which will help you to