Detecting and diagnosing performance issues in Amazon Aurora - AWS Online Tech Talks

Aug 16, 2023

Detecting and diagnosing performance issues in Amazon Aurora - AWS Online Tech Talks

As applications scale in size and sophistication, it becomes increasingly challenging for Amazon Aurora customers to detect and resolve relational database performance bottlenecks and other operational issues quickly. Today, customers have access to Amazon DevOps Guru for RDS, which is a fully-managed ML powered service that detects operational and performance related issues for all Amazon Aurora engines and dozens of other resource types. However, determining the exact cause of a relational database performance issue and how to fix it can be complicated and time consuming. In this session, we will look at a new capability for Aurora through which customers can detect and diagnose performance issues and get recommendations on how to fix them. We will also use this time to do a demo of the new functionality and talk about how it works.

Learning Objectives:
* Objective 1: Understand the benefits of leveraging Amazon DevOps Guru for RDS.
* Objective 2: Learn what metrics DevOps Guru measures and how to interpret them.
* Objective 3: See first hand how to deploy DevOps Guru and step-by-step walk-thru on how to use the product to pinpoint performance issues.

***To learn more about the services featured in this talk, please visit: https://aws.amazon.com/devops-guru/fe… Subscribe to AWS Online Tech Talks On AWS:
https://www.youtube.com/@AWSOnlineTec…

Follow Amazon Web Services:
Official Website: https://aws.amazon.com/what-is-aws
Twitch: https://twitch.tv/aws
Twitter: https://twitter.com/awsdevelopers
Facebook: https://facebook.com/amazonwebservices
Instagram: https://instagram.com/amazonwebservices

☁️ AWS Online Tech Talks cover a wide range of topics and expertise levels through technical deep dives, demos, customer examples, and live Q\u0026A with AWS experts. Builders can choose from bite-sized 15-minute sessions, insightful fireside chats, immersive virtual workshops, interactive office hours, or watch on-demand tech talks at your own pace. Join us to fuel your learning journey with AWS.

#AWS

Content

2.15 -> [Music]

7.6 -> good day everyone thank you for joining

10.24 -> us today for this session on detecting

12.96 -> and diagnosing performance issues in

14.96 -> amazon aurora

16.72 -> as most of you know by now we launched

19.84 -> devops guru for rds at reinvent 2021

24.32 -> so we'd be focusing our talk on that new

27.039 -> capability for today

29.679 -> i would like to first invite maxim to

32.32 -> introduce himself maxim

35.84 -> hello everyone my name is maxim

37.92 -> karchenko and i'm a principal database

40.48 -> engineer who

42 -> is working on delves guru and delb's

44.239 -> guru for rds

47.12 -> awesome and my name is shayan sanyal i'm

49.68 -> a principal database specialist

51.039 -> solutions architect and i specialize in

53.52 -> the aurora postgres engine

55.84 -> we are here to talk to you today about

58.559 -> devops guru for rds

60.879 -> and this is a great new capability for

63.52 -> devops guru that brings in-depth machine

66.08 -> learning to help you manage your

67.92 -> database performance as part of the full

70.479 -> application stack so let's get started

75.28 -> today we are going to start out by

77.439 -> setting some context by telling you

79.68 -> about amazon rds in general amazon

82.72 -> aurora and performance insights which

85.439 -> are foundational concepts and building

87.439 -> blocks of devops guru for rds

90.799 -> then we'll take some time to understand

92.88 -> an important metric that we use

95.759 -> called database load

97.759 -> we learn how it's derived and why it's

100.24 -> important

101.92 -> then we'll walk through devops guru for

104.32 -> rds including how to enable it how it

107.68 -> helps you

109.04 -> how it works

110.96 -> after that i'm going to hand it over to

113.119 -> my colleague maxim who will take you

115.119 -> through a demo of devops guru for rds

118.32 -> so you can get a sense of how it can

120.479 -> help you detect

122 -> understand and react to performance

124.24 -> concerns that may arise on even the best

127.68 -> tuned databases

129.92 -> we'll then wrap up the session with some

131.92 -> pricing information

136.64 -> okay so first let's level set about

139.28 -> amazon rds and amazon aurora

142.72 -> as many of you already know rds and

145.52 -> aurora are what we call managed database

148.239 -> services

149.68 -> in essence they are automatically

151.68 -> managed databases that take care of the

154 -> mundane day-to-day tasks for you

157.36 -> these are things like provisioning a new

159.68 -> database scaling up and down backup and

162.56 -> recovery high availability and disaster

165.44 -> recovery and the list goes on

168.64 -> basically everything you would be

170.48 -> scripting yourself or doing manually on

173.04 -> a conventional database

174.959 -> we automate all of that in amazon rds

177.68 -> and amazon aurora

181.68 -> now you may ask what is so special about

184.56 -> amazon aurora as compared to amazon rds

188.72 -> well as you might already know aurora is

191.84 -> our cloud native database engine

194.8 -> we designed it to meet the needs of

196.8 -> enterprise customers who are accustomed

198.959 -> to the legacy commercial on-prem

201.44 -> databases

202.879 -> these are customers who need powerful

205.2 -> and full-featured databases but are

207.68 -> getting tired of the legacy databases

210.239 -> punitive licensing their significant

212.879 -> expenses their lack of cloud native

215.2 -> capabilities etc

217.599 -> so amazon aurora is our answer to that

220.319 -> need

221.68 -> at the sql prompt amazon aurora looks

224.48 -> and feels just like postgres or mysql

228.239 -> but behind the scenes it features a

230.4 -> distributed fault tolerant self-healing

233.519 -> storage system that auto scales up to

236.48 -> 128 terabytes per database

240.48 -> each aurora cluster replicates

242.799 -> automatically across three availability

245.04 -> zones while at the same time delivering

247.68 -> high performance and availability with

250.08 -> up to 15 low latency read replicas

253.92 -> so in essence aurora is a cloud native

257.12 -> massively scalable and available

258.959 -> database

260.959 -> it aims to address high-end commercial

264.08 -> and enterprise use cases in the way that

266.72 -> mysql and postgres don't do by

268.72 -> themselves

270.32 -> and aurora aims to do it at a lower cost

272.72 -> to customers than the legacy commercial

274.8 -> databases

278.88 -> now as i've mentioned in rds and aurora

281.6 -> we want to make it easy to set up and

283.52 -> manage databases

285.84 -> well an important part of managing

288.08 -> databases is managing performance

291.28 -> even the best tuned databases need some

294.24 -> kind of performance help at one time or

296.56 -> another

298.16 -> and being able to see and understand

300.08 -> what's happening in your database is key

303.84 -> we know that relational databases have a

305.919 -> reputation for being complex and

308 -> difficult to tune so in 2017 we

312 -> introduced a dashboard called

313.84 -> performance insights

316.479 -> performance insights shows you your

318.88 -> current activity

321.199 -> it shows you your database metrics

324.4 -> and it shows you your top sql commands

327.039 -> all in a simple straightforward console

329.68 -> screen

331.6 -> but the core innovation of performance

333.36 -> insights is actually a single metric

335.68 -> called database load

339.12 -> this metric is the key to simplifying

341.759 -> understanding database performance

345.12 -> it's at the heart of performance

346.88 -> insights and it's at the heart of devops

349.759 -> guru for rds

353.68 -> now we are just going to take a small

355.44 -> detour to explain this db load metric a

358.88 -> little bit more

361.039 -> so what is a database load

363.759 -> well

364.56 -> simply stated it is a count of how many

367.44 -> database sessions are active at any

369.919 -> given time

371.759 -> performance insights collects this count

374.08 -> once every second

376.72 -> what do i mean by a session

378.88 -> simply put a session is just a

381.199 -> connection to the database

383.6 -> but we don't just count all sessions

386.56 -> we count all active sessions that is

388.88 -> connections for which the database is in

390.96 -> the middle of fulfilling a request

394.88 -> so that means we are counting sessions

397.199 -> that are not idle but instead active in

400.16 -> a call

402.4 -> now it turns out

404.08 -> especially if you average this count

407.84 -> 1 minute

408.96 -> 5 minute 15 minute or 1 hour intervals

413.199 -> this gives us a very accurate measure of

415.919 -> how stressed a database is

419.12 -> in performance insights we usually show

421.759 -> this metric in period averages

424.8 -> so because we are averaging the metric

426.639 -> over periods of time the unit of measure

429.36 -> we use for database load is called

431.759 -> average active sessions or aas in short

436.639 -> in essence db load is rds and aurora's

440.8 -> central metric for measuring performance

442.88 -> health

444.56 -> we at amazon have found it to be the

446.88 -> metric that's the most reflective of

449.28 -> service quality

450.56 -> from the customer perspective

455.52 -> just to dive deeper for a minute i want

458 -> to show you exactly how we derive this

460.16 -> metric from your database activity

464.319 -> as we mentioned a session is a

466.319 -> connection

467.84 -> here's a connection connection one and

470.479 -> here is the connection making a request

472.72 -> to the database

475.039 -> down below here we have a 20 second

477.44 -> timeline just for illustration purposes

481.44 -> but databases and database connections

484.08 -> don't usually just make one request

488.08 -> they make many such requests in

490.08 -> succession

492.4 -> and this is just one connection

495.44 -> there may be many such connections from

497.68 -> many applications to any given database

500.639 -> and each connection may fulfill many

502.72 -> different requests and these requests

504.879 -> will likely have a variety of durations

508.639 -> so what performance insights does is it

511.039 -> comes in and it samples the number of

513.599 -> active sessions once a second

517.279 -> so if we take a look at just the first

519.519 -> sample here from the top there is one

522.719 -> two three four five

526.24 -> active sessions at the time of the

527.68 -> sample

529.2 -> similarly if i went ahead and counted

531.68 -> all of these

533.2 -> and just put the sum below

535.279 -> you will see

536.399 -> this is what we get is a count of each

539.2 -> sample per session

542.24 -> as you can see this is what produces our

545.04 -> database load metric

547.279 -> it's that simple it's just a simple

549.04 -> account

550.88 -> so that's database load in its most

553.519 -> basic form

555.44 -> but now i want to show you something

557.2 -> interesting

559.12 -> during a database request a database

561.6 -> does all kinds of things

564.64 -> it might spend some of this request time

566.56 -> doing reads and writes and it might

569.2 -> spend some of this time waiting on locks

571.2 -> and so on

573.04 -> it seems like this metric and our

575.68 -> sampling offers an opportunity not only

578.08 -> to show

579.2 -> how many requests are being handled at a

581.04 -> time

581.839 -> but also to show you what those requests

584 -> are doing

587.12 -> now because every active database

588.959 -> request is doing something at every

591.04 -> given moment in time

592.88 -> like running on cpu

594.64 -> reading from disk or something else

598 -> now i want to introduce this concept

599.839 -> called weight events

602.959 -> most modern databases are instrumented

605.44 -> with what are called weight events

608.16 -> and these weird events are just names or

610.72 -> labels for what a request is doing at

613.04 -> any given time

615.44 -> and this illustration here is a bit of a

617.839 -> simplified version but each of these

620.32 -> colors represents a different weight

622.32 -> state or a weight event in an imaginary

624.88 -> database

626.56 -> now real databases like postgres mysql

629.6 -> oracle or sql server all have different

632.72 -> names for their dozens of various way

634.56 -> dimensions

635.68 -> but just for illustration here let's say

638.079 -> there's a database type

640.16 -> with just four weight events cpu

643.36 -> read and write

645.04 -> and

645.839 -> locks

649.12 -> you can see these different weights

651.12 -> reflected in the active session requests

654.48 -> so if we not only count up the active

656.8 -> sessions but also count how many are

660.24 -> there in each weight state

662.88 -> this means that we can build a very nice

665.36 -> visualization

666.8 -> something like this

668.959 -> a visualization that characterizes the

671.279 -> prevalence of various weight events in a

673.36 -> databases active sessions

677.839 -> this type of visualization is at the

680.079 -> heart of performance insights

682.399 -> keep in mind we are still talking about

684.24 -> database load

687.519 -> that key core metric

690 -> can tell you when a database is having

692.48 -> performance issues

695.12 -> and the weight event breakdown can tell

697.12 -> you why it is having performance issues

702 -> so now that we have a way of finding

704.16 -> performance issues and also a way of

706.56 -> understanding why we are having them

708.959 -> what's our first instinct

711.36 -> well my first instinct is to automate it

715.04 -> at aws we automate to reduce our

717.68 -> customers operational burden

720.639 -> and we also heard our customers loud and

723.279 -> clear telling us that they didn't want

725.04 -> to have to browse around in performance

726.88 -> insights looking for problems and

729.04 -> analyzing weight events manually

732 -> they wanted specific analysis and

734.16 -> prescriptive advice

736.16 -> so we got to work

738.079 -> and so at reinvent this past year

740.959 -> we launched this brand new capability

743.2 -> called devops guru for rds

746.959 -> now for a long time customers have been

749.36 -> telling us that they like a lot of

751.36 -> things about performance insights and

752.959 -> cloud watch which let them explore all

756.079 -> kinds of issues around database

757.36 -> performance and troubleshooting

760.32 -> but customers have also told us loud and

762.8 -> clear that they would like a little more

764.8 -> help tracking down potential problems

767.839 -> and even more importantly figuring out

769.76 -> what to do about them

772.32 -> this is where devops guru comes in

776.24 -> since its release in may of 2021 devops

780 -> guru has been helping customers by

782.399 -> telling them about unusual and

784.56 -> problematic performance behavior

786.56 -> throughout their application stack

790.32 -> we are now raising the bar for database

792.639 -> diagnostics bringing detailed database

795.04 -> specific capabilities to devops guru

798.72 -> devops guru for rds goes several steps

801.76 -> further than performance insights by

803.76 -> using machine learning to detect and

806.24 -> diagnose performance problems in your

808.32 -> databases

809.6 -> in order to help you fix those problems

811.76 -> very quickly

816.48 -> let's jump right in and take a look at

818.959 -> what this new capability can do

822.399 -> so devops guru for rds uses machine

824.8 -> learning and anomaly detection to look

827.36 -> for unusual and problematic performance

829.519 -> events in your databases

832.399 -> when it finds one it performs analysis

835.04 -> on the anomaly to determine what kinds

837.44 -> of conditions are contributing to it

840.8 -> just some of the conditions that devops

842.959 -> guru can highlight are

845.6 -> database weight states also known as

847.839 -> weight events that are creating

849.279 -> bottlenecks for users

851.76 -> the top sql commands that are being sent

853.839 -> to the database by the application and

856.16 -> that are contributing chiefly to the

858 -> anomaly

859.92 -> any other notable metrics like memory

862.079 -> pressure and io that might also be

864.32 -> contributing

866.16 -> finally devops guru for rds suggests

868.959 -> what you can do next to address the

871.76 -> issue

875.36 -> let's take a moment to highlight a few

877.36 -> key things about this new capability

879.6 -> that we think make it particularly

881.199 -> useful for developers and database folks

885.199 -> first of all devops guru for rds

887.519 -> monitors for anomalies using that

890.16 -> special metric called database load

893.279 -> as we've discussed this metric registers

895.839 -> any kind of database contention

898.56 -> across every possible type of bottleneck

903.36 -> our anomaly detection uses a databases

905.92 -> metrics history to look back in time

908.88 -> and establish a baseline workload

911.04 -> pattern then it zeros in on significant

914.399 -> deviations from that pattern

918.079 -> this provides a much more accurate and

920.56 -> relevant set of findings than say

922.88 -> conventional static monitoring

924.48 -> thresholds which typically can be very

927.04 -> noisy and can overload operators with

929.839 -> alerts and alarms for things that aren't

932.399 -> typically problems

935.519 -> so devops guru for rds analyzes these

937.759 -> anomalies pulls them apart and

940.16 -> identifies causes based on algorithms

942.88 -> developed from years of performance

944.88 -> analysis at amazon across thousands and

947.68 -> thousands of databases

951.519 -> to you as a customer you clearly benefit

954.8 -> from these explanations of the results

956.959 -> of this analysis as well as you're very

959.36 -> quickly able to find the heaviest

960.959 -> hitting bottlenecks on your database

963.04 -> clusters

965.6 -> once you find them devops guru for rds

968 -> provides clear next steps for what you

970.16 -> can do about them

974.959 -> i think we should pause for a minute

976.8 -> just to explain what devops guru is and

979.68 -> how devops guru for rds adds to it

985.04 -> devops guru is a service that's been

986.959 -> available on ews since a while

989.68 -> it looks across your whole application

991.839 -> stack for problematic behavior and tells

994.56 -> you what's going on

996.8 -> devops guru already includes rds and

999.6 -> aurora in their analysis but up to now

1002.72 -> it hasn't included a database deep dive

1006.88 -> so devops guru for rds adds a few

1009.44 -> special things to raise the bar in

1011.279 -> devops guru

1013.04 -> first of all it fine-tunes anomaly

1015.44 -> detection for aurora to accommodate the

1018.16 -> particular workload patterns of

1019.6 -> relational databases

1021.92 -> then as we have mentioned before it

1024.4 -> shows the customer the results of a

1026.16 -> database specific analysis and

1027.919 -> recommendation process

1032.88 -> so here's a breakdown of how this new

1035.28 -> capability functions

1038.16 -> it all starts with performance insights

1040.959 -> the database tuning metrics and

1042.88 -> telemetry tool for rds and aurora

1046.079 -> devops guru for rds relies on this data

1048.799 -> and only works if you have performance

1050.64 -> insights enabled on your error clusters

1055.52 -> devs guru for rds then ingests these

1058.32 -> numeric metrics only

1060.32 -> note that it will never

1062.4 -> ingest any customer sql commands or any

1064.799 -> other sensitive data

1068.08 -> it particularly monitors the database

1070.4 -> load metric for anomalies comparing

1073.2 -> historical baselines to current activity

1075.52 -> to pinpoint unusual behavior

1078.48 -> one of the optimizations develops guru

1080.32 -> for rds has made for database workloads

1082.64 -> is to account for seasonality

1085.52 -> in many database workloads there are

1087.84 -> regular periodic spikes of activity from

1090.799 -> say batch jobs or etl jobs reporting and

1094.4 -> other scheduled tasks

1097.12 -> devops guru for rds only highlights

1099.36 -> anomalies that are not periodic and that

1102.559 -> occur outside of the regular course of

1104.559 -> normal database activity

1108.24 -> once an anomaly is detected dev's guru

1110.96 -> for rds performs analysis on the metrics

1114.16 -> that make up the anomaly

1116 -> making note of the most prevalent wait

1118.24 -> states sql commands and other metrics

1120.88 -> that coincide with the anomaly

1123.44 -> a rule-based algorithm then processes

1125.679 -> these results and generates and stores a

1128.64 -> simple straightforward explanation and

1130.88 -> recommendations

1133.84 -> devops guru for rds sends the basic

1136.4 -> anomaly information including severity

1139.28 -> time frame and resource ids to the main

1141.679 -> devops guru service console

1144.48 -> and then devops guru has the ability to

1146.799 -> also send sns events and integrates any

1149.84 -> database anomalies into the big picture

1152.32 -> of your full application stack

1155.039 -> here in the devops guru console is where

1157.44 -> customers usually first find out about a

1159.84 -> database anomaly

1163.28 -> any time devops guru for rds has

1165.36 -> detected an anomaly in an aurora cluster

1167.84 -> the devops guru console will highlight

1169.76 -> it and direct you to the develop school

1171.84 -> for rds detail page for that anomaly

1175.679 -> on that detail page dev's guru for rds

1178.24 -> displays the analysis results providing

1181.12 -> the context necessary to understand what

1183.36 -> is contributing to the database issue

1186.24 -> and more importantly also the results of

1189.039 -> the analysis that point you to the best

1191.679 -> next steps for resolving the problem

1197.36 -> now that i've had a chance to introduce

1199.2 -> you to the feature

1200.64 -> let's talk a little bit about how to get

1202.72 -> it working

1204.4 -> as i mentioned devops guru for rds is

1207.039 -> part of devops guru

1208.799 -> and that means to use it you have to

1210.96 -> turn devops guru on for your application

1213.919 -> stack or the account

1217.28 -> you can find devops guru or the aws

1219.44 -> console under machine learning

1224.4 -> once you click there if you haven't

1226.159 -> enabled devops guru for your account

1228.24 -> you'll just arrive here

1230.08 -> on this page just click get started

1235.039 -> on the getting started page you can

1236.64 -> learn a little more about devops guru

1239.28 -> and also see the permissions that the

1241.2 -> service needs in order to do its job

1245.12 -> scrolling down you can see and choose

1247.919 -> which resources in your account you want

1250.08 -> devops guru to monitor

1252.08 -> you can choose to enable all the

1253.6 -> resources in your account or if you

1255.84 -> click choose later you can later select

1258.24 -> any resources with a particular tag or a

1261.039 -> particular cloud formation stack

1264 -> remember devops guru monitors your whole

1266.64 -> application

1267.84 -> so enabling by tag or a cloud formation

1270.64 -> stack helps it differentiate one

1273.039 -> application from another

1275.76 -> here you can also set up sns to receive

1278.48 -> alerts when devops guru detects

1280.32 -> something is happening in your

1281.6 -> environment

1283.44 -> once you've made your choices just click

1285.679 -> enable

1289.12 -> when you first enable devops guru it

1291.84 -> needs some time to establish a baseline

1294.72 -> and begin detecting unusual patterns

1298.24 -> you could start seeing insights in as

1300.24 -> soon as two hours after enabling it

1302.4 -> depending on the size of the account

1304.4 -> and of course whether something

1305.76 -> anomalous is happening

1307.84 -> this dashboard will give you a summary

1310.08 -> across all your monitored resources

1314.799 -> once devops guru begins detecting things

1317.679 -> you will find everything listed in the

1319.6 -> insights page

1321.44 -> the individual insights have names that

1323.679 -> are indicative of one or more anomalous

1326.159 -> metric that devops guru detected

1329.2 -> and if you click into an insight it

1330.96 -> shows you the inside detail page

1334.32 -> with respect to rds any given insight

1337.12 -> and develops guru may or may not have an

1339.6 -> aurora component

1341.2 -> in this example we have a few lambda

1343.28 -> errors and a rds db load anomaly

1349.2 -> on the inside detail page from the top

1351.76 -> you'll find things like insight severity

1355.12 -> start and end times of the insight and

1356.96 -> the status of the anomaly

1359.84 -> in this example you'll notice this is an

1361.76 -> ongoing anomaly so you'll see the status

1363.919 -> as ongoing

1365.52 -> you can also see aggregated metrics that

1368.08 -> devops guru detected

1370.08 -> that were an anomalous at the same time

1373.6 -> now this paints a picture of the overall

1376.159 -> issue to help you understand the full

1378.559 -> extent and possible origins of an issue

1381.12 -> in your app stack

1383.44 -> looking at this particular example we

1385.2 -> have some anomalous metrics from rds and

1387.84 -> lambda

1389.679 -> you will also see that there is a

1391.52 -> special button on the rds metric finding

1394.96 -> inviting you to view the detailed

1396.72 -> analysis

1398.24 -> this is your indication that devops guru

1400.32 -> for rds has been triggered and a deep

1402.88 -> analysis of the database component of

1404.799 -> this insight is available

1409.12 -> below the aggregated metrics section

1410.96 -> you'll find some relevant events

1413.039 -> distributed either over a timeline view

1415.28 -> or a tabular view

1417.12 -> things like infrastructure events that

1419.679 -> may happen during the anomaly like auto

1421.52 -> scaling activity etc

1423.44 -> that occurred on your abstract during

1425.12 -> this anomaly that devops guru used to

1427.36 -> generate these insights

1431.12 -> the primary metric section summarizes

1433.52 -> the db anomaly which is the top level

1436.08 -> anomaly within the insight

1438.4 -> you can think of this anomaly as the

1440.08 -> general problem that is experienced by

1442 -> your database instance

1446.32 -> finally you'll see the analysis and

1448.559 -> recommendations section which describes

1451.039 -> a specific finding that requires

1452.88 -> investigation

1454.72 -> each finding corresponds to a set of

1457.12 -> related metrics

1458.799 -> and also conveniently linked here you

1460.799 -> will find our brand new documentation

1463.679 -> which provides diagnostic guidelines for

1466 -> each of these weight events

1468.64 -> you'll notice for this example here

1470.32 -> there's also a sql id

1472.96 -> and if you hover over the sql id it

1475.44 -> shows you the actual sql statement

1478.32 -> clicking on this sql id will in fact

1480.72 -> take you to a view in performance

1482.32 -> insights dashboard that shows the

1484.72 -> details and history of the sql statement

1486.72 -> itself

1489.679 -> now in addition to the devops guru

1491.679 -> console

1492.799 -> you can also get a glimpse of an ongoing

1494.88 -> anomaly or the rds dashboard

1497.679 -> in this case it's the apg-1 instance 1

1501.279 -> db instance in our aurora postgres

1503.76 -> cluster that is seeing an increased

1506.24 -> database load of 272 times above normal

1510.4 -> that likely impacted our application

1512.159 -> performance

1515.76 -> drilling down further we will see that

1518.4 -> the anomaly is also visible in the

1520.4 -> performance insights dashboard and is

1522.72 -> highlighted as a high severity anomaly

1525.44 -> in the db load graph

1527.6 -> in this dashboard you can also see the

1529.6 -> top sql queries

1531.279 -> in the top sql view below that might be

1533.679 -> contributing to the relevant anomaly

1538.799 -> and with that i would like to invite my

1540.96 -> colleague maxim to take over and walk

1543.84 -> through a demo of the kinds of findings

1546.32 -> and advice you can expect from devops

1548.88 -> guru for rds

1551.919 -> in this demo we are going to look at

1554.72 -> two separate scenarios

1556.559 -> to see how devops guru can help you

1559.919 -> troubleshoot performance issues in your

1562.24 -> applications

1564.72 -> now to set things up

1566.64 -> for these scenarios we'll consider a

1568.64 -> case of an online bookstore

1572 -> this bookstore is selling books

1574.799 -> via its website out of its warehouse

1578.24 -> and as any online business it runs a

1581.2 -> number of applications to to support its

1584.48 -> operations

1586.88 -> and of course

1588.24 -> whenever there is a production software

1590.48 -> environment

1591.6 -> there is also a

1593.279 -> support engineer

1595.76 -> who is tasks to monitor that environment

1598.799 -> react to problems and address the

1601.44 -> various issues

1604.799 -> one day that on call engineer receives a

1607.919 -> notification from devops guru

1610.32 -> that tells her that

1612.559 -> a major performance anomaly

1614.96 -> is happening with one of their

1616.799 -> applications

1619.279 -> so let's

1621.12 -> take a look at this case and see how

1623.279 -> deos guru can help troubleshoot this

1625.919 -> particular issue

1629.84 -> so the first string that

1632.799 -> we are going to see is the overall

1635.36 -> application dashboard

1637.44 -> so in this case we can see that there is

1639.279 -> a single unhealth application

1641.679 -> called inventory tracker

1644.24 -> and

1645.039 -> we actually have several different

1646.48 -> applications but

1648.08 -> most of them are

1649.84 -> healthy but

1651.36 -> one of them is experiencing

1653.919 -> ongoing performance issue

1656.88 -> now if you're wondering about how to

1658.799 -> define these applications

1661.12 -> uh there are several ways to do it you

1663.84 -> can either use cloud formation stacks

1666.32 -> that ties all of the resources together

1668.88 -> in a single application

1671.12 -> or you can use a special tags that you

1674.08 -> can assign to

1676.08 -> different groups of resources to to mark

1678.64 -> what

1680.08 -> represents a particular application

1683.919 -> but going back uh there is currently an

1686.32 -> ongoing issue with the application

1688.159 -> called inventory tracker

1690.799 -> so let's take a look at

1692.48 -> exactly what is going on there

1698.96 -> so here we can see that there is a

1700.48 -> single an ongoing anomaly

1703.039 -> with

1704 -> high severity that started at

1706.64 -> 1219

1708.84 -> utc if i click on the

1711.52 -> link for that anomaly

1713.76 -> we're going to see the main devops guru

1716.799 -> finding

1718.64 -> and i'm just going to briefly comment

1721.36 -> on what exactly you're seeing and how

1724.399 -> exactly the ops guru

1726.72 -> identified that the reason and going

1728.96 -> performance issue

1731.76 -> well as you can see here there is a

1734.32 -> number of metrics

1736.159 -> that

1737.2 -> devops guru found

1739.679 -> that they experience anomalous behavior

1744 -> now here you can see that there are five

1745.919 -> different metrics that are

1749.44 -> spread across two different types of

1752 -> resources so there is an rds database

1755.36 -> and there is also a lambda function

1758.399 -> and

1760.08 -> devops guru shows you the exact timeline

1763.12 -> of when anomaly was detected

1765.679 -> right or here separately for each metric

1769.919 -> and you can also see the graft anomalies

1774.48 -> and these graphs would give you a more

1776.48 -> precise picture of what exactly the

1779.279 -> problematic behavior of this metrics

1781.52 -> look like

1784.48 -> but going back to the main screen

1788.48 -> here looking at these metrics

1791.36 -> we can confirm that the our application

1794.799 -> which in this case represented by

1796.88 -> lambda8 function

1798.799 -> is indeed having some

1801.2 -> serious issues

1804 -> so we can take a look at

1805.919 -> its metrics in detail

1809.52 -> and see that

1812.48 -> the duration of this function

1815.6 -> or

1816.559 -> estimated time to run for this function

1819.2 -> has increased dramatically

1821.76 -> and also this function

1823.679 -> is now throwing

1825.36 -> quite a few errors so it's definitely in

1828.159 -> a bad shape

1831.36 -> now of course

1833.679 -> this lambda function is not the only

1835.84 -> part of the application

1838.48 -> in the modern world applications are

1840.64 -> typically built

1842 -> in a distributed fashion where

1844.72 -> different pieces of business logic are

1847.679 -> assigned across multiple different

1850.72 -> tiers

1852.08 -> and

1852.96 -> when you investigate performance issues

1856.24 -> the first question that people typically

1858.64 -> ask is

1859.76 -> well

1860.64 -> which tiers in my application stack are

1863.6 -> actually responsible for the problem

1867.76 -> and here the ops guru

1870 -> has given us the first important clue

1874 -> because

1876.24 -> in this picture we not only see that the

1878.96 -> lambda function is have an issue

1881.919 -> but also that the

1884 -> database instance

1886.24 -> uh that is a part of

1888.399 -> this application stack is having a

1890.32 -> correlated issue so a performance issue

1893.44 -> that happens at the same time

1897.519 -> because we know that our lambda function

1900.88 -> issues frequent calls to the backend

1903.279 -> database

1905.2 -> this is an important observation

1907.76 -> so

1908.48 -> here we actually can make a connection

1911.76 -> that

1914.399 -> these two problems are likely related

1917.36 -> and in fact if you want to find out the

1920.08 -> real root cause of this problem

1922.559 -> you probably have to look at the

1923.84 -> database

1926.399 -> just because it's downstream

1929.919 -> well

1930.88 -> the good news is

1932.96 -> devops guru has a dedicated component

1936.399 -> that can do a deep dive on the database

1939.039 -> level

1940 -> uh run analysis on what's going on with

1943.12 -> with the database performance and report

1945.519 -> results back

1948.08 -> in this case

1950.399 -> this analysis has already been run and

1952.48 -> completed and

1954.559 -> its findings were linked to

1957.76 -> to the main devops guru finding

1961.12 -> and we can access them here

1964.72 -> now let's click on view analysis button

1967.2 -> and see

1968.32 -> what exactly are the issues with the

1970.72 -> database

1977.679 -> here is the database-centric view of of

1980.399 -> this particular performance issue

1983.44 -> and there are several pieces of

1984.96 -> important information here

1988.32 -> first of all we can see that

1991.039 -> there is a pretty significant spike of

1993.84 -> uh

1994.72 -> database lot

1996.559 -> measured in average active sessions so

1999.2 -> something my colleague has mentioned

2000.96 -> before

2003.2 -> and to put it very simply

2006.08 -> this spike means that

2008.799 -> uh there are a lot of database users or

2011.76 -> sessions

2013.2 -> who are currently executing database

2015.279 -> calls

2016.64 -> and are waiting for the database to

2018.32 -> respond back

2020.64 -> and uh these weights are quite long at

2023.919 -> this moment uh which is why we haven't

2027.279 -> this performance issue and which is why

2029.679 -> the database at the moment is considered

2032 -> to be unhealthy

2036.32 -> if we look at the spike we can notice

2038.72 -> that the spike is

2041.76 -> large in both the absolute values right

2044.88 -> so we have the

2047.32 -> 1400 average active sessions at the top

2050.72 -> of the spike

2052.159 -> and the normal uh

2055.44 -> number of average active session for

2057.52 -> this database should be less than 2

2061.28 -> which is represented by the number of

2062.879 -> vcpus here

2066 -> but also the spike is very unusual

2069.76 -> as we can see here the typical database

2072.56 -> workload is around zero hour objective

2076.24 -> sessions

2077.359 -> in other words this

2079.2 -> database is typically

2081.04 -> very idle

2082.96 -> but in this case we can see that the

2085.119 -> spike is actually

2086.639 -> more than a thousand times

2088.8 -> higher than the typical database

2092 -> workload activity

2093.76 -> so this makes it a very unusual spike

2098.56 -> well in any case

2101.2 -> the large and unusual spike represents a

2104.96 -> very acute

2106.48 -> and severe performance issue that we are

2108.88 -> observing on the database level

2113.04 -> but why exactly are we seeing this

2114.56 -> problem

2115.52 -> what's going on with the database

2119.04 -> and here is where analysis part of the

2121.52 -> ops group for rds comes in

2124.64 -> so we can look at the analysis section

2128.839 -> and as a big picture

2133.68 -> essentially what we're seeing here is

2135.599 -> that the database is

2138.8 -> experiencing heavy

2140.8 -> database lock-in conditions

2144.72 -> now

2145.92 -> just to step back a little

2148.16 -> lock-in conditions is one of those

2150 -> things that are very hard to detect

2153.04 -> using

2154.4 -> quote-unquote traditional monitoring

2156.8 -> approaches

2158.88 -> for example many organizations rely on

2162.64 -> resource-based monitoring such as

2165.599 -> cpu utilization for example

2168.4 -> and

2170.64 -> just to give you a sense of

2173.2 -> why cpu utilization is not the best

2175.52 -> metric to detect this particular type of

2177.52 -> problem

2178.72 -> we are going to take a quick look at cpu

2180.96 -> utilization that

2182.88 -> is measured on this database

2191.359 -> so here

2193.599 -> we see that the way that cpu utilization

2196.16 -> has been unfolding

2198 -> over the last three hours

2200.4 -> and there is definitely an increase

2202.96 -> which is correlated to the type of the

2204.64 -> problem

2205.92 -> but

2206.88 -> even at the peak as you can see the the

2209.119 -> maximum cpu utilization

2211.52 -> is just 35 percent and this number is

2216.24 -> usually too low to trigger any

2218.48 -> actionable alarms out of that

2221.2 -> in other words

2222.48 -> if you watch cpu utilization you'll

2224.56 -> probably miss that particular issue

2227.119 -> and that's why relying on database load

2229.68 -> metric is better

2232.32 -> and that's why we've chosen a database

2234.56 -> lot metric as our main metric

2239.119 -> but going back to our case

2243.52 -> so

2244.72 -> the database

2246 -> has

2246.8 -> been experienced in lock-in conditions

2249.599 -> but what kind of login

2251.92 -> well the devops guru for rds has

2254.48 -> determined that as well

2256.8 -> so there is a particular type of

2258.72 -> database login called log tuple

2262.079 -> that is currently occurring in in our

2265.04 -> database

2266.48 -> and we know that it's locked up all

2268.72 -> because it's the way to end

2271.52 -> once again

2273.359 -> that my colleague has mentioned before

2276 -> that is dominating this particular spike

2278.8 -> of database load

2282 -> and

2283.52 -> locked apple uh sounds a little bit

2286.079 -> cryptic

2287.599 -> and we of course recognize that fact and

2290.64 -> that's why in devs guru you can click on

2293.28 -> the wait event name

2295.839 -> and see the simplified description of

2298.4 -> what is really going on with the

2299.839 -> database

2301.839 -> and it looks like in this case

2304.96 -> we have a

2306.24 -> situation where there are multiple

2308.8 -> concurrent

2310.16 -> database sessions that are competing for

2312.4 -> the same exact database records

2316.48 -> so that's easy to understand

2319.76 -> now we can also see the

2322.4 -> actual sql

2324.24 -> that is mainly participating in that

2327.76 -> competition so in other words this

2330 -> sequel is a likely culprit for for the

2332.079 -> problem that we're seeing

2336.88 -> okay so

2338.88 -> now we understand the problem uh in a

2340.8 -> little more detail

2342.96 -> but what can we do to actually solve it

2346.48 -> how can we

2348.16 -> address the problem and uh

2350.56 -> make sure that we no longer see this

2353.92 -> database login issue

2357.28 -> and that's where troubleshooting

2359.44 -> documentation comes in

2362.4 -> so here at rds

2365.119 -> we've looked at the

2367.2 -> typical weight events that are occurring

2369.68 -> in our fleet

2371.28 -> and we've created detailed

2373.119 -> troubleshooting guides for all of these

2375.599 -> events

2377.44 -> and these guides are part of regular

2380 -> aurora documentation now and the way

2382.64 -> that they are structured

2384.48 -> is that there is usually

2387.28 -> a section that describes the likely

2389.68 -> causes of increased weights

2393.28 -> as well as the action items that you can

2395.52 -> take to troubleshoot and address them

2398.88 -> and these action items frequently

2400.72 -> include the actual code references

2404.319 -> or sql queries that you can just

2407.359 -> copy and paste and run

2410.8 -> so for example in this case we can refer

2413.599 -> to one of the documents

2416.4 -> that describes how to troubleshoot

2419.76 -> lock-in scenarios

2423.68 -> we can copy

2426.4 -> the recommended view

2429.839 -> sql that is provided in these documents

2433.359 -> and we are going to

2436.24 -> query the blocking tree

2439.119 -> to see

2440.319 -> what is the root cause of this

2442 -> particular issue

2451.68 -> so in this case you can notice that

2454.319 -> there is a single session

2456.4 -> that is holding

2458.24 -> the entire tree

2460.4 -> so we know where the session is coming

2462.56 -> from

2463.92 -> we know which statement this session is

2465.76 -> running

2467.04 -> and we also know which particular

2469.04 -> database record the session is competing

2472.319 -> on

2473.04 -> and everybody else is waiting for for

2475.2 -> this session to to unblock this record

2478.56 -> so essentially if we just kill that one

2480.56 -> particular session

2482.4 -> that would address this particular

2484.8 -> performance problem

2487.2 -> so it's as simple as that

2492.8 -> all right

2494.48 -> so going back

2496.64 -> uh just to summarize what we've seen in

2498.56 -> this example

2500.4 -> so there was guru for rds

2503.359 -> first recognized performance issue that

2506.24 -> is pretty hard to detect

2510.319 -> it also established an important

2512.4 -> connection between application issue and

2515.119 -> the database issue

2517.359 -> and directed us towards

2519.52 -> analyzing database performance because

2521.839 -> it was the most likely culprit

2525.359 -> on the database level it discovered the

2528.16 -> nature of the database issue that was

2530.16 -> going on

2531.359 -> and also explained why why this issue

2534.16 -> was happening

2535.52 -> why did we see those locks and what

2537.28 -> types of flocks we saw

2540.88 -> it highlighted the sequel that was

2542.96 -> likely the main contributor

2545.44 -> to to the problem

2548.16 -> and finally it provided detailed

2550.64 -> troubleshooting uh

2552.24 -> guidelines uh to

2554.8 -> address this this this problem at its

2557.359 -> root cause

2560.56 -> so these issues are happening uh

2564.8 -> in the real world uh quite frequently

2567.92 -> and uh

2569.04 -> they usually takes time some time to

2571.2 -> resolve

2572.56 -> but hopefully with the help of the ops

2574.24 -> guru the time to resolution can be a

2576.72 -> little bit faster

2581.92 -> all right so i'm going to switch gears

2584.8 -> and look at one more scenario

2588.72 -> so the nice thing about the ops guru is

2592.24 -> that

2593.04 -> you don't necessarily need to be on the

2595.04 -> system when a performance incident is

2597.92 -> going on

2600.24 -> devops guru

2601.839 -> captures and records all of the

2603.92 -> performance incidents and it maintains a

2606.72 -> historical record of them

2609.28 -> so essentially you can always go back to

2611.44 -> the past incident

2613.119 -> to investigate the issue that that

2614.96 -> happened

2616.16 -> some time ago

2617.68 -> maybe because

2619.04 -> of the customer request to

2621.119 -> to see what was going on

2623.04 -> yesterday at 1 pm

2626 -> so let's take a look at one of those

2627.68 -> incidents

2629.599 -> that happened in the past

2634.96 -> so here you can see that there is a

2638.4 -> history of

2641.92 -> different anomalies that happen with

2643.839 -> different applications

2646.4 -> so let's look at one of the particular

2648.72 -> anomalies that

2650.56 -> happened a couple of days ago

2652.72 -> with the shipping application

2658.96 -> now when we look at the main anomaly

2660.72 -> screen

2662.8 -> we can see a very similar picture to the

2665.28 -> first example

2666.96 -> so there is a number of metrics and

2670.88 -> the obs guru detected anomalies in those

2673.119 -> metrics

2674.16 -> and we can see the exact timeline of

2675.92 -> anomalies

2677.359 -> and also the graft representation of

2679.839 -> anomalies

2682.96 -> and also just like in the first case uh

2686.88 -> delb's guru combined anomalies from two

2689.2 -> different resources

2690.96 -> the lambda function or

2693.119 -> content called the front end of our

2694.64 -> application

2696 -> and the

2697.04 -> database or code encode the back end of

2700.48 -> our application

2702.319 -> and because we see those two resources

2705.2 -> uh in the same anomaly and because we

2708.16 -> know that our lambda function actually

2710.24 -> frequently queries uh that particular

2712.48 -> database

2713.839 -> we can

2714.96 -> reasonably infer that once again the

2717.839 -> database is likely the

2720 -> root cause of that issue and if we want

2722.64 -> to find out what's going on we need to

2724.64 -> look at the database

2727.839 -> now there is one thing that is

2729.44 -> interesting here

2731.68 -> we

2732.4 -> actually have detected anomalies on on

2735.52 -> more metrics

2737.52 -> and one of those metrics is called this

2740 -> q depth

2742.24 -> it is happening on the database resource

2746.48 -> and

2748.319 -> because we we see anomaly on this metric

2751.28 -> we can reasonably conclude that the

2753.52 -> actual

2754.72 -> problem is likely related to io

2758.56 -> but let's confirm that

2761.119 -> so if we click on the view analysis

2762.96 -> button

2764.88 -> we can see the

2766.319 -> database centric view of this

2767.839 -> performance issue

2770.56 -> and if we look at the main graph

2773.2 -> we can confirm

2774.48 -> that

2776.88 -> this particular performance issue

2779.359 -> is indeed related to io because the

2783.28 -> most dominating weighted one that we see

2785.52 -> is is io exact sync or io related

2791.68 -> so if we look at the analysis

2794.079 -> we once again can pinpoint the

2798.079 -> actual sql

2799.599 -> that

2800.839 -> experience those io issues

2804.72 -> and also the simplified description of

2808.8 -> of the problem

2810.24 -> what exactly is happening on the io

2813.04 -> and it looks like in this case what is

2814.8 -> happening is that there is

2817.92 -> something related to commits

2820.48 -> perhaps the increased number of commits

2824.96 -> we can corroborate that theory by

2827.04 -> looking through the troubleshooting

2828.56 -> documentation

2830.16 -> if we open the link

2832.72 -> and read the likely causes of increased

2835.52 -> weights and

2839.2 -> suggested action items

2841.839 -> one of the action items is actually to

2844 -> reduce the number of commits and this is

2847.599 -> probably the most typical case for this

2849.68 -> particular type of problem

2853.68 -> now there is one thing that i haven't

2855.68 -> mentioned yet that

2857.76 -> a very nice feature that devops guru

2861.359 -> provides

2862.8 -> and this is the timeline of events that

2865.599 -> it captures

2867.68 -> now

2869.2 -> whenever you investigate a performance

2871.119 -> issue it's really important to see

2875.119 -> whether

2876.24 -> there was something

2877.92 -> in in the environment a configuration

2881.68 -> or perhaps a code

2883.44 -> that changed very recently

2886.559 -> in other words if you see for example

2888.559 -> that a particular database parameter

2891.839 -> or particular piece of code was changed

2894.64 -> right prior to to the time when anomaly

2897.28 -> has actually started

2899.28 -> it's very likely that

2902.24 -> there is a causal relationship between

2904.24 -> them

2905.28 -> in other words you change the database

2907.119 -> parameter and that

2909.52 -> probably triggered the anomaly

2912.64 -> now let's look at the timeline

2915.119 -> for this particular incident

2926.24 -> and here we can see

2929.359 -> that right prior to the anomaly start

2932.88 -> there was actually a change in code

2936.88 -> which

2937.76 -> in our case is probably important

2940.559 -> so

2941.52 -> the action item here would be to to look

2943.599 -> at that piece of code

2945.44 -> and uh

2946.72 -> and find out why exactly

2949.2 -> uh

2950.8 -> it

2952.079 -> made the application issue

2954.319 -> uh more commits

2956.88 -> and that's something that that we can do

2962.72 -> i hope that these two examples uh

2966.24 -> showed you that

2968.8 -> devops guru and derp's guru for rds

2971.599 -> are useful tools that

2974.8 -> can help you

2976.16 -> troubleshoot performance issues better

2978.48 -> and faster

2980.8 -> and as a side note

2982.8 -> i'd like to say that we are constantly

2985.04 -> working on improving these tools

2987.68 -> and

2989.599 -> we want to to make them even more useful

2992.16 -> in the future so

2994.319 -> please

2995.2 -> keep watching the space

2997.28 -> thank you

3000.4 -> thanks maxim those are some great

3002.88 -> examples of the kinds of insights devops

3005.44 -> guru for rds can provide

3008.64 -> now let's talk a little bit about

3010.319 -> pricing

3012.48 -> so devops guru for rds is offered at no

3015.76 -> additional charge as part of the

3017.44 -> existing price that devops guru charges

3020.079 -> you for analyzing amazon rds resources

3024.079 -> devops guru segments the resource types

3026.48 -> it evaluates into two groups

3028.8 -> group a which consists of lambda and s3

3032 -> and group b which consists of amazon rds

3035.2 -> ec2 redshift clusters and 25 other aws

3038.72 -> resource types

3040.8 -> you will also notice that group a

3043.28 -> and group b both are priced very very

3045.76 -> economically which where group a

3048.4 -> actually equates to approximately two

3050.319 -> dollars per resource for a month

3052.559 -> and group b equates to approximately

3054.8 -> three dollars per resource for a month

3058.24 -> if you choose to you can also

3060.72 -> opt to use tags to control costs by

3064 -> enabling devops guru only for your

3065.92 -> aurora resources

3068.64 -> for more information about using aws

3070.88 -> tags with devops guru

3072.8 -> please visit the topic working with

3074.72 -> resource tags in our documentation page

3080.72 -> so to summarize

3082.8 -> today we didn't only want to tell you

3084.88 -> about this great new capability called

3087.119 -> devops guru for rds

3089.119 -> but we also wanted to provide you with

3091.28 -> some foundational knowledge that you

3093.2 -> could use and that could help you

3096 -> understand why

3097.599 -> it matters to find and resolve these

3099.599 -> performance disruptions

3102.24 -> amazon rds and amazon aurora are

3104.559 -> designed to take care of the complicated

3106.48 -> parts of running a database

3108.88 -> but performance issues are a complicated

3111.04 -> part

3111.92 -> and until recently we only had

3114.319 -> performance insights to help you shine a

3117.359 -> ray of light onto that topic

3120.319 -> but now we have devops guru for rds

3124.24 -> using machine learning and all of our

3126.16 -> experience managing a fleet of amazon

3128.4 -> databases we are going one step further

3131.76 -> towards making performance management an

3134.079 -> automated part of our managed database

3136.72 -> service

3140.88 -> we would encourage you to leverage these

3143.119 -> helpful resources which will help you to

3145.76 -> learn more about devops guru for rds

3151.04 -> so with that i want to thank maxim i

3153.76 -> want to thank all of you for attending

3156 -> and paying attention to this talk as we

3158.88 -> explained this brand new capability

3160.8 -> called devops guru for rds

3163.76 -> take care stay safe and have a great

3166.319 -> rest of your day

3172.87 -> [Music]

Source: https://www.youtube.com/watch?v=lOVPCySAiZk