AWS re:Invent 2020: Serverless data preparation with AWS Glue

Aug 16, 2023

AWS re:Invent 2020: Serverless data preparation with AWS Glue

The first step in an analytics or machine learning project is to prepare your data to obtain quality results. AWS Glue is a serverless extract, transform, and load (ETL) service with a recent series of innovations that make data preparation simpler, faster, and cheaper. Join this session and listen to AWS Glue general manager Mehul A. Shah showcase the service’s new visual experience that makes it easier to author, debug, and manage your ETL jobs. This session dives deep on the new AWS Glue engine that offers 10 times faster job start times and improved support for data extraction, streaming ETL, orchestrating ETL workflows, and more.

Learn more about re:Invent 2020 at http://bit.ly/3c4NSdY

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

#AWS #AWSEvents

Content

1.76 -> hi i'm miho shah

3.919 -> the general manager of aws glue and aws

7.04 -> lake formation and today i'm going to

9.519 -> tell you about aws glue

11.2 -> and how it provides serverless data

13.599 -> preparation

14.92 -> capabilities

17.52 -> before i get started let me give you an

19.199 -> overview of my talk

20.96 -> first i'm going to tell you about aws

22.72 -> glue and the modern

24.48 -> use cases that are driving its growth

27.119 -> then for the bulk of my talk

28.64 -> i'm going to be speaking about some of

30.32 -> the cool new features it offers

33.2 -> to make glue faster cheaper

36.559 -> and easier to use for you and then

39.04 -> finally

39.68 -> i'm going to conclude

43.04 -> so why data preparation what is data

45.2 -> preparation

46.32 -> data preparation is the process of

48.16 -> collecting all the data that you need

50.399 -> transforming it cleaning it and

53.039 -> normalizing it

54.399 -> so you can get quality in that data for

56.8 -> running analytics

58.879 -> to get insights about your business

61.44 -> understand how it's operating and what

63.199 -> you can do to optimize it

65.68 -> and to build machine learning models so

68.32 -> that you can do

69.439 -> prediction and classification

73.52 -> data preparation is the first mile of

75.52 -> any of these activities

78.159 -> and data preparation is hard if you

80.159 -> don't get it right you're going to end

81.52 -> up with incorrect results

83.6 -> so what are the factors that make it

84.96 -> hard well there's three of them the

87.2 -> first one is that there's lots of data

89.28 -> across our customer base we see our

90.88 -> customers

92.479 -> increasing the amount of data that they

94 -> manage and growing the amount of data

95.68 -> they manage by 10x every five years

98.64 -> and this data is becoming increasingly

100.96 -> diverse

101.84 -> it's no longer just structured data that

103.6 -> they want to manage

104.88 -> or analyze they want to analyze

106.96 -> semi-structured data

108.479 -> this is data like machine logs network

111.439 -> logs

112.479 -> iot device events click streams

116.32 -> mobile events and social feeds they also

119.759 -> want to analyze unstructured data like

121.6 -> text

123.84 -> and for data prep most data preparation

126.32 -> projects

127.28 -> there's always some amount of

128.8 -> customization that you need

130.399 -> it's not a simple one-click and go

132.64 -> process

133.52 -> and we find our customers end up writing

136.959 -> hand-coded scripts for most of their

138.879 -> data preparation jobs

140.64 -> and these scripts tend to be brittle and

142.56 -> error-prone as the data evolves

146 -> and then finally many of our customers

148.08 -> are still managing their own

149.44 -> infrastructure

150.64 -> they're managing their own virtual

151.92 -> machines sizing them

154.64 -> they're building their own clusters and

156.48 -> managing their life cycle

158.64 -> they're scheduling jobs on those

160.08 -> clusters monitoring them making sure

162.48 -> they're running successfully

164.239 -> and then building a variety of metadata

167.599 -> metadata stores so they can organize

169.36 -> their data

170.16 -> and manage those metadata stores

171.519 -> themselves and so as we saw that

174.56 -> back in 2017 we decided to build and

177.599 -> launch aws glue

179.84 -> and glue at that time was billed as a

182.879 -> fully managed extract transform and load

185.68 -> service

186.48 -> or etl for short it was designed for

189.519 -> developers

190.319 -> programmers people like us were also

193.44 -> developers and programmers and we knew

195.44 -> what was necessary to get this job done

200.159 -> since then you our customers have pulled

202.159 -> us in many different directions

204.08 -> and in particular you've told us that

206.159 -> you want to broaden the service

207.76 -> it's more general now it's a serverless

209.44 -> data preparation service

211.36 -> and it not only serves data engineers

213.04 -> and programmers but it also serves

215.68 -> etl specialists data scientists business

218.64 -> analysts

219.36 -> and more you actually helped us grow

224.319 -> today we have hundreds of thousands of

226.239 -> active customers

227.44 -> on the glue service running millions of

230.959 -> jobs

231.599 -> jobs on a daily basis

234.879 -> and glue is available in 22 regions

237.12 -> across the globe

240.319 -> here's a a selection a small subset

243.76 -> of the some of the customers and

245.2 -> partners that are on

247.76 -> our service you can see that these uh

250.959 -> enterprises span the gamut from small to

254.08 -> big

255.439 -> from financial services from the auto

258.4 -> industry

259.759 -> from media companies and so on and so

262.079 -> forth

263.12 -> glue is applicable in a lot more context

265.6 -> than a single vertical

268.639 -> and we really want to thank you for all

270.08 -> the things that you've said and

272.56 -> for noticing all the things that we've

274 -> done over the past two years

276 -> here's what you've been saying in many

277.919 -> different languages in many different

279.36 -> ways

280.32 -> glue is faster cheaper and better than

283.44 -> ever

284.4 -> thank you so much so let me tell you a

287.6 -> little bit about the modern use cases

289.6 -> that are driving the growth for glue

293.199 -> before i get there let me quickly give

294.96 -> you an overview of what glue

296.88 -> and its components are the first main

299.68 -> central piece of glue

301.039 -> is its serverless etl engine that's

303.68 -> based on apache spark

306.56 -> you basically give us your apache spark

308.16 -> scripts or you can just give us pure

310.08 -> python scripts as well

311.759 -> we automatically spin up the servers

313.44 -> that are necessary to run those scripts

315.759 -> determine the capacity that's necessary

318.24 -> execute it you pay for what you use and

320.32 -> then we shut those machines down for you

322.24 -> we manage the entire life cycle entirely

324.72 -> serverless

325.759 -> we also give you a visual tool

328.96 -> or a number of visual tools to

331.12 -> interactively interactively develop your

332.96 -> jobs

333.919 -> and will automatically compile those

336.32 -> jobs down to apache spark scripts that

338.479 -> run

339.12 -> on our scalable big data infrastructure

343.44 -> we also give you a glue data catalog

347.28 -> what this is is a centralized metadata

349.6 -> store

350.479 -> fully managed it's hive metastore

353.44 -> compatible

354.96 -> and there are a number of aws services

356.96 -> that are already integrated with this

359.12 -> store as well as third-party partners

363.28 -> as well as open source tools

367.52 -> we also provide crawlers that

369.759 -> automatically scan your s3 buckets your

371.919 -> databases

373.52 -> infer the schema for your data the table

376.24 -> structures

377.36 -> and then automatically load the data

378.96 -> catalog for you it also supports

381.199 -> schema evolution and then finally you

383.759 -> can put all this stuff together into

385.199 -> complex workflows

386.88 -> to run your data pipelines in a

389.919 -> reliable fashion this is our glue

392.84 -> workflows

394.08 -> component what's glue used for

398.4 -> well there are three main use cases that

400.16 -> we can see we can see the first one is

401.759 -> for building data lakes

403.039 -> on aws you see we find a lot of our

406.479 -> customers taking all of their data

409.199 -> and putting it into a lot of their data

411.52 -> to putting it into

412.8 -> amazon s3 which is this ubiquitous

416.479 -> low cost highly durable with 11 nines of

419.52 -> durability

420.4 -> object store they're breaking their data

423.52 -> silos and their operational databases

425.919 -> and using glue to ingest data from those

428.08 -> data silos

429.039 -> into s3 and then process that data from

431.84 -> stage to stage

433.919 -> and refine it then finally when it's

437.199 -> ready for production what they'll do is

439.84 -> they'll

440.72 -> run glue crawlers to load and maintain

443.28 -> the data catalog

444.4 -> that points to all this data and they'll

447.599 -> use a sister service called aws lake

449.44 -> formation

450 -> to secure the data lake and the data

452.08 -> that they've created

453.12 -> the data lake that they've created and

455.919 -> then a number of analytic services

457.84 -> that we have integrate with glue and

459.759 -> lake formation so you can access that

461.68 -> data securely and analyze it

464 -> we have amazon quick site that allows

466.4 -> you to do business intelligence

468.319 -> amazon athena that allows you to do sql

470.4 -> querying over your data warehouse

474 -> in a serverless fashion emr that allows

476.319 -> you to do big data processing

478.24 -> amazon redshift that allows you to do

479.919 -> data warehousing and sagemaker that

482 -> allows you to do ml

485.759 -> we also find our customers loading data

488 -> warehousing

489.039 -> wait at warehouses with aws glue typical

492.96 -> etl processing they're getting data out

495.68 -> of their operational databases like

497.28 -> amazon aurora

498.56 -> combining it with amazon s3 loading

501.599 -> their enterprise data warehouse so they

503.12 -> can run daily or nightly reports

506.639 -> and increasingly we see our customers

508.56 -> running data preparation for ai

511.36 -> ml and data science workloads we see

514.24 -> them cleaning and enriching their data

516.64 -> extracting features and separate

519.519 -> training sets so that they can build

520.8 -> their ml models

522.479 -> and then they're taking their data and

524.48 -> running inference

525.519 -> prediction and classification

528.959 -> using glue or using glue to drive that

532.399 -> drive the data into the ml services

535.839 -> we also see customers having their data

539.519 -> scientists

540.56 -> use notebooks connected to glue for data

543.519 -> exploration

544.399 -> and data experimentation so that they

546.8 -> can come up with new ways

548.32 -> of analyzing their data all right

551.839 -> so now that you know what blue is used

553.36 -> for let's talk about some of the recent

554.72 -> innovations that we've added into glue

558.48 -> since two years ago when we last talked

560.32 -> to you about glue

561.519 -> and its features we've actually added

564.64 -> over 50 new major features to glue

567.839 -> and a number of new regions

574 -> and there are three trends that are

575.279 -> driving these new features the first

577.36 -> trend

578.16 -> is that when we first started or the

580.16 -> first trend is that

581.68 -> you're putting more demanding work

583.68 -> workloads onto our

585.519 -> service when we first started most of

587.92 -> the jobs that you were running

589.36 -> were basically latency agnostic

591.92 -> long-running

593.12 -> nightly or hourly etl jobs for loading

596.16 -> your data warehouse

597.68 -> for loading your data lakes and running

600 -> reports

601.04 -> and so our original engine 0.9 and the

603.76 -> next version 1.0

605.839 -> really was great because it allowed you

607.519 -> to do that in a serverless fashion

609.04 -> without

609.519 -> having to go spin up any machines job

612.399 -> startup

613.12 -> was tolerable on the order of you know

615.6 -> tens of minutes if you will

617.279 -> and we had a 10 minute minute minimum

620.079 -> billing duration

622.24 -> well since then you've put a lot more

625.2 -> real time

625.92 -> micro batch workloads on glue and these

629.839 -> workloads are latency insensitive

632.959 -> and sorry latency sensitive and require

636.399 -> in some ways continuous operation so you

639.12 -> can build things like responsive

640.399 -> dashboards

641.92 -> do monitoring and learning and build

644.16 -> other kinds of you know streaming or

645.92 -> semi streaming applications

648.079 -> and so the old engines aren't really

649.839 -> going to be able to support these kinds

651.68 -> of

652.56 -> workload demands and so we built a brand

654.8 -> new engine for you in the summer

656.56 -> we released glue 2.0 a new engine for

660.16 -> micro batch new real-time workloads

663.44 -> okay what's better well almost

666.64 -> everything is better about this engine

668.56 -> it starts up 10x faster is much more

671.36 -> predictable

673.279 -> and it's more cost effective we've

675.6 -> dropped the minimum billing duration to

677.839 -> one minute from 10 minutes and for

680.16 -> customers that have moved over to glue

681.519 -> 2.0

682.64 -> they see an average of 45 percent cost

685.12 -> savings

686.88 -> so now they can actually run their

688.399 -> latency sensitive workloads

690.959 -> at a much cheaper rate we've built a

693.44 -> brand new

694.079 -> job execution engine engine underneath

696.24 -> the covers to make this all work

698 -> and i'm going to tell you about that

701.2 -> but before i get there let me tell you

702.72 -> how spark works so you have some

704.079 -> background

705.92 -> when glue actually runs the job on spark

708.24 -> what spark does is actually takes those

710.079 -> scripts

710.88 -> and breaks it up spark is a data

712.88 -> parallel system

714 -> so it's going to take those scripts or

715.36 -> your jobs and divide them into stages

718.56 -> and each stage is going to run over your

720.72 -> data and your data is going to be

722.16 -> partitioned or into shards

724.639 -> and so for each shade stage and for each

727.36 -> partition

728.32 -> there's going to be a single task that's

729.92 -> going to be ready for execution

732 -> and what the spark scheduler does is it

734.079 -> takes those tasks

735.12 -> and executes them or schedules them on

738.399 -> executors which are attached to virtual

740.88 -> machines

741.519 -> that comprise your cluster

745.04 -> all right so for the previous version of

746.639 -> glue 0.9 and 1.0

749.2 -> what we were doing was we were actually

752.079 -> sending those jobs to our job manager

754.88 -> and the job manager was scheduling it on

758.079 -> a cluster and it would pick a cluster

759.76 -> from three different mechanisms

761.839 -> either the cluster was already allocated

763.68 -> to a particular user and a customer

765.68 -> in which case your job startup would be

767.36 -> on the order of a few seconds

770.72 -> or it would go and take a cluster from

773.279 -> an already provisioned cluster in our

774.959 -> warm pool

775.92 -> attach it to a user and its job and then

778.639 -> start running in which case it would

780 -> take tens of seconds or we would spin up

783.36 -> a brand new cluster from ec2

786 -> on demand in which case startup's

788.24 -> startup time would take on the order of

789.68 -> minutes

790.639 -> and then we would submit the job or the

792.48 -> script to the cluster

794.32 -> spark would take over and it would start

796.32 -> doing the scheduling of the tasks to the

798.24 -> executors

799.6 -> when you did it this way what happens is

802.32 -> that you have

803.04 -> high latency because your job only

805.36 -> starts after the entire cluster is

807.44 -> provisioned

809.2 -> and if you end up needing capacity

812 -> that's not within the sort of t-shirt

813.839 -> size

814.24 -> clusters that we have in our warm pool

817.12 -> you're going to end up

818.24 -> actually going to ec2 to wait for the

820.959 -> cluster to get

822.16 -> built up and so that leads into high

825.36 -> latency and high variability

827.6 -> glue 2.0 is completely different okay so

831.12 -> 2.0 actually integrates the scheduling

834.399 -> of the tasks with the provisioning of

837.04 -> the machines let me show you how

839.279 -> you submit the job to the job manager

842.079 -> the job manager

843.12 -> then submits the job to a virtual

845.36 -> cluster now if that cluster already has

848.24 -> some machines set up the job starts

850.8 -> immediately

852.32 -> the spark scheduler we're stuck will

853.92 -> start working and it will ask for

856 -> tasks to be attached to executors

859.12 -> and if it needs a new executor and the

861.279 -> executor doesn't have any machines

863.04 -> i mean to be attached to then we're

865.199 -> going to dynamically

866.48 -> grow the virtual cluster we're going to

868.72 -> go to the warm pool and now we don't we

870.8 -> don't actually have to go and get

871.92 -> another cluster we can get one machine

873.44 -> at a time

874.399 -> so our warm plugs are much more flexible

876.88 -> and spread across and multiple azs for

878.72 -> additional

879.36 -> resiliency and the warm pool

882.399 -> already has a pre-configured machines

884.16 -> with with spark on them but if you can't

886.48 -> find or

887.76 -> a machine in the warm pool or it's

890.079 -> depleted you can go to ec2 and get it on

892 -> demand

892.8 -> and if you go to the warm pool the

894.079 -> startup time is in the order of a couple

895.68 -> of seconds

896.56 -> and if you go to ec2 it's on the order

899.04 -> of tens of seconds

901.12 -> once the virtual machine is provisioned

903.44 -> we will then schedule

904.639 -> the executor on that machine and then

906.88 -> the task on the executor

908.399 -> and the spark will start working with

910.48 -> that cluster as it grows

912.639 -> so the cool thing here is the job starts

914.959 -> when the first

915.839 -> executor is ready so your start time is

918.56 -> reduced

919.519 -> and the variance in your start time is

921.519 -> also reduced

922.8 -> the other nice thing is you can operate

924.399 -> your job it'll actually run and make

926.16 -> progress

927.04 -> even when it's not in full capacity for

929.68 -> example

930.48 -> and when it's starting up or when

933.12 -> there's some kind of unavailability

934.959 -> in that case it'll still continue to

936.48 -> make progress and so it gives you

938.399 -> graceful degradation

941.12 -> all right let's compare the two at the

943.279 -> top we have

944.8 -> the distribution of job startup times

947.199 -> for 1.0

948.399 -> and on the bottom for 2.0 now the scales

951.839 -> on the x-axis

953.04 -> are not actually aligned properly but

954.8 -> that's okay we'll fix that in a second

957.199 -> there are three things to notice here if

958.88 -> you get

960.32 -> a job that gets executed on an existing

962.72 -> cluster

963.519 -> this and this is in most of the cases

965.519 -> then the startup times are going to be

966.88 -> short

967.36 -> we call this warm start if you end up

970.639 -> hitting the warm pool

971.92 -> it'll be a little bit slower that's the

973.6 -> second hump that you see

974.959 -> but it's still pretty fast and then of

977.6 -> course if you need to go provision new

979.12 -> machines and

980 -> you need to go and configure them and

982.32 -> you're going to ec2 it's going to take

983.92 -> much longer

985.199 -> with glue 2.0 and versus glue 1.0 you

988.959 -> see a different distribution

990.72 -> the reason the 1.0 distribution for cold

992.72 -> start is much

994.079 -> wider is because every job is going to

996.48 -> have a different capacity that it needs

998.16 -> and so it's going to only start when the

1000.079 -> entire cluster is provisioned

1001.68 -> so you're going to have this wide

1002.8 -> distribution of start times while with

1005.12 -> 1.0 you start

1006.16 -> the job almost immediately as soon as

1008 -> the

1009.199 -> machine is ready so the distribution is

1011.199 -> much narrower

1012.8 -> and actually if we had to actually fix

1014.48 -> the time scales you'll see

1015.839 -> that one point for with 2.0 everything

1018.24 -> is just flat out much faster

1020.56 -> the warm start is under 10 seconds

1022.88 -> compared to one minute for glue 1.0

1025.919 -> and glue 1.0 job start times for cold

1028.48 -> start

1029.039 -> are on the order of 8 to 10 minutes on

1030.72 -> average while with glue 2.0

1032.88 -> it's under or around 30 35 seconds

1036.24 -> so it's literally an order of magnitude

1038.16 -> faster

1040.24 -> we have a number of customers that have

1041.76 -> moved over to glue 2.0

1043.679 -> ibada and marketing evolution and they

1046.319 -> can convert a confirm

1048 -> that start times are much faster they're

1050.4 -> saving a lot more money and the system

1052.799 -> is much more reliable and resilient

1059.039 -> all right the second trend i want to

1060.4 -> talk about is the fact that you have

1063.12 -> many more types of personas

1066.16 -> our customers have many more types of

1068.16 -> personas that want to be able to use

1070.48 -> glue

1071.12 -> and its powerful serverless engine not

1073.76 -> just engineers but also

1075.52 -> etl specialists data scientists and

1077.919 -> business analysts

1079.52 -> and for them we launched a brand new

1082.08 -> interface to glue called glue studio

1084.799 -> it's a visual interface for building etl

1088.08 -> jobs for glue jobs you basically

1091.2 -> draw or diagram out

1095.44 -> a a a an etl workflow

1099.44 -> visually without ever having to code you

1101.84 -> basically get a canvas based tools where

1103.76 -> you can actually diagram it out

1106.64 -> and we also give you ways to drop in

1110 -> specialized code for advanced

1111.76 -> transformations that the visual tool

1113.44 -> can't easily express

1115.28 -> so you can do this entirely without

1116.72 -> coding but if you want to code that's

1118.24 -> also an option

1120.24 -> we also allow you to monitor thousands

1122.08 -> of jobs through a single pane of glass

1124.72 -> and then we convert these workflows into

1126.4 -> apache spark scripts so you can still

1128.32 -> take advantage of the distributed

1129.6 -> processing

1130.72 -> of spark without actually having the

1132.88 -> learning curve

1135.12 -> let's give you an example of how this

1136.48 -> works

1138.4 -> in this example what we're going to do

1140.16 -> is we're going to take ventilator

1141.44 -> readings

1142.48 -> from hospital hospital ventilators put

1145.36 -> them into a stream

1146.96 -> we're going to enrich them with hospital

1148.64 -> information and then store them in s3 so

1151.44 -> we think we can subsequently analyze

1154.08 -> how well our ventilators are doing in a

1156.24 -> hospital system

1158.32 -> we're going to consume the data from two

1160.16 -> different sources

1161.6 -> one which is a streaming source and the

1163.12 -> second which is a static source which is

1165.039 -> in a

1165.6 -> relational database combine it using a

1168.16 -> join

1169.44 -> clean it using some ml and then store it

1171.84 -> in s3

1172.88 -> monitor the jobs and then analyzing the

1175.039 -> resulting data set

1177.12 -> if you take a look at the ventilator

1178.48 -> readings the raw readings are

1180.16 -> going to be in json format and it's

1183.36 -> going to have a bunch of different

1184.559 -> fields that tell you or identify the

1187.44 -> ventilator

1189.12 -> some of the metrics that the ventilator

1190.84 -> is

1192.4 -> the manufacturer the hospital that that

1195.76 -> ventilator is in

1197.28 -> and then the operating status now

1199.039 -> depending on the ventilator

1200.32 -> some ventilators actually give you an

1202 -> operating status

1203.36 -> and other ones don't because they're

1204.88 -> just manufactured differently

1206.799 -> and so in the real data you're going to

1208.64 -> see

1209.919 -> empty fields this is how real data is is

1212.799 -> dirty

1213.6 -> so let's show you how you can actually

1214.96 -> clean it up and process it using

1217.6 -> studio this is what you get with studio

1219.919 -> you get a blank canvas

1221.12 -> and then you can basically start adding

1222.559 -> nodes to build up a graph

1228 -> node there are three different types of

1229.76 -> nodes sources transformations and

1231.919 -> targets

1232.96 -> in this case the source is going to be a

1234.4 -> ventilator stream coming from a kinesis

1236.24 -> stream

1241.679 -> and if you want to take a look at the

1242.96 -> output of that data set

1245.52 -> and what its schema looks like you can

1247.44 -> actually look at the output schema

1248.72 -> directly in glue studio

1254.559 -> the hospital information is actually

1256 -> going to come from a data set that's

1257.28 -> sitting in the public

1258.4 -> aws kovid 19 data lake we're going to

1261.44 -> store it

1262 -> in an rds database it's basically

1264.32 -> structured data about the hospitals in

1266.48 -> across the u.s it's an rds database so

1270.4 -> we're going to create a new source

1272.24 -> configure it to be an rds source

1277.919 -> and then we're going to combine those

1279.679 -> two data sets

1280.96 -> based on a transformation called a join

1284.64 -> and what we're going to do is we're

1285.84 -> going to take the records that are

1286.96 -> coming in from the ventilator

1289.039 -> and join them on the hospital id with

1291.76 -> the hospital information that's in the

1293.6 -> in the relational database that's there

1300.559 -> and then finally those missing values

1303.2 -> we're actually going to go and drop in a

1304.64 -> custom transformation

1306.48 -> this custom transformation allows you to

1308.159 -> write python code or scala code in this

1310.799 -> case we've written python code

1312.96 -> we're actually calling upon a library

1315.52 -> that's embedded with glue etl that

1317.12 -> automatically fills in missing values

1319.36 -> it's called imputer and what it does

1322.72 -> is it automatically trains itself on

1324.96 -> data that's streaming through

1326.4 -> and based on the values of other the

1328.32 -> other columns that it sees

1330.08 -> it's going to try to predict what the

1331.76 -> value for operating status is going to

1333.44 -> be

1336.559 -> 10 lines of code and you can

1337.84 -> automatically pretty quickly use

1339.679 -> ml to clean your data all right

1343.039 -> then the output is going to go into an

1344.48 -> s3 bucket and we're going to organize

1346.559 -> the output

1347.28 -> by state and also by event time so all

1350.4 -> those records are going to be

1351.36 -> distributed across different directories

1353.36 -> so that when we analyze them we can go

1355.36 -> to the directories of interest to reduce

1357.12 -> the amount of data that we have to

1358.4 -> process and analyze

1364.24 -> here's what the monitoring dashboard

1365.52 -> looks like gives you a single plane of

1367.12 -> glass that tells you what's going on

1368.72 -> in your entire system number of jobs

1371.36 -> that have succeeded and failed

1372.96 -> how many are running and there are

1374.64 -> different views here where you can come

1376.159 -> in and

1376.72 -> drill down and find the jobs of interest

1378.559 -> to understand their details and what's

1380 -> going on

1382.4 -> you can also attach a notebook to

1384.48 -> quickly

1385.6 -> analyze the data using glue that's

1387.84 -> sitting in your s3 buckets

1389.52 -> in this case what we're showing you are

1391.84 -> the missing values that were imputed

1393.679 -> you'll see that the nulls have been

1395.12 -> turned into okays or warnings

1397.28 -> based on the other column values that

1398.88 -> you see on the left hand side

1402.4 -> and then with through those notebooks

1403.84 -> you can also run analyses in this

1405.6 -> particular case

1406.48 -> you're running a five-minute window

1411.039 -> analysis showing you the number of

1413.36 -> unhealthy ventilators in that

1414.72 -> five-minute window

1417.12 -> and you can run this on a continuous

1418.64 -> basis

1421.6 -> we have a number of customers using glue

1423.279 -> studio across their organization

1425.44 -> and they confirm that it's much easier

1428.159 -> much faster

1429.12 -> and much quicker to deploy jobs using

1431.919 -> studio

1432.88 -> than before because the learning curve

1434.64 -> is much much less

1436.64 -> and you don't have to actually know

1438.08 -> spark to get started

1441.2 -> all right now the third trend is that

1444.159 -> we're seeing our customers store

1445.76 -> more and more data with the glue data

1447.52 -> catalog and the glue data catalog is

1449.76 -> something that actually keeps track of

1451.12 -> your table structures including your

1452.72 -> partitions

1453.919 -> partitions are effectively directories

1455.679 -> this is what we did in the last example

1457.279 -> i'm going to use the same thing here and

1459.76 -> what you do

1460.4 -> is you put your data into different

1461.76 -> directories so you can get to that data

1463.44 -> very quickly when you're doing analysis

1465.6 -> so originally people had sort of a

1467.2 -> manageable number of partitions in their

1468.799 -> data catalog tens of thousands if you

1470.559 -> will

1471.279 -> here's an example where we've organized

1472.96 -> our data based on region

1474.96 -> year month and day but increasingly

1478.72 -> with micro batch workloads and more data

1480.96 -> coming into the system

1482.64 -> we see customers storing hundreds of

1485.279 -> thousands if not

1486.159 -> millions of partitions with the data

1488 -> catalog and adding additional levels of

1490.64 -> granularity

1491.52 -> like fine-grained partitions here not

1493.919 -> just based on day but also by hour

1495.919 -> sometimes minute and second

1499.279 -> and so the number of partitions have

1500.799 -> exploded and that's affected how you can

1502.559 -> do analytics with the glue data catalog

1506.64 -> let's see how that works imagine for

1508.48 -> example you're using emr

1510 -> emr spark to run a query you submit the

1513.12 -> query spark

1514.88 -> will consult the data catalog get the

1517.36 -> and the data catalog will then scan

1519.12 -> through

1520.559 -> all the partitions it has to find the

1523.039 -> relevant ones

1524.48 -> and then send them back to spark and

1526.24 -> that's the query planning phase

1528.799 -> then it's going to execute the query

1530.799 -> retrieve the data

1532 -> and send the results back well query

1534.96 -> planning here

1535.84 -> is going to be proportional to the time

1537.679 -> it takes to scan through all the

1538.96 -> partitions

1540.08 -> and if you have millions of partitions

1542 -> this can be significant

1544.159 -> and of course if you ask queries where

1546.96 -> you know

1547.84 -> you're looking at or asking for things

1549.52 -> like just at the leaves of the

1550.799 -> partitions

1552.4 -> then you actually have to scan through

1553.76 -> all of them you can't very quickly

1555.52 -> pinpoint which partitions matter and so

1557.919 -> it's going to take a while

1560.24 -> what we added is a new feature called

1561.76 -> partition indexes

1563.52 -> partition indexes basically improve the

1565.919 -> query planning phrase

1567.76 -> you can create partition indexes on any

1570.08 -> partition column or any combination of

1571.84 -> partition columns in this case

1573.76 -> we've created one on year month day and

1575.84 -> hour

1576.88 -> you can create as many of them as you

1578.24 -> want so you can add region and category

1580.559 -> for another one

1582.159 -> and the nice cool thing about partition

1583.76 -> indexes is they also support range based

1585.76 -> predicates

1586.48 -> so you can very quickly narrow down for

1588.799 -> example

1589.679 -> all of the partitions that span from say

1591.84 -> 10 am to say 1 pm

1595.919 -> emr hive and emr spark use partition

1598.72 -> indexes today and more applications are

1600.64 -> to come

1602 -> let's see the impact of partition

1603.679 -> indexes on a real world-like query

1607.12 -> here's a query where we're trying to

1608.32 -> count the number of ita iot devices

1611.2 -> that are actually sending out data and

1613.919 -> this particular case we're looking for

1616.08 -> literally just um one day's worth of

1618.32 -> data here in november

1621.039 -> and you know we had about 50 k

1623.2 -> partitions in this data set

1624.64 -> with or without a partition index the

1626.4 -> query took roughly seven seconds

1629.2 -> manageable but as we scaled the number

1631.919 -> of partitions when we went from

1633.919 -> you know days to minutes and hours in

1636.08 -> our partition list

1638.32 -> it turns out that the query time without

1640.399 -> a partition index started blowing up

1643.12 -> went from seven seconds to nearly 64

1646.159 -> seconds

1647.2 -> as we dig a little bit deeper what you

1648.799 -> see is the run time which is in orange

1650.88 -> stayed roughly the same about seven

1653.039 -> seconds

1654.399 -> but the query planning time blew up

1656.32 -> added another

1657.52 -> 50 seconds of latency and so by using an

1660.72 -> index you were able to save 99

1662.72 -> and run in in running your query so

1664.96 -> depending on how big your partitions are

1666.96 -> and how big your runtime is for your

1669.039 -> query you can literally see orders of

1670.799 -> magnitude improvement

1672.159 -> in your query planning and certainly

1674.159 -> some of our bigger customers have seen

1675.6 -> this

1675.919 -> for example vmware has a petabyte scale

1678.72 -> data lake on s3

1680.399 -> and previously they couldn't run their

1682.399 -> analytics based on tight slas and now

1684.24 -> they can because

1685.279 -> they're using partition indexes

1688.64 -> all right now one last thing

1692.159 -> i want to introduce a brand new feature

1697.039 -> it's going to be available the end of

1698.88 -> december in 2020

1701.919 -> and it's called glue custom connectors

1705.279 -> prior to this if you wanted to connect

1707.44 -> to on-premise databases that we didn't

1710.159 -> have first-class support for

1713.279 -> or proprietary stores or sas

1716.559 -> applications like salesforce and

1718.32 -> servicenow

1719.679 -> you had to be a spark expert well now

1722.559 -> you can have

1723.679 -> a connector that you can build on your

1725.919 -> own and reuse in glue studio

1728.559 -> or you can pick up a connector from aws

1730.72 -> marketplace

1731.679 -> where there's a bunch of third parties

1733.12 -> that have developed connectors

1735.52 -> and you can use it directly in glue

1737.279 -> studio to access the data

1740.64 -> we have a variety of connectors that we

1742.399 -> support we support athena data source

1745.6 -> interface connectors where you basically

1747.36 -> have to fill out about

1748.88 -> six different interfaces we support

1751.679 -> spark data source v2

1753.2 -> interface connectors and we also support

1756 -> conductors on the aws marketplace

1759.44 -> and how does it work well you simply

1762.24 -> register the connector or create a

1763.679 -> connector

1764.399 -> in glue studio you create a job

1767.52 -> using one of those connectors configure

1769.52 -> the connector and off you go

1771.2 -> pretty simple now i want to briefly

1773.679 -> touch on

1774.24 -> two new capabilities in glue the first

1777.279 -> is glue data brew it's a new visual

1779.919 -> interface for cleaning and normalizing

1781.52 -> your data

1782.64 -> it profiles your data to detect patterns

1784.399 -> and anomalies and you can choose from

1786.32 -> over 250 built-in cleaning transforms

1789.039 -> and visually apply them at scale with

1791.6 -> this and glue studio

1793.44 -> glue is a one-stop shop for authoring

1795.44 -> end-to-end data preparation pipelines

1798.48 -> the second new capability is the glue

1801.36 -> schema registry with the data catalog

1803.84 -> what it allows you to do is it allows

1806.24 -> you to enforce

1807.44 -> schema evolution rules for schemas in

1809.76 -> your data catalog

1811.2 -> and this is useful for getting data

1813.12 -> quality for your streaming applications

1815.44 -> for getting data quality when you're

1816.96 -> processing streams from kafka

1819.12 -> kinesis and running kinesis data

1821.44 -> analytics or apache flink

1825.039 -> all right now to conclude i've told you

1829.12 -> about a number of new new features that

1831.679 -> have made glue

1832.48 -> faster and cheaper we've shown you tools

1836.559 -> that are that make it a make glue

1838.64 -> accessible

1839.679 -> to non-programmers etl specialists

1843.12 -> data scientists and data analysts we've

1846.24 -> shown you

1846.799 -> new features to speed up your analytics

1848.799 -> using glue

1850.32 -> and now we've released we're going to

1852.64 -> release

1853.52 -> glue custom connectors that allow you to

1855.36 -> access hundreds of data sources with

1857.36 -> glue

1858.88 -> glue now supports multiple personas and

1861.2 -> use cases

1862.32 -> with persona specific journeys that go

1864.24 -> beyond developers

1865.84 -> to let you continue to break down silos

1868.08 -> and enable collaboration in your

1869.44 -> organization

1870.64 -> i recommend that you check out glue at

1872.399 -> the following website

1875.12 -> and if you're interested check out these

1876.88 -> other sessions in re invent

1878.64 -> where customers have used glue in

1880.559 -> production with a tremendous amount of

1882.799 -> success

1884.88 -> thank you so much and if you're

1887.919 -> interested

1888.88 -> please complete the session survey at

1891.039 -> the end of the talk

1892.24 -> thank you

Source: https://www.youtube.com/watch?v=pT5lAYTCYJ4