AWS re:Invent 2020: Serverless data preparation with AWS Glue
Aug 16, 2023
AWS re:Invent 2020: Serverless data preparation with AWS Glue
The first step in an analytics or machine learning project is to prepare your data to obtain quality results. AWS Glue is a serverless extract, transform, and load (ETL) service with a recent series of innovations that make data preparation simpler, faster, and cheaper. Join this session and listen to AWS Glue general manager Mehul A. Shah showcase the service’s new visual experience that makes it easier to author, debug, and manage your ETL jobs. This session dives deep on the new AWS Glue engine that offers 10 times faster job start times and improved support for data extraction, streaming ETL, orchestrating ETL workflows, and more. Learn more about re:Invent 2020 at http://bit.ly/3c4NSdY Subscribe: More AWS videos http://bit.ly/2O3zS75 More AWS events videos http://bit.ly/316g9t4 #AWS #AWSEvents
Content
1.76 -> hi i'm miho shah
3.919 -> the general manager of aws glue and aws
7.04 -> lake formation and today i'm going to
9.519 -> tell you about aws glue
11.2 -> and how it provides serverless data
13.599 -> preparation
14.92 -> capabilities
17.52 -> before i get started let me give you an
19.199 -> overview of my talk
20.96 -> first i'm going to tell you about aws
22.72 -> glue and the modern
24.48 -> use cases that are driving its growth
27.119 -> then for the bulk of my talk
28.64 -> i'm going to be speaking about some of
30.32 -> the cool new features it offers
33.2 -> to make glue faster cheaper
36.559 -> and easier to use for you and then
39.04 -> finally
39.68 -> i'm going to conclude
43.04 -> so why data preparation what is data
45.2 -> preparation
46.32 -> data preparation is the process of
48.16 -> collecting all the data that you need
50.399 -> transforming it cleaning it and
53.039 -> normalizing it
54.399 -> so you can get quality in that data for
56.8 -> running analytics
58.879 -> to get insights about your business
61.44 -> understand how it's operating and what
63.199 -> you can do to optimize it
65.68 -> and to build machine learning models so
68.32 -> that you can do
69.439 -> prediction and classification
73.52 -> data preparation is the first mile of
75.52 -> any of these activities
78.159 -> and data preparation is hard if you
80.159 -> don't get it right you're going to end
81.52 -> up with incorrect results
83.6 -> so what are the factors that make it
84.96 -> hard well there's three of them the
87.2 -> first one is that there's lots of data
89.28 -> across our customer base we see our
90.88 -> customers
92.479 -> increasing the amount of data that they
94 -> manage and growing the amount of data
95.68 -> they manage by 10x every five years
98.64 -> and this data is becoming increasingly
100.96 -> diverse
101.84 -> it's no longer just structured data that
103.6 -> they want to manage
104.88 -> or analyze they want to analyze
106.96 -> semi-structured data
108.479 -> this is data like machine logs network
111.439 -> logs
112.479 -> iot device events click streams
116.32 -> mobile events and social feeds they also
119.759 -> want to analyze unstructured data like
121.6 -> text
123.84 -> and for data prep most data preparation
126.32 -> projects
127.28 -> there's always some amount of
128.8 -> customization that you need
130.399 -> it's not a simple one-click and go
132.64 -> process
133.52 -> and we find our customers end up writing
136.959 -> hand-coded scripts for most of their
138.879 -> data preparation jobs
140.64 -> and these scripts tend to be brittle and
142.56 -> error-prone as the data evolves
146 -> and then finally many of our customers
148.08 -> are still managing their own
149.44 -> infrastructure
150.64 -> they're managing their own virtual
151.92 -> machines sizing them
154.64 -> they're building their own clusters and
156.48 -> managing their life cycle
158.64 -> they're scheduling jobs on those
160.08 -> clusters monitoring them making sure
162.48 -> they're running successfully
164.239 -> and then building a variety of metadata
167.599 -> metadata stores so they can organize
169.36 -> their data
170.16 -> and manage those metadata stores
171.519 -> themselves and so as we saw that
174.56 -> back in 2017 we decided to build and
177.599 -> launch aws glue
179.84 -> and glue at that time was billed as a
182.879 -> fully managed extract transform and load
185.68 -> service
186.48 -> or etl for short it was designed for
189.519 -> developers
190.319 -> programmers people like us were also
193.44 -> developers and programmers and we knew
195.44 -> what was necessary to get this job done
200.159 -> since then you our customers have pulled
202.159 -> us in many different directions
204.08 -> and in particular you've told us that
206.159 -> you want to broaden the service
207.76 -> it's more general now it's a serverless
209.44 -> data preparation service
211.36 -> and it not only serves data engineers
213.04 -> and programmers but it also serves
215.68 -> etl specialists data scientists business
218.64 -> analysts
219.36 -> and more you actually helped us grow
224.319 -> today we have hundreds of thousands of
226.239 -> active customers
227.44 -> on the glue service running millions of
230.959 -> jobs
231.599 -> jobs on a daily basis
234.879 -> and glue is available in 22 regions
237.12 -> across the globe
240.319 -> here's a a selection a small subset
243.76 -> of the some of the customers and
245.2 -> partners that are on
247.76 -> our service you can see that these uh
250.959 -> enterprises span the gamut from small to
254.08 -> big
255.439 -> from financial services from the auto
258.4 -> industry
259.759 -> from media companies and so on and so
262.079 -> forth
263.12 -> glue is applicable in a lot more context
265.6 -> than a single vertical
268.639 -> and we really want to thank you for all
270.08 -> the things that you've said and
272.56 -> for noticing all the things that we've
274 -> done over the past two years
276 -> here's what you've been saying in many
277.919 -> different languages in many different
279.36 -> ways
280.32 -> glue is faster cheaper and better than
283.44 -> ever
284.4 -> thank you so much so let me tell you a
287.6 -> little bit about the modern use cases
289.6 -> that are driving the growth for glue
293.199 -> before i get there let me quickly give
294.96 -> you an overview of what glue
296.88 -> and its components are the first main
299.68 -> central piece of glue
301.039 -> is its serverless etl engine that's
303.68 -> based on apache spark
306.56 -> you basically give us your apache spark
308.16 -> scripts or you can just give us pure
310.08 -> python scripts as well
311.759 -> we automatically spin up the servers
313.44 -> that are necessary to run those scripts
315.759 -> determine the capacity that's necessary
318.24 -> execute it you pay for what you use and
320.32 -> then we shut those machines down for you
322.24 -> we manage the entire life cycle entirely
324.72 -> serverless
325.759 -> we also give you a visual tool
328.96 -> or a number of visual tools to
331.12 -> interactively interactively develop your
332.96 -> jobs
333.919 -> and will automatically compile those
336.32 -> jobs down to apache spark scripts that
338.479 -> run
339.12 -> on our scalable big data infrastructure
343.44 -> we also give you a glue data catalog
347.28 -> what this is is a centralized metadata
349.6 -> store
350.479 -> fully managed it's hive metastore
353.44 -> compatible
354.96 -> and there are a number of aws services
356.96 -> that are already integrated with this
359.12 -> store as well as third-party partners
363.28 -> as well as open source tools
367.52 -> we also provide crawlers that
369.759 -> automatically scan your s3 buckets your
371.919 -> databases
373.52 -> infer the schema for your data the table
376.24 -> structures
377.36 -> and then automatically load the data
378.96 -> catalog for you it also supports
381.199 -> schema evolution and then finally you
383.759 -> can put all this stuff together into
385.199 -> complex workflows
386.88 -> to run your data pipelines in a
389.919 -> reliable fashion this is our glue
392.84 -> workflows
394.08 -> component what's glue used for
398.4 -> well there are three main use cases that
400.16 -> we can see we can see the first one is
401.759 -> for building data lakes
403.039 -> on aws you see we find a lot of our
406.479 -> customers taking all of their data
409.199 -> and putting it into a lot of their data
411.52 -> to putting it into
412.8 -> amazon s3 which is this ubiquitous
416.479 -> low cost highly durable with 11 nines of
419.52 -> durability
420.4 -> object store they're breaking their data
423.52 -> silos and their operational databases
425.919 -> and using glue to ingest data from those
428.08 -> data silos
429.039 -> into s3 and then process that data from
431.84 -> stage to stage
433.919 -> and refine it then finally when it's
437.199 -> ready for production what they'll do is
439.84 -> they'll
440.72 -> run glue crawlers to load and maintain
443.28 -> the data catalog
444.4 -> that points to all this data and they'll
447.599 -> use a sister service called aws lake
449.44 -> formation
450 -> to secure the data lake and the data
452.08 -> that they've created
453.12 -> the data lake that they've created and
455.919 -> then a number of analytic services
457.84 -> that we have integrate with glue and
459.759 -> lake formation so you can access that
461.68 -> data securely and analyze it
464 -> we have amazon quick site that allows
466.4 -> you to do business intelligence
468.319 -> amazon athena that allows you to do sql
470.4 -> querying over your data warehouse
474 -> in a serverless fashion emr that allows
476.319 -> you to do big data processing
478.24 -> amazon redshift that allows you to do
479.919 -> data warehousing and sagemaker that
482 -> allows you to do ml
485.759 -> we also find our customers loading data
488 -> warehousing
489.039 -> wait at warehouses with aws glue typical
492.96 -> etl processing they're getting data out
495.68 -> of their operational databases like
497.28 -> amazon aurora
498.56 -> combining it with amazon s3 loading
501.599 -> their enterprise data warehouse so they
503.12 -> can run daily or nightly reports
506.639 -> and increasingly we see our customers
508.56 -> running data preparation for ai
511.36 -> ml and data science workloads we see
514.24 -> them cleaning and enriching their data
516.64 -> extracting features and separate
519.519 -> training sets so that they can build
520.8 -> their ml models
522.479 -> and then they're taking their data and
524.48 -> running inference
525.519 -> prediction and classification
528.959 -> using glue or using glue to drive that
532.399 -> drive the data into the ml services
535.839 -> we also see customers having their data
539.519 -> scientists
540.56 -> use notebooks connected to glue for data
543.519 -> exploration
544.399 -> and data experimentation so that they
546.8 -> can come up with new ways
548.32 -> of analyzing their data all right
551.839 -> so now that you know what blue is used
553.36 -> for let's talk about some of the recent
554.72 -> innovations that we've added into glue
558.48 -> since two years ago when we last talked
560.32 -> to you about glue
561.519 -> and its features we've actually added
564.64 -> over 50 new major features to glue
567.839 -> and a number of new regions
574 -> and there are three trends that are
575.279 -> driving these new features the first
577.36 -> trend
578.16 -> is that when we first started or the
580.16 -> first trend is that
581.68 -> you're putting more demanding work
583.68 -> workloads onto our
585.519 -> service when we first started most of
587.92 -> the jobs that you were running
589.36 -> were basically latency agnostic
591.92 -> long-running
593.12 -> nightly or hourly etl jobs for loading
596.16 -> your data warehouse
597.68 -> for loading your data lakes and running
600 -> reports
601.04 -> and so our original engine 0.9 and the
603.76 -> next version 1.0
605.839 -> really was great because it allowed you
607.519 -> to do that in a serverless fashion
609.04 -> without
609.519 -> having to go spin up any machines job
612.399 -> startup
613.12 -> was tolerable on the order of you know
615.6 -> tens of minutes if you will
617.279 -> and we had a 10 minute minute minimum
620.079 -> billing duration
622.24 -> well since then you've put a lot more
625.2 -> real time
625.92 -> micro batch workloads on glue and these
629.839 -> workloads are latency insensitive
632.959 -> and sorry latency sensitive and require
636.399 -> in some ways continuous operation so you
639.12 -> can build things like responsive
640.399 -> dashboards
641.92 -> do monitoring and learning and build
644.16 -> other kinds of you know streaming or
645.92 -> semi streaming applications
648.079 -> and so the old engines aren't really
649.839 -> going to be able to support these kinds
651.68 -> of
652.56 -> workload demands and so we built a brand
654.8 -> new engine for you in the summer
656.56 -> we released glue 2.0 a new engine for
660.16 -> micro batch new real-time workloads
663.44 -> okay what's better well almost
666.64 -> everything is better about this engine
668.56 -> it starts up 10x faster is much more
671.36 -> predictable
673.279 -> and it's more cost effective we've
675.6 -> dropped the minimum billing duration to
677.839 -> one minute from 10 minutes and for
680.16 -> customers that have moved over to glue
681.519 -> 2.0
682.64 -> they see an average of 45 percent cost
685.12 -> savings
686.88 -> so now they can actually run their
688.399 -> latency sensitive workloads
690.959 -> at a much cheaper rate we've built a
693.44 -> brand new
694.079 -> job execution engine engine underneath
696.24 -> the covers to make this all work
698 -> and i'm going to tell you about that
701.2 -> but before i get there let me tell you
702.72 -> how spark works so you have some
704.079 -> background
705.92 -> when glue actually runs the job on spark
708.24 -> what spark does is actually takes those
710.079 -> scripts
710.88 -> and breaks it up spark is a data
712.88 -> parallel system
714 -> so it's going to take those scripts or
715.36 -> your jobs and divide them into stages
718.56 -> and each stage is going to run over your
720.72 -> data and your data is going to be
722.16 -> partitioned or into shards
724.639 -> and so for each shade stage and for each
727.36 -> partition
728.32 -> there's going to be a single task that's
729.92 -> going to be ready for execution
732 -> and what the spark scheduler does is it
734.079 -> takes those tasks
735.12 -> and executes them or schedules them on
738.399 -> executors which are attached to virtual
740.88 -> machines
741.519 -> that comprise your cluster
745.04 -> all right so for the previous version of
746.639 -> glue 0.9 and 1.0
749.2 -> what we were doing was we were actually
752.079 -> sending those jobs to our job manager
754.88 -> and the job manager was scheduling it on
758.079 -> a cluster and it would pick a cluster
759.76 -> from three different mechanisms
761.839 -> either the cluster was already allocated
763.68 -> to a particular user and a customer
765.68 -> in which case your job startup would be
767.36 -> on the order of a few seconds
770.72 -> or it would go and take a cluster from
773.279 -> an already provisioned cluster in our
774.959 -> warm pool
775.92 -> attach it to a user and its job and then
778.639 -> start running in which case it would
780 -> take tens of seconds or we would spin up
783.36 -> a brand new cluster from ec2
786 -> on demand in which case startup's
788.24 -> startup time would take on the order of
789.68 -> minutes
790.639 -> and then we would submit the job or the
792.48 -> script to the cluster
794.32 -> spark would take over and it would start
796.32 -> doing the scheduling of the tasks to the
798.24 -> executors
799.6 -> when you did it this way what happens is
802.32 -> that you have
803.04 -> high latency because your job only
805.36 -> starts after the entire cluster is
807.44 -> provisioned
809.2 -> and if you end up needing capacity
812 -> that's not within the sort of t-shirt
813.839 -> size
814.24 -> clusters that we have in our warm pool
817.12 -> you're going to end up
818.24 -> actually going to ec2 to wait for the
820.959 -> cluster to get
822.16 -> built up and so that leads into high
825.36 -> latency and high variability
827.6 -> glue 2.0 is completely different okay so
831.12 -> 2.0 actually integrates the scheduling
834.399 -> of the tasks with the provisioning of
837.04 -> the machines let me show you how
839.279 -> you submit the job to the job manager
842.079 -> the job manager
843.12 -> then submits the job to a virtual
845.36 -> cluster now if that cluster already has
848.24 -> some machines set up the job starts
850.8 -> immediately
852.32 -> the spark scheduler we're stuck will
853.92 -> start working and it will ask for
856 -> tasks to be attached to executors
859.12 -> and if it needs a new executor and the
861.279 -> executor doesn't have any machines
863.04 -> i mean to be attached to then we're
865.199 -> going to dynamically
866.48 -> grow the virtual cluster we're going to
868.72 -> go to the warm pool and now we don't we
870.8 -> don't actually have to go and get
871.92 -> another cluster we can get one machine
873.44 -> at a time
874.399 -> so our warm plugs are much more flexible
876.88 -> and spread across and multiple azs for
878.72 -> additional
879.36 -> resiliency and the warm pool
882.399 -> already has a pre-configured machines
884.16 -> with with spark on them but if you can't
886.48 -> find or
887.76 -> a machine in the warm pool or it's
890.079 -> depleted you can go to ec2 and get it on
892 -> demand
892.8 -> and if you go to the warm pool the
894.079 -> startup time is in the order of a couple
895.68 -> of seconds
896.56 -> and if you go to ec2 it's on the order
899.04 -> of tens of seconds
901.12 -> once the virtual machine is provisioned
903.44 -> we will then schedule
904.639 -> the executor on that machine and then
906.88 -> the task on the executor
908.399 -> and the spark will start working with
910.48 -> that cluster as it grows
912.639 -> so the cool thing here is the job starts
914.959 -> when the first
915.839 -> executor is ready so your start time is
918.56 -> reduced
919.519 -> and the variance in your start time is
921.519 -> also reduced
922.8 -> the other nice thing is you can operate
924.399 -> your job it'll actually run and make
926.16 -> progress
927.04 -> even when it's not in full capacity for
929.68 -> example
930.48 -> and when it's starting up or when
933.12 -> there's some kind of unavailability
934.959 -> in that case it'll still continue to
936.48 -> make progress and so it gives you
938.399 -> graceful degradation
941.12 -> all right let's compare the two at the
943.279 -> top we have
944.8 -> the distribution of job startup times
947.199 -> for 1.0
948.399 -> and on the bottom for 2.0 now the scales
951.839 -> on the x-axis
953.04 -> are not actually aligned properly but
954.8 -> that's okay we'll fix that in a second
957.199 -> there are three things to notice here if
958.88 -> you get
960.32 -> a job that gets executed on an existing
962.72 -> cluster
963.519 -> this and this is in most of the cases
965.519 -> then the startup times are going to be
966.88 -> short
967.36 -> we call this warm start if you end up
970.639 -> hitting the warm pool
971.92 -> it'll be a little bit slower that's the
973.6 -> second hump that you see
974.959 -> but it's still pretty fast and then of
977.6 -> course if you need to go provision new
979.12 -> machines and
980 -> you need to go and configure them and
982.32 -> you're going to ec2 it's going to take
983.92 -> much longer
985.199 -> with glue 2.0 and versus glue 1.0 you
988.959 -> see a different distribution
990.72 -> the reason the 1.0 distribution for cold
992.72 -> start is much
994.079 -> wider is because every job is going to
996.48 -> have a different capacity that it needs
998.16 -> and so it's going to only start when the
1000.079 -> entire cluster is provisioned
1001.68 -> so you're going to have this wide
1002.8 -> distribution of start times while with
1005.12 -> 1.0 you start
1006.16 -> the job almost immediately as soon as
1008 -> the
1009.199 -> machine is ready so the distribution is
1011.199 -> much narrower
1012.8 -> and actually if we had to actually fix
1014.48 -> the time scales you'll see
1015.839 -> that one point for with 2.0 everything
1018.24 -> is just flat out much faster
1020.56 -> the warm start is under 10 seconds
1022.88 -> compared to one minute for glue 1.0
1025.919 -> and glue 1.0 job start times for cold
1028.48 -> start
1029.039 -> are on the order of 8 to 10 minutes on
1030.72 -> average while with glue 2.0
1032.88 -> it's under or around 30 35 seconds
1036.24 -> so it's literally an order of magnitude
1038.16 -> faster
1040.24 -> we have a number of customers that have
1041.76 -> moved over to glue 2.0
1043.679 -> ibada and marketing evolution and they
1046.319 -> can convert a confirm
1048 -> that start times are much faster they're
1050.4 -> saving a lot more money and the system
1052.799 -> is much more reliable and resilient
1059.039 -> all right the second trend i want to
1060.4 -> talk about is the fact that you have
1063.12 -> many more types of personas
1066.16 -> our customers have many more types of
1068.16 -> personas that want to be able to use
1070.48 -> glue
1071.12 -> and its powerful serverless engine not
1073.76 -> just engineers but also
1075.52 -> etl specialists data scientists and
1077.919 -> business analysts
1079.52 -> and for them we launched a brand new
1082.08 -> interface to glue called glue studio
1084.799 -> it's a visual interface for building etl
1088.08 -> jobs for glue jobs you basically
1091.2 -> draw or diagram out
1095.44 -> a a a an etl workflow
1099.44 -> visually without ever having to code you
1101.84 -> basically get a canvas based tools where
1103.76 -> you can actually diagram it out
1106.64 -> and we also give you ways to drop in
1110 -> specialized code for advanced
1111.76 -> transformations that the visual tool
1113.44 -> can't easily express
1115.28 -> so you can do this entirely without
1116.72 -> coding but if you want to code that's
1118.24 -> also an option
1120.24 -> we also allow you to monitor thousands
1122.08 -> of jobs through a single pane of glass
1124.72 -> and then we convert these workflows into
1126.4 -> apache spark scripts so you can still
1128.32 -> take advantage of the distributed
1129.6 -> processing
1130.72 -> of spark without actually having the
1132.88 -> learning curve
1135.12 -> let's give you an example of how this
1136.48 -> works
1138.4 -> in this example what we're going to do
1140.16 -> is we're going to take ventilator
1141.44 -> readings
1142.48 -> from hospital hospital ventilators put
1145.36 -> them into a stream
1146.96 -> we're going to enrich them with hospital
1148.64 -> information and then store them in s3 so
1151.44 -> we think we can subsequently analyze
1154.08 -> how well our ventilators are doing in a
1156.24 -> hospital system
1158.32 -> we're going to consume the data from two
1160.16 -> different sources
1161.6 -> one which is a streaming source and the
1163.12 -> second which is a static source which is
1165.039 -> in a
1165.6 -> relational database combine it using a
1168.16 -> join
1169.44 -> clean it using some ml and then store it
1171.84 -> in s3
1172.88 -> monitor the jobs and then analyzing the
1175.039 -> resulting data set
1177.12 -> if you take a look at the ventilator
1178.48 -> readings the raw readings are
1180.16 -> going to be in json format and it's
1183.36 -> going to have a bunch of different
1184.559 -> fields that tell you or identify the
1187.44 -> ventilator
1189.12 -> some of the metrics that the ventilator
1190.84 -> is
1192.4 -> the manufacturer the hospital that that
1195.76 -> ventilator is in
1197.28 -> and then the operating status now
1199.039 -> depending on the ventilator
1200.32 -> some ventilators actually give you an
1202 -> operating status
1203.36 -> and other ones don't because they're
1204.88 -> just manufactured differently
1206.799 -> and so in the real data you're going to
1208.64 -> see
1209.919 -> empty fields this is how real data is is
1212.799 -> dirty
1213.6 -> so let's show you how you can actually
1214.96 -> clean it up and process it using
1217.6 -> studio this is what you get with studio
1219.919 -> you get a blank canvas
1221.12 -> and then you can basically start adding
1222.559 -> nodes to build up a graph
1228 -> node there are three different types of
1229.76 -> nodes sources transformations and
1231.919 -> targets
1232.96 -> in this case the source is going to be a
1234.4 -> ventilator stream coming from a kinesis
1236.24 -> stream
1241.679 -> and if you want to take a look at the
1242.96 -> output of that data set
1245.52 -> and what its schema looks like you can
1247.44 -> actually look at the output schema
1248.72 -> directly in glue studio
1254.559 -> the hospital information is actually
1256 -> going to come from a data set that's
1257.28 -> sitting in the public
1258.4 -> aws kovid 19 data lake we're going to
1261.44 -> store it
1262 -> in an rds database it's basically
1264.32 -> structured data about the hospitals in
1266.48 -> across the u.s it's an rds database so
1270.4 -> we're going to create a new source
1272.24 -> configure it to be an rds source
1277.919 -> and then we're going to combine those
1279.679 -> two data sets
1280.96 -> based on a transformation called a join
1284.64 -> and what we're going to do is we're
1285.84 -> going to take the records that are
1286.96 -> coming in from the ventilator
1289.039 -> and join them on the hospital id with
1291.76 -> the hospital information that's in the
1293.6 -> in the relational database that's there
1300.559 -> and then finally those missing values
1303.2 -> we're actually going to go and drop in a
1304.64 -> custom transformation
1306.48 -> this custom transformation allows you to
1308.159 -> write python code or scala code in this
1310.799 -> case we've written python code
1312.96 -> we're actually calling upon a library
1315.52 -> that's embedded with glue etl that
1317.12 -> automatically fills in missing values
1319.36 -> it's called imputer and what it does
1322.72 -> is it automatically trains itself on
1324.96 -> data that's streaming through
1326.4 -> and based on the values of other the
1328.32 -> other columns that it sees
1330.08 -> it's going to try to predict what the
1331.76 -> value for operating status is going to
1333.44 -> be
1336.559 -> 10 lines of code and you can
1337.84 -> automatically pretty quickly use
1339.679 -> ml to clean your data all right
1343.039 -> then the output is going to go into an
1344.48 -> s3 bucket and we're going to organize
1346.559 -> the output
1347.28 -> by state and also by event time so all
1350.4 -> those records are going to be
1351.36 -> distributed across different directories
1353.36 -> so that when we analyze them we can go
1355.36 -> to the directories of interest to reduce
1357.12 -> the amount of data that we have to
1358.4 -> process and analyze
1364.24 -> here's what the monitoring dashboard
1365.52 -> looks like gives you a single plane of
1367.12 -> glass that tells you what's going on
1368.72 -> in your entire system number of jobs
1371.36 -> that have succeeded and failed
1372.96 -> how many are running and there are
1374.64 -> different views here where you can come
1376.159 -> in and
1376.72 -> drill down and find the jobs of interest
1378.559 -> to understand their details and what's
1380 -> going on
1382.4 -> you can also attach a notebook to
1384.48 -> quickly
1385.6 -> analyze the data using glue that's
1387.84 -> sitting in your s3 buckets
1389.52 -> in this case what we're showing you are
1391.84 -> the missing values that were imputed
1393.679 -> you'll see that the nulls have been
1395.12 -> turned into okays or warnings
1397.28 -> based on the other column values that
1398.88 -> you see on the left hand side
1402.4 -> and then with through those notebooks
1403.84 -> you can also run analyses in this
1405.6 -> particular case
1406.48 -> you're running a five-minute window
1411.039 -> analysis showing you the number of
1413.36 -> unhealthy ventilators in that
1414.72 -> five-minute window
1417.12 -> and you can run this on a continuous
1418.64 -> basis
1421.6 -> we have a number of customers using glue
1423.279 -> studio across their organization
1425.44 -> and they confirm that it's much easier
1428.159 -> much faster
1429.12 -> and much quicker to deploy jobs using
1431.919 -> studio
1432.88 -> than before because the learning curve
1434.64 -> is much much less
1436.64 -> and you don't have to actually know
1438.08 -> spark to get started
1441.2 -> all right now the third trend is that
1444.159 -> we're seeing our customers store
1445.76 -> more and more data with the glue data
1447.52 -> catalog and the glue data catalog is
1449.76 -> something that actually keeps track of
1451.12 -> your table structures including your
1452.72 -> partitions
1453.919 -> partitions are effectively directories
1455.679 -> this is what we did in the last example
1457.279 -> i'm going to use the same thing here and
1459.76 -> what you do
1460.4 -> is you put your data into different
1461.76 -> directories so you can get to that data
1463.44 -> very quickly when you're doing analysis
1465.6 -> so originally people had sort of a
1467.2 -> manageable number of partitions in their
1468.799 -> data catalog tens of thousands if you
1470.559 -> will
1471.279 -> here's an example where we've organized
1472.96 -> our data based on region
1474.96 -> year month and day but increasingly
1478.72 -> with micro batch workloads and more data
1480.96 -> coming into the system
1482.64 -> we see customers storing hundreds of
1485.279 -> thousands if not
1486.159 -> millions of partitions with the data
1488 -> catalog and adding additional levels of
1490.64 -> granularity
1491.52 -> like fine-grained partitions here not
1493.919 -> just based on day but also by hour
1495.919 -> sometimes minute and second
1499.279 -> and so the number of partitions have
1500.799 -> exploded and that's affected how you can
1502.559 -> do analytics with the glue data catalog
1506.64 -> let's see how that works imagine for
1508.48 -> example you're using emr
1510 -> emr spark to run a query you submit the
1513.12 -> query spark
1514.88 -> will consult the data catalog get the
1517.36 -> and the data catalog will then scan
1519.12 -> through
1520.559 -> all the partitions it has to find the
1523.039 -> relevant ones
1524.48 -> and then send them back to spark and
1526.24 -> that's the query planning phase
1528.799 -> then it's going to execute the query
1530.799 -> retrieve the data
1532 -> and send the results back well query
1534.96 -> planning here
1535.84 -> is going to be proportional to the time
1537.679 -> it takes to scan through all the
1538.96 -> partitions
1540.08 -> and if you have millions of partitions
1542 -> this can be significant
1544.159 -> and of course if you ask queries where
1546.96 -> you know
1547.84 -> you're looking at or asking for things
1549.52 -> like just at the leaves of the
1550.799 -> partitions
1552.4 -> then you actually have to scan through
1553.76 -> all of them you can't very quickly
1555.52 -> pinpoint which partitions matter and so
1557.919 -> it's going to take a while
1560.24 -> what we added is a new feature called
1561.76 -> partition indexes
1563.52 -> partition indexes basically improve the
1565.919 -> query planning phrase
1567.76 -> you can create partition indexes on any
1570.08 -> partition column or any combination of
1571.84 -> partition columns in this case
1573.76 -> we've created one on year month day and
1575.84 -> hour
1576.88 -> you can create as many of them as you
1578.24 -> want so you can add region and category
1580.559 -> for another one
1582.159 -> and the nice cool thing about partition
1583.76 -> indexes is they also support range based
1585.76 -> predicates
1586.48 -> so you can very quickly narrow down for
1588.799 -> example
1589.679 -> all of the partitions that span from say
1591.84 -> 10 am to say 1 pm
1595.919 -> emr hive and emr spark use partition
1598.72 -> indexes today and more applications are
1600.64 -> to come
1602 -> let's see the impact of partition
1603.679 -> indexes on a real world-like query
1607.12 -> here's a query where we're trying to
1608.32 -> count the number of ita iot devices
1611.2 -> that are actually sending out data and
1613.919 -> this particular case we're looking for
1616.08 -> literally just um one day's worth of
1618.32 -> data here in november
1621.039 -> and you know we had about 50 k
1623.2 -> partitions in this data set
1624.64 -> with or without a partition index the
1626.4 -> query took roughly seven seconds
1629.2 -> manageable but as we scaled the number
1631.919 -> of partitions when we went from
1633.919 -> you know days to minutes and hours in
1636.08 -> our partition list
1638.32 -> it turns out that the query time without
1640.399 -> a partition index started blowing up
1643.12 -> went from seven seconds to nearly 64
1646.159 -> seconds
1647.2 -> as we dig a little bit deeper what you
1648.799 -> see is the run time which is in orange
1650.88 -> stayed roughly the same about seven
1653.039 -> seconds
1654.399 -> but the query planning time blew up
1656.32 -> added another
1657.52 -> 50 seconds of latency and so by using an
1660.72 -> index you were able to save 99
1662.72 -> and run in in running your query so
1664.96 -> depending on how big your partitions are
1666.96 -> and how big your runtime is for your
1669.039 -> query you can literally see orders of
1670.799 -> magnitude improvement
1672.159 -> in your query planning and certainly
1674.159 -> some of our bigger customers have seen
1675.6 -> this
1675.919 -> for example vmware has a petabyte scale
1678.72 -> data lake on s3
1680.399 -> and previously they couldn't run their
1682.399 -> analytics based on tight slas and now
1684.24 -> they can because
1685.279 -> they're using partition indexes
1688.64 -> all right now one last thing
1692.159 -> i want to introduce a brand new feature
1697.039 -> it's going to be available the end of
1698.88 -> december in 2020
1701.919 -> and it's called glue custom connectors
1705.279 -> prior to this if you wanted to connect
1707.44 -> to on-premise databases that we didn't
1710.159 -> have first-class support for
1713.279 -> or proprietary stores or sas
1716.559 -> applications like salesforce and
1718.32 -> servicenow
1719.679 -> you had to be a spark expert well now
1722.559 -> you can have
1723.679 -> a connector that you can build on your
1725.919 -> own and reuse in glue studio
1728.559 -> or you can pick up a connector from aws
1730.72 -> marketplace
1731.679 -> where there's a bunch of third parties
1733.12 -> that have developed connectors
1735.52 -> and you can use it directly in glue
1737.279 -> studio to access the data
1740.64 -> we have a variety of connectors that we
1742.399 -> support we support athena data source
1745.6 -> interface connectors where you basically
1747.36 -> have to fill out about
1748.88 -> six different interfaces we support
1751.679 -> spark data source v2
1753.2 -> interface connectors and we also support
1756 -> conductors on the aws marketplace
1759.44 -> and how does it work well you simply
1762.24 -> register the connector or create a
1763.679 -> connector
1764.399 -> in glue studio you create a job
1767.52 -> using one of those connectors configure
1769.52 -> the connector and off you go
1771.2 -> pretty simple now i want to briefly
1773.679 -> touch on
1774.24 -> two new capabilities in glue the first
1777.279 -> is glue data brew it's a new visual
1779.919 -> interface for cleaning and normalizing
1781.52 -> your data
1782.64 -> it profiles your data to detect patterns
1784.399 -> and anomalies and you can choose from
1786.32 -> over 250 built-in cleaning transforms
1789.039 -> and visually apply them at scale with
1791.6 -> this and glue studio
1793.44 -> glue is a one-stop shop for authoring
1795.44 -> end-to-end data preparation pipelines
1798.48 -> the second new capability is the glue
1801.36 -> schema registry with the data catalog
1803.84 -> what it allows you to do is it allows
1806.24 -> you to enforce
1807.44 -> schema evolution rules for schemas in
1809.76 -> your data catalog
1811.2 -> and this is useful for getting data
1813.12 -> quality for your streaming applications
1815.44 -> for getting data quality when you're
1816.96 -> processing streams from kafka
1819.12 -> kinesis and running kinesis data
1821.44 -> analytics or apache flink
1825.039 -> all right now to conclude i've told you
1829.12 -> about a number of new new features that
1831.679 -> have made glue
1832.48 -> faster and cheaper we've shown you tools
1836.559 -> that are that make it a make glue
1838.64 -> accessible
1839.679 -> to non-programmers etl specialists
1843.12 -> data scientists and data analysts we've
1846.24 -> shown you
1846.799 -> new features to speed up your analytics
1848.799 -> using glue
1850.32 -> and now we've released we're going to
1852.64 -> release
1853.52 -> glue custom connectors that allow you to
1855.36 -> access hundreds of data sources with
1857.36 -> glue
1858.88 -> glue now supports multiple personas and
1861.2 -> use cases
1862.32 -> with persona specific journeys that go
1864.24 -> beyond developers
1865.84 -> to let you continue to break down silos
1868.08 -> and enable collaboration in your
1869.44 -> organization
1870.64 -> i recommend that you check out glue at
1872.399 -> the following website
1875.12 -> and if you're interested check out these
1876.88 -> other sessions in re invent
1878.64 -> where customers have used glue in
1880.559 -> production with a tremendous amount of
1882.799 -> success
1884.88 -> thank you so much and if you're
1887.919 -> interested
1888.88 -> please complete the session survey at
1891.039 -> the end of the talk
1892.24 -> thank you
Source: https://www.youtube.com/watch?v=pT5lAYTCYJ4