Serverless Data Integration for a Modern Data Infrastructure with AWS Glue - AWS Online Tech Talks

Serverless Data Integration for a Modern Data Infrastructure with AWS Glue - AWS Online Tech Talks


Serverless Data Integration for a Modern Data Infrastructure with AWS Glue - AWS Online Tech Talks

When modernizing your data architecture, data integration and data movement are essential components. Learn how AWS Glue’s serverless data integration service lets users at all skill levels discover, combine, and prepare data at petabyte scale.

Learning Objectives:
* Objective 1: Learn how AWS Glue offers tailored tools for business and technical users.
* Objective 2: See how you can use AWS Glue to quickly create a centralized data catalog.
* Objective 3: Understand how AWS Glue supports event driven ETL.

***To learn more about the services featured in this talk, please visit: https://aws.amazon.com/glue Subscribe to AWS Online Tech Talks On AWS:
https://www.youtube.com/@AWSOnlineTec

Follow Amazon Web Services:
Official Website: https://aws.amazon.com/what-is-aws
Twitch: https://twitch.tv/aws
Twitter: https://twitter.com/awsdevelopers
Facebook: https://facebook.com/amazonwebservices
Instagram: https://instagram.com/amazonwebservices

☁️ AWS Online Tech Talks cover a wide range of topics and expertise levels through technical deep dives, demos, customer examples, and live Q\u0026A with AWS experts. Builders can choose from bite-sized 15-minute sessions, insightful fireside chats, immersive virtual workshops, interactive office hours, or watch on-demand tech talks at your own pace. Join us to fuel your learning journey with AWS.

#AWS


Content

2.12 -> [Music]
8.4 -> welcome everybody to our discussion on
10.24 -> serverless data integration
12.32 -> my name is zach mitchell i'm a big data
14.4 -> architect with aws glue in aws lake
16.88 -> formation
18.4 -> and today we're going to be discussing
20.8 -> glue
22.88 -> specifically
24.16 -> our learning objectives today are
26.72 -> getting a brief overview of what glue is
28.64 -> just to refresh everybody's memory
31.359 -> discussing glue and how it applies to
33.36 -> different users
35.2 -> discussing glue for different workloads
37.2 -> and how you can work with glue for
39.12 -> varying workflow types
41.36 -> and then discussing how glue can meet
42.879 -> the requirements of your scaling needs
46 -> we understand workloads aren't always
47.76 -> the same size
49.76 -> and so let's discuss that and finally
51.76 -> we'll discuss glue centralized usage of
53.92 -> a data catalog
58.559 -> so what is aws glue as
61.76 -> as an overview glue is a fast open
64.96 -> scalable integration service as it says
67.439 -> um
68.479 -> we are a fully serverless data
70.32 -> integration service built on the
72.64 -> foundations of apache spark in python
76.08 -> there is no servers or compute to manage
78.88 -> it scales as you need it
80.799 -> and you pay for what you use
84.24 -> the key capabilities of glue include our
86.72 -> scalable data integration engine uh with
89.119 -> our built-in transforms
90.88 -> designed for simplifying uh data
92.72 -> integration tasks
94.4 -> um again it's our serverless execution
96.96 -> engine that is only charging you for
98.96 -> what you need
100.56 -> and it's the ability to monitor
102.56 -> all of those jobs
104.399 -> in a centralized system
107.119 -> and easily identify and troubleshoot
109.04 -> issues
112 -> we have a centralized and unified data
114.079 -> governance strategy that centers around
116.159 -> the glue data catalog that we'll get to
118.159 -> in a little while we have glued crawlers
121.119 -> which simplify the ability to get data
123.759 -> into the catalog
125.52 -> and to have all of your data represented
127.439 -> therein
128.479 -> and we have fine grained security and
130.879 -> access control
132.16 -> by aws lake formation
137.599 -> glue allows you to connect to and ingest
139.599 -> data from hundreds of different data
141.44 -> sources
142.879 -> we have native connectors from the most
145.2 -> popular systems like redshift
147.36 -> over on mongodb
149.44 -> and we have a connector marketplace
151.44 -> for things such as sas providers like
153.68 -> salesforce
155.28 -> we also expose an api to allow you to
157.36 -> easily build your own connectors
159.84 -> through your own custom interfaces
165.36 -> glue is not persona specific and we're
168.56 -> going to get into this today
170.48 -> ideally quite deeply
173.519 -> glue is designed to be used by a variety
175.68 -> of people
176.879 -> to explore data
178.4 -> to help generate code when necessary
181.12 -> and to allow you to really dive in
183.599 -> when you want to get your hands dirty
187.04 -> it allows you to
189.44 -> work together and collaborate together
192 -> even when you have a variety of skill
193.76 -> sets within a team
198.64 -> so let's get into the data integration
200.4 -> overview first because we we can
202.08 -> describe what glue is
203.92 -> but let's take a step back and describe
205.44 -> what data integration is and why it's
207.28 -> necessary
209.04 -> data integration is a loop it's the
211.12 -> process of discovering your data and
212.799 -> combining that data
214.879 -> transforming it and cleaning it such
216.959 -> that it makes sense for your business
220.319 -> so you can get quality you can run
221.92 -> useful analytics you can build your
223.84 -> machine learning models so on and so
225.44 -> forth
227.84 -> it involves
229.12 -> centralizing that data and that data
230.72 -> centralization usually revolves around
232.48 -> the catalog i need to know what data i
234.64 -> have
235.76 -> i need to be able to move that data
238.239 -> from its original source and its
239.68 -> original shape i need to be able to
241.2 -> change it to fit a different source in a
243.12 -> different shape and that of course
244.799 -> requires me to be able to connect to
246.879 -> those sources and those destinations
250.319 -> however data integration is extremely
252.72 -> hard
254.239 -> so if you don't get it right you're
255.76 -> going to get incorrect results
258.88 -> the problem is
260.32 -> you want those results because those
262.079 -> results provide meaningful insights
264.56 -> they provide increased collaboration
267.199 -> once you realize that hey two teams are
269.12 -> likely doing the same thing
271.28 -> and these all lead to faster decisions
273.84 -> as you increase that collaboration and
275.52 -> you gain faster and more meaningful
277.199 -> insights
278.96 -> so why is it hard
281.12 -> the first reason it's hard of course is
282.56 -> because data is growing and changing
285.28 -> every ever more
287.44 -> is growing rapidly and exponentially
290.32 -> data sources seem to crop up all over
292.639 -> the place
293.84 -> when you think you finally understand
295.199 -> when you're where your data is somebody
297.04 -> brings in a new data source and you have
299.12 -> to deal with it
300.72 -> and with increased diversity of sources
304 -> comes a diversity in the style and type
306.16 -> of data
307.52 -> no longer is all data sitting in a
309.039 -> relational table in a database now
311.36 -> you've got data in flat files you've got
313.84 -> data in semi-structured stores like open
316.88 -> search
317.84 -> and even graph stores like neptune
321.199 -> all of these have to be dealt with
322.479 -> typically in different ways
325.44 -> you've got different personas who are
327.12 -> now dealing with data no longer is it
329.36 -> simply one data engineering team
332.32 -> that produces reports about
335.28 -> the data
337.039 -> you now have personas at the business
339.36 -> level that simply want to
341.44 -> get into the data with no code or very
343.919 -> little coding experience
345.84 -> you have developers again that original
348.4 -> central team that still need to do their
350.72 -> job and still need to do the
352.88 -> deep dive and tinkering with the data
355.199 -> they have
356.319 -> and then you increasingly have
357.919 -> purpose-built users of data analysts and
360.08 -> scientists whose sole job it is
363.199 -> to go and look beyond the data that
365.36 -> currently exists and derive again
367.52 -> increasingly diverse insights from that
370.56 -> data
372.96 -> you have an increased number of
374.24 -> applications
375.6 -> interacting with your data as well
378.16 -> more and more of those applications are
380.24 -> becoming real time
382 -> or near real-time and therefore are very
383.68 -> sla sensitive and waiting a day or two
386.24 -> or even an hour
387.759 -> for data is no longer acceptable
391.28 -> your applications are scaling ever more
393.44 -> rapidly
395.759 -> and of course as you're trying to
398.08 -> balance all of these
399.68 -> these demands of time sensitivity and
403.12 -> highly scalable nature of your data
405.759 -> you come up across budget constraints
408.56 -> because these things cost money
411.039 -> and so how do you balance the needs of
413.039 -> your applications
414.96 -> with your budget
416.96 -> well
418.24 -> the problem is traditional solutions
419.84 -> aren't suited for this
421.759 -> you don't have a scalable infrastructure
424.479 -> your infrastructure back in the day was
426.319 -> complex to install and to maintain
429.759 -> and was often very very rigid
433.28 -> that rigidity came with a high cost both
436 -> the people cost and having to deal with
437.68 -> it
438.56 -> as as well as a cost for you know extra
441.599 -> licenses for advanced functionality
444.72 -> for other transforms centralized
446.72 -> cataloging
448 -> you seem to need a different piece of
449.36 -> software for each piece of your of your
451.44 -> data infrastructure
454 -> and
454.8 -> a lot of times that was all based off
456.56 -> proprietary tooling that locked you in
459.28 -> and
460.16 -> you were stuck
461.599 -> your data was in and you can't go
463.039 -> anywhere because your data is sitting in
464.479 -> that engine it becomes a critical
466.479 -> component of your business
470 -> so
471.36 -> against all this
473.84 -> how do you get
476.24 -> all of this data into the hands of the
478.8 -> right people
480.479 -> well that requires the right tooling
484.24 -> so let's get into the tooling that glue
486.16 -> provides
487.68 -> typically in one of these presentations
489.44 -> i would actually go the simplest tooling
491.68 -> to the most advanced
493.52 -> but today i want to work a little
494.72 -> backwards
496.16 -> to show you the simplification that can
498.639 -> occur when we go from the advanced
500.56 -> tooling that's built at a really low
502.72 -> level
504.319 -> to the simpler tooling that is built in
506.479 -> in toward no code personalities
509.84 -> so our advanced tool and our most modern
512.8 -> tool is called glue interactive sessions
516.32 -> with interactive sessions we provide an
518.88 -> interactive data integration experience
521.599 -> we provide interactive development
523.599 -> include jobs
525.12 -> and interactive data exploration
528.32 -> interactive sessions replaces glue
530.16 -> development endpoints if you've used
532.16 -> those in the past
535.839 -> interactive sessions provides the rapid
538.24 -> creation of serverless spark
540.959 -> you are able to launch an interactive
543.279 -> session within about 30 seconds
546.24 -> from the time you decide you want to go
548.56 -> work with your data
550.959 -> you can configure and install any
552.64 -> packages you need inside glue's
554.8 -> serverless environment
556.72 -> right from the same place
558.32 -> and
559.2 -> every time you go to work with
560.959 -> interactive sessions
562.88 -> you get dedicated resources you're not
565.6 -> sharing a set of compute
567.76 -> with your neighbor or your colleague
569.519 -> you're not sharing a set of compute with
571.04 -> anyone else it is dedicated to this
573.36 -> specific task of your session
577.279 -> it's important to note that interactive
579.04 -> sessions is extraordinarily cost
580.72 -> effective
581.839 -> you literally only pay for what you use
584.16 -> interactive sessions aggressively clean
586.399 -> themselves up and time out so you're not
589.04 -> wasting idle compute
591.279 -> and all the billing is per second
597.76 -> the main interface into interactive
600.16 -> sessions is a jupiter kernel
602.8 -> it's easy to install on your local
604.8 -> macbook or pc
606.64 -> so you can run it locally you can run it
608.48 -> in the cloud either on sagemaker or in
610.56 -> glue studio
612.16 -> and because you can run it locally you
614.16 -> can connect it to your favorite ide
616.959 -> so let's go ahead and dive in real quick
618.72 -> and
619.44 -> let me show you a demo
621.12 -> of this kernel
622.72 -> now this particular demo i recorded a
625.2 -> while back
626.32 -> uh but i recorded it on an airplane
628.88 -> at 40 000 feet with no bandwidth because
632.399 -> i was on an airplane but i wanted to see
634.48 -> if i could use interactive sessions from
636.8 -> anywhere
638.399 -> and specifically in the use case i
640.32 -> wanted to go pull up a data set in the
643.36 -> glue data catalog
645.12 -> but first i need to configure my session
648.079 -> sessions are configured via jupiter
650 -> magics
651.12 -> magics are simple at commands that are
653.44 -> prefixed with a percentage sign
655.44 -> and allow you to specify the
656.64 -> configuration of glue
658.959 -> and any setting you want in glue
661.839 -> if you need to remember what magics are
663.36 -> you simply call help
664.959 -> to see what you can do now back to what
667.12 -> i was doing
668.399 -> so in sessions i was looking for a table
670.24 -> in the coven 19 data lake when i was
672.56 -> doing this particular demo that's right
674.72 -> i was looking for the county populations
676.56 -> table
678.079 -> now that i remember the table name i can
679.839 -> go look at the data so i'm going to use
681.92 -> an sql magic to select from the county
683.839 -> populations table and look
685.68 -> look at the data sure enough there it is
689.279 -> so now i want to see if i can find the
690.56 -> largest counties by population
693.36 -> so i'm going to go again run that query
695.12 -> again
696.16 -> this time sorting by population estimate
703.12 -> once i've sorted by the population
704.56 -> estimate you can see that the sorting is
706 -> not correct
707.6 -> um in this case it looks like the
709.519 -> population estimate was a string
711.76 -> and so we're going to import glue and
713.76 -> we're going to use glues transformation
716.56 -> engine
717.519 -> to figure them out
719.6 -> now you'll notice i made a mistake
721.839 -> the beautiful thing about interactive
723.2 -> sessions is if i make a mistake i can
725.519 -> simply correct it i don't have to wait
727.44 -> minutes in between mistakes to go fix
729.68 -> them like i used to
732.399 -> so sure enough i can see that county
733.839 -> populations was a string
736.399 -> and i could go back
737.92 -> and
738.8 -> fix that string
741.2 -> by casting it over
743.36 -> to a to an integer
745.2 -> using applied mapping
746.88 -> so that's what we're going to do real
748.079 -> quick we'll go ahead and clean up the
750.48 -> entire dynamic frame
753.279 -> we are going to
754.959 -> rename the column to get rid of the
756.88 -> spacing because nothing likes the
758.32 -> spacing
761.12 -> and so now we're going to keep the two
762.56 -> columns we had
765.44 -> now that we're happy with columns we can
767.519 -> go ahead and create a temporary view
770.48 -> in a data frame to see if we can query
772.72 -> on top of them
787.2 -> and sure enough now county populations
789.279 -> are properly sorted and we can see the
790.88 -> top 10 counties by population in the
793.12 -> coven 19 data set
796.48 -> it's worth noting that this entire demo
798.88 -> was done sitting on an airplane
800.88 -> it was done sitting in visual studio
802.639 -> code
804 -> and so this is showing interactive
805.839 -> sessions running in an ide local to my
808.079 -> macbook
809.36 -> essentially being able to connect to
811.12 -> glue's distributed compute service from
813.519 -> anywhere i have an internet connection
818.48 -> worth calling out again is the
819.76 -> configuration and why jupiter magics
821.92 -> were used to configure interactive
823.519 -> sessions
825.04 -> jupiter magics
826.639 -> are again these snippets of code that
828.48 -> state this is how i want my serve my my
830.72 -> session configured when they're run as
832.72 -> the first cell
834.72 -> we take that's that configuration we
836.72 -> apply it to your to your glue session
838.72 -> and environment
840.16 -> these are the same parameters as you
841.6 -> would use in glue job if you're familiar
843.6 -> with those you can pass any
845.199 -> configuration you need to glue including
848.079 -> custom spark config
849.839 -> in this case enabling the cryo
851.36 -> serializer
853.36 -> as long as you do that before running
854.88 -> your first cell of code
856.8 -> your config stays and and goes
859.6 -> if you forget something say you forget
861.44 -> the connection to your database or you
863.68 -> discover you need another database
865.199 -> connection
866.56 -> no problem simply add it to your magic
869.04 -> list
870.959 -> restart the kernel and rerun your code
874 -> and within seconds you have a brand new
876.079 -> cluster with a brand new configuration
878.72 -> and off you go
880.639 -> the other thing worth calling out here
882 -> is this particular configuration assumes
884.32 -> i'm you i'm wanting to use glue
885.68 -> streaming
887.04 -> i can use interactive sessions with glue
888.88 -> structured streaming
892.56 -> now
893.68 -> is is really really great
896.48 -> if i'm a developer and if i love working
898.88 -> in code
900.32 -> however
901.68 -> there are many customers who want to get
904.16 -> the power of glue and the flexibility of
906.079 -> it without getting into the code
909.04 -> so that's what glue studio comes comes
911.04 -> in glue studio is our visual job
914 -> authoring and monitoring tool you don't
916.16 -> have to write code
918.32 -> you simply have to drag and drop nodes
921.92 -> in blue studio you can preview data at
924.16 -> each step of the way
926.56 -> you get real-time schema inference
929.68 -> and just like the rest of glue
932.399 -> it supports hundreds of connectors
935.199 -> and yes you can still use transforms in
938.399 -> sql or custom code
940.639 -> as needed if you have the capability and
942.56 -> the desire to do so
944.16 -> well not required they do come in handy
946.8 -> when you want to do something that we
948.16 -> don't support directly out of the box
952.959 -> glue studio also offer offers a
955.12 -> monitoring environment for all of your
957.36 -> jobs and glue
958.8 -> through a single pane of glass
962.639 -> okay let's get into a demo of glue
964.16 -> studio let's go create a job with a
966.639 -> visual source and target
968.399 -> this will pre-populate for us three
970.399 -> nodes a source a destination and apply
973.44 -> mapping
974.639 -> so let's go ahead and select the source
976.32 -> from s3
977.68 -> we'll go to the catalog table that we
979.279 -> created out of the previous demos data
982.48 -> so we called that demo populations
985.44 -> now before i can start a data preview
987.68 -> i have to fill in the rest of my
989.759 -> my information for my nodes in this case
992.639 -> my s3 bucket needs a location to write
995.519 -> data to
999.44 -> once we've filled that out we can go
1000.88 -> back to my source bucket and we click
1002.32 -> data preview
1004.959 -> and once we select a role
1006.8 -> preview will start and might take a few
1008.48 -> seconds
1010.16 -> but within a few seconds we will start
1011.839 -> to get a data preview
1014.079 -> data previews are powered by interactive
1015.92 -> sessions the same things we've been
1017.68 -> discussing
1018.8 -> in the last few demos
1024.079 -> data previews are reasonably quick the
1026 -> first one often takes a few more seconds
1028.4 -> to spin up the compute but afterward
1030.559 -> everything is very very quick
1034.4 -> we can see here that it's the same data
1036.24 -> we were working with earlier excellent
1039.28 -> so let's go ahead now
1041.039 -> and
1042 -> you know let's add a transform
1044.959 -> let's go add a pii detection transform
1048.559 -> just for fun
1050.48 -> and let's look for a person's name
1054.08 -> because we can select which ones we want
1056.96 -> no no
1058.08 -> let's just go ahead and do all of them
1060.08 -> let's include all detection types let me
1062 -> just see everything
1063.76 -> let's go preview the data let's see if
1065.36 -> we can figure out
1066.88 -> what has pii
1071.039 -> after giving it a second to load
1074.24 -> again we see the same
1075.919 -> preview
1076.88 -> you'll notice that we're only previewing
1078.4 -> five of the seven fields so let's go
1080.48 -> ahead and select the rest of the fields
1083.039 -> and you can see that detected entities
1085.44 -> says there there are counties
1087.919 -> the county column has a person's name in
1090.4 -> it
1091.6 -> okay
1093.039 -> is that really a problem let's go scroll
1094.88 -> over and find out
1101.36 -> you can see that most of the county
1102.559 -> names are names
1104.32 -> mccracken pope
1106.84 -> grant they're counting names while
1109.039 -> they're detected by pii as proper names
1111.36 -> they're not
1113.2 -> but let's assume they were let's go
1115.039 -> ahead and redact them
1116.72 -> with a with a text
1118.48 -> field so we're just going to
1120 -> asterisk them out
1123.12 -> let's preview the data again to make
1124.48 -> sure it was taken and sure enough
1127.6 -> there go the there go the proper names
1130.32 -> any county that was identified as a
1131.84 -> proper name is gone
1134.72 -> awesome now let's continue on and let's
1137.2 -> assume we want to do something with this
1138.96 -> and write this out to our data catalog
1141.919 -> let's move that apply mapping back to
1143.919 -> our detect pii
1145.84 -> you'll see the output schema gets rid of
1147.679 -> detect the detect struct because we
1150.32 -> don't want that in our output
1152.559 -> we just want the clean data
1154.4 -> without the personally identifiable
1156 -> information
1157.2 -> in this case the county names that you
1159.28 -> know are our proper names
1162.96 -> and now that we're we're done with our
1164.4 -> data we can select a bucket
1166.72 -> we can select a database in a table we
1168.559 -> want to write to in this case we'll call
1170.24 -> it populations cleansed
1173.76 -> and let's go ahead and save
1176.24 -> say our transfer ah that's right we need
1178.559 -> to name the job
1180.08 -> let's go ahead and call the job clean
1181.6 -> populations sounds good to me
1184.4 -> and you can go look at the script
1186.32 -> this script is generated dynamically
1188.24 -> every time you change a field or
1190.24 -> property
1191.52 -> in the visual editor
1194.72 -> once we're satisfied with it we can go
1196.08 -> to job details
1197.52 -> all i'm doing today is enabling auto
1199.12 -> scaling
1200.799 -> we'll save the job
1203.6 -> and once we're happy with it we can
1204.96 -> click run
1207.36 -> now once we've run it we can go to the
1208.799 -> run details page
1210.48 -> to get information on this particular
1212.48 -> run and how it's going
1214 -> and if we wanted to keep going we could
1215.84 -> go to schedules and we can click on
1217.12 -> schedules
1220.08 -> worth calling out in addition to the
1221.679 -> visual editor
1223.76 -> glue studio offers
1225.679 -> glue studio notebooks
1228 -> they're the same notebooks as
1230.72 -> interactive sessions locally
1232.799 -> running in jupiter
1234.559 -> they're free
1236.08 -> serverless and they offer one-click job
1238.559 -> execution scheduling
1241.6 -> it's the same thing you saw me running
1243.36 -> in visual studio code except it's
1245.12 -> running in jupiter notebooks
1248 -> are sitting in glue studio nothing for
1250.159 -> you to host nothing for you to manage
1252.48 -> the same magics
1254 -> that i was showing you earlier apply
1256.08 -> here and run in here
1257.679 -> and of course there are built-in
1258.72 -> monitoring support just like there are
1260.159 -> for blue jobs
1264.799 -> now
1266.24 -> it's all nice and good that you can
1267.6 -> integrate your data in in a batch mode
1270.08 -> as we've been seeing but what about your
1271.76 -> other data
1274.159 -> glue has several execution modes we've
1276.48 -> got badge which is your standard job
1279.12 -> that runs at a set schedule
1281.28 -> you've got streaming modes that allow
1283.44 -> glue to run continually
1286.159 -> ingesting data as it hits a stream
1289.36 -> you have event based execution for glue
1292.559 -> that that will kick off a glue job based
1294.559 -> off of an event trigger
1296.4 -> and then you have an interactive api
1298.96 -> that allows you to integrate glue into
1300.4 -> interactive applications
1304.08 -> so let's talk about batch for a minute
1306.32 -> why batch
1308.159 -> simply batch is there for
1310.159 -> scale and for reuse it's the thing we
1312.559 -> typically think about when we think data
1314.159 -> integration or etl
1316.799 -> but
1317.6 -> batch goes beyond just a single job
1319.679 -> often it goes into workflows and
1321.76 -> pipelines as well
1323.6 -> so let's talk about workflows
1326.64 -> glue workflows allow you to orchestrate
1328.96 -> jobs
1330.4 -> with glue and with other aws services
1333.679 -> you can use glue spark jobs as well as
1336.08 -> glue python shell
1338.48 -> you can monitor the execution of the
1340.48 -> entire workflow in one place
1343.2 -> you can use triggers
1344.88 -> either schedule based on-demand
1347.6 -> or event-based triggers
1349.76 -> inside your workflows
1352 -> and of course you have easy access to
1353.44 -> monitoring logs the whole reason for
1355.76 -> workflows within glue
1357.84 -> is to encapsulate the entirety of your
1359.76 -> data integration pipeline in one place
1365.039 -> now
1366.08 -> workflows are great
1368.559 -> but oftentimes you're repeating the same
1370.64 -> process
1371.76 -> you might be repeating the same workflow
1374.88 -> for sales for marketing for another
1377.2 -> department
1379.12 -> now these come with their own challenges
1381.84 -> if you're repeating these continually
1384.4 -> you're coding them manually
1386.64 -> you're making mistakes because you have
1388.08 -> to change the same thing over and over
1390.24 -> again
1392.08 -> you have some poor data engineer who has
1393.76 -> to do this 50 times in this job
1396.559 -> and at some point it fails to scale
1400.08 -> so
1401.36 -> how do we avoid that developer spending
1403.2 -> that valuable time
1404.96 -> and how do we avoid the errors and the
1406.72 -> pain
1408.48 -> glue gives us custom blueprints
1411.36 -> these are workflows
1414.84 -> that are templatized
1417.76 -> you simply give glue a script
1420.24 -> a configuration file
1422.4 -> essentially how you want us to
1424.559 -> prepare and launch the environment
1427.2 -> for your
1428.08 -> for your workflow and a layout file
1430.48 -> to say hey
1432.32 -> what should the inputs and outputs to my
1433.919 -> workflow
1435.84 -> and we do the rest
1437.919 -> you can then instantiate that workflow
1440.32 -> as necessary
1442.32 -> so your end users can just say cool i
1444.96 -> need my daily report workflow for a new
1447.039 -> report and i can fill in the boxes
1450.32 -> click submit
1451.919 -> and your glue workflow
1454.96 -> creates a copy of itself with the
1456.48 -> correct parameters and does its thing
1461.6 -> the next mode we talked about is
1462.88 -> streaming
1465.44 -> glue streaming is built on spark
1467.12 -> structured streaming
1468.64 -> it is fully serverless
1470.64 -> and you can build jobs visually
1473.279 -> interactively with interactive sessions
1476 -> uh or in a more traditional manner like
1477.679 -> you have always done with uh with an ide
1480.08 -> and spark
1481.919 -> you can easily connect from kinesis or
1483.44 -> kafka directly in the visual editor
1486.96 -> so streaming is a great way to move your
1489.039 -> your spark jobs from badge
1491.44 -> into a micro batch and streaming method
1493.919 -> because really the difference between
1495.2 -> streaming and batch is only a few lines
1497.52 -> of code
1498.96 -> glue makes it fairly easy to migrate
1501.52 -> from one to the other either visually
1504.24 -> or in a traditional manner
1508.08 -> the next method we discussed was event
1511.36 -> driven
1512.84 -> integration so how do we let your data
1516.159 -> drive your work
1520.32 -> glue etl integrates now with that amazon
1522.799 -> event bridge
1525.279 -> amazon event bridge has hundreds of
1527.039 -> built-in sources
1529.039 -> allows connections to custom
1530.4 -> applications and your sas applications
1533.52 -> they provide all of these triggers that
1535.919 -> allow you to kick off a workflow
1539.12 -> that includes glue jobs crawlers and
1542.159 -> other things
1543.279 -> and the whole purpose of this
1545.279 -> is that you let your your data
1548.08 -> do the work you might have data that
1550 -> comes in daily
1551.52 -> you know once a day it's sometime
1553.679 -> roughly in the
1554.84 -> evening my clients upload their data to
1558.159 -> my s3 bucket and i have to go get it
1560.559 -> so every night i kick it off and i go
1562.159 -> get the data and i hope it's all there
1564.559 -> with eventbridge
1566.24 -> the second the data lands
1568.96 -> you can go ahead and just get the data
1571.2 -> you don't have to schedule it on a
1572.799 -> schedule you can do it in in response to
1575.6 -> the event of your data landing in your
1577.44 -> s3 bucket
1579.84 -> this allows you to move closer to that
1582.72 -> more that near real time
1584.96 -> goal that everybody seems to be seeking
1587.679 -> without the great expense and
1590.159 -> and hassle of maintaining a continually
1592.32 -> streaming environment when you don't
1594.24 -> likely need one for every use case
1599.76 -> and finally the last
1601.279 -> the last version we'd like to talk about
1603.039 -> the last method for integrating glue is
1605.279 -> interactivity using interactive apps
1609.84 -> using glue interactive sessions you can
1612 -> integrate glue directly into your own
1614.799 -> applications using the glue apis
1618 -> this includes glue streaming
1620.799 -> and it allows you to extend glue
1624.08 -> into any application that has access to
1626.48 -> aws
1627.84 -> on-premise
1629.12 -> or in the cloud
1631.44 -> so let's go ahead and see how this might
1632.72 -> be done
1634.4 -> so here i'm going to have a simple bash
1636.96 -> application
1638.64 -> called detect pii
1641.279 -> and this is going to run what amounts to
1642.96 -> the same code we ran earlier
1645.039 -> again on the same table i'm not trying
1646.96 -> to do anything fancy with data
1649.52 -> and this thing's going to go out and
1651.44 -> it's going to run the detect pii systems
1654.48 -> against interactive sessions and it's
1656.159 -> going to ask hey is this actually pii
1659.84 -> well in this table no
1661.84 -> queen anne's maryland is a county
1666.64 -> it's not pii
1669.12 -> so we're just gonna put no
1671.52 -> what what about castro texas is is that
1673.76 -> a name
1674.96 -> no again it's counting so no
1678.48 -> and so on and so forth
1680.48 -> and one south dakota now then there's no
1682.559 -> pii in here
1684.72 -> at this point i don't think there is pii
1686.64 -> so i'm going to exit
1689.2 -> now
1690.72 -> if there was pii i could have said yes
1693.76 -> and something else in my application
1695.2 -> could have happened
1696.399 -> such as
1697.44 -> locking it down inside leg formation or
1700.24 -> notifying the data owner that they have
1701.919 -> potential pii within their data
1705.6 -> while automatic detection is great and
1707.6 -> extremely useful there are times that
1710.64 -> auditing is is appropriate
1714.64 -> so this is all nice and great but how
1716.159 -> does glue work how does blue do all of
1718.48 -> these things and scale
1720.48 -> from all of these personas and all of
1722.08 -> these workflows and use cases
1725.84 -> glue runs a scalable execution model
1728.72 -> where a job is kicked off
1731.279 -> by a job manager
1733.2 -> and the job is divided into stages
1736 -> and the data is divided into partitions
1739.76 -> and the job manager takes those stages
1743.12 -> and pairs them with partitions to
1744.96 -> schedule tasks
1747.039 -> on a worker or a node of compute
1750.64 -> and glue serverlessly scales to
1752.88 -> thousands of workers
1754.72 -> and parallely executes these tasks on
1758.08 -> the workers
1759.52 -> and of course you only pay for the
1760.72 -> compute that's used
1763.44 -> glue announced at uh
1765.52 -> reinvent and we launched it the san
1767.2 -> francisco summit auto scaling
1770 -> and so
1771.039 -> now in addition to just
1774.08 -> setting a serverless compute you check
1775.679 -> the box as i showed earlier
1777.84 -> and we will only provision the compute
1779.44 -> you need
1780.64 -> for any given task
1785.12 -> so
1787.2 -> let's dive deeper real quick into
1788.96 -> integrating at scale
1791.679 -> because
1792.559 -> scaling challenges are hard
1794.88 -> you've got different business events
1797.12 -> schema and size variants
1799.279 -> source variants
1800.799 -> and this again all impacts your cost and
1802.799 -> capacity
1806.159 -> you've got resource prediction problems
1808.799 -> tuning's hard and yeah
1811.919 -> everyone typically over provisions
1813.919 -> because the last thing you want is for
1815.44 -> it to fail i'd rather pay 20 more than
1818.48 -> have to be woken up in the middle of the
1819.84 -> night
1823.679 -> so that's where auto scaling comes in
1825.84 -> as discussed auto scaling is a check box
1828.96 -> it reduces cost
1831.36 -> simply put you check the box
1834.24 -> and you don't have to think about the
1835.2 -> capacity planning
1837.039 -> you enable auto scaling
1839.279 -> you set the maximum number of workers
1840.96 -> you want to allocate to your auto
1842.24 -> scaling job
1844.399 -> and and you let it go
1850.64 -> so let's go ahead and look at how auto
1852.32 -> scaling works within a multi-stage job
1854.799 -> if we have a jdbc source
1856.88 -> and we want to read from it apply a
1858.399 -> custom transform a simple mapping and
1860.799 -> write it out
1863.6 -> you're going to have a few connections
1864.88 -> when you start the jdbc job then you're
1867.039 -> going to have high parallelization after
1868.96 -> the data is read into memory and spark
1871.84 -> now
1872.799 -> in a traditional cluster without auto
1874.96 -> scaling
1876 -> you would have to spin up all the
1877.12 -> workers first
1879.44 -> then you would have to do
1881.039 -> the small read
1882.88 -> and the big writes and you'd have a
1884.08 -> bunch of idle compute
1885.6 -> with auto scaling we only have to
1887.679 -> provision the workers that need to do
1889.039 -> the read when it's time to do the read
1892.08 -> when it's time to do the advanced
1893.44 -> transforms we can scale up as needed
1896.88 -> not before
1899.279 -> and that way you're saving yourself a
1902.24 -> significant compute
1904.08 -> and a good potential cost savings
1910.24 -> now all of this is nice and good
1913.76 -> but if i've got so much more data
1916 -> that i can interact with in my company
1918 -> how can i unify that how can i
1920.32 -> centralize
1921.919 -> all of my data and centrally govern it
1925.84 -> glue data catalog
1927.76 -> is our mechanism for central data
1931.12 -> organization
1932.88 -> it is a meta store for data lakes
1935.2 -> it's highly scalable and durable it's
1937.36 -> extremely cost effective it offers
1939.679 -> security compliance and auditing
1941.039 -> capabilities and it is hive meta store
1943.44 -> compliant
1944.559 -> uh with several open with an open source
1946.72 -> connector to attach all of your favorite
1948.88 -> hive compatible systems
1955.2 -> now easiest way to get data into your
1957.6 -> catalog
1959.279 -> is to use a blue crawler
1961.679 -> group glue crawlers
1963.919 -> are little serverless applications that
1966 -> connect to your data sources
1969.12 -> automatically discover your schema and
1970.96 -> extract that schema
1973.76 -> and then they write that scheme into the
1975.279 -> data catalog
1976.64 -> every time they re-read your data source
1979.279 -> they will update your schema as needed
1981.6 -> inside the catalog
1984.72 -> you're able to specify your own
1986.159 -> classifiers for custom files
1988.559 -> this is a big deal because you can even
1990.88 -> use glue crawler for those old school
1993.679 -> flat files
1995.12 -> with classifiers and grok
1997.36 -> this can allow you to remove some of
1999.039 -> your legacy systems that currently exist
2001.519 -> because it's really hard to deal with
2003.279 -> this old data
2004.72 -> and of course crawlers can run on demand
2007.6 -> as part of a schedule or part of a
2009.2 -> workflow
2010.32 -> or in response to events
2015.2 -> event based crawlers on s3 for example
2019.279 -> only crawl when a new object hits
2021.679 -> remember there's no real such thing as
2023.2 -> an update in
2024.88 -> s3 it's a new object
2027.279 -> or not
2028.72 -> so
2029.76 -> you have you can have a broader coverage
2032.48 -> with crawlers because you can say hey
2034.399 -> monitor all of these
2036.24 -> these folders these prefixes and s3 let
2038.32 -> me know when anything changes
2040.96 -> and because we're not having to list the
2042.48 -> buckets crawlers become faster and more
2044.88 -> cost effective
2046.96 -> making it easier to maintain your data
2048.8 -> lake
2049.599 -> at every stage in your data journey
2052.56 -> now
2053.52 -> we also offer lake formation
2055.919 -> for fine grain access control
2058.399 -> and while i'm not going to go too deep
2059.76 -> into lake formation i did want to call
2061.28 -> out govern tables
2062.879 -> lake formation government tables are a
2064.56 -> data ty or a table type on top of s3
2067.76 -> that provides atomic transactions
2070.96 -> on top of your tables
2072.639 -> so that you can atomically maintain and
2075.2 -> update
2076.24 -> your objects
2077.76 -> you can automatically compact your data
2080 -> as necessary to make sure it's
2081.599 -> performant
2082.96 -> for these distributed analytics
2084.48 -> applications such as amazon athena
2086.879 -> it allows you time travel in your table
2089.359 -> so you can look at the table
2091.44 -> and the data as it existed at any given
2093.44 -> point in time
2095.44 -> this allows you to do
2097.839 -> cleaner testing
2099.52 -> of
2100.8 -> ml models and training it allows you a
2102.96 -> better understanding of when your data
2105.2 -> has come in and how it's come in and be
2106.64 -> able to compare what's changed when
2110.48 -> there are a lot of advantages to
2112.48 -> automatic time travel within your tables
2117.92 -> thank you all for coming i hope you've
2119.359 -> learned something
2120.64 -> i look forward to seeing what you guys
2122.72 -> build on aws glue
2125.04 -> thank you very much
2130.68 -> [Music]
2136.24 -> you

Source: https://www.youtube.com/watch?v=cdk_6bpmZYE