Serverless Data Integration for a Modern Data Infrastructure with AWS Glue - AWS Online Tech Talks

Aug 16, 2023

Serverless Data Integration for a Modern Data Infrastructure with AWS Glue - AWS Online Tech Talks

When modernizing your data architecture, data integration and data movement are essential components. Learn how AWS Glue’s serverless data integration service lets users at all skill levels discover, combine, and prepare data at petabyte scale.

Learning Objectives:
* Objective 1: Learn how AWS Glue offers tailored tools for business and technical users.
* Objective 2: See how you can use AWS Glue to quickly create a centralized data catalog.
* Objective 3: Understand how AWS Glue supports event driven ETL.

***To learn more about the services featured in this talk, please visit: https://aws.amazon.com/glue Subscribe to AWS Online Tech Talks On AWS:
https://www.youtube.com/@AWSOnlineTec…

Follow Amazon Web Services:
Official Website: https://aws.amazon.com/what-is-aws
Twitch: https://twitch.tv/aws
Twitter: https://twitter.com/awsdevelopers
Facebook: https://facebook.com/amazonwebservices
Instagram: https://instagram.com/amazonwebservices

☁️ AWS Online Tech Talks cover a wide range of topics and expertise levels through technical deep dives, demos, customer examples, and live Q\u0026A with AWS experts. Builders can choose from bite-sized 15-minute sessions, insightful fireside chats, immersive virtual workshops, interactive office hours, or watch on-demand tech talks at your own pace. Join us to fuel your learning journey with AWS.

#AWS

Content

2.12 -> [Music]

8.4 -> welcome everybody to our discussion on

10.24 -> serverless data integration

12.32 -> my name is zach mitchell i'm a big data

14.4 -> architect with aws glue in aws lake

16.88 -> formation

18.4 -> and today we're going to be discussing

20.8 -> glue

22.88 -> specifically

24.16 -> our learning objectives today are

26.72 -> getting a brief overview of what glue is

28.64 -> just to refresh everybody's memory

31.359 -> discussing glue and how it applies to

33.36 -> different users

35.2 -> discussing glue for different workloads

37.2 -> and how you can work with glue for

39.12 -> varying workflow types

41.36 -> and then discussing how glue can meet

42.879 -> the requirements of your scaling needs

46 -> we understand workloads aren't always

47.76 -> the same size

49.76 -> and so let's discuss that and finally

51.76 -> we'll discuss glue centralized usage of

53.92 -> a data catalog

58.559 -> so what is aws glue as

61.76 -> as an overview glue is a fast open

64.96 -> scalable integration service as it says

67.439 -> um

68.479 -> we are a fully serverless data

70.32 -> integration service built on the

72.64 -> foundations of apache spark in python

76.08 -> there is no servers or compute to manage

78.88 -> it scales as you need it

80.799 -> and you pay for what you use

84.24 -> the key capabilities of glue include our

86.72 -> scalable data integration engine uh with

89.119 -> our built-in transforms

90.88 -> designed for simplifying uh data

92.72 -> integration tasks

94.4 -> um again it's our serverless execution

96.96 -> engine that is only charging you for

98.96 -> what you need

100.56 -> and it's the ability to monitor

102.56 -> all of those jobs

104.399 -> in a centralized system

107.119 -> and easily identify and troubleshoot

109.04 -> issues

112 -> we have a centralized and unified data

114.079 -> governance strategy that centers around

116.159 -> the glue data catalog that we'll get to

118.159 -> in a little while we have glued crawlers

121.119 -> which simplify the ability to get data

123.759 -> into the catalog

125.52 -> and to have all of your data represented

127.439 -> therein

128.479 -> and we have fine grained security and

130.879 -> access control

132.16 -> by aws lake formation

137.599 -> glue allows you to connect to and ingest

139.599 -> data from hundreds of different data

141.44 -> sources

142.879 -> we have native connectors from the most

145.2 -> popular systems like redshift

147.36 -> over on mongodb

149.44 -> and we have a connector marketplace

151.44 -> for things such as sas providers like

153.68 -> salesforce

155.28 -> we also expose an api to allow you to

157.36 -> easily build your own connectors

159.84 -> through your own custom interfaces

165.36 -> glue is not persona specific and we're

168.56 -> going to get into this today

170.48 -> ideally quite deeply

173.519 -> glue is designed to be used by a variety

175.68 -> of people

176.879 -> to explore data

178.4 -> to help generate code when necessary

181.12 -> and to allow you to really dive in

183.599 -> when you want to get your hands dirty

187.04 -> it allows you to

189.44 -> work together and collaborate together

192 -> even when you have a variety of skill

193.76 -> sets within a team

198.64 -> so let's get into the data integration

200.4 -> overview first because we we can

202.08 -> describe what glue is

203.92 -> but let's take a step back and describe

205.44 -> what data integration is and why it's

207.28 -> necessary

209.04 -> data integration is a loop it's the

211.12 -> process of discovering your data and

212.799 -> combining that data

214.879 -> transforming it and cleaning it such

216.959 -> that it makes sense for your business

220.319 -> so you can get quality you can run

221.92 -> useful analytics you can build your

223.84 -> machine learning models so on and so

225.44 -> forth

227.84 -> it involves

229.12 -> centralizing that data and that data

230.72 -> centralization usually revolves around

232.48 -> the catalog i need to know what data i

234.64 -> have

235.76 -> i need to be able to move that data

238.239 -> from its original source and its

239.68 -> original shape i need to be able to

241.2 -> change it to fit a different source in a

243.12 -> different shape and that of course

244.799 -> requires me to be able to connect to

246.879 -> those sources and those destinations

250.319 -> however data integration is extremely

252.72 -> hard

254.239 -> so if you don't get it right you're

255.76 -> going to get incorrect results

258.88 -> the problem is

260.32 -> you want those results because those

262.079 -> results provide meaningful insights

264.56 -> they provide increased collaboration

267.199 -> once you realize that hey two teams are

269.12 -> likely doing the same thing

271.28 -> and these all lead to faster decisions

273.84 -> as you increase that collaboration and

275.52 -> you gain faster and more meaningful

277.199 -> insights

278.96 -> so why is it hard

281.12 -> the first reason it's hard of course is

282.56 -> because data is growing and changing

285.28 -> every ever more

287.44 -> is growing rapidly and exponentially

290.32 -> data sources seem to crop up all over

292.639 -> the place

293.84 -> when you think you finally understand

295.199 -> when you're where your data is somebody

297.04 -> brings in a new data source and you have

299.12 -> to deal with it

300.72 -> and with increased diversity of sources

304 -> comes a diversity in the style and type

306.16 -> of data

307.52 -> no longer is all data sitting in a

309.039 -> relational table in a database now

311.36 -> you've got data in flat files you've got

313.84 -> data in semi-structured stores like open

316.88 -> search

317.84 -> and even graph stores like neptune

321.199 -> all of these have to be dealt with

322.479 -> typically in different ways

325.44 -> you've got different personas who are

327.12 -> now dealing with data no longer is it

329.36 -> simply one data engineering team

332.32 -> that produces reports about

335.28 -> the data

337.039 -> you now have personas at the business

339.36 -> level that simply want to

341.44 -> get into the data with no code or very

343.919 -> little coding experience

345.84 -> you have developers again that original

348.4 -> central team that still need to do their

350.72 -> job and still need to do the

352.88 -> deep dive and tinkering with the data

355.199 -> they have

356.319 -> and then you increasingly have

357.919 -> purpose-built users of data analysts and

360.08 -> scientists whose sole job it is

363.199 -> to go and look beyond the data that

365.36 -> currently exists and derive again

367.52 -> increasingly diverse insights from that

370.56 -> data

372.96 -> you have an increased number of

374.24 -> applications

375.6 -> interacting with your data as well

378.16 -> more and more of those applications are

380.24 -> becoming real time

382 -> or near real-time and therefore are very

383.68 -> sla sensitive and waiting a day or two

386.24 -> or even an hour

387.759 -> for data is no longer acceptable

391.28 -> your applications are scaling ever more

393.44 -> rapidly

395.759 -> and of course as you're trying to

398.08 -> balance all of these

399.68 -> these demands of time sensitivity and

403.12 -> highly scalable nature of your data

405.759 -> you come up across budget constraints

408.56 -> because these things cost money

411.039 -> and so how do you balance the needs of

413.039 -> your applications

414.96 -> with your budget

416.96 -> well

418.24 -> the problem is traditional solutions

419.84 -> aren't suited for this

421.759 -> you don't have a scalable infrastructure

424.479 -> your infrastructure back in the day was

426.319 -> complex to install and to maintain

429.759 -> and was often very very rigid

433.28 -> that rigidity came with a high cost both

436 -> the people cost and having to deal with

437.68 -> it

438.56 -> as as well as a cost for you know extra

441.599 -> licenses for advanced functionality

444.72 -> for other transforms centralized

446.72 -> cataloging

448 -> you seem to need a different piece of

449.36 -> software for each piece of your of your

451.44 -> data infrastructure

454 -> and

454.8 -> a lot of times that was all based off

456.56 -> proprietary tooling that locked you in

459.28 -> and

460.16 -> you were stuck

461.599 -> your data was in and you can't go

463.039 -> anywhere because your data is sitting in

464.479 -> that engine it becomes a critical

466.479 -> component of your business

470 -> so

471.36 -> against all this

473.84 -> how do you get

476.24 -> all of this data into the hands of the

478.8 -> right people

480.479 -> well that requires the right tooling

484.24 -> so let's get into the tooling that glue

486.16 -> provides

487.68 -> typically in one of these presentations

489.44 -> i would actually go the simplest tooling

491.68 -> to the most advanced

493.52 -> but today i want to work a little

494.72 -> backwards

496.16 -> to show you the simplification that can

498.639 -> occur when we go from the advanced

500.56 -> tooling that's built at a really low

502.72 -> level

504.319 -> to the simpler tooling that is built in

506.479 -> in toward no code personalities

509.84 -> so our advanced tool and our most modern

512.8 -> tool is called glue interactive sessions

516.32 -> with interactive sessions we provide an

518.88 -> interactive data integration experience

521.599 -> we provide interactive development

523.599 -> include jobs

525.12 -> and interactive data exploration

528.32 -> interactive sessions replaces glue

530.16 -> development endpoints if you've used

532.16 -> those in the past

535.839 -> interactive sessions provides the rapid

538.24 -> creation of serverless spark

540.959 -> you are able to launch an interactive

543.279 -> session within about 30 seconds

546.24 -> from the time you decide you want to go

548.56 -> work with your data

550.959 -> you can configure and install any

552.64 -> packages you need inside glue's

554.8 -> serverless environment

556.72 -> right from the same place

558.32 -> and

559.2 -> every time you go to work with

560.959 -> interactive sessions

562.88 -> you get dedicated resources you're not

565.6 -> sharing a set of compute

567.76 -> with your neighbor or your colleague

569.519 -> you're not sharing a set of compute with

571.04 -> anyone else it is dedicated to this

573.36 -> specific task of your session

577.279 -> it's important to note that interactive

579.04 -> sessions is extraordinarily cost

580.72 -> effective

581.839 -> you literally only pay for what you use

584.16 -> interactive sessions aggressively clean

586.399 -> themselves up and time out so you're not

589.04 -> wasting idle compute

591.279 -> and all the billing is per second

597.76 -> the main interface into interactive

600.16 -> sessions is a jupiter kernel

602.8 -> it's easy to install on your local

604.8 -> macbook or pc

606.64 -> so you can run it locally you can run it

608.48 -> in the cloud either on sagemaker or in

610.56 -> glue studio

612.16 -> and because you can run it locally you

614.16 -> can connect it to your favorite ide

616.959 -> so let's go ahead and dive in real quick

618.72 -> and

619.44 -> let me show you a demo

621.12 -> of this kernel

622.72 -> now this particular demo i recorded a

625.2 -> while back

626.32 -> uh but i recorded it on an airplane

628.88 -> at 40 000 feet with no bandwidth because

632.399 -> i was on an airplane but i wanted to see

634.48 -> if i could use interactive sessions from

636.8 -> anywhere

638.399 -> and specifically in the use case i

640.32 -> wanted to go pull up a data set in the

643.36 -> glue data catalog

645.12 -> but first i need to configure my session

648.079 -> sessions are configured via jupiter

650 -> magics

651.12 -> magics are simple at commands that are

653.44 -> prefixed with a percentage sign

655.44 -> and allow you to specify the

656.64 -> configuration of glue

658.959 -> and any setting you want in glue

661.839 -> if you need to remember what magics are

663.36 -> you simply call help

664.959 -> to see what you can do now back to what

667.12 -> i was doing

668.399 -> so in sessions i was looking for a table

670.24 -> in the coven 19 data lake when i was

672.56 -> doing this particular demo that's right

674.72 -> i was looking for the county populations

676.56 -> table

678.079 -> now that i remember the table name i can

679.839 -> go look at the data so i'm going to use

681.92 -> an sql magic to select from the county

683.839 -> populations table and look

685.68 -> look at the data sure enough there it is

689.279 -> so now i want to see if i can find the

690.56 -> largest counties by population

693.36 -> so i'm going to go again run that query

695.12 -> again

696.16 -> this time sorting by population estimate

703.12 -> once i've sorted by the population

704.56 -> estimate you can see that the sorting is

706 -> not correct

707.6 -> um in this case it looks like the

709.519 -> population estimate was a string

711.76 -> and so we're going to import glue and

713.76 -> we're going to use glues transformation

716.56 -> engine

717.519 -> to figure them out

719.6 -> now you'll notice i made a mistake

721.839 -> the beautiful thing about interactive

723.2 -> sessions is if i make a mistake i can

725.519 -> simply correct it i don't have to wait

727.44 -> minutes in between mistakes to go fix

729.68 -> them like i used to

732.399 -> so sure enough i can see that county

733.839 -> populations was a string

736.399 -> and i could go back

737.92 -> and

738.8 -> fix that string

741.2 -> by casting it over

743.36 -> to a to an integer

745.2 -> using applied mapping

746.88 -> so that's what we're going to do real

748.079 -> quick we'll go ahead and clean up the

750.48 -> entire dynamic frame

753.279 -> we are going to

754.959 -> rename the column to get rid of the

756.88 -> spacing because nothing likes the

758.32 -> spacing

761.12 -> and so now we're going to keep the two

762.56 -> columns we had

765.44 -> now that we're happy with columns we can

767.519 -> go ahead and create a temporary view

770.48 -> in a data frame to see if we can query

772.72 -> on top of them

787.2 -> and sure enough now county populations

789.279 -> are properly sorted and we can see the

790.88 -> top 10 counties by population in the

793.12 -> coven 19 data set

796.48 -> it's worth noting that this entire demo

798.88 -> was done sitting on an airplane

800.88 -> it was done sitting in visual studio

802.639 -> code

804 -> and so this is showing interactive

805.839 -> sessions running in an ide local to my

808.079 -> macbook

809.36 -> essentially being able to connect to

811.12 -> glue's distributed compute service from

813.519 -> anywhere i have an internet connection

818.48 -> worth calling out again is the

819.76 -> configuration and why jupiter magics

821.92 -> were used to configure interactive

823.519 -> sessions

825.04 -> jupiter magics

826.639 -> are again these snippets of code that

828.48 -> state this is how i want my serve my my

830.72 -> session configured when they're run as

832.72 -> the first cell

834.72 -> we take that's that configuration we

836.72 -> apply it to your to your glue session

838.72 -> and environment

840.16 -> these are the same parameters as you

841.6 -> would use in glue job if you're familiar

843.6 -> with those you can pass any

845.199 -> configuration you need to glue including

848.079 -> custom spark config

849.839 -> in this case enabling the cryo

851.36 -> serializer

853.36 -> as long as you do that before running

854.88 -> your first cell of code

856.8 -> your config stays and and goes

859.6 -> if you forget something say you forget

861.44 -> the connection to your database or you

863.68 -> discover you need another database

865.199 -> connection

866.56 -> no problem simply add it to your magic

869.04 -> list

870.959 -> restart the kernel and rerun your code

874 -> and within seconds you have a brand new

876.079 -> cluster with a brand new configuration

878.72 -> and off you go

880.639 -> the other thing worth calling out here

882 -> is this particular configuration assumes

884.32 -> i'm you i'm wanting to use glue

885.68 -> streaming

887.04 -> i can use interactive sessions with glue

888.88 -> structured streaming

892.56 -> now

893.68 -> is is really really great

896.48 -> if i'm a developer and if i love working

898.88 -> in code

900.32 -> however

901.68 -> there are many customers who want to get

904.16 -> the power of glue and the flexibility of

906.079 -> it without getting into the code

909.04 -> so that's what glue studio comes comes

911.04 -> in glue studio is our visual job

914 -> authoring and monitoring tool you don't

916.16 -> have to write code

918.32 -> you simply have to drag and drop nodes

921.92 -> in blue studio you can preview data at

924.16 -> each step of the way

926.56 -> you get real-time schema inference

929.68 -> and just like the rest of glue

932.399 -> it supports hundreds of connectors

935.199 -> and yes you can still use transforms in

938.399 -> sql or custom code

940.639 -> as needed if you have the capability and

942.56 -> the desire to do so

944.16 -> well not required they do come in handy

946.8 -> when you want to do something that we

948.16 -> don't support directly out of the box

952.959 -> glue studio also offer offers a

955.12 -> monitoring environment for all of your

957.36 -> jobs and glue

958.8 -> through a single pane of glass

962.639 -> okay let's get into a demo of glue

964.16 -> studio let's go create a job with a

966.639 -> visual source and target

968.399 -> this will pre-populate for us three

970.399 -> nodes a source a destination and apply

973.44 -> mapping

974.639 -> so let's go ahead and select the source

976.32 -> from s3

977.68 -> we'll go to the catalog table that we

979.279 -> created out of the previous demos data

982.48 -> so we called that demo populations

985.44 -> now before i can start a data preview

987.68 -> i have to fill in the rest of my

989.759 -> my information for my nodes in this case

992.639 -> my s3 bucket needs a location to write

995.519 -> data to

999.44 -> once we've filled that out we can go

1000.88 -> back to my source bucket and we click

1002.32 -> data preview

1004.959 -> and once we select a role

1006.8 -> preview will start and might take a few

1008.48 -> seconds

1010.16 -> but within a few seconds we will start

1011.839 -> to get a data preview

1014.079 -> data previews are powered by interactive

1015.92 -> sessions the same things we've been

1017.68 -> discussing

1018.8 -> in the last few demos

1024.079 -> data previews are reasonably quick the

1026 -> first one often takes a few more seconds

1028.4 -> to spin up the compute but afterward

1030.559 -> everything is very very quick

1034.4 -> we can see here that it's the same data

1036.24 -> we were working with earlier excellent

1039.28 -> so let's go ahead now

1041.039 -> and

1042 -> you know let's add a transform

1044.959 -> let's go add a pii detection transform

1048.559 -> just for fun

1050.48 -> and let's look for a person's name

1054.08 -> because we can select which ones we want

1056.96 -> no no

1058.08 -> let's just go ahead and do all of them

1060.08 -> let's include all detection types let me

1062 -> just see everything

1063.76 -> let's go preview the data let's see if

1065.36 -> we can figure out

1066.88 -> what has pii

1071.039 -> after giving it a second to load

1074.24 -> again we see the same

1075.919 -> preview

1076.88 -> you'll notice that we're only previewing

1078.4 -> five of the seven fields so let's go

1080.48 -> ahead and select the rest of the fields

1083.039 -> and you can see that detected entities

1085.44 -> says there there are counties

1087.919 -> the county column has a person's name in

1090.4 -> it

1091.6 -> okay

1093.039 -> is that really a problem let's go scroll

1094.88 -> over and find out

1101.36 -> you can see that most of the county

1102.559 -> names are names

1104.32 -> mccracken pope

1106.84 -> grant they're counting names while

1109.039 -> they're detected by pii as proper names

1111.36 -> they're not

1113.2 -> but let's assume they were let's go

1115.039 -> ahead and redact them

1116.72 -> with a with a text

1118.48 -> field so we're just going to

1120 -> asterisk them out

1123.12 -> let's preview the data again to make

1124.48 -> sure it was taken and sure enough

1127.6 -> there go the there go the proper names

1130.32 -> any county that was identified as a

1131.84 -> proper name is gone

1134.72 -> awesome now let's continue on and let's

1137.2 -> assume we want to do something with this

1138.96 -> and write this out to our data catalog

1141.919 -> let's move that apply mapping back to

1143.919 -> our detect pii

1145.84 -> you'll see the output schema gets rid of

1147.679 -> detect the detect struct because we

1150.32 -> don't want that in our output

1152.559 -> we just want the clean data

1154.4 -> without the personally identifiable

1156 -> information

1157.2 -> in this case the county names that you

1159.28 -> know are our proper names

1162.96 -> and now that we're we're done with our

1164.4 -> data we can select a bucket

1166.72 -> we can select a database in a table we

1168.559 -> want to write to in this case we'll call

1170.24 -> it populations cleansed

1173.76 -> and let's go ahead and save

1176.24 -> say our transfer ah that's right we need

1178.559 -> to name the job

1180.08 -> let's go ahead and call the job clean

1181.6 -> populations sounds good to me

1184.4 -> and you can go look at the script

1186.32 -> this script is generated dynamically

1188.24 -> every time you change a field or

1190.24 -> property

1191.52 -> in the visual editor

1194.72 -> once we're satisfied with it we can go

1196.08 -> to job details

1197.52 -> all i'm doing today is enabling auto

1199.12 -> scaling

1200.799 -> we'll save the job

1203.6 -> and once we're happy with it we can

1204.96 -> click run

1207.36 -> now once we've run it we can go to the

1208.799 -> run details page

1210.48 -> to get information on this particular

1212.48 -> run and how it's going

1214 -> and if we wanted to keep going we could

1215.84 -> go to schedules and we can click on

1217.12 -> schedules

1220.08 -> worth calling out in addition to the

1221.679 -> visual editor

1223.76 -> glue studio offers

1225.679 -> glue studio notebooks

1228 -> they're the same notebooks as

1230.72 -> interactive sessions locally

1232.799 -> running in jupiter

1234.559 -> they're free

1236.08 -> serverless and they offer one-click job

1238.559 -> execution scheduling

1241.6 -> it's the same thing you saw me running

1243.36 -> in visual studio code except it's

1245.12 -> running in jupiter notebooks

1248 -> are sitting in glue studio nothing for

1250.159 -> you to host nothing for you to manage

1252.48 -> the same magics

1254 -> that i was showing you earlier apply

1256.08 -> here and run in here

1257.679 -> and of course there are built-in

1258.72 -> monitoring support just like there are

1260.159 -> for blue jobs

1264.799 -> now

1266.24 -> it's all nice and good that you can

1267.6 -> integrate your data in in a batch mode

1270.08 -> as we've been seeing but what about your

1271.76 -> other data

1274.159 -> glue has several execution modes we've

1276.48 -> got badge which is your standard job

1279.12 -> that runs at a set schedule

1281.28 -> you've got streaming modes that allow

1283.44 -> glue to run continually

1286.159 -> ingesting data as it hits a stream

1289.36 -> you have event based execution for glue

1292.559 -> that that will kick off a glue job based

1294.559 -> off of an event trigger

1296.4 -> and then you have an interactive api

1298.96 -> that allows you to integrate glue into

1300.4 -> interactive applications

1304.08 -> so let's talk about batch for a minute

1306.32 -> why batch

1308.159 -> simply batch is there for

1310.159 -> scale and for reuse it's the thing we

1312.559 -> typically think about when we think data

1314.159 -> integration or etl

1316.799 -> but

1317.6 -> batch goes beyond just a single job

1319.679 -> often it goes into workflows and

1321.76 -> pipelines as well

1323.6 -> so let's talk about workflows

1326.64 -> glue workflows allow you to orchestrate

1328.96 -> jobs

1330.4 -> with glue and with other aws services

1333.679 -> you can use glue spark jobs as well as

1336.08 -> glue python shell

1338.48 -> you can monitor the execution of the

1340.48 -> entire workflow in one place

1343.2 -> you can use triggers

1344.88 -> either schedule based on-demand

1347.6 -> or event-based triggers

1349.76 -> inside your workflows

1352 -> and of course you have easy access to

1353.44 -> monitoring logs the whole reason for

1355.76 -> workflows within glue

1357.84 -> is to encapsulate the entirety of your

1359.76 -> data integration pipeline in one place

1365.039 -> now

1366.08 -> workflows are great

1368.559 -> but oftentimes you're repeating the same

1370.64 -> process

1371.76 -> you might be repeating the same workflow

1374.88 -> for sales for marketing for another

1377.2 -> department

1379.12 -> now these come with their own challenges

1381.84 -> if you're repeating these continually

1384.4 -> you're coding them manually

1386.64 -> you're making mistakes because you have

1388.08 -> to change the same thing over and over

1390.24 -> again

1392.08 -> you have some poor data engineer who has

1393.76 -> to do this 50 times in this job

1396.559 -> and at some point it fails to scale

1400.08 -> so

1401.36 -> how do we avoid that developer spending

1403.2 -> that valuable time

1404.96 -> and how do we avoid the errors and the

1406.72 -> pain

1408.48 -> glue gives us custom blueprints

1411.36 -> these are workflows

1414.84 -> that are templatized

1417.76 -> you simply give glue a script

1420.24 -> a configuration file

1422.4 -> essentially how you want us to

1424.559 -> prepare and launch the environment

1427.2 -> for your

1428.08 -> for your workflow and a layout file

1430.48 -> to say hey

1432.32 -> what should the inputs and outputs to my

1433.919 -> workflow

1435.84 -> and we do the rest

1437.919 -> you can then instantiate that workflow

1440.32 -> as necessary

1442.32 -> so your end users can just say cool i

1444.96 -> need my daily report workflow for a new

1447.039 -> report and i can fill in the boxes

1450.32 -> click submit

1451.919 -> and your glue workflow

1454.96 -> creates a copy of itself with the

1456.48 -> correct parameters and does its thing

1461.6 -> the next mode we talked about is

1462.88 -> streaming

1465.44 -> glue streaming is built on spark

1467.12 -> structured streaming

1468.64 -> it is fully serverless

1470.64 -> and you can build jobs visually

1473.279 -> interactively with interactive sessions

1476 -> uh or in a more traditional manner like

1477.679 -> you have always done with uh with an ide

1480.08 -> and spark

1481.919 -> you can easily connect from kinesis or

1483.44 -> kafka directly in the visual editor

1486.96 -> so streaming is a great way to move your

1489.039 -> your spark jobs from badge

1491.44 -> into a micro batch and streaming method

1493.919 -> because really the difference between

1495.2 -> streaming and batch is only a few lines

1497.52 -> of code

1498.96 -> glue makes it fairly easy to migrate

1501.52 -> from one to the other either visually

1504.24 -> or in a traditional manner

1508.08 -> the next method we discussed was event

1511.36 -> driven

1512.84 -> integration so how do we let your data

1516.159 -> drive your work

1520.32 -> glue etl integrates now with that amazon

1522.799 -> event bridge

1525.279 -> amazon event bridge has hundreds of

1527.039 -> built-in sources

1529.039 -> allows connections to custom

1530.4 -> applications and your sas applications

1533.52 -> they provide all of these triggers that

1535.919 -> allow you to kick off a workflow

1539.12 -> that includes glue jobs crawlers and

1542.159 -> other things

1543.279 -> and the whole purpose of this

1545.279 -> is that you let your your data

1548.08 -> do the work you might have data that

1550 -> comes in daily

1551.52 -> you know once a day it's sometime

1553.679 -> roughly in the

1554.84 -> evening my clients upload their data to

1558.159 -> my s3 bucket and i have to go get it

1560.559 -> so every night i kick it off and i go

1562.159 -> get the data and i hope it's all there

1564.559 -> with eventbridge

1566.24 -> the second the data lands

1568.96 -> you can go ahead and just get the data

1571.2 -> you don't have to schedule it on a

1572.799 -> schedule you can do it in in response to

1575.6 -> the event of your data landing in your

1577.44 -> s3 bucket

1579.84 -> this allows you to move closer to that

1582.72 -> more that near real time

1584.96 -> goal that everybody seems to be seeking

1587.679 -> without the great expense and

1590.159 -> and hassle of maintaining a continually

1592.32 -> streaming environment when you don't

1594.24 -> likely need one for every use case

1599.76 -> and finally the last

1601.279 -> the last version we'd like to talk about

1603.039 -> the last method for integrating glue is

1605.279 -> interactivity using interactive apps

1609.84 -> using glue interactive sessions you can

1612 -> integrate glue directly into your own

1614.799 -> applications using the glue apis

1618 -> this includes glue streaming

1620.799 -> and it allows you to extend glue

1624.08 -> into any application that has access to

1626.48 -> aws

1627.84 -> on-premise

1629.12 -> or in the cloud

1631.44 -> so let's go ahead and see how this might

1632.72 -> be done

1634.4 -> so here i'm going to have a simple bash

1636.96 -> application

1638.64 -> called detect pii

1641.279 -> and this is going to run what amounts to

1642.96 -> the same code we ran earlier

1645.039 -> again on the same table i'm not trying

1646.96 -> to do anything fancy with data

1649.52 -> and this thing's going to go out and

1651.44 -> it's going to run the detect pii systems

1654.48 -> against interactive sessions and it's

1656.159 -> going to ask hey is this actually pii

1659.84 -> well in this table no

1661.84 -> queen anne's maryland is a county

1666.64 -> it's not pii

1669.12 -> so we're just gonna put no

1671.52 -> what what about castro texas is is that

1673.76 -> a name

1674.96 -> no again it's counting so no

1678.48 -> and so on and so forth

1680.48 -> and one south dakota now then there's no

1682.559 -> pii in here

1684.72 -> at this point i don't think there is pii

1686.64 -> so i'm going to exit

1689.2 -> now

1690.72 -> if there was pii i could have said yes

1693.76 -> and something else in my application

1695.2 -> could have happened

1696.399 -> such as

1697.44 -> locking it down inside leg formation or

1700.24 -> notifying the data owner that they have

1701.919 -> potential pii within their data

1705.6 -> while automatic detection is great and

1707.6 -> extremely useful there are times that

1710.64 -> auditing is is appropriate

1714.64 -> so this is all nice and great but how

1716.159 -> does glue work how does blue do all of

1718.48 -> these things and scale

1720.48 -> from all of these personas and all of

1722.08 -> these workflows and use cases

1725.84 -> glue runs a scalable execution model

1728.72 -> where a job is kicked off

1731.279 -> by a job manager

1733.2 -> and the job is divided into stages

1736 -> and the data is divided into partitions

1739.76 -> and the job manager takes those stages

1743.12 -> and pairs them with partitions to

1744.96 -> schedule tasks

1747.039 -> on a worker or a node of compute

1750.64 -> and glue serverlessly scales to

1752.88 -> thousands of workers

1754.72 -> and parallely executes these tasks on

1758.08 -> the workers

1759.52 -> and of course you only pay for the

1760.72 -> compute that's used

1763.44 -> glue announced at uh

1765.52 -> reinvent and we launched it the san

1767.2 -> francisco summit auto scaling

1770 -> and so

1771.039 -> now in addition to just

1774.08 -> setting a serverless compute you check

1775.679 -> the box as i showed earlier

1777.84 -> and we will only provision the compute

1779.44 -> you need

1780.64 -> for any given task

1785.12 -> so

1787.2 -> let's dive deeper real quick into

1788.96 -> integrating at scale

1791.679 -> because

1792.559 -> scaling challenges are hard

1794.88 -> you've got different business events

1797.12 -> schema and size variants

1799.279 -> source variants

1800.799 -> and this again all impacts your cost and

1802.799 -> capacity

1806.159 -> you've got resource prediction problems

1808.799 -> tuning's hard and yeah

1811.919 -> everyone typically over provisions

1813.919 -> because the last thing you want is for

1815.44 -> it to fail i'd rather pay 20 more than

1818.48 -> have to be woken up in the middle of the

1819.84 -> night

1823.679 -> so that's where auto scaling comes in

1825.84 -> as discussed auto scaling is a check box

1828.96 -> it reduces cost

1831.36 -> simply put you check the box

1834.24 -> and you don't have to think about the

1835.2 -> capacity planning

1837.039 -> you enable auto scaling

1839.279 -> you set the maximum number of workers

1840.96 -> you want to allocate to your auto

1842.24 -> scaling job

1844.399 -> and and you let it go

1850.64 -> so let's go ahead and look at how auto

1852.32 -> scaling works within a multi-stage job

1854.799 -> if we have a jdbc source

1856.88 -> and we want to read from it apply a

1858.399 -> custom transform a simple mapping and

1860.799 -> write it out

1863.6 -> you're going to have a few connections

1864.88 -> when you start the jdbc job then you're

1867.039 -> going to have high parallelization after

1868.96 -> the data is read into memory and spark

1871.84 -> now

1872.799 -> in a traditional cluster without auto

1874.96 -> scaling

1876 -> you would have to spin up all the

1877.12 -> workers first

1879.44 -> then you would have to do

1881.039 -> the small read

1882.88 -> and the big writes and you'd have a

1884.08 -> bunch of idle compute

1885.6 -> with auto scaling we only have to

1887.679 -> provision the workers that need to do

1889.039 -> the read when it's time to do the read

1892.08 -> when it's time to do the advanced

1893.44 -> transforms we can scale up as needed

1896.88 -> not before

1899.279 -> and that way you're saving yourself a

1902.24 -> significant compute

1904.08 -> and a good potential cost savings

1910.24 -> now all of this is nice and good

1913.76 -> but if i've got so much more data

1916 -> that i can interact with in my company

1918 -> how can i unify that how can i

1920.32 -> centralize

1921.919 -> all of my data and centrally govern it

1925.84 -> glue data catalog

1927.76 -> is our mechanism for central data

1931.12 -> organization

1932.88 -> it is a meta store for data lakes

1935.2 -> it's highly scalable and durable it's

1937.36 -> extremely cost effective it offers

1939.679 -> security compliance and auditing

1941.039 -> capabilities and it is hive meta store

1943.44 -> compliant

1944.559 -> uh with several open with an open source

1946.72 -> connector to attach all of your favorite

1948.88 -> hive compatible systems

1955.2 -> now easiest way to get data into your

1957.6 -> catalog

1959.279 -> is to use a blue crawler

1961.679 -> group glue crawlers

1963.919 -> are little serverless applications that

1966 -> connect to your data sources

1969.12 -> automatically discover your schema and

1970.96 -> extract that schema

1973.76 -> and then they write that scheme into the

1975.279 -> data catalog

1976.64 -> every time they re-read your data source

1979.279 -> they will update your schema as needed

1981.6 -> inside the catalog

1984.72 -> you're able to specify your own

1986.159 -> classifiers for custom files

1988.559 -> this is a big deal because you can even

1990.88 -> use glue crawler for those old school

1993.679 -> flat files

1995.12 -> with classifiers and grok

1997.36 -> this can allow you to remove some of

1999.039 -> your legacy systems that currently exist

2001.519 -> because it's really hard to deal with

2003.279 -> this old data

2004.72 -> and of course crawlers can run on demand

2007.6 -> as part of a schedule or part of a

2009.2 -> workflow

2010.32 -> or in response to events

2015.2 -> event based crawlers on s3 for example

2019.279 -> only crawl when a new object hits

2021.679 -> remember there's no real such thing as

2023.2 -> an update in

2024.88 -> s3 it's a new object

2027.279 -> or not

2028.72 -> so

2029.76 -> you have you can have a broader coverage

2032.48 -> with crawlers because you can say hey

2034.399 -> monitor all of these

2036.24 -> these folders these prefixes and s3 let

2038.32 -> me know when anything changes

2040.96 -> and because we're not having to list the

2042.48 -> buckets crawlers become faster and more

2044.88 -> cost effective

2046.96 -> making it easier to maintain your data

2048.8 -> lake

2049.599 -> at every stage in your data journey

2052.56 -> now

2053.52 -> we also offer lake formation

2055.919 -> for fine grain access control

2058.399 -> and while i'm not going to go too deep

2059.76 -> into lake formation i did want to call

2061.28 -> out govern tables

2062.879 -> lake formation government tables are a

2064.56 -> data ty or a table type on top of s3

2067.76 -> that provides atomic transactions

2070.96 -> on top of your tables

2072.639 -> so that you can atomically maintain and

2075.2 -> update

2076.24 -> your objects

2077.76 -> you can automatically compact your data

2080 -> as necessary to make sure it's

2081.599 -> performant

2082.96 -> for these distributed analytics

2084.48 -> applications such as amazon athena

2086.879 -> it allows you time travel in your table

2089.359 -> so you can look at the table

2091.44 -> and the data as it existed at any given

2093.44 -> point in time

2095.44 -> this allows you to do

2097.839 -> cleaner testing

2099.52 -> of

2100.8 -> ml models and training it allows you a

2102.96 -> better understanding of when your data

2105.2 -> has come in and how it's come in and be

2106.64 -> able to compare what's changed when

2110.48 -> there are a lot of advantages to

2112.48 -> automatic time travel within your tables

2117.92 -> thank you all for coming i hope you've

2119.359 -> learned something

2120.64 -> i look forward to seeing what you guys

2122.72 -> build on aws glue

2125.04 -> thank you very much

2130.68 -> [Music]

2136.24 -> you

Source: https://www.youtube.com/watch?v=cdk_6bpmZYE