AWS re:Invent 2020: Zero-code data preparation with AWS Glue DataBrew

Aug 16, 2023

AWS re:Invent 2020: Zero-code data preparation with AWS Glue DataBrew

Data preparation is a critical step to get data ready for analytics or machine learning. As data continues to grow in size and complexity, you need to expand the number of people preparing and unlocking value in your data. In this session, dive deep into AWS Glue DataBrew, a new visual data preparation tool that enables data analysts and data scientists to clean and normalize data without writing code. See a walkthrough of how AWS Glue DataBrew works, popular use cases, and best practices for data preparation across all your data stores.

Learn more about re:Invent 2020 at http://bit.ly/3c4NSdY

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

#AWS #AWSEvents

Content

1.28 -> welcome to aws re invent 2020.

4.08 -> um in this session for zero code data

6.879 -> preparation with aws glue data brew

9.679 -> my name is serbi dangi and i'm a senior

11.92 -> product manager for databrew

14.24 -> today our customers are using data at an

16.72 -> unprecedented pace for analytics and

18.8 -> machine learning

20.08 -> with the explosion in the amount of data

22.4 -> sources and the type of data flowing in

24.96 -> data preparation can cost businesses

27.039 -> valuable time

28 -> resources if not done efficiently

31.519 -> with this in today's session we have an

33.36 -> exciting agenda to talk about

35.84 -> we're going to be covering the

37.12 -> challenges that customers face with data

39.12 -> preparation

40.64 -> how customers can take advantage of

42.559 -> self-service data preparation with data

44.399 -> brewing and organization

46.32 -> uh we'll also discuss key features of

48.399 -> the tool take data throughout for a spin

51.199 -> with an interactive demo and then wrap

53.199 -> up with some very common use cases that

55.52 -> our customers are using to approve 4.

60.16 -> to take a step back with a constant

62.96 -> influx of information customers are

65.04 -> scrambling to capture all the right data

67.04 -> points

68.24 -> organize and process the data and

70.479 -> convert it into actual actionable

72.479 -> insights

73.92 -> yet in the hustle to transform raw data

76.4 -> into usable information we tend to

78.08 -> forget how sophisticated

79.84 -> and challenging the data preparation

81.759 -> process is

82.96 -> and how many individuals in an

84.479 -> organization it

86 -> encompasses this way the way that this

89.759 -> came about

90.479 -> is that a lot of customers tell us that

92.159 -> they love the scalability and the

93.84 -> flexibility of code-based

95.84 -> data preparation in services like aws

98.64 -> glue today but they would

101.119 -> also benefit from allowing business

103.84 -> users data analysts

106.079 -> data scientists to visually explore and

108.399 -> experiment with data independently

110.64 -> without writing any code

115.04 -> so let's take a look at what data

116.56 -> preparation looks like today

119.36 -> preparing data for analytics and machine

121.439 -> learning involves several

122.96 -> necessary and time consuming tasks

126.24 -> these include things like extracting

128.72 -> data

130.16 -> cleaning data normalizing it to specific

133.52 -> forms

134.64 -> loading them into databases data

136.959 -> warehouses and the data lake

139.04 -> and finally orchestrating scalable etl

142.319 -> workflows at scale

145.12 -> for extracting orchestrating and loading

147.52 -> data at scale

149.04 -> data engineers and etl developers

151.68 -> typically

152.56 -> are skilled in sql or programming

155.2 -> languages like python and scala

157.599 -> and they they go and sort of prepare

160.64 -> their data

161.519 -> in code-based environments we've seen

164.48 -> sort of etl developers

166.08 -> love the new more visual interfaces of

168.8 -> modern etl tools

170.48 -> and so aws recently introduced aws glue

173.599 -> studio

175.04 -> which is a new visual interface to help

177.28 -> etl developers

178.56 -> author run and monitor etl jobs without

180.8 -> having to write code

183.12 -> um once this data has been reliably

185.599 -> moved

186.4 -> um the underlying data still needs to be

189.04 -> prepared it still needs to be cleaned

190.8 -> and normalized

192.239 -> and data scientists and data analysts

194.72 -> that operate

195.599 -> in lines of business understand this

198.4 -> data best

199.68 -> and and sort of want to get their hands

201.519 -> on that data

205.28 -> in in eventuality code based

209.36 -> heavy lifting is required today to

212.64 -> prepare data at scale

216.4 -> in an effort to spot anomalies in the

218.4 -> data

219.599 -> we have highly skilled sort of data

221.68 -> engineers

222.64 -> and etl developers writing custom

225.84 -> workflows

226.799 -> and custom code to pull data from

230 -> different sources

231.36 -> and pivot transform slice and dice's

234.08 -> data multiple times before they can

236.08 -> iterate

237.36 -> with data analysts and data scientists

239.2 -> to identify

240.879 -> what what needs to be fixed in the data

243.76 -> so you'll find

245.36 -> data engineers etl developers and then

247.76 -> of course the data analysts

249.28 -> and data scientists so very familiar

251.12 -> with this data often work together

254 -> to to sort of determine the kind of

256.16 -> transformations

257.84 -> more upstream in the data preparation

260.16 -> process

261.6 -> after they've developed these transforms

264.24 -> data engineers and data

265.84 -> in etl developers still need to sort of

268 -> schedule

269.04 -> custom workflows on an ongoing basis so

271.84 -> new incoming data can automatically be

274 -> transformed

275.84 -> i mean this is a pretty elaborate

279.12 -> process where each time a data analyst

282.4 -> let's say

282.96 -> finds a mistake in a specific cell in

285.52 -> the data or finds an anomaly or a

287.759 -> pattern mismatch

289.919 -> they go back to sort of data engineering

292.8 -> they go back to the etl developers

295.12 -> and and each time a transformation needs

298.56 -> to be changed or added

300.4 -> um everyone sort of needs to get

302.8 -> involved again

304 -> um and and um all of these data

307.28 -> preparation tasks need to be

308.56 -> orchestrated again

311.039 -> this iterative process as you can

312.72 -> imagine can take several weeks

314.8 -> um or even sometimes months to complete

317.44 -> and as

318.16 -> a result customers end up spending up to

320.56 -> 80 percent of their time

322.4 -> cleaning and normalizing data instead of

324.72 -> analyzing the data

326.479 -> and extracting value from it

330 -> with this to summarize there are several

333.199 -> challenges with

334.4 -> traditional data preparation today first

338.72 -> of course is traditional data

340.24 -> preparation at scale is still largely

342.32 -> manual

343.52 -> it's done by cherry picking small

345.36 -> samples of the data in excel sheets

347.68 -> or in jupyter notebook environments

350.72 -> it needs a lot of code-based heavy

352.56 -> lifting um

353.759 -> for it to really work at scale and with

357.199 -> uh with more and more data lake uh with

360.08 -> more and more data that's sitting in

361.36 -> your data lake or in your data

362.639 -> warehouses

363.84 -> different kinds of data structured

366.08 -> unstructured

367.759 -> data from logs data from third party

369.919 -> applications

371.6 -> it's becoming more and more important to

374.639 -> have

376.08 -> to have ad hoc cleaning as part of the

379.199 -> data preparation

380.08 -> process second we notice that data

383.28 -> arrives sort of in so many different

385.12 -> formats it's compressed

386.96 -> partitioned periodic transactional

390.479 -> and oftentimes to enable business users

393.68 -> large amounts of that data still needs

395.919 -> to be moved

396.8 -> um into silos or even outside the

399.199 -> company's virtual private network

402.08 -> or vpc to be able to be prepped

405.199 -> or really explored even for analyses

408.639 -> these causes security concerns at time

410.88 -> for customers

412.56 -> and finally as you go more and more

414.96 -> upstream

415.68 -> in the data consumer chain the

418.56 -> transformations

419.919 -> are very specific to the context of the

421.68 -> meaning of the data

423.68 -> there is so a specific cell might be out

426.96 -> of place

427.68 -> or just a specific format of the data

431.199 -> might be out of place

432.4 -> and so these spot fixes need to be made

435.759 -> in in the case of this uh this sort of

438.4 -> dependent process

440 -> there's a low factor of reuse of work

442.319 -> where um

443.919 -> when a data analyst spots an anomaly

446.96 -> can they number one be empowered to go

448.96 -> ahead and fix that anomaly

450.8 -> and then number two um can the data

453.44 -> engineering team

454.56 -> or can can the data source actually um

457.84 -> take in that anomaly um and and have a

460.639 -> fix for it

461.52 -> um that's already been done by the data

464 -> analyst

464.96 -> so tooling is is very specific to point

469.199 -> use cases today and does not look

471.199 -> at data preparation as a whole and how

473.199 -> an organization can be empowered

475.84 -> by it with some reusability factor

481.039 -> in the spirit of being customer obsessed

482.96 -> we listened to our customer we listened

484.72 -> to their pain points

486.16 -> and understood their challenges and data

487.919 -> preparation and launched aws glue data

490.96 -> brew just a few weeks ago

493.28 -> for those of you who are not very

494.8 -> familiar with aws glue

497.52 -> it is our serverless data integration

499.84 -> service that makes it easy

502.24 -> to discover prepare combine data

505.68 -> for analytics machine learning and

508.16 -> application development

510.4 -> aws glue provides both visual and code

513.599 -> based interfaces to make

515.2 -> overall data integration easier

518.839 -> specifically um today i'm going to share

521.44 -> more about

522.56 -> aws glue data brew data brew is a new

526 -> visual data preparation tool that

528.08 -> enables data analysts and data

530 -> scientists

531.36 -> to clean and normalize data without

533.2 -> writing any code

534.8 -> it provides these these users with

537.839 -> powerful capabilities to explore

540.08 -> manipulate and merge new data sources

543.279 -> all without the assistance of without

546.48 -> the assistance of code

548.24 -> and helps them self-service

551.839 -> data preparation tasks that they come

554.48 -> across

555.04 -> in their work when we talk about data

557.839 -> analysts and data scientists

560.24 -> it's really important to understand that

563.2 -> these are

564 -> many many different they come in many

566.64 -> they come in many teams they come in

568.32 -> many

569.279 -> roles and job functions analysts such as

572.64 -> legal analysts marketing analysts

574.8 -> financial analysts environmental

578.399 -> analysts as well as data scientists such

580.959 -> as

581.44 -> researchers that are researching on

584.32 -> vaccine development for example where

586.56 -> their core job function

588.399 -> is not to learn python but really to

591.519 -> work with data to understand the

593.2 -> patterns and really analyze the data

596.399 -> for all of these personas and users

600.08 -> databrew provides an easy to use

603.36 -> platform

604.079 -> for them um to for them to prepare their

607.839 -> data

608.88 -> directly from data lakes

612.959 -> with glue data brew end users can easily

616.56 -> sort of access and visually explore data

620.56 -> across their data stores so that

622.8 -> includes

624.24 -> your amazon s3 data lake the amazon

627.2 -> retro data warehouse

628.8 -> and the amazon rds databases you can

632.079 -> also

632.48 -> upload a file from your local disk which

634.48 -> we've seen is a very common use case

636.64 -> amongst the analyst persona

639.36 -> or even connect to data sources

642.56 -> that are from other that are coming from

646 -> third-party systems like salesforce

648.24 -> marketo

649.2 -> um etc customers can also choose from

653.44 -> over 250 built-in transformations

656.72 -> um these transformations really help you

658.8 -> combine pivot

660.16 -> and transpose data without writing any

662.079 -> code

663.44 -> it recommends cleaning and normalization

666.399 -> steps

667.519 -> things like filtering anomalies in your

669.519 -> data

671.519 -> really recognizing and dealing with

673.44 -> invalid or misclassified values

676.88 -> you can also look at duplicating data

681.04 -> duplicate data as well as any data that

684 -> you need to standardize

685.68 -> so things like date time or even just

688.8 -> uppercase to lower case characters

693.2 -> databrew also comes with a set of

695.839 -> transformations for data science

697.839 -> and these are really helpful where

700.959 -> uh you can we use sort of advanced

703.12 -> machine learning techniques like natural

705.04 -> language processing

706.56 -> uh directly within our transformation so

709.12 -> you can actually convert

711.04 -> um convert yearly for example the word

714.48 -> yearly

716 -> to your long or year-long to your

719.279 -> um so it really sort of helps you

721.04 -> convert

722.56 -> common words to their base or root form

724.959 -> and this is just one example of

727.839 -> data the data science transformations

729.76 -> that we provide

731.44 -> um you can you can sort of save

735.04 -> these cleaning and normalization steps

737.44 -> into a workflow

738.959 -> um which is typically called a recipe

741.279 -> and this recipe can then automatically

743.12 -> be run on future incoming data

746.399 -> as you can see there's a lot of

748.16 -> operation on the data

749.44 -> and a lot of transformation so you can

751.2 -> do on the data using data brew and with

753.76 -> that

754.959 -> another important aspect of this is

756.88 -> being able to understand

758.24 -> what changed at what point in time by

760.48 -> who

761.279 -> um which transformations were applied on

763.44 -> which data sets

764.639 -> and um and and and how did the data

768.399 -> change and evolve um with the with the

771.12 -> usage of data brew and so we provide a

773.2 -> very visual

774.56 -> lineage view for um all operations that

777.36 -> happen on the data

779.6 -> and lastly um being on the data lake

783.44 -> um scale is scale is important

787.839 -> and being able to operate at scale and

790.56 -> being able to schedule

792.48 -> these transformations to run repeatedly

794.639 -> on new incoming data

796.959 -> is is part of data brew and

800.56 -> um all of data brews actions skin

803.68 -> um are serverless so data brews full is

807.04 -> serverless fully managed

808.639 -> you don't have to provision um you don't

811.36 -> have to configure

812.24 -> provision or manage any resources

815.76 -> to operate data brew

821.76 -> with this we're going to switch over to

823.92 -> a quick demo

824.88 -> of the product and then cover some

828.16 -> popular use cases

830.399 -> so data brew can be accessed through the

832.48 -> analytics category on the aws management

834.88 -> console

835.92 -> and when you land on the console um you

838.959 -> can

839.44 -> you can see um the way to sort of create

842.959 -> a project

844.32 -> there are some key entities in data brew

846.32 -> such as

848 -> data sets projects recipes and jobs

851.76 -> and in these you'll notice

854.959 -> you can go ahead and sort of create a

856.8 -> data set when you land on the

858.639 -> data brew console you can browse

862 -> your s3 buckets or connect through the

864.88 -> glue data catalog

866.48 -> your retro tables rds tables or even

870.32 -> upload a file from your local desk so in

872.24 -> this case

872.88 -> i'm going to take a new york city bike

875.68 -> data set

877.199 -> and just look at sort of what this data

878.959 -> set contains what columns

881.279 -> exist in these once i've created a data

884.079 -> set

884.8 -> i can go ahead and sort of not only take

886.72 -> a look at the sample but understand

889.76 -> over 40 to 50 statistics and how this

892.8 -> data

893.279 -> is laid out so things like duplicate

896.639 -> values missing values

898.32 -> as well as correlations for this for all

901.6 -> numerical values in this data set

904.24 -> i'm starting to sort of get a picture of

905.839 -> this data set but

907.839 -> this data set for example has a bike id

911.44 -> column

912 -> and so there i'm interested in the

913.44 -> cardinality to make sure

915.68 -> the rows are unique um i'm also

918.32 -> interested in making sure that

920.079 -> there's no anomaly in the bike id so i'm

922 -> going to take a quick glance

923.839 -> then i also look at the user type to

926.24 -> understand how many user types even

928.16 -> exist in this data set

930.639 -> once i'm done then i can switch over to

932.88 -> the lineage view and just understand how

934.56 -> many projects are connected to this

936.639 -> what are the jobs that are running and

938.88 -> so in this case i'm going to go ahead

940.56 -> and actually

941.199 -> open this data set up to do some

944.24 -> data preparation so when i click on the

947.279 -> project

947.92 -> i land into a visual working window

950.72 -> where we're working with a first

952.32 -> and sample of this data set

955.36 -> as you can see when i click on column

957.36 -> headers

958.56 -> you can really sort of understand

960.8 -> statistics detailed statistics about

963.199 -> each individual column and i notice

966.56 -> we have lat long columns and maybe for

968.639 -> my analysis

970.32 -> i want to actually combine these two

972.079 -> columns so i would just go ahead and

973.92 -> initiate a transformation to merge these

976.079 -> with a separator

978 -> rename the column and then

981.6 -> what this does is gives me a real-time

983.519 -> preview of what my column is going to

985.68 -> look like

986.72 -> uh when i apply this transformation so

989.6 -> this is a really easy way to sort of

992.56 -> visually understand the impact of your

995.519 -> transformation

996.8 -> and what that's doing to your data next

999.92 -> i'm going to go ahead and look at

1002 -> um just just sort of look at

1005.36 -> some aggregates for this data just to

1007.519 -> understand it

1009.199 -> so i'm interested in understanding based

1011.68 -> on my

1012.959 -> user type how many

1016.32 -> how many bike hours or what is the time

1018.8 -> duration the trip duration

1020.639 -> for each of the bike types i can also

1023.839 -> look at other analyses that i can add on

1026.24 -> to

1027.039 -> to this to this data set once i select

1031.12 -> sort of my attributes i can see a

1034.4 -> immediate preview of of this analysis

1038.16 -> and this is really helpful especially

1040.16 -> when you're

1041.6 -> when you're preparing data for machine

1043.12 -> learning and when you're

1044.64 -> really trying to understand or engineer

1047.36 -> your features for your machine learning

1049.039 -> models

1050 -> so in this case it looks good my

1052 -> customer subscriber

1054.24 -> values and so i'm going to pull them

1055.84 -> pull those columns back into my

1057.919 -> visual working window um once i've done

1061.52 -> that

1061.919 -> as you can see on the right hand side

1064.72 -> every time i've

1065.679 -> applied a transformation it's added it

1068.72 -> to what we call the recipe

1071.52 -> one of the one more transformation that

1073.84 -> i'm going to do

1074.64 -> is for me for my machine learning models

1077.6 -> they don't take

1079.44 -> i would i would give them numerical

1081.44 -> values and so

1082.799 -> i quickly went ahead and looked at

1085.84 -> just two values in my um in my sample

1089.76 -> also went and looked at the fact that

1092.16 -> there are two values that exist in my

1093.76 -> population and then mapped all two

1095.44 -> values

1096.4 -> let's say there was actually a lifetime

1098.64 -> um

1099.44 -> lifetime uh value

1102.559 -> that didn't show up in my sample but

1104.64 -> existed in my population i can quickly

1106.72 -> sort of check between the profile view

1108.799 -> and the grid view um and make sure it's

1112.16 -> all

1112.48 -> um it's all solid um one of the last

1116.24 -> things my

1117.039 -> analysis is incomplete without looking

1118.96 -> at maintenance data

1120.32 -> so what i'm doing is bringing in

1122.48 -> maintenance bike maintenance data why i

1124.88 -> join

1126.24 -> specifying my joint keys and looking at

1128.4 -> a preview

1130.08 -> what this helps me do is really bring in

1131.919 -> data sets from disparate sources

1134.4 -> i joined a csv with an excel file here

1137.919 -> just to understand um how does

1140.72 -> maintenance

1141.84 -> like a recent maintenance impact my trip

1144 -> duration for the bike

1146 -> and so this is a really great way i mean

1147.84 -> to experiment with all kinds of joints

1150.799 -> you can look at an inner outer etc

1154.24 -> this looks quite good but clearly i've

1156.72 -> kind of messed up my schema a little bit

1158.88 -> so i can go ahead into the schema view

1161.44 -> just double click on columns rename them

1163.6 -> shade change

1164.64 -> hide them delete them as required

1169.52 -> and once i'm ready i can go ahead and

1172.96 -> sort of

1173.84 -> rename this column this will get added

1175.84 -> as a recipe step

1177.6 -> once again when i'm ready with my recipe

1181.84 -> i can go ahead and i can go ahead and

1184.96 -> sort of review

1186.559 -> the changes i've made i can even switch

1188.72 -> around samples

1190.96 -> and then um and then really sort of

1194.24 -> understand that this recipe that i've

1196.4 -> created actually applies to my data set

1198.559 -> and works well

1199.919 -> once i'm done uh remember we've been

1202.159 -> working on just a firsthand 500 row

1204.559 -> sample

1206 -> i can go ahead and now apply this recipe

1209.2 -> to the population data so now i'm

1211.52 -> running it

1212.24 -> on the actual data set and

1216.4 -> what i did was just specified an s3

1218.559 -> location or where to store my output

1221.6 -> so in a nutshell with all of that all of

1224.799 -> that exploration

1226.799 -> i basically took a data set from s3

1229.52 -> joined it with another data set

1231.2 -> created some transformations ran a

1233.919 -> profile and ran

1234.88 -> an output and so this visual sort of

1237.76 -> explorer really helps you

1240.24 -> understand and keep track of all the

1242.08 -> changes that are happening in your data

1243.6 -> set

1245.039 -> so in a nutshell this is this is sort of

1247.44 -> one of the most common workflows

1249.44 -> um in data brew and you can go ahead and

1253.76 -> experiment with data um you can also go

1256.24 -> ahead and sort of develop

1258.799 -> develop interesting use cases with this

1261.919 -> foundation

1263.36 -> so next we're going to go through sort

1265.84 -> of what we really looked at

1267.6 -> um we're going to go and do a quick

1269.52 -> recap of what you looked at in the demo

1272.72 -> so in this demo you saw how we could

1275.2 -> profile

1276.24 -> a large data set um use those statistics

1280.159 -> to inform transformations on your sample

1282.48 -> that you're visually exploring

1284.4 -> and then apply these transformations

1286.159 -> back to the large data set

1288.96 -> in addition uh you can also

1290.88 -> operationalize these at scale

1292.96 -> so recipes um recipes are reusable you

1296.32 -> can publish them

1297.36 -> so two data analysts that are

1299.36 -> collaborating can each publish recipes

1301.36 -> in the account

1302.559 -> um and you can import them back into the

1305.44 -> project for reuse

1307.28 -> you can also use apis to perform

1311.039 -> all the actions that you saw on the ui

1313.6 -> and

1314.72 -> and really schedule jobs so every time

1317.28 -> you have a new

1318 -> data set that lands in an s3 bucket

1321.52 -> you can you can parameterize the input

1324.24 -> file name to pick up that latest data

1326.159 -> set

1326.559 -> and then just schedule a job to do some

1329.28 -> lightweight

1330.48 -> um do sort of perform these

1332.159 -> transformations pretty easily

1334.4 -> and orchestrate them at scale

1339.039 -> so we're going to talk through now that

1341.039 -> you have a glimpse of what data brew

1343.2 -> looks like

1344.24 -> we're going to talk through some common

1345.919 -> sort of use cases that we've seen

1348.24 -> um customers use for um use data brew

1351.76 -> with

1352.4 -> one of the most popular ones are ad hoc

1356.039 -> experimentation and data exploration

1359.2 -> for business reporting and data analysts

1363.6 -> commonly sort of bring data either from

1366.799 -> a local file on their like an excel

1369.6 -> spreadsheet on the computer you'd be

1371.36 -> surprised how many spreadsheets still

1373.36 -> get passed

1374 -> around in in organizations

1377.36 -> or they bring data that's been made

1379.2 -> available from the catalog to them

1381.76 -> the data catalog from glue once they

1385.2 -> do that they connect to this data they

1387.28 -> perform the analysis

1388.64 -> in data brew and then put the files back

1392.64 -> into s3 from

1395.679 -> s3 you can connect to you can create a

1398.799 -> spice data set

1399.919 -> in amazon quick site which then

1401.679 -> automatically

1402.96 -> pulls in new data as it lands and

1405.76 -> refreshes your business report

1407.679 -> so this is a really great use case for

1409.84 -> your common monthly

1411.76 -> weekly or daily reports as well as ad

1414.96 -> hoc analyses

1416 -> that you need to do when you first get

1418.559 -> get a hold of the data set and need to

1420.08 -> get familiar with it

1422.159 -> so things like enriching data with joins

1425.12 -> unions

1426.32 -> is common let's say you have

1429.44 -> monthly data that comes in and you want

1431.12 -> to put in a yearly report just specify a

1433.279 -> folder in data brew

1435.44 -> let's say you want to update the schema

1437.52 -> in bulk

1438.559 -> again create a recipe and then use a

1441.44 -> recipe on different data sets to just

1443.36 -> harmonize

1445.039 -> all of the schema

1448.32 -> another popular use case that we've seen

1451.12 -> is data quality and we see this very

1453.2 -> commonly with customers that get a lot

1454.88 -> of data feeds

1456 -> from different providers um and

1459.2 -> they're really looking to sort of set up

1460.799 -> business rules so what they do

1462.32 -> is every time a data set comes in they

1465.36 -> first profile it

1466.72 -> with uh with data brew and then

1470.48 -> and then using eventbridge and lambda

1473.52 -> functions you can code up

1475.279 -> uh specific business rules um that apply

1478.32 -> to the data

1479.76 -> certain things um certain interesting

1481.76 -> sort of use cases here you want to

1483.44 -> compare profiles

1484.799 -> uh of two data sets really to understand

1487.6 -> the difference

1489.52 -> if if something is wrong you also want

1492.24 -> to

1492.64 -> understand if if a specific data set

1496.559 -> needs to be marked for manual review

1499.2 -> and so these are these are very

1501.039 -> interesting use cases where

1502.88 -> uh with a variety of data sets now

1506.24 -> in the system um different kinds

1509.12 -> different shapes

1509.919 -> sizes formats data quality and its

1512.559 -> practice

1513.679 -> in the organization becomes increasingly

1515.6 -> important

1517.679 -> so data profiles as well as all the

1520.32 -> statistics around it

1522.08 -> become critical

1526 -> pre-processing data for machine learning

1528.24 -> um is

1529.2 -> um has been sort of very critical

1533.039 -> in even training machine learning models

1536.159 -> and so

1537.52 -> customers typically bring data from s3

1540.72 -> go ahead and look at go ahead and look

1544 -> at that data in data brew

1545.919 -> really easily feature engineer it so

1548.48 -> with data brew you can categorically map

1550.88 -> data you can tokenize

1552.799 -> text columns normalize them

1556.4 -> really just do data sanity before it

1558.48 -> goes into a machine learning model

1560.4 -> and so um a lot of folks use

1563.44 -> data brew in as a plug-in

1566.64 -> to their jupiter notebooks um and

1570.159 -> and directly within the notebook

1571.76 -> environment the data prep on one tab

1574 -> with data proof

1575.2 -> and then continue coding in their

1577.039 -> notebook environment

1578.4 -> um on another tab in a jupyter notebook

1581.52 -> so take a look at

1584.72 -> the the data brew jupiter lab plugin

1587.76 -> for accelerating some of the machine

1589.679 -> learning pre-processing that happens

1592.64 -> with data today

1595.84 -> and one of the one of the final sort of

1598.64 -> use cases i want to talk about

1600.4 -> is orchestrating data preparation

1602.72 -> workflows now

1603.84 -> there's a lot of ad hoc one-off data

1606.08 -> preparation

1606.96 -> but truly sort of the efficiencies are

1609.679 -> driven

1610.32 -> when you can really reuse a lot of the

1612.32 -> data preparation that

1613.919 -> individuals are doing in different teams

1616.64 -> and so

1618.32 -> what you can do is take let's say data

1620.72 -> that's sitting in redshift

1622.64 -> and really bring it into brew

1626.48 -> and then once data is prepared in data

1628.48 -> brew when you run the recipe you can

1630.24 -> then trigger

1631.44 -> let's say a group glue crawler to put

1633.679 -> that data back into redshift

1636 -> um back into retro in your data catalog

1639.84 -> and of course this is just one such

1641.52 -> example of orchestrating data

1643.12 -> preparation but

1644.64 -> you can see similar sort of use cases

1647.12 -> where you can orchestrate these things

1648.799 -> via step functions

1650.96 -> or even lambda functions

1654.64 -> with this i want to leave you with

1658.399 -> the availability in how to use databrew

1661.12 -> databrew is generally available

1663.44 -> it's accessible via the aws management

1665.76 -> console

1666.88 -> apis as well as a plugin for jupyter

1669.52 -> notebooks

1670.559 -> um and you can as you can see getting

1672.64 -> started with databrew is very very easy

1675.12 -> you can try one of our sample data sets

1677.36 -> or even bring your data sets

1679.52 -> um and start brewing

1682.88 -> feel free to reach out to us at databrew

1685.679 -> feedback

1686.48 -> at amazon.com and thank you so much for

1690.48 -> taking the time to attend this session

1692.64 -> i hope you enjoyed it and learned

1694.08 -> something today

Source: https://www.youtube.com/watch?v=S1fmtDHB0Qs