
AWS re:Invent 2020: Zero-code data preparation with AWS Glue DataBrew
AWS re:Invent 2020: Zero-code data preparation with AWS Glue DataBrew
Data preparation is a critical step to get data ready for analytics or machine learning. As data continues to grow in size and complexity, you need to expand the number of people preparing and unlocking value in your data. In this session, dive deep into AWS Glue DataBrew, a new visual data preparation tool that enables data analysts and data scientists to clean and normalize data without writing code. See a walkthrough of how AWS Glue DataBrew works, popular use cases, and best practices for data preparation across all your data stores.
Learn more about re:Invent 2020 at http://bit.ly/3c4NSdY
Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4
#AWS #AWSEvents
Content
1.28 -> welcome to aws re invent 2020.
4.08 -> um in this session for zero code data
6.879 -> preparation with aws glue data brew
9.679 -> my name is serbi dangi and i'm a senior
11.92 -> product manager for databrew
14.24 -> today our customers are using data at an
16.72 -> unprecedented pace for analytics and
18.8 -> machine learning
20.08 -> with the explosion in the amount of data
22.4 -> sources and the type of data flowing in
24.96 -> data preparation can cost businesses
27.039 -> valuable time
28 -> resources if not done efficiently
31.519 -> with this in today's session we have an
33.36 -> exciting agenda to talk about
35.84 -> we're going to be covering the
37.12 -> challenges that customers face with data
39.12 -> preparation
40.64 -> how customers can take advantage of
42.559 -> self-service data preparation with data
44.399 -> brewing and organization
46.32 -> uh we'll also discuss key features of
48.399 -> the tool take data throughout for a spin
51.199 -> with an interactive demo and then wrap
53.199 -> up with some very common use cases that
55.52 -> our customers are using to approve 4.
60.16 -> to take a step back with a constant
62.96 -> influx of information customers are
65.04 -> scrambling to capture all the right data
67.04 -> points
68.24 -> organize and process the data and
70.479 -> convert it into actual actionable
72.479 -> insights
73.92 -> yet in the hustle to transform raw data
76.4 -> into usable information we tend to
78.08 -> forget how sophisticated
79.84 -> and challenging the data preparation
81.759 -> process is
82.96 -> and how many individuals in an
84.479 -> organization it
86 -> encompasses this way the way that this
89.759 -> came about
90.479 -> is that a lot of customers tell us that
92.159 -> they love the scalability and the
93.84 -> flexibility of code-based
95.84 -> data preparation in services like aws
98.64 -> glue today but they would
101.119 -> also benefit from allowing business
103.84 -> users data analysts
106.079 -> data scientists to visually explore and
108.399 -> experiment with data independently
110.64 -> without writing any code
115.04 -> so let's take a look at what data
116.56 -> preparation looks like today
119.36 -> preparing data for analytics and machine
121.439 -> learning involves several
122.96 -> necessary and time consuming tasks
126.24 -> these include things like extracting
128.72 -> data
130.16 -> cleaning data normalizing it to specific
133.52 -> forms
134.64 -> loading them into databases data
136.959 -> warehouses and the data lake
139.04 -> and finally orchestrating scalable etl
142.319 -> workflows at scale
145.12 -> for extracting orchestrating and loading
147.52 -> data at scale
149.04 -> data engineers and etl developers
151.68 -> typically
152.56 -> are skilled in sql or programming
155.2 -> languages like python and scala
157.599 -> and they they go and sort of prepare
160.64 -> their data
161.519 -> in code-based environments we've seen
164.48 -> sort of etl developers
166.08 -> love the new more visual interfaces of
168.8 -> modern etl tools
170.48 -> and so aws recently introduced aws glue
173.599 -> studio
175.04 -> which is a new visual interface to help
177.28 -> etl developers
178.56 -> author run and monitor etl jobs without
180.8 -> having to write code
183.12 -> um once this data has been reliably
185.599 -> moved
186.4 -> um the underlying data still needs to be
189.04 -> prepared it still needs to be cleaned
190.8 -> and normalized
192.239 -> and data scientists and data analysts
194.72 -> that operate
195.599 -> in lines of business understand this
198.4 -> data best
199.68 -> and and sort of want to get their hands
201.519 -> on that data
205.28 -> in in eventuality code based
209.36 -> heavy lifting is required today to
212.64 -> prepare data at scale
216.4 -> in an effort to spot anomalies in the
218.4 -> data
219.599 -> we have highly skilled sort of data
221.68 -> engineers
222.64 -> and etl developers writing custom
225.84 -> workflows
226.799 -> and custom code to pull data from
230 -> different sources
231.36 -> and pivot transform slice and dice's
234.08 -> data multiple times before they can
236.08 -> iterate
237.36 -> with data analysts and data scientists
239.2 -> to identify
240.879 -> what what needs to be fixed in the data
243.76 -> so you'll find
245.36 -> data engineers etl developers and then
247.76 -> of course the data analysts
249.28 -> and data scientists so very familiar
251.12 -> with this data often work together
254 -> to to sort of determine the kind of
256.16 -> transformations
257.84 -> more upstream in the data preparation
260.16 -> process
261.6 -> after they've developed these transforms
264.24 -> data engineers and data
265.84 -> in etl developers still need to sort of
268 -> schedule
269.04 -> custom workflows on an ongoing basis so
271.84 -> new incoming data can automatically be
274 -> transformed
275.84 -> i mean this is a pretty elaborate
279.12 -> process where each time a data analyst
282.4 -> let's say
282.96 -> finds a mistake in a specific cell in
285.52 -> the data or finds an anomaly or a
287.759 -> pattern mismatch
289.919 -> they go back to sort of data engineering
292.8 -> they go back to the etl developers
295.12 -> and and each time a transformation needs
298.56 -> to be changed or added
300.4 -> um everyone sort of needs to get
302.8 -> involved again
304 -> um and and um all of these data
307.28 -> preparation tasks need to be
308.56 -> orchestrated again
311.039 -> this iterative process as you can
312.72 -> imagine can take several weeks
314.8 -> um or even sometimes months to complete
317.44 -> and as
318.16 -> a result customers end up spending up to
320.56 -> 80 percent of their time
322.4 -> cleaning and normalizing data instead of
324.72 -> analyzing the data
326.479 -> and extracting value from it
330 -> with this to summarize there are several
333.199 -> challenges with
334.4 -> traditional data preparation today first
338.72 -> of course is traditional data
340.24 -> preparation at scale is still largely
342.32 -> manual
343.52 -> it's done by cherry picking small
345.36 -> samples of the data in excel sheets
347.68 -> or in jupyter notebook environments
350.72 -> it needs a lot of code-based heavy
352.56 -> lifting um
353.759 -> for it to really work at scale and with
357.199 -> uh with more and more data lake uh with
360.08 -> more and more data that's sitting in
361.36 -> your data lake or in your data
362.639 -> warehouses
363.84 -> different kinds of data structured
366.08 -> unstructured
367.759 -> data from logs data from third party
369.919 -> applications
371.6 -> it's becoming more and more important to
374.639 -> have
376.08 -> to have ad hoc cleaning as part of the
379.199 -> data preparation
380.08 -> process second we notice that data
383.28 -> arrives sort of in so many different
385.12 -> formats it's compressed
386.96 -> partitioned periodic transactional
390.479 -> and oftentimes to enable business users
393.68 -> large amounts of that data still needs
395.919 -> to be moved
396.8 -> um into silos or even outside the
399.199 -> company's virtual private network
402.08 -> or vpc to be able to be prepped
405.199 -> or really explored even for analyses
408.639 -> these causes security concerns at time
410.88 -> for customers
412.56 -> and finally as you go more and more
414.96 -> upstream
415.68 -> in the data consumer chain the
418.56 -> transformations
419.919 -> are very specific to the context of the
421.68 -> meaning of the data
423.68 -> there is so a specific cell might be out
426.96 -> of place
427.68 -> or just a specific format of the data
431.199 -> might be out of place
432.4 -> and so these spot fixes need to be made
435.759 -> in in the case of this uh this sort of
438.4 -> dependent process
440 -> there's a low factor of reuse of work
442.319 -> where um
443.919 -> when a data analyst spots an anomaly
446.96 -> can they number one be empowered to go
448.96 -> ahead and fix that anomaly
450.8 -> and then number two um can the data
453.44 -> engineering team
454.56 -> or can can the data source actually um
457.84 -> take in that anomaly um and and have a
460.639 -> fix for it
461.52 -> um that's already been done by the data
464 -> analyst
464.96 -> so tooling is is very specific to point
469.199 -> use cases today and does not look
471.199 -> at data preparation as a whole and how
473.199 -> an organization can be empowered
475.84 -> by it with some reusability factor
481.039 -> in the spirit of being customer obsessed
482.96 -> we listened to our customer we listened
484.72 -> to their pain points
486.16 -> and understood their challenges and data
487.919 -> preparation and launched aws glue data
490.96 -> brew just a few weeks ago
493.28 -> for those of you who are not very
494.8 -> familiar with aws glue
497.52 -> it is our serverless data integration
499.84 -> service that makes it easy
502.24 -> to discover prepare combine data
505.68 -> for analytics machine learning and
508.16 -> application development
510.4 -> aws glue provides both visual and code
513.599 -> based interfaces to make
515.2 -> overall data integration easier
518.839 -> specifically um today i'm going to share
521.44 -> more about
522.56 -> aws glue data brew data brew is a new
526 -> visual data preparation tool that
528.08 -> enables data analysts and data
530 -> scientists
531.36 -> to clean and normalize data without
533.2 -> writing any code
534.8 -> it provides these these users with
537.839 -> powerful capabilities to explore
540.08 -> manipulate and merge new data sources
543.279 -> all without the assistance of without
546.48 -> the assistance of code
548.24 -> and helps them self-service
551.839 -> data preparation tasks that they come
554.48 -> across
555.04 -> in their work when we talk about data
557.839 -> analysts and data scientists
560.24 -> it's really important to understand that
563.2 -> these are
564 -> many many different they come in many
566.64 -> they come in many teams they come in
568.32 -> many
569.279 -> roles and job functions analysts such as
572.64 -> legal analysts marketing analysts
574.8 -> financial analysts environmental
578.399 -> analysts as well as data scientists such
580.959 -> as
581.44 -> researchers that are researching on
584.32 -> vaccine development for example where
586.56 -> their core job function
588.399 -> is not to learn python but really to
591.519 -> work with data to understand the
593.2 -> patterns and really analyze the data
596.399 -> for all of these personas and users
600.08 -> databrew provides an easy to use
603.36 -> platform
604.079 -> for them um to for them to prepare their
607.839 -> data
608.88 -> directly from data lakes
612.959 -> with glue data brew end users can easily
616.56 -> sort of access and visually explore data
620.56 -> across their data stores so that
622.8 -> includes
624.24 -> your amazon s3 data lake the amazon
627.2 -> retro data warehouse
628.8 -> and the amazon rds databases you can
632.079 -> also
632.48 -> upload a file from your local disk which
634.48 -> we've seen is a very common use case
636.64 -> amongst the analyst persona
639.36 -> or even connect to data sources
642.56 -> that are from other that are coming from
646 -> third-party systems like salesforce
648.24 -> marketo
649.2 -> um etc customers can also choose from
653.44 -> over 250 built-in transformations
656.72 -> um these transformations really help you
658.8 -> combine pivot
660.16 -> and transpose data without writing any
662.079 -> code
663.44 -> it recommends cleaning and normalization
666.399 -> steps
667.519 -> things like filtering anomalies in your
669.519 -> data
671.519 -> really recognizing and dealing with
673.44 -> invalid or misclassified values
676.88 -> you can also look at duplicating data
681.04 -> duplicate data as well as any data that
684 -> you need to standardize
685.68 -> so things like date time or even just
688.8 -> uppercase to lower case characters
693.2 -> databrew also comes with a set of
695.839 -> transformations for data science
697.839 -> and these are really helpful where
700.959 -> uh you can we use sort of advanced
703.12 -> machine learning techniques like natural
705.04 -> language processing
706.56 -> uh directly within our transformation so
709.12 -> you can actually convert
711.04 -> um convert yearly for example the word
714.48 -> yearly
716 -> to your long or year-long to your
719.279 -> um so it really sort of helps you
721.04 -> convert
722.56 -> common words to their base or root form
724.959 -> and this is just one example of
727.839 -> data the data science transformations
729.76 -> that we provide
731.44 -> um you can you can sort of save
735.04 -> these cleaning and normalization steps
737.44 -> into a workflow
738.959 -> um which is typically called a recipe
741.279 -> and this recipe can then automatically
743.12 -> be run on future incoming data
746.399 -> as you can see there's a lot of
748.16 -> operation on the data
749.44 -> and a lot of transformation so you can
751.2 -> do on the data using data brew and with
753.76 -> that
754.959 -> another important aspect of this is
756.88 -> being able to understand
758.24 -> what changed at what point in time by
760.48 -> who
761.279 -> um which transformations were applied on
763.44 -> which data sets
764.639 -> and um and and and how did the data
768.399 -> change and evolve um with the with the
771.12 -> usage of data brew and so we provide a
773.2 -> very visual
774.56 -> lineage view for um all operations that
777.36 -> happen on the data
779.6 -> and lastly um being on the data lake
783.44 -> um scale is scale is important
787.839 -> and being able to operate at scale and
790.56 -> being able to schedule
792.48 -> these transformations to run repeatedly
794.639 -> on new incoming data
796.959 -> is is part of data brew and
800.56 -> um all of data brews actions skin
803.68 -> um are serverless so data brews full is
807.04 -> serverless fully managed
808.639 -> you don't have to provision um you don't
811.36 -> have to configure
812.24 -> provision or manage any resources
815.76 -> to operate data brew
821.76 -> with this we're going to switch over to
823.92 -> a quick demo
824.88 -> of the product and then cover some
828.16 -> popular use cases
830.399 -> so data brew can be accessed through the
832.48 -> analytics category on the aws management
834.88 -> console
835.92 -> and when you land on the console um you
838.959 -> can
839.44 -> you can see um the way to sort of create
842.959 -> a project
844.32 -> there are some key entities in data brew
846.32 -> such as
848 -> data sets projects recipes and jobs
851.76 -> and in these you'll notice
854.959 -> you can go ahead and sort of create a
856.8 -> data set when you land on the
858.639 -> data brew console you can browse
862 -> your s3 buckets or connect through the
864.88 -> glue data catalog
866.48 -> your retro tables rds tables or even
870.32 -> upload a file from your local desk so in
872.24 -> this case
872.88 -> i'm going to take a new york city bike
875.68 -> data set
877.199 -> and just look at sort of what this data
878.959 -> set contains what columns
881.279 -> exist in these once i've created a data
884.079 -> set
884.8 -> i can go ahead and sort of not only take
886.72 -> a look at the sample but understand
889.76 -> over 40 to 50 statistics and how this
892.8 -> data
893.279 -> is laid out so things like duplicate
896.639 -> values missing values
898.32 -> as well as correlations for this for all
901.6 -> numerical values in this data set
904.24 -> i'm starting to sort of get a picture of
905.839 -> this data set but
907.839 -> this data set for example has a bike id
911.44 -> column
912 -> and so there i'm interested in the
913.44 -> cardinality to make sure
915.68 -> the rows are unique um i'm also
918.32 -> interested in making sure that
920.079 -> there's no anomaly in the bike id so i'm
922 -> going to take a quick glance
923.839 -> then i also look at the user type to
926.24 -> understand how many user types even
928.16 -> exist in this data set
930.639 -> once i'm done then i can switch over to
932.88 -> the lineage view and just understand how
934.56 -> many projects are connected to this
936.639 -> what are the jobs that are running and
938.88 -> so in this case i'm going to go ahead
940.56 -> and actually
941.199 -> open this data set up to do some
944.24 -> data preparation so when i click on the
947.279 -> project
947.92 -> i land into a visual working window
950.72 -> where we're working with a first
952.32 -> and sample of this data set
955.36 -> as you can see when i click on column
957.36 -> headers
958.56 -> you can really sort of understand
960.8 -> statistics detailed statistics about
963.199 -> each individual column and i notice
966.56 -> we have lat long columns and maybe for
968.639 -> my analysis
970.32 -> i want to actually combine these two
972.079 -> columns so i would just go ahead and
973.92 -> initiate a transformation to merge these
976.079 -> with a separator
978 -> rename the column and then
981.6 -> what this does is gives me a real-time
983.519 -> preview of what my column is going to
985.68 -> look like
986.72 -> uh when i apply this transformation so
989.6 -> this is a really easy way to sort of
992.56 -> visually understand the impact of your
995.519 -> transformation
996.8 -> and what that's doing to your data next
999.92 -> i'm going to go ahead and look at
1002 -> um just just sort of look at
1005.36 -> some aggregates for this data just to
1007.519 -> understand it
1009.199 -> so i'm interested in understanding based
1011.68 -> on my
1012.959 -> user type how many
1016.32 -> how many bike hours or what is the time
1018.8 -> duration the trip duration
1020.639 -> for each of the bike types i can also
1023.839 -> look at other analyses that i can add on
1026.24 -> to
1027.039 -> to this to this data set once i select
1031.12 -> sort of my attributes i can see a
1034.4 -> immediate preview of of this analysis
1038.16 -> and this is really helpful especially
1040.16 -> when you're
1041.6 -> when you're preparing data for machine
1043.12 -> learning and when you're
1044.64 -> really trying to understand or engineer
1047.36 -> your features for your machine learning
1049.039 -> models
1050 -> so in this case it looks good my
1052 -> customer subscriber
1054.24 -> values and so i'm going to pull them
1055.84 -> pull those columns back into my
1057.919 -> visual working window um once i've done
1061.52 -> that
1061.919 -> as you can see on the right hand side
1064.72 -> every time i've
1065.679 -> applied a transformation it's added it
1068.72 -> to what we call the recipe
1071.52 -> one of the one more transformation that
1073.84 -> i'm going to do
1074.64 -> is for me for my machine learning models
1077.6 -> they don't take
1079.44 -> i would i would give them numerical
1081.44 -> values and so
1082.799 -> i quickly went ahead and looked at
1085.84 -> just two values in my um in my sample
1089.76 -> also went and looked at the fact that
1092.16 -> there are two values that exist in my
1093.76 -> population and then mapped all two
1095.44 -> values
1096.4 -> let's say there was actually a lifetime
1098.64 -> um
1099.44 -> lifetime uh value
1102.559 -> that didn't show up in my sample but
1104.64 -> existed in my population i can quickly
1106.72 -> sort of check between the profile view
1108.799 -> and the grid view um and make sure it's
1112.16 -> all
1112.48 -> um it's all solid um one of the last
1116.24 -> things my
1117.039 -> analysis is incomplete without looking
1118.96 -> at maintenance data
1120.32 -> so what i'm doing is bringing in
1122.48 -> maintenance bike maintenance data why i
1124.88 -> join
1126.24 -> specifying my joint keys and looking at
1128.4 -> a preview
1130.08 -> what this helps me do is really bring in
1131.919 -> data sets from disparate sources
1134.4 -> i joined a csv with an excel file here
1137.919 -> just to understand um how does
1140.72 -> maintenance
1141.84 -> like a recent maintenance impact my trip
1144 -> duration for the bike
1146 -> and so this is a really great way i mean
1147.84 -> to experiment with all kinds of joints
1150.799 -> you can look at an inner outer etc
1154.24 -> this looks quite good but clearly i've
1156.72 -> kind of messed up my schema a little bit
1158.88 -> so i can go ahead into the schema view
1161.44 -> just double click on columns rename them
1163.6 -> shade change
1164.64 -> hide them delete them as required
1169.52 -> and once i'm ready i can go ahead and
1172.96 -> sort of
1173.84 -> rename this column this will get added
1175.84 -> as a recipe step
1177.6 -> once again when i'm ready with my recipe
1181.84 -> i can go ahead and i can go ahead and
1184.96 -> sort of review
1186.559 -> the changes i've made i can even switch
1188.72 -> around samples
1190.96 -> and then um and then really sort of
1194.24 -> understand that this recipe that i've
1196.4 -> created actually applies to my data set
1198.559 -> and works well
1199.919 -> once i'm done uh remember we've been
1202.159 -> working on just a firsthand 500 row
1204.559 -> sample
1206 -> i can go ahead and now apply this recipe
1209.2 -> to the population data so now i'm
1211.52 -> running it
1212.24 -> on the actual data set and
1216.4 -> what i did was just specified an s3
1218.559 -> location or where to store my output
1221.6 -> so in a nutshell with all of that all of
1224.799 -> that exploration
1226.799 -> i basically took a data set from s3
1229.52 -> joined it with another data set
1231.2 -> created some transformations ran a
1233.919 -> profile and ran
1234.88 -> an output and so this visual sort of
1237.76 -> explorer really helps you
1240.24 -> understand and keep track of all the
1242.08 -> changes that are happening in your data
1243.6 -> set
1245.039 -> so in a nutshell this is this is sort of
1247.44 -> one of the most common workflows
1249.44 -> um in data brew and you can go ahead and
1253.76 -> experiment with data um you can also go
1256.24 -> ahead and sort of develop
1258.799 -> develop interesting use cases with this
1261.919 -> foundation
1263.36 -> so next we're going to go through sort
1265.84 -> of what we really looked at
1267.6 -> um we're going to go and do a quick
1269.52 -> recap of what you looked at in the demo
1272.72 -> so in this demo you saw how we could
1275.2 -> profile
1276.24 -> a large data set um use those statistics
1280.159 -> to inform transformations on your sample
1282.48 -> that you're visually exploring
1284.4 -> and then apply these transformations
1286.159 -> back to the large data set
1288.96 -> in addition uh you can also
1290.88 -> operationalize these at scale
1292.96 -> so recipes um recipes are reusable you
1296.32 -> can publish them
1297.36 -> so two data analysts that are
1299.36 -> collaborating can each publish recipes
1301.36 -> in the account
1302.559 -> um and you can import them back into the
1305.44 -> project for reuse
1307.28 -> you can also use apis to perform
1311.039 -> all the actions that you saw on the ui
1313.6 -> and
1314.72 -> and really schedule jobs so every time
1317.28 -> you have a new
1318 -> data set that lands in an s3 bucket
1321.52 -> you can you can parameterize the input
1324.24 -> file name to pick up that latest data
1326.159 -> set
1326.559 -> and then just schedule a job to do some
1329.28 -> lightweight
1330.48 -> um do sort of perform these
1332.159 -> transformations pretty easily
1334.4 -> and orchestrate them at scale
1339.039 -> so we're going to talk through now that
1341.039 -> you have a glimpse of what data brew
1343.2 -> looks like
1344.24 -> we're going to talk through some common
1345.919 -> sort of use cases that we've seen
1348.24 -> um customers use for um use data brew
1351.76 -> with
1352.4 -> one of the most popular ones are ad hoc
1356.039 -> experimentation and data exploration
1359.2 -> for business reporting and data analysts
1363.6 -> commonly sort of bring data either from
1366.799 -> a local file on their like an excel
1369.6 -> spreadsheet on the computer you'd be
1371.36 -> surprised how many spreadsheets still
1373.36 -> get passed
1374 -> around in in organizations
1377.36 -> or they bring data that's been made
1379.2 -> available from the catalog to them
1381.76 -> the data catalog from glue once they
1385.2 -> do that they connect to this data they
1387.28 -> perform the analysis
1388.64 -> in data brew and then put the files back
1392.64 -> into s3 from
1395.679 -> s3 you can connect to you can create a
1398.799 -> spice data set
1399.919 -> in amazon quick site which then
1401.679 -> automatically
1402.96 -> pulls in new data as it lands and
1405.76 -> refreshes your business report
1407.679 -> so this is a really great use case for
1409.84 -> your common monthly
1411.76 -> weekly or daily reports as well as ad
1414.96 -> hoc analyses
1416 -> that you need to do when you first get
1418.559 -> get a hold of the data set and need to
1420.08 -> get familiar with it
1422.159 -> so things like enriching data with joins
1425.12 -> unions
1426.32 -> is common let's say you have
1429.44 -> monthly data that comes in and you want
1431.12 -> to put in a yearly report just specify a
1433.279 -> folder in data brew
1435.44 -> let's say you want to update the schema
1437.52 -> in bulk
1438.559 -> again create a recipe and then use a
1441.44 -> recipe on different data sets to just
1443.36 -> harmonize
1445.039 -> all of the schema
1448.32 -> another popular use case that we've seen
1451.12 -> is data quality and we see this very
1453.2 -> commonly with customers that get a lot
1454.88 -> of data feeds
1456 -> from different providers um and
1459.2 -> they're really looking to sort of set up
1460.799 -> business rules so what they do
1462.32 -> is every time a data set comes in they
1465.36 -> first profile it
1466.72 -> with uh with data brew and then
1470.48 -> and then using eventbridge and lambda
1473.52 -> functions you can code up
1475.279 -> uh specific business rules um that apply
1478.32 -> to the data
1479.76 -> certain things um certain interesting
1481.76 -> sort of use cases here you want to
1483.44 -> compare profiles
1484.799 -> uh of two data sets really to understand
1487.6 -> the difference
1489.52 -> if if something is wrong you also want
1492.24 -> to
1492.64 -> understand if if a specific data set
1496.559 -> needs to be marked for manual review
1499.2 -> and so these are these are very
1501.039 -> interesting use cases where
1502.88 -> uh with a variety of data sets now
1506.24 -> in the system um different kinds
1509.12 -> different shapes
1509.919 -> sizes formats data quality and its
1512.559 -> practice
1513.679 -> in the organization becomes increasingly
1515.6 -> important
1517.679 -> so data profiles as well as all the
1520.32 -> statistics around it
1522.08 -> become critical
1526 -> pre-processing data for machine learning
1528.24 -> um is
1529.2 -> um has been sort of very critical
1533.039 -> in even training machine learning models
1536.159 -> and so
1537.52 -> customers typically bring data from s3
1540.72 -> go ahead and look at go ahead and look
1544 -> at that data in data brew
1545.919 -> really easily feature engineer it so
1548.48 -> with data brew you can categorically map
1550.88 -> data you can tokenize
1552.799 -> text columns normalize them
1556.4 -> really just do data sanity before it
1558.48 -> goes into a machine learning model
1560.4 -> and so um a lot of folks use
1563.44 -> data brew in as a plug-in
1566.64 -> to their jupiter notebooks um and
1570.159 -> and directly within the notebook
1571.76 -> environment the data prep on one tab
1574 -> with data proof
1575.2 -> and then continue coding in their
1577.039 -> notebook environment
1578.4 -> um on another tab in a jupyter notebook
1581.52 -> so take a look at
1584.72 -> the the data brew jupiter lab plugin
1587.76 -> for accelerating some of the machine
1589.679 -> learning pre-processing that happens
1592.64 -> with data today
1595.84 -> and one of the one of the final sort of
1598.64 -> use cases i want to talk about
1600.4 -> is orchestrating data preparation
1602.72 -> workflows now
1603.84 -> there's a lot of ad hoc one-off data
1606.08 -> preparation
1606.96 -> but truly sort of the efficiencies are
1609.679 -> driven
1610.32 -> when you can really reuse a lot of the
1612.32 -> data preparation that
1613.919 -> individuals are doing in different teams
1616.64 -> and so
1618.32 -> what you can do is take let's say data
1620.72 -> that's sitting in redshift
1622.64 -> and really bring it into brew
1626.48 -> and then once data is prepared in data
1628.48 -> brew when you run the recipe you can
1630.24 -> then trigger
1631.44 -> let's say a group glue crawler to put
1633.679 -> that data back into redshift
1636 -> um back into retro in your data catalog
1639.84 -> and of course this is just one such
1641.52 -> example of orchestrating data
1643.12 -> preparation but
1644.64 -> you can see similar sort of use cases
1647.12 -> where you can orchestrate these things
1648.799 -> via step functions
1650.96 -> or even lambda functions
1654.64 -> with this i want to leave you with
1658.399 -> the availability in how to use databrew
1661.12 -> databrew is generally available
1663.44 -> it's accessible via the aws management
1665.76 -> console
1666.88 -> apis as well as a plugin for jupyter
1669.52 -> notebooks
1670.559 -> um and you can as you can see getting
1672.64 -> started with databrew is very very easy
1675.12 -> you can try one of our sample data sets
1677.36 -> or even bring your data sets
1679.52 -> um and start brewing
1682.88 -> feel free to reach out to us at databrew
1685.679 -> feedback
1686.48 -> at amazon.com and thank you so much for
1690.48 -> taking the time to attend this session
1692.64 -> i hope you enjoyed it and learned
1694.08 -> something today
Source: https://www.youtube.com/watch?v=S1fmtDHB0Qs