AWS re:Invent 2020: Data pipelines with Amazon Managed Workflows for Apache Airflow

AWS re:Invent 2020: Data pipelines with Amazon Managed Workflows for Apache Airflow


AWS re:Invent 2020: Data pipelines with Amazon Managed Workflows for Apache Airflow

Apache Airflow continues to gain adoption for its ability to configure data pipelines as scripts, monitor them with a great user interface, and extend their functionality through a set of community-developed and custom plugins. In this session, see how to get started with Airflow using Amazon Managed Workflows for Apache Airflow, a platform that makes it easy for data engineers and scientists to execute their extract-transform-load (ETL) jobs and data pipelines in the cloud. See a demonstration of how to create your environment, configure access, and create and run a workflow. Also, learn the methods for monitoring through Airflow’s UI and Amazon CloudWatch.

Learn more about re:Invent 2020 at http://bit.ly/3c4NSdY

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

#AWS #AWSEvents


Content

2.08 -> welcome and thank you for joining this
3.36 -> session my name is john jackson and
5.04 -> today we're going to be talking about
6.16 -> creating data pipelines with amazon
8 -> managed workflows for apache airflow
10.559 -> this session's for anyone who's
11.84 -> interested in apache airflow or data
13.519 -> pipelines and workflows in general
15.36 -> whether you're new to airflow and want
16.56 -> to learn what aws has to make it easier
18.24 -> to get started
19.199 -> or new to aws and want to learn more
20.96 -> about our workflow tools
22.72 -> or if you're experienced with airflow
24.4 -> and you want to hear how managed
25.519 -> workflows makes apache airflow easier to
27.359 -> deploy maintain and scale
30.96 -> so we're going to start out talking
32.719 -> about the general data pipeline problems
35.12 -> and for you data engineers and data
36.719 -> scientists in the crowd
38.239 -> this is going to be very high level but
40.16 -> it sort of set the stage as to why
41.6 -> airflow is so important
43.52 -> i'm then going to introduce airflow
44.879 -> itself and explain
47.52 -> to you some of the main components and
49.36 -> benefits and challenges that it
51.12 -> introduces again at a very high level of
53.039 -> your experience with airflow
55.039 -> after this i'm going to introduce amazon
56.719 -> managed workflows for apache airflow or
58.48 -> mwaa
59.68 -> and how it works and what benefits it
61.359 -> provides i'll then give you a quick
63.28 -> demonstration
64.4 -> of how to build a very rudimentary data
66.32 -> pipeline with amazon mwaa
68.56 -> and then we're going to discuss a little
69.68 -> bit about how amazon mwa
72 -> works with integrations and open source
74.479 -> and when to choose it for your
75.84 -> orchestration needs
80.479 -> so why should we be interested in data
84 -> pipelines in general well
85.68 -> many organizations have built their
86.96 -> entire business around data
88.799 -> in order to benefit from that data you
90.4 -> have to take it from where it was
91.439 -> produced
92.24 -> and move it to where it can be used and
94.24 -> that's the basic definition of a data
95.84 -> pipeline
97.2 -> data pipelines that general movement of
98.88 -> data have become more complicated over
100.72 -> time
101.2 -> so for example movie studios used to be
103.68 -> able to track their sales by counting
105.28 -> the number of physical dvds produced
107.04 -> if you're a little older like i am vhs
109.6 -> and maybe even betamax too
111.52 -> the data pipeline was very simple there
113.68 -> was the number produced minus the number
115.2 -> in the warehouse
116 -> that's how many sales there were and
117.6 -> since the studios controlled most of
119.2 -> this process
120.399 -> analysis was easy you know you might add
122.24 -> a little bit extra for in theater movies
123.92 -> but
124.479 -> other than that that's about it but now
127.92 -> studios have to track sales of not just
129.84 -> physical items and theater tickets sold
132.16 -> but pull out digital sales and online
134.16 -> rentals uh they have to stream you know
136.64 -> track stream counts from a variety of
138.08 -> different sources there's also social
140.08 -> media and broadcast campaign and
141.92 -> marketing campaigns to track and
143.52 -> correlate with those
144.72 -> effects on those sales rentals and
146.8 -> tickets
148.16 -> and each of these services typically
149.68 -> exposes an api or some other shared data
151.84 -> source
152.4 -> and these apis are all different
153.84 -> sometimes they may not be accessible and
155.599 -> when they
156 -> are the format may not be exactly the
157.84 -> way you want it and it needs to be
159.599 -> manipulated so it can be actually used
162.08 -> sometimes even the content of the data
163.519 -> isn't exactly correct for example some
165.599 -> services may list a movie as the blank
168.319 -> versus other ones that use blank comma
170.56 -> the
173.2 -> so let's take a look at how even a
174.72 -> simple version of this use case can be
176.8 -> complicated
177.92 -> so the way data this data is managed
179.68 -> through a series of tasks called a
180.959 -> workflow and this type of workflow is
182.72 -> going to be constructed
184.08 -> in such a way that certain tasks need to
185.92 -> finish before the next task can begin
188.959 -> so for example we may want to make sure
190.72 -> all the data has been pulled from all
192.159 -> the different
192.72 -> streams and rentals and purchases before
195.44 -> we try and merge it together and we want
198.159 -> to make sure that we do all this in a
199.599 -> certain order without going back
201.519 -> so this type of workflow is called a
202.879 -> directed acyclic graph or dag
205.84 -> so in our case the dag workflow will
207.28 -> start by connecting each service
209.599 -> requesting the play counts and the views
212.08 -> and the and that sort of information
214.4 -> and then after that we're going to
215.44 -> process the data we might send it to emr
218.239 -> or maybe an ecs container or eks and
221.2 -> convert it into say
222.159 -> let's convert it from a csv to a parquet
224.48 -> format
225.36 -> or maybe fix those titles after we do
228.56 -> all that data processing maybe we'll
230 -> send it to something like glue or athena
232.08 -> to do analytics or maybe something like
233.599 -> sagemaker
234.56 -> that can actually extract out additional
236.239 -> information that would be combined
237.92 -> and then after all this is done and we
239.36 -> have a clean data set we'll want to
241.04 -> upload this information to something
242.48 -> like
242.799 -> uh redshift for further analysis or
245.12 -> dynamodb for storage
247.36 -> and there's a failure everywhere along
248.72 -> the way we're going to want to be able
249.76 -> to resume just from that point
251.439 -> because any of these tasks might take
253.12 -> quite a bit of time and we're not going
254.72 -> to want to have to go back and restart
256.56 -> them all
259.04 -> so apache airflow has emerged as a
261.199 -> fantastic tool
262.8 -> to be able to manage this sort of data
264.32 -> pipeline it lets us define the flow
267.04 -> from a selection of hundreds of built-in
268.72 -> operators and community-created
270.4 -> plug-ins including integration to 16
272.8 -> different aws services
274.479 -> third-party tools such as parking apache
276.72 -> spark and hadoop
278.72 -> and apache airflow has a very active
280.8 -> open source community so
282.24 -> fixes and features happen on a regular
284.08 -> basis and there's always plenty of
285.44 -> places to go to get help
288 -> so in airflow we define these dags in
290.72 -> python
291.759 -> and then we can see here how though that
294.479 -> previous workflow we just showed you
296.479 -> would be reproduced in the in the
298.56 -> airflow as one of these python dags and
300.639 -> then displayed in the user interface
304.08 -> so assuming that we didn't cover your
306.16 -> exact use case
307.68 -> why would you be interested in apache
309.199 -> airflow well there's three main sort of
311.44 -> categories of use cases
313.6 -> uh the first and by far the most common
315.52 -> is etl or extract transform load
318.32 -> you know this is your this is what
320.96 -> almost
321.44 -> all installations of airflow do at least
323.68 -> sometimes and that is really
325.36 -> that facilitating that data pipeline
327.44 -> that movement of data
329.6 -> an emerging area is around aiml or
332.72 -> artificial intelligence machine learning
334.639 -> and really this has to do with just the
336.24 -> sheer volume of data you have to
337.52 -> pre-process to do this so
339.44 -> let's say for example you're creating a
341.44 -> pipeline for autonomous vehicles
343.199 -> right so when the car comes into the
344.8 -> garage you have to then offload all that
347.36 -> data from the vehicle
348.8 -> and get it to a source where it can be
350.4 -> used for your machine learning models
352.16 -> but each vehicle might have a slightly
354 -> different camera system or lidar and
355.84 -> that has to be
356.639 -> all this data has to be normalized and
358.08 -> the camera color space needs to be
359.6 -> changed to
360.24 -> modified so that it can actually
361.84 -> properly be analyzed by something like
363.6 -> sagemaker
365.28 -> and then the third category for airflow
366.8 -> is really around uh devops and it
369.039 -> operations
370.16 -> and these are great fit for airflow
371.759 -> because airflow is very flexible it can
374 -> streamline automate day-to-day tasks and
375.919 -> in fact plenty of customers actually use
377.68 -> airflow to manage airflow for example
380.16 -> keep maintaining the metadata base that
381.919 -> airflow uses
383.44 -> and keeping only a certain amount of
384.88 -> data historical data
387.919 -> so speaking of that metadata base what
390.24 -> are the airflow components and again for
392 -> those of you that do airflow this is
393.199 -> gonna be really
394 -> obvious for you uh but just for anyone
396.56 -> else the
397.199 -> there's four main components to to
399.039 -> airflow there's the scheduler which not
400.72 -> surprisingly schedules the work
402.24 -> and the worker that not surprisingly
403.84 -> does the work you then have a web server
405.759 -> component that lets you access that user
407.36 -> interface like what you saw earlier
409.199 -> and then there's a metadata base that i
410.72 -> just referred to which really stores all
412.479 -> the state information and all the
413.919 -> historical tasks
416.96 -> but there's a number of challenges
418.4 -> around self managing airflow
420.56 -> the first is really just around that
421.84 -> setup it's typically a very manual
424.08 -> process
424.88 -> so much so that some customers end up
426.639 -> developing their own tooling around it
428.88 -> scaling can be a challenge um you know
431.039 -> you can do this with ec2 auto scaling or
433.12 -> kubernetes
434.16 -> but these have their own complexity they
435.759 -> bring into the mix
437.599 -> for security often supporting role-based
440.16 -> uh
440.639 -> authorization authentication can be a
442.4 -> little challenging with airflow you
443.68 -> often have to
445.039 -> authenticate in one place and then have
446.639 -> the user log into the airflow user
448.479 -> interface to give them a proper
449.759 -> authorization role
452.16 -> upgrades and patches can be challenging
454.8 -> sometimes there'll be a breaking change
456.16 -> in an upgrade and just maintaining
458 -> you know the hundreds of libraries and
459.599 -> dependencies that airflow needs
461.599 -> can also be a challenge as well and then
463.84 -> finally there's that's ongoing
465.039 -> maintenance
465.919 -> um airflow is running you know arbitrary
469.039 -> python code it can crash it can have
471.199 -> issues with it
472.24 -> and being able to know there's a failure
474.08 -> and recover from a failure
475.52 -> is super important especially in a
476.879 -> production system
479.599 -> so listening to the customer demand
482 -> around these challenges
483.919 -> aws has released amazon managed
486 -> workflows for apache airflow this was
487.52 -> really built to address these challenges
489.36 -> i just mentioned
490.56 -> and still providing everything great
492.08 -> that airflow has to offer
494.879 -> so first and foremost is set up now
498.639 -> amazon mwaa is not a branch of airflow
501.199 -> we're just compatible with we're
502.319 -> actually running the exact same open
503.759 -> source airflow software that you can
505.199 -> download and run
505.919 -> today we've just made a little easier to
507.599 -> work with it's got the same us
510.16 -> same great user interface and we've also
512.8 -> made it
513.36 -> really easy to stand up new environments
515.2 -> either from the aws console
517.36 -> command line interface api or cloud
519.68 -> formation
522.159 -> we've also made scaling easy so really
525.2 -> what we do is we've monitored those
526.56 -> queues of tasks
527.839 -> and when we see that you need additional
529.36 -> capacity we'll add additional containers
531.2 -> this is all running on the back end
532.72 -> using
533.76 -> amazon ecs on aws fargate so it's very
536.48 -> easy for us to bring up other containers
538.56 -> and it's also using a celery executor so
541.04 -> it lets you control the
543.68 -> parallelism and how many tasks actually
545.36 -> run per container
548.48 -> for security we have full integration to
551.2 -> iam for both authentication and
553.2 -> authorization
554.32 -> which means that you can provision your
556.56 -> users
557.36 -> to be able to access the ui not just
559.279 -> that they have the access to it
561.04 -> but what they're actually allowed to do
562.399 -> when they get there so for example i may
565.2 -> have a
567.04 -> developer role or developer group that i
569.519 -> want to give them full administration or
571.44 -> ops privileges
572.48 -> on dev systems but i only want them to
574.56 -> have you only access on production
576.08 -> systems
577.839 -> the that gives them access to that
579.839 -> airflow ui and that user interface can
581.839 -> be either connected through a vpc
583.44 -> endpoint into your
584.72 -> virtual private cloud we'll talk to that
586.24 -> in a second or we can provide a url
588.56 -> whereby you can actually access
590 -> it directly from the internet again
591.68 -> still secured by iam
594.16 -> speaking of im the workers actually
595.76 -> assume an iem authentication role or
598.32 -> execution rule excuse me and this
600.399 -> simplifies access to aws services so if
602.56 -> i want to give
603.6 -> a given airflow environment access to
606.56 -> emr
607.12 -> i simply give that particular
608.399 -> environment's worker execution rule
611.2 -> the ability to access my emr cluster and
613.12 -> wacos
614.64 -> and the workers and scheduler run in
616.48 -> your vpc which means any
618.72 -> assets that perhaps live on-prem or
620.8 -> connected to other
621.92 -> secured assets as long as long as
624.959 -> they're accessible from the vpc you've
627.04 -> attached to the workers then they'll be
628.959 -> accessible from airflow
631.76 -> around upgrades what we've done is we
633.68 -> allow you to designate a maintenance
635.2 -> window
635.76 -> and this is when we'll do any patching
637.76 -> that needs to be done for
639.04 -> you know security fixes and things like
640.72 -> that but it's also when we will perform
642.72 -> any upgrades that are needed
644.56 -> and then if there is any issues uh we
647.04 -> can we allow you to roll back from that
648.8 -> upgrade as well
651.6 -> and finally around maintenance we have
653.6 -> integrated monitoring with
655.279 -> uh amazon cloudwatch so it makes it very
657.92 -> easy to monitor one or more environments
659.6 -> centrally and create alarms and alerts
662 -> as i mentioned it's running on fargate
664 -> so therefore it's by nature
666.399 -> a multi-availability zone and so
668.72 -> therefore at any failures
670.48 -> or if there's any unhealthy containers
672.72 -> we swap those out on your behalf
674.88 -> and then we'll also automatically be
676.399 -> able to restart on any failures that
677.839 -> occur
680.399 -> so how does it work well it's actually
682.64 -> fairly simple
683.44 -> we create a mwa environment which is
686.079 -> really just that
686.88 -> open source version of airflow with all
689.2 -> the connections and wiring and
690.64 -> everything else you need to actually
691.68 -> make it work
693.36 -> you then simply provide these directed
695.36 -> acyclic graph python codes
697.12 -> or dags into a folder in an s3 bucket
700.079 -> that you designate
702.079 -> and then we'll take care of making sure
703.839 -> that any containers that are being used
705.6 -> get updated with those tags and then
708 -> after that
708.64 -> you're just accessing the airflow ui as
710.48 -> i mentioned either right from a url or
712.8 -> through a vpc endpoint
716 -> so to give you an idea of what this
717.519 -> looks like uh just to for those of you
719.279 -> that want to understand how this is put
720.56 -> together
721.2 -> you can see how that within your vpc the
723.92 -> airflow scheduler is exists and then one
726.32 -> or more airflow workers depending on how
728.16 -> many you need based on your demand
730.8 -> and then within the services vpc we have
732.88 -> a metadata database
734.56 -> and then the airflow web server which
736.32 -> can be connected either to that
737.76 -> aforementioned vpc uh via vpc endpoint
741.519 -> or accessible via the internet via a
744.639 -> internal to the service application load
746.48 -> balancer and in either way
748.48 -> those are secured by iam
752.079 -> so i'd like to now walk you through what
753.839 -> this looks like and this is a live
756.639 -> live on the console today you can try it
758.88 -> out right after this session if you'd
759.92 -> like
760.48 -> or you can play along while we go if
761.92 -> you'd like
763.6 -> so here's the landing page for the
765.519 -> service
766.72 -> and you can see here how you know
770 -> we describe the environment we describe
771.92 -> what it looks like
773.12 -> and then clicking on the big orange
775.12 -> create environment button we'll start
776.48 -> out with just a brief description of
778.079 -> what this looks like
779.44 -> so i'm going to go through and
783.04 -> give my environment a name so that we
786 -> can actually
786.639 -> designate what it's actually called
790.32 -> and i'll just call it a demo mwa demo
794.48 -> you can see here also where we can
795.839 -> specify the specific airflow version we
798.079 -> want to use so in this case this is the
800.32 -> this is actually the latest version as
801.76 -> of release when we
803.519 -> verify the next release which is 1.10.13
807.04 -> and we validated that it's safe to use
809.2 -> you'll be able to designate that as the
810.399 -> new version
811.36 -> and we'll be able to upgrade that in the
814 -> maintenance window as i mentioned
817.36 -> so from here if we scroll down
822.639 -> you can see we can go down and get down
824.399 -> to the where we designate the
826.079 -> s3 bucket and the one
829.519 -> there's a couple actually a couple of
830.48 -> requirements around the bucket one is it
832.24 -> has to be prefixed with
833.92 -> uh airflow dash the reason for this is
836.959 -> that the
839.04 -> the airflow environment has full access
841.44 -> to this particular bucket so
842.959 -> this is a way of basically designing
845.68 -> that this
846.16 -> we acknowledge that this bucket is used
848 -> for this environment and it's not being
849.839 -> used for sort of
850.639 -> general airflow or general s3 storage
854.48 -> and then uh the other requirement is
857.12 -> actually around versioning i'll mention
858.24 -> in a second
858.88 -> so we actually provide a link right here
860.32 -> if you don't have a bucket and it's in
861.68 -> the right format you can click there
862.959 -> it'll open up the s3 console and you can
864.56 -> create a bucket in that respect
866.24 -> in this case i do have one that has the
867.68 -> proper airflow prefix so i'm going to go
869.279 -> ahead and choose that
871.36 -> and one thing to make sure you note that
873.12 -> if you're pasting in your s3 bucket you
874.72 -> need to make sure you don't have a
875.76 -> trailing slash on this that it just has
877.6 -> just the bucket name with the ss3 prefix
881.76 -> after this we're going to need to go and
882.959 -> select a dag folder this is where those
884.639 -> python
887.68 -> workflows actually live so you can have
890.399 -> a
891.44 -> single s3 bucket that has many different
894 -> folders for different environments or
895.68 -> you can have one environment per uh
898.079 -> per bucket in addition to this we also
901.12 -> support plug-ins and requirements so the
902.959 -> plug-ins are just the same airflow
904.32 -> requirements the
905.199 -> airflow plug-ins that you can put in
906.639 -> today
909.12 -> and you just simply zip those up and the
910.639 -> reason for that is twofold one is that
912.399 -> we don't want to get into a race
913.519 -> condition so we want to make sure that
914.8 -> all those plugins get loaded at once
916.959 -> and the other reason is that the bucket
919.279 -> supports versioning so that we can make
920.72 -> sure we can roll back and same for any
922.24 -> of those python requirements
924.48 -> so the next thing as i mentioned was
925.839 -> that the scheduler and the workers live
927.92 -> within your virtual private cloud so of
929.6 -> course you're going to pick that
930.88 -> now there's a couple requirements around
932.24 -> this one is that you must have private
934.32 -> subnets within your vpc
936 -> that connect and are able to connect to
937.759 -> the internet on the public side
940.32 -> so if you don't have a vpc that connects
942.399 -> that supports that
943.519 -> we do have a quick link to be able to to
945.6 -> quick create one in in
947.44 -> cloud formation but you can see here i
949.519 -> have one set up and you can see that
950.8 -> they just automatically selected the
952.48 -> multi-availability on subnets you then
955.36 -> need to pick whether you want to have
956.8 -> the
957.839 -> web server terminate in a private
960.639 -> endpoint within your vpc in which case
962.639 -> you'll need to do some additional setup
964.24 -> like a
966.48 -> net or a linux bastion or you can select
970.16 -> a public network
971.199 -> uh in which case it's provided just by
973.36 -> you access via url
975.36 -> and we either way we pre you authorize
978.16 -> that connectivity via iam
981.519 -> now it also requires security group so
983.6 -> we will we can
984.959 -> by uh create that for you or if you
986.56 -> already have one configured
988.32 -> that can be selected here as well you
991.199 -> then need to pick an environment size
993.12 -> so airflow1.x has a limitation there's
996.959 -> only one scheduler per environment
998.72 -> and that really depends defines the
1000.24 -> number of workflows that you can
1001.68 -> concurrently
1002.399 -> load into memory and so a small
1004.639 -> environment in our testing it's roughly
1006.32 -> about 50 workflows now if you have a
1008 -> single workflow that loads a thousand
1010.399 -> sub workflows of course that
1011.759 -> change that would throw that off but um
1014.88 -> we do have the ability to change this
1016.48 -> after the fact too and of course some of
1018.079 -> those
1018.72 -> metrics i'll show you in a second will
1020.079 -> help with that
1021.92 -> the other thing you define on here is
1023.199 -> the maximum number of workers and this
1024.64 -> really is sort of a cost control measure
1026.64 -> because we'll scale out those
1028 -> the workers to meet the parallel
1030 -> executions you need
1031.6 -> up to whatever this maximum as you
1033.039 -> define here and then after that
1034.72 -> airflow's
1035.439 -> typical queuing mechanism kicks in when
1038.079 -> there are
1038.959 -> no more tasks in the queue and no more
1041.039 -> tasks executing we're going to dispose
1042.559 -> of all those extra
1044.88 -> workers and you'll be back down to that
1046.959 -> original worker that you had
1049.52 -> so we have as i mentioned integration to
1052.08 -> cloudwatch
1053.2 -> this is for all the airflow metrics plus
1055.36 -> the
1056.48 -> task logs uh so this is the actual logs
1058.799 -> from those executions
1060.559 -> and optionally add the rest of the logs
1062 -> as well and we support things like the
1064.16 -> airflow config options
1066 -> and other elements like that finally we
1068.32 -> can either create or select an execution
1070.48 -> role and this is the
1071.52 -> role that actually determines what it's
1073.76 -> able to do when
1074.96 -> when what this environment's allowed to
1076.32 -> connect to within the aws ecosystem
1079.679 -> so after that we hit create environment
1082 -> the big orange button
1083.6 -> and then we give it a few moments and
1085.36 -> it'll kick us over to the
1087.12 -> list environments page where it'll
1088.559 -> actually show us which environments are
1090.08 -> available
1091.28 -> so this process does take a few minutes
1092.88 -> it's building the database it's building
1094.4 -> the containers
1095.679 -> so you can see here it says it takes
1097.12 -> about 20 minutes or so
1098.96 -> so rather than wait for this one to
1101.36 -> build we're going to go ahead and
1103.36 -> connect to one that i already had
1104.96 -> pre-existing
1106.799 -> and you can see here it just provides a
1108.559 -> list this is everything you'll see for
1110.16 -> your your airflow environment so when i
1112.08 -> wish to access it
1113.44 -> i'm just going to go over to the right
1114.88 -> hand side and where it says open airflow
1116.559 -> ui
1117.039 -> and that will connect me directly to the
1119.679 -> airflow instance
1120.72 -> that we've we've just set up
1126.32 -> so once we've done this we're just now
1128.799 -> in airflow
1129.52 -> there's nothing unique about this other
1131.84 -> than integrating to
1133.919 -> to iem for authorization so now we need
1136.08 -> to provide an actual workflow so i've
1137.919 -> created a very simple dag here all it's
1139.52 -> going to do is it's going to use a
1140.559 -> python operator
1141.76 -> an s3 sensor and an s3 hook to go ahead
1144.559 -> and
1145.84 -> see if a file shows up into s3 and when
1148.24 -> it does show up there i'm going to
1149.2 -> provide
1149.679 -> do a very simple transform on it with
1151.44 -> python and i'm going to return it back
1153.36 -> to that same s3 bucket
1155.2 -> so here's the transform i built super
1157.36 -> simple i'm using the s3 hook which is
1158.96 -> one of the built-in airflow
1160.48 -> capabilities i'm going to go through and
1162.88 -> pull the pull the input file out of that
1165.52 -> the input key
1166.72 -> i'm going to remove all the double
1168.799 -> quotes from that in file and then i'm
1170.32 -> going to return it back to s3
1171.919 -> and you can see here i've defined the
1173.44 -> bucket and the input key and the output
1175.84 -> key as part of this
1178.24 -> to actually define how these work
1179.76 -> together you can see here
1181.679 -> i've defined a dag i'm just calling it a
1184.48 -> copy s3 file
1185.679 -> there's that python operator that's
1187.2 -> going to call that function we just did
1189.28 -> and then you can see right above it is
1190.64 -> the s3 key sensor that actually is going
1192.72 -> to
1193.679 -> wait until that key is available for the
1195.36 -> input file
1196.72 -> after that we just simply connected it
1198.32 -> via the the the two sort of greater than
1200.799 -> signs and away it goes so i'm going to
1202.32 -> copy that up to the s3 bucket
1204.48 -> but there is one more thing i need to do
1206 -> is the
1207.52 -> when we create the environment it's only
1209.52 -> has the ability to see
1210.72 -> to read these dag files and the python
1212.72 -> files and things like that you can see
1213.919 -> it refreshed it came up
1215.679 -> but i do need to give our worker the
1217.44 -> ability to write to this s3 bucket
1219.28 -> so if i go look at the environment
1221.039 -> details page and i scroll down to the
1222.64 -> bottom
1224.48 -> it gives me a link to the in i am to the
1228.96 -> execution role so i'm going to click on
1230.64 -> that
1232.72 -> and what i'm going to do here is i'm
1234.24 -> going to do it in a very simple
1235.919 -> you know way i'm just going to go ahead
1237.679 -> and give this particular roll just
1239.28 -> access to all
1240 -> three buckets of course you would in
1241.84 -> production you would have it limited to
1243.28 -> just the scope
1244.08 -> needed but this is just simply for
1245.919 -> demonstration purposes so i'll just give
1247.44 -> it full s3 all that full access
1250.08 -> and then going back to the user
1251.28 -> interface i can go now click on this
1252.72 -> workflow
1253.52 -> and it has everything it needs to run so
1256.48 -> you can see here is that those two
1258.4 -> steps that we defined in the dag file
1260.88 -> i'm gonna go to trigger dag and i'm
1262.159 -> gonna go
1262.88 -> launch it and of course this could also
1265.039 -> be on a schedule as well that's the
1266.48 -> typical way to run these
1268.799 -> and then i'm gonna go through and you
1270 -> can see now you can see the different
1271.52 -> state
1272 -> so i can actually click on each of these
1273.44 -> tasks and i can click on view log
1276 -> and it will show me what the current
1277.28 -> output is from this so if i scroll to
1278.72 -> the bottom
1279.44 -> we can see it's waiting for that input
1281.039 -> key that we defined in the dag file
1283.36 -> so let's go ahead and provide that key
1285.6 -> so it's able to actually
1288.559 -> perform the task so it's waiting for
1290.159 -> this key to show up
1291.6 -> i'm going to go ahead and copy this to
1292.96 -> my s3 bucket as well
1295.76 -> and i'm just going to use the cli
1296.96 -> command i'm going to copy the uh the csv
1299.52 -> file name
1300.32 -> and i'm going to go back and copy the
1302.08 -> target file it happens to be just being
1303.6 -> log so i'll grab it from there
1306 -> and i'll hit go so now this s3 sensor if
1309.919 -> i refresh it
1311.679 -> it's gonna once it sees that that file
1313.679 -> is there we're gonna see a success on
1315.44 -> that
1316.32 -> so now i can go through and say okay
1317.84 -> well let's go see what the how the rest
1319.36 -> of that workflow
1320.24 -> is is working because we've got the key
1322 -> now so it should launch that python
1324.559 -> uh function that's gonna transform it
1327.12 -> you can see it's already completed so
1328.4 -> again i can click on the
1330.08 -> uh the actual task click on the view log
1332.559 -> button
1333.28 -> and we can see here that at the bottom
1335.12 -> it shows success it was able to do what
1336.88 -> we asked it to do
1341.76 -> so let's go back to that s3 bucket and
1345.44 -> make sure to see if that target file is
1346.96 -> actually there now since i used the same
1348.559 -> bucket as i was using for the test
1350.32 -> it's in the same file i can go through
1352.88 -> and click on the csv file and let's just
1354.48 -> open it up
1358.72 -> and here's the file we just created you
1360 -> can see it's the exact same csv file i
1361.76 -> loaded except now it's missing the two
1363.36 -> quotes
1363.84 -> so obviously this is a super simple
1365.6 -> example but
1366.88 -> these are the basic building blocks to
1368.799 -> be able to generate whatever data
1370.159 -> pipeline you actually need to do
1374.64 -> so going back to uh um
1378 -> the the console one other thing we can
1381.679 -> do here is go
1382.559 -> look at how these logs are actually sent
1384.24 -> to cloudwatch so as i mentioned
1386.32 -> all the metrics and logs that you choose
1388.64 -> are sent to cloudwatch of the name we
1390.32 -> can actually see here's that demo
1391.6 -> environment and if i click on the task
1393.36 -> logs
1396.32 -> then we can actually see exactly what
1398.64 -> it's uh what it's reported
1400.72 -> we can see all the information we can
1402 -> create alerts let's say i want to create
1403.44 -> an alarm on
1404.64 -> if there's a ton of exceptions or
1406.08 -> anything like that as well
1408.48 -> and that's the basic building blocks of
1411.039 -> our managed airflow service
1413.28 -> so let's talk about these integrations
1416.559 -> integrating was is very easy as you saw
1418.799 -> integrating to aws services
1420.48 -> we're since we're running the open
1421.679 -> source version of airflow that comes
1423.12 -> with 16 existing integrations everything
1425.2 -> from dynamodb
1426.4 -> emr eks sagemaker you name it
1431.679 -> there's also tons of container support
1434.24 -> built in so the ecs operator the
1435.84 -> kubernetes pod operator
1437.36 -> and the docker operator are examples of
1438.96 -> those to integrate to uh
1440.799 -> containerized workloads and then of
1443.279 -> course there's
1444 -> hundreds of built-in and community
1445.84 -> created operators and
1447.6 -> sensors and plugins to be able to handle
1450.24 -> a bunch of other needs
1453.679 -> one thing i want to make sure i stress
1455.039 -> is that we are committed to supporting
1457.039 -> solely the open source version of air
1458.64 -> flow so any improvements we make
1460.96 -> we want to be able we will go first
1462.799 -> through the open source submittal
1464.24 -> process and make sure that they are
1465.679 -> accepted by the community before we
1467.039 -> adopt them in the version we're not
1468.32 -> going to branch or come up with our own
1469.679 -> version
1470.559 -> we're dependent on airflow being healthy
1472.48 -> and whether you run our managed service
1474.24 -> or not we want it to be a
1475.84 -> a great service for everyone that uses
1477.279 -> it
1479.52 -> so talking a little bit around um
1483.12 -> when to choose amazon mwa versus some
1485.919 -> other offerings on aws
1487.52 -> a natural comparison would be step
1489.12 -> functions uh which is another
1490.64 -> orchestration platform
1492.72 -> um you know really mwaa is really about
1496.48 -> when you are either committed to open
1498.48 -> source or you have a lot of non-aws
1500.32 -> resources
1501.12 -> or you have a lot of customization you
1502.88 -> want to do
1504.159 -> for things that are perhaps on premise
1506.559 -> whereas step function is fantastic for
1508.24 -> very much for
1509.039 -> high volume loads and it really provides
1511.52 -> a high volume orchestration at a low
1513.12 -> cost
1515.84 -> so why would you choose amazon mwa for
1519.039 -> your airflow needs well first
1520.559 -> farmers it's it's easy as you saw there
1522.72 -> segment up is very straightforward it's
1524.24 -> all done through the console or through
1525.919 -> cli or api or
1527.36 -> cloud formation and when you create
1529.76 -> these it's
1530.48 -> it's secure by default but it's still
1533.2 -> familiar it's still that airflow
1534.72 -> interface that you uh know and love
1536.88 -> and you can still enjoy it the way you
1539.44 -> want to enjoy it but now with
1540.88 -> the great adios experience and finally
1543.52 -> it's robust
1544.24 -> we handle the scaling for you without
1545.76 -> any configuration we'll patch it we'll
1547.44 -> provide the monitoring integration
1549.279 -> and we make it easy to handle those
1550.64 -> upgrades
1553.52 -> so thank you very much for your time in
1555.52 -> this session uh if you want to reach me
1557.679 -> my information is here on the screen you
1558.96 -> can reach me on twitter or send me an
1560.4 -> email
1560.799 -> happy to hear from all of you and please
1563.84 -> don't forget to complete the
1565.2 -> session survey that's how we know if
1566.72 -> we've done well and we'll be able to
1568.08 -> give you the information you're looking
1569.6 -> for so once again thank you very much
1571.279 -> and have a great day

Source: https://www.youtube.com/watch?v=xiOsbCs7T9k