AWS re:Invent 2022 - What’s new in Amazon Athena (ANT208)

AWS re:Invent 2022 - What’s new in Amazon Athena (ANT208)


AWS re:Invent 2022 - What’s new in Amazon Athena (ANT208)

Amazon Athena is a highly scalable analytics service that makes it easy to analyze all your data across Amazon S3, in on-premises stores, and on other cloud platforms. Amazon Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. This session offers a deep dive into the service, customer use cases, best practices, newly launched features, and what’s next for Amazon Athena.

Learn more about AWS re:Invent at https://go.aws/3ikK4dD.

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#reInvent2022 #AWSreInvent2022 #AWSEvents


Content

0 -> - Hello everybody, great to see you all here.
1.65 -> re:Invent 2022,
3.99 -> hope you're having a great time so far at the session.
6.3 -> We know you've got the late night session tonight,
8.55 -> so appreciate you coming out to spend some time with us.
11.88 -> This is what's new in Amazon Athena.
14.49 -> My name is Scott Rigney and I'm one of the product managers
16.83 -> on the Athena team.
18.28 -> This session has a bunch of great announcements and updates
21.57 -> on features that you know, myself and colleagues have been
24.6 -> hard at work building for you over the course of this year.
27.3 -> And we've we're joined by a great guest speaker.
30.24 -> In a little bit,
31.073 -> we'll hear from Ofer Eliassaf
32.82 -> who joins us today from Mobileye.
34.24 -> Ofer is gonna tell us about how he and his colleagues are
38.64 -> using Athena to on a really cool use case.
42.18 -> So before we get into it,
43.56 -> thanks again for joining today's session.
45.93 -> Hopefully got a lot of great updates for you.
49.32 -> So today we're gonna go on a bit of a journey through Athena
52.68 -> and data lakes.
54.09 -> We'll start with the core of Athena,
55.95 -> which is our engine and we want to start there because we
58.59 -> have a lot of great announcements to to share with you about
61.29 -> all of that.
62.7 -> Next we'll hear from Mobileye on how they're using Athena as
66.57 -> part of their machine vision systems
68.4 -> for autonomous vehicles.
69.83 -> It's a really fascinating use case and we're really happy
72.75 -> that Ofer could join us here in person and tell us more.
76.02 -> And last but not least,
77.28 -> we'll cover some of the announcements and updates on how you
79.86 -> can bring Athena to your data.
81.93 -> Apply it to all of your data sources, you know,
84.15 -> spanning data lakes, external sources and more.
89.07 -> But before we do that,
89.903 -> let's do a quick poll by show of hands,
92.16 -> is anybody here new to Athena?
96.3 -> All right, couple of you out there.
97.71 -> So, so thanks for joining.
99.38 -> For those of us who are new,
100.83 -> we thought it'd be good to start out with a little bit of a
102.93 -> refresher on Athena.
105.09 -> So let's start off with that.
107.22 -> Athena is an interactive query service that's designed to
111.33 -> make it easy to query and analyze data in your data lakes.
115.2 -> Athena is serverless,
116.43 -> which means there is no infrastructure to set up and manage,
119.3 -> and you pay only for the queries that you run.
122.35 -> What customers love about Athena is
124.8 -> how easy it is to get started. With the product
127.71 -> being serverless and having no infrastructure to set up
131.07 -> simply means you can sort of bring it,
132.99 -> bring Athena to your data and start analyzing it.
136.03 -> Athena is interactive and built for speed.
139.68 -> Everything we do in the product is designed to help you
142.08 -> answer business questions quickly and do that in a cost
145.53 -> effective manner.
147.03 -> Athena is built on open standards and that starts with our
150.3 -> processing engine,
151.23 -> which is based on open source technology like Presto.
154.44 -> And with that comes support of multiple data formats like
157.83 -> Parquet and Apache Iceberg.
159.63 -> So you can get started with the data that you have today.
163.23 -> Athena is also cost effective.
165.06 -> You pay only for the queries that you run and you can save
167.82 -> up to 90% using compression,
170.01 -> partitioning and converting your data into
172.53 -> optimized column formats.
176.25 -> We launched a few years ago here at re:Invent 2016 and today
179.79 -> we have thousands of customers spanning industries and of
184.26 -> all sizes ranging from startups to large enterprises who
188.79 -> have chosen Athena for a variety of use cases including
191.64 -> securities analysis and financial services,
194.02 -> information security where Athena's used to provide rapid
198.45 -> response to security and insights on security events.
202.8 -> As well as analytics in highly regulated industries,
205.44 -> especially those dealing with
206.76 -> sensitive data like healthcare.
210.15 -> So let's go a little bit deeper to understand how customers
212.97 -> are using Athena today with some of the common patterns and
215.64 -> use cases that we see.
216.88 -> First off should be a no brainer
218.79 -> and that's interactive analytics.
220.503 -> And, and with this what what it means is analysts,
223.5 -> data scientists and data engineers send SQL queries to
226.83 -> Athena and often get responses back in seconds.
230.05 -> Another is business intelligence.
232.38 -> It's a very common and popular one for Athena and with our
235.38 -> drivers and SDK you can plug Athena into your preferred
239.46 -> business intelligence and applications or SQL IDEs like
243.39 -> Power BI and Tableau.
246.15 -> Data workflows are an interesting area for Athena.
249.33 -> Here what customers do is build
250.98 -> what we like to call self-service
252.23 -> data workflows, using SQL that makes data available to other
255.96 -> applications, teammates or processes.
259.74 -> For example,
260.573 -> our integration with step functions
261.93 -> which we released last year,
264.06 -> gives you a good real easy to use no code experience to
267.24 -> building these drag and drop data workflows.
270.63 -> Many of our customers use Athena for a query layer on custom
275.58 -> multi-tenant user facing applications often coming with a
278.76 -> custom UI.
280.14 -> And last but not least, machine learning.
282.75 -> Machine learning, as you know,
284.94 -> algorithms tend to benefit from diverse input data and to
289.17 -> bring all that diverse data together,
290.83 -> what data scientists often do today is build ETL jobs that
295.47 -> move raw data from one source to another one so they could
298.95 -> ultimately join it together before feeding it into systems
301.89 -> like Amazon SageMaker to run machine learning training and
305.1 -> inference workloads.
306.19 -> Athena provides up to or close to 30 data sources so you can
310.26 -> use Athena as a sequel layer to pull all that data together
313.74 -> to provide a kind of common experience for creating those
316.95 -> base tables for machine learning workflows.
319.63 -> What all of these use cases have in common is that they're
322.74 -> based on SQL.
324.48 -> SQL is great,
325.65 -> it's often one of the first things that analysts learn and
329.01 -> it's widely understood across both
330.66 -> business and technical domains.
336.18 -> One of the challenges with SQL is that many describe it as
340.08 -> being not expressive enough to answer some of the more
342.75 -> complex business questions that often come up.
346.32 -> To get around that,
347.52 -> a lot of folks turn to open source frameworks like
350.67 -> Python and Spark to quickly find you know,
353.76 -> insights on those data.
354.7 -> But what they quickly discover in scaling those products and
358.14 -> frameworks is that it's really hard
359.82 -> to do that in enterprise settings.
362.85 -> Not only is it hard to get those frameworks to work but it
366.06 -> also requires heavy investment upfront to optimize your
370.35 -> infrastructure so that it is right sized to serve your
373.71 -> business needs while not breaking the bank.
376.86 -> And all of those are reasons why we were real excited
379.44 -> earlier today in Swami's keynote to unveil a brand new
382.98 -> capability in Athena,
384.03 -> which is Apache Spark with Amazon Athena for Apache Spark,
388.4 -> you can run interactive analytics quicker than you could
391.74 -> ever before and with that you get the ease of use and speed
396.27 -> that you've come to expect out of Athena.
399.07 -> With this brand new experience you can start interact spark
402.33 -> applications in just under one second,
404.55 -> which is pretty amazing.
406.8 -> The product uses our AWS optimized spark runtime,
410.4 -> which is up to three times faster than open source spark,
413.36 -> which means you spend a lot less time waiting to provision
416.64 -> clusters and wait for them to come online and produce the
419.85 -> insights that you're looking for, and get to spend more time
423.12 -> discovering insights from your data
427.89 -> to allow you to run those workloads, on Athena,
431.7 -> we've added a new experience in our console, it's a notebook
435.84 -> experience, and that allows you to write Python code, run
439.35 -> calculations in spark, visualize your data and a lot more.
443.35 -> This is a really exciting new capability for Athena and
446.76 -> there's a ton of awesome capability there, unfortunately,
449.56 -> certainly too much to kind of go into the details on in
452.7 -> today's session.
454.02 -> So we recommend that you check out ANT209,
457.2 -> which is a deep dive on the Spark announcement and that's
460.77 -> happening tomorrow.
461.82 -> So be sure to check the session guide and sign up for that.
466.23 -> Now if you're sitting there wondering how does this impact
469.14 -> the SQL part of Athena that I probably
470.97 -> came to learn about today?
472.27 -> Well don't worry because we've got a lot of really great
474.9 -> news for you on that front,
476.52 -> and that all starts with our SQL engine,
478.8 -> the latest version of which is version three that went GA
481.89 -> just a handful of weeks ago.
484.26 -> And as you've come to expect from Athena engine releases,
487.26 -> V3 is providing faster queries
489.48 -> and is more efficient at scans,
491.43 -> which means you get better performance and lower cost for
494.31 -> your workloads.
496 -> One of the interesting things with this release is that
498.06 -> we've rebuilt the way in which we pull components
501.18 -> from open source.
503.19 -> What that means is version three
504.6 -> will provide you more current,
506.97 -> more numerous and more interesting features coming from the
509.7 -> open source community with greater sort of regularity.
513.06 -> The other big piece of news is that engine version three
515.72 -> also incorporates Trino.
518.22 -> Trino if you're not familiar, is a fork of Presto DB,
521.4 -> which Athena version two is based on but it provides a lot
526.29 -> of the similar functionality but there's some key
528.18 -> differences across the products that kinda show up and
531.39 -> customers are asking to leverage some of the capabilities
534.51 -> from the Trino variant of Presto DB.
537.24 -> So this version of Athena includes both Trino and Presto and
541.53 -> comes bundled with optimizations built by our team to scale
545.76 -> those frameworks for use in AWS.
549.9 -> So let's take a look at some of the benefits that are coming
552.18 -> with version three. Right outta the gate you get 50 new
555.75 -> functions and that's gonna help you expand the types of
558.96 -> analytics you can apply to your data.
561.93 -> We've made over 90 enhancements to existing functions,
565.95 -> spanning query execution,
567.27 -> memory usage and how we process data all to give you queries
572.01 -> that run faster.
573.9 -> What that means in our benchmarks is we're seeing about a
575.88 -> 20% performance speed up compared to version two and some
579.69 -> queries seeing up to 10 times faster execution in specific
584.04 -> stages of query execution like planning.
585.87 -> So really awesome benefits kind of in a performance domain.
588.99 -> Best yet you get all of that for the same price and with
592.12 -> 100% functional parity between the version two API and that
595.62 -> should make it very easy to upgrade to version three today.
600.12 -> As I mentioned,
600.953 -> we rolled out V3 a few weeks ago and are happy to report
603.51 -> that customers are seeing some of the benefits in their
606.63 -> applications.
607.95 -> Orca security is one such customer who is using Athena
611.01 -> within their machine learning powered security event
614.4 -> detection product. Arie Teter who leads R&D at Orca
618.15 -> Security shared this quote with us and Arie's reported that
621.78 -> Orca's already feeling the scale and performance benefits
624.51 -> that come with version three and are starting to tap into
627.66 -> that expanded feature set that's coming with the Trino kind
631.08 -> of version of the of the SQL engine.
634.95 -> So to speak more on that, as I mentioned,
637.35 -> one of the highlights is the expanded set of analytics
640.53 -> capabilities that come with version three.
643.56 -> For example come a handful of new
645.15 -> aggregation functions like T Digest which
647.91 -> allows you to run approximate rank based statistics with
651.99 -> high accuracy.
653.09 -> Another one if you're doing geospatial analytics are new and
656.91 -> improved geospatial functions that help you bring location
660.36 -> based insights to your analytics programs.
663.99 -> And also available are new operators like Match Recognize
666.96 -> which bring more performant pattern matching to use cases
669.87 -> like fraud detection and sensor data analysis.
672.82 -> Now there's a special one at the top left of the screen
675.36 -> which we wanted to dive into a little bit more, a little bit
678.39 -> further cause that's a a brand new one for Athena and that
681.06 -> is query result caching.
683.64 -> Before we get into that, some background.
686.28 -> Many of the customers we talk with often describe using
689.13 -> Athena in what we'll call multi-user applications.
692.82 -> These are often applications like business intelligence
694.92 -> where multiple users are accessing Athena from the context
698.34 -> of another application.
700.08 -> Now in this model,
701.67 -> users send queries to Athena using those applications.
705.15 -> Athena runs the queries and returns results via our API and
708.33 -> drivers, and that sort of sits the the standard flow of of
712.53 -> running queries on Athena.
714.3 -> The challenge with this model is that as the number of
718.62 -> users expands,
719.48 -> what we tend to see is users sending very similar or
722.73 -> oftentimes identical queries to Athena.
726.33 -> With that comes added lakes,
727.64 -> which can add to reduce the time to insight on your data and
733.77 -> those repeat query executions can drive your costs higher.
737.94 -> So that's why we were excited to release just a few weeks
740.04 -> ago a caching feature for Athena
743.1 -> which we call query result reuse.
745.68 -> And that was a release just a few weeks ago as I mentioned.
748.35 -> With query result reuse,
749.73 -> Athena automatically accelerates queries by returning the
753.78 -> cache results of previous executions
756.57 -> and when enabled queries
757.83 -> using this query result reuse feature run up to five times
761.1 -> faster and don't scan any data.
763.71 -> So we're getting a lot of opportunity for performance and
765.9 -> cost savings benefits out of that one feature.
768.73 -> So to give you a sense for how that works and what it looks
771.12 -> like, we'll take a look at a sample query here.
773.43 -> Here we're doing a simple count of canceled flights grouped
776.85 -> by day of the week and in this first run piloted in the
780.09 -> orange box what you can see is that our query took about
783.15 -> three seconds to to process and scanned around 50 megabytes
786.45 -> of data.
787.53 -> So to turn on query result reuse,
789.36 -> we just toggle the slider switch there and then we can rerun
793.53 -> our query and what we see is the Athena is applying the
797.01 -> cashed executions to this query returning the result in
800.46 -> around 250 milliseconds.
802.08 -> So huge speed up in terms of latency that users are gonna
805.95 -> feel the benefits of, and best yet you see no data scanned
808.92 -> with that query.
810.93 -> So to make it easy to see when
813.21 -> caching is happening in the system,
814.86 -> we've added a few new views to the product including the
817.35 -> query history which is shown here, and that's giving you a
821.22 -> better sense in inspecting your query history on how
824.91 -> frequently cache results are being returned.
827.37 -> So that should make it easy to quickly diagnose and see
830.46 -> what's happening in the system.
832.17 -> So stepping back query result reuse is available today and
836.25 -> it's built for engine version three.
838.56 -> It's easy to use requiring no changes
840.66 -> in SQL queries to get started.
843 -> It's automated meaning Athena automatically identifies the
845.76 -> queries that can be accelerated and automatically returns
848.91 -> the cash result for those queries without scanning any data.
851.94 -> And then thinking back on those
853.59 -> multi-user applications or contexts,
856.17 -> it's really easy to bring all of these benefits to those
858.06 -> users as well.
859.65 -> For driver based clients it's a simple configuration change
862.74 -> where you just enable the query result reuse behavior within
865.77 -> the latest version of our driver.
867.72 -> And for API based clients you simply specify to binary
872.16 -> toggle whether or not to use query result reuse when
875.31 -> submitting new queries.
877.95 -> So between our new engine and features like query result
880.95 -> reuse, queries are running faster out of the box with less
885.12 -> need for a lot of that heavy query tuning of SQL queries
888.66 -> that's needed up front.
890.52 -> But for, for those of us who,
891.66 -> and I know a few of you are out there,
893.94 -> but for those of us who like to go a lot deeper into our
895.86 -> queries and and squeeze all of the performance out of them
898.62 -> as possible,
899.53 -> earlier this year we released a handful of new features in
902.94 -> our console to make it easy to inspect queries and really
906.3 -> dive into their performance.
908.46 -> That starts with visual query plans.
910.08 -> So previously before you ran a query you could inspect a
912.84 -> query using the Explain SQL syntax and what we heard from
916.41 -> customers is that wasn't easy enough for SQL analysts who
919.44 -> were using the console and wanted to get a kind of simpler
921.78 -> experience for inspecting those query plans.
923.81 -> So we've added a single click visual experience for this in
927.36 -> the Athena console and to access the query plans for a query
930.63 -> that you've got loaded into the console,
932.49 -> you simply click the explain button and that's gonna take
936.09 -> you to the query plan.
937.98 -> Up next you can choose between the distributed and logical
941.34 -> plan for query and that's gonna help you inspect the joins
944.7 -> and other complex operations
946.08 -> that are happening in your query.
947.91 -> What's really cool is the experience is interactive so you
951.33 -> can pinch and zoom and sort of inspect individual stages of
954.84 -> your query to learn more about what's happening at each of
957.24 -> those stages.
959.24 -> After you run your query,
960.81 -> we now display some really useful runtime statistics and
963.81 -> other query performance metadata.
965.87 -> Here we see some key data like the number of rows returned
969.69 -> by query and that's really helpful for validating that your
972.42 -> query is working as expected.
974.9 -> You also see summarize performance data shown in the bar
978.87 -> graph at the bottom,
980.19 -> which encompasses all of the key stages of query execution
982.8 -> including query planning, queuing and query execution.
986.85 -> So that's all great,
988.02 -> but if you want to go even deeper into how your query
990.72 -> executed, you can click the execution details button,
993.63 -> which is shown at the very bottom right and that's gonna
996.57 -> bring you into the deep dive view of
998.67 -> how your query executed.
1001.88 -> So here the top node just to orient you around how the
1004.91 -> information is displayed,
1006.08 -> the top node represents the last stage of your query while
1009.8 -> the bottommost nodes represent the earliest stages.
1012.98 -> So you kind of think about it in terms of bottom up.
1015.68 -> The green bar in each of these nodes shows the duration in
1019.22 -> relative terms of the run time for that stage.
1022.86 -> And what that allows you to do is zoom out and quickly see
1025.8 -> where your query is spending the most amount of time and
1028.34 -> that should give you good insight as to where you can dig in
1031.85 -> to identify optimizations that'll make your queries run
1034.43 -> faster.
1035.55 -> Clicking a node shows off to the very right,
1038 -> really interesting and useful stage level operator data that
1042.17 -> allows you to inspect the operations at
1044.33 -> each stage of your query.
1047.62 -> Now, if you're not using the console today and still want to
1050.18 -> do sort of analysis on your queries and how they're running
1053.78 -> and potentially in bulk sort of a use case,
1056.75 -> we also released a API and that's the at runtime statistics
1061.1 -> API to support that experience as well.
1063.71 -> And that's the API that's sitting behind all of the rich
1066.38 -> data that we're surfacing for the queries after they've run.
1068.64 -> So it's a really great API to check out.
1073 -> Query tuning is certainly a great way to get more
1075.98 -> performance at lower cost for your workloads.
1078.93 -> Another strategy we often recommend actually deals with the
1082.16 -> data structure itself and that's really important because
1085.52 -> when dealing with big data,
1086.42 -> how you structure your data can have a huge impact on query
1089.93 -> performance as well as cost.
1092.45 -> To deal with this,
1093.53 -> we typically recommend customers use columnmar
1096.04 -> data formats, partitioning and compression.
1099.5 -> And when you use all of those together you can save up to
1102.23 -> 90% on per query execution costs.
1105.67 -> Optimizing your data lake is a big investment,
1109.01 -> it's a really important part of the journey as we'll learn
1111.56 -> in a little bit.
1112.88 -> So to hear that I want to invite our guest speaker to the
1115.16 -> stage to tell us about the journey he and his colleagues are
1117.83 -> on at Mobileye to not only optimize their analytics
1121.07 -> workloads but to bring about a really exciting future
1123.64 -> involving autonomous driving that'll benefit all of us.
1126.05 -> So, Ofer.
1137.3 -> - So hello everyone,
1139.16 -> I hope you're having a great time here in the conference.
1141.95 -> Thank you Scott for introducing me. Like Scott said,
1144.83 -> my name is Ofer Eliasaffen
1146.18 -> and I'm a director of mobilized REM cloud infrastructure.
1149.96 -> I'm going to provide an overview on mobilize autonomous
1153.2 -> vehicles or AV mapping technology called REM.
1157.43 -> We will discuss REM's data ingestion challenge and our need
1161.06 -> for a data lake,
1162.44 -> and we will then talk about our journey with
1164.71 -> Amazon Athena as our ingestion data lake.
1170.42 -> So why do autonomous vehicles
1172.1 -> need high definition maps or HD maps?
1174.68 -> Sometimes they're called AV maps or autonomous vehicle maps.
1178.46 -> As you all know,
1179.293 -> Mobileye's one of the world's leader in the autonomous
1182.15 -> vehicles industry. Mobileye discovered
1184.73 -> that accurate maps are
1185.81 -> necessary in order for autonomous vehicles
1188.3 -> to operate better.
1190.07 -> The reason is that the vehicle needs to plan things like
1192.83 -> lane transition and do routing in distances where its
1195.89 -> sensors are not efficient enough or there is no visibility
1199.25 -> line to the vehicles.
1201.02 -> You can think about it using the following illustration.
1203.4 -> Imagine a human being trying to drive
1206.69 -> in a place they know well,
1208.16 -> as opposed to a place they have never been before.
1213.99 -> Regular standard definition maps you all know will not do.
1218.12 -> An autonomous car needs a map that includes
1221.72 -> much more details,
1222.553 -> such as semantic information, road curvature,
1225.77 -> traffic signs and everything needs to be
1227.87 -> in centimeters accuracy.
1230.39 -> It requires a map with semantic information such as which
1234.26 -> traffic sign or traffic light is relevant to which lane,
1238.07 -> who gives way to who,
1239.36 -> the relations between the lanes and much more.
1242.69 -> The map must be updated continuously so that when changes
1245.69 -> occur in the world,
1247.16 -> they're immediately visible on the map.
1250.52 -> It is very helpful to include in the map the actual drive
1253.58 -> pattern of the vehicles in given lanes such as average speed
1257.42 -> or lane centering,
1258.72 -> which is not always according to the traffic rules and in
1261.56 -> some cases the lanes are worn out and cannot be observed.
1265.32 -> Let's talk a bit about terminology.
1267.98 -> REM, which stands for road experience management,
1270.59 -> is mobilized solution for generating such maps on a global
1273.77 -> scale and road book is mobilized AV map product,
1277.67 -> it's the actual map itself.
1280.84 -> Let's look together on such,
1282.44 -> on such a road book visualization on Europe scale.
1284.68 -> We start by zooming in on relatively small area,
1288.17 -> small junction.
1289.43 -> Please look at the richness of the visualization.
1291.98 -> We can see the lanes, the landmarks, the traffic lights,
1295.28 -> crosswalks, roundabouts, et cetera.
1298.28 -> And it then grows to a magnitude of all of Europe.
1301.49 -> This is what we are dealing here with.
1305.39 -> Okay.
1307.28 -> Let's talk about REM's data ingestion scale.
1310.55 -> So in order to generate a road book,
1313.34 -> we are using a crowdsource technology where vehicle upload
1316.85 -> payloads containing the modeling of the road driving
1319.91 -> behavior and the surrounding of the vehicles.
1323.15 -> REM is working with many car companies that are called OEMs
1327.5 -> such as VW, BMW,
1329.69 -> Nissan, Ford, Geely, in production for several years,
1333.47 -> for several years now. This is a growing business.
1336.8 -> We are going to collaborate with
1338.33 -> much more OEMs in years to come.
1341.27 -> We are collecting information of tens of millions of
1345.14 -> kilometers per day and we operate at global coverage,
1348.08 -> United States, Europe, parts of Asia, China, Israel,
1352.73 -> and much more.
1354.47 -> In this visualization we see the ingestion coverage.
1357.29 -> The video starts with a single day and it then shows the
1360.47 -> coverage after one week, one month and eventually 10 months.
1366.14 -> It basically illustrates that we have enough coverage to
1369.35 -> continuously map the entire Europe in relatively small
1373.82 -> amount of time.
1374.88 -> Small amount of time. Sorry.
1378.44 -> Let's have a look on how REM works in high level overview.
1382.07 -> Our relatively cheap IQ chips are spread around millions of
1385.16 -> consumer cars around the world and they process images on
1388.7 -> the edge device itself.
1390.38 -> By using state-of-art algorithms,
1392.36 -> we create a model of the scene
1393.65 -> surrounding the location of the car.
1395.9 -> A multidimensional model of the road is created alongside
1399.5 -> all the signs, traffic lights, road marks, et cetera.
1402.91 -> This information is then being packed into a very dense
1406.57 -> payload, which we call an RSD,
1408.77 -> which stands for road segmented data.
1413.36 -> RSD usually contains information from 10 kilometers and its
1417.23 -> density is up to 10 kilobyte per kilometers.
1419.57 -> And so it means that every payload is roughly 10,
1422.69 -> 100 megabyte of size.
1426.4 -> These RSDs anonymized, encrypted and uploaded to the cloud.
1430.7 -> And while each of such RSDs a bit noisy,
1433.4 -> aggregating many of them using the REM technology from the
1436.76 -> same lane around the same time,
1438.44 -> generates a centimeter accuracy of the lane.
1441.91 -> Mobile invested a lot to make this process automatic in a
1445.31 -> click of a button on a global scale and this road book is
1449.66 -> generated from crowdsource so it's time to reflect reality
1452.45 -> after a change in the world is very small.
1458.03 -> the road book is then sent to the vehicle
1460.19 -> which runs Mobileye localization technology
1462.53 -> that compares what the vehicle
1464.21 -> detects to the elements of the map. By doing so,
1466.97 -> the vehicle locate itself on the map in centimeter accuracy
1471.38 -> and this enables autonomous vehicle driving features.
1475.15 -> Everything needs to be scalable, the harvesting technology,
1478.46 -> the APIs, the aggregation competition
1481.1 -> which happens on tens or even
1482.81 -> hundreds of concurrent CPUs.
1485 -> And our approach allow,
1486.32 -> allows us to generate the detailed semantic information I
1489.89 -> mentioned earlier.
1493.1 -> Let's talk about why we need an ingestion data lake.
1496.71 -> Each ingestion payload that arrives is kept in a data lake.
1500.6 -> After intensive computation,
1502.19 -> each record contains more than 150 attributes containing
1506.93 -> things like events, geometries, time measurements,
1510.2 -> length of drives, specific events, metadata and and more.
1515.12 -> This data is later being queried for many use cases.
1519.1 -> The main use case is the road creation.
1522.16 -> Every map creation begins with a query to the ingestion data
1526.13 -> lake and we will drill into this use case a bit later in the
1529.58 -> next slide.
1530.69 -> But we also have a analytics queries so we can answer
1533.48 -> questions like how much cover, how much time it'll,
1536.27 -> it'll take us to get coverage of the United States highways.
1540.13 -> We also have UI so that our customer can see their RSD
1543.89 -> coverage and we try to leverage this data lake and build new
1547.91 -> business models around it such as smart cities and more.
1552.59 -> And this mean that we have many internal customers inside
1557.57 -> Mobileye that need access to this data.
1560.21 -> And new types of usages
1562.22 -> are coming every now and then.
1568.56 -> Let's focus on the roadbook creation usage
1570.95 -> of the ingestion data lake.
1572.51 -> Our mapping process is done on geographical cells of about
1576.02 -> 10 by 10 kilometers.
1577.56 -> Every cell processing starts with a query to the data lake,
1581.69 -> scanning dozens of gigabytes
1583.22 -> and returning millions of records.
1585.56 -> This query happens as you might guess by now
1588.5 -> on Amazon Athena.
1590 -> Most of the queries are scanning three to five months of
1592.82 -> data, but some scan more.
1595.22 -> We sometimes generate multiple maps in parallel and I'm not
1598.28 -> talking about mapping multiple cells in parallel.
1600.36 -> This goes without saying.
1602.19 -> What I mean by that is that we sometimes run multiple huge
1606.02 -> maps such as mapping of the Europe and another mapping of
1609.41 -> the United States, in parallel.
1611.21 -> Each containing multiple cells. We are using up
1614.84 -> to tens of thousands of such queries per day. Again,
1617.78 -> all of these activities happening on Amazon Athena.
1620.17 -> In this visualization that we see,
1623.03 -> we can see a single mapping job of Europe running multiple
1626.57 -> cells in parallel.
1627.65 -> The small squares that you see are cells and the map is
1630.98 -> generated by mapping multiple cells separately in parallel
1635.6 -> and stitching them together into a coherent map.
1643.43 -> If you're wondering how we got to this level of analytics,
1646.52 -> I want to take a few minutes to explain how our journey with
1649.58 -> Amazon Athena, how we started with our ingestion data lake.
1653.79 -> This is a simplified version,
1655.31 -> very simplified version of the system we had back then.
1658.49 -> It begins with a vehicle that uploads payloads into rest API
1662.69 -> interface used to retrieve the payloads.
1665.55 -> When a payload arrives,
1666.92 -> it passes some simple sanity check and the computation plan
1670.1 -> is being built for it.
1672.17 -> It then being passed to a worker queue
1674.57 -> that executes the plan.
1676.51 -> This plan is very compute intense process.
1678.69 -> It takes like dozens of seconds per single payload.
1682.58 -> At the end of the execution we extract the 150 metadata
1686.48 -> attributes we mentioned that we want to keep in the
1689.39 -> ingestion data lake.
1690.92 -> The question that we struggled with back then was which data
1694.73 -> lake engine can function properly and provide US query
1697.82 -> services to our use cases.
1701.21 -> So what were our requirements?
1704.03 -> We had to support joints on different tables such as
1707.33 -> geometries of roads and stuff like that.
1709.91 -> So we have our relational data by nature,
1712.91 -> we need the data to be fresh so that every new payload that
1715.61 -> arrives should be available to queries soon after.
1720.14 -> We had the need to support geographical queries
1722.51 -> and we wanted something
1723.65 -> that will be reliable and easy to maintain because this is a
1726.41 -> mission critical functionality.
1729.05 -> We needed reasonable query speed.
1731.21 -> It doesn't have to be subsecond but it should run fast
1734 -> enough for our needs.
1736.04 -> Like I already mentioned,
1737.84 -> we need to support a high concurrency of up to thousands of
1740.87 -> mapping processes in parallel, and we want the storage to be
1745.13 -> relatively cheap.
1747.02 -> Back then our usage pattern was unknown so we needed
1749.99 -> something that we can count,
1751.79 -> so we can count on to keep evolving together with us.
1756.89 -> So what was the process that led us to choose Amazon Athena?
1761.45 -> It started with the design phase.
1763.38 -> Our initial idea was to use the two types of system.
1766.67 -> One for the short term storage,
1768.2 -> which was supposed to be more expensive but very efficient
1772.88 -> and one for the long term storage that was supposed to be
1775.37 -> cheap but not very efficient.
1778.52 -> And we had the need to start with the long term term storage
1781.58 -> and our research led us to conclude that Amazon Athena
1784.7 -> together with parquet files and S3
1786.65 -> will work great for us and we consulted our AWS
1790.58 -> solution architect and they suggested that we will go to a
1793.07 -> data lab in Seattle. And we went into this data lab in
1796.73 -> Seattle and implemented a POC and we then got back to Israel
1801.02 -> and we implemented the,
1803.39 -> finalized the implementation and went into production.
1807.41 -> As we were in production, the usage grew,
1810.32 -> grew both from usage pattern, amount of data and cost and we
1814.91 -> had to do few iterations of optimizations.
1818.34 -> The good news is that we were able to scale with Athena with
1821.99 -> with our current storage strategy of using
1823.67 -> S3 with parqet data format,
1826.43 -> and we don't foresee a need for a second system for a
1828.77 -> short term storage in analytics.
1831.92 -> And putting things in perspective,
1834.77 -> The design phase took us two months and the implementation
1837.59 -> phase took us two months and we are in production for
1840.38 -> several years now and with very small development efforts,
1843.92 -> it's, it's nice.
1847.94 -> This is our current architecture again in high level in very
1851.24 -> high level without many of the small details.
1853.01 -> As you can see,
1854.12 -> the diagram is the same as the previous slide I showed up
1856.94 -> until the point.
1857.773 -> We have the 150 metadata attributes and we use Kinesis
1861.89 -> streams, Kinesis firehose,
1863.3 -> and lambda function in order to take the RSD metadata and
1866.87 -> convert it into parquet files on S3.
1869.81 -> We run daily coalesce job to
1871.19 -> reduce the amount of files so we
1872.96 -> can increase the speed and reduce the S3 cost.
1876.18 -> And of course Amazon Athena is the query engine
1879.44 -> that we use for the cases.
1884.15 -> So this slide is all about the optimization we have done
1887.03 -> during the time we are in production.
1888.62 -> Optimization is quite an iterative process.
1891.94 -> You are okay until something is not scaling correctly and
1894.86 -> then you need to fix things.
1896.99 -> And we did few types of optimization.
1899.45 -> So the first type was query level optimization where you
1902.54 -> look for better SQL statement for queries.
1905.51 -> We also partnered with AWS to reach massive concurrency.
1909.92 -> We had to refactor our data models according to Amazon
1912.53 -> Athena best practices and we started using new abilities of
1917.72 -> Amazon Athena when they showed up. For example,
1920.06 -> our typical queries usually filtering by geospatial in time.
1924.38 -> So by using a new feature of Amazon Athena back then called
1927.5 -> partition projection,
1929.3 -> we were able to add the geospatial partition,
1931.96 -> which uses tens of thousands of partitions
1935.45 -> per day, allowing us to get amazing optimization
1937.97 -> by reducing the amount of data scanned.
1941.27 -> In this optimization, we reduced the query time by 90%,
1945.59 -> we also reduced the cost by 90% and it got us to zero
1949.49 -> concurrency issues and we can now query hundreds of
1953.39 -> concurrent queries instead of dozens.
1955.7 -> And we have zero bottlenecks
1956.96 -> between different map activities,
1959.24 -> which was a very big pain point back then. And this is it.
1963.74 -> So we are really happy to work with Athena.
1966.65 -> And getting back to you, Scott.
1974.74 -> - All right, Thanks Ofer.
1979.16 -> Really fascinating use case.
1981.2 -> Every time I see the zoom out I'm just kind of like,
1983.31 -> it's kind of mind blowing.
1985.61 -> So yeah,
1986.443 -> really impressive to see the work you
1987.276 -> and the team have put into Athena.
1988.37 -> So, and also the impact that it's had at Mobileye,
1990.06 -> and being able to scale that really cool technology too,
1993.24 -> you know, many,
1994.25 -> many machines and and devices out there in the real world.
1997.4 -> And that's something that I find is really cool about his
1999.2 -> use case is, you know,
2000.37 -> having been in analytics for a long time, it's,
2002.35 -> it's kind of rare that we find these use cases where
2004.42 -> analytics goes full circle and ends up out there in the real
2008.26 -> world powering sort of the
2009.88 -> decisions we make on a daily basis. So really cool.
2013.24 -> Switching gears now,
2014.073 -> we'll talk about how Athena is helping you bring all of your
2018.34 -> data together to provide you analytics on all those sources.
2022.39 -> So when talking with customers about that topic,
2025.27 -> we often describe our thinking in terms of the modern data
2028.6 -> architecture for AWS and what it should do for you.
2033.1 -> A modern data strategy as you heard this morning in Swami's
2036.61 -> keynote to kind of throughout
2038.38 -> the AWS reinvent sessions this week,
2041.8 -> the modern data strategy should give you sort of the best of
2044.77 -> both data lakes and purpose-built data stores. Again,
2048.28 -> thinking about all of the best in class database products
2051.55 -> that AWS supports.
2053.56 -> Modern data strategy should enable you to store any amount
2056.14 -> of data at low cost using open standard,
2060.4 -> standard based formats.
2062.98 -> Modern data architecture should allow
2064.51 -> you to break down data silos,
2066.43 -> empowering teams to run analytics and machine learning
2069.64 -> workloads using their preferred tools and giving you the
2073.57 -> capability to manage who has access to the data with the
2077.11 -> proper security and data governance controls in mind.
2081.22 -> Data lakes are often a great starting point because
2083.74 -> they provide flexible storage of data with high durability
2087.22 -> and low cost. And by sorting data in open formats, you can
2091.54 -> decouple storage from compute and that makes it easy when
2094.87 -> the time comes to analyze data by allowing you to choose the
2099.52 -> right tool for the job and bring a variety or choose from a
2103.27 -> variety of machine learning and analytics platforms and
2106.51 -> products supported by AWS.
2109.06 -> One of the benefits of data lakes is the flexibility to
2112 -> embrace new formats and paradigms for analyzing data.
2116.26 -> A recent shift in data lakes has been the emergence of table
2119.38 -> formats. Table formats if you're not familiar,
2122.38 -> are gaining traction
2124.48 -> mostly because they're really easy to understand.
2126.67 -> They allow interaction with data lakes with a familiar
2129.88 -> database like constructs and semantics that allow us to
2133.69 -> abstract data from where it came and bring data into a
2138.16 -> singular data set represented intuitively as a table.
2142.87 -> One of the areas where Athena is leading the way is on its
2145.9 -> support of table formats.
2148.66 -> And last year at Reinvent you may recall our announcement on
2151.48 -> Athena's support of one of those table formats,
2153.68 -> which is Apache Iceberg.
2156.27 -> Apache Iceberg is an open table format designed for
2160.6 -> very large analytic data sets.
2162.91 -> It has many properties making it a great solution
2165.46 -> for data lakes. For example,
2167.68 -> Iceberg supports writing to data stored on S3 and that is
2172 -> something that many customers need as part of the
2174.19 -> operational activities that
2175.6 -> support their analytics programs.
2177.88 -> Iceberg also supports schema evolution,
2180.88 -> giving you ad column drop column rename columns,
2184 -> semantics that look very similar to or running a database.
2189.01 -> So those are very familiar to to folks who are used to the
2191.77 -> those paradigms.
2193.33 -> So we've not stopped innovating on Iceberg since last year,
2196.69 -> and in 2022 we've pushed the boundaries even further on what
2200.14 -> you can do with Iceberg and Amazon Athena.
2203.26 -> So to kind of recap some of those updates,
2206.05 -> the first of which is create table as select support for
2209.08 -> Apache Iceberg.
2210.52 -> So with CTAS you get the easy and fast way to create
2214.64 -> new Iceberg tables from the results of another select query.
2219.4 -> We've also added view support so you can now hide complex
2223.39 -> joins and other business logic surfacing simpler to query
2228.01 -> analytic data sets that are surfaced
2230.17 -> to users to run SQL queries on.
2233.02 -> And we've also worked to optimize SQL queries running on
2237.28 -> Iceberg with engine version three ,and we're happy to report
2240.1 -> that queries on Iceberg using engine version three are
2244.48 -> running up to 10 times faster.
2246.58 -> And that's really exciting for those of us who are doubling
2249.19 -> down on the Apache Iceberg format.
2252 -> We've also extended asset transactions in Athena so you can
2256.06 -> now use Iceberg's merge operator to synchronize your tables
2259.57 -> as they're modified by
2260.59 -> other processes and business users.
2263.86 -> So that's gonna make it a lot easier and efficient to keep
2266.29 -> your Iceberg tables up to date.
2268.96 -> Now if you want to delete records to meet regulatory
2271.87 -> requirements like GDPR or to manage your storage footprint,
2275.78 -> you can now use the vacuum operator to do just that.
2279.9 -> And last but not least,
2281.08 -> we've also added support Avro and ORC so you can, giving you
2285.34 -> more flexibility to choose the format that works best for
2288.67 -> your use case and allowing you
2290.29 -> to bring that to Iceberg as well.
2292.87 -> So when scaling data lakes,
2294.28 -> it's really important to take into account not only the ease
2298 -> of use and flexibility benefits that we're describing here,
2300.97 -> but also the security and data governance needs that others
2305.23 -> in your organization most likely have.
2308.11 -> So we have some really great news on that front as well.
2310.66 -> And the news is that we've expanded our support through AWS
2314.5 -> lake formation to include all file and table formats
2318.16 -> currently supported by Athena.
2320.38 -> If you're not familiar with lake formation,
2322.95 -> lake formation allows you to essentially define column row
2327.04 -> and table data governance policies,
2329.6 -> which when queried by engines
2332.02 -> like Athena and EMR are respected.
2334.96 -> So users are only able to access
2336.94 -> the data that they're entitled to.
2339.4 -> So with this launch you can now define all of those fine
2342.28 -> grained data access and governance controls using lake
2345.97 -> formation, and have those work on file or table format that
2350.2 -> Athena supports today.
2352.9 -> Best yet we've implemented all of the filtering logic that
2356.83 -> was typically happening when when these, you know,
2359.59 -> governance policies are applied
2361.06 -> during a user's query execution,
2362.8 -> we've implemented all of that natively in Athena's engine.
2365.95 -> So you're getting more optimized performance when users are
2368.86 -> querying their their data lake files when lake formation
2373.3 -> policies are applied to them.
2376.91 -> Cool. So we're gonna revisit the
2378.88 -> modern data architecture slide
2380.35 -> for a moment, as there's an important part of the story that
2382.51 -> we wanted to kind of build on. And that's actually the
2385.54 -> prevalence of data sitting adjacent to the data lake.
2388.87 -> Oftentimes in databases,
2390.55 -> warehouses or other object stores often running in AWS but
2395.23 -> sometimes on-prem or potentially
2396.88 -> even in another cloud provider.
2399.28 -> And oftentimes analysts,
2400.71 -> data engineers and other users need access to that data just
2404.08 -> as they do their data lake.
2405.66 -> But too often those users are having to deal with the
2408.07 -> friction and frustration of having to learn new languages or
2411.4 -> build pipelines that extract that data and bring it
2415 -> somewhere else where they can then analyze it.
2418.48 -> Athena addresses this problem with
2420.13 -> what we call federated query.
2422.2 -> Federated query allows you to run SQL queries on
2425.65 -> data stored and relational, non relational object and even
2429.43 -> custom data sources.
2431.23 -> Analysts can run federated queries using the same ANSI SQL
2435.49 -> syntax that we support for data lake queries, and use that
2439.45 -> same language or a single query to join data spanning their
2443.44 -> federated sources with their data lake in a single query.
2447.07 -> With federated query you query data where it lives so
2450.73 -> there's no data movement. However,
2452.8 -> you can also use it to ingest external data into your data
2456.31 -> lake and use that to drive business intelligence and other
2459.31 -> use cases from your data lake without having to query all
2462.58 -> the way down to your underlying database each time.
2467.62 -> We have over 25 of these connectors available today and
2471.4 -> earlier this year we released a bunch of new ones spanning
2474.43 -> cloud object stores, relational databases and more.
2477.82 -> And in Athena these connectors are
2479.32 -> really easy to set up and use.
2481.45 -> Starting with our console you can click data sources and
2484.69 -> you'll see a list of all the sources that we support.
2487.21 -> And after selecting a source,
2488.83 -> you can follow our guided workflow to plug in the values
2491.12 -> that help you get connected.
2493.5 -> Our connectors work as applications
2495.52 -> running on AWS Lambda and
2497.86 -> with that comes support for
2499.51 -> cross account access and IM policies.
2502.63 -> That makes it easy for one person in your
2504.82 -> organization to set up a connection and then grants access
2508.72 -> to other teammates so that,
2509.673 -> so that they can then query that source
2511.75 -> using their own AWS account.
2514.22 -> All of our connectors are built on our open source SDK and
2517.857 -> all of our code is out there on GitHub,
2520.39 -> so we hope you can take a look at that and use it as boiler
2523.03 -> plate code for any customer connectors that you're
2525.76 -> thinking about developing as well.
2528.47 -> This year was really big for Athena on that front and the
2531.76 -> data sources that we support altogether, as I mentioned,
2534.79 -> there's over 25 connectors available often to some of the
2538.78 -> most widely used databases and storage platforms on the
2541.78 -> market today.
2543.07 -> Often that spans not only AWS sources
2545.71 -> but third party ones as well.
2548.36 -> And so another thing we want to sort of introduce here is
2550.98 -> the fact that many organizations today are using software as
2555.01 -> a service applications to help drive their businesses and
2557.71 -> and sort of, you know, in specific functions or use cases.
2561.86 -> Unfortunately many of those SaaS data providers don't give
2565.66 -> you direct access to the underlying databases,
2568.88 -> which is a problem when you need access to that data to
2571.8 -> understand how your business is operating.
2575.44 -> One of our partner services is Amazon App Flow.
2578.95 -> Amazon App Flow is a fully managed integration service that
2582.49 -> enables you to securely transfer data between SaaS
2585.82 -> applications like Salesforce, SAP, Zendesk, Slack,
2589.92 -> and a bunch more, and bring that data to AWS and ingested in
2594.34 -> services like Redshift and S3 where you can then use it for
2598.18 -> a variety of use cases.
2600.25 -> AppFlow supports over 50 SaaS source like the ones shown
2603.34 -> here on the slide that help you ingest all that data and
2606.19 -> bring it to S3, for again, variety of use cases.
2609.78 -> The big news from App Flow at this reinvent is they're
2613.45 -> recently announced support of AWS Glue data catalog for data
2617.8 -> flows between SaaS sources and S3.
2620.63 -> What you can do now is basically select a SaaS source,
2623.8 -> build a data flow and register that flow with AWS Glue.
2628.27 -> Once the data is registered with AWS Glue,
2630.4 -> you can run queries on it using Athena and host of
2633.49 -> additional AWS analytics services as well.
2636.58 -> The app flow team went a step further and added a really
2639.58 -> cool feature,
2640.413 -> which is essentially partition setup as part of the flow
2643.72 -> design workflow.
2645.1 -> And what that lets you do is as you're building your flows,
2648.58 -> select the fields using a simple GUI that allows you to sort
2653.83 -> of choose which fields in the response from those sources
2657.61 -> are good candidates for partitions to use in the data model.
2661.33 -> Once you're setting it up on S3,
2664.27 -> what that means is app flow takes that partition input into
2667.54 -> account when ingesting the data and automatically writes
2670.09 -> data to those partitions,
2671.32 -> which means Athena queries running on those sources are
2674.68 -> really fast.
2678.43 -> So we covered a lot of ground today.
2680.59 -> We figured we should kind of recap some of those things
2683.23 -> before we wrap up the session.
2685.27 -> So as you know, Athena is easy to use,
2687.24 -> giving you instant startup for SQL and now Apache Spark
2693.04 -> applications as well.
2694.27 -> That's a brand new experience in Athena and we're really
2696.58 -> excited to see what you can do with that as well as what
2699.04 -> feedback you have for us on where you want us to take that.
2701.95 -> So we encourage you to check out A&T209,
2704.68 -> which is happening tomorrow to go really deep on that topic
2707.803 -> and, and learn more.
2709.65 -> We also covered SQL Engine version three and how it helps
2713.11 -> you with the expanded functionality for analytics
2716.29 -> capabilities that it provides as well as faster queries.
2719.58 -> And again, on top of engine version three,
2721.57 -> we have query result reuse,
2722.92 -> which is available today and and giving you faster queries
2727.33 -> at lower cost. Caching is a,
2729.56 -> is a concept we're gonna be investing heavily in.
2732.22 -> So I hope you can keep your eyes and ears open for
2733.99 -> additional announcements on that front in the weeks and
2736.75 -> quarters ahead.
2738.67 -> We also touched on expanded support for Apache Iceberg and
2742.39 -> how you can now bring transactionality to your data lake and
2745.15 -> analytics workflows as well.
2746.73 -> So we're really excited to see what comes next on that.
2750.34 -> And as well as we,
2752.17 -> touched on expanded supportive row column and table level
2756.25 -> security controls powered by lake formation and how we can
2758.65 -> now apply those policies to any table or file format that
2762.07 -> Athena supports.
2763.46 -> And last but not least,
2764.65 -> as you build analytics around your data lake,
2767.33 -> we encourage you to consider Athena's data source connectors
2770.82 -> and other services like Amazon App Flow to help you bring
2773.98 -> all that data together.
2777.16 -> Wanted to thank you all for your time today and thank our
2780.46 -> guest speaker Ofer for the insights and inspiration on
2783.24 -> their use case.
2784.38 -> And hope you got a great sense for the broad set of features
2787.03 -> that we've been rolling out this year and are excited to get
2790 -> back home and try 'em all out. So thank you.

Source: https://www.youtube.com/watch?v=vhO8Qst5Vhc