AWS re:Invent 2022 - What’s new in Amazon Athena (ANT208)

Aug 16, 2023

AWS re:Invent 2022 - What’s new in Amazon Athena (ANT208)

Amazon Athena is a highly scalable analytics service that makes it easy to analyze all your data across Amazon S3, in on-premises stores, and on other cloud platforms. Amazon Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. This session offers a deep dive into the service, customer use cases, best practices, newly launched features, and what’s next for Amazon Athena.

Learn more about AWS re:Invent at https://go.aws/3ikK4dD.

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#reInvent2022 #AWSreInvent2022 #AWSEvents

Content

0 -> - Hello everybody, great to see you all here.

1.65 -> re:Invent 2022,

3.99 -> hope you're having a great time so far at the session.

6.3 -> We know you've got the late night session tonight,

8.55 -> so appreciate you coming out to spend some time with us.

11.88 -> This is what's new in Amazon Athena.

14.49 -> My name is Scott Rigney and I'm one of the product managers

16.83 -> on the Athena team.

18.28 -> This session has a bunch of great announcements and updates

21.57 -> on features that you know, myself and colleagues have been

24.6 -> hard at work building for you over the course of this year.

27.3 -> And we've we're joined by a great guest speaker.

30.24 -> In a little bit,

31.073 -> we'll hear from Ofer Eliassaf

32.82 -> who joins us today from Mobileye.

34.24 -> Ofer is gonna tell us about how he and his colleagues are

38.64 -> using Athena to on a really cool use case.

42.18 -> So before we get into it,

43.56 -> thanks again for joining today's session.

45.93 -> Hopefully got a lot of great updates for you.

49.32 -> So today we're gonna go on a bit of a journey through Athena

52.68 -> and data lakes.

54.09 -> We'll start with the core of Athena,

55.95 -> which is our engine and we want to start there because we

58.59 -> have a lot of great announcements to to share with you about

61.29 -> all of that.

62.7 -> Next we'll hear from Mobileye on how they're using Athena as

66.57 -> part of their machine vision systems

68.4 -> for autonomous vehicles.

69.83 -> It's a really fascinating use case and we're really happy

72.75 -> that Ofer could join us here in person and tell us more.

76.02 -> And last but not least,

77.28 -> we'll cover some of the announcements and updates on how you

79.86 -> can bring Athena to your data.

81.93 -> Apply it to all of your data sources, you know,

84.15 -> spanning data lakes, external sources and more.

89.07 -> But before we do that,

89.903 -> let's do a quick poll by show of hands,

92.16 -> is anybody here new to Athena?

96.3 -> All right, couple of you out there.

97.71 -> So, so thanks for joining.

99.38 -> For those of us who are new,

100.83 -> we thought it'd be good to start out with a little bit of a

102.93 -> refresher on Athena.

105.09 -> So let's start off with that.

107.22 -> Athena is an interactive query service that's designed to

111.33 -> make it easy to query and analyze data in your data lakes.

115.2 -> Athena is serverless,

116.43 -> which means there is no infrastructure to set up and manage,

119.3 -> and you pay only for the queries that you run.

122.35 -> What customers love about Athena is

124.8 -> how easy it is to get started. With the product

127.71 -> being serverless and having no infrastructure to set up

131.07 -> simply means you can sort of bring it,

132.99 -> bring Athena to your data and start analyzing it.

136.03 -> Athena is interactive and built for speed.

139.68 -> Everything we do in the product is designed to help you

142.08 -> answer business questions quickly and do that in a cost

145.53 -> effective manner.

147.03 -> Athena is built on open standards and that starts with our

150.3 -> processing engine,

151.23 -> which is based on open source technology like Presto.

154.44 -> And with that comes support of multiple data formats like

157.83 -> Parquet and Apache Iceberg.

159.63 -> So you can get started with the data that you have today.

163.23 -> Athena is also cost effective.

165.06 -> You pay only for the queries that you run and you can save

167.82 -> up to 90% using compression,

170.01 -> partitioning and converting your data into

172.53 -> optimized column formats.

176.25 -> We launched a few years ago here at re:Invent 2016 and today

179.79 -> we have thousands of customers spanning industries and of

184.26 -> all sizes ranging from startups to large enterprises who

188.79 -> have chosen Athena for a variety of use cases including

191.64 -> securities analysis and financial services,

194.02 -> information security where Athena's used to provide rapid

198.45 -> response to security and insights on security events.

202.8 -> As well as analytics in highly regulated industries,

205.44 -> especially those dealing with

206.76 -> sensitive data like healthcare.

210.15 -> So let's go a little bit deeper to understand how customers

212.97 -> are using Athena today with some of the common patterns and

215.64 -> use cases that we see.

216.88 -> First off should be a no brainer

218.79 -> and that's interactive analytics.

220.503 -> And, and with this what what it means is analysts,

223.5 -> data scientists and data engineers send SQL queries to

226.83 -> Athena and often get responses back in seconds.

230.05 -> Another is business intelligence.

232.38 -> It's a very common and popular one for Athena and with our

235.38 -> drivers and SDK you can plug Athena into your preferred

239.46 -> business intelligence and applications or SQL IDEs like

243.39 -> Power BI and Tableau.

246.15 -> Data workflows are an interesting area for Athena.

249.33 -> Here what customers do is build

250.98 -> what we like to call self-service

252.23 -> data workflows, using SQL that makes data available to other

255.96 -> applications, teammates or processes.

259.74 -> For example,

260.573 -> our integration with step functions

261.93 -> which we released last year,

264.06 -> gives you a good real easy to use no code experience to

267.24 -> building these drag and drop data workflows.

270.63 -> Many of our customers use Athena for a query layer on custom

275.58 -> multi-tenant user facing applications often coming with a

278.76 -> custom UI.

280.14 -> And last but not least, machine learning.

282.75 -> Machine learning, as you know,

284.94 -> algorithms tend to benefit from diverse input data and to

289.17 -> bring all that diverse data together,

290.83 -> what data scientists often do today is build ETL jobs that

295.47 -> move raw data from one source to another one so they could

298.95 -> ultimately join it together before feeding it into systems

301.89 -> like Amazon SageMaker to run machine learning training and

305.1 -> inference workloads.

306.19 -> Athena provides up to or close to 30 data sources so you can

310.26 -> use Athena as a sequel layer to pull all that data together

313.74 -> to provide a kind of common experience for creating those

316.95 -> base tables for machine learning workflows.

319.63 -> What all of these use cases have in common is that they're

322.74 -> based on SQL.

324.48 -> SQL is great,

325.65 -> it's often one of the first things that analysts learn and

329.01 -> it's widely understood across both

330.66 -> business and technical domains.

336.18 -> One of the challenges with SQL is that many describe it as

340.08 -> being not expressive enough to answer some of the more

342.75 -> complex business questions that often come up.

346.32 -> To get around that,

347.52 -> a lot of folks turn to open source frameworks like

350.67 -> Python and Spark to quickly find you know,

353.76 -> insights on those data.

354.7 -> But what they quickly discover in scaling those products and

358.14 -> frameworks is that it's really hard

359.82 -> to do that in enterprise settings.

362.85 -> Not only is it hard to get those frameworks to work but it

366.06 -> also requires heavy investment upfront to optimize your

370.35 -> infrastructure so that it is right sized to serve your

373.71 -> business needs while not breaking the bank.

376.86 -> And all of those are reasons why we were real excited

379.44 -> earlier today in Swami's keynote to unveil a brand new

382.98 -> capability in Athena,

384.03 -> which is Apache Spark with Amazon Athena for Apache Spark,

388.4 -> you can run interactive analytics quicker than you could

391.74 -> ever before and with that you get the ease of use and speed

396.27 -> that you've come to expect out of Athena.

399.07 -> With this brand new experience you can start interact spark

402.33 -> applications in just under one second,

404.55 -> which is pretty amazing.

406.8 -> The product uses our AWS optimized spark runtime,

410.4 -> which is up to three times faster than open source spark,

413.36 -> which means you spend a lot less time waiting to provision

416.64 -> clusters and wait for them to come online and produce the

419.85 -> insights that you're looking for, and get to spend more time

423.12 -> discovering insights from your data

427.89 -> to allow you to run those workloads, on Athena,

431.7 -> we've added a new experience in our console, it's a notebook

435.84 -> experience, and that allows you to write Python code, run

439.35 -> calculations in spark, visualize your data and a lot more.

443.35 -> This is a really exciting new capability for Athena and

446.76 -> there's a ton of awesome capability there, unfortunately,

449.56 -> certainly too much to kind of go into the details on in

452.7 -> today's session.

454.02 -> So we recommend that you check out ANT209,

457.2 -> which is a deep dive on the Spark announcement and that's

460.77 -> happening tomorrow.

461.82 -> So be sure to check the session guide and sign up for that.

466.23 -> Now if you're sitting there wondering how does this impact

469.14 -> the SQL part of Athena that I probably

470.97 -> came to learn about today?

472.27 -> Well don't worry because we've got a lot of really great

474.9 -> news for you on that front,

476.52 -> and that all starts with our SQL engine,

478.8 -> the latest version of which is version three that went GA

481.89 -> just a handful of weeks ago.

484.26 -> And as you've come to expect from Athena engine releases,

487.26 -> V3 is providing faster queries

489.48 -> and is more efficient at scans,

491.43 -> which means you get better performance and lower cost for

494.31 -> your workloads.

496 -> One of the interesting things with this release is that

498.06 -> we've rebuilt the way in which we pull components

501.18 -> from open source.

503.19 -> What that means is version three

504.6 -> will provide you more current,

506.97 -> more numerous and more interesting features coming from the

509.7 -> open source community with greater sort of regularity.

513.06 -> The other big piece of news is that engine version three

515.72 -> also incorporates Trino.

518.22 -> Trino if you're not familiar, is a fork of Presto DB,

521.4 -> which Athena version two is based on but it provides a lot

526.29 -> of the similar functionality but there's some key

528.18 -> differences across the products that kinda show up and

531.39 -> customers are asking to leverage some of the capabilities

534.51 -> from the Trino variant of Presto DB.

537.24 -> So this version of Athena includes both Trino and Presto and

541.53 -> comes bundled with optimizations built by our team to scale

545.76 -> those frameworks for use in AWS.

549.9 -> So let's take a look at some of the benefits that are coming

552.18 -> with version three. Right outta the gate you get 50 new

555.75 -> functions and that's gonna help you expand the types of

558.96 -> analytics you can apply to your data.

561.93 -> We've made over 90 enhancements to existing functions,

565.95 -> spanning query execution,

567.27 -> memory usage and how we process data all to give you queries

572.01 -> that run faster.

573.9 -> What that means in our benchmarks is we're seeing about a

575.88 -> 20% performance speed up compared to version two and some

579.69 -> queries seeing up to 10 times faster execution in specific

584.04 -> stages of query execution like planning.

585.87 -> So really awesome benefits kind of in a performance domain.

588.99 -> Best yet you get all of that for the same price and with

592.12 -> 100% functional parity between the version two API and that

595.62 -> should make it very easy to upgrade to version three today.

600.12 -> As I mentioned,

600.953 -> we rolled out V3 a few weeks ago and are happy to report

603.51 -> that customers are seeing some of the benefits in their

606.63 -> applications.

607.95 -> Orca security is one such customer who is using Athena

611.01 -> within their machine learning powered security event

614.4 -> detection product. Arie Teter who leads R&D at Orca

618.15 -> Security shared this quote with us and Arie's reported that

621.78 -> Orca's already feeling the scale and performance benefits

624.51 -> that come with version three and are starting to tap into

627.66 -> that expanded feature set that's coming with the Trino kind

631.08 -> of version of the of the SQL engine.

634.95 -> So to speak more on that, as I mentioned,

637.35 -> one of the highlights is the expanded set of analytics

640.53 -> capabilities that come with version three.

643.56 -> For example come a handful of new

645.15 -> aggregation functions like T Digest which

647.91 -> allows you to run approximate rank based statistics with

651.99 -> high accuracy.

653.09 -> Another one if you're doing geospatial analytics are new and

656.91 -> improved geospatial functions that help you bring location

660.36 -> based insights to your analytics programs.

663.99 -> And also available are new operators like Match Recognize

666.96 -> which bring more performant pattern matching to use cases

669.87 -> like fraud detection and sensor data analysis.

672.82 -> Now there's a special one at the top left of the screen

675.36 -> which we wanted to dive into a little bit more, a little bit

678.39 -> further cause that's a a brand new one for Athena and that

681.06 -> is query result caching.

683.64 -> Before we get into that, some background.

686.28 -> Many of the customers we talk with often describe using

689.13 -> Athena in what we'll call multi-user applications.

692.82 -> These are often applications like business intelligence

694.92 -> where multiple users are accessing Athena from the context

698.34 -> of another application.

700.08 -> Now in this model,

701.67 -> users send queries to Athena using those applications.

705.15 -> Athena runs the queries and returns results via our API and

708.33 -> drivers, and that sort of sits the the standard flow of of

712.53 -> running queries on Athena.

714.3 -> The challenge with this model is that as the number of

718.62 -> users expands,

719.48 -> what we tend to see is users sending very similar or

722.73 -> oftentimes identical queries to Athena.

726.33 -> With that comes added lakes,

727.64 -> which can add to reduce the time to insight on your data and

733.77 -> those repeat query executions can drive your costs higher.

737.94 -> So that's why we were excited to release just a few weeks

740.04 -> ago a caching feature for Athena

743.1 -> which we call query result reuse.

745.68 -> And that was a release just a few weeks ago as I mentioned.

748.35 -> With query result reuse,

749.73 -> Athena automatically accelerates queries by returning the

753.78 -> cache results of previous executions

756.57 -> and when enabled queries

757.83 -> using this query result reuse feature run up to five times

761.1 -> faster and don't scan any data.

763.71 -> So we're getting a lot of opportunity for performance and

765.9 -> cost savings benefits out of that one feature.

768.73 -> So to give you a sense for how that works and what it looks

771.12 -> like, we'll take a look at a sample query here.

773.43 -> Here we're doing a simple count of canceled flights grouped

776.85 -> by day of the week and in this first run piloted in the

780.09 -> orange box what you can see is that our query took about

783.15 -> three seconds to to process and scanned around 50 megabytes

786.45 -> of data.

787.53 -> So to turn on query result reuse,

789.36 -> we just toggle the slider switch there and then we can rerun

793.53 -> our query and what we see is the Athena is applying the

797.01 -> cashed executions to this query returning the result in

800.46 -> around 250 milliseconds.

802.08 -> So huge speed up in terms of latency that users are gonna

805.95 -> feel the benefits of, and best yet you see no data scanned

808.92 -> with that query.

810.93 -> So to make it easy to see when

813.21 -> caching is happening in the system,

814.86 -> we've added a few new views to the product including the

817.35 -> query history which is shown here, and that's giving you a

821.22 -> better sense in inspecting your query history on how

824.91 -> frequently cache results are being returned.

827.37 -> So that should make it easy to quickly diagnose and see

830.46 -> what's happening in the system.

832.17 -> So stepping back query result reuse is available today and

836.25 -> it's built for engine version three.

838.56 -> It's easy to use requiring no changes

840.66 -> in SQL queries to get started.

843 -> It's automated meaning Athena automatically identifies the

845.76 -> queries that can be accelerated and automatically returns

848.91 -> the cash result for those queries without scanning any data.

851.94 -> And then thinking back on those

853.59 -> multi-user applications or contexts,

856.17 -> it's really easy to bring all of these benefits to those

858.06 -> users as well.

859.65 -> For driver based clients it's a simple configuration change

862.74 -> where you just enable the query result reuse behavior within

865.77 -> the latest version of our driver.

867.72 -> And for API based clients you simply specify to binary

872.16 -> toggle whether or not to use query result reuse when

875.31 -> submitting new queries.

877.95 -> So between our new engine and features like query result

880.95 -> reuse, queries are running faster out of the box with less

885.12 -> need for a lot of that heavy query tuning of SQL queries

888.66 -> that's needed up front.

890.52 -> But for, for those of us who,

891.66 -> and I know a few of you are out there,

893.94 -> but for those of us who like to go a lot deeper into our

895.86 -> queries and and squeeze all of the performance out of them

898.62 -> as possible,

899.53 -> earlier this year we released a handful of new features in

902.94 -> our console to make it easy to inspect queries and really

906.3 -> dive into their performance.

908.46 -> That starts with visual query plans.

910.08 -> So previously before you ran a query you could inspect a

912.84 -> query using the Explain SQL syntax and what we heard from

916.41 -> customers is that wasn't easy enough for SQL analysts who

919.44 -> were using the console and wanted to get a kind of simpler

921.78 -> experience for inspecting those query plans.

923.81 -> So we've added a single click visual experience for this in

927.36 -> the Athena console and to access the query plans for a query

930.63 -> that you've got loaded into the console,

932.49 -> you simply click the explain button and that's gonna take

936.09 -> you to the query plan.

937.98 -> Up next you can choose between the distributed and logical

941.34 -> plan for query and that's gonna help you inspect the joins

944.7 -> and other complex operations

946.08 -> that are happening in your query.

947.91 -> What's really cool is the experience is interactive so you

951.33 -> can pinch and zoom and sort of inspect individual stages of

954.84 -> your query to learn more about what's happening at each of

957.24 -> those stages.

959.24 -> After you run your query,

960.81 -> we now display some really useful runtime statistics and

963.81 -> other query performance metadata.

965.87 -> Here we see some key data like the number of rows returned

969.69 -> by query and that's really helpful for validating that your

972.42 -> query is working as expected.

974.9 -> You also see summarize performance data shown in the bar

978.87 -> graph at the bottom,

980.19 -> which encompasses all of the key stages of query execution

982.8 -> including query planning, queuing and query execution.

986.85 -> So that's all great,

988.02 -> but if you want to go even deeper into how your query

990.72 -> executed, you can click the execution details button,

993.63 -> which is shown at the very bottom right and that's gonna

996.57 -> bring you into the deep dive view of

998.67 -> how your query executed.

1001.88 -> So here the top node just to orient you around how the

1004.91 -> information is displayed,

1006.08 -> the top node represents the last stage of your query while

1009.8 -> the bottommost nodes represent the earliest stages.

1012.98 -> So you kind of think about it in terms of bottom up.

1015.68 -> The green bar in each of these nodes shows the duration in

1019.22 -> relative terms of the run time for that stage.

1022.86 -> And what that allows you to do is zoom out and quickly see

1025.8 -> where your query is spending the most amount of time and

1028.34 -> that should give you good insight as to where you can dig in

1031.85 -> to identify optimizations that'll make your queries run

1034.43 -> faster.

1035.55 -> Clicking a node shows off to the very right,

1038 -> really interesting and useful stage level operator data that

1042.17 -> allows you to inspect the operations at

1044.33 -> each stage of your query.

1047.62 -> Now, if you're not using the console today and still want to

1050.18 -> do sort of analysis on your queries and how they're running

1053.78 -> and potentially in bulk sort of a use case,

1056.75 -> we also released a API and that's the at runtime statistics

1061.1 -> API to support that experience as well.

1063.71 -> And that's the API that's sitting behind all of the rich

1066.38 -> data that we're surfacing for the queries after they've run.

1068.64 -> So it's a really great API to check out.

1073 -> Query tuning is certainly a great way to get more

1075.98 -> performance at lower cost for your workloads.

1078.93 -> Another strategy we often recommend actually deals with the

1082.16 -> data structure itself and that's really important because

1085.52 -> when dealing with big data,

1086.42 -> how you structure your data can have a huge impact on query

1089.93 -> performance as well as cost.

1092.45 -> To deal with this,

1093.53 -> we typically recommend customers use columnmar

1096.04 -> data formats, partitioning and compression.

1099.5 -> And when you use all of those together you can save up to

1102.23 -> 90% on per query execution costs.

1105.67 -> Optimizing your data lake is a big investment,

1109.01 -> it's a really important part of the journey as we'll learn

1111.56 -> in a little bit.

1112.88 -> So to hear that I want to invite our guest speaker to the

1115.16 -> stage to tell us about the journey he and his colleagues are

1117.83 -> on at Mobileye to not only optimize their analytics

1121.07 -> workloads but to bring about a really exciting future

1123.64 -> involving autonomous driving that'll benefit all of us.

1126.05 -> So, Ofer.

1137.3 -> - So hello everyone,

1139.16 -> I hope you're having a great time here in the conference.

1141.95 -> Thank you Scott for introducing me. Like Scott said,

1144.83 -> my name is Ofer Eliasaffen

1146.18 -> and I'm a director of mobilized REM cloud infrastructure.

1149.96 -> I'm going to provide an overview on mobilize autonomous

1153.2 -> vehicles or AV mapping technology called REM.

1157.43 -> We will discuss REM's data ingestion challenge and our need

1161.06 -> for a data lake,

1162.44 -> and we will then talk about our journey with

1164.71 -> Amazon Athena as our ingestion data lake.

1170.42 -> So why do autonomous vehicles

1172.1 -> need high definition maps or HD maps?

1174.68 -> Sometimes they're called AV maps or autonomous vehicle maps.

1178.46 -> As you all know,

1179.293 -> Mobileye's one of the world's leader in the autonomous

1182.15 -> vehicles industry. Mobileye discovered

1184.73 -> that accurate maps are

1185.81 -> necessary in order for autonomous vehicles

1188.3 -> to operate better.

1190.07 -> The reason is that the vehicle needs to plan things like

1192.83 -> lane transition and do routing in distances where its

1195.89 -> sensors are not efficient enough or there is no visibility

1199.25 -> line to the vehicles.

1201.02 -> You can think about it using the following illustration.

1203.4 -> Imagine a human being trying to drive

1206.69 -> in a place they know well,

1208.16 -> as opposed to a place they have never been before.

1213.99 -> Regular standard definition maps you all know will not do.

1218.12 -> An autonomous car needs a map that includes

1221.72 -> much more details,

1222.553 -> such as semantic information, road curvature,

1225.77 -> traffic signs and everything needs to be

1227.87 -> in centimeters accuracy.

1230.39 -> It requires a map with semantic information such as which

1234.26 -> traffic sign or traffic light is relevant to which lane,

1238.07 -> who gives way to who,

1239.36 -> the relations between the lanes and much more.

1242.69 -> The map must be updated continuously so that when changes

1245.69 -> occur in the world,

1247.16 -> they're immediately visible on the map.

1250.52 -> It is very helpful to include in the map the actual drive

1253.58 -> pattern of the vehicles in given lanes such as average speed

1257.42 -> or lane centering,

1258.72 -> which is not always according to the traffic rules and in

1261.56 -> some cases the lanes are worn out and cannot be observed.

1265.32 -> Let's talk a bit about terminology.

1267.98 -> REM, which stands for road experience management,

1270.59 -> is mobilized solution for generating such maps on a global

1273.77 -> scale and road book is mobilized AV map product,

1277.67 -> it's the actual map itself.

1280.84 -> Let's look together on such,

1282.44 -> on such a road book visualization on Europe scale.

1284.68 -> We start by zooming in on relatively small area,

1288.17 -> small junction.

1289.43 -> Please look at the richness of the visualization.

1291.98 -> We can see the lanes, the landmarks, the traffic lights,

1295.28 -> crosswalks, roundabouts, et cetera.

1298.28 -> And it then grows to a magnitude of all of Europe.

1301.49 -> This is what we are dealing here with.

1305.39 -> Okay.

1307.28 -> Let's talk about REM's data ingestion scale.

1310.55 -> So in order to generate a road book,

1313.34 -> we are using a crowdsource technology where vehicle upload

1316.85 -> payloads containing the modeling of the road driving

1319.91 -> behavior and the surrounding of the vehicles.

1323.15 -> REM is working with many car companies that are called OEMs

1327.5 -> such as VW, BMW,

1329.69 -> Nissan, Ford, Geely, in production for several years,

1333.47 -> for several years now. This is a growing business.

1336.8 -> We are going to collaborate with

1338.33 -> much more OEMs in years to come.

1341.27 -> We are collecting information of tens of millions of

1345.14 -> kilometers per day and we operate at global coverage,

1348.08 -> United States, Europe, parts of Asia, China, Israel,

1352.73 -> and much more.

1354.47 -> In this visualization we see the ingestion coverage.

1357.29 -> The video starts with a single day and it then shows the

1360.47 -> coverage after one week, one month and eventually 10 months.

1366.14 -> It basically illustrates that we have enough coverage to

1369.35 -> continuously map the entire Europe in relatively small

1373.82 -> amount of time.

1374.88 -> Small amount of time. Sorry.

1378.44 -> Let's have a look on how REM works in high level overview.

1382.07 -> Our relatively cheap IQ chips are spread around millions of

1385.16 -> consumer cars around the world and they process images on

1388.7 -> the edge device itself.

1390.38 -> By using state-of-art algorithms,

1392.36 -> we create a model of the scene

1393.65 -> surrounding the location of the car.

1395.9 -> A multidimensional model of the road is created alongside

1399.5 -> all the signs, traffic lights, road marks, et cetera.

1402.91 -> This information is then being packed into a very dense

1406.57 -> payload, which we call an RSD,

1408.77 -> which stands for road segmented data.

1413.36 -> RSD usually contains information from 10 kilometers and its

1417.23 -> density is up to 10 kilobyte per kilometers.

1419.57 -> And so it means that every payload is roughly 10,

1422.69 -> 100 megabyte of size.

1426.4 -> These RSDs anonymized, encrypted and uploaded to the cloud.

1430.7 -> And while each of such RSDs a bit noisy,

1433.4 -> aggregating many of them using the REM technology from the

1436.76 -> same lane around the same time,

1438.44 -> generates a centimeter accuracy of the lane.

1441.91 -> Mobile invested a lot to make this process automatic in a

1445.31 -> click of a button on a global scale and this road book is

1449.66 -> generated from crowdsource so it's time to reflect reality

1452.45 -> after a change in the world is very small.

1458.03 -> the road book is then sent to the vehicle

1460.19 -> which runs Mobileye localization technology

1462.53 -> that compares what the vehicle

1464.21 -> detects to the elements of the map. By doing so,

1466.97 -> the vehicle locate itself on the map in centimeter accuracy

1471.38 -> and this enables autonomous vehicle driving features.

1475.15 -> Everything needs to be scalable, the harvesting technology,

1478.46 -> the APIs, the aggregation competition

1481.1 -> which happens on tens or even

1482.81 -> hundreds of concurrent CPUs.

1485 -> And our approach allow,

1486.32 -> allows us to generate the detailed semantic information I

1489.89 -> mentioned earlier.

1493.1 -> Let's talk about why we need an ingestion data lake.

1496.71 -> Each ingestion payload that arrives is kept in a data lake.

1500.6 -> After intensive computation,

1502.19 -> each record contains more than 150 attributes containing

1506.93 -> things like events, geometries, time measurements,

1510.2 -> length of drives, specific events, metadata and and more.

1515.12 -> This data is later being queried for many use cases.

1519.1 -> The main use case is the road creation.

1522.16 -> Every map creation begins with a query to the ingestion data

1526.13 -> lake and we will drill into this use case a bit later in the

1529.58 -> next slide.

1530.69 -> But we also have a analytics queries so we can answer

1533.48 -> questions like how much cover, how much time it'll,

1536.27 -> it'll take us to get coverage of the United States highways.

1540.13 -> We also have UI so that our customer can see their RSD

1543.89 -> coverage and we try to leverage this data lake and build new

1547.91 -> business models around it such as smart cities and more.

1552.59 -> And this mean that we have many internal customers inside

1557.57 -> Mobileye that need access to this data.

1560.21 -> And new types of usages

1562.22 -> are coming every now and then.

1568.56 -> Let's focus on the roadbook creation usage

1570.95 -> of the ingestion data lake.

1572.51 -> Our mapping process is done on geographical cells of about

1576.02 -> 10 by 10 kilometers.

1577.56 -> Every cell processing starts with a query to the data lake,

1581.69 -> scanning dozens of gigabytes

1583.22 -> and returning millions of records.

1585.56 -> This query happens as you might guess by now

1588.5 -> on Amazon Athena.

1590 -> Most of the queries are scanning three to five months of

1592.82 -> data, but some scan more.

1595.22 -> We sometimes generate multiple maps in parallel and I'm not

1598.28 -> talking about mapping multiple cells in parallel.

1600.36 -> This goes without saying.

1602.19 -> What I mean by that is that we sometimes run multiple huge

1606.02 -> maps such as mapping of the Europe and another mapping of

1609.41 -> the United States, in parallel.

1611.21 -> Each containing multiple cells. We are using up

1614.84 -> to tens of thousands of such queries per day. Again,

1617.78 -> all of these activities happening on Amazon Athena.

1620.17 -> In this visualization that we see,

1623.03 -> we can see a single mapping job of Europe running multiple

1626.57 -> cells in parallel.

1627.65 -> The small squares that you see are cells and the map is

1630.98 -> generated by mapping multiple cells separately in parallel

1635.6 -> and stitching them together into a coherent map.

1643.43 -> If you're wondering how we got to this level of analytics,

1646.52 -> I want to take a few minutes to explain how our journey with

1649.58 -> Amazon Athena, how we started with our ingestion data lake.

1653.79 -> This is a simplified version,

1655.31 -> very simplified version of the system we had back then.

1658.49 -> It begins with a vehicle that uploads payloads into rest API

1662.69 -> interface used to retrieve the payloads.

1665.55 -> When a payload arrives,

1666.92 -> it passes some simple sanity check and the computation plan

1670.1 -> is being built for it.

1672.17 -> It then being passed to a worker queue

1674.57 -> that executes the plan.

1676.51 -> This plan is very compute intense process.

1678.69 -> It takes like dozens of seconds per single payload.

1682.58 -> At the end of the execution we extract the 150 metadata

1686.48 -> attributes we mentioned that we want to keep in the

1689.39 -> ingestion data lake.

1690.92 -> The question that we struggled with back then was which data

1694.73 -> lake engine can function properly and provide US query

1697.82 -> services to our use cases.

1701.21 -> So what were our requirements?

1704.03 -> We had to support joints on different tables such as

1707.33 -> geometries of roads and stuff like that.

1709.91 -> So we have our relational data by nature,

1712.91 -> we need the data to be fresh so that every new payload that

1715.61 -> arrives should be available to queries soon after.

1720.14 -> We had the need to support geographical queries

1722.51 -> and we wanted something

1723.65 -> that will be reliable and easy to maintain because this is a

1726.41 -> mission critical functionality.

1729.05 -> We needed reasonable query speed.

1731.21 -> It doesn't have to be subsecond but it should run fast

1734 -> enough for our needs.

1736.04 -> Like I already mentioned,

1737.84 -> we need to support a high concurrency of up to thousands of

1740.87 -> mapping processes in parallel, and we want the storage to be

1745.13 -> relatively cheap.

1747.02 -> Back then our usage pattern was unknown so we needed

1749.99 -> something that we can count,

1751.79 -> so we can count on to keep evolving together with us.

1756.89 -> So what was the process that led us to choose Amazon Athena?

1761.45 -> It started with the design phase.

1763.38 -> Our initial idea was to use the two types of system.

1766.67 -> One for the short term storage,

1768.2 -> which was supposed to be more expensive but very efficient

1772.88 -> and one for the long term storage that was supposed to be

1775.37 -> cheap but not very efficient.

1778.52 -> And we had the need to start with the long term term storage

1781.58 -> and our research led us to conclude that Amazon Athena

1784.7 -> together with parquet files and S3

1786.65 -> will work great for us and we consulted our AWS

1790.58 -> solution architect and they suggested that we will go to a

1793.07 -> data lab in Seattle. And we went into this data lab in

1796.73 -> Seattle and implemented a POC and we then got back to Israel

1801.02 -> and we implemented the,

1803.39 -> finalized the implementation and went into production.

1807.41 -> As we were in production, the usage grew,

1810.32 -> grew both from usage pattern, amount of data and cost and we

1814.91 -> had to do few iterations of optimizations.

1818.34 -> The good news is that we were able to scale with Athena with

1821.99 -> with our current storage strategy of using

1823.67 -> S3 with parqet data format,

1826.43 -> and we don't foresee a need for a second system for a

1828.77 -> short term storage in analytics.

1831.92 -> And putting things in perspective,

1834.77 -> The design phase took us two months and the implementation

1837.59 -> phase took us two months and we are in production for

1840.38 -> several years now and with very small development efforts,

1843.92 -> it's, it's nice.

1847.94 -> This is our current architecture again in high level in very

1851.24 -> high level without many of the small details.

1853.01 -> As you can see,

1854.12 -> the diagram is the same as the previous slide I showed up

1856.94 -> until the point.

1857.773 -> We have the 150 metadata attributes and we use Kinesis

1861.89 -> streams, Kinesis firehose,

1863.3 -> and lambda function in order to take the RSD metadata and

1866.87 -> convert it into parquet files on S3.

1869.81 -> We run daily coalesce job to

1871.19 -> reduce the amount of files so we

1872.96 -> can increase the speed and reduce the S3 cost.

1876.18 -> And of course Amazon Athena is the query engine

1879.44 -> that we use for the cases.

1884.15 -> So this slide is all about the optimization we have done

1887.03 -> during the time we are in production.

1888.62 -> Optimization is quite an iterative process.

1891.94 -> You are okay until something is not scaling correctly and

1894.86 -> then you need to fix things.

1896.99 -> And we did few types of optimization.

1899.45 -> So the first type was query level optimization where you

1902.54 -> look for better SQL statement for queries.

1905.51 -> We also partnered with AWS to reach massive concurrency.

1909.92 -> We had to refactor our data models according to Amazon

1912.53 -> Athena best practices and we started using new abilities of

1917.72 -> Amazon Athena when they showed up. For example,

1920.06 -> our typical queries usually filtering by geospatial in time.

1924.38 -> So by using a new feature of Amazon Athena back then called

1927.5 -> partition projection,

1929.3 -> we were able to add the geospatial partition,

1931.96 -> which uses tens of thousands of partitions

1935.45 -> per day, allowing us to get amazing optimization

1937.97 -> by reducing the amount of data scanned.

1941.27 -> In this optimization, we reduced the query time by 90%,

1945.59 -> we also reduced the cost by 90% and it got us to zero

1949.49 -> concurrency issues and we can now query hundreds of

1953.39 -> concurrent queries instead of dozens.

1955.7 -> And we have zero bottlenecks

1956.96 -> between different map activities,

1959.24 -> which was a very big pain point back then. And this is it.

1963.74 -> So we are really happy to work with Athena.

1966.65 -> And getting back to you, Scott.

1974.74 -> - All right, Thanks Ofer.

1979.16 -> Really fascinating use case.

1981.2 -> Every time I see the zoom out I'm just kind of like,

1983.31 -> it's kind of mind blowing.

1985.61 -> So yeah,

1986.443 -> really impressive to see the work you

1987.276 -> and the team have put into Athena.

1988.37 -> So, and also the impact that it's had at Mobileye,

1990.06 -> and being able to scale that really cool technology too,

1993.24 -> you know, many,

1994.25 -> many machines and and devices out there in the real world.

1997.4 -> And that's something that I find is really cool about his

1999.2 -> use case is, you know,

2000.37 -> having been in analytics for a long time, it's,

2002.35 -> it's kind of rare that we find these use cases where

2004.42 -> analytics goes full circle and ends up out there in the real

2008.26 -> world powering sort of the

2009.88 -> decisions we make on a daily basis. So really cool.

2013.24 -> Switching gears now,

2014.073 -> we'll talk about how Athena is helping you bring all of your

2018.34 -> data together to provide you analytics on all those sources.

2022.39 -> So when talking with customers about that topic,

2025.27 -> we often describe our thinking in terms of the modern data

2028.6 -> architecture for AWS and what it should do for you.

2033.1 -> A modern data strategy as you heard this morning in Swami's

2036.61 -> keynote to kind of throughout

2038.38 -> the AWS reinvent sessions this week,

2041.8 -> the modern data strategy should give you sort of the best of

2044.77 -> both data lakes and purpose-built data stores. Again,

2048.28 -> thinking about all of the best in class database products

2051.55 -> that AWS supports.

2053.56 -> Modern data strategy should enable you to store any amount

2056.14 -> of data at low cost using open standard,

2060.4 -> standard based formats.

2062.98 -> Modern data architecture should allow

2064.51 -> you to break down data silos,

2066.43 -> empowering teams to run analytics and machine learning

2069.64 -> workloads using their preferred tools and giving you the

2073.57 -> capability to manage who has access to the data with the

2077.11 -> proper security and data governance controls in mind.

2081.22 -> Data lakes are often a great starting point because

2083.74 -> they provide flexible storage of data with high durability

2087.22 -> and low cost. And by sorting data in open formats, you can

2091.54 -> decouple storage from compute and that makes it easy when

2094.87 -> the time comes to analyze data by allowing you to choose the

2099.52 -> right tool for the job and bring a variety or choose from a

2103.27 -> variety of machine learning and analytics platforms and

2106.51 -> products supported by AWS.

2109.06 -> One of the benefits of data lakes is the flexibility to

2112 -> embrace new formats and paradigms for analyzing data.

2116.26 -> A recent shift in data lakes has been the emergence of table

2119.38 -> formats. Table formats if you're not familiar,

2122.38 -> are gaining traction

2124.48 -> mostly because they're really easy to understand.

2126.67 -> They allow interaction with data lakes with a familiar

2129.88 -> database like constructs and semantics that allow us to

2133.69 -> abstract data from where it came and bring data into a

2138.16 -> singular data set represented intuitively as a table.

2142.87 -> One of the areas where Athena is leading the way is on its

2145.9 -> support of table formats.

2148.66 -> And last year at Reinvent you may recall our announcement on

2151.48 -> Athena's support of one of those table formats,

2153.68 -> which is Apache Iceberg.

2156.27 -> Apache Iceberg is an open table format designed for

2160.6 -> very large analytic data sets.

2162.91 -> It has many properties making it a great solution

2165.46 -> for data lakes. For example,

2167.68 -> Iceberg supports writing to data stored on S3 and that is

2172 -> something that many customers need as part of the

2174.19 -> operational activities that

2175.6 -> support their analytics programs.

2177.88 -> Iceberg also supports schema evolution,

2180.88 -> giving you ad column drop column rename columns,

2184 -> semantics that look very similar to or running a database.

2189.01 -> So those are very familiar to to folks who are used to the

2191.77 -> those paradigms.

2193.33 -> So we've not stopped innovating on Iceberg since last year,

2196.69 -> and in 2022 we've pushed the boundaries even further on what

2200.14 -> you can do with Iceberg and Amazon Athena.

2203.26 -> So to kind of recap some of those updates,

2206.05 -> the first of which is create table as select support for

2209.08 -> Apache Iceberg.

2210.52 -> So with CTAS you get the easy and fast way to create

2214.64 -> new Iceberg tables from the results of another select query.

2219.4 -> We've also added view support so you can now hide complex

2223.39 -> joins and other business logic surfacing simpler to query

2228.01 -> analytic data sets that are surfaced

2230.17 -> to users to run SQL queries on.

2233.02 -> And we've also worked to optimize SQL queries running on

2237.28 -> Iceberg with engine version three ,and we're happy to report

2240.1 -> that queries on Iceberg using engine version three are

2244.48 -> running up to 10 times faster.

2246.58 -> And that's really exciting for those of us who are doubling

2249.19 -> down on the Apache Iceberg format.

2252 -> We've also extended asset transactions in Athena so you can

2256.06 -> now use Iceberg's merge operator to synchronize your tables

2259.57 -> as they're modified by

2260.59 -> other processes and business users.

2263.86 -> So that's gonna make it a lot easier and efficient to keep

2266.29 -> your Iceberg tables up to date.

2268.96 -> Now if you want to delete records to meet regulatory

2271.87 -> requirements like GDPR or to manage your storage footprint,

2275.78 -> you can now use the vacuum operator to do just that.

2279.9 -> And last but not least,

2281.08 -> we've also added support Avro and ORC so you can, giving you

2285.34 -> more flexibility to choose the format that works best for

2288.67 -> your use case and allowing you

2290.29 -> to bring that to Iceberg as well.

2292.87 -> So when scaling data lakes,

2294.28 -> it's really important to take into account not only the ease

2298 -> of use and flexibility benefits that we're describing here,

2300.97 -> but also the security and data governance needs that others

2305.23 -> in your organization most likely have.

2308.11 -> So we have some really great news on that front as well.

2310.66 -> And the news is that we've expanded our support through AWS

2314.5 -> lake formation to include all file and table formats

2318.16 -> currently supported by Athena.

2320.38 -> If you're not familiar with lake formation,

2322.95 -> lake formation allows you to essentially define column row

2327.04 -> and table data governance policies,

2329.6 -> which when queried by engines

2332.02 -> like Athena and EMR are respected.

2334.96 -> So users are only able to access

2336.94 -> the data that they're entitled to.

2339.4 -> So with this launch you can now define all of those fine

2342.28 -> grained data access and governance controls using lake

2345.97 -> formation, and have those work on file or table format that

2350.2 -> Athena supports today.

2352.9 -> Best yet we've implemented all of the filtering logic that

2356.83 -> was typically happening when when these, you know,

2359.59 -> governance policies are applied

2361.06 -> during a user's query execution,

2362.8 -> we've implemented all of that natively in Athena's engine.

2365.95 -> So you're getting more optimized performance when users are

2368.86 -> querying their their data lake files when lake formation

2373.3 -> policies are applied to them.

2376.91 -> Cool. So we're gonna revisit the

2378.88 -> modern data architecture slide

2380.35 -> for a moment, as there's an important part of the story that

2382.51 -> we wanted to kind of build on. And that's actually the

2385.54 -> prevalence of data sitting adjacent to the data lake.

2388.87 -> Oftentimes in databases,

2390.55 -> warehouses or other object stores often running in AWS but

2395.23 -> sometimes on-prem or potentially

2396.88 -> even in another cloud provider.

2399.28 -> And oftentimes analysts,

2400.71 -> data engineers and other users need access to that data just

2404.08 -> as they do their data lake.

2405.66 -> But too often those users are having to deal with the

2408.07 -> friction and frustration of having to learn new languages or

2411.4 -> build pipelines that extract that data and bring it

2415 -> somewhere else where they can then analyze it.

2418.48 -> Athena addresses this problem with

2420.13 -> what we call federated query.

2422.2 -> Federated query allows you to run SQL queries on

2425.65 -> data stored and relational, non relational object and even

2429.43 -> custom data sources.

2431.23 -> Analysts can run federated queries using the same ANSI SQL

2435.49 -> syntax that we support for data lake queries, and use that

2439.45 -> same language or a single query to join data spanning their

2443.44 -> federated sources with their data lake in a single query.

2447.07 -> With federated query you query data where it lives so

2450.73 -> there's no data movement. However,

2452.8 -> you can also use it to ingest external data into your data

2456.31 -> lake and use that to drive business intelligence and other

2459.31 -> use cases from your data lake without having to query all

2462.58 -> the way down to your underlying database each time.

2467.62 -> We have over 25 of these connectors available today and

2471.4 -> earlier this year we released a bunch of new ones spanning

2474.43 -> cloud object stores, relational databases and more.

2477.82 -> And in Athena these connectors are

2479.32 -> really easy to set up and use.

2481.45 -> Starting with our console you can click data sources and

2484.69 -> you'll see a list of all the sources that we support.

2487.21 -> And after selecting a source,

2488.83 -> you can follow our guided workflow to plug in the values

2491.12 -> that help you get connected.

2493.5 -> Our connectors work as applications

2495.52 -> running on AWS Lambda and

2497.86 -> with that comes support for

2499.51 -> cross account access and IM policies.

2502.63 -> That makes it easy for one person in your

2504.82 -> organization to set up a connection and then grants access

2508.72 -> to other teammates so that,

2509.673 -> so that they can then query that source

2511.75 -> using their own AWS account.

2514.22 -> All of our connectors are built on our open source SDK and

2517.857 -> all of our code is out there on GitHub,

2520.39 -> so we hope you can take a look at that and use it as boiler

2523.03 -> plate code for any customer connectors that you're

2525.76 -> thinking about developing as well.

2528.47 -> This year was really big for Athena on that front and the

2531.76 -> data sources that we support altogether, as I mentioned,

2534.79 -> there's over 25 connectors available often to some of the

2538.78 -> most widely used databases and storage platforms on the

2541.78 -> market today.

2543.07 -> Often that spans not only AWS sources

2545.71 -> but third party ones as well.

2548.36 -> And so another thing we want to sort of introduce here is

2550.98 -> the fact that many organizations today are using software as

2555.01 -> a service applications to help drive their businesses and

2557.71 -> and sort of, you know, in specific functions or use cases.

2561.86 -> Unfortunately many of those SaaS data providers don't give

2565.66 -> you direct access to the underlying databases,

2568.88 -> which is a problem when you need access to that data to

2571.8 -> understand how your business is operating.

2575.44 -> One of our partner services is Amazon App Flow.

2578.95 -> Amazon App Flow is a fully managed integration service that

2582.49 -> enables you to securely transfer data between SaaS

2585.82 -> applications like Salesforce, SAP, Zendesk, Slack,

2589.92 -> and a bunch more, and bring that data to AWS and ingested in

2594.34 -> services like Redshift and S3 where you can then use it for

2598.18 -> a variety of use cases.

2600.25 -> AppFlow supports over 50 SaaS source like the ones shown

2603.34 -> here on the slide that help you ingest all that data and

2606.19 -> bring it to S3, for again, variety of use cases.

2609.78 -> The big news from App Flow at this reinvent is they're

2613.45 -> recently announced support of AWS Glue data catalog for data

2617.8 -> flows between SaaS sources and S3.

2620.63 -> What you can do now is basically select a SaaS source,

2623.8 -> build a data flow and register that flow with AWS Glue.

2628.27 -> Once the data is registered with AWS Glue,

2630.4 -> you can run queries on it using Athena and host of

2633.49 -> additional AWS analytics services as well.

2636.58 -> The app flow team went a step further and added a really

2639.58 -> cool feature,

2640.413 -> which is essentially partition setup as part of the flow

2643.72 -> design workflow.

2645.1 -> And what that lets you do is as you're building your flows,

2648.58 -> select the fields using a simple GUI that allows you to sort

2653.83 -> of choose which fields in the response from those sources

2657.61 -> are good candidates for partitions to use in the data model.

2661.33 -> Once you're setting it up on S3,

2664.27 -> what that means is app flow takes that partition input into

2667.54 -> account when ingesting the data and automatically writes

2670.09 -> data to those partitions,

2671.32 -> which means Athena queries running on those sources are

2674.68 -> really fast.

2678.43 -> So we covered a lot of ground today.

2680.59 -> We figured we should kind of recap some of those things

2683.23 -> before we wrap up the session.

2685.27 -> So as you know, Athena is easy to use,

2687.24 -> giving you instant startup for SQL and now Apache Spark

2693.04 -> applications as well.

2694.27 -> That's a brand new experience in Athena and we're really

2696.58 -> excited to see what you can do with that as well as what

2699.04 -> feedback you have for us on where you want us to take that.

2701.95 -> So we encourage you to check out A&T209,

2704.68 -> which is happening tomorrow to go really deep on that topic

2707.803 -> and, and learn more.

2709.65 -> We also covered SQL Engine version three and how it helps

2713.11 -> you with the expanded functionality for analytics

2716.29 -> capabilities that it provides as well as faster queries.

2719.58 -> And again, on top of engine version three,

2721.57 -> we have query result reuse,

2722.92 -> which is available today and and giving you faster queries

2727.33 -> at lower cost. Caching is a,

2729.56 -> is a concept we're gonna be investing heavily in.

2732.22 -> So I hope you can keep your eyes and ears open for

2733.99 -> additional announcements on that front in the weeks and

2736.75 -> quarters ahead.

2738.67 -> We also touched on expanded support for Apache Iceberg and

2742.39 -> how you can now bring transactionality to your data lake and

2745.15 -> analytics workflows as well.

2746.73 -> So we're really excited to see what comes next on that.

2750.34 -> And as well as we,

2752.17 -> touched on expanded supportive row column and table level

2756.25 -> security controls powered by lake formation and how we can

2758.65 -> now apply those policies to any table or file format that

2762.07 -> Athena supports.

2763.46 -> And last but not least,

2764.65 -> as you build analytics around your data lake,

2767.33 -> we encourage you to consider Athena's data source connectors

2770.82 -> and other services like Amazon App Flow to help you bring

2773.98 -> all that data together.

2777.16 -> Wanted to thank you all for your time today and thank our

2780.46 -> guest speaker Ofer for the insights and inspiration on

2783.24 -> their use case.

2784.38 -> And hope you got a great sense for the broad set of features

2787.03 -> that we've been rolling out this year and are excited to get

2790 -> back home and try 'em all out. So thank you.

Source: https://www.youtube.com/watch?v=vhO8Qst5Vhc