Processing Streaming Data with AWS Lambda - AWS Online Tech Talks
Processing Streaming Data with AWS Lambda - AWS Online Tech Talks
Processing streaming data can be complex in traditional, server-based architectures, especially if you need to react in real-time. Amazon Kinesis makes it possible to collect, process, and analyze this data at scale, and AWS Lambda can make it easier to develop highly scalable, custom workloads to turn the data into useful insights. This tech talk explains common streaming data scenarios, when to use Kinesis or Kinesis Data Firehose, and how to use Lambda in a streaming architecture. Learn about the extract, transform, load (ETL) process using Lambda and how you can implement AWS services to build data analytics. This talk also discusses best practices to help you build efficient and effective streaming solutions.
Learning Objectives: *Understand how to manage streaming data workloads *Learn how to use AWS Lambda to process streaming data *Use best practices in your Lambda architectures to reduce cost and improve scale
☁️ AWS Online Tech Talks cover a wide range of topics and expertise levels through technical deep dives, demos, customer examples, and live Q\u0026A with AWS experts. Builders can choose from bite-sized 15-minute sessions, insightful fireside chats, immersive virtual workshops, interactive office hours, or watch on-demand tech talks at your own pace. Join us to fuel your learning journey with AWS.
#AWS
Content
0.88 -> thanks for joining me today for this
2.399 -> session which is about processing
4.16 -> streaming data
5.279 -> with aws lambda my name is james besik
9.04 -> and i'm a principal developer advocate
11.2 -> here in the aws serverless team i'm a
14.4 -> self-confessed serverless geek and i've
16 -> built quite a few production systems
17.76 -> using serverless infrastructure
20 -> prior to being a da i was a software
21.92 -> developer for many years
23.519 -> and also a product manager for a long
25.199 -> time the most important thing on my
27.119 -> slide here is my twitter handle
29.279 -> and my email address so if you have any
30.88 -> questions about kinesis
32.399 -> or serverless in general feel free to
34.559 -> reach out to me and i'll do my very best
36.48 -> to help
38.8 -> today we're talking about processing
40.719 -> streaming data with aws lambda
43.6 -> in this session i'll cover five topics
46.879 -> first i'll talk about streaming use
48.719 -> cases and briefly explain
50.559 -> the different kinesis services then i'll
53.76 -> show how you can build zero
55.039 -> administration stream processing
57.199 -> with kinesis data fire hose lambda is a
60.399 -> really important service for processing
62.239 -> streaming data
63.28 -> so i'll talk about how to use it for
65.04 -> on-demand compute
66.32 -> with kinesis data streams in production
69.36 -> systems it's important to know how to
70.799 -> scale
71.52 -> monitor and troubleshoot kinesis so i'll
73.6 -> provide some important guidance here
75.6 -> and finally i'll discuss how to optimize
78.08 -> your kinesis based application
80.159 -> and highlight some best practices
84.72 -> amazon kinesis makes it easier to
86.799 -> collect process
88.08 -> and analyze real-time streaming data for
91.28 -> applications
92.24 -> this can help you get timely insights
94.4 -> and react quickly to new information
97.28 -> kinesis provides cost-effective
99.04 -> streaming data processing at any scale
101.439 -> with the flexibility to choose the tools
103.36 -> that best suit the needs of your
104.72 -> workload
106.24 -> moving from traditional batch
107.759 -> architectures to streaming architectures
110.399 -> it creates new capabilities you go from
113.439 -> hourly
114 -> server logs to real-time metrics or from
116.799 -> weekly and monthly bills
118.64 -> created by a job to a system that
120.56 -> monitors real-time spending alerts
122.719 -> and implementing spending caps in
125.439 -> analyzing click streams
126.96 -> it can move from daily reports to a
128.959 -> real-time analysis
130.879 -> and for customers like financial
132.319 -> institutions who have fraud detection
134.239 -> systems
135.2 -> critically this information can drive
137.44 -> real-time detection
139.599 -> bringing real-time capabilities to
141.36 -> workloads fundamentally changes the
143.52 -> capabilities
144.56 -> and the problems that you can solve
148.8 -> there are a wide variety of use cases
150.8 -> for kinesis
151.92 -> here are just a few of the common ones
153.599 -> that we see from our customers
156 -> in video processing kinesis allows you
158.16 -> to stream video from connected devices
160.4 -> to aws
161.76 -> for analytics or machine learning the
164.16 -> video can then be processed by other
166.08 -> services
167.04 -> depending upon the needs of the workload
170.239 -> in industrial automation sensors collect
172.879 -> large amounts of data from thousands of
174.8 -> devices
176.16 -> this data can be ingested by kinesis and
178.8 -> analyzed by kinesis data analytics
181.44 -> in real time to power dashboards for
183.68 -> real-time monitoring
184.959 -> the data can also be stored and analyzed
187.36 -> historically
188.4 -> or used to train machine learning models
191.84 -> you can use kinesis to process streaming
193.84 -> data from iot devices such as consumer
196.48 -> appliances
197.599 -> embedded sensors and tv set-top boxes
201.12 -> you can then use the data to send
203.04 -> real-time alerts
204.239 -> or take actions programmatically when a
206.4 -> sensor exceeds certain operating
208.239 -> thresholds
210.56 -> data connected via kinesis can also be
213.2 -> filtered and
214 -> aggregated with kinesis data fire hose
216.959 -> and then stored durably in s3 buckets to
219.28 -> create data lakes
220.799 -> this enables analytics and machine
222.56 -> learning for use cases like production
224.4 -> optimization
225.44 -> and predictive maintenance kinesis data
228.799 -> streams can be used to
230 -> collect log and event data from sources
232.48 -> such as servers
233.68 -> desktops and mobile devices you can then
236.879 -> build
237.28 -> kinesis applications to continuously
239.36 -> process the data
240.64 -> generate metrics power live dashboards
243.68 -> and emit aggregated data into stores
245.84 -> such as amazon s3
249.76 -> under the kinesis umbrella there are
251.36 -> four distinct services with different
253.439 -> capabilities
255.12 -> kinesis data streams is a highly
257.04 -> scalable and durable
258.959 -> real-time data streaming service that
261.04 -> can continuously capture gigabytes of
263.199 -> data per
263.84 -> second from hundreds of thousands of
266.08 -> sources
267.68 -> amazon kinesis video streams makes it
269.68 -> easy to securely stream video from
271.6 -> connected devices to aws
273.84 -> for analytics machine learning and other
276 -> processing i'm not covering this service
278.08 -> today
278.639 -> but it's an important part of the
279.919 -> kinesis suite
282 -> kinesis data firehose is a managed
284.08 -> streaming service
285.199 -> with minimal administration and it's the
287.199 -> easiest way to capture
288.88 -> transform and load data streams into aws
292 -> data stores
293.12 -> you can use the data collected here for
294.88 -> real-time analytics with existing
296.56 -> business intelligence tools
300 -> kinesis data analytics enables you to
301.919 -> process data streams
303.28 -> in real time with sql or apache flink
306.639 -> you can use familiar programming
308 -> languages like sql to build complex
310.08 -> queries that run continuously on
312.24 -> streaming data
315.68 -> generally this is the pattern for
317.28 -> implementing a streaming data solution
319.919 -> data producers continuously generate
322 -> data and write it to a stream
324.4 -> a data producer could be a web server
326.639 -> sending logs
327.84 -> or it could be an application server
329.6 -> sending metrics or it could be an
331.36 -> internet of things device
333.039 -> sending telemetry the streaming service
336.16 -> then durably steers that stores the data
338.479 -> once it's been received it provides a
340.32 -> temporary buffer
341.52 -> to prepare the data and it's capable of
343.6 -> handling handling high throughput
346.56 -> the streaming service delivers the
348.08 -> records to data consumers
350.16 -> the consumer continuously processes the
352.16 -> data
353.36 -> in many cases this means cleaning
355.039 -> preparing and aggregating the records
357.759 -> transforming the raw data into
359.36 -> information
362.4 -> producers create records and records
364.72 -> contain three pieces of information
367.52 -> the partition key logically separates
369.919 -> sets of data
371.12 -> and is hashed to route the data to a
373.199 -> shard
374.56 -> the sequence number is unique per
377.84 -> partition key within the shard it's
380.08 -> assigned by kinesis
381.6 -> after the producer writes to the stream
384.479 -> and the data block
385.759 -> is a base64 encoded payload up to one
388.88 -> megabyte in size
392.8 -> using kinesis enables you to decouple
395.199 -> the data producer and consumer
397.36 -> which can help promote the development
399.12 -> of a microservices model
401.039 -> and make it easier to manage large
402.88 -> workloads
404.24 -> producers put records into streams and
407.039 -> consumers get
408.08 -> records from streams and then process
410.08 -> them
411.68 -> the delay between the time a record is
413.759 -> put into the stream
414.88 -> and the time it can be retrieved this is
416.8 -> called the put to get delay
418.639 -> is typically less than one second so
421.12 -> kinesis data streams application can
423.28 -> start consuming the data from the stream
425.599 -> almost immediately after the data is
427.68 -> added
429.28 -> a kinesis data stream is a set of one or
432 -> more shards
433.039 -> each shard contains a sequence of data
435.12 -> records
436.08 -> and each data record has a sequence
438 -> number that's assigned by the service
441.039 -> if your data rate increases you can
443.039 -> increase or decrease
444.56 -> the number of shards allocated to your
446.4 -> stream
447.759 -> kinesis can automatically encrypt
449.52 -> sensitive information as a producer
451.52 -> enters it into the stream
453.039 -> it uses aws kms for encryption
457.199 -> record ordering is maintained within
458.96 -> each shard and you can configure record
461.36 -> retention
462.08 -> for up to one year
465.599 -> a shard is a uniquely identified
467.599 -> sequence of data records in a stream
470.08 -> it provides a fixed unit of capacity
473.12 -> each shard can support up to 1 000
475.68 -> records
476.56 -> per second for writes up to a maximum
479.28 -> total data write rate
480.72 -> of one megabyte per second including
482.96 -> partition keys
484.879 -> each shard can support up to five
486.72 -> transactions per second for reads
489.199 -> up to a maximum total data read rate of
491.759 -> 2 megabytes per second
494.479 -> the data capacity of your entire stream
496.8 -> is a function of the number of shards
498.639 -> that you specify for the stream
500.8 -> and the total capacity of the stream is
502.879 -> the sum of the capacities of the shards
506.479 -> each shard contains records ordered by
508.72 -> arrival time
510.319 -> in a shard records from the trim horizon
513.2 -> are all available
514.56 -> or all the available records since the
516.399 -> beginning whereas records from the tip
518.56 -> or the latest
519.519 -> are the most current records essentially
522.24 -> the difference is whether you want to
523.68 -> start from the oldest record the trim
525.519 -> horizon
526.399 -> or from right now the latest and skip
529.04 -> data between the latest checkpoint and
532.839 -> now
534.56 -> lambda is the on-demand commute compute
536.88 -> service
537.6 -> at the heart of the aws serverless
539.6 -> portfolio
540.64 -> and it can consume and process data from
542.72 -> kinesis streams
544.56 -> the lambda service polls kinesis
546.56 -> automatically and invokes your lambda
548.48 -> functions when records are available
551.2 -> the benefit of using lambda for
552.72 -> streaming data processing
554.399 -> is that lambda manages scaling and that
556.48 -> allows you to focus on the custom
558.16 -> business logic
559.2 -> to process data instead of the
561.04 -> underlying infrastructure
563.44 -> records are delivered as a payload to
565.44 -> the lambda function
566.48 -> and you can configure how many records
568.32 -> are in each batch
569.68 -> up to ten thousand or six megabytes as a
572.72 -> total payload limit
574.88 -> lambda supports a variety of custom
577.04 -> runtimes natively such as pythonnode.net
580.48 -> and you can also bring your own runtime
582.08 -> too
582.399 -> we have customers running even erlang
584.72 -> php
585.68 -> or cobol lambda is agnostic to your
588.16 -> choice of runtime
590.24 -> in the event source mapping you can also
592.08 -> configure a starting position for
593.68 -> records
594.48 -> you can process only new records all
596.72 -> existing records
598 -> or records created after a certain date
601.04 -> once the lambda function is finished
602.56 -> processing it returns the result to
604.32 -> kinesis
607.6 -> for each shard lambda configures a polar
609.92 -> that pulls the shard every second
612 -> and invokes your lambda function when
614.079 -> records are available
615.6 -> the polar is managed internally by the
617.68 -> lambda service
620.32 -> internally a record processor pulls the
622.88 -> kinesis shard
624.56 -> the batcher creates the batches to be
626.399 -> processed by the function
628.079 -> and the invoker gets batches and invokes
630.959 -> your lambda function
635.04 -> kinesis data fire hose is a fully manual
637.76 -> service that automatically provisions
639.839 -> manages and scales the resources
641.6 -> required to process and load your
643.36 -> streaming data
646 -> configure data fire hose within minutes
648.16 -> in the console
649.36 -> cli or cloud formation and it's then
651.68 -> immediately available to receive data
653.68 -> from thousands of data sources
656.399 -> fire hose can optionally invoke a lambda
658.48 -> function to transform data before it
660.32 -> stores results
662.079 -> it's integrated with some aws services
664.959 -> like
665.279 -> s3 redshift and the elasticsearch
667.839 -> service
668.56 -> it can also deliver data to generic http
671.68 -> endpoints
672.64 -> and directly to service providers from
675.519 -> the console you point data fire hose to
677.839 -> the destinations you need
679.519 -> and use your existing applications and
681.44 -> tools to analyze streaming data
684.64 -> once setup data fire hose loads
686.8 -> streaming data
687.76 -> into your destination continuously as
690.399 -> they arrive
691.2 -> and there's no ongoing administration
693.2 -> and you don't need to manage shards
695.6 -> it's also a pay as you go service where
697.44 -> the cost is based upon usage
699.36 -> with no minimum fees
703.279 -> there are various sources for data fire
705.279 -> hose the first is a direct put from the
707.519 -> data producer
709.04 -> you can send data using the kinesis
711.12 -> agent the data fire hose api
714 -> or the aws sdk you can also configure
717.76 -> your data
718.72 -> firehose delivery stream to
720.24 -> automatically read data from an existing
722.32 -> kinesis data stream
724.24 -> and you can use cloudwatch logs
726.24 -> eventbridge or iot as a source
729.76 -> apart from s3 redshift and elasticsearch
732.8 -> there's also a range of partners that
734.48 -> you can use as destinations for data
736.8 -> firehose
737.92 -> these include datadog dynatrace
740.959 -> logic monitor mongodb cloud new relic
744.72 -> splunk and sumo logic for http
748.72 -> endpoint delivery there is a defined
750.8 -> request and response format
753.12 -> endpoints have three minutes to respond
754.959 -> to a request before a timeout occurs
758.56 -> the frequency of data delivery to s3 is
761.279 -> determined by the buffer size
763.12 -> and buffer interval values that you
765.04 -> configure for your delivery stream
767.44 -> the service buffers incoming data before
769.6 -> delivering it to s3
772.16 -> you can configure the buffer size
774.48 -> between 1 to 128 megabytes
777.04 -> or buffer interval between 60 and 900
780.399 -> seconds
781.519 -> the condition that is satisfied first
783.519 -> will then trigger data delivery to s3
787.2 -> when data delivery falls behind data
789.44 -> writing to a stream
790.8 -> the service raises the buffer size
792.56 -> dynamically this allows this allows it
794.8 -> to catch up
795.6 -> and ensure that the data is delivered to
797.68 -> the destination
799.519 -> fire hose also buffers incoming data
801.839 -> before delivering it to splunk
803.92 -> the buffer size is 5 megabytes and the
806.32 -> buffer interval
807.279 -> is 60 seconds these aren't configurable
810.079 -> because they're optimized specifically
812 -> for the splunk integration
816 -> the overall operation of a data firehose
818.72 -> stream looks like this
820.72 -> the data source puts records onto the
823.12 -> stream
824.16 -> fire hose invokes the data
825.68 -> transformation lambda function to
827.36 -> process the records
828.639 -> it uses the batch size value in the
830.48 -> configuration to determine the number
832.56 -> of records sent per invocation the
835.68 -> transformed records are returned from
837.44 -> lambda
838.24 -> back to the firehose stream those
841.12 -> records then forwarded to the
842.48 -> destination
843.36 -> in this case the amazon elasticsearch
845.839 -> service
848 -> if your lambda function invocation fails
850.72 -> because of a network timeout or because
852.639 -> you've reached the lambda invocation
854.399 -> limit
855.199 -> firehose retries the invocation three
857.68 -> times by default
859.36 -> if the invocation does not succeed
861.68 -> firehose then skips that batch of
863.519 -> records
864.24 -> any skipped records are treated as
865.92 -> unsuccessfully processed records
868.72 -> if the status of a data transformation
870.56 -> of a record is processing failed
873.04 -> then firehose also treats the record as
875.76 -> unsuccessfully processed
878.079 -> any unsuccessfully processed records
880.16 -> that are delivered to your s3 bucket
882.48 -> in a folder called processing dash
884.959 -> failed
885.839 -> this will include metadata indicating
888 -> the number of attempts made
889.68 -> the timestamp for the last attempt and
892.079 -> the lambda functions arm
894.88 -> firehose can backup all untrunk
897.04 -> transformed records
898.32 -> to your s3 bucket concurrently while
900.72 -> delivering transformed records to your
902.56 -> destination
903.839 -> you can enable source record backup when
906.56 -> you create or update a delivery stream
911.04 -> when you enable data transformation the
913.44 -> service buffers
914.48 -> incoming data by up to three megabytes
917.12 -> by default
918.16 -> you can adjust this buffering size by
920.079 -> using the processing configuration
922.24 -> api fire hose then invokes the specified
926.32 -> lambda function asynchronously
928.32 -> with each buffered batch using the
930.399 -> lambda synchronous invocation
932.48 -> mode the lambda function processes and
935.519 -> returns
936.24 -> a list of transformed records with a
938.56 -> status of each record
941.04 -> the status is an attribute called result
943.519 -> with the possible values of okay
945.92 -> dropped or processing failed in the
948.959 -> result in the returned payload the data
951.12 -> attribute
951.92 -> must be base encoded
955.279 -> the transformed data is then sent back
957.199 -> from lambda to firehose and from there
959.519 -> it's sent to the destination once the
961.519 -> specified buffering size or buffering
963.6 -> interval is reached
964.72 -> whichever one happens first
968.639 -> the data transformation lambda function
970.72 -> can run for up to five minutes
972.48 -> these functions are commonly used to
974.24 -> filter enrich and convert data
977.44 -> in filtering the lambda function can
979.519 -> remove attributes from records
981.519 -> or remove entire records based upon your
984.079 -> business logic
985.6 -> some customers use filtering to remove
987.839 -> personally identifiable information from
989.759 -> records and streams for example
992.079 -> you can also enrich records by fetching
994.24 -> data from other aws services
996.88 -> or from external data sources one common
1000.079 -> process is to use a
1001.759 -> goip service to look up the geographical
1004.639 -> location of ip addresses
1006.399 -> and append this data to records just
1008.88 -> remember that the response
1010.079 -> size in asynchronously invoked lambda
1012.56 -> functions
1013.279 -> is six megabytes in converting data
1016.48 -> you have complete flexibility in
1018.32 -> modifying the record layout to match the
1020.24 -> needs of your data consumer
1022.16 -> there are lambda blueprints available as
1024.16 -> examples of data conversion
1026.079 -> using this process firehose passes a
1029.439 -> record id
1030.559 -> along with each record to lambda during
1032.48 -> the invocation
1033.6 -> each transformed record must be returned
1036.559 -> with the exact same record
1038 -> id
1041.039 -> biohose invokes the data transformation
1043.36 -> lambda function
1044.48 -> and scales up the function if the number
1046.319 -> of records in the stream grows
1048.96 -> when the destination is s3 redshift or
1051.919 -> the elasticsearch service
1053.76 -> firehose allows up to five outstanding
1055.919 -> lambda implications per
1057.36 -> shard when the destination is splunk the
1060.32 -> quota is 10 outstanding lambda
1062.24 -> invocations per shard
1065.679 -> you can monitor a data fire hose stream