
AWS re:Invent 2020: How Stitch Fix is delivering personalized experiences
AWS re:Invent 2020: How Stitch Fix is delivering personalized experiences
How can organizations deliver dynamic personalized customer interactions? Hear how the platform engineering team at Stitch Fix is transforming customer interactions by delivering a highly concurrent and scalable solution for real-time product recommendations. Come away with an understanding of the architectural requirements for leveraging Amazon DynamoDB to optimize near-real-time machine learning workloads to deliver the right user experience.
Learn more about re:Invent 2020 at http://bit.ly/3c4NSdY
Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4
#AWS #AWSEvents
Content
1.28 -> hello welcome to all the virtual
3.36 -> audiences from around the world
5.44 -> in this session we are going to talk
7.359 -> about personalized experiences
9.599 -> and how stitch fix is delivering these
12.4 -> experiences
13.44 -> to their customers around the globe my
15.92 -> name is madhuna
17.279 -> i'm a solutions architect here at aws
20.08 -> and with me today is visual serene
22.32 -> he is a data platform engineer at stitch
24.32 -> fix
27.76 -> before i talk about personalization and
30.48 -> specific solutions
31.92 -> i want to spend a minute talking about
34.16 -> how aws
35.28 -> is helping the retail industry today
39.12 -> we help in increasing the customer
41.36 -> insights for
42.48 -> retail customers with services such as
44.96 -> personalized and forecast
47.12 -> and we also help in optimizing the
49.52 -> operations
50.399 -> across multiple business domains with
52.879 -> services such as
54 -> iot and cloud migration by doing this
58 -> we help in transform the way these
60.559 -> customers
61.359 -> engage with their end customers and also
64.08 -> supply chain partners
65.68 -> across multiple business domains be it
68.24 -> fulfillment
69.52 -> or e-commerce these are just a few of
73.68 -> several ways aws is helping the
76.479 -> customers
77.52 -> realize and accelerate their business
79.68 -> goals
84.08 -> so why should you personalize because
87.84 -> everybody's tastes preferences and needs
90.88 -> are different personalization is crucial
94.88 -> to make sure you are optimizing
96.64 -> customers lifetime value
100.159 -> so how does personalization help it
103.119 -> helps customers in choosing the products
105.52 -> and services they need
108 -> this becomes obvious when you go to
109.96 -> amazon.com
111.759 -> or stitch fix that we'll take a closer
114 -> look at later on in this presentation
117.28 -> and by doing this you can help improve
119.68 -> customer engagement
120.96 -> with your online content on your
122.88 -> websites on the mobile apps
125.04 -> or any other dynamic content you're
126.96 -> trying to deliver to your customers
130.479 -> if you have customers who have a sense
133.2 -> of being taken care of
135.2 -> then that directly translates to
137.44 -> increased purchases
138.72 -> increased subscriptions more application
141.12 -> downloads
142.16 -> and also any other content that you are
143.76 -> trying to deliver as well
146.08 -> and all of these things directly
148.48 -> contribute
149.44 -> to an increase in your top line
152.8 -> so while the benefits of personalization
156.879 -> seem obvious what is not obvious
160.319 -> is the best path to unlock those
162.8 -> benefits
164.16 -> so let's take a look at the
166.16 -> personalization journey
167.76 -> at amazon.com and the lessons we learned
171.36 -> from there
176.959 -> before you take a look at specific
179.28 -> personalization
180.239 -> solutions it is important to take a look
183.92 -> at your current state
186.72 -> where are your data sources for
188.8 -> personalization
190.64 -> what is your data life cycle what is the
193.76 -> process
194.319 -> you do or you use for data
196.84 -> transformation
198.56 -> how do you implement data integrity
202.4 -> and what is the timeliness or what is
204.879 -> the latency of that personalization
207.84 -> can you deliver personalization or
210 -> rather can you deliver
211.44 -> maximum value to your customers by
214.4 -> delivering by personalizing their
216.959 -> online real-time experiences or offline
220.08 -> experiences
221.04 -> or maybe both and
224.08 -> how about continuous learning from your
226.239 -> customer behavior
228.319 -> how do you ensure that the
230.48 -> personalization
231.599 -> you are building for your customers is
234.64 -> dynamic and is keeping up with the
237.599 -> changing behavior
239.12 -> of your customers how can you
243.04 -> and can you leverage technology such as
246.72 -> machine learning to accomplish this
250.799 -> to summarize if you understand your
254.159 -> current state
255.28 -> it will make it easy to select the right
258.959 -> approach for personalization that
261.44 -> satisfies all your business needs
268.639 -> at aws broadly speaking
272 -> we see two different approaches to
274.24 -> personalization
276.88 -> looking at the left side here
279.04 -> organizations with
281.12 -> little to no machine learning experience
284.32 -> make use of services such as amazon
286.88 -> personalized
287.84 -> amazon neptune and amazon pinpoint
292.08 -> amazon personalize is a service using
295.44 -> which
296.24 -> developers can build applications using
299.12 -> the same
300.08 -> machine learning technology used at
302.84 -> amazon.com
304 -> to build personalized recommendations
307.52 -> and amazon pinpoint is a inbound and
310.72 -> outbound
311.44 -> highly scalable and flexible marketing
314.639 -> communications
315.6 -> service amazon neptune
319.44 -> is a managed graph database using amazon
323.36 -> neptune
324.32 -> you can build identity graphs for
326.88 -> personalization
328 -> and recommendations using a highly
331.44 -> scalable
332.32 -> graph database like amazon neptune for
335.52 -> identity graphs makes it easy to link
339.039 -> identifiers with the user profiles and
342 -> also
342.479 -> easily query at a very low latency
346.24 -> by doing all of these things you can do
349.039 -> add attribution
350.24 -> add targeting analytics and also
352.639 -> personalization
355.68 -> now looking at the right side of this
357.44 -> slide organizations with
360 -> mature machine learning expertise
363.12 -> they focus on building custom
365.479 -> personalizations
367.28 -> and these end up becoming ip backed core
370.56 -> differentiators for them
373.039 -> stitch fix is one such organization
374.96 -> which we'll look at later on
377.6 -> if you look at aws aws has the broadest
381.039 -> and deepest set of machine learning
383.759 -> capabilities
384.88 -> and the supporting cloud infrastructure
387.44 -> thereby putting
388.72 -> machine learning in the hands of every
390.88 -> developer every data scientist and other
393.199 -> expert practitioners
396.24 -> amazon sage maker is a fully managed
399.28 -> service
400.16 -> that helps developers and data
402.24 -> scientists in
403.44 -> building training and deploying machine
406.639 -> learning models
407.68 -> at scale it makes it
411.039 -> easy and simplifies every step of the
414.24 -> machine learning workflow
416.24 -> thereby making it easy for you to build
418.8 -> machine learning use cases whether it is
420.96 -> for
421.44 -> predictive maintenance computer vision
424.72 -> or capturing or predicting customer
427.68 -> behavior for personalization
429.68 -> and recommendations amazon dynamodb
434.479 -> is a key value document database
438.08 -> and it helps in providing ultra low
440.56 -> latency
441.68 -> serverless data store for storing your
444.479 -> user profiles
445.68 -> user events clicks and any other
448.72 -> customer behavior related data
452.479 -> similarly amazon aurora serverless which
456.4 -> is a relational database option
458.16 -> for doing this for similar data store uh
461.52 -> that can solve some of your other use
463.12 -> cases
465.199 -> you can also use in-memory cache elastic
467.84 -> cache
468.4 -> for low latency data reads
471.44 -> and if you need micro second
474.8 -> latency for your dynamodb access
478.639 -> you can use dynamodb accelerator
482 -> you can also use amazon cloudfront which
484.56 -> is a
485.36 -> fast content delivery network to
488 -> securely
488.8 -> deliver your personalized experiences to
491.52 -> your users around the globe
493.68 -> at low latency and high transfer rates
498.4 -> optionally you can also use global
500.96 -> accelerator
501.919 -> which is a networking service using
504.879 -> which you can direct your user's traffic
507.759 -> through aws on global network
510.639 -> infrastructure
511.84 -> thereby optimizing your customers
514.56 -> performance by
515.68 -> up to 60 percent
518.959 -> amazon route 53 is a
522 -> dns service that helps you in directing
525.519 -> user traffic around the globe through
527.6 -> various flexible routing policies
530.32 -> thereby enabling you to build low
532.959 -> latency and highly fault tolerant
535.279 -> architectures for personalization
538.8 -> the key takeaway here is as an
541.36 -> organization
542.8 -> no matter where you are in your journey
546 -> in your machine learning journey aws
548.48 -> provides you multiple
549.839 -> ways to personalize just like amazon.com
557.04 -> so now that you have chosen the tools
559.2 -> for personalization
560.56 -> and also the right cloud infrastructure
563.519 -> how do you
564.08 -> ensure your design and your architecture
567.2 -> is optimal and is in line with your
570 -> business goals
572.16 -> aws well architected framework will help
575.04 -> you
575.36 -> in doing that using this framework
579.279 -> you can learn the best way best
581.839 -> architectural practices
584 -> to build and operate highly secure
586.88 -> highly reliable and cost efficient
589.44 -> architectures and solutions on top of
591.6 -> aws
593.2 -> it also provides the best ways to
595.6 -> improve your solution
596.88 -> continuously improve your designs as
598.88 -> well
600.48 -> not only that aws also provides a
603.6 -> self-service tool
604.959 -> aws will architect a tool available on
608 -> our console
609.12 -> that you can use to accomplish the same
611.44 -> tasks
615.92 -> with that i hand over to visual serene
618.72 -> uh who is going to talk about
620.16 -> stitch fix and also the personalizations
622.56 -> they are building there
623.68 -> thank you hi i'm urge
627.04 -> from stitchfix and i'm a data platform
629.44 -> engineer
630.8 -> working on our algorithms side
634 -> so at stitchfix we are using data
637.12 -> to deliver personalized styling and
639.12 -> clothing recommendations for our users
641.68 -> so the goal of the service is to provide
645.44 -> our users the best possible clothing
648.959 -> that fits their life and their
651.839 -> preferences
654 -> so how does data work at stitch fix we
657.04 -> have a few different ways of collecting
658.959 -> data from our users so some
663.04 -> in some cases we are taking direct input
665.44 -> from our users for instance
667.519 -> they could be liking or unliking
670.16 -> particular styles
671.76 -> or types of clothes and we use that to
674.16 -> feed
674.88 -> into a style profile the other
678.72 -> aspects of our business could
680 -> potentially be using
682.32 -> preferences from the pieces of clothing
686.32 -> that have been kept or returned
688.16 -> back to the service so for instance when
690.8 -> we ship
691.44 -> a box of clothes which we call a fix the
694.72 -> items that have been kept they become
697.04 -> sort of part of your style profile
699.12 -> and the items that have been returned we
701.36 -> use them as signal for the things that a
703.279 -> user may not
704.24 -> want that could be from a fit
706.48 -> perspective or a style perspective
708.399 -> or some local preferences in terms of
711.44 -> where they live or what their lifestyle
713.36 -> or their profession is
716 -> so we've used those features to
719.04 -> build a unified style profile for our
721.68 -> users
724.399 -> so there's some business context around
726.24 -> it so what we want to do is
728.48 -> we want to take these ratings that the
730.72 -> user is generating
732.56 -> and collect them both in real time and
735.519 -> in a historical context
737.519 -> to generate and serve recommendations
740.16 -> for our users
742.48 -> so um some of the more practical
746.56 -> aspects of this is that we want to be
748.16 -> able to serve our historical and recent
750.56 -> data
751.68 -> in about like 80 milliseconds of a p99
754.959 -> of about 80 milliseconds
757.04 -> our traffic has been growing and we want
759.36 -> our
760.32 -> backend infrastructure to be able to
762.079 -> horizontally scale as our data grows
766 -> we also want since we operate in a few
768.16 -> different geographies we have to be able
770.079 -> to deal with
771.44 -> data compliance issues things like gdpr
775.04 -> or local privacy perspectives
778.8 -> for instance a user may ask us to delete
780.959 -> the data and we want to be able to
782.399 -> delete the entire style profile
784.959 -> or the data that we have been collecting
786.56 -> from the user
788.32 -> yet we also want to have like low
790.56 -> operational and cost overhead
793.279 -> so as you can see like this is
796.639 -> a lot of different requirements on how
799.6 -> we can
800.16 -> respond and serve recommendations
804.399 -> yet keeping our operational
807.68 -> and server business context
812.399 -> so what we had before for storing these
815.12 -> ratings
815.839 -> and preferences from our users we had
818.48 -> about 120 node redis cluster so what we
821.12 -> were doing was
822.639 -> we were collecting these ratings and
824.8 -> preferences and storing it essentially
827.04 -> in a hot
827.68 -> memory tier so the problem we faced with
830.88 -> that
831.44 -> was as our data was growing we had to
834.8 -> continuously keep adding more nodes
837.68 -> and there was a complexity from an
840.639 -> operational perspective of constantly
842.56 -> rebalancing
843.6 -> and growing our cluster
846.8 -> also redis stores data internally within
850.959 -> um as strings so the cost is pretty
855.04 -> high for capturing and storing this data
858.8 -> and also since elastic cache and aka
861.68 -> redis
862.24 -> is memory backed
866.56 -> our disaster recovery story wasn't as
869.44 -> robust as we would have liked it
871.519 -> because if we had a catastrophic failure
874.399 -> in our redis cluster we would lose
876.56 -> this very valuable user data which we
879.519 -> were using
880.399 -> to serve recommendations both in a batch
882.959 -> and a real-time context
886.399 -> so to sum up what we wanted to do was we
889.279 -> want to be able to serve about let's say
890.88 -> about 3000 of these ratings for our
893.04 -> clients
894.079 -> in about a context of about like p 99 of
896.88 -> 80 milliseconds
898.88 -> all the data is usually keyed by client
901.12 -> id
902.24 -> as far as like data recency goes we want
904.959 -> to be able to
906.16 -> serve recommendations in like
909.279 -> about a second since the user took an
912.079 -> action
912.8 -> so for instance if a user like to put a
915.839 -> particular piece of clothing
917.839 -> we want to be able to use that rating in
920.72 -> about a second
921.76 -> after it was generated by the user
926 -> we also want to have fault tolerance and
927.8 -> recoverability
929.279 -> in case there is there are node failures
932 -> or there is a catastrophic event we want
934.24 -> to be
934.8 -> able to rehydrate this data with
938.16 -> low or very little effort
942.399 -> of course there's a notion of like
944.16 -> scalability and cost
946.079 -> where when we were running our redis
947.92 -> cluster it was about a thousand dollars
950.16 -> a day
950.88 -> and we wanted to reduce that cost
953.759 -> because the traffic has been growing
955.519 -> steadily
956.48 -> and that cost was also growing
960.16 -> so to implement um a different backend
964 -> to store and serve these recommendations
966.639 -> we came up with like a high level
968.56 -> methodology
969.68 -> on how we would be testing different
971.759 -> amazon services
974.16 -> to benchmark against what we are trying
977.199 -> to do
978.48 -> so what we did was we built a little
980.48 -> harness which would simulate the live
982.88 -> traffic that would
986.24 -> have some level of seasonality of how
988.48 -> our users are
989.759 -> accessing and writing to this data store
993.519 -> it would have basically a test harness
996.32 -> which was
997.04 -> also measuring the latency the error
1001.279 -> rates
1001.92 -> and throughput so with that we could
1004.639 -> basically
1005.68 -> swap these data back-ends and then
1008.8 -> have an ability to benchmark
1011.839 -> different strategies or implementations
1017.44 -> so um what we did was we
1021.12 -> came up with like some key metrics
1024.559 -> or characteristics for which we wanted
1027.679 -> to build our solutions so the first
1029.679 -> one was we want the back end the data
1033.28 -> back-end to be able to scale as our
1035.199 -> traffic grows
1036.799 -> we wanted high read and write
1039.12 -> performance
1040.4 -> we wanted to be cost effective we want
1042.959 -> data recoverability
1044.64 -> uh against disasters and we also wanted
1048.48 -> to have good developer
1050.08 -> semantics or ergonomics to use this data
1053.2 -> store
1054.96 -> so we had the existing implementation
1058.32 -> with redis on elastic cache using sorted
1061.039 -> sets
1062.24 -> and because the data was in memory the
1065.039 -> performance was very
1066.08 -> high and we were getting id importance
1069.36 -> because we were using the sorted set
1071.28 -> semantic
1073.679 -> now as i mentioned earlier redis
1076.96 -> stores the strings of the
1080.24 -> the objects as strings so it has a high
1082.88 -> overhead so we tried using protocol
1085.2 -> buffers
1086.08 -> compressed protocol buffers so that
1088.08 -> reduced
1089.76 -> um the memory footprint and we were able
1092.799 -> to reduce
1094.24 -> our cluster size thus making adding some
1097.44 -> cost
1098.4 -> and operational gains there however
1101.6 -> we still had the issue of recoverability
1104.559 -> and
1105.2 -> we lost the set depend semantic so with
1108.24 -> redis
1109.52 -> we have high performance we were able to
1112.4 -> make it cost effective
1114.08 -> yet we didn't have the scaling
1116.24 -> characteristics
1117.52 -> and the recoverability characteristics
1119.6 -> that we wanted
1121.039 -> we also experimented with aurora
1123.2 -> serverless using the json b
1126.48 -> column type so since it's aurora
1129.76 -> serverless which is disk back
1131.52 -> it had nice scaling performance
1133.679 -> characteristics
1134.72 -> however when it came to performance it
1137.44 -> lacked
1138 -> because we were often hot spotting
1142.16 -> so yes it was cost effective and there's
1144.72 -> recoverability because it's dispatched
1147.12 -> but the performance suffered i'm going
1150.4 -> to talk a little bit more about the
1152 -> final solution which is what we ended up
1154.24 -> with which was like a tier
1155.52 -> dynamo db implementation which meant
1158.559 -> both our scaling operational performance
1162.24 -> and cost characteristics
1166.4 -> so this is a little bit of a high level
1168.24 -> diagram of our architecture what we
1170.64 -> ended up with
1172.24 -> so what we did here was we built
1175.919 -> a multi-table data structure a
1179.039 -> data strategy there was
1182.16 -> a table for our historical data
1185.28 -> there was intraday and a real time table
1188.08 -> so let me talk a little bit about
1190.24 -> why we chose three tables and what was
1193.2 -> the right
1194 -> and read paths to it the historical
1197.679 -> table
1198.16 -> was data that was older than a day and
1201.28 -> went back to the start of time for us
1203.52 -> it's about six years ago and
1206.48 -> the the right path to that was through
1209.28 -> an e
1209.76 -> spark job that runs on an emr cluster
1212.4 -> every night
1213.52 -> so the the strategy there is we take
1217.36 -> a client's historical data and then we
1220.88 -> compress it
1222.32 -> into a unified blob and we
1225.36 -> re we write one key per client so
1230.24 -> the goal there is to use the put item
1233.12 -> semantics
1234.48 -> and to use the get item semantics to get
1238.32 -> the historical history for a client so
1241.039 -> as i mentioned earlier about we stored
1242.799 -> about 3000
1244.48 -> responses per client so that roughly
1247.76 -> translated to about 20 kilobyte payload
1251.2 -> and so per client there was a compressed
1254.72 -> json payload
1256.159 -> of about 20 kilobytes the intraday
1259.84 -> and the real-time table they were
1264.08 -> they had a roper event
1268.159 -> which is every time an event was
1270.24 -> generated
1271.52 -> it was keyed by client id and the
1274 -> timestamp
1275.679 -> so the goal there was to have a stream
1278.48 -> from our message bus which happens to be
1280.32 -> kafka
1281.679 -> and to read the events from the message
1284.64 -> bus
1284.96 -> and to write rows into the intraday
1287.84 -> table
1289.039 -> the the recency of that table is from
1291.76 -> about a minute ago
1293.36 -> to about a day and the real timetable
1297.44 -> had a similar schema however it was
1300 -> coming from a direct http endpoint
1302.4 -> generated from a client action
1304.64 -> and the recency of the data was from now
1307.679 -> to about two minutes ago
1310.24 -> so how did the read path tie these three
1313.2 -> tables together
1314.72 -> so when we requested a user's
1318.48 -> data or a client's history what we did
1321.76 -> was we have
1322.799 -> three concurrent requests that go to
1325.919 -> these tables independently
1327.76 -> and we then have circuit breakers that
1330.88 -> are running
1331.6 -> for each of those requests and then once
1334.159 -> those requests return we
1335.84 -> merge those results dedupim and we
1338.96 -> return a unified result back
1341.52 -> so what that does is that we have three
1344.799 -> separate read paths however coming
1347.919 -> together in a single request
1351.52 -> so the rights to to repeat myself again
1354.24 -> the rights are happening from
1355.919 -> either from kafka spark or direct
1358.96 -> http rights and on the read side we're
1362.24 -> taking
1363.28 -> the data from these tables and either
1365.76 -> stitching them together through get item
1368.24 -> or direct queries on these tables
1375.6 -> so the end result was that we were able
1378.24 -> to
1378.64 -> get about our p99 latency in about
1382.24 -> 60 to 80 millisecond range and
1386 -> force since we were using individual
1389.2 -> circuit breakers
1390.96 -> what that guaranteed was that if any one
1393.12 -> of those queries
1394.24 -> or getting data from the historical
1396.96 -> table was
1397.679 -> outside that limit we would still have
1400.32 -> some data to serve even if you didn't
1402.72 -> have all the data
1404.08 -> it is also possible that a user may not
1406.24 -> have had historical
1407.919 -> data and it just just had recent data in
1410.72 -> that case we would just serve data out
1412.72 -> of the recent and the intraday
1414.559 -> the real time in the intraday table
1418.32 -> so the key takeaways for us were that we
1423.36 -> really tried to first understand our
1425.44 -> data read
1426.72 -> and write patterns that is what led us
1429.6 -> to
1430.4 -> build the simulation and to be able to
1433.36 -> test a few different data back-ends
1436.88 -> um we swapped our data back in from
1440.4 -> using
1440.96 -> a hot elastic redis cluster
1444.08 -> to using a tiered dynamodb solution
1447.2 -> however for our users and internal apis
1450.32 -> we did not change any of the api
1452.32 -> characteristics the api and the protocol
1454.799 -> stayed exactly the same
1456.48 -> just the underlying implementation had
1458.64 -> changed
1460.48 -> what was also very useful for us was to
1462.799 -> constantly leverage
1464.72 -> the the depth of knowledge that the aws
1468.4 -> support team and the dynamo tb
1470.72 -> db team brought to the table they were
1473.52 -> really
1474.4 -> instrumental in helping us carve the
1477.2 -> sort of this horizontal and vertical
1479.36 -> schema tables and bringing them together
1482.159 -> thank you urge it was a pleasure working
1485.44 -> with you
1486.4 -> during your design and building up this
1488.4 -> solution
1489.76 -> and thank you very much for all the
1491.6 -> audiences virtual audiences from around
1493.76 -> the world
1495.2 -> if you want to explore more about what
1497.52 -> aws
1498.4 -> is doing for retail or consumer goods
1501.52 -> customers
1502.32 -> you can go to the url we are showing on
1504.32 -> the slide here
1506.159 -> also if you want to explore more and
1508.799 -> maybe
1509.36 -> dive deep into some of the services like
1511.6 -> amazon sagemaker
1512.96 -> amazon personalized dynamodb or any of
1516.64 -> the other services
1517.919 -> we touched upon in this presentation
1520.72 -> please go and explore those tracks
1522.799 -> in this arrangement with that i thank
1526.159 -> you all for attending today and
1528.559 -> have a great day
Source: https://www.youtube.com/watch?v=mcM1y2tNAAQ