AWS re:Invent 2020: How Stitch Fix is delivering personalized experiences

Aug 16, 2023

AWS re:Invent 2020: How Stitch Fix is delivering personalized experiences

How can organizations deliver dynamic personalized customer interactions? Hear how the platform engineering team at Stitch Fix is transforming customer interactions by delivering a highly concurrent and scalable solution for real-time product recommendations. Come away with an understanding of the architectural requirements for leveraging Amazon DynamoDB to optimize near-real-time machine learning workloads to deliver the right user experience.

Learn more about re:Invent 2020 at http://bit.ly/3c4NSdY

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

#AWS #AWSEvents

Content

1.28 -> hello welcome to all the virtual

3.36 -> audiences from around the world

5.44 -> in this session we are going to talk

7.359 -> about personalized experiences

9.599 -> and how stitch fix is delivering these

12.4 -> experiences

13.44 -> to their customers around the globe my

15.92 -> name is madhuna

17.279 -> i'm a solutions architect here at aws

20.08 -> and with me today is visual serene

22.32 -> he is a data platform engineer at stitch

24.32 -> fix

27.76 -> before i talk about personalization and

30.48 -> specific solutions

31.92 -> i want to spend a minute talking about

34.16 -> how aws

35.28 -> is helping the retail industry today

39.12 -> we help in increasing the customer

41.36 -> insights for

42.48 -> retail customers with services such as

44.96 -> personalized and forecast

47.12 -> and we also help in optimizing the

49.52 -> operations

50.399 -> across multiple business domains with

52.879 -> services such as

54 -> iot and cloud migration by doing this

58 -> we help in transform the way these

60.559 -> customers

61.359 -> engage with their end customers and also

64.08 -> supply chain partners

65.68 -> across multiple business domains be it

68.24 -> fulfillment

69.52 -> or e-commerce these are just a few of

73.68 -> several ways aws is helping the

76.479 -> customers

77.52 -> realize and accelerate their business

79.68 -> goals

84.08 -> so why should you personalize because

87.84 -> everybody's tastes preferences and needs

90.88 -> are different personalization is crucial

94.88 -> to make sure you are optimizing

96.64 -> customers lifetime value

100.159 -> so how does personalization help it

103.119 -> helps customers in choosing the products

105.52 -> and services they need

108 -> this becomes obvious when you go to

109.96 -> amazon.com

111.759 -> or stitch fix that we'll take a closer

114 -> look at later on in this presentation

117.28 -> and by doing this you can help improve

119.68 -> customer engagement

120.96 -> with your online content on your

122.88 -> websites on the mobile apps

125.04 -> or any other dynamic content you're

126.96 -> trying to deliver to your customers

130.479 -> if you have customers who have a sense

133.2 -> of being taken care of

135.2 -> then that directly translates to

137.44 -> increased purchases

138.72 -> increased subscriptions more application

141.12 -> downloads

142.16 -> and also any other content that you are

143.76 -> trying to deliver as well

146.08 -> and all of these things directly

148.48 -> contribute

149.44 -> to an increase in your top line

152.8 -> so while the benefits of personalization

156.879 -> seem obvious what is not obvious

160.319 -> is the best path to unlock those

162.8 -> benefits

164.16 -> so let's take a look at the

166.16 -> personalization journey

167.76 -> at amazon.com and the lessons we learned

171.36 -> from there

176.959 -> before you take a look at specific

179.28 -> personalization

180.239 -> solutions it is important to take a look

183.92 -> at your current state

186.72 -> where are your data sources for

188.8 -> personalization

190.64 -> what is your data life cycle what is the

193.76 -> process

194.319 -> you do or you use for data

196.84 -> transformation

198.56 -> how do you implement data integrity

202.4 -> and what is the timeliness or what is

204.879 -> the latency of that personalization

207.84 -> can you deliver personalization or

210 -> rather can you deliver

211.44 -> maximum value to your customers by

214.4 -> delivering by personalizing their

216.959 -> online real-time experiences or offline

220.08 -> experiences

221.04 -> or maybe both and

224.08 -> how about continuous learning from your

226.239 -> customer behavior

228.319 -> how do you ensure that the

230.48 -> personalization

231.599 -> you are building for your customers is

234.64 -> dynamic and is keeping up with the

237.599 -> changing behavior

239.12 -> of your customers how can you

243.04 -> and can you leverage technology such as

246.72 -> machine learning to accomplish this

250.799 -> to summarize if you understand your

254.159 -> current state

255.28 -> it will make it easy to select the right

258.959 -> approach for personalization that

261.44 -> satisfies all your business needs

268.639 -> at aws broadly speaking

272 -> we see two different approaches to

274.24 -> personalization

276.88 -> looking at the left side here

279.04 -> organizations with

281.12 -> little to no machine learning experience

284.32 -> make use of services such as amazon

286.88 -> personalized

287.84 -> amazon neptune and amazon pinpoint

292.08 -> amazon personalize is a service using

295.44 -> which

296.24 -> developers can build applications using

299.12 -> the same

300.08 -> machine learning technology used at

302.84 -> amazon.com

304 -> to build personalized recommendations

307.52 -> and amazon pinpoint is a inbound and

310.72 -> outbound

311.44 -> highly scalable and flexible marketing

314.639 -> communications

315.6 -> service amazon neptune

319.44 -> is a managed graph database using amazon

323.36 -> neptune

324.32 -> you can build identity graphs for

326.88 -> personalization

328 -> and recommendations using a highly

331.44 -> scalable

332.32 -> graph database like amazon neptune for

335.52 -> identity graphs makes it easy to link

339.039 -> identifiers with the user profiles and

342 -> also

342.479 -> easily query at a very low latency

346.24 -> by doing all of these things you can do

349.039 -> add attribution

350.24 -> add targeting analytics and also

352.639 -> personalization

355.68 -> now looking at the right side of this

357.44 -> slide organizations with

360 -> mature machine learning expertise

363.12 -> they focus on building custom

365.479 -> personalizations

367.28 -> and these end up becoming ip backed core

370.56 -> differentiators for them

373.039 -> stitch fix is one such organization

374.96 -> which we'll look at later on

377.6 -> if you look at aws aws has the broadest

381.039 -> and deepest set of machine learning

383.759 -> capabilities

384.88 -> and the supporting cloud infrastructure

387.44 -> thereby putting

388.72 -> machine learning in the hands of every

390.88 -> developer every data scientist and other

393.199 -> expert practitioners

396.24 -> amazon sage maker is a fully managed

399.28 -> service

400.16 -> that helps developers and data

402.24 -> scientists in

403.44 -> building training and deploying machine

406.639 -> learning models

407.68 -> at scale it makes it

411.039 -> easy and simplifies every step of the

414.24 -> machine learning workflow

416.24 -> thereby making it easy for you to build

418.8 -> machine learning use cases whether it is

420.96 -> for

421.44 -> predictive maintenance computer vision

424.72 -> or capturing or predicting customer

427.68 -> behavior for personalization

429.68 -> and recommendations amazon dynamodb

434.479 -> is a key value document database

438.08 -> and it helps in providing ultra low

440.56 -> latency

441.68 -> serverless data store for storing your

444.479 -> user profiles

445.68 -> user events clicks and any other

448.72 -> customer behavior related data

452.479 -> similarly amazon aurora serverless which

456.4 -> is a relational database option

458.16 -> for doing this for similar data store uh

461.52 -> that can solve some of your other use

463.12 -> cases

465.199 -> you can also use in-memory cache elastic

467.84 -> cache

468.4 -> for low latency data reads

471.44 -> and if you need micro second

474.8 -> latency for your dynamodb access

478.639 -> you can use dynamodb accelerator

482 -> you can also use amazon cloudfront which

484.56 -> is a

485.36 -> fast content delivery network to

488 -> securely

488.8 -> deliver your personalized experiences to

491.52 -> your users around the globe

493.68 -> at low latency and high transfer rates

498.4 -> optionally you can also use global

500.96 -> accelerator

501.919 -> which is a networking service using

504.879 -> which you can direct your user's traffic

507.759 -> through aws on global network

510.639 -> infrastructure

511.84 -> thereby optimizing your customers

514.56 -> performance by

515.68 -> up to 60 percent

518.959 -> amazon route 53 is a

522 -> dns service that helps you in directing

525.519 -> user traffic around the globe through

527.6 -> various flexible routing policies

530.32 -> thereby enabling you to build low

532.959 -> latency and highly fault tolerant

535.279 -> architectures for personalization

538.8 -> the key takeaway here is as an

541.36 -> organization

542.8 -> no matter where you are in your journey

546 -> in your machine learning journey aws

548.48 -> provides you multiple

549.839 -> ways to personalize just like amazon.com

557.04 -> so now that you have chosen the tools

559.2 -> for personalization

560.56 -> and also the right cloud infrastructure

563.519 -> how do you

564.08 -> ensure your design and your architecture

567.2 -> is optimal and is in line with your

570 -> business goals

572.16 -> aws well architected framework will help

575.04 -> you

575.36 -> in doing that using this framework

579.279 -> you can learn the best way best

581.839 -> architectural practices

584 -> to build and operate highly secure

586.88 -> highly reliable and cost efficient

589.44 -> architectures and solutions on top of

591.6 -> aws

593.2 -> it also provides the best ways to

595.6 -> improve your solution

596.88 -> continuously improve your designs as

598.88 -> well

600.48 -> not only that aws also provides a

603.6 -> self-service tool

604.959 -> aws will architect a tool available on

608 -> our console

609.12 -> that you can use to accomplish the same

611.44 -> tasks

615.92 -> with that i hand over to visual serene

618.72 -> uh who is going to talk about

620.16 -> stitch fix and also the personalizations

622.56 -> they are building there

623.68 -> thank you hi i'm urge

627.04 -> from stitchfix and i'm a data platform

629.44 -> engineer

630.8 -> working on our algorithms side

634 -> so at stitchfix we are using data

637.12 -> to deliver personalized styling and

639.12 -> clothing recommendations for our users

641.68 -> so the goal of the service is to provide

645.44 -> our users the best possible clothing

648.959 -> that fits their life and their

651.839 -> preferences

654 -> so how does data work at stitch fix we

657.04 -> have a few different ways of collecting

658.959 -> data from our users so some

663.04 -> in some cases we are taking direct input

665.44 -> from our users for instance

667.519 -> they could be liking or unliking

670.16 -> particular styles

671.76 -> or types of clothes and we use that to

674.16 -> feed

674.88 -> into a style profile the other

678.72 -> aspects of our business could

680 -> potentially be using

682.32 -> preferences from the pieces of clothing

686.32 -> that have been kept or returned

688.16 -> back to the service so for instance when

690.8 -> we ship

691.44 -> a box of clothes which we call a fix the

694.72 -> items that have been kept they become

697.04 -> sort of part of your style profile

699.12 -> and the items that have been returned we

701.36 -> use them as signal for the things that a

703.279 -> user may not

704.24 -> want that could be from a fit

706.48 -> perspective or a style perspective

708.399 -> or some local preferences in terms of

711.44 -> where they live or what their lifestyle

713.36 -> or their profession is

716 -> so we've used those features to

719.04 -> build a unified style profile for our

721.68 -> users

724.399 -> so there's some business context around

726.24 -> it so what we want to do is

728.48 -> we want to take these ratings that the

730.72 -> user is generating

732.56 -> and collect them both in real time and

735.519 -> in a historical context

737.519 -> to generate and serve recommendations

740.16 -> for our users

742.48 -> so um some of the more practical

746.56 -> aspects of this is that we want to be

748.16 -> able to serve our historical and recent

750.56 -> data

751.68 -> in about like 80 milliseconds of a p99

754.959 -> of about 80 milliseconds

757.04 -> our traffic has been growing and we want

759.36 -> our

760.32 -> backend infrastructure to be able to

762.079 -> horizontally scale as our data grows

766 -> we also want since we operate in a few

768.16 -> different geographies we have to be able

770.079 -> to deal with

771.44 -> data compliance issues things like gdpr

775.04 -> or local privacy perspectives

778.8 -> for instance a user may ask us to delete

780.959 -> the data and we want to be able to

782.399 -> delete the entire style profile

784.959 -> or the data that we have been collecting

786.56 -> from the user

788.32 -> yet we also want to have like low

790.56 -> operational and cost overhead

793.279 -> so as you can see like this is

796.639 -> a lot of different requirements on how

799.6 -> we can

800.16 -> respond and serve recommendations

804.399 -> yet keeping our operational

807.68 -> and server business context

812.399 -> so what we had before for storing these

815.12 -> ratings

815.839 -> and preferences from our users we had

818.48 -> about 120 node redis cluster so what we

821.12 -> were doing was

822.639 -> we were collecting these ratings and

824.8 -> preferences and storing it essentially

827.04 -> in a hot

827.68 -> memory tier so the problem we faced with

830.88 -> that

831.44 -> was as our data was growing we had to

834.8 -> continuously keep adding more nodes

837.68 -> and there was a complexity from an

840.639 -> operational perspective of constantly

842.56 -> rebalancing

843.6 -> and growing our cluster

846.8 -> also redis stores data internally within

850.959 -> um as strings so the cost is pretty

855.04 -> high for capturing and storing this data

858.8 -> and also since elastic cache and aka

861.68 -> redis

862.24 -> is memory backed

866.56 -> our disaster recovery story wasn't as

869.44 -> robust as we would have liked it

871.519 -> because if we had a catastrophic failure

874.399 -> in our redis cluster we would lose

876.56 -> this very valuable user data which we

879.519 -> were using

880.399 -> to serve recommendations both in a batch

882.959 -> and a real-time context

886.399 -> so to sum up what we wanted to do was we

889.279 -> want to be able to serve about let's say

890.88 -> about 3000 of these ratings for our

893.04 -> clients

894.079 -> in about a context of about like p 99 of

896.88 -> 80 milliseconds

898.88 -> all the data is usually keyed by client

901.12 -> id

902.24 -> as far as like data recency goes we want

904.959 -> to be able to

906.16 -> serve recommendations in like

909.279 -> about a second since the user took an

912.079 -> action

912.8 -> so for instance if a user like to put a

915.839 -> particular piece of clothing

917.839 -> we want to be able to use that rating in

920.72 -> about a second

921.76 -> after it was generated by the user

926 -> we also want to have fault tolerance and

927.8 -> recoverability

929.279 -> in case there is there are node failures

932 -> or there is a catastrophic event we want

934.24 -> to be

934.8 -> able to rehydrate this data with

938.16 -> low or very little effort

942.399 -> of course there's a notion of like

944.16 -> scalability and cost

946.079 -> where when we were running our redis

947.92 -> cluster it was about a thousand dollars

950.16 -> a day

950.88 -> and we wanted to reduce that cost

953.759 -> because the traffic has been growing

955.519 -> steadily

956.48 -> and that cost was also growing

960.16 -> so to implement um a different backend

964 -> to store and serve these recommendations

966.639 -> we came up with like a high level

968.56 -> methodology

969.68 -> on how we would be testing different

971.759 -> amazon services

974.16 -> to benchmark against what we are trying

977.199 -> to do

978.48 -> so what we did was we built a little

980.48 -> harness which would simulate the live

982.88 -> traffic that would

986.24 -> have some level of seasonality of how

988.48 -> our users are

989.759 -> accessing and writing to this data store

993.519 -> it would have basically a test harness

996.32 -> which was

997.04 -> also measuring the latency the error

1001.279 -> rates

1001.92 -> and throughput so with that we could

1004.639 -> basically

1005.68 -> swap these data back-ends and then

1008.8 -> have an ability to benchmark

1011.839 -> different strategies or implementations

1017.44 -> so um what we did was we

1021.12 -> came up with like some key metrics

1024.559 -> or characteristics for which we wanted

1027.679 -> to build our solutions so the first

1029.679 -> one was we want the back end the data

1033.28 -> back-end to be able to scale as our

1035.199 -> traffic grows

1036.799 -> we wanted high read and write

1039.12 -> performance

1040.4 -> we wanted to be cost effective we want

1042.959 -> data recoverability

1044.64 -> uh against disasters and we also wanted

1048.48 -> to have good developer

1050.08 -> semantics or ergonomics to use this data

1053.2 -> store

1054.96 -> so we had the existing implementation

1058.32 -> with redis on elastic cache using sorted

1061.039 -> sets

1062.24 -> and because the data was in memory the

1065.039 -> performance was very

1066.08 -> high and we were getting id importance

1069.36 -> because we were using the sorted set

1071.28 -> semantic

1073.679 -> now as i mentioned earlier redis

1076.96 -> stores the strings of the

1080.24 -> the objects as strings so it has a high

1082.88 -> overhead so we tried using protocol

1085.2 -> buffers

1086.08 -> compressed protocol buffers so that

1088.08 -> reduced

1089.76 -> um the memory footprint and we were able

1092.799 -> to reduce

1094.24 -> our cluster size thus making adding some

1097.44 -> cost

1098.4 -> and operational gains there however

1101.6 -> we still had the issue of recoverability

1104.559 -> and

1105.2 -> we lost the set depend semantic so with

1108.24 -> redis

1109.52 -> we have high performance we were able to

1112.4 -> make it cost effective

1114.08 -> yet we didn't have the scaling

1116.24 -> characteristics

1117.52 -> and the recoverability characteristics

1119.6 -> that we wanted

1121.039 -> we also experimented with aurora

1123.2 -> serverless using the json b

1126.48 -> column type so since it's aurora

1129.76 -> serverless which is disk back

1131.52 -> it had nice scaling performance

1133.679 -> characteristics

1134.72 -> however when it came to performance it

1137.44 -> lacked

1138 -> because we were often hot spotting

1142.16 -> so yes it was cost effective and there's

1144.72 -> recoverability because it's dispatched

1147.12 -> but the performance suffered i'm going

1150.4 -> to talk a little bit more about the

1152 -> final solution which is what we ended up

1154.24 -> with which was like a tier

1155.52 -> dynamo db implementation which meant

1158.559 -> both our scaling operational performance

1162.24 -> and cost characteristics

1166.4 -> so this is a little bit of a high level

1168.24 -> diagram of our architecture what we

1170.64 -> ended up with

1172.24 -> so what we did here was we built

1175.919 -> a multi-table data structure a

1179.039 -> data strategy there was

1182.16 -> a table for our historical data

1185.28 -> there was intraday and a real time table

1188.08 -> so let me talk a little bit about

1190.24 -> why we chose three tables and what was

1193.2 -> the right

1194 -> and read paths to it the historical

1197.679 -> table

1198.16 -> was data that was older than a day and

1201.28 -> went back to the start of time for us

1203.52 -> it's about six years ago and

1206.48 -> the the right path to that was through

1209.28 -> an e

1209.76 -> spark job that runs on an emr cluster

1212.4 -> every night

1213.52 -> so the the strategy there is we take

1217.36 -> a client's historical data and then we

1220.88 -> compress it

1222.32 -> into a unified blob and we

1225.36 -> re we write one key per client so

1230.24 -> the goal there is to use the put item

1233.12 -> semantics

1234.48 -> and to use the get item semantics to get

1238.32 -> the historical history for a client so

1241.039 -> as i mentioned earlier about we stored

1242.799 -> about 3000

1244.48 -> responses per client so that roughly

1247.76 -> translated to about 20 kilobyte payload

1251.2 -> and so per client there was a compressed

1254.72 -> json payload

1256.159 -> of about 20 kilobytes the intraday

1259.84 -> and the real-time table they were

1264.08 -> they had a roper event

1268.159 -> which is every time an event was

1270.24 -> generated

1271.52 -> it was keyed by client id and the

1274 -> timestamp

1275.679 -> so the goal there was to have a stream

1278.48 -> from our message bus which happens to be

1280.32 -> kafka

1281.679 -> and to read the events from the message

1284.64 -> bus

1284.96 -> and to write rows into the intraday

1287.84 -> table

1289.039 -> the the recency of that table is from

1291.76 -> about a minute ago

1293.36 -> to about a day and the real timetable

1297.44 -> had a similar schema however it was

1300 -> coming from a direct http endpoint

1302.4 -> generated from a client action

1304.64 -> and the recency of the data was from now

1307.679 -> to about two minutes ago

1310.24 -> so how did the read path tie these three

1313.2 -> tables together

1314.72 -> so when we requested a user's

1318.48 -> data or a client's history what we did

1321.76 -> was we have

1322.799 -> three concurrent requests that go to

1325.919 -> these tables independently

1327.76 -> and we then have circuit breakers that

1330.88 -> are running

1331.6 -> for each of those requests and then once

1334.159 -> those requests return we

1335.84 -> merge those results dedupim and we

1338.96 -> return a unified result back

1341.52 -> so what that does is that we have three

1344.799 -> separate read paths however coming

1347.919 -> together in a single request

1351.52 -> so the rights to to repeat myself again

1354.24 -> the rights are happening from

1355.919 -> either from kafka spark or direct

1358.96 -> http rights and on the read side we're

1362.24 -> taking

1363.28 -> the data from these tables and either

1365.76 -> stitching them together through get item

1368.24 -> or direct queries on these tables

1375.6 -> so the end result was that we were able

1378.24 -> to

1378.64 -> get about our p99 latency in about

1382.24 -> 60 to 80 millisecond range and

1386 -> force since we were using individual

1389.2 -> circuit breakers

1390.96 -> what that guaranteed was that if any one

1393.12 -> of those queries

1394.24 -> or getting data from the historical

1396.96 -> table was

1397.679 -> outside that limit we would still have

1400.32 -> some data to serve even if you didn't

1402.72 -> have all the data

1404.08 -> it is also possible that a user may not

1406.24 -> have had historical

1407.919 -> data and it just just had recent data in

1410.72 -> that case we would just serve data out

1412.72 -> of the recent and the intraday

1414.559 -> the real time in the intraday table

1418.32 -> so the key takeaways for us were that we

1423.36 -> really tried to first understand our

1425.44 -> data read

1426.72 -> and write patterns that is what led us

1429.6 -> to

1430.4 -> build the simulation and to be able to

1433.36 -> test a few different data back-ends

1436.88 -> um we swapped our data back in from

1440.4 -> using

1440.96 -> a hot elastic redis cluster

1444.08 -> to using a tiered dynamodb solution

1447.2 -> however for our users and internal apis

1450.32 -> we did not change any of the api

1452.32 -> characteristics the api and the protocol

1454.799 -> stayed exactly the same

1456.48 -> just the underlying implementation had

1458.64 -> changed

1460.48 -> what was also very useful for us was to

1462.799 -> constantly leverage

1464.72 -> the the depth of knowledge that the aws

1468.4 -> support team and the dynamo tb

1470.72 -> db team brought to the table they were

1473.52 -> really

1474.4 -> instrumental in helping us carve the

1477.2 -> sort of this horizontal and vertical

1479.36 -> schema tables and bringing them together

1482.159 -> thank you urge it was a pleasure working

1485.44 -> with you

1486.4 -> during your design and building up this

1488.4 -> solution

1489.76 -> and thank you very much for all the

1491.6 -> audiences virtual audiences from around

1493.76 -> the world

1495.2 -> if you want to explore more about what

1497.52 -> aws

1498.4 -> is doing for retail or consumer goods

1501.52 -> customers

1502.32 -> you can go to the url we are showing on

1504.32 -> the slide here

1506.159 -> also if you want to explore more and

1508.799 -> maybe

1509.36 -> dive deep into some of the services like

1511.6 -> amazon sagemaker

1512.96 -> amazon personalized dynamodb or any of

1516.64 -> the other services

1517.919 -> we touched upon in this presentation

1520.72 -> please go and explore those tracks

1522.799 -> in this arrangement with that i thank

1526.159 -> you all for attending today and

1528.559 -> have a great day

Source: https://www.youtube.com/watch?v=mcM1y2tNAAQ