AWS re:Invent 2022 - How Intuit migrated Apache Cassandra workloads to Amazon Keyspaces (DAT327)

Aug 16, 2023

AWS re:Invent 2022 - How Intuit migrated Apache Cassandra workloads to Amazon Keyspaces (DAT327)

Intuit delivers global technology solutions and consolidates data from thousands of financial institutions to help power its products. The Intuit Data Exchange platform team wanted to simplify the management of its Apache Cassandra-based workloads to improve operational efficiency. Learn about their experience migrating more than 120 TiB of data to Amazon Keyspaces, while delivering high availability and reliability to their users, using a dual-write approach. Discover Amazon Keyspaces best practices and migration guidance.

Learn more about AWS re:Invent at https://go.aws/3ikK4dD.

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#reInvent2022 #AWSreInvent2022 #AWSEvents

Content

0.9 -> - All right, let's get rolling.

3.21 -> Welcome, everyone, to our session.

4.74 -> Firstly, a big thank you for coming to this evening session.

7.71 -> It's day three. It's late.

10.29 -> For those of you who made it, we really appreciate it.

12.6 -> So, you know, a big, big thank you

15.09 -> for coming this late in the day.

17.37 -> re:Invent is in person.

18.9 -> It's been virtual over the last two years.

20.64 -> It's definitely more fun doing this live

23.85 -> versus doing it sitting in the corner of my room

25.98 -> in front of a camera.

27.36 -> So it's exciting to see some of you live,

29.46 -> and I think, hope to keep doing this

32.1 -> in the next few years as well.

33.99 -> This is session DAT327.

36.39 -> We're gonna talk about how Intuit migrated

39.03 -> Apache Cassandra workload to Amazon Keyspaces.

42.54 -> We're also gonna give you a quick overview

44.25 -> of what Keyspaces is,

45.3 -> especially if you've never used it before.

47.91 -> With that, my name is Meet Bhagdev.

49.86 -> I am a principal product manager at AWS.

53.07 -> I lead product for Amazon Keyspaces.

55.44 -> I'm joined by some really talented folks from Intuit,

58.68 -> Manoj Mohan, who is an engineering leader at Intuit,

62.13 -> and Jason Lashmet, who is a principal software engineer

64.95 -> and a pioneer of some of the migration things

67.17 -> that you're gonna learn about on the technical side.

69.09 -> With that, let's get started.

72.12 -> We have an action-packed agenda for you.

75.57 -> We're gonna talk about what Cassandra is, what Keyspaces is.

79.32 -> We're gonna talk about our architecture,

81.6 -> the question that we hear a lot from a lot of our customers

84.09 -> is, "Okay, how are you really built?"

86.64 -> So we're gonna cover that just a little bit.

89.34 -> We're also gonna talk about how Intuit migrated to Keyspaces

93.42 -> and some of the benefits that they have experienced

96.45 -> from their migration to Keyspaces.

100.71 -> Before we talk about Keyspaces,

102.053 -> I wanna talk a little bit about Cassandra.

104.7 -> So Cassandra has been available,

107.1 -> I wanna say, for about 14 years.

109.41 -> Some of you may be using it in production today.

111.81 -> Some of you may have just heard about it.

114.63 -> It's founded by the Apache Foundation.

116.82 -> It's one of the original projects by Apache.

120.15 -> It's used for a variety of workloads.

122.19 -> Customers use it for any kind of applications

125.01 -> that need massive scale with low latency performance.

129.69 -> That's where Cassandra really comes in.

131.49 -> It also integrates really well with other projects

134.43 -> in the Apache community, such as Kafka and Spark.

139.08 -> And last but not the least,

140.1 -> it actually gives you a SQL-like query language

142.8 -> called the Cassandra Query Language.

144.84 -> It's a word play on SQL.

146.19 -> They call it CQL, which a lot of customers like to use,

149.49 -> especially given it's a NoSQL database.

151.32 -> If you've done relational before,

153.03 -> it gives you the best of both worlds

154.71 -> with the scale from NoSQL and the query language

158.34 -> from your maybe a relational background.

161.58 -> With that, we do hear from customers

163.89 -> that managing Cassandra is still painful.

166.23 -> You know, it scales well, but comes with a cost, right?

169.41 -> You need to deal with backups,

171.21 -> you need to deal with dynamic scaling,

173.31 -> which talk about adding nodes, removing nodes.

176.25 -> This is painful.

177.42 -> Patching could cost downtime sometimes.

180.24 -> Version upgrades. Again, you gotta plan for that.

183.66 -> Some of these database operations

185.07 -> are really, really painful on Cassandra,

187.23 -> and that's where Keyspaces comes in.

189.9 -> Before I get into Keyspaces,

192.66 -> of all the folks here, can I get a show of hands

195.06 -> if you've used or heard of Keyspaces before?

198.99 -> All right, it's a decent number, and by the end of it,

201.24 -> for the folks who did not raise your hands,

202.74 -> hopefully, you could turn those to yeses as well.

206.25 -> But Keyspaces is AWS's fully-managed, Cassandra-compatible,

211.53 -> highly available serverless database service.

214.74 -> Now, I threw a lot of words at you.

216.63 -> You're probably wondering,

217.567 -> "What is this salesy pitch that you're talking about?"

220.77 -> We're gonna break it down into four kind of segments here.

224.37 -> The first one is Cassandra compatibility.

227.73 -> This is really important to us.

229.5 -> So if you're already using Cassandra today,

232.77 -> you can use Keyspaces with the same application tools

236.64 -> and drivers that you're using today.

239.49 -> So you don't have to learn a completely new set of tools,

242.49 -> libraries, SDKs, command line tools, Go tools, you know.

246.99 -> Pick your favorite tool that you use today.

249.24 -> Our goal is to make those work with Keyspaces.

252.33 -> Now, I have a full slide on how we define compatibility

255.6 -> and what it really means to be compatible,

257.31 -> and I'll get to that in just a bit.

259.95 -> The second key pillar that we like to talk about

262.71 -> is serverless.

264.63 -> And with serverless, you actually get true scale to zero.

268.89 -> And I call that out

270.27 -> because we hear from customers again and again and again

274.41 -> around what does serverless really mean, right?

276.72 -> So I'm gonna take a second and share

278.1 -> what we at Keyspaces think about serverless.

280.833 -> With serverless, there's obviously no instances.

283.05 -> You're not going to the console

284.4 -> and choosing R5 or C5 or any of that.

288.03 -> But in addition to that,

289.32 -> you don't even have to worry about capacity.

292.38 -> You just create a table,

294.21 -> you put reads and writes on that table,

296.49 -> we figure out all the underlying compute,

299.16 -> infrastructure, storage, et cetera for you.

303.15 -> Let's say you're not putting

304.17 -> any reads or writes on the system, you don't pay for that.

307.17 -> So you actually pay for what you use.

309 -> If you have zero reads and zero writes,

310.56 -> you pay zero dollars on Keyspaces,

312.3 -> giving you the true serverless promise of scale to zero.

317.55 -> Now, you might be wondering, "Okay, what about performance?"

320.4 -> Cassandra is really good at performance.

322.89 -> What about Keyspaces?

324.42 -> Well, we give you

325.253 -> the same single-digit-millisecond performance latency

328.44 -> on pretty much most of your queries at any scale.

332.43 -> So let's say you're going from gigabytes

334.86 -> to terabytes to petabytes,

337.44 -> your database is not slowing down.

339.57 -> Keyspaces is still horizontally distributed like Cassandra,

343.02 -> and you keep getting that single-digit-millisecond at scale

346.41 -> as your workload grows on Amazon Keyspaces.

350.58 -> Last but not the least, availability and security.

354.24 -> This is 'job zero' for us at AWS.

357.15 -> We provide four nines of SLA.

359.313 -> What this really means

360.45 -> is if you have an availability requirement

364.08 -> for any tier zero apps, that's kind of a checkbox.

367.11 -> We provide that.

367.943 -> You get four nines of SLA within a single region.

371.46 -> We keep three copies of your data

373.74 -> across three availability zones.

376.89 -> You only pay for one,

377.88 -> but we make three copies for durability.

380.4 -> And we offer features such as encryption at rest

383.31 -> and encryption at transit by default.

385.92 -> So you don't have to go and set any of that up.

388.83 -> You just create your first table

390.51 -> and you get encryption at rest and encryption in transit

393.84 -> right out of the box.

397.23 -> Now, your next question might be, "Who is using Keyspaces?

401.22 -> And what are the use cases?"

403.83 -> We'll start with Amazon.

404.97 -> We use Keyspaces for internal workloads

406.86 -> for a variety of use cases.

408.84 -> Externally, you're gonna hear from Intuit very shortly

412.23 -> why they use Keyspaces.

413.94 -> We also have customers such as Experian and PWC

417.93 -> who use it for financial tech applications.

420.78 -> We are starting to see additional workloads

422.85 -> in different verticals such as Time Series and Graph,

426.6 -> which is definitely newer for us at Keyspaces,

429.48 -> being in the market for about three years now.

431.88 -> So the point I'm trying to make here

434.1 -> that it's general purpose.

435.69 -> We've seen customers across verticals and segments

439.35 -> adopt Keyspaces.

441 -> I've shown some logos here, but that's not the entire list.

443.52 -> You can click on that link below,

444.99 -> you'll see some more examples.

446.85 -> We also have some customers

447.99 -> that publicly aren't referenced yet,

449.79 -> but soon will be added over the next few months.

452.07 -> So keep an eye on that page that I've linked below.

457.44 -> Now, let's talk about a very popular and a favorite topic

460.86 -> from a lot of our customers around Cassandra compatibility.

465.69 -> So what do we mean by Cassandra-compatible?

467.49 -> And I'm gonna raise my hand and say upfront,

469.297 -> "We're not 100% Cassandra-compatible."

472.08 -> There are some features

474.18 -> that we don't support that Cassandra supports.

476.37 -> But our goal is to be 100% compatible

478.59 -> with what our customers are using.

480.72 -> What that really means

481.62 -> is we work backwards from your requirements.

484.26 -> Some examples of features that we've added in the last year

487.41 -> include the Spark connector support, TTL support.

491.91 -> Those were some feature asks

493.32 -> from customers who use Cassandra heavily,

495.39 -> wanted to use Keyspaces but couldn't

497.43 -> due to these missing gaps.

499.14 -> So we're always working backwards from those customers.

501.63 -> We have an action-packed roadmap

503.19 -> for the rest of Q4 in December and Q1 and Q2

506.85 -> where we're gonna launch additional compatibility,

508.77 -> especially if you're using query APIs

511.65 -> that we don't support today.

513.69 -> We've also built a tool that you can run

516.72 -> against your source Cassandra cluster,

519.24 -> and I'll spit out what features that you're using today

522.81 -> that may not be compatible with Keyspaces.

525.99 -> So you don't actually have to go and try a POC,

528.48 -> spend a lot of time, find a surprise.

531 -> We tell you this upfront.

532.74 -> If you're interested in using this tool,

534.24 -> come talk to us after,

535.41 -> and I'll be happy to connect you with the right folks.

538.26 -> Versions, so we are compatible with version 3.11.2,

542.91 -> backward compatible up to that version.

545.25 -> I know there have been additional versions

547.05 -> out there in Cassandra.

548.25 -> 4.0 is available. 4.1 is in preview slash release candidate.

552.9 -> So we're committed to supporting additional versions.

555.69 -> Where we see critical mass today

557.28 -> is up to that 3.11 kind of version range,

560.97 -> So we're compatible up to that version.

563.28 -> All existing tools and drivers, pick your favorite language,

566.49 -> Python, Node.js, Java, Perl, Ruby,

570.12 -> existing Cassandra drivers for these languages

572.64 -> or data sets drivers

573.66 -> continue to work with Keyspaces with little to no change.

579.27 -> Another very popular question we get

581.28 -> is "What about tombstones? What about repair?

585.6 -> How does garbage collection work?

587.79 -> Or how does compaction work?"

589.92 -> So the good news is, as a developer,

592.65 -> you actually don't have to worry about this on Keyspaces.

595.8 -> What I was talking about earlier on the architecture side,

598.5 -> and I'll cover that just in a bit,

601.17 -> we don't completely use Apache Cassandra under the hood.

604.53 -> We use parts of it,

605.94 -> but we've also built our differentiated storage architecture

609.63 -> that makes some of these concepts redundant

612 -> or not applicable.

613.32 -> And if they do apply, for instance, garbage collection,

615.6 -> obviously, there's some version of it

617.28 -> that also applies to Keyspaces,

619.95 -> something you don't have to worry about.

621.42 -> We take care of it for you completely

624.24 -> without any application impact,

626.7 -> or a setting that you need to tune, adjust, or deal with.

630.72 -> So next time a developer at your company asks you,

632.887 -> "Hey, tombstones, how do I deal with it?"

636.09 -> And if you're using Keyspaces,

637.35 -> the answer is, "Hey, you don't have to deal with it.

639.33 -> It's done for you."

640.62 -> You keep driving traffic, terabytes, petabytes,

643.38 -> millions of reads, millions of writes.

646.235 -> It's just taken care of right off the bat.

650.97 -> And now, let's talk about the architecture.

653.01 -> How does this really work?

654.18 -> How have you gotten rid of tombstones?

655.8 -> How can we get serverless? How do we get petabyte scale?

660.96 -> The key point to talk about here

663.39 -> is that compute and storage and Keyspaces

667.26 -> are completely decoupled.

670.14 -> So your nodes on Cassandra may be data bearing.

673.74 -> They're heavy, they're data heavy.

675.63 -> On Keyspaces, that's not the case.

678.24 -> On Keyspaces, your storage partitions

681.3 -> continue to grow in 10-gig increments.

684.3 -> So let's say you have 10 gigabytes of data today,

687.42 -> your workload grows to 20, 30, 40,

689.58 -> your company's wildly successful,

691.23 -> you're getting into terabytes,

692.43 -> hundreds of terabytes into petabytes,

695.31 -> on your partition key, we use sharding under the hood.

697.95 -> We keep scaling your storage partitions

699.96 -> as your data size grows.

702.09 -> This is automatic.

703.95 -> You don't have to go to the console,

705.72 -> click a button, or take any application downtime.

710.55 -> Now, let's talk about the query processing layer.

713.07 -> That's the compute layer

714.3 -> or what we like to call the compute layer.

716.61 -> This is where your reads and writes go.

718.56 -> This layer also scales independently of storage.

721.5 -> So let's say you have very thin storage,

723.87 -> maybe you have just a couple of gigs of data,

725.88 -> but you have thousands of reads and writes per second

728.1 -> or millions of reads and writes per second,

730.14 -> you're able to achieve that with our separation.

733.59 -> Your compute layer also scales automatically

735.93 -> on your partition key,

737.37 -> and you don't have to worry

738.6 -> about scaling your reads or writes either.

741.24 -> You can let Keyspaces take care of that for you

743.82 -> without any application impact.

746.79 -> So that's our architecture in a nutshell.

748.29 -> And if you wanna go deeper,

750 -> we'll be happy to, like I said, stick around or take Q and A

752.82 -> towards the end of the session.

756.03 -> Now, let's talk about serverless.

759.21 -> The key point I wanna call out here on serverless

761.73 -> is what it really means.

764.79 -> And we talked about scale to zero,

766.23 -> but how do you achieve that?

768.24 -> So in Keyspaces, there are two pricing modes,

771.54 -> and this often trips up a lot of customers

773.31 -> because if you're coming from a Cassandra world,

775.89 -> you're so used to vCPUs and memory,

778.71 -> that it's a little bit different on Keyspaces.

781.756 -> So in Keyspaces, by default,

783.87 -> you get started on on-demand capacity.

787.35 -> What this means is that you actually don't provision

789.99 -> any amount of compute.

792.3 -> You just go create a table, you put workload on your table,

797.28 -> and over time, Keyspaces learns how much capacity you need.

800.67 -> And it's not magic.

801.96 -> You know, we learn about your workload,

803.58 -> we see your average reads and write, peak reads and write,

806.01 -> and we provision the underlying infrastructure for you

808.98 -> as your workload grows.

810.93 -> Now, you may be a customer

812.13 -> who may want some more predictability.

815.52 -> We also give you an option

817.32 -> to provision a certain amount of throughput,

819.27 -> and this is not instances.

820.95 -> You're provisioning a certain number of reads

823.02 -> and a certain number of writes.

825.06 -> That's the second pricing mode,

826.38 -> and that's what we call provision capacity.

828.81 -> The good news with provision capacity

830.61 -> is that you can still use auto scaling,

832.95 -> so you can set up a minimum bound and a maximum bound,

836.1 -> and we will scale within those bounds of reads and writes.

839.97 -> Everything you're doing on Keyspaces

841.8 -> is in terms of reads and writes.

843.63 -> There are no instances, there's no RAM, there's no CPU.

847.44 -> You directly get billed for your database operations

850.68 -> in reads and writes.

853.14 -> That's enough about Keyspaces and enough from AWS.

855.51 -> I'd like to now bring on Manoj and Jason,

858.72 -> and they'll kind of talk to you guys

860.37 -> about how Intuit migrated

862.59 -> from Apache Cassandra to Keyspaces.

865.098 -> Pass it on to you, Manoj.

867.93 -> - Hello.

873.15 -> Thanks, Meet.

875.13 -> Hello, everyone. I'm Manoj.

877.41 -> I lead a platform engineering team at Intuit.

880.65 -> So myself and Jason, we are excited to share

883.98 -> our Keyspaces journey with all of you.

890.34 -> So this is a quote we had published earlier

893.73 -> in an AWS case study.

895.74 -> Our prior state used to be a custom Cassandra cluster

899.52 -> running on AWS.

901.83 -> In this prior state, we would need to do extensive planning

905.97 -> every time we have to add additional capacity

909.27 -> for organic growth in our cluster.

911.97 -> And this extensive planning

913.56 -> would translate to at least a few weeks of lead time.

917.49 -> Now, with Keyspaces as a managed service,

920.52 -> we are able to accomplish all of this in a single day

923.91 -> and save on all of this energy and bandwidth

926.85 -> and repurpose that towards building business capabilities

930.42 -> in our platform for our stakeholders.

935.28 -> Quick introduction about Intuit for folks who don't know.

938.79 -> Intuit is a global fintech company

941.94 -> and our mission is to drive financial prosperity

945.45 -> around the world.

946.86 -> We do this by building AI-driven expert platforms

951.27 -> that enable our customers to better manage their finances,

956.58 -> be it tax solutions using TurboTax,

959.91 -> or be it personal finance solutions

962.76 -> using Credit Karma, our main,

965.43 -> or accounting automation for small businesses

968.58 -> using QuickBooks,

970.05 -> or even marketing automation, leveraging Mailchimp.

974.34 -> We continue to innovate

976.26 -> building best-of-breed financial solutions

979.14 -> and are currently serving over 100 million

982.95 -> cumulative install base of customers globally.

989.04 -> Quick introduction of our team.

991.02 -> Our team is known as the Intuit Data Exchange platform team.

996.06 -> Our charter is to acquire information

999.6 -> working with several different financial institutions

1002.87 -> and data providers globally

1005.54 -> as a one stop shop for all of Intuit.

1008.75 -> So we go ahead and build data agents

1013.79 -> that acquire this information on a continuous basis,

1017.06 -> and then we pass it through

1018.38 -> several different cleansing routines

1020.42 -> and then make this information readily available

1023.99 -> in the form of services for all of the Intuit products

1027.56 -> in the specific use cases of those products

1030.47 -> for our customers.

1034.52 -> Now, this is a very high level overview

1038.36 -> of the landscape of our platform.

1040.82 -> Now, as I mentioned,

1042.53 -> what you see on the right side of the diagram

1044.93 -> are all of the different integrations

1047.66 -> and capabilities that we have built

1050.03 -> in the form of custom agents to extract this information.

1054.26 -> Once we extract this information,

1056.42 -> this passes through several different routines

1059.3 -> like cleansing routines, curation routines,

1062.06 -> enrichment routines with ML and so on.

1064.94 -> And finally, all of that data gets persisted

1068.96 -> in what used to be our Cassandra cluster.

1072.29 -> To the left side of the diagram

1074.27 -> are all of the Intuit products

1076.37 -> that is requesting for specific information

1079.34 -> in specific use cases in the form of several different APIs

1083.12 -> and services from our one-stop platform across the board.

1088.43 -> Now, all of the different logical application components

1092.51 -> are containerized and deployed on Kubernetes Spots

1096.32 -> for seamless scalability.

1098.93 -> Please note that this is just a very high level forest view

1102.89 -> of our landscape of our platform

1105.71 -> without delving into the finer details

1108.11 -> in the interest of time.

1112.58 -> Let me pass it on to Jason

1114.11 -> to walk through the next few slides.

1118.4 -> - Thanks, Manoj.

1120.95 -> So our group has been running a Cassandra cluster now

1123.47 -> for going on eight years.

1126.05 -> Our use cases primarily involve

1127.85 -> storing and retrieving large volumes of financial data.

1131.45 -> And since we don't require complex queries or table joins,

1135.38 -> we chose Cassandra primarily for its horizontal scalability

1139.04 -> and its performance.

1141.17 -> And I think that for the most part,

1143.18 -> Cassandra has delivered on these aspects,

1145.64 -> but it's certainly not without its downsides.

1150.83 -> Operationally, it requires some significant effort.

1154.01 -> There are administrative and monitoring tools

1156.17 -> to install and maintain,

1158 -> a backup strategy needs to be put in place,

1160.61 -> and things like version upgrades, scaling out the cluster,

1164.39 -> these all need significant planning and execution

1167.78 -> ideally by someone with that specialized skill set.

1172.61 -> In our group in particular,

1174.17 -> we had the rather unfortunate situation

1177.14 -> of having multiple periods of time

1179.36 -> where we didn't have a DBA assigned to these tasks.

1183.05 -> We ended up going so long without running

1185.54 -> some of the important maintenance jobs like repair

1188.72 -> that we were actually told by DataStax

1190.97 -> that starting them up again could impact the entire cluster.

1196.31 -> So towards the middle of 2019,

1198.74 -> one of our team members was starting to look into DynamoDB

1202.1 -> as a possible fit for our use cases.

1205.25 -> And while it looked promising, it would've taken

1208.28 -> some significant application level changes on our side

1211.7 -> to really integrate with it and test it out.

1215.48 -> So later that year, when Keyspaces was announced

1218.27 -> as more of a drop in replacement for Cassandra,

1221.33 -> we were very eager to try that out.

1224.54 -> We ended up taking one of our services in particular

1227.18 -> called financial transaction,

1229.43 -> which had by far our largest volume of data

1232.88 -> and highest processing requirements.

1235.61 -> Around that time,

1236.57 -> I think it had around 60 billion rows in the database

1240.92 -> and was serving anywhere from 2 to 8,000 API calls a second.

1246.53 -> So we chose this service to do a proof of concept

1249.68 -> with the idea that if Keyspaces performed well

1253.01 -> with its volume,

1254.33 -> then it would also be suitable

1255.8 -> for any of our other services.

1260.553 -> - Excuse me.

1264.59 -> So in summary, our platform,

1267.35 -> specifically our persistence layer

1269.72 -> was doing all of the essential things expected

1272.84 -> with some additional maintenance overheads.

1277.07 -> We were able to take all of our existing workloads

1279.8 -> up until that point in time.

1281.6 -> However, the Intuit product landscape

1284.69 -> was changing with newer acquisitions

1287.18 -> like Credit Karma and Mailchimp,

1290.3 -> and this meant our own platform needs continued to grow.

1294.98 -> Our workloads were quadrupling over time

1299.03 -> and there was a significant increase

1301.67 -> in the variance of traffic

1303.38 -> that we were seeing on our platform.

1305.66 -> And all of this, we had our database

1309.11 -> being the single biggest choking point

1311.63 -> in the platform ecosystem.

1315.74 -> So all of our existing workloads were more along the lines

1318.83 -> of the small aeroplane that's depicted out here.

1321.53 -> But we were now looking to upgrade our persistence layer

1326.21 -> from being this small aeroplane

1328.34 -> that could take only that minuscule amount of traffic

1331.52 -> to being more like a Boeing 777

1335.15 -> that could really push the boundaries of scale.

1341.63 -> So now that we have the problem statement defined

1345.08 -> and called out clearly, we wanted to strategize

1348.2 -> on how we roadmap towards a solution.

1351.29 -> In order to do that, we prioritized four key aspects

1355.97 -> as we were working through this.

1357.83 -> One, we wanted to leverage persistence as a managed service.

1362 -> This way, we could put away

1363.98 -> with respect to all the operational

1365.9 -> and administrative overheads.

1368 -> Two, we wanted to stay focused

1370.19 -> on delivering business agility for our customers.

1374.09 -> We did not want our database layer

1377.12 -> to be the slowest moving block in our platform ecosystem.

1381.53 -> Three, we wanted to dynamically

1384.05 -> scale up or scale down all of the capacity

1388.19 -> so that we could keep our AWS costs

1391.22 -> optimal in the longer term.

1393.5 -> And four, we were also keen not to overhaul

1397.58 -> any of our application or data service contracts

1401.27 -> because this way, we could focus

1403.61 -> on this being a seamless upgrade

1406.46 -> without having any sort of adverse impact

1409.55 -> for our stakeholders of the platform.

1412.13 -> So for all of you who are thinking,

1414.147 -> "Did this team really evaluate DynamoDB

1417.47 -> as a potential option," there lies the answer.

1420.65 -> This was the reason why we were inclined

1423.14 -> to move ahead with Keyspaces as our first priority.

1430.19 -> Looking back through our journey,

1432.44 -> there were three key decisions that enabled us

1436.46 -> to get to be in this successful state

1439.31 -> of launching Keyspaces.

1441.62 -> What do I mean? Let me talk through.

1444.2 -> One, starting all the way from the beginning,

1448.1 -> working with AWS, we knew that Keyspaces

1452.57 -> was not Apache Cassandra under the hood.

1455.9 -> What this also meant is we had to ensure

1459.65 -> that there was a high level of parity

1462.38 -> functionally as well as non-functionally

1465.08 -> to ensure that there is zero to minimal adverse impact

1469.34 -> on our application layer.

1471.32 -> Two, we wanted to simulate this entire setup

1475.61 -> in production in an iterative way.

1478.76 -> Our platform is extremely mission critical to all of Intuit,

1483.95 -> and we did not want to sign up for a higher risk quotient

1488.51 -> given the critical role we play in the Intuit ecosystem.

1492.5 -> So in terms of a risk mitigation strategy,

1495.65 -> we wanted to continuously iterate

1497.87 -> and figure out and get to our final state

1500.69 -> without it being a zero to one transition.

1504.62 -> The third one, we wanted to build out

1507.14 -> a strong partnership with AWS engineering

1510.35 -> and all our cross-functional teams.

1512.87 -> Now, in my opinion,

1514.22 -> this was probably one of the most important investments

1519.11 -> we made as we embarked on this journey.

1522.14 -> This helped us set the right expectations

1524.72 -> all the way from the beginning, enabled us to lean in on AWS

1530.51 -> as we ran into different sets of roadblocks

1533.15 -> and get help right away.

1535.43 -> It seemed as if both our teams were connected at the hip

1540.05 -> all the way from the beginning 'til the end.

1546.47 -> Okay, now that we have the problem statement,

1549.41 -> followed by the strategy and priority

1551.99 -> around what we are solving for,

1553.91 -> we wanted to now make sure

1556.13 -> that we break down the entire goal

1558.8 -> into a sequential logical set of milestones or phases

1563.33 -> so that we are dividing and conquering it appropriately.

1566.96 -> So we came up with four different phases

1569.93 -> in terms of getting to our end state.

1572.39 -> Phase one is all about dual writes.

1575.66 -> What do I mean by dual writes?

1577.67 -> We wanted to simulate all of our write traffic,

1581.66 -> the actual production write traffic onto Keyspaces

1586.34 -> in a simulated manner actually running in production.

1590.36 -> This enabled us to stay focused on solving

1593.06 -> for all of the write-related challenges,

1595.19 -> be it write throughput, write performance, optimizations,

1599.3 -> et cetera, et cetera.

1600.98 -> So that was our focus of phase one.

1603.62 -> Phase two, we did the exact same thing with reads.

1607.85 -> We wanted to make sure

1609.17 -> that we are simulating reads from Keyspaces

1612.26 -> while our production is still the Cassandra cluster,

1615.77 -> and ensure that we are comparing apples to apples

1619.19 -> between our current state of production

1621.32 -> and the new simulated reads from Keyspaces.

1625.49 -> Phase three was focused on data migration.

1629.15 -> So data migration

1630.77 -> was about the one time historical data backfill

1634.85 -> from Cassandra to Keyspaces.

1638.45 -> Dual writes

1639.283 -> was for all of the ongoing writes that's happening,

1642.2 -> and the data migration was only the one time migration.

1646.79 -> Now, given that we were one

1648.47 -> of the early adopters of Keyspaces,

1650.78 -> there was no data migration utility.

1653.12 -> So working with AWS, we choose to build out

1656.9 -> a custom migration utility.

1659.78 -> And phase four was all about data parity

1663.29 -> and then, cut over of traffic.

1665.78 -> After phase one, two and three,

1668.15 -> we had all of the writes, we had all of the reads,

1670.91 -> and now, it was time for us to ensure

1672.92 -> there was 100% data parity

1675.5 -> at the row level, at the column level,

1677.51 -> between the old system and the new system.

1680.36 -> And then once we had guarantees on the data parity,

1684.29 -> we slowly, gradually moved our users in batches

1689.21 -> from the old system to the new system.

1692.06 -> So these were how we took as the approach.

1695.3 -> Now, to deep dive into each of these phases

1698.15 -> and explain all of the challenges,

1700.07 -> let me pass it on to Jason.

1715.4 -> - So the first step that we took

1716.57 -> was to implement dual writes to Amazon Keyspaces

1720.4 -> and release into production.

1723.112 -> We had validated in our non-profit environment

1725.242 -> that the basic functionality works,

1727.357 -> but we really wanted to expose Keyspaces

1729.2 -> to our production workload with all of its edge cases

1734.347 -> and all of its (speaks faintly)

1736.321 -> to really make sure that under those conditions,

1740.291 -> the latency and error rate was comparable to Cassandra.

1744.71 -> Keeping with the principle of zero customer impact

1747.47 -> during this evaluation,

1749.51 -> we made the writes to Keyspaces fully asynchronous

1752.9 -> and added the appropriate safeguards,

1755.18 -> so that any issues with that write path

1757.58 -> would not affect any of our user requests.

1761.63 -> When we first rolled this out,

1763.46 -> we actually saw a very high error rate.

1766.49 -> Around 40% of our writes were failing.

1769.91 -> And even after implementing

1771.47 -> some of the recommended best practices

1773.54 -> like retries with exponential backoff,

1776.15 -> we still had a significant error rate

1778.61 -> as well as some high latency now because of those retries.

1783.26 -> At this time, we were meeting regularly

1785.54 -> with the Keyspaces team,

1787.34 -> and we worked with them closely to dig into the issue.

1791.69 -> What we found was that we were hitting a rate limit.

1796.79 -> So Keyspaces has a few different levels of rate limiting.

1800.15 -> There's an overall limit for the AWS account.

1803.78 -> There is a limit per table.

1805.64 -> And we were aware of these and our monitoring showed

1808.79 -> that we weren't anywhere near hitting those limits,

1812.06 -> but there is another limit when reading and writing

1815 -> to an individual partition within a Keyspaces table.

1819.65 -> You can see above, our most heavily used table

1822.83 -> was storing all financial transactions

1825.08 -> for a given user and account inside a single partition.

1829.46 -> This meant that if we had a user

1831.77 -> with a particularly active account,

1834.14 -> we could easily get into situations

1836.21 -> where a single request coming into our service

1839.42 -> would result in thousands of rows

1841.52 -> being inserted into this single partition in Keyspaces.

1845.75 -> And when we do this,

1846.83 -> we do make multiple parallel requests to Keyspaces

1850.4 -> to try and do this as quickly as possible.

1853.22 -> And while doing so,

1854.053 -> we were just repeatedly hitting this rate limit.

1858.35 -> So working with one of the solutions architects,

1860.66 -> we did some ballpark math around how long it would take

1864.53 -> to insert different amounts of data given this limit,

1868.13 -> and quickly came to the realization

1870.08 -> that given the way that our data was partitioned,

1872.78 -> we just weren't going to be able to meet

1874.31 -> our latency objectives.

1876.32 -> Again, the core issue was that we were trying to insert

1879.38 -> too much data too quickly into a single partition.

1884.93 -> So the recommended solution was fairly straightforward.

1887.9 -> We added another column to our partition key,

1890.99 -> which essentially spread the transactions

1893.18 -> for a given account over multiple partitions.

1896.99 -> Based on our volume, we found that using 10 partitions

1900.11 -> should give us throughput

1901.28 -> comparable to what we had in Cassandra.

1903.92 -> And when we implemented and rolled this out,

1905.87 -> we saw very good results.

1908.27 -> The latency was now within a few percent of Cassandra

1912.68 -> in all the percentiles that we were measuring,

1915.32 -> and the error rate had dropped to almost zero.

1922.34 -> I say almost zero because even after making this change,

1926.45 -> there were a small number of updates

1928.52 -> that were still continuously failing.

1931.7 -> When we looked at the error messages,

1934.43 -> it looked like they were all the same message,

1936.38 -> which was we were exceeding the maximum row size

1939.74 -> of one megabyte.

1942.32 -> When we dug in to why this was happening,

1944.33 -> we found it was all due to one particular use case

1947.63 -> that we have around pending transactions.

1950.99 -> Because of the way that these are processed in our system,

1954.32 -> we store pending transactions differently

1956.69 -> than we do our regular posted transactions.

1959.6 -> Rather than putting each one in its own database row,

1962.78 -> we take the whole list of pending transactions

1965.36 -> for a given account, serialize it to a JSON string,

1968.9 -> and stick that into a text field inside one row.

1973.13 -> Normally, this list is very small,

1975.8 -> but for a few of our larger customers,

1978.44 -> there were several hundred of these.

1980.36 -> And in this case,

1982.22 -> the resulting JSON string exceeded the one megabyte.

1987.29 -> So we did check at this time with the Keyspaces team

1990.38 -> to see if we could just increase this limit

1992.45 -> because some of these things are adjustable,

1995.09 -> but this is a hard limit,

1996.83 -> so definitely something that needs to be kept in mind

1999.41 -> when designing your database schema.

2003.61 -> After discussing a few potential options,

2005.95 -> we ended up implementing client-side compression.

2008.92 -> So compressing the data

2010.33 -> right before inserting into Keyspaces,

2012.97 -> and then decompressing it again when it's read back out.

2016.9 -> After implementing that change,

2018.67 -> the write errors dropped to zero.

2024.55 -> So now that we had confidence in the write path,

2027.01 -> the next phase was essentially

2029.2 -> doing the same thing for reads.

2031.81 -> Here, we modified the read path

2033.55 -> such that every time we performed a read in Cassandra,

2037.15 -> we would perform the same read in Keyspaces

2039.7 -> and compare the data.

2042.1 -> The data comparison was especially important here

2045.01 -> because even though we had confidence in the write path,

2049.09 -> we still didn't know for sure if this dual write strategy

2052.96 -> would really be good enough to keep the databases in sync.

2057.01 -> One challenge with the approach

2058.6 -> is that the writes between databases

2060.61 -> are not an atomic operation.

2063.1 -> So the way we were handling this

2064.81 -> was we would write to Cassandra first,

2067.48 -> and if that succeeded, we were relying on our retries

2070.6 -> to make sure the data was also written to Keyspaces.

2073.96 -> From our monitoring in the first phase,

2075.94 -> it looked like this was working,

2077.83 -> but we needed to validate the data to be sure.

2085 -> So when we got the initial results of this phase,

2088.33 -> things looked really good actually right off the bat.

2092.29 -> The error rate was very low

2094.39 -> and similar to what we had in Cassandra,

2097.39 -> and the latency across the 90th and 99th percentiles

2101.14 -> was also very similar.

2103.9 -> When we started looking

2104.86 -> at the higher end latencies, however,

2106.69 -> the performance started to diverge

2109 -> with some very high latency requests in Keyspaces.

2113.2 -> Looking at the max, you can see that Cassandra

2116.17 -> was about at two and a half minutes,

2118.57 -> whereas Keyspaces was almost at two hours.

2123.37 -> Now, to be clear what we were measuring here,

2125.47 -> this wasn't a single request in response to Keyspaces.

2129.13 -> Rather, it was a series of requests

2131.71 -> iterating through all of the different pages

2133.42 -> of a select query.

2135.4 -> We were monitoring it at this level

2137.14 -> because due to the schema change

2138.97 -> that we made in the first phase,

2140.74 -> the select queries between databases were now different.

2144.13 -> So in order to get

2145.24 -> an apples to apples comparison on the latency,

2147.37 -> we had to measure it at this level.

2152.89 -> Here's a quick example of how the query changed.

2156.85 -> In Cassandra, we were asking for all transactions

2160.18 -> for a list of accounts.

2162.07 -> And now, in Keyspaces, we were doing the same thing,

2164.65 -> but with the added list of partition IDs.

2168.43 -> When we looked at the specific cases

2170.26 -> where this query was taking a long time to run,

2172.96 -> we were actually very surprised to find

2175.21 -> that in certain cases, we were asking for all transactions

2179.02 -> for up to 20,000 accounts at one time.

2183.76 -> What this meant was for the Cassandra query,

2186.34 -> it would have to scan

2187.21 -> through 20,000 partitions on the backend,

2189.85 -> whereas Keyspaces,

2190.99 -> with the additional list of partition IDs,

2193.27 -> would have to scan through 200,000.

2196.72 -> To make matters worse, we found that, at the time,

2200.32 -> the backend processing on Keyspaces

2202.69 -> was doing all of this work sequentially.

2207.34 -> So from this point, the Keyspaces team

2209.62 -> started working on an enhancement

2211.45 -> to do some of that backend processing in parallel

2214.27 -> while we continued investigation on our side.

2218.08 -> One strange thing that we had noticed

2219.97 -> was that although we were asking for transactions

2222.49 -> for so many different accounts at once,

2224.92 -> the actual number returned was only a few hundred,

2228.04 -> meaning that most of the accounts had no data at all.

2232.15 -> When we dug a bit further,

2233.47 -> we confirmed that this was actually the case.

2236.2 -> One of our other services called Financial Account

2239.47 -> had a defect that was sometimes inserting duplicate rows

2242.71 -> into its database.

2245.53 -> Because of the application logic in that service,

2248.14 -> these extra rows

2249.1 -> were never getting returned to our customers,

2251.5 -> but when they were used to generate this transactions query,

2254.47 -> they weren't getting filtered out.

2258.01 -> So the end result of this

2259.33 -> was that both Keyspaces and Cassandra

2262.66 -> essentially were spinning their wheels on the backend,

2265.12 -> looking through thousands of partitions

2267.01 -> that would never actually contain any data,

2269.23 -> and this is what was causing the high latency.

2272.83 -> After the Keyspaces team

2274.33 -> rolled out the enhancement on their end

2276.19 -> and after we fixed the defect on ours,

2279.1 -> that original two-hour query dropped to about 30 seconds,

2283.21 -> putting it almost exactly on par with Cassandra.

2290.17 -> So now that we had verified the latency

2292.96 -> and the error rate of the reads,

2294.85 -> the other important part of this phase

2296.77 -> was the data validation.

2299.32 -> The way that we implemented this

2300.82 -> was that every read that we did in Cassandra,

2303.37 -> we would perform the same read in Keyspaces

2305.77 -> and compare the data.

2307.96 -> The results of that were logged, fed into a dashboard,

2311.26 -> and we reviewed that daily

2312.61 -> to make sure that the data was being kept in sync.

2317.11 -> When we first rolled this to production,

2319.3 -> the report showed that we had data mismatches

2322.18 -> in about 1% of the transactions.

2325.63 -> However, when we spot-checked a few of these

2327.97 -> looking directly at the database data,

2330.16 -> we didn't see any mismatches there.

2332.98 -> What was happening was because we were leveraging reads

2336.4 -> that were happening within the application,

2338.83 -> the chances were fairly high

2340.54 -> that that same data was being updated around the same time.

2344.35 -> And because we were doing

2345.73 -> asynchronous writes over to Keyspaces,

2348.07 -> sometimes the comparison was happening

2350.47 -> before Keyspaces was actually updated.

2354.55 -> So this was fairly easy to work around.

2357.43 -> We, instead of leveraging the reads

2360.04 -> that were happening organically in the application,

2362.53 -> we just set up a process that ran off to the side

2365.68 -> that compared the data in the background.

2368.23 -> After making that change,

2370.27 -> the original 1% of mismatches dropped to 0.01%.

2378.73 -> Looking at the remaining discrepancies,

2380.74 -> we found that those were actually due

2383.02 -> to data inconsistencies in our Cassandra cluster itself.

2387.73 -> Cassandra's model of eventual consistency

2389.92 -> does allow data to remain inconsistent

2392.44 -> between replicas for some time.

2395.08 -> And in our case, this was exacerbated

2397.69 -> by the fact that we couldn't run repair.

2403.21 -> And because we were using local quorum consistency

2406.15 -> when selecting the data out of Cassandra for comparison,

2409.57 -> it was possible that these inconsistencies

2411.7 -> would show up on our mismatch report.

2415.06 -> The fix here was to use consistency level 'ALL'

2418.54 -> when selecting the data out of Cassandra,

2421.36 -> which guaranteed that all three replicas

2423.49 -> would always be consulted,

2425.05 -> leading to consistent and up-to-date results.

2428.86 -> After making that change,

2430.48 -> the reported mismatches dropped to zero.

2434.65 -> So at this point, we had validated

2436.84 -> both the read and the write paths,

2439.33 -> and additionally, we had created a point in time

2441.88 -> when we turned on the dual writes

2444.43 -> such that after that point in time,

2445.99 -> the databases were in sync.

2448.27 -> So the next phase was backfilling all of the historical data

2452.05 -> that still resided only in Cassandra.

2454.78 -> And now, I'll turn it back to Manoj

2456.49 -> to talk through some of the details of that phase.

2460.75 -> - Okay.

2462.46 -> Thank you, Jason.

2467.56 -> So now that we have all of the dual writes and reads

2471.07 -> working in our production setup,

2473.08 -> it was time to focus on data migration.

2476.11 -> Now, for all the Cassandra experts out here,

2479.08 -> the immediate question that comes to your mind

2481.96 -> is "Why do we need a data migration utility?

2486.04 -> Why not just stream the data while replication

2489.37 -> from the current Cassandra cluster

2491.17 -> to the Keyspaces instance?"

2493.36 -> Well, there are a couple of different reasons

2495.97 -> why we could not pursue that route.

2498.64 -> One, as we mentioned before,

2500.737 -> AWS Keyspaces is not Apache Cassandra under the hood,

2504.82 -> so that was not a viable route.

2507.19 -> Two, our own production cluster, Cassandra cluster,

2510.73 -> was maxed out on I/O operations bandwidth.

2514.21 -> So we were less than keen

2516.79 -> to let replication run on our production cluster

2521.74 -> that is a heavy I/O operation

2524.35 -> and hog all of our production bandwidth,

2527.14 -> potentially disrupting work for our customers.

2530.59 -> We would never let that happen.

2532.21 -> So with these reasons and with these constraints,

2535.15 -> we choose to build out a custom migration utility.

2539.83 -> And we used this utility one time

2542.02 -> to historically backfill all the information.

2545.08 -> Now, even with that utility,

2547.51 -> we built in the right level of throttles

2550.06 -> such that we would maximize the migration workloads

2555.49 -> only when our production workload

2558.46 -> or production traffic is less,

2560.77 -> so this way, and the vice versa too.

2565.75 -> And this way, we were able to leverage the max capacity

2570.13 -> based on our production Cassandra cluster

2573.85 -> without creating any problems for our customers.

2581.17 -> So this is a quick glimpse

2583.24 -> of how auto scaling works in Keyspaces.

2589.3 -> The blue line indicates the provision capacity

2592.99 -> and the green line shows the consumed capacity.

2597.19 -> Around 3:28 PM, in this particular instance,

2601.78 -> the total write consumption starts to peak

2605.68 -> and the provision capacity automatically scales up

2609.46 -> with just few minutes of delay.

2612.01 -> So what this ensures

2613.24 -> is if you have your auto scaling configuration set right,

2617.47 -> your system can seamlessly scale

2619.99 -> without creating any kind of adverse impact

2623.08 -> for your customers or production workloads.

2630.22 -> So up until this point in time,

2632.89 -> our Cassandra cluster is primary,

2635.98 -> and everything that we're doing on Keyspaces,

2638.77 -> be it reads, be it write,

2640.48 -> it's all in a simulated manner, in an asynchronous fashion.

2645.97 -> And the asynchronous fashion ensured

2648.82 -> that if there were any glitches on Keyspaces,

2652.09 -> that would not result in any kind of production impact

2656.02 -> for our actual production system.

2659.14 -> We continue to do this,

2660.85 -> we continue to monitor our system in this setup

2663.58 -> for an extended few sprints

2665.86 -> until we ensure there is, one, 100% data parity

2670.03 -> between both the systems,

2671.5 -> and two, that we were able to build a high confidence

2676.12 -> that the new system is up and running

2678.61 -> and in a steady, stable, mature state.

2681.64 -> And once we got to that high confidence phase,

2684.4 -> then we started switching over the reads and writes

2688.33 -> to be primary and synchronous on Keyspaces,

2691.96 -> and then gradually we started terminating or tapering off

2696.13 -> all of the reads and writes to Cassandra.

2698.77 -> So this is how we made the switch

2701.2 -> over an extended period of time

2703.57 -> iteratively working through all these phases.

2708.88 -> To give you a glimpse of the high level stats

2711.37 -> with respect to our Keyspaces implementation,

2714.22 -> our peak time writes is about 15,000 WCUs.

2719.98 -> Let me take a moment to explain what WCU is.

2722.923 -> It's write capacity unit in Keyspaces.

2726.07 -> One WCU represents one write per second

2730.36 -> for a row of up to one kilobyte in size

2733.84 -> using local quorum consistency.

2736.66 -> So if your average payload size is three kilobytes,

2740.65 -> then each of those payload requests

2743.11 -> would translate to three WCUs in Keyspaces.

2748.03 -> Peak reads was 140,000 RCU.

2751.6 -> Similar to WCU, Read Capacity Unit

2754.99 -> represents one local quorum read per second,

2759.46 -> or two local one reads per second

2763.24 -> for a row of up to four kilobytes.

2767.14 -> Our total data footprint was about 150 terabytes,

2771.13 -> which comprised of more than 100 billion records.

2779.41 -> Okay, where we are today.

2781.57 -> We have been fully live on Keyspaces implementation

2785.2 -> for the last one year.

2788.08 -> We have had zero production incidents

2791.53 -> post migration to Keyspaces

2793.96 -> specifically around our persistence layer.

2797.41 -> Now, this was a huge feather in the cap for our team,

2802.12 -> primarily because database layer

2805.15 -> used to be our single biggest contributor

2808.27 -> towards all of our production incidents.

2812.5 -> We have also significantly improved

2814.57 -> on our operational metrics

2816.52 -> like reliability and availability of our application.

2819.82 -> We've been able to avoid sporadic read timeouts

2822.37 -> that we used to see on Cassandra and so on.

2825.22 -> And we were also able to deliver on all of our agility goals

2829.72 -> to enable better business capabilities,

2832.33 -> leveraging our platform for our stakeholders.

2836.17 -> So this is all the content that we wanted to share

2840.85 -> from Intuit.

2842.05 -> So let me invite Meet and Jason back on stage

2847.668 -> to wrap the session up.

2850.36 -> - All right.

2852.19 -> Thank you, Manoj. Thank you, Jason.

2856.3 -> I think that wraps up what we have for you all today.

2858.67 -> And again, appreciate you spending time

2860.41 -> and sticking around later in the evening.

2863.8 -> You'll see a link to the survey on your app.

2865.96 -> If you came to the session, you registered for the session,

2868.33 -> we really encourage you to take the survey.

2869.95 -> That's how we learn,

2870.783 -> that's how we know what to do better next year.

2872.92 -> So please take the survey, let us know what you thought

2875.17 -> so we could keep getting even better every year.

Source: https://www.youtube.com/watch?v=AjsKP0Key6U