AWS re:Invent 2022 - How Intuit migrated Apache Cassandra workloads to Amazon Keyspaces  (DAT327)

AWS re:Invent 2022 - How Intuit migrated Apache Cassandra workloads to Amazon Keyspaces (DAT327)


AWS re:Invent 2022 - How Intuit migrated Apache Cassandra workloads to Amazon Keyspaces (DAT327)

Intuit delivers global technology solutions and consolidates data from thousands of financial institutions to help power its products. The Intuit Data Exchange platform team wanted to simplify the management of its Apache Cassandra-based workloads to improve operational efficiency. Learn about their experience migrating more than 120 TiB of data to Amazon Keyspaces, while delivering high availability and reliability to their users, using a dual-write approach. Discover Amazon Keyspaces best practices and migration guidance.

Learn more about AWS re:Invent at https://go.aws/3ikK4dD.

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#reInvent2022 #AWSreInvent2022 #AWSEvents


Content

0.9 -> - All right, let's get rolling.
3.21 -> Welcome, everyone, to our session.
4.74 -> Firstly, a big thank you for coming to this evening session.
7.71 -> It's day three. It's late.
10.29 -> For those of you who made it, we really appreciate it.
12.6 -> So, you know, a big, big thank you
15.09 -> for coming this late in the day.
17.37 -> re:Invent is in person.
18.9 -> It's been virtual over the last two years.
20.64 -> It's definitely more fun doing this live
23.85 -> versus doing it sitting in the corner of my room
25.98 -> in front of a camera.
27.36 -> So it's exciting to see some of you live,
29.46 -> and I think, hope to keep doing this
32.1 -> in the next few years as well.
33.99 -> This is session DAT327.
36.39 -> We're gonna talk about how Intuit migrated
39.03 -> Apache Cassandra workload to Amazon Keyspaces.
42.54 -> We're also gonna give you a quick overview
44.25 -> of what Keyspaces is,
45.3 -> especially if you've never used it before.
47.91 -> With that, my name is Meet Bhagdev.
49.86 -> I am a principal product manager at AWS.
53.07 -> I lead product for Amazon Keyspaces.
55.44 -> I'm joined by some really talented folks from Intuit,
58.68 -> Manoj Mohan, who is an engineering leader at Intuit,
62.13 -> and Jason Lashmet, who is a principal software engineer
64.95 -> and a pioneer of some of the migration things
67.17 -> that you're gonna learn about on the technical side.
69.09 -> With that, let's get started.
72.12 -> We have an action-packed agenda for you.
75.57 -> We're gonna talk about what Cassandra is, what Keyspaces is.
79.32 -> We're gonna talk about our architecture,
81.6 -> the question that we hear a lot from a lot of our customers
84.09 -> is, "Okay, how are you really built?"
86.64 -> So we're gonna cover that just a little bit.
89.34 -> We're also gonna talk about how Intuit migrated to Keyspaces
93.42 -> and some of the benefits that they have experienced
96.45 -> from their migration to Keyspaces.
100.71 -> Before we talk about Keyspaces,
102.053 -> I wanna talk a little bit about Cassandra.
104.7 -> So Cassandra has been available,
107.1 -> I wanna say, for about 14 years.
109.41 -> Some of you may be using it in production today.
111.81 -> Some of you may have just heard about it.
114.63 -> It's founded by the Apache Foundation.
116.82 -> It's one of the original projects by Apache.
120.15 -> It's used for a variety of workloads.
122.19 -> Customers use it for any kind of applications
125.01 -> that need massive scale with low latency performance.
129.69 -> That's where Cassandra really comes in.
131.49 -> It also integrates really well with other projects
134.43 -> in the Apache community, such as Kafka and Spark.
139.08 -> And last but not the least,
140.1 -> it actually gives you a SQL-like query language
142.8 -> called the Cassandra Query Language.
144.84 -> It's a word play on SQL.
146.19 -> They call it CQL, which a lot of customers like to use,
149.49 -> especially given it's a NoSQL database.
151.32 -> If you've done relational before,
153.03 -> it gives you the best of both worlds
154.71 -> with the scale from NoSQL and the query language
158.34 -> from your maybe a relational background.
161.58 -> With that, we do hear from customers
163.89 -> that managing Cassandra is still painful.
166.23 -> You know, it scales well, but comes with a cost, right?
169.41 -> You need to deal with backups,
171.21 -> you need to deal with dynamic scaling,
173.31 -> which talk about adding nodes, removing nodes.
176.25 -> This is painful.
177.42 -> Patching could cost downtime sometimes.
180.24 -> Version upgrades. Again, you gotta plan for that.
183.66 -> Some of these database operations
185.07 -> are really, really painful on Cassandra,
187.23 -> and that's where Keyspaces comes in.
189.9 -> Before I get into Keyspaces,
192.66 -> of all the folks here, can I get a show of hands
195.06 -> if you've used or heard of Keyspaces before?
198.99 -> All right, it's a decent number, and by the end of it,
201.24 -> for the folks who did not raise your hands,
202.74 -> hopefully, you could turn those to yeses as well.
206.25 -> But Keyspaces is AWS's fully-managed, Cassandra-compatible,
211.53 -> highly available serverless database service.
214.74 -> Now, I threw a lot of words at you.
216.63 -> You're probably wondering,
217.567 -> "What is this salesy pitch that you're talking about?"
220.77 -> We're gonna break it down into four kind of segments here.
224.37 -> The first one is Cassandra compatibility.
227.73 -> This is really important to us.
229.5 -> So if you're already using Cassandra today,
232.77 -> you can use Keyspaces with the same application tools
236.64 -> and drivers that you're using today.
239.49 -> So you don't have to learn a completely new set of tools,
242.49 -> libraries, SDKs, command line tools, Go tools, you know.
246.99 -> Pick your favorite tool that you use today.
249.24 -> Our goal is to make those work with Keyspaces.
252.33 -> Now, I have a full slide on how we define compatibility
255.6 -> and what it really means to be compatible,
257.31 -> and I'll get to that in just a bit.
259.95 -> The second key pillar that we like to talk about
262.71 -> is serverless.
264.63 -> And with serverless, you actually get true scale to zero.
268.89 -> And I call that out
270.27 -> because we hear from customers again and again and again
274.41 -> around what does serverless really mean, right?
276.72 -> So I'm gonna take a second and share
278.1 -> what we at Keyspaces think about serverless.
280.833 -> With serverless, there's obviously no instances.
283.05 -> You're not going to the console
284.4 -> and choosing R5 or C5 or any of that.
288.03 -> But in addition to that,
289.32 -> you don't even have to worry about capacity.
292.38 -> You just create a table,
294.21 -> you put reads and writes on that table,
296.49 -> we figure out all the underlying compute,
299.16 -> infrastructure, storage, et cetera for you.
303.15 -> Let's say you're not putting
304.17 -> any reads or writes on the system, you don't pay for that.
307.17 -> So you actually pay for what you use.
309 -> If you have zero reads and zero writes,
310.56 -> you pay zero dollars on Keyspaces,
312.3 -> giving you the true serverless promise of scale to zero.
317.55 -> Now, you might be wondering, "Okay, what about performance?"
320.4 -> Cassandra is really good at performance.
322.89 -> What about Keyspaces?
324.42 -> Well, we give you
325.253 -> the same single-digit-millisecond performance latency
328.44 -> on pretty much most of your queries at any scale.
332.43 -> So let's say you're going from gigabytes
334.86 -> to terabytes to petabytes,
337.44 -> your database is not slowing down.
339.57 -> Keyspaces is still horizontally distributed like Cassandra,
343.02 -> and you keep getting that single-digit-millisecond at scale
346.41 -> as your workload grows on Amazon Keyspaces.
350.58 -> Last but not the least, availability and security.
354.24 -> This is 'job zero' for us at AWS.
357.15 -> We provide four nines of SLA.
359.313 -> What this really means
360.45 -> is if you have an availability requirement
364.08 -> for any tier zero apps, that's kind of a checkbox.
367.11 -> We provide that.
367.943 -> You get four nines of SLA within a single region.
371.46 -> We keep three copies of your data
373.74 -> across three availability zones.
376.89 -> You only pay for one,
377.88 -> but we make three copies for durability.
380.4 -> And we offer features such as encryption at rest
383.31 -> and encryption at transit by default.
385.92 -> So you don't have to go and set any of that up.
388.83 -> You just create your first table
390.51 -> and you get encryption at rest and encryption in transit
393.84 -> right out of the box.
397.23 -> Now, your next question might be, "Who is using Keyspaces?
401.22 -> And what are the use cases?"
403.83 -> We'll start with Amazon.
404.97 -> We use Keyspaces for internal workloads
406.86 -> for a variety of use cases.
408.84 -> Externally, you're gonna hear from Intuit very shortly
412.23 -> why they use Keyspaces.
413.94 -> We also have customers such as Experian and PWC
417.93 -> who use it for financial tech applications.
420.78 -> We are starting to see additional workloads
422.85 -> in different verticals such as Time Series and Graph,
426.6 -> which is definitely newer for us at Keyspaces,
429.48 -> being in the market for about three years now.
431.88 -> So the point I'm trying to make here
434.1 -> that it's general purpose.
435.69 -> We've seen customers across verticals and segments
439.35 -> adopt Keyspaces.
441 -> I've shown some logos here, but that's not the entire list.
443.52 -> You can click on that link below,
444.99 -> you'll see some more examples.
446.85 -> We also have some customers
447.99 -> that publicly aren't referenced yet,
449.79 -> but soon will be added over the next few months.
452.07 -> So keep an eye on that page that I've linked below.
457.44 -> Now, let's talk about a very popular and a favorite topic
460.86 -> from a lot of our customers around Cassandra compatibility.
465.69 -> So what do we mean by Cassandra-compatible?
467.49 -> And I'm gonna raise my hand and say upfront,
469.297 -> "We're not 100% Cassandra-compatible."
472.08 -> There are some features
474.18 -> that we don't support that Cassandra supports.
476.37 -> But our goal is to be 100% compatible
478.59 -> with what our customers are using.
480.72 -> What that really means
481.62 -> is we work backwards from your requirements.
484.26 -> Some examples of features that we've added in the last year
487.41 -> include the Spark connector support, TTL support.
491.91 -> Those were some feature asks
493.32 -> from customers who use Cassandra heavily,
495.39 -> wanted to use Keyspaces but couldn't
497.43 -> due to these missing gaps.
499.14 -> So we're always working backwards from those customers.
501.63 -> We have an action-packed roadmap
503.19 -> for the rest of Q4 in December and Q1 and Q2
506.85 -> where we're gonna launch additional compatibility,
508.77 -> especially if you're using query APIs
511.65 -> that we don't support today.
513.69 -> We've also built a tool that you can run
516.72 -> against your source Cassandra cluster,
519.24 -> and I'll spit out what features that you're using today
522.81 -> that may not be compatible with Keyspaces.
525.99 -> So you don't actually have to go and try a POC,
528.48 -> spend a lot of time, find a surprise.
531 -> We tell you this upfront.
532.74 -> If you're interested in using this tool,
534.24 -> come talk to us after,
535.41 -> and I'll be happy to connect you with the right folks.
538.26 -> Versions, so we are compatible with version 3.11.2,
542.91 -> backward compatible up to that version.
545.25 -> I know there have been additional versions
547.05 -> out there in Cassandra.
548.25 -> 4.0 is available. 4.1 is in preview slash release candidate.
552.9 -> So we're committed to supporting additional versions.
555.69 -> Where we see critical mass today
557.28 -> is up to that 3.11 kind of version range,
560.97 -> So we're compatible up to that version.
563.28 -> All existing tools and drivers, pick your favorite language,
566.49 -> Python, Node.js, Java, Perl, Ruby,
570.12 -> existing Cassandra drivers for these languages
572.64 -> or data sets drivers
573.66 -> continue to work with Keyspaces with little to no change.
579.27 -> Another very popular question we get
581.28 -> is "What about tombstones? What about repair?
585.6 -> How does garbage collection work?
587.79 -> Or how does compaction work?"
589.92 -> So the good news is, as a developer,
592.65 -> you actually don't have to worry about this on Keyspaces.
595.8 -> What I was talking about earlier on the architecture side,
598.5 -> and I'll cover that just in a bit,
601.17 -> we don't completely use Apache Cassandra under the hood.
604.53 -> We use parts of it,
605.94 -> but we've also built our differentiated storage architecture
609.63 -> that makes some of these concepts redundant
612 -> or not applicable.
613.32 -> And if they do apply, for instance, garbage collection,
615.6 -> obviously, there's some version of it
617.28 -> that also applies to Keyspaces,
619.95 -> something you don't have to worry about.
621.42 -> We take care of it for you completely
624.24 -> without any application impact,
626.7 -> or a setting that you need to tune, adjust, or deal with.
630.72 -> So next time a developer at your company asks you,
632.887 -> "Hey, tombstones, how do I deal with it?"
636.09 -> And if you're using Keyspaces,
637.35 -> the answer is, "Hey, you don't have to deal with it.
639.33 -> It's done for you."
640.62 -> You keep driving traffic, terabytes, petabytes,
643.38 -> millions of reads, millions of writes.
646.235 -> It's just taken care of right off the bat.
650.97 -> And now, let's talk about the architecture.
653.01 -> How does this really work?
654.18 -> How have you gotten rid of tombstones?
655.8 -> How can we get serverless? How do we get petabyte scale?
660.96 -> The key point to talk about here
663.39 -> is that compute and storage and Keyspaces
667.26 -> are completely decoupled.
670.14 -> So your nodes on Cassandra may be data bearing.
673.74 -> They're heavy, they're data heavy.
675.63 -> On Keyspaces, that's not the case.
678.24 -> On Keyspaces, your storage partitions
681.3 -> continue to grow in 10-gig increments.
684.3 -> So let's say you have 10 gigabytes of data today,
687.42 -> your workload grows to 20, 30, 40,
689.58 -> your company's wildly successful,
691.23 -> you're getting into terabytes,
692.43 -> hundreds of terabytes into petabytes,
695.31 -> on your partition key, we use sharding under the hood.
697.95 -> We keep scaling your storage partitions
699.96 -> as your data size grows.
702.09 -> This is automatic.
703.95 -> You don't have to go to the console,
705.72 -> click a button, or take any application downtime.
710.55 -> Now, let's talk about the query processing layer.
713.07 -> That's the compute layer
714.3 -> or what we like to call the compute layer.
716.61 -> This is where your reads and writes go.
718.56 -> This layer also scales independently of storage.
721.5 -> So let's say you have very thin storage,
723.87 -> maybe you have just a couple of gigs of data,
725.88 -> but you have thousands of reads and writes per second
728.1 -> or millions of reads and writes per second,
730.14 -> you're able to achieve that with our separation.
733.59 -> Your compute layer also scales automatically
735.93 -> on your partition key,
737.37 -> and you don't have to worry
738.6 -> about scaling your reads or writes either.
741.24 -> You can let Keyspaces take care of that for you
743.82 -> without any application impact.
746.79 -> So that's our architecture in a nutshell.
748.29 -> And if you wanna go deeper,
750 -> we'll be happy to, like I said, stick around or take Q and A
752.82 -> towards the end of the session.
756.03 -> Now, let's talk about serverless.
759.21 -> The key point I wanna call out here on serverless
761.73 -> is what it really means.
764.79 -> And we talked about scale to zero,
766.23 -> but how do you achieve that?
768.24 -> So in Keyspaces, there are two pricing modes,
771.54 -> and this often trips up a lot of customers
773.31 -> because if you're coming from a Cassandra world,
775.89 -> you're so used to vCPUs and memory,
778.71 -> that it's a little bit different on Keyspaces.
781.756 -> So in Keyspaces, by default,
783.87 -> you get started on on-demand capacity.
787.35 -> What this means is that you actually don't provision
789.99 -> any amount of compute.
792.3 -> You just go create a table, you put workload on your table,
797.28 -> and over time, Keyspaces learns how much capacity you need.
800.67 -> And it's not magic.
801.96 -> You know, we learn about your workload,
803.58 -> we see your average reads and write, peak reads and write,
806.01 -> and we provision the underlying infrastructure for you
808.98 -> as your workload grows.
810.93 -> Now, you may be a customer
812.13 -> who may want some more predictability.
815.52 -> We also give you an option
817.32 -> to provision a certain amount of throughput,
819.27 -> and this is not instances.
820.95 -> You're provisioning a certain number of reads
823.02 -> and a certain number of writes.
825.06 -> That's the second pricing mode,
826.38 -> and that's what we call provision capacity.
828.81 -> The good news with provision capacity
830.61 -> is that you can still use auto scaling,
832.95 -> so you can set up a minimum bound and a maximum bound,
836.1 -> and we will scale within those bounds of reads and writes.
839.97 -> Everything you're doing on Keyspaces
841.8 -> is in terms of reads and writes.
843.63 -> There are no instances, there's no RAM, there's no CPU.
847.44 -> You directly get billed for your database operations
850.68 -> in reads and writes.
853.14 -> That's enough about Keyspaces and enough from AWS.
855.51 -> I'd like to now bring on Manoj and Jason,
858.72 -> and they'll kind of talk to you guys
860.37 -> about how Intuit migrated
862.59 -> from Apache Cassandra to Keyspaces.
865.098 -> Pass it on to you, Manoj.
867.93 -> - Hello.
873.15 -> Thanks, Meet.
875.13 -> Hello, everyone. I'm Manoj.
877.41 -> I lead a platform engineering team at Intuit.
880.65 -> So myself and Jason, we are excited to share
883.98 -> our Keyspaces journey with all of you.
890.34 -> So this is a quote we had published earlier
893.73 -> in an AWS case study.
895.74 -> Our prior state used to be a custom Cassandra cluster
899.52 -> running on AWS.
901.83 -> In this prior state, we would need to do extensive planning
905.97 -> every time we have to add additional capacity
909.27 -> for organic growth in our cluster.
911.97 -> And this extensive planning
913.56 -> would translate to at least a few weeks of lead time.
917.49 -> Now, with Keyspaces as a managed service,
920.52 -> we are able to accomplish all of this in a single day
923.91 -> and save on all of this energy and bandwidth
926.85 -> and repurpose that towards building business capabilities
930.42 -> in our platform for our stakeholders.
935.28 -> Quick introduction about Intuit for folks who don't know.
938.79 -> Intuit is a global fintech company
941.94 -> and our mission is to drive financial prosperity
945.45 -> around the world.
946.86 -> We do this by building AI-driven expert platforms
951.27 -> that enable our customers to better manage their finances,
956.58 -> be it tax solutions using TurboTax,
959.91 -> or be it personal finance solutions
962.76 -> using Credit Karma, our main,
965.43 -> or accounting automation for small businesses
968.58 -> using QuickBooks,
970.05 -> or even marketing automation, leveraging Mailchimp.
974.34 -> We continue to innovate
976.26 -> building best-of-breed financial solutions
979.14 -> and are currently serving over 100 million
982.95 -> cumulative install base of customers globally.
989.04 -> Quick introduction of our team.
991.02 -> Our team is known as the Intuit Data Exchange platform team.
996.06 -> Our charter is to acquire information
999.6 -> working with several different financial institutions
1002.87 -> and data providers globally
1005.54 -> as a one stop shop for all of Intuit.
1008.75 -> So we go ahead and build data agents
1013.79 -> that acquire this information on a continuous basis,
1017.06 -> and then we pass it through
1018.38 -> several different cleansing routines
1020.42 -> and then make this information readily available
1023.99 -> in the form of services for all of the Intuit products
1027.56 -> in the specific use cases of those products
1030.47 -> for our customers.
1034.52 -> Now, this is a very high level overview
1038.36 -> of the landscape of our platform.
1040.82 -> Now, as I mentioned,
1042.53 -> what you see on the right side of the diagram
1044.93 -> are all of the different integrations
1047.66 -> and capabilities that we have built
1050.03 -> in the form of custom agents to extract this information.
1054.26 -> Once we extract this information,
1056.42 -> this passes through several different routines
1059.3 -> like cleansing routines, curation routines,
1062.06 -> enrichment routines with ML and so on.
1064.94 -> And finally, all of that data gets persisted
1068.96 -> in what used to be our Cassandra cluster.
1072.29 -> To the left side of the diagram
1074.27 -> are all of the Intuit products
1076.37 -> that is requesting for specific information
1079.34 -> in specific use cases in the form of several different APIs
1083.12 -> and services from our one-stop platform across the board.
1088.43 -> Now, all of the different logical application components
1092.51 -> are containerized and deployed on Kubernetes Spots
1096.32 -> for seamless scalability.
1098.93 -> Please note that this is just a very high level forest view
1102.89 -> of our landscape of our platform
1105.71 -> without delving into the finer details
1108.11 -> in the interest of time.
1112.58 -> Let me pass it on to Jason
1114.11 -> to walk through the next few slides.
1118.4 -> - Thanks, Manoj.
1120.95 -> So our group has been running a Cassandra cluster now
1123.47 -> for going on eight years.
1126.05 -> Our use cases primarily involve
1127.85 -> storing and retrieving large volumes of financial data.
1131.45 -> And since we don't require complex queries or table joins,
1135.38 -> we chose Cassandra primarily for its horizontal scalability
1139.04 -> and its performance.
1141.17 -> And I think that for the most part,
1143.18 -> Cassandra has delivered on these aspects,
1145.64 -> but it's certainly not without its downsides.
1150.83 -> Operationally, it requires some significant effort.
1154.01 -> There are administrative and monitoring tools
1156.17 -> to install and maintain,
1158 -> a backup strategy needs to be put in place,
1160.61 -> and things like version upgrades, scaling out the cluster,
1164.39 -> these all need significant planning and execution
1167.78 -> ideally by someone with that specialized skill set.
1172.61 -> In our group in particular,
1174.17 -> we had the rather unfortunate situation
1177.14 -> of having multiple periods of time
1179.36 -> where we didn't have a DBA assigned to these tasks.
1183.05 -> We ended up going so long without running
1185.54 -> some of the important maintenance jobs like repair
1188.72 -> that we were actually told by DataStax
1190.97 -> that starting them up again could impact the entire cluster.
1196.31 -> So towards the middle of 2019,
1198.74 -> one of our team members was starting to look into DynamoDB
1202.1 -> as a possible fit for our use cases.
1205.25 -> And while it looked promising, it would've taken
1208.28 -> some significant application level changes on our side
1211.7 -> to really integrate with it and test it out.
1215.48 -> So later that year, when Keyspaces was announced
1218.27 -> as more of a drop in replacement for Cassandra,
1221.33 -> we were very eager to try that out.
1224.54 -> We ended up taking one of our services in particular
1227.18 -> called financial transaction,
1229.43 -> which had by far our largest volume of data
1232.88 -> and highest processing requirements.
1235.61 -> Around that time,
1236.57 -> I think it had around 60 billion rows in the database
1240.92 -> and was serving anywhere from 2 to 8,000 API calls a second.
1246.53 -> So we chose this service to do a proof of concept
1249.68 -> with the idea that if Keyspaces performed well
1253.01 -> with its volume,
1254.33 -> then it would also be suitable
1255.8 -> for any of our other services.
1260.553 -> - Excuse me.
1264.59 -> So in summary, our platform,
1267.35 -> specifically our persistence layer
1269.72 -> was doing all of the essential things expected
1272.84 -> with some additional maintenance overheads.
1277.07 -> We were able to take all of our existing workloads
1279.8 -> up until that point in time.
1281.6 -> However, the Intuit product landscape
1284.69 -> was changing with newer acquisitions
1287.18 -> like Credit Karma and Mailchimp,
1290.3 -> and this meant our own platform needs continued to grow.
1294.98 -> Our workloads were quadrupling over time
1299.03 -> and there was a significant increase
1301.67 -> in the variance of traffic
1303.38 -> that we were seeing on our platform.
1305.66 -> And all of this, we had our database
1309.11 -> being the single biggest choking point
1311.63 -> in the platform ecosystem.
1315.74 -> So all of our existing workloads were more along the lines
1318.83 -> of the small aeroplane that's depicted out here.
1321.53 -> But we were now looking to upgrade our persistence layer
1326.21 -> from being this small aeroplane
1328.34 -> that could take only that minuscule amount of traffic
1331.52 -> to being more like a Boeing 777
1335.15 -> that could really push the boundaries of scale.
1341.63 -> So now that we have the problem statement defined
1345.08 -> and called out clearly, we wanted to strategize
1348.2 -> on how we roadmap towards a solution.
1351.29 -> In order to do that, we prioritized four key aspects
1355.97 -> as we were working through this.
1357.83 -> One, we wanted to leverage persistence as a managed service.
1362 -> This way, we could put away
1363.98 -> with respect to all the operational
1365.9 -> and administrative overheads.
1368 -> Two, we wanted to stay focused
1370.19 -> on delivering business agility for our customers.
1374.09 -> We did not want our database layer
1377.12 -> to be the slowest moving block in our platform ecosystem.
1381.53 -> Three, we wanted to dynamically
1384.05 -> scale up or scale down all of the capacity
1388.19 -> so that we could keep our AWS costs
1391.22 -> optimal in the longer term.
1393.5 -> And four, we were also keen not to overhaul
1397.58 -> any of our application or data service contracts
1401.27 -> because this way, we could focus
1403.61 -> on this being a seamless upgrade
1406.46 -> without having any sort of adverse impact
1409.55 -> for our stakeholders of the platform.
1412.13 -> So for all of you who are thinking,
1414.147 -> "Did this team really evaluate DynamoDB
1417.47 -> as a potential option," there lies the answer.
1420.65 -> This was the reason why we were inclined
1423.14 -> to move ahead with Keyspaces as our first priority.
1430.19 -> Looking back through our journey,
1432.44 -> there were three key decisions that enabled us
1436.46 -> to get to be in this successful state
1439.31 -> of launching Keyspaces.
1441.62 -> What do I mean? Let me talk through.
1444.2 -> One, starting all the way from the beginning,
1448.1 -> working with AWS, we knew that Keyspaces
1452.57 -> was not Apache Cassandra under the hood.
1455.9 -> What this also meant is we had to ensure
1459.65 -> that there was a high level of parity
1462.38 -> functionally as well as non-functionally
1465.08 -> to ensure that there is zero to minimal adverse impact
1469.34 -> on our application layer.
1471.32 -> Two, we wanted to simulate this entire setup
1475.61 -> in production in an iterative way.
1478.76 -> Our platform is extremely mission critical to all of Intuit,
1483.95 -> and we did not want to sign up for a higher risk quotient
1488.51 -> given the critical role we play in the Intuit ecosystem.
1492.5 -> So in terms of a risk mitigation strategy,
1495.65 -> we wanted to continuously iterate
1497.87 -> and figure out and get to our final state
1500.69 -> without it being a zero to one transition.
1504.62 -> The third one, we wanted to build out
1507.14 -> a strong partnership with AWS engineering
1510.35 -> and all our cross-functional teams.
1512.87 -> Now, in my opinion,
1514.22 -> this was probably one of the most important investments
1519.11 -> we made as we embarked on this journey.
1522.14 -> This helped us set the right expectations
1524.72 -> all the way from the beginning, enabled us to lean in on AWS
1530.51 -> as we ran into different sets of roadblocks
1533.15 -> and get help right away.
1535.43 -> It seemed as if both our teams were connected at the hip
1540.05 -> all the way from the beginning 'til the end.
1546.47 -> Okay, now that we have the problem statement,
1549.41 -> followed by the strategy and priority
1551.99 -> around what we are solving for,
1553.91 -> we wanted to now make sure
1556.13 -> that we break down the entire goal
1558.8 -> into a sequential logical set of milestones or phases
1563.33 -> so that we are dividing and conquering it appropriately.
1566.96 -> So we came up with four different phases
1569.93 -> in terms of getting to our end state.
1572.39 -> Phase one is all about dual writes.
1575.66 -> What do I mean by dual writes?
1577.67 -> We wanted to simulate all of our write traffic,
1581.66 -> the actual production write traffic onto Keyspaces
1586.34 -> in a simulated manner actually running in production.
1590.36 -> This enabled us to stay focused on solving
1593.06 -> for all of the write-related challenges,
1595.19 -> be it write throughput, write performance, optimizations,
1599.3 -> et cetera, et cetera.
1600.98 -> So that was our focus of phase one.
1603.62 -> Phase two, we did the exact same thing with reads.
1607.85 -> We wanted to make sure
1609.17 -> that we are simulating reads from Keyspaces
1612.26 -> while our production is still the Cassandra cluster,
1615.77 -> and ensure that we are comparing apples to apples
1619.19 -> between our current state of production
1621.32 -> and the new simulated reads from Keyspaces.
1625.49 -> Phase three was focused on data migration.
1629.15 -> So data migration
1630.77 -> was about the one time historical data backfill
1634.85 -> from Cassandra to Keyspaces.
1638.45 -> Dual writes
1639.283 -> was for all of the ongoing writes that's happening,
1642.2 -> and the data migration was only the one time migration.
1646.79 -> Now, given that we were one
1648.47 -> of the early adopters of Keyspaces,
1650.78 -> there was no data migration utility.
1653.12 -> So working with AWS, we choose to build out
1656.9 -> a custom migration utility.
1659.78 -> And phase four was all about data parity
1663.29 -> and then, cut over of traffic.
1665.78 -> After phase one, two and three,
1668.15 -> we had all of the writes, we had all of the reads,
1670.91 -> and now, it was time for us to ensure
1672.92 -> there was 100% data parity
1675.5 -> at the row level, at the column level,
1677.51 -> between the old system and the new system.
1680.36 -> And then once we had guarantees on the data parity,
1684.29 -> we slowly, gradually moved our users in batches
1689.21 -> from the old system to the new system.
1692.06 -> So these were how we took as the approach.
1695.3 -> Now, to deep dive into each of these phases
1698.15 -> and explain all of the challenges,
1700.07 -> let me pass it on to Jason.
1715.4 -> - So the first step that we took
1716.57 -> was to implement dual writes to Amazon Keyspaces
1720.4 -> and release into production.
1723.112 -> We had validated in our non-profit environment
1725.242 -> that the basic functionality works,
1727.357 -> but we really wanted to expose Keyspaces
1729.2 -> to our production workload with all of its edge cases
1734.347 -> and all of its (speaks faintly)
1736.321 -> to really make sure that under those conditions,
1740.291 -> the latency and error rate was comparable to Cassandra.
1744.71 -> Keeping with the principle of zero customer impact
1747.47 -> during this evaluation,
1749.51 -> we made the writes to Keyspaces fully asynchronous
1752.9 -> and added the appropriate safeguards,
1755.18 -> so that any issues with that write path
1757.58 -> would not affect any of our user requests.
1761.63 -> When we first rolled this out,
1763.46 -> we actually saw a very high error rate.
1766.49 -> Around 40% of our writes were failing.
1769.91 -> And even after implementing
1771.47 -> some of the recommended best practices
1773.54 -> like retries with exponential backoff,
1776.15 -> we still had a significant error rate
1778.61 -> as well as some high latency now because of those retries.
1783.26 -> At this time, we were meeting regularly
1785.54 -> with the Keyspaces team,
1787.34 -> and we worked with them closely to dig into the issue.
1791.69 -> What we found was that we were hitting a rate limit.
1796.79 -> So Keyspaces has a few different levels of rate limiting.
1800.15 -> There's an overall limit for the AWS account.
1803.78 -> There is a limit per table.
1805.64 -> And we were aware of these and our monitoring showed
1808.79 -> that we weren't anywhere near hitting those limits,
1812.06 -> but there is another limit when reading and writing
1815 -> to an individual partition within a Keyspaces table.
1819.65 -> You can see above, our most heavily used table
1822.83 -> was storing all financial transactions
1825.08 -> for a given user and account inside a single partition.
1829.46 -> This meant that if we had a user
1831.77 -> with a particularly active account,
1834.14 -> we could easily get into situations
1836.21 -> where a single request coming into our service
1839.42 -> would result in thousands of rows
1841.52 -> being inserted into this single partition in Keyspaces.
1845.75 -> And when we do this,
1846.83 -> we do make multiple parallel requests to Keyspaces
1850.4 -> to try and do this as quickly as possible.
1853.22 -> And while doing so,
1854.053 -> we were just repeatedly hitting this rate limit.
1858.35 -> So working with one of the solutions architects,
1860.66 -> we did some ballpark math around how long it would take
1864.53 -> to insert different amounts of data given this limit,
1868.13 -> and quickly came to the realization
1870.08 -> that given the way that our data was partitioned,
1872.78 -> we just weren't going to be able to meet
1874.31 -> our latency objectives.
1876.32 -> Again, the core issue was that we were trying to insert
1879.38 -> too much data too quickly into a single partition.
1884.93 -> So the recommended solution was fairly straightforward.
1887.9 -> We added another column to our partition key,
1890.99 -> which essentially spread the transactions
1893.18 -> for a given account over multiple partitions.
1896.99 -> Based on our volume, we found that using 10 partitions
1900.11 -> should give us throughput
1901.28 -> comparable to what we had in Cassandra.
1903.92 -> And when we implemented and rolled this out,
1905.87 -> we saw very good results.
1908.27 -> The latency was now within a few percent of Cassandra
1912.68 -> in all the percentiles that we were measuring,
1915.32 -> and the error rate had dropped to almost zero.
1922.34 -> I say almost zero because even after making this change,
1926.45 -> there were a small number of updates
1928.52 -> that were still continuously failing.
1931.7 -> When we looked at the error messages,
1934.43 -> it looked like they were all the same message,
1936.38 -> which was we were exceeding the maximum row size
1939.74 -> of one megabyte.
1942.32 -> When we dug in to why this was happening,
1944.33 -> we found it was all due to one particular use case
1947.63 -> that we have around pending transactions.
1950.99 -> Because of the way that these are processed in our system,
1954.32 -> we store pending transactions differently
1956.69 -> than we do our regular posted transactions.
1959.6 -> Rather than putting each one in its own database row,
1962.78 -> we take the whole list of pending transactions
1965.36 -> for a given account, serialize it to a JSON string,
1968.9 -> and stick that into a text field inside one row.
1973.13 -> Normally, this list is very small,
1975.8 -> but for a few of our larger customers,
1978.44 -> there were several hundred of these.
1980.36 -> And in this case,
1982.22 -> the resulting JSON string exceeded the one megabyte.
1987.29 -> So we did check at this time with the Keyspaces team
1990.38 -> to see if we could just increase this limit
1992.45 -> because some of these things are adjustable,
1995.09 -> but this is a hard limit,
1996.83 -> so definitely something that needs to be kept in mind
1999.41 -> when designing your database schema.
2003.61 -> After discussing a few potential options,
2005.95 -> we ended up implementing client-side compression.
2008.92 -> So compressing the data
2010.33 -> right before inserting into Keyspaces,
2012.97 -> and then decompressing it again when it's read back out.
2016.9 -> After implementing that change,
2018.67 -> the write errors dropped to zero.
2024.55 -> So now that we had confidence in the write path,
2027.01 -> the next phase was essentially
2029.2 -> doing the same thing for reads.
2031.81 -> Here, we modified the read path
2033.55 -> such that every time we performed a read in Cassandra,
2037.15 -> we would perform the same read in Keyspaces
2039.7 -> and compare the data.
2042.1 -> The data comparison was especially important here
2045.01 -> because even though we had confidence in the write path,
2049.09 -> we still didn't know for sure if this dual write strategy
2052.96 -> would really be good enough to keep the databases in sync.
2057.01 -> One challenge with the approach
2058.6 -> is that the writes between databases
2060.61 -> are not an atomic operation.
2063.1 -> So the way we were handling this
2064.81 -> was we would write to Cassandra first,
2067.48 -> and if that succeeded, we were relying on our retries
2070.6 -> to make sure the data was also written to Keyspaces.
2073.96 -> From our monitoring in the first phase,
2075.94 -> it looked like this was working,
2077.83 -> but we needed to validate the data to be sure.
2085 -> So when we got the initial results of this phase,
2088.33 -> things looked really good actually right off the bat.
2092.29 -> The error rate was very low
2094.39 -> and similar to what we had in Cassandra,
2097.39 -> and the latency across the 90th and 99th percentiles
2101.14 -> was also very similar.
2103.9 -> When we started looking
2104.86 -> at the higher end latencies, however,
2106.69 -> the performance started to diverge
2109 -> with some very high latency requests in Keyspaces.
2113.2 -> Looking at the max, you can see that Cassandra
2116.17 -> was about at two and a half minutes,
2118.57 -> whereas Keyspaces was almost at two hours.
2123.37 -> Now, to be clear what we were measuring here,
2125.47 -> this wasn't a single request in response to Keyspaces.
2129.13 -> Rather, it was a series of requests
2131.71 -> iterating through all of the different pages
2133.42 -> of a select query.
2135.4 -> We were monitoring it at this level
2137.14 -> because due to the schema change
2138.97 -> that we made in the first phase,
2140.74 -> the select queries between databases were now different.
2144.13 -> So in order to get
2145.24 -> an apples to apples comparison on the latency,
2147.37 -> we had to measure it at this level.
2152.89 -> Here's a quick example of how the query changed.
2156.85 -> In Cassandra, we were asking for all transactions
2160.18 -> for a list of accounts.
2162.07 -> And now, in Keyspaces, we were doing the same thing,
2164.65 -> but with the added list of partition IDs.
2168.43 -> When we looked at the specific cases
2170.26 -> where this query was taking a long time to run,
2172.96 -> we were actually very surprised to find
2175.21 -> that in certain cases, we were asking for all transactions
2179.02 -> for up to 20,000 accounts at one time.
2183.76 -> What this meant was for the Cassandra query,
2186.34 -> it would have to scan
2187.21 -> through 20,000 partitions on the backend,
2189.85 -> whereas Keyspaces,
2190.99 -> with the additional list of partition IDs,
2193.27 -> would have to scan through 200,000.
2196.72 -> To make matters worse, we found that, at the time,
2200.32 -> the backend processing on Keyspaces
2202.69 -> was doing all of this work sequentially.
2207.34 -> So from this point, the Keyspaces team
2209.62 -> started working on an enhancement
2211.45 -> to do some of that backend processing in parallel
2214.27 -> while we continued investigation on our side.
2218.08 -> One strange thing that we had noticed
2219.97 -> was that although we were asking for transactions
2222.49 -> for so many different accounts at once,
2224.92 -> the actual number returned was only a few hundred,
2228.04 -> meaning that most of the accounts had no data at all.
2232.15 -> When we dug a bit further,
2233.47 -> we confirmed that this was actually the case.
2236.2 -> One of our other services called Financial Account
2239.47 -> had a defect that was sometimes inserting duplicate rows
2242.71 -> into its database.
2245.53 -> Because of the application logic in that service,
2248.14 -> these extra rows
2249.1 -> were never getting returned to our customers,
2251.5 -> but when they were used to generate this transactions query,
2254.47 -> they weren't getting filtered out.
2258.01 -> So the end result of this
2259.33 -> was that both Keyspaces and Cassandra
2262.66 -> essentially were spinning their wheels on the backend,
2265.12 -> looking through thousands of partitions
2267.01 -> that would never actually contain any data,
2269.23 -> and this is what was causing the high latency.
2272.83 -> After the Keyspaces team
2274.33 -> rolled out the enhancement on their end
2276.19 -> and after we fixed the defect on ours,
2279.1 -> that original two-hour query dropped to about 30 seconds,
2283.21 -> putting it almost exactly on par with Cassandra.
2290.17 -> So now that we had verified the latency
2292.96 -> and the error rate of the reads,
2294.85 -> the other important part of this phase
2296.77 -> was the data validation.
2299.32 -> The way that we implemented this
2300.82 -> was that every read that we did in Cassandra,
2303.37 -> we would perform the same read in Keyspaces
2305.77 -> and compare the data.
2307.96 -> The results of that were logged, fed into a dashboard,
2311.26 -> and we reviewed that daily
2312.61 -> to make sure that the data was being kept in sync.
2317.11 -> When we first rolled this to production,
2319.3 -> the report showed that we had data mismatches
2322.18 -> in about 1% of the transactions.
2325.63 -> However, when we spot-checked a few of these
2327.97 -> looking directly at the database data,
2330.16 -> we didn't see any mismatches there.
2332.98 -> What was happening was because we were leveraging reads
2336.4 -> that were happening within the application,
2338.83 -> the chances were fairly high
2340.54 -> that that same data was being updated around the same time.
2344.35 -> And because we were doing
2345.73 -> asynchronous writes over to Keyspaces,
2348.07 -> sometimes the comparison was happening
2350.47 -> before Keyspaces was actually updated.
2354.55 -> So this was fairly easy to work around.
2357.43 -> We, instead of leveraging the reads
2360.04 -> that were happening organically in the application,
2362.53 -> we just set up a process that ran off to the side
2365.68 -> that compared the data in the background.
2368.23 -> After making that change,
2370.27 -> the original 1% of mismatches dropped to 0.01%.
2378.73 -> Looking at the remaining discrepancies,
2380.74 -> we found that those were actually due
2383.02 -> to data inconsistencies in our Cassandra cluster itself.
2387.73 -> Cassandra's model of eventual consistency
2389.92 -> does allow data to remain inconsistent
2392.44 -> between replicas for some time.
2395.08 -> And in our case, this was exacerbated
2397.69 -> by the fact that we couldn't run repair.
2403.21 -> And because we were using local quorum consistency
2406.15 -> when selecting the data out of Cassandra for comparison,
2409.57 -> it was possible that these inconsistencies
2411.7 -> would show up on our mismatch report.
2415.06 -> The fix here was to use consistency level 'ALL'
2418.54 -> when selecting the data out of Cassandra,
2421.36 -> which guaranteed that all three replicas
2423.49 -> would always be consulted,
2425.05 -> leading to consistent and up-to-date results.
2428.86 -> After making that change,
2430.48 -> the reported mismatches dropped to zero.
2434.65 -> So at this point, we had validated
2436.84 -> both the read and the write paths,
2439.33 -> and additionally, we had created a point in time
2441.88 -> when we turned on the dual writes
2444.43 -> such that after that point in time,
2445.99 -> the databases were in sync.
2448.27 -> So the next phase was backfilling all of the historical data
2452.05 -> that still resided only in Cassandra.
2454.78 -> And now, I'll turn it back to Manoj
2456.49 -> to talk through some of the details of that phase.
2460.75 -> - Okay.
2462.46 -> Thank you, Jason.
2467.56 -> So now that we have all of the dual writes and reads
2471.07 -> working in our production setup,
2473.08 -> it was time to focus on data migration.
2476.11 -> Now, for all the Cassandra experts out here,
2479.08 -> the immediate question that comes to your mind
2481.96 -> is "Why do we need a data migration utility?
2486.04 -> Why not just stream the data while replication
2489.37 -> from the current Cassandra cluster
2491.17 -> to the Keyspaces instance?"
2493.36 -> Well, there are a couple of different reasons
2495.97 -> why we could not pursue that route.
2498.64 -> One, as we mentioned before,
2500.737 -> AWS Keyspaces is not Apache Cassandra under the hood,
2504.82 -> so that was not a viable route.
2507.19 -> Two, our own production cluster, Cassandra cluster,
2510.73 -> was maxed out on I/O operations bandwidth.
2514.21 -> So we were less than keen
2516.79 -> to let replication run on our production cluster
2521.74 -> that is a heavy I/O operation
2524.35 -> and hog all of our production bandwidth,
2527.14 -> potentially disrupting work for our customers.
2530.59 -> We would never let that happen.
2532.21 -> So with these reasons and with these constraints,
2535.15 -> we choose to build out a custom migration utility.
2539.83 -> And we used this utility one time
2542.02 -> to historically backfill all the information.
2545.08 -> Now, even with that utility,
2547.51 -> we built in the right level of throttles
2550.06 -> such that we would maximize the migration workloads
2555.49 -> only when our production workload
2558.46 -> or production traffic is less,
2560.77 -> so this way, and the vice versa too.
2565.75 -> And this way, we were able to leverage the max capacity
2570.13 -> based on our production Cassandra cluster
2573.85 -> without creating any problems for our customers.
2581.17 -> So this is a quick glimpse
2583.24 -> of how auto scaling works in Keyspaces.
2589.3 -> The blue line indicates the provision capacity
2592.99 -> and the green line shows the consumed capacity.
2597.19 -> Around 3:28 PM, in this particular instance,
2601.78 -> the total write consumption starts to peak
2605.68 -> and the provision capacity automatically scales up
2609.46 -> with just few minutes of delay.
2612.01 -> So what this ensures
2613.24 -> is if you have your auto scaling configuration set right,
2617.47 -> your system can seamlessly scale
2619.99 -> without creating any kind of adverse impact
2623.08 -> for your customers or production workloads.
2630.22 -> So up until this point in time,
2632.89 -> our Cassandra cluster is primary,
2635.98 -> and everything that we're doing on Keyspaces,
2638.77 -> be it reads, be it write,
2640.48 -> it's all in a simulated manner, in an asynchronous fashion.
2645.97 -> And the asynchronous fashion ensured
2648.82 -> that if there were any glitches on Keyspaces,
2652.09 -> that would not result in any kind of production impact
2656.02 -> for our actual production system.
2659.14 -> We continue to do this,
2660.85 -> we continue to monitor our system in this setup
2663.58 -> for an extended few sprints
2665.86 -> until we ensure there is, one, 100% data parity
2670.03 -> between both the systems,
2671.5 -> and two, that we were able to build a high confidence
2676.12 -> that the new system is up and running
2678.61 -> and in a steady, stable, mature state.
2681.64 -> And once we got to that high confidence phase,
2684.4 -> then we started switching over the reads and writes
2688.33 -> to be primary and synchronous on Keyspaces,
2691.96 -> and then gradually we started terminating or tapering off
2696.13 -> all of the reads and writes to Cassandra.
2698.77 -> So this is how we made the switch
2701.2 -> over an extended period of time
2703.57 -> iteratively working through all these phases.
2708.88 -> To give you a glimpse of the high level stats
2711.37 -> with respect to our Keyspaces implementation,
2714.22 -> our peak time writes is about 15,000 WCUs.
2719.98 -> Let me take a moment to explain what WCU is.
2722.923 -> It's write capacity unit in Keyspaces.
2726.07 -> One WCU represents one write per second
2730.36 -> for a row of up to one kilobyte in size
2733.84 -> using local quorum consistency.
2736.66 -> So if your average payload size is three kilobytes,
2740.65 -> then each of those payload requests
2743.11 -> would translate to three WCUs in Keyspaces.
2748.03 -> Peak reads was 140,000 RCU.
2751.6 -> Similar to WCU, Read Capacity Unit
2754.99 -> represents one local quorum read per second,
2759.46 -> or two local one reads per second
2763.24 -> for a row of up to four kilobytes.
2767.14 -> Our total data footprint was about 150 terabytes,
2771.13 -> which comprised of more than 100 billion records.
2779.41 -> Okay, where we are today.
2781.57 -> We have been fully live on Keyspaces implementation
2785.2 -> for the last one year.
2788.08 -> We have had zero production incidents
2791.53 -> post migration to Keyspaces
2793.96 -> specifically around our persistence layer.
2797.41 -> Now, this was a huge feather in the cap for our team,
2802.12 -> primarily because database layer
2805.15 -> used to be our single biggest contributor
2808.27 -> towards all of our production incidents.
2812.5 -> We have also significantly improved
2814.57 -> on our operational metrics
2816.52 -> like reliability and availability of our application.
2819.82 -> We've been able to avoid sporadic read timeouts
2822.37 -> that we used to see on Cassandra and so on.
2825.22 -> And we were also able to deliver on all of our agility goals
2829.72 -> to enable better business capabilities,
2832.33 -> leveraging our platform for our stakeholders.
2836.17 -> So this is all the content that we wanted to share
2840.85 -> from Intuit.
2842.05 -> So let me invite Meet and Jason back on stage
2847.668 -> to wrap the session up.
2850.36 -> - All right.
2852.19 -> Thank you, Manoj. Thank you, Jason.
2856.3 -> I think that wraps up what we have for you all today.
2858.67 -> And again, appreciate you spending time
2860.41 -> and sticking around later in the evening.
2863.8 -> You'll see a link to the survey on your app.
2865.96 -> If you came to the session, you registered for the session,
2868.33 -> we really encourage you to take the survey.
2869.95 -> That's how we learn,
2870.783 -> that's how we know what to do better next year.
2872.92 -> So please take the survey, let us know what you thought
2875.17 -> so we could keep getting even better every year.

Source: https://www.youtube.com/watch?v=AjsKP0Key6U