AWS re:Invent 2022 - Beyond five 9s: Lessons from our highest available data planes (ARC310)
AWS re:Invent 2022 - Beyond five 9s: Lessons from our highest available data planes (ARC310)
Updated with recent learning, this session dives deep into building and improvising resilience in AWS services. Every AWS service is designed to be highly available, but a small number of what are called Tier 0 services get extra-special attention. In this session, hear lessons from how AWS has built and architected Amazon Route 53 and the AWS authentication system to help them survive cataclysmic failures, enormous load increases, and more. Learn about the AWS approach to redundancy and resilience at the infrastructure, software, and team levels and how the teams tasked with keeping the internet running manage themselves and keep up with the pace of change that AWS customers demand.
ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.
AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.
#reInvent2022 #AWSreInvent2022 #AWSEvents
Content
0.13 -> - [Colm] Well, my name's Colm.
1.92 -> I'm a VP and distinguished engineer
4.86 -> at Amazon Web Services.
7.08 -> I joined AWS in 2008,
10.38 -> been here a while,
13.11 -> helped build a bunch of AWS services
16.26 -> that hopefully you're familiar with.
18.21 -> And what I'm gonna talk about today
20.64 -> are lessons I've learned
23.16 -> from working with teams
and observing teams
25.32 -> who build some of our
highest available services,
29.43 -> and the most highly available
parts of those services.
35.37 -> All the lessons today come from
38.73 -> the data plane parts of our services,
41.7 -> so, sometimes
44.01 -> we can debate what's a data plane?
46.08 -> what's a control plane?
what are kinds of planes?
48.54 -> propagation planes? and so on,
49.89 -> when it comes to the systems and services
53.22 -> that we have.
54.48 -> But what I mean by data plane
56.94 -> is the part of our systems
58.38 -> that is used the most, right?
60.6 -> So if I take a service like EC2,
64.29 -> the data plane for EC2 are the instances
68.07 -> and making sure that instance stays up,
70.8 -> and that networking packets
can get to and from it,
74.22 -> and that EBS data storage is working,
76.677 -> and so on, in realtime.
78.36 -> Those are the really,
really critical parts
80.28 -> versus something like the control plane,
81.96 -> which is maybe launching
a new EC2 instance
85.26 -> or tearing down an EC2
instance, and so on.
88.47 -> And the first category tend to be
90.66 -> much, much more critical,
92.52 -> and we tend to pour, or have to pour,
95.46 -> a lot of attention to detail and scrutiny
98.309 -> into how those systems are built
101.88 -> to get the reliability levels
that our customers need.
105.96 -> A lot of what I'm gonna talk about today,
107.82 -> we've got articles that
go into even greater depth
111.84 -> at the Amazon Builders' Library.
114.48 -> These are articles, really
deep technical articles
118.41 -> written by firsthand experts,
121.26 -> folks who actually built
Amazon Web Services
124.53 -> write these articles,
126.18 -> and they go through a
review and editorial process
129.03 -> and we get them as readable
and helpful as we can,
131.987 -> and we dive into all
sorts of nuance details
134.76 -> that are important to building
highly available systems.
138 -> We've articles and things like
caching, retries, fairness,
141.81 -> safety, and more.
143.4 -> And if you want even
more detail on things,
145.77 -> I'm not gonna cover it all today,
148.5 -> I encourage you to go there.
149.73 -> It's a really, really great resource.
151.17 -> I love it.
152.43 -> But what I am gonna go in today is,
155.58 -> it's kinda my mission,
that everybody who came,
158.28 -> or everybody who watches this talk
160.26 -> can leave with some tips, or mental model,
164.07 -> or something that hopefully
will stick in your mind
166.89 -> that is really genuinely useful
169.02 -> for building highly available systems.
173.04 -> Now, every system is different
176.22 -> and while there are some
techniques, and tips,
179.37 -> and tricks that are in common,
181.32 -> and can apply to almost any system,
183.87 -> a lot of the nuance detail
is very, very specific.
187.14 -> So to make sure we get
some good takeaways,
190.23 -> I'm gonna focus, in part,
192.84 -> on some of the higher level lessons
194.787 -> and the kind of forms
of insight that apply,
197.67 -> genuinely apply to kind
of any development process
200.79 -> or any highly available system.
203.82 -> But we're also gonna cover
some super practical,
206.7 -> like nuts and bolts technical techniques,
209.49 -> that we do try to use in our systems
213.57 -> and show the benefits that they bring.
216.3 -> And just to keep things interesting,
217.44 -> we're gonna interleave these,
218.55 -> so we're gonna go back
and forth between the two.
221.4 -> So whether you're a hands-on builder,
223.71 -> writing code, shipping systems,
225.9 -> or whether you're an engineering leader,
228.21 -> or a principal engineer,
229.65 -> or even a business leader,
230.64 -> hopefully there's things in here
232.71 -> that will be useful, and to take away.
235.83 -> And so we're gonna start
237.27 -> by very literally going beyond five 9s.
240.78 -> And what I mean by that is
just thinking a bit more deeply
245.34 -> than the traditional
nines model itself allows,
249.33 -> because when we're
reasoning about building
253.5 -> or designing highly available systems,
255.63 -> we have to be careful about,
what components do we use?
259.26 -> or what reliability level do we insist on
262.2 -> for certain subsystems?
263.76 -> because that all adds up to
our overall availability.
266.94 -> And in general, the
traditional five nines model,
270.36 -> and nines model, it's kinda crude
272.97 -> and doesn't really give us enough.
275.46 -> So you're probably familiar
with the very traditional
278.58 -> pattern of you measure
the uptime or availability
281.46 -> of a system maybe once
a minute or something,
284.76 -> and determine is it up? is it down?
287.64 -> is it fully available? is
it degraded? or whatever,
290.25 -> and you can translate that
into the number of nines
293.43 -> that that system was available for.
295.77 -> A system that has two
nines of availability
297.63 -> can have up to three and
a half days of downtime,
302.16 -> or impact, in a year,
which is quite a lot.
305.19 -> All the way up to five nines or even,
307.62 -> more nines systems, and five nines system
310.29 -> you're talking five and
a half minutes per year,
312.48 -> which is, you know,
it's pretty aggressive.
314.55 -> That's a very high level of
availability at that point.
318.33 -> And this model is useful.
321.18 -> It's great for measuring overall
business impact and value.
324.86 -> It does very succinctly summarize,
327.93 -> well, how available was
that system in a year?
330.36 -> how often did I have to work around it?
332.43 -> All of those things.
333.6 -> You can see these baked
into business relationships,
336.51 -> and SLAs, and so on, often are expressed
339.93 -> in terms of numbers of nines.
341.94 -> But it's also a very
low-fidelity measurement.
344.1 -> It doesn't really give us any insight
345.99 -> into how the system is going to perform,
348.18 -> and it's not so great for systems design.
352.56 -> Some of the challenges
of the nines model are,
355.56 -> well, what about partitioned
and cellular systems, right?
359.61 -> Nines can easily capture a binary state,
363.63 -> uptime versus downtime,
365.64 -> but it's harder for it to measure
367.32 -> partial availability, right?
369.87 -> And at AWS, we've been
partitioning all our systems
373.32 -> since we launched a second region.
376.02 -> We have independent, isolated
copies of each service
380.31 -> in every region, and as far as I know,
383.28 -> we've never experienced a
global outage of any service
386.43 -> across all regions.
387.78 -> So every time we're talking
about an issue or impact,
391.35 -> it's already partial by that measure
393.33 -> because it's only impacting
one region out of many.
396.99 -> And so some customers are
having an absolutely fine day
400.5 -> because they're in different
regions, different locations,
405.54 -> but some are impacted,
406.56 -> but the nines model doesn't
capture that as easily.
409.95 -> Another challenge with the nines model
412.32 -> is it's hard to pick a good
time duration to measure over.
417.42 -> Modern services, like the
kind of cloud services
419.82 -> that we operate, can now
typically go many, many years
424.47 -> in a region without impact,
426.51 -> without, you know,
427.98 -> having a single incident.
429.69 -> There're extreme examples of that.
431.42 -> So the Route 53 data plane,
433.89 -> so that's the part of
the Route 53 DNS service
436.32 -> that actually hosts
domains, answers DNS queries
439.26 -> so that websites stay online,
441.45 -> I was part of the team that built that.
443.547 -> And when we were building
it, and launching it,
446.52 -> 12 years ago, very early
in the beta program,
449.67 -> I put my own personal
domain right on that thing.
453.18 -> It was the very first
domain hosted on Route 53,
456 -> it was my own personal website.
457.77 -> Not a very high traffic website,
459.27 -> but I owned a domain and
I figured I might as well
462.54 -> start somewhere, drink my own champagne,
464.76 -> and put it on the service.
467.25 -> And in over 12 years that's
been running like that,
470.4 -> I've never even once experienced
472.71 -> not being able to have my domain resolved.
475.59 -> Out that whole time it's worked.
477.66 -> And, the five nines model,
480.27 -> I don't even know how
many nines that is, right?
482.73 -> It's really hard to capture.
485.76 -> And part of the challenge
here is that the nines model
489.54 -> is really just a simple
summary statistic, right?
492.87 -> It's just a single number
that's trying to describe
495.69 -> a complex data set.
497.28 -> My favorite paper's is a statistical paper
500.85 -> where the authors produced
502.35 -> these data sets that
are radically different.
505.98 -> One looks like a T-Rex,
507.9 -> and the rest look like these
other interesting shapes,
510 -> but they all have the same
mean, the same median,
513.15 -> same standard deviation,
and a bunch of other
515.82 -> summary statistics where
they're all the same thing.
517.673 -> And the point of the
paper was to get across
519.84 -> something that is commonly taught
521.97 -> in the very first lecture
of a statistics module,
525.33 -> which is you should plot your data,
526.86 -> you should look at how it looks.
529.08 -> Don't just focus on the means
and averages, and so on,
531.72 -> 'cause it won't give you much insight.
533.91 -> So we can apply the same technique, right?
536.25 -> But what really matters,
537.81 -> what is it the thing that we
really, really care about?
540.84 -> 'Cause personally, I don't
think nines tells you that much.
543.78 -> And I think the thing that
really matters to most customers,
548.19 -> engineering leaders, and systems designers
551.28 -> is how long can an interruption last?
553.98 -> and how often can that happen?
556.56 -> And something I've done as a practice
559.8 -> and seen work really, really well
562.05 -> is to approach every system's design
564.99 -> with this frame of mind, and
this frame of conversation.
568.26 -> And so, by talking to a
team and just asking them,
571.41 -> don't ask them, "Well
how many nines are you?"
573.6 -> Ask them, "Well, what do
you think the longest event
578.327 -> you could have would be?
579.57 -> And how often do you
think that could happen?"
581.7 -> 'Cause that's just much
more useful actionable data.
585.54 -> As a business owner,
that tells you whether
587.69 -> it might be an acceptable system or not.
590.16 -> As an engineering designer, it tells you,
591.93 -> well now I know how long
I might have to work
594.42 -> around that system if
it has an issue, right?
596.66 -> It really, really focuses the mind.
598.95 -> The durability industry
has used this approach
601.23 -> for a long time.
602.063 -> Recovery time objective is
kind of the standard number
604.5 -> that that industry uses
for, I've got durable data,
607.86 -> what happens if it goes down
609.09 -> and I have to restore the
whole system from backups?
611.55 -> How long could that take, right?
612.87 -> That's recovery time objective.
615.57 -> And that's what they focus on,
616.97 -> and I think it's a
better thing to focus on.
620.88 -> So put all this together and focus on,
622.92 -> well, what's the rate
and expected duration of
627.57 -> an incident, or some kind of downtime,
629.67 -> or something like that?
630.96 -> It's really easy to plot.
634.05 -> I've made up these data sets,
636.33 -> but it's easy to plot them
with almost any plotting tool,
640.83 -> I used Excel for these.
641.963 -> But the idea is to just
pick some time ranges
644.88 -> that you care about, time durations,
648.15 -> let's say how often in a year
650.45 -> am I gonna have one minute interruptions?
652.68 -> And this graph says for
this fictitious system,
655.62 -> 10 one minute interruptions in a year.
658.17 -> But then how often am I
gonna have an interruption
661.35 -> that might last an hour, right?
663.33 -> And on this graph, it's a log graph,
666.06 -> it's about 0.5, and that's telling you,
669.33 -> well, maybe every two years, right,
671.25 -> you're gonna have an interruption
that lasts every hour.
673.56 -> Straight away, you're
getting kinds of insights
675.87 -> that the nines model just
cannot give you, right?
678.69 -> It can't expose that kind
of information to you.
682.167 -> And this is still giving you binary data.
684.99 -> It's just telling you something's
up, or something's down.
688.77 -> If we wanna capture partial impact,
690.51 -> it's really simple,
because this is a graph,
692.43 -> we can just add more lines, right?
694.23 -> And so we can look at the, you know,
696.42 -> what are the expected rates
of a 5% impacting event?
701.64 -> How long could an event that
impacts 5% of my resources,
705.21 -> or customers last, and so on?
708.3 -> And just plotting the data like
this gives you this insight.
711.9 -> And then the next thing I do,
715.17 -> I just take this data and I multiply it
717.24 -> by the number of minutes involved.
718.74 -> Nothing fancy at all.
719.73 -> So we're just weighting it by minutes.
722.22 -> And so now I get minutes per
year by expected duration.
726.06 -> And this is the kind of mental
model and frame of reference
729.51 -> I use when evaluating
components and systems, right?
732.87 -> Because this tells me so much, right?
735.63 -> It tells me, okay, well how
reliable is that system gonna be
739.08 -> in this high-fidelity way?
741.63 -> But it also teaches me
what to focus on, right?
744.66 -> Like if I saw a graph like this
for a system, I would know,
748.26 -> well I should probably
spend more time focusing
750.15 -> on the right hand side of this graph,
751.77 -> the longer duration events,
753.75 -> 'cause those are gonna
accumulate to more impact.
755.94 -> And so that means refining
operational processes,
758.73 -> putting better failover
mechanisms in place,
761.85 -> and so on, and so forth, right?
763.86 -> Whereas if I saw if the left was higher,
765.81 -> there were a different
system, and I'm going,
767.88 -> you know, you're gonna have
a lot of small interruptions.
770.31 -> That's just probably telling
me I need more components,
773.46 -> I need more resiliency,
774.87 -> but those are very different conclusions.
776.91 -> But the number of nines
could be exactly the same
779.28 -> for those two radically different graphs.
782.97 -> It's just simpler to talk
about and reason about systems,
785.997 -> and I find conversations like
this enormous simplifiers.
789.69 -> When I'm working with customers
790.86 -> who have the highest
availability of requirements,
793.53 -> where we're talking about
live market trading,
797.43 -> or safety critical systems, and so on,
800.49 -> we pivot into these conversations quickly
802.77 -> and it really helps us get
to the meat of matters.
806.34 -> And although the gold
standard for this technique
809.97 -> is to use real historical data
812.25 -> that accounts for the
operational performance
814.56 -> of a service and its team,
816.21 -> and just use that to plot the graphs.
818.49 -> Typically, services
usually come with metrics,
822.3 -> and canaries, and so on,
824.517 -> and there are ways to get
this data and build it.
827.1 -> But even if you're building
something from scratch,
829.41 -> or you don't have that data,
this even works with guesswork.
833.1 -> I've done exercises
where I've gone to teams
835.167 -> and I've just said,
"Well, tell me your guess,
837.66 -> how often do you think you're
gonna have a one hour event?
841.59 -> What's your guess?
842.52 -> Do you think you could have
one of those in a year?
844.5 -> Do you think it's like a 25% chance?"
846.6 -> And just that conversation,
that style of conversation,
850.5 -> right away gets you to
actionable interesting data
853.02 -> that it's really hard to talk
about with the nines model.
857.37 -> And by the way, engineers'
guesses, they tend to be right.
865.56 -> No matter what you ask them about,
867.09 -> they have an intuitive sense for this,
868.74 -> at least the experienced ones.
870.54 -> When you plot out these graphs,
872.16 -> they're very different
for a cloud service,
875.64 -> or an on-premises database,
or an on-premise application,
878.97 -> or low-level physical infrastructure,
881.25 -> even when they might have
the same number of nines,
883.77 -> they have very different
shapes to these graphs.
886.14 -> And so you have to accommodate that
887.61 -> in your resiliency plans
around these systems.
891.21 -> I just find that endlessly fascinating.
895.35 -> Richard Feynman famously
applied this technique,
898.05 -> this kind of like, "Just
go ask the engineers,"
900.99 -> in his appendix F that he wrote
902.79 -> to the Challenger disaster report,
904.74 -> where he literally just asked
all the engineers involved
907.29 -> in building the components
for the Challenger rockets,
911.297 -> you know, "What do you think
the failure rates are like?"
912.84 -> All this kinda stuff, and
he added up the numbers
915.15 -> and he got to an overall failure rate
916.83 -> of about one in 100 for the space shuttle,
919.53 -> which tragically turned out
to be pretty much on the nose.
924 -> Two space shuttles were lost
926.22 -> in just around 200-ish missions.
931.05 -> That stuff works.
932.88 -> But there's other challenges with nines,
934.89 -> there's other reasons why we try to avoid
937.47 -> the kinda naive nines model
938.91 -> when we're doing systems design.
941.934 -> It that it comes from
mechanical and civil engineering
946.17 -> and it's kinda based around
failure mode analysis
949.68 -> that doesn't easily apply to our field.
953.73 -> Physical components, you
can test a bolt or a screw
958.02 -> over, and over, and over again
959.1 -> under all sorts of different conditions
960.48 -> and see how often it fails,
961.8 -> and that will tell you the
failure rate of that component,
963.9 -> and now you know how many bolts you need
965.7 -> to put in something to
make sure that it's safe.
969.03 -> But our field's a little different.
970.8 -> Firstly, our failures are recoverable.
974.55 -> As an industry, pretty much every failure
977.91 -> has always been recovered from.
980.34 -> I don't know of any
real outage or downtime
983.04 -> at any company, anywhere,
really in this whole field
987 -> where they've had a permanently
disabling event, right?
990.66 -> So that's not at all
like physical components.
993.36 -> So recovery time really matters
in a very different way.
996.12 -> And another is that software systems
998.04 -> and distributed systems,
999.45 -> they tend to have non-linear interactions
1002.09 -> and even positive feedback
loops in different places,
1005.48 -> and much more state space than
simple physical components,
1010.01 -> and all sorts of ways that makes them fail
1012.62 -> in kind of dynamic ways.
1015.35 -> You know, a simple example of that,
1018.02 -> classic example, you
wanna suspend a weight
1020.81 -> while you've got a
chain, or a braid, right?
1023.18 -> Everyone knows, famously
a chain is only as strong
1026.03 -> as its weakest link, right?
1027.14 -> It's an idiom.
1028.22 -> And so when you do the
nines equations for these,
1031.22 -> it's a slightly different formula, right?
1032.87 -> The availability can
go down for the chain,
1037.19 -> but go up for the braid, right,
1038.51 -> because one piece of
rope can can save the day
1041.39 -> as far as a braid is concerned.
1043.85 -> And if you do the math,
you add the numbers,
1046.25 -> you'll get lower
availability for the chain
1048.5 -> than for the braid.
1050 -> And I've seen people apply
this to distributed systems
1053.12 -> and systems design,
1055.22 -> and they end up getting disappointed.
1058.52 -> Like it doesn't really work.
1060.5 -> A simple example of this,
1061.88 -> I'll go back to my DNS example.
1064.58 -> DNS
1065.93 -> is a relatively simple
1069.26 -> service,
1070.34 -> it's been around over 40 years
1074.51 -> and it's design hasn't
changed much in that time,
1077.63 -> and you have a service,
it answers DNS queries,
1079.88 -> websites stay online, right?
1081.8 -> This was very natural to think,
1083 -> well, I really care
that my website stays up
1086.12 -> so I'm gonna use two or
more DNS providers, right?
1089.27 -> I've gotta put my domain on Route 53
1091.55 -> and another DNS provider.
1093.17 -> Great.
1094.003 -> And you go, well I'm
getting some biodiversity,
1096.2 -> they probably have different software,
1097.61 -> different networks, different people.
1099.32 -> Very unlikely they would
break at the same time, right?
1102.38 -> So it feels like the braid, right?
1104.717 -> But the problem is in practice
1106.37 -> turns out it's like the chain,
1108.14 -> because in practice DNS resolvers
can't easily work around
1113.18 -> even one of your providers
out of any being down.
1117.53 -> And so if one of your DNS
providers has a bad day,
1119.75 -> then you have a bad day.
1121.19 -> So you end up actually
doubling your fault rate,
1124.79 -> not having it, the opposite
of what you intended.
1128.36 -> Or another simple example
1129.41 -> is if you rely on your DNS provider
1131.03 -> to be able to make timely changes,
1132.457 -> 'cause that's what you use
for your failover process,
1136.34 -> again, you're gonna get
half the availability rate
1138.26 -> if you're relying on two providers,
1139.43 -> 'cause now you need those
changes reflected in two places.
1143.33 -> That's just not really obvious
1144.98 -> unless you've gone and done
it, and experienced it,
1147.32 -> and seen what's going on.
1149.78 -> My take is that the kinda
systems we all build,
1152.78 -> they're more like cantilevers
than chains and braids.
1156.71 -> And what I mean by that
is every single one,
1159.05 -> it's kinda specific and has to be tested.
1161.87 -> Like you've gotta build
maybe a scale model
1163.91 -> and really exercise it, and
learn everything you can
1166.55 -> about how it might fail,
1168.05 -> and make sure that you've accounted
1169.43 -> for all the safety mechanisms, and so on.
1172.25 -> And that's the kind of approach to take.
1175.97 -> We'll see a bit more of that later.
1177.23 -> But this kind of end-to-end testing
1179.96 -> is what's really critical
1181.37 -> to figuring out whether
these systems are safe
1184.16 -> and reliable enough for
whatever purpose we've intended.
1188.63 -> All right.
1189.47 -> Now there are some techniques
that are so general purpose,
1193.55 -> they're pretty much always a good idea.
1196.13 -> And I think the biggest one that we apply
1198.53 -> is to compartmentalize our systems, right?
1200.93 -> So I've already hinted at this a little.
1203.93 -> We run independent copies of every service
1207.29 -> in every region, right?
1208.58 -> They're not magically linked,
1210.71 -> there's no synchronous
dependencies between them.
1213.38 -> And the idea is if you know
a service is having a bad day
1217.46 -> in region A that shouldn't
have a knock-on impact
1220.28 -> and cause region B to have a bad day too.
1223.34 -> That would also defeat the utility
1225.38 -> of having multiple regions for failover.
1228.62 -> And we take this to the extreme,
1230.6 -> like this is our default design paradigm
1233.57 -> this is how we build
every service that we do,
1236.27 -> it's baked into our tools,
into our design processes,
1240.11 -> into our principal engineer
tenets, everything.
1243.17 -> You know, thou shalt not
build a multi-region service
1246.92 -> unless it's extremely intentional,
1248.78 -> and we're very, very careful
about how we do build those.
1252.59 -> We wanna keep everything
as separate as we can.
1255.41 -> But even inside the region
we wanna do the same again.
1259.04 -> Inside the region we've
got availability zones
1261.41 -> and where we can, we run
isolated separate stacks
1264.44 -> inside those availability zones.
1266.15 -> So if there's a problem
with availability zone,
1268.97 -> us-west-2a here, we can just turn it off,
1272.42 -> failover to the other
availability zones for a while.
1275.54 -> Fix it, fail back.
1278.42 -> It's a really, really powerful technique.
1282.47 -> We take it even further,
1283.73 -> we build cells inside availability zones.
1285.89 -> So when a service is big
enough, when it makes sense,
1289.97 -> we can start to have multiple
cells of that service,
1293.3 -> more independent copies,
1295.13 -> and then partition the
customers across those cells.
1298.288 -> Some customers are on cell
one, some are on cell three,
1301.37 -> and so on.
1302.63 -> And so, you know, maybe you make a change,
1304.43 -> it gets to one cell,
that change has an issue
1307.25 -> or something like that,
1309.05 -> you're only impacting a
much smaller percentage
1313.19 -> of your overall customers, than everybody.
1316.31 -> We take this to the absolute
extreme with shuffle sharding
1320.03 -> where you can do better again, right?
1321.56 -> So let's say we've got 10
cells, instead of just saying,
1325.617 -> "Well, we're gonna take
10% of our customers
1327.53 -> and put them on cell one, and
10% on cell two, and so on,"
1332.03 -> what we do is we take
each customer, and we say,
1335.127 -> "We're gonna put you on two random cells."
1338.66 -> So one customer, they're gonna
be on cell one and cell six,
1342.53 -> and another customer might be
on cell one and cell eight.
1346.34 -> So those two customers happen
to share one cell, right,
1349.28 -> cell one, but otherwise,
one's on cell six,
1352.79 -> one's on cell eight.
1354.74 -> And unless both of your
cells that you're allocated
1360.32 -> happen to be having a bad day,
1362.12 -> which should be very unlikely,
1364.31 -> you should be okay,
1365.143 -> because there should be some resilience
1366.17 -> built into that system
1367.1 -> and you should be able to
survive on the remaining cell.
1370.76 -> Firstly, because you've got
that built in resilience,
1373.73 -> because we're really careful
about how we do things
1375.59 -> like deploy changes, just
naturally makes faults
1379.7 -> way less likely in terms of
actually impacting a customer.
1382.85 -> While it might happen, the
idea is they won't notice.
1386.81 -> But even if they do,
even if somehow an issue
1389 -> does get to impacting two cells,
1392.45 -> 'cause maybe this issue was
triggered by the customer,
1395.93 -> so maybe that's why those
two cells became impacted.
1400.04 -> Maybe it wasn't change related,
1401.9 -> even in that event, with just
10 cells and this method,
1406.37 -> one 100th of your customers
are impacted and share a fate,
1411.38 -> and those numbers get better,
and better, and better
1413.06 -> the more cells you have,
1414.14 -> 'cause this scales exponentially.
1416.09 -> It's based on combinatorial math.
1418.43 -> It's pretty cool.
1419.263 -> And we try to use that in
many, many places we can.
1423.74 -> This is something we learned,
1427.04 -> this wasn't something that was baked
1428.54 -> into the very first systems that we built.
1431.45 -> We retrofitted it into many,
1433.49 -> but it's something we
learned to do over time.
1436.46 -> And I think of learning
to do things over time,
1439.76 -> as actually the primary way
that we really design systems.
1444.92 -> I think often folks can think
1447.14 -> there's some mythical
reference manual somewhere,
1450.92 -> that maybe someone, someday, will write,
1452.9 -> that just has like all the best practices
1455.03 -> you're supposed to apply
for systems design.
1457.88 -> And if you read that golden manual
1459.65 -> and you learn all those things,
1461.6 -> that you will just come out knowing
1463.22 -> how to build the perfect system, right?