AWS re:Invent 2022 - Beyond five 9s: Lessons from our highest available data planes (ARC310)

AWS re:Invent 2022 - Beyond five 9s: Lessons from our highest available data planes (ARC310)


AWS re:Invent 2022 - Beyond five 9s: Lessons from our highest available data planes (ARC310)

Updated with recent learning, this session dives deep into building and improvising resilience in AWS services. Every AWS service is designed to be highly available, but a small number of what are called Tier 0 services get extra-special attention. In this session, hear lessons from how AWS has built and architected Amazon Route 53 and the AWS authentication system to help them survive cataclysmic failures, enormous load increases, and more. Learn about the AWS approach to redundancy and resilience at the infrastructure, software, and team levels and how the teams tasked with keeping the internet running manage themselves and keep up with the pace of change that AWS customers demand.

Learn more about AWS re:Invent at https://go.aws/3ikK4dD.

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#reInvent2022 #AWSreInvent2022 #AWSEvents


Content

0.13 -> - [Colm] Well, my name's Colm.
1.92 -> I'm a VP and distinguished engineer
4.86 -> at Amazon Web Services.
7.08 -> I joined AWS in 2008,
10.38 -> been here a while,
13.11 -> helped build a bunch of AWS services
16.26 -> that hopefully you're familiar with.
18.21 -> And what I'm gonna talk about today
20.64 -> are lessons I've learned
23.16 -> from working with teams and observing teams
25.32 -> who build some of our highest available services,
29.43 -> and the most highly available parts of those services.
35.37 -> All the lessons today come from
38.73 -> the data plane parts of our services,
41.7 -> so, sometimes
44.01 -> we can debate what's a data plane?
46.08 -> what's a control plane? what are kinds of planes?
48.54 -> propagation planes? and so on,
49.89 -> when it comes to the systems and services
53.22 -> that we have.
54.48 -> But what I mean by data plane
56.94 -> is the part of our systems
58.38 -> that is used the most, right?
60.6 -> So if I take a service like EC2,
64.29 -> the data plane for EC2 are the instances
68.07 -> and making sure that instance stays up,
70.8 -> and that networking packets can get to and from it,
74.22 -> and that EBS data storage is working,
76.677 -> and so on, in realtime.
78.36 -> Those are the really, really critical parts
80.28 -> versus something like the control plane,
81.96 -> which is maybe launching a new EC2 instance
85.26 -> or tearing down an EC2 instance, and so on.
88.47 -> And the first category tend to be
90.66 -> much, much more critical,
92.52 -> and we tend to pour, or have to pour,
95.46 -> a lot of attention to detail and scrutiny
98.309 -> into how those systems are built
101.88 -> to get the reliability levels that our customers need.
105.96 -> A lot of what I'm gonna talk about today,
107.82 -> we've got articles that go into even greater depth
111.84 -> at the Amazon Builders' Library.
114.48 -> These are articles, really deep technical articles
118.41 -> written by firsthand experts,
121.26 -> folks who actually built Amazon Web Services
124.53 -> write these articles,
126.18 -> and they go through a review and editorial process
129.03 -> and we get them as readable and helpful as we can,
131.987 -> and we dive into all sorts of nuance details
134.76 -> that are important to building highly available systems.
138 -> We've articles and things like caching, retries, fairness,
141.81 -> safety, and more.
143.4 -> And if you want even more detail on things,
145.77 -> I'm not gonna cover it all today,
148.5 -> I encourage you to go there.
149.73 -> It's a really, really great resource.
151.17 -> I love it.
152.43 -> But what I am gonna go in today is,
155.58 -> it's kinda my mission, that everybody who came,
158.28 -> or everybody who watches this talk
160.26 -> can leave with some tips, or mental model,
164.07 -> or something that hopefully will stick in your mind
166.89 -> that is really genuinely useful
169.02 -> for building highly available systems.
173.04 -> Now, every system is different
176.22 -> and while there are some techniques, and tips,
179.37 -> and tricks that are in common,
181.32 -> and can apply to almost any system,
183.87 -> a lot of the nuance detail is very, very specific.
187.14 -> So to make sure we get some good takeaways,
190.23 -> I'm gonna focus, in part,
192.84 -> on some of the higher level lessons
194.787 -> and the kind of forms of insight that apply,
197.67 -> genuinely apply to kind of any development process
200.79 -> or any highly available system.
203.82 -> But we're also gonna cover some super practical,
206.7 -> like nuts and bolts technical techniques,
209.49 -> that we do try to use in our systems
213.57 -> and show the benefits that they bring.
216.3 -> And just to keep things interesting,
217.44 -> we're gonna interleave these,
218.55 -> so we're gonna go back and forth between the two.
221.4 -> So whether you're a hands-on builder,
223.71 -> writing code, shipping systems,
225.9 -> or whether you're an engineering leader,
228.21 -> or a principal engineer,
229.65 -> or even a business leader,
230.64 -> hopefully there's things in here
232.71 -> that will be useful, and to take away.
235.83 -> And so we're gonna start
237.27 -> by very literally going beyond five 9s.
240.78 -> And what I mean by that is just thinking a bit more deeply
245.34 -> than the traditional nines model itself allows,
249.33 -> because when we're reasoning about building
253.5 -> or designing highly available systems,
255.63 -> we have to be careful about, what components do we use?
259.26 -> or what reliability level do we insist on
262.2 -> for certain subsystems?
263.76 -> because that all adds up to our overall availability.
266.94 -> And in general, the traditional five nines model,
270.36 -> and nines model, it's kinda crude
272.97 -> and doesn't really give us enough.
275.46 -> So you're probably familiar with the very traditional
278.58 -> pattern of you measure the uptime or availability
281.46 -> of a system maybe once a minute or something,
284.76 -> and determine is it up? is it down?
287.64 -> is it fully available? is it degraded? or whatever,
290.25 -> and you can translate that into the number of nines
293.43 -> that that system was available for.
295.77 -> A system that has two nines of availability
297.63 -> can have up to three and a half days of downtime,
302.16 -> or impact, in a year, which is quite a lot.
305.19 -> All the way up to five nines or even,
307.62 -> more nines systems, and five nines system
310.29 -> you're talking five and a half minutes per year,
312.48 -> which is, you know, it's pretty aggressive.
314.55 -> That's a very high level of availability at that point.
318.33 -> And this model is useful.
321.18 -> It's great for measuring overall business impact and value.
324.86 -> It does very succinctly summarize,
327.93 -> well, how available was that system in a year?
330.36 -> how often did I have to work around it?
332.43 -> All of those things.
333.6 -> You can see these baked into business relationships,
336.51 -> and SLAs, and so on, often are expressed
339.93 -> in terms of numbers of nines.
341.94 -> But it's also a very low-fidelity measurement.
344.1 -> It doesn't really give us any insight
345.99 -> into how the system is going to perform,
348.18 -> and it's not so great for systems design.
352.56 -> Some of the challenges of the nines model are,
355.56 -> well, what about partitioned and cellular systems, right?
359.61 -> Nines can easily capture a binary state,
363.63 -> uptime versus downtime,
365.64 -> but it's harder for it to measure
367.32 -> partial availability, right?
369.87 -> And at AWS, we've been partitioning all our systems
373.32 -> since we launched a second region.
376.02 -> We have independent, isolated copies of each service
380.31 -> in every region, and as far as I know,
383.28 -> we've never experienced a global outage of any service
386.43 -> across all regions.
387.78 -> So every time we're talking about an issue or impact,
391.35 -> it's already partial by that measure
393.33 -> because it's only impacting one region out of many.
396.99 -> And so some customers are having an absolutely fine day
400.5 -> because they're in different regions, different locations,
405.54 -> but some are impacted,
406.56 -> but the nines model doesn't capture that as easily.
409.95 -> Another challenge with the nines model
412.32 -> is it's hard to pick a good time duration to measure over.
417.42 -> Modern services, like the kind of cloud services
419.82 -> that we operate, can now typically go many, many years
424.47 -> in a region without impact,
426.51 -> without, you know,
427.98 -> having a single incident.
429.69 -> There're extreme examples of that.
431.42 -> So the Route 53 data plane,
433.89 -> so that's the part of the Route 53 DNS service
436.32 -> that actually hosts domains, answers DNS queries
439.26 -> so that websites stay online,
441.45 -> I was part of the team that built that.
443.547 -> And when we were building it, and launching it,
446.52 -> 12 years ago, very early in the beta program,
449.67 -> I put my own personal domain right on that thing.
453.18 -> It was the very first domain hosted on Route 53,
456 -> it was my own personal website.
457.77 -> Not a very high traffic website,
459.27 -> but I owned a domain and I figured I might as well
462.54 -> start somewhere, drink my own champagne,
464.76 -> and put it on the service.
467.25 -> And in over 12 years that's been running like that,
470.4 -> I've never even once experienced
472.71 -> not being able to have my domain resolved.
475.59 -> Out that whole time it's worked.
477.66 -> And, the five nines model,
480.27 -> I don't even know how many nines that is, right?
482.73 -> It's really hard to capture.
485.76 -> And part of the challenge here is that the nines model
489.54 -> is really just a simple summary statistic, right?
492.87 -> It's just a single number that's trying to describe
495.69 -> a complex data set.
497.28 -> My favorite paper's is a statistical paper
500.85 -> where the authors produced
502.35 -> these data sets that are radically different.
505.98 -> One looks like a T-Rex,
507.9 -> and the rest look like these other interesting shapes,
510 -> but they all have the same mean, the same median,
513.15 -> same standard deviation, and a bunch of other
515.82 -> summary statistics where they're all the same thing.
517.673 -> And the point of the paper was to get across
519.84 -> something that is commonly taught
521.97 -> in the very first lecture of a statistics module,
525.33 -> which is you should plot your data,
526.86 -> you should look at how it looks.
529.08 -> Don't just focus on the means and averages, and so on,
531.72 -> 'cause it won't give you much insight.
533.91 -> So we can apply the same technique, right?
536.25 -> But what really matters,
537.81 -> what is it the thing that we really, really care about?
540.84 -> 'Cause personally, I don't think nines tells you that much.
543.78 -> And I think the thing that really matters to most customers,
548.19 -> engineering leaders, and systems designers
551.28 -> is how long can an interruption last?
553.98 -> and how often can that happen?
556.56 -> And something I've done as a practice
559.8 -> and seen work really, really well
562.05 -> is to approach every system's design
564.99 -> with this frame of mind, and this frame of conversation.
568.26 -> And so, by talking to a team and just asking them,
571.41 -> don't ask them, "Well how many nines are you?"
573.6 -> Ask them, "Well, what do you think the longest event
578.327 -> you could have would be?
579.57 -> And how often do you think that could happen?"
581.7 -> 'Cause that's just much more useful actionable data.
585.54 -> As a business owner, that tells you whether
587.69 -> it might be an acceptable system or not.
590.16 -> As an engineering designer, it tells you,
591.93 -> well now I know how long I might have to work
594.42 -> around that system if it has an issue, right?
596.66 -> It really, really focuses the mind.
598.95 -> The durability industry has used this approach
601.23 -> for a long time.
602.063 -> Recovery time objective is kind of the standard number
604.5 -> that that industry uses for, I've got durable data,
607.86 -> what happens if it goes down
609.09 -> and I have to restore the whole system from backups?
611.55 -> How long could that take, right?
612.87 -> That's recovery time objective.
615.57 -> And that's what they focus on,
616.97 -> and I think it's a better thing to focus on.
620.88 -> So put all this together and focus on,
622.92 -> well, what's the rate and expected duration of
627.57 -> an incident, or some kind of downtime,
629.67 -> or something like that?
630.96 -> It's really easy to plot.
634.05 -> I've made up these data sets,
636.33 -> but it's easy to plot them with almost any plotting tool,
640.83 -> I used Excel for these.
641.963 -> But the idea is to just pick some time ranges
644.88 -> that you care about, time durations,
648.15 -> let's say how often in a year
650.45 -> am I gonna have one minute interruptions?
652.68 -> And this graph says for this fictitious system,
655.62 -> 10 one minute interruptions in a year.
658.17 -> But then how often am I gonna have an interruption
661.35 -> that might last an hour, right?
663.33 -> And on this graph, it's a log graph,
666.06 -> it's about 0.5, and that's telling you,
669.33 -> well, maybe every two years, right,
671.25 -> you're gonna have an interruption that lasts every hour.
673.56 -> Straight away, you're getting kinds of insights
675.87 -> that the nines model just cannot give you, right?
678.69 -> It can't expose that kind of information to you.
682.167 -> And this is still giving you binary data.
684.99 -> It's just telling you something's up, or something's down.
688.77 -> If we wanna capture partial impact,
690.51 -> it's really simple, because this is a graph,
692.43 -> we can just add more lines, right?
694.23 -> And so we can look at the, you know,
696.42 -> what are the expected rates of a 5% impacting event?
701.64 -> How long could an event that impacts 5% of my resources,
705.21 -> or customers last, and so on?
708.3 -> And just plotting the data like this gives you this insight.
711.9 -> And then the next thing I do,
715.17 -> I just take this data and I multiply it
717.24 -> by the number of minutes involved.
718.74 -> Nothing fancy at all.
719.73 -> So we're just weighting it by minutes.
722.22 -> And so now I get minutes per year by expected duration.
726.06 -> And this is the kind of mental model and frame of reference
729.51 -> I use when evaluating components and systems, right?
732.87 -> Because this tells me so much, right?
735.63 -> It tells me, okay, well how reliable is that system gonna be
739.08 -> in this high-fidelity way?
741.63 -> But it also teaches me what to focus on, right?
744.66 -> Like if I saw a graph like this for a system, I would know,
748.26 -> well I should probably spend more time focusing
750.15 -> on the right hand side of this graph,
751.77 -> the longer duration events,
753.75 -> 'cause those are gonna accumulate to more impact.
755.94 -> And so that means refining operational processes,
758.73 -> putting better failover mechanisms in place,
761.85 -> and so on, and so forth, right?
763.86 -> Whereas if I saw if the left was higher,
765.81 -> there were a different system, and I'm going,
767.88 -> you know, you're gonna have a lot of small interruptions.
770.31 -> That's just probably telling me I need more components,
773.46 -> I need more resiliency,
774.87 -> but those are very different conclusions.
776.91 -> But the number of nines could be exactly the same
779.28 -> for those two radically different graphs.
782.97 -> It's just simpler to talk about and reason about systems,
785.997 -> and I find conversations like this enormous simplifiers.
789.69 -> When I'm working with customers
790.86 -> who have the highest availability of requirements,
793.53 -> where we're talking about live market trading,
797.43 -> or safety critical systems, and so on,
800.49 -> we pivot into these conversations quickly
802.77 -> and it really helps us get to the meat of matters.
806.34 -> And although the gold standard for this technique
809.97 -> is to use real historical data
812.25 -> that accounts for the operational performance
814.56 -> of a service and its team,
816.21 -> and just use that to plot the graphs.
818.49 -> Typically, services usually come with metrics,
822.3 -> and canaries, and so on,
824.517 -> and there are ways to get this data and build it.
827.1 -> But even if you're building something from scratch,
829.41 -> or you don't have that data, this even works with guesswork.
833.1 -> I've done exercises where I've gone to teams
835.167 -> and I've just said, "Well, tell me your guess,
837.66 -> how often do you think you're gonna have a one hour event?
841.59 -> What's your guess?
842.52 -> Do you think you could have one of those in a year?
844.5 -> Do you think it's like a 25% chance?"
846.6 -> And just that conversation, that style of conversation,
850.5 -> right away gets you to actionable interesting data
853.02 -> that it's really hard to talk about with the nines model.
857.37 -> And by the way, engineers' guesses, they tend to be right.
865.56 -> No matter what you ask them about,
867.09 -> they have an intuitive sense for this,
868.74 -> at least the experienced ones.
870.54 -> When you plot out these graphs,
872.16 -> they're very different for a cloud service,
875.64 -> or an on-premises database, or an on-premise application,
878.97 -> or low-level physical infrastructure,
881.25 -> even when they might have the same number of nines,
883.77 -> they have very different shapes to these graphs.
886.14 -> And so you have to accommodate that
887.61 -> in your resiliency plans around these systems.
891.21 -> I just find that endlessly fascinating.
895.35 -> Richard Feynman famously applied this technique,
898.05 -> this kind of like, "Just go ask the engineers,"
900.99 -> in his appendix F that he wrote
902.79 -> to the Challenger disaster report,
904.74 -> where he literally just asked all the engineers involved
907.29 -> in building the components for the Challenger rockets,
911.297 -> you know, "What do you think the failure rates are like?"
912.84 -> All this kinda stuff, and he added up the numbers
915.15 -> and he got to an overall failure rate
916.83 -> of about one in 100 for the space shuttle,
919.53 -> which tragically turned out to be pretty much on the nose.
924 -> Two space shuttles were lost
926.22 -> in just around 200-ish missions.
931.05 -> That stuff works.
932.88 -> But there's other challenges with nines,
934.89 -> there's other reasons why we try to avoid
937.47 -> the kinda naive nines model
938.91 -> when we're doing systems design.
941.934 -> It that it comes from mechanical and civil engineering
946.17 -> and it's kinda based around failure mode analysis
949.68 -> that doesn't easily apply to our field.
953.73 -> Physical components, you can test a bolt or a screw
958.02 -> over, and over, and over again
959.1 -> under all sorts of different conditions
960.48 -> and see how often it fails,
961.8 -> and that will tell you the failure rate of that component,
963.9 -> and now you know how many bolts you need
965.7 -> to put in something to make sure that it's safe.
969.03 -> But our field's a little different.
970.8 -> Firstly, our failures are recoverable.
974.55 -> As an industry, pretty much every failure
977.91 -> has always been recovered from.
980.34 -> I don't know of any real outage or downtime
983.04 -> at any company, anywhere, really in this whole field
987 -> where they've had a permanently disabling event, right?
990.66 -> So that's not at all like physical components.
993.36 -> So recovery time really matters in a very different way.
996.12 -> And another is that software systems
998.04 -> and distributed systems,
999.45 -> they tend to have non-linear interactions
1002.09 -> and even positive feedback loops in different places,
1005.48 -> and much more state space than simple physical components,
1010.01 -> and all sorts of ways that makes them fail
1012.62 -> in kind of dynamic ways.
1015.35 -> You know, a simple example of that,
1018.02 -> classic example, you wanna suspend a weight
1020.81 -> while you've got a chain, or a braid, right?
1023.18 -> Everyone knows, famously a chain is only as strong
1026.03 -> as its weakest link, right?
1027.14 -> It's an idiom.
1028.22 -> And so when you do the nines equations for these,
1031.22 -> it's a slightly different formula, right?
1032.87 -> The availability can go down for the chain,
1037.19 -> but go up for the braid, right,
1038.51 -> because one piece of rope can can save the day
1041.39 -> as far as a braid is concerned.
1043.85 -> And if you do the math, you add the numbers,
1046.25 -> you'll get lower availability for the chain
1048.5 -> than for the braid.
1050 -> And I've seen people apply this to distributed systems
1053.12 -> and systems design,
1055.22 -> and they end up getting disappointed.
1058.52 -> Like it doesn't really work.
1060.5 -> A simple example of this,
1061.88 -> I'll go back to my DNS example.
1064.58 -> DNS
1065.93 -> is a relatively simple
1069.26 -> service,
1070.34 -> it's been around over 40 years
1074.51 -> and it's design hasn't changed much in that time,
1077.63 -> and you have a service, it answers DNS queries,
1079.88 -> websites stay online, right?
1081.8 -> This was very natural to think,
1083 -> well, I really care that my website stays up
1086.12 -> so I'm gonna use two or more DNS providers, right?
1089.27 -> I've gotta put my domain on Route 53
1091.55 -> and another DNS provider.
1093.17 -> Great.
1094.003 -> And you go, well I'm getting some biodiversity,
1096.2 -> they probably have different software,
1097.61 -> different networks, different people.
1099.32 -> Very unlikely they would break at the same time, right?
1102.38 -> So it feels like the braid, right?
1104.717 -> But the problem is in practice
1106.37 -> turns out it's like the chain,
1108.14 -> because in practice DNS resolvers can't easily work around
1113.18 -> even one of your providers out of any being down.
1117.53 -> And so if one of your DNS providers has a bad day,
1119.75 -> then you have a bad day.
1121.19 -> So you end up actually doubling your fault rate,
1124.79 -> not having it, the opposite of what you intended.
1128.36 -> Or another simple example
1129.41 -> is if you rely on your DNS provider
1131.03 -> to be able to make timely changes,
1132.457 -> 'cause that's what you use for your failover process,
1136.34 -> again, you're gonna get half the availability rate
1138.26 -> if you're relying on two providers,
1139.43 -> 'cause now you need those changes reflected in two places.
1143.33 -> That's just not really obvious
1144.98 -> unless you've gone and done it, and experienced it,
1147.32 -> and seen what's going on.
1149.78 -> My take is that the kinda systems we all build,
1152.78 -> they're more like cantilevers than chains and braids.
1156.71 -> And what I mean by that is every single one,
1159.05 -> it's kinda specific and has to be tested.
1161.87 -> Like you've gotta build maybe a scale model
1163.91 -> and really exercise it, and learn everything you can
1166.55 -> about how it might fail,
1168.05 -> and make sure that you've accounted
1169.43 -> for all the safety mechanisms, and so on.
1172.25 -> And that's the kind of approach to take.
1175.97 -> We'll see a bit more of that later.
1177.23 -> But this kind of end-to-end testing
1179.96 -> is what's really critical
1181.37 -> to figuring out whether these systems are safe
1184.16 -> and reliable enough for whatever purpose we've intended.
1188.63 -> All right.
1189.47 -> Now there are some techniques that are so general purpose,
1193.55 -> they're pretty much always a good idea.
1196.13 -> And I think the biggest one that we apply
1198.53 -> is to compartmentalize our systems, right?
1200.93 -> So I've already hinted at this a little.
1203.93 -> We run independent copies of every service
1207.29 -> in every region, right?
1208.58 -> They're not magically linked,
1210.71 -> there's no synchronous dependencies between them.
1213.38 -> And the idea is if you know a service is having a bad day
1217.46 -> in region A that shouldn't have a knock-on impact
1220.28 -> and cause region B to have a bad day too.
1223.34 -> That would also defeat the utility
1225.38 -> of having multiple regions for failover.
1228.62 -> And we take this to the extreme,
1230.6 -> like this is our default design paradigm
1233.57 -> this is how we build every service that we do,
1236.27 -> it's baked into our tools, into our design processes,
1240.11 -> into our principal engineer tenets, everything.
1243.17 -> You know, thou shalt not build a multi-region service
1246.92 -> unless it's extremely intentional,
1248.78 -> and we're very, very careful about how we do build those.
1252.59 -> We wanna keep everything as separate as we can.
1255.41 -> But even inside the region we wanna do the same again.
1259.04 -> Inside the region we've got availability zones
1261.41 -> and where we can, we run isolated separate stacks
1264.44 -> inside those availability zones.
1266.15 -> So if there's a problem with availability zone,
1268.97 -> us-west-2a here, we can just turn it off,
1272.42 -> failover to the other availability zones for a while.
1275.54 -> Fix it, fail back.
1278.42 -> It's a really, really powerful technique.
1282.47 -> We take it even further,
1283.73 -> we build cells inside availability zones.
1285.89 -> So when a service is big enough, when it makes sense,
1289.97 -> we can start to have multiple cells of that service,
1293.3 -> more independent copies,
1295.13 -> and then partition the customers across those cells.
1298.288 -> Some customers are on cell one, some are on cell three,
1301.37 -> and so on.
1302.63 -> And so, you know, maybe you make a change,
1304.43 -> it gets to one cell, that change has an issue
1307.25 -> or something like that,
1309.05 -> you're only impacting a much smaller percentage
1313.19 -> of your overall customers, than everybody.
1316.31 -> We take this to the absolute extreme with shuffle sharding
1320.03 -> where you can do better again, right?
1321.56 -> So let's say we've got 10 cells, instead of just saying,
1325.617 -> "Well, we're gonna take 10% of our customers
1327.53 -> and put them on cell one, and 10% on cell two, and so on,"
1332.03 -> what we do is we take each customer, and we say,
1335.127 -> "We're gonna put you on two random cells."
1338.66 -> So one customer, they're gonna be on cell one and cell six,
1342.53 -> and another customer might be on cell one and cell eight.
1346.34 -> So those two customers happen to share one cell, right,
1349.28 -> cell one, but otherwise, one's on cell six,
1352.79 -> one's on cell eight.
1354.74 -> And unless both of your cells that you're allocated
1360.32 -> happen to be having a bad day,
1362.12 -> which should be very unlikely,
1364.31 -> you should be okay,
1365.143 -> because there should be some resilience
1366.17 -> built into that system
1367.1 -> and you should be able to survive on the remaining cell.
1370.76 -> Firstly, because you've got that built in resilience,
1373.73 -> because we're really careful about how we do things
1375.59 -> like deploy changes, just naturally makes faults
1379.7 -> way less likely in terms of actually impacting a customer.
1382.85 -> While it might happen, the idea is they won't notice.
1386.81 -> But even if they do, even if somehow an issue
1389 -> does get to impacting two cells,
1392.45 -> 'cause maybe this issue was triggered by the customer,
1395.93 -> so maybe that's why those two cells became impacted.
1400.04 -> Maybe it wasn't change related,
1401.9 -> even in that event, with just 10 cells and this method,
1406.37 -> one 100th of your customers are impacted and share a fate,
1411.38 -> and those numbers get better, and better, and better
1413.06 -> the more cells you have,
1414.14 -> 'cause this scales exponentially.
1416.09 -> It's based on combinatorial math.
1418.43 -> It's pretty cool.
1419.263 -> And we try to use that in many, many places we can.
1423.74 -> This is something we learned,
1427.04 -> this wasn't something that was baked
1428.54 -> into the very first systems that we built.
1431.45 -> We retrofitted it into many,
1433.49 -> but it's something we learned to do over time.
1436.46 -> And I think of learning to do things over time,
1439.76 -> as actually the primary way that we really design systems.
1444.92 -> I think often folks can think
1447.14 -> there's some mythical reference manual somewhere,
1450.92 -> that maybe someone, someday, will write,
1452.9 -> that just has like all the best practices
1455.03 -> you're supposed to apply for systems design.
1457.88 -> And if you read that golden manual
1459.65 -> and you learn all those things,
1461.6 -> that you will just come out knowing
1463.22 -> how to build the perfect system, right?
1465.35 -> And now every time you'll just,
1466.617 -> "Oh, I've gotta build this system,
1467.96 -> I'm gonna use technique blah, blah, blah, blah, blah."
1470.9 -> It does not work like that.
1472.22 -> It doesn't work like that in civil engineering,
1474.02 -> mechanical engineering,
1474.89 -> and it definitely doesn't work like that
1476.06 -> in distributed systems.
1479 -> And so, our designs have to evolve
1482.03 -> and we have to figure out how to learn lessons,
1485.15 -> figure out what might we need to iterate on a service
1487.61 -> and push that iteration.
1490.16 -> Now if you accept that,
1492.32 -> you can still push through and you can ask most people,
1494.697 -> "What do you think is gonna be most key
1498.08 -> to a highly available system?"
1500.09 -> And they'll almost always tell you,
1501.657 -> "Well it has to be simple."
1503.433 -> 'Cause if it's simple, if it's got fewer moving parts,
1506.39 -> there's just less things that can go wrong,
1508.61 -> it's gonna be a more reliable system, right?
1510.74 -> Very intuitive, makes a kind of sense.
1514.16 -> But the problem is like what does simple mean, right?
1518.72 -> A unicycle has radically fewer moving parts than a bicycle,
1523.79 -> but it is not simpler than a bicycle.
1526.25 -> It's way harder to ride a unicycle.
1528.65 -> And while a unicycle might have 50% the chance
1532.07 -> of getting a puncture, no one's commuting to work
1535.37 -> on a unicycle, except in Seattle, where I live,
1538.07 -> where I do see a few people who do it,
1540.89 -> but in general most people prefer bikes.
1542.96 -> Bikes are way, way more usable.
1545.15 -> There're much better solution to human locomotion
1548.84 -> than a unicycle is, right?
1551 -> And what's going on there,
1552.2 -> even though it has more components,
1554.18 -> I think that's because solid designs,
1556.19 -> things that will survive, and we can really use,
1559.88 -> it's not about just having the fewest moving parts,
1562.34 -> but having the fewest parts that are truly needed.
1565.79 -> And having a design that takes into account
1569.15 -> all of the trade offs that have to be considered, right?
1572.21 -> So if you started, if unicycles, bikes,
1574.94 -> and nothing like that existed at all,
1576.56 -> and you were given that design challenge,
1578.84 -> you might end up with a tricycle
1580.49 -> because you're like, well that's intrinsically stable.
1584.06 -> You can learn to ride a tricycle in five minutes,
1586.97 -> whereas a bike probably takes a few hours, right?
1589.79 -> But tricycles are bigger, and heavier,
1591.53 -> and they're harder to park,
1593.24 -> so no one really commutes on tricycles.
1598.129 -> A bike turns out to be this nice middle ground, right?
1601.73 -> It's very usable,
1602.75 -> within a few hours, even a child can learn to ride a bike.
1605.66 -> And it is easy to carry places, and park,
1607.67 -> and has all those nice tradeoffs.
1610.25 -> But all of those things I think would have to be learned.
1613.25 -> If you went back to pre the 1800s
1615.71 -> and had never seen either of those things,
1618.11 -> I don't think there's a book you could read
1619.52 -> or any system you could come up with
1620.96 -> that you would pop out with a bike as the perfect design,
1624.47 -> right?
1625.303 -> These systems really, really do have to evolve over time
1628.97 -> and it's that process of iteration that matters.
1631.79 -> And so the best thing we can do as designers
1633.53 -> is try to use the best, most well worn patterns
1637.34 -> that we actually know of, and reuse them,
1639.47 -> 'cause that's what's gonna be most reliable.
1642.063 -> AWS services are really strong examples of this.
1644.99 -> Every service I know of still has code
1647.84 -> from its day one launch, certainly the ones I worked on,
1651.29 -> and I'm pretty sure the rest.
1653 -> And we actually have it
1653.833 -> as one of our principal engineer tenets,
1655.917 -> "Respect what came before."
1657.53 -> We don't wanna be "not invented here" culture,
1661.07 -> or just rebuilding, re-architecting things for its own sake.
1665.27 -> Like, that way lies madness.
1667.34 -> Instead we try to leave things alone as much as we can.
1670.52 -> We add whatever business functionality we need to,
1672.73 -> to support new features, and improve performance,
1675.62 -> and do all of the innovation that we love announcing
1677.72 -> here at re:Invent.
1679.28 -> But other than that, we wanna use
1681.77 -> our root cause analysis process, which is our COE process,
1686.15 -> to learn any lessons we can
1688.01 -> from any interruptions or events,
1690.14 -> and reintegrate that at the lowest level possible
1693.35 -> into our systems.
1695.45 -> When we have something go wrong,
1697.73 -> we take those lessons and we bake it back into code,
1700.58 -> in our tools, and our systems,
1702.47 -> so that we're not just relying on good intentions,
1704.78 -> and so that we can keep those antibodies around
1707.6 -> and kinda spread those antibodies
1709.7 -> into the next system that we build.
1712.37 -> So an example of that, something else that we learned
1717.05 -> is that
1720.05 -> there's all sorts of change and dynamism
1723.23 -> that can be happening at systems,
1725.33 -> and a real key to building highly available systems
1729.86 -> is to minimize the amount of change and dynamism
1733.76 -> because risk is proportionate to change.
1738.11 -> I mean every event is a change, right?
1740.99 -> Something changed.
1741.91 -> That's why it went from up to down, right?
1744.23 -> By definition.
1745.7 -> Now I'm gonna talk later about how we ring-fence
1748.07 -> operational changes, configuration, software updates,
1750.62 -> stuff like that.
1751.55 -> But another kind of change that's much more challenging
1754.19 -> is load,
1755.24 -> right?
1756.77 -> A customer can have a busy day
1758.75 -> without being able to predict it,
1760.16 -> or know that it's coming,
1761.63 -> and you can have big load spikes, right?
1763.46 -> And those load spikes can exercise systems
1766.73 -> to points they've never been tested at,
1769.58 -> or slow down different parts of those systems
1773.27 -> and maybe that slowness isn't tolerated, and so on,
1776.33 -> and cause issues, right?
1779.06 -> So we wanna be able to wrangle that dynamism
1781.79 -> and we have a very counter-intuitive solution for this,
1786.38 -> which is to try and run out some systems
1788.39 -> that are in key areas,
1789.98 -> just at the maximum load all the time,
1792.32 -> no matter what the customers are doing, right?
1795.86 -> That sounds counter-intuitive 'cause it sounds wasteful.
1798.11 -> I mean this is all software systems, is not real waste,
1800.72 -> it's all just electronic bits flying around.
1803.18 -> But,
1805.49 -> it just feels wrong as a systems designer.
1809.72 -> It's not our human intuition for how things should work
1812.3 -> but actually it turns out to be much more reliable,
1813.98 -> because you're reducing the amount of change in the system.
1816.38 -> It's just always operating in a known, tested,
1819.95 -> kind of reliable state.
1821.9 -> A simple example of this is,
1824.27 -> let's say we've got a really high availability
1827.12 -> critical control plane, or sorry, data plane,
1830.81 -> we're in the data plane talk,
1832.43 -> something like a health check, right?
1835.46 -> Let's say, Route 53s health checks.
1837.2 -> So they're health checking your endpoints to see,
1839.51 -> well, is that health check currently healthy?
1842.54 -> Is the web server up?
1844.4 -> Is it not up?
1846.11 -> And report that status back
1847.85 -> and integrate that into Route 53
1849.65 -> so we can do some kind of failover, right?
1851.54 -> That's kinda how it works.
1853.58 -> And if you were to design this,
1856.58 -> I think a natural, organic way
1858.5 -> most folks would design this,
1859.82 -> is well, we'll build a health checker
1862.1 -> and we'll keep health checking,
1863.66 -> and if the health check changes,
1865.52 -> you know, if it goes from healthy to unhealthy,
1867.41 -> if we've got some state transition,
1869.39 -> then we're gonna push a change into Route 53, right?
1871.91 -> We're gonna kick off a workflow and edit the DNS record.
1876.68 -> If this, then that, right?
1878.6 -> Very, very simple.
1880.25 -> The problem with that is when you've got a busy day,
1884.03 -> when lots of health check statuses change,
1886.64 -> things get slower, right,
1888.11 -> 'cause you just asked the system to do more work.
1890.72 -> Now it's trying to push more transitions,
1893.15 -> ask Route 53 to make more changes.
1895.79 -> So that's not how it works.
1898.64 -> What it's actually doing is using constant work,
1901.82 -> which is every health check,
1903.62 -> it's happening once every few seconds, no matter what.
1907.07 -> And then the status, whether it's healthy or unhealthy,
1910.79 -> goes through this summarization process,
1912.71 -> actually temporarily goes through an S3 bucket,
1914.66 -> that's why I got that here.
1917.57 -> And then Route 53 is ingesting those health check statuses
1921.53 -> no matter what, right?
1923.54 -> Every few seconds, Route 53 is getting
1925.52 -> all of those health check statuses, all of them,
1928.7 -> it doesn't know whether they changed, or not,
1931.49 -> every single time.
1933.47 -> And so it wouldn't matter if one web server
1936.08 -> suddenly went from healthy to unhealthy,
1937.73 -> or if a million web servers went from healthy to unhealthy,
1941.12 -> this entire part of the system wouldn't care,
1943.52 -> wouldn't operate any differently at all,
1945.5 -> wouldn't get any slower.
1947.78 -> That's counter-intuitive,
1949.13 -> but it just makes things incredibly reliable.
1952.13 -> It's just, I love that pattern.
1954.172 -> (Colm laughs)
1956.99 -> So the next thing I wanna talk about is testing.
1961.76 -> Actually the next two topics
1963.02 -> are gonna talk about testing, and operational safety,
1966.17 -> are the number one things that any team can do
1969.77 -> to actually improve the availability of their systems.
1972.68 -> In practical terms, it is those things
1974.54 -> that make the difference.
1975.83 -> Testing our systems so that we don't have defects
1979.1 -> and having very rigorous operational safety processes,
1982.01 -> so we avoid mistakes impacting customers.
1985.76 -> That is ultimately what the game is about.
1988.94 -> Now, Andy Jassy, here at re:Invent, in 2016,
1995.24 -> gave us a great saying, which is that,
1996.807 -> "There's no compression algorithm for experience."
1999.71 -> And I am super pedantic engineer,
2002.29 -> I'm a really annoying guy to be around sometimes,
2005.65 -> and I remember thinking to myself, wow, is that right?
2008.89 -> I mean, I can read a book, right,
2010.87 -> and then I've got the author's experience,
2013 -> or I can go to a lecture
2015.76 -> and then I've got the professor's experience, and so on,
2018.37 -> but that's not what this is about.
2020.92 -> I really love this saying because what it's really conveying
2024.19 -> is the kind of experience
2026.86 -> that comes from been through issues and events
2030.94 -> and learning lessons the hard way,
2033.19 -> turns into habits, right?
2035.08 -> Turns into personal habits, and organizational habits,
2039.04 -> and things that we all insist on,
2041.44 -> and that's ultimately what shows up and drives quality.
2044.95 -> And I genuinely think there is no compression algorithms
2047.05 -> for that, kind of as a group, as a team, as people.
2049.81 -> Like there's a certain amount of,
2050.92 -> like, you just have to have seen some things
2054.19 -> to get that.
2055.023 -> I think this applies
2057.7 -> to testing
2060.37 -> because I think of testing
2062.92 -> as a little bit of a different kind of compression,
2065.53 -> a little bit of a time travel trick.
2068.86 -> If we know that we're gonna have to evolve our systems,
2072.22 -> right, to make them the most reliable patterns,
2075.46 -> then we wanna speed up that process of evolution
2077.74 -> and run it as fast as we can.
2079.69 -> And the secret to that is lots and lots of tests,
2083.59 -> and lots of testing, and lots of observability.
2087.25 -> And if you were to work at AWS
2089.41 -> and see inside what's going on in AWS service,
2091.81 -> you would see just enormous amount of tests.
2095.2 -> You know, typically an AWS service
2096.61 -> has thousands and thousands of unit tests,
2099.52 -> hundreds of integration tests.
2101.53 -> We've got pre-production environments
2103.03 -> that we run and simulate everything in,
2105.49 -> and then we've got all sorts of fancy testing.
2107.29 -> If you want a flavor of this,
2109.6 -> we've got an open source project, s2n,
2111.79 -> which is our SSL TLS Library,
2114.61 -> and that is a data plane component,
2118.325 -> that is a data plane team, at AWS, shipping that code,
2121.3 -> but it's all public on GitHub.
2122.857 -> And you can go look for yourself at just how many tests
2126.04 -> they have, and how many tests they're constantly adding,
2129.04 -> and how tests are in every single code review,
2131.92 -> and it'll just give you a sense at the scale
2134.14 -> of what's typical, it's kinda like a view inside.
2137.23 -> It is an enormous amount,
2139.51 -> way more than I see in typical open source projects.
2145.03 -> I don't have too broad a view into elsewhere in industry
2147.97 -> to benchmark us against that,
2150.52 -> but all of these tests are integrated and automated.
2154.39 -> Every time I submit a code review,
2157.21 -> well, obviously it had to pass my unit tests
2159.07 -> before I'd even bother submitting it,
2161.08 -> 'cause those run locally.
2162.49 -> But even after that, you know,
2163.63 -> we're kicking off integration tests
2166.06 -> so that we're not gonna waste the code reviewer's time
2169.06 -> on something that doesn't pass the integration tests.
2171.76 -> And if it gets through code review, you know,
2173.86 -> we're gonna do beta, and gamma,
2176.05 -> and other pre-production testing, and more besides.
2180.88 -> And that's just what the team is doing directly.
2183.61 -> As an organization, every time we build a new region,
2186.55 -> which is quite frequently these days,
2189.1 -> we, several times before that region has launched,
2192.16 -> we'll use that region as full region scale wind tunnel.
2197.11 -> It is a simulation environment
2199.39 -> and we will go deliberately break some things.
2201.49 -> We'll say, "Turn off power in this availability zone,"
2204.37 -> and then measure ourselves against that,
2206.71 -> and learn any lessons we can.
2210.07 -> You know, it's the cheap way,
2211.99 -> the way without impacting customers at all,
2216.49 -> by far the better way to learn those lessons,
2219.01 -> and then reintegrate those lessons
2220.48 -> into how our systems are doing.
2222.34 -> We also have teams that work on automated reasoning,
2224.62 -> building mathematical models of our system.
2226.84 -> That you can use techniques like symbolic execution
2229.84 -> to run that code through basically every possible input,
2234.34 -> and confirm that invariance are held, and so on.
2238.42 -> We also do FUS testing, which is pretty similar.
2241.27 -> All of this is like time travel, right?
2242.95 -> You're basically simulating the code to all these code paths
2245.62 -> that you could never really see in production quickly,
2247.93 -> right, until some day where you suddenly do.
2250.96 -> Way, way better to find that out using these mechanisms.
2255.67 -> And then we also have a lot of instrumentation
2258.64 -> and observability built into our systems.
2261.07 -> Typically an AWS service
2263.02 -> has tens of thousands of metrics,
2266.68 -> and alarms configured on those metrics,
2269.32 -> and all of those sensors mean
2271.36 -> as we deploy a change, or an update to our system,
2274.24 -> if anything regresses across that whole thing,
2277.12 -> it's like this embedded sensor network
2279.46 -> that can tell us if there are any early warning signs
2281.86 -> that we just did something bad, or regressed performance,
2284.71 -> which could turn into availability issues, and so on.
2287.789 -> And we get this data all the time, right?
2290.35 -> So we can constantly run this iteration loop
2293.32 -> where we can get the data we need to reintegrate,
2296.29 -> and hopefully build a bike, and not a unicycle, right?
2300.91 -> That's what this is about.
2303.19 -> Now all this testing, it's great for building
2306.25 -> software that has as few defects as possible,
2310.36 -> but it would mean nothing if we were then cavalier
2313.3 -> or haphazard about how we just get that software out there.
2316.83 -> It turns out, at least in our environment,
2319.75 -> I think many of our customers' environments,
2322.57 -> most incidents, most issues,
2324.79 -> are actually just triggered by changes.
2326.83 -> You know, operational changes that needed to be made,
2329.44 -> configuration updates, whatever it was.
2332.35 -> Those are what can be hard to account for
2336.018 -> and cause problems.
2337.96 -> And, you know, it's in the name,
2339.16 -> web services are services, right?
2340.99 -> They're live running operational concerns.
2343.9 -> They're not just code and an editor,
2345.97 -> not just a compiled binary sitting there, right?
2349.03 -> It's a live running thing and it's interactions with people
2352.42 -> that are gonna change its behavior.
2353.95 -> Sometimes customers, like we saw with load,
2356.35 -> but most times, operators, right?
2359.32 -> So that's what we have to pay attention to.
2362.53 -> Now we've made enormous investments
2364.24 -> in centralized deployment safety.
2366.16 -> We have an incredibly paranoid,
2369.55 -> incredibly rigorous process to deploying changes,
2374.11 -> because we know that new code means new risk.
2376.39 -> As rigorously as we test it,
2378.31 -> we know there still might be code paths
2380.38 -> that a customer is gonna be first to exercise.
2384.34 -> And so we only wanna promote those,
2386.5 -> that software,
2388.42 -> very cautiously.
2389.74 -> We wanna be very gradual
2391.434 -> in how we expose that to customers,
2394.81 -> and we want always to have the ability
2396.49 -> to quickly roll it back, right?
2398.41 -> So if there's even a hint that something's wrong,
2401.14 -> we'll just roll back, figure it out,
2402.49 -> do some investigation, before we proceed again.
2406.66 -> And it's really, really typical that, you know,
2410.26 -> if I check in some code, goes through code review,
2412.6 -> and so on, and does all those normal levels of testing
2416.53 -> that I mentioned, all the way up to pre-production,
2418.84 -> but let's say it passes all of that.
2420.01 -> So we think it's good to go to production,
2422.38 -> we'll typically promote it only just to one box
2424.93 -> where we'll run it for a while,
2426.43 -> observe all the metrics we can,
2427.78 -> and confirm everything looks healthy,
2430.06 -> and then progress to one availability zone,
2433.03 -> and then one region, zone by zone.
2434.92 -> We never deploy to multiple zones at a time
2437.097 -> 'cause that would defeat the point
2438.58 -> of having availability zones.
2441.22 -> And only as we get more confidence
2443.77 -> in that change in software, do we expose it,
2446.98 -> to a bigger and bigger radius of infrastructure.
2450.07 -> And that has turned out to be enormously beneficial.
2454.96 -> We're also at the scale, at AWS,
2456.91 -> where we just can't manage things by hand.
2460.75 -> I can no longer remember every AWS region.
2464.05 -> I just can't.
2465.34 -> And we also have regions I don't even have access to,
2470.11 -> just because of the way we work.
2472.51 -> And so, it's just not really possible to operate
2475.437 -> an AWS service in a way where you're like,
2477.49 -> imagine you're logging into a box,
2478.99 -> and old school SSH commands, and restart this,
2481.57 -> and all that kinda stuff.
2482.403 -> It just don't work like that.
2483.236 -> It's all done through these centralized deployment systems,
2486.61 -> and safety systems, and tools, and automation.
2492.34 -> Some of that's for security reasons.
2494.59 -> We wanna have important safeguards around systems
2497.02 -> that make sure people can't just be accessing them.
2499.36 -> But a really great benefit of this
2501.49 -> is operational safety too.
2504.46 -> You know, not having access to things
2507.22 -> means it just makes it harder and harder to break them.
2511.18 -> And we've been ratcheting this up,
2512.53 -> we've been building systems that are hermetic,
2515.23 -> like the AWS Nitro system, for example,
2517.09 -> which just really has no operator capability at all.
2520.09 -> Like there's no login capability,
2522.94 -> or general purpose administrator capability.
2524.95 -> or anything like that.
2525.85 -> The only way to administer the system
2529.09 -> is through software deployments.
2531.4 -> And this automation first approach
2533.77 -> just means every lesson we learn, right,
2536.68 -> can be fed back into code,
2539.23 -> which is the best place to put it, right?
2542.05 -> You have an outage, you have an incident,
2543.76 -> you learn the lessons you can,
2545.5 -> well one approach would be to go,
2546.58 -> well, we'll we'll try harder next time,
2548.14 -> we won't make that mistake.
2549.16 -> Let's tell everybody we know.
2551.29 -> That's a disaster.
2552.16 -> That's just good intentions.
2553.87 -> Teams change over time,
2555.28 -> maybe somebody forgets, whatever, right?
2557.62 -> So we really wanna be able to put that lesson
2559.6 -> into a tool that will enforce it rigorously,
2562.69 -> do it the right way next time,
2565.57 -> because that's what's gonna produce the most enduring value.
2567.85 -> And that's how it becomes an antibody, right,
2570.25 -> that's baked into the system.
2571.81 -> We're never gonna repeat that mistake
2573.85 -> and we're gonna get the value of that lesson applied.
2576.43 -> And that's our philosophy too.
2579.43 -> Operational safety.
2581.38 -> We never wanna see teams being half hazard or cowboy-ish
2586.09 -> about operations.
2588.97 -> We like to see things being meticulous and fastidious,
2592 -> and that comes with overheads.
2593.26 -> You know, that's a pain to be honest,
2595.69 -> an inconvenience to do things like that.
2597.67 -> So it's another reason why you want automation first, right?
2600.55 -> You wanna be able to have the tools do all that.
2604.93 -> And so the last thing that I wanna talk about
2607.562 -> is a property that I see
2611.29 -> in every team I've worked with
2613.42 -> that manages very high stake systems.
2617.89 -> So the systems I'm talking about,
2619.66 -> remember the slide at the beginning
2620.8 -> with all the service logos.
2623.44 -> I think it's fair to say at this point,
2626.02 -> those are internet scale services, right?
2628.63 -> If they have issues, it's an internet scale event, right?
2632.29 -> It is literally making news, and so on.
2635.08 -> The stakes are pretty high,
2637.81 -> and it's natural and appropriate
2640.66 -> to be like, appropriately fearful of that, right?
2643.72 -> Like, you know, take that very, very seriously.
2646.93 -> But we have to get things done,
2649 -> and fear leads to the dark side.
2654.249 -> It's not a good emotion,
2655.21 -> it's not what you want,
2656.11 -> and it's not gonna produce the most conducive thing.
2660.31 -> And team culture,
2662.32 -> culture eats strategy for breakfast, right?
2665.14 -> If you wanna system that's been well designed,
2667.84 -> hitting a high testing bar,
2669.73 -> and hitting great operational safety,
2672.49 -> what's gonna do that isn't gonna be
2676.69 -> a written document, or a vision statement,
2679.48 -> or some rules written somewhere
2681.25 -> that this is how you have to do it.
2682.87 -> You know, what's gonna rule the day,
2685 -> is how the leaders on that team,
2687.1 -> the folks that people look up to,
2689.23 -> how they convey themselves,
2690.97 -> the tough decisions and calls they make, right?
2693.76 -> The time where they decide not to launch something because
2696.49 -> it actually needs some more attention
2697.75 -> and some more safety controls.
2700.51 -> That's what really feeds into all of this.
2702.91 -> And that's what I see, right?
2706.21 -> These spaces have very high stakes
2708.52 -> and that demands very high standards,
2711.49 -> and it takes elite teams to deliver on that.
2715.51 -> You know, typically on these teams,
2717.91 -> when I see the folks writing code,
2721.36 -> like testing,
2723.1 -> and quality assurance,
2724.75 -> and all of these things
2725.77 -> that the mark of a high quality system,
2728.14 -> it's not something that they're budgeting exactly, right?
2731.98 -> It's not like central planning decided,
2734.02 -> well, this team needs to spend 50% more time writing tests
2737.53 -> 'cause they're critical,
2738.58 -> so every sprint they're gonna spend
2740.56 -> this percentage of their time
2741.97 -> writing tests, and so on, right?
2743.8 -> It's not quite like that.
2744.88 -> It's not the way that it works, or is budgeted.
2748.78 -> Instead, it's more like
2751.66 -> they know they've learned over time
2754.3 -> that to operate in this space,
2756.1 -> they just have to have these good habits
2758.53 -> and every time they write code, they write a test.
2761.05 -> They just couldn't imagine doing it any other way, right?
2763.96 -> Or every time they do an operational change,
2766.48 -> they do it the safe way because they just couldn't
2768.79 -> even imagine doing it other way.
2771.31 -> Even with a gun to their head,
2772.45 -> you couldn't make 'em do it the other way,
2773.59 -> it's just not how they do it, right?
2775.57 -> And it's like the master craftsperson,
2777.37 -> who's, you know, perfectly shaving a piece of wood
2780.67 -> and paying real fine detail each time,
2782.157 -> and if you ask them to do it in a hurry,
2783.94 -> they just look at you like,
2785.35 -> well, that doesn't make any sense.
2787.81 -> And, these folks know that if you cut corners,
2791.83 -> and if you don't have these high standards,
2793.96 -> you're gonna pay for it.
2795.31 -> You might ship a feature faster now,
2797.71 -> but you know,
2798.94 -> over a year or whatever, you're gonna pay for it
2801.13 -> because you're gonna have issues
2802.57 -> and you're gonna have to come back and fix things,
2804.37 -> and that's not fun.
2807.46 -> And so what you tend to see in these teams
2810.31 -> is that they have really fearless environments
2813.97 -> and they're very supportive of each other.
2818.23 -> And when I say that these teams are elite,
2820.24 -> I don't mean, it's not about,
2821.657 -> "Oh, we only hire the top 1% of the top 1%,"
2824.56 -> or anything like that, that's nonsense.
2828.34 -> Any person can be elite in the right environment,
2831.49 -> with the right support infrastructure,
2833.11 -> and the right culture around them, right?
2834.94 -> It's just about what the group insists.
2837.64 -> If you went and worked for the fanciest hotel in the world,
2842.11 -> within days they'd be telling you,
2843.557 -> "Yeah, you gotta polish every speck of dust,
2846.1 -> you gotta make the bed perfectly.
2848.11 -> Like that is our level of standard.
2849.67 -> That's what it demands."
2850.84 -> And they would just get that into you really quickly, right?
2854.264 -> And it's kind of like that, you know,
2855.4 -> these teams, just how they do it,
2857.35 -> and they don't tolerate anything less,
2860.53 -> but in a very supportive, collegial way.
2864.52 -> And this is really, really important
2866.14 -> because when people are afraid
2867.73 -> and then they don't support each other,
2869.59 -> and when they just compete with each other,
2872.44 -> critiques and criticism gets buried, right?
2876.4 -> But you don't want that.
2877.84 -> These are high stakes system.
2879.37 -> You want every risk, every possible criticism raised,
2883.03 -> open and vocally, so that we can learn about it
2885.49 -> and potentially do something about it, you know,
2887.77 -> investigate it, and all the rest.
2889 -> And that's really, really key.
2891.16 -> So, every single one of these teams,
2894.52 -> I've been to their happy hours, and parties,
2897.58 -> and they're all very supportive of each other
2901.54 -> and it's just a fun, nice environment to to be in.
2904.51 -> And I don't think that's a coincidence.
2906.82 -> I don't think it's as a coincidence that it takes that.
2909.397 -> And so something I pay a lot of attention to,
2913.09 -> when I'm trying to keep an eye on a broad set of teams,
2916.3 -> is kind of where are they in a virtuous cycle
2918.94 -> that looks a bit like this?
2920.8 -> Which is that, a good technically fearless team,
2923.83 -> they insist on high standards,
2925.9 -> and that means they produce things
2927.4 -> that are maintainable and operable,
2929.74 -> and that in turn,
2932.11 -> makes that space attractive
2934.51 -> because the folks who value quality,
2936.43 -> they can spot this stuff a mile away.
2938.2 -> They can see, oh that team over there,
2939.55 -> they know their stuff,
2940.63 -> or they really insist on high standards,
2942.19 -> I wanna be part of that.
2943.93 -> And by the way, if you're early in your career,
2945.393 -> I think the number one thing you should always do
2947.2 -> is find a team like that,
2948.94 -> a team that will insist on higher standards.
2951.7 -> And early in your career, in the short term,
2953.56 -> it can be harder to get promoted on a team like that
2955.87 -> because they have higher standards, right?
2958.51 -> You might make not make it to the next level as quickly,
2962.05 -> but by God, over 10, 15 years,
2964.48 -> like the duration of your career,
2965.95 -> that early experience of getting all those high standards
2969.19 -> kind of, into you, pays off enormously, right,
2972.37 -> because then you can bring that everywhere else.
2975.91 -> And there's a vicious cycle version of this too, right?
2978.22 -> If the opposite happens, right?
2979.93 -> If you've got poor standards,
2982 -> well you're gonna build a system that's unmaintainable,
2984.43 -> right, and that's rapidly gonna backfire.
2986.32 -> People are gonna get paged, they're gonna be overloaded,
2988.78 -> it's not gonna be fun, people might even leave.
2991.63 -> That's gonna make it even less maintainable,
2993.49 -> 'cause now there's fewer people
2994.57 -> to handle all those pages, and so on,
2996.367 -> and this can spiral.
2997.66 -> It's not fun, right?
3000.24 -> And there's unhappiness.
3002.4 -> And now, maybe when people are unhappy,
3004.339 -> they're not gonna be as supportive.
3006.12 -> So you really, really, really
3007.5 -> wanna be in the positive cycle.
3009.33 -> And I always tell people,
3011.257 -> "If a team ever does get into the unhappy state
3013.92 -> for some reason, and as a senior engineer
3016.83 -> or an engineering leader,
3017.73 -> you're asked to go in and help them,
3019.41 -> job number one, insist on high standards."
3021.15 -> Start there, 'cause you can always do that.
3023.97 -> You can always come in and say,
3025.777 -> "Okay, general amnesty,
3027.39 -> everything we're doing until today,
3028.98 -> like it's in the past, it's over with,
3030.93 -> but starting now we're gonna have high standards,"
3033.36 -> and that'll get the wheel turning
3036.39 -> because folks'll pick up on that,
3038.16 -> and it helps people care more,
3039.99 -> and generate more happiness, and so on.
3042.48 -> And I really look out for that
3044.22 -> because I actually think it's at that cultural level
3047.424 -> that we can have some of the broadest impact,
3049.41 -> that ultimately leads to higher availability systems.
3052.38 -> Hopefully that doesn't sound, you know,
3054.36 -> too hippie-ish, or something, from me,
3055.83 -> but
3057.217 -> I think it's very real.
3059.94 -> So key takeaways,
3062.16 -> there's a few hopefully speckled throughout the talk.
3064.86 -> Hopefully something's gonna stick in your mind
3066.99 -> that you'll remember someday that it's helpful.
3069.69 -> But the ones I wanna highlight
3070.98 -> that I think are particularly useful,
3072.75 -> is if you're working on highly available systems
3074.82 -> or systems design in general,
3077.31 -> and just try to push past the traditional nines model.
3081.39 -> Try to have a mental model that gets a little bit deeper
3085.71 -> and really tries to understand
3087.69 -> the potential failure characteristics,
3089.79 -> including duration, right?
3092.31 -> I didn't mention it earlier,
3093.21 -> but this is how we measure extreme weather events,
3095.49 -> for example.
3096.323 -> We talked about a one in a 100 year flood, right?
3099.36 -> Which is sadly becoming more common
3101.37 -> because of climate change,
3103.11 -> but this is actually a really common dominant approach
3106.41 -> to safety engineering, that for some reason or another
3109.38 -> is kind of overlooked in our field,
3111.69 -> but has a lot of value.
3113.94 -> The next takeaway is I want every builder
3117.87 -> to have enough humility to realize
3120.24 -> that we don't really get to design systems.
3122.91 -> You know, systems outlast us,
3124.38 -> they go through many iterations in many teams, and so on,
3127.44 -> and really, it's our job to shepherd its evolution.
3130.53 -> And that when we approach it with that philosophy,
3132.36 -> we should try to figure out how can we speed up
3134.28 -> that evolution as much as possible,
3136.08 -> and we can get as much data,
3137.85 -> and have the most rigorous ways to integrate that
3140.31 -> into our innovation and iteration processes,
3143.46 -> so that we collect all these antibodies
3145.62 -> and build them in,
3146.7 -> 'cause that's what's gonna have enduring value.
3149.16 -> And the last takeaway is
3151.71 -> it's really elite, happy, supportive teams
3156.18 -> that are gonna be the key to anything.
3158.58 -> You know, if you got a team that is burying risks,
3162.24 -> or is
3165.96 -> unhappy for whatever reason,
3169.05 -> you're just not gonna have enough kind of human slack
3173.13 -> to really be able to operate and build systems safely,
3177.12 -> and that I think just ends up
3179.1 -> being really, really key for this field.
3182.16 -> Hopefully everyone took some takeaway there.
3185.25 -> Thank you all for coming.
3186.09 -> It's been a real privilege to speak to you,
3187.5 -> especially here in such a beautiful location.
3189.337 -> (audience applauds)

Source: https://www.youtube.com/watch?v=es9527rA_8I