AWS re:Invent 2022 - A day in the life of a billion requests (SEC404)

AWS re:Invent 2022 - A day in the life of a billion requests (SEC404)


AWS re:Invent 2022 - A day in the life of a billion requests (SEC404)

Every day, sites around the world authenticate their callers. That is, they verify cryptographically that the requests are actually coming from who they claim to come from. In this session, learn about unique AWS requirements for scale and security that have led to some interesting and innovative solutions to this need. How did solutions evolve as AWS scaled multiple orders of magnitude and spread into many AWS Regions around the globe? Hear about some of the recent enhancements that have been launched to support new AWS features, and walk through some of the mechanisms that help ensure that AWS systems operate with minimal privileges.

Learn more about AWS re:Invent at https://go.aws/3ikK4dD.

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#reInvent2022 #AWSreInvent2022 #AWSEvents


Content

0.36 -> - My name's Eric Brandwine and I'm a distinguished engineer
2.88 -> with the Amazon Security team.
5.22 -> Today we're gonna talk about some of the systems
7.08 -> and processes we use when handling inbound API requests
10.83 -> from customers.
12.099 -> The title is a reference to a talk that I gave
14.72 -> at this conference nine years ago.
17.01 -> It was entitled 'A day in the life of a billion packets'.
20.25 -> It was talking about the internals of VPC.
22.95 -> Even back then, we handled way more
24.81 -> than a billion packets in a day.
26.94 -> So it was a fun title, but it was numerically inaccurate.
30.6 -> This title is also inaccurate.
32.61 -> The most recent public statistic is
34.41 -> that AWS IAM handles half a billion requests per second.
38.88 -> A quick bit of math leads you to the fact
41.04 -> that we must handle trillions of requests per day.
44.1 -> This is one of the most scaled systems on earth
47.16 -> and one of the unsung heroes of AWS.
49.8 -> And today we're gonna take a closer look at it.
53.97 -> Anytime you're dealing with anything over a network,
56.628 -> anything that any security person has touched,
58.96 -> these two topics come up.
61.56 -> Authentication is making sure that the party is actually
63.84 -> who they say they are.
65.19 -> Do I have Eric Brandwine's driver's license.
67.89 -> Authorization is determining whether or not the person
70.32 -> that has that driver's license is cool enough
72.54 -> to do the thing.
73.53 -> I'm usually not that cool.
76.62 -> Most talks in this area go gloss over authentication.
80.31 -> They assume that authentication is handled,
82.41 -> it's a solved problem,
83.94 -> and then they spend a lot of time on authorization.
86.49 -> Which is fine, authorization is a fascinating topic.
89.16 -> It it's worth being dug into,
90.45 -> and there's a lot that can and should be said about it.
92.7 -> However, in this talk,
94.05 -> we're gonna flip that script on its head
96.09 -> and we're gonna spend most of our time talking
97.74 -> about authentication.
99.42 -> It turns out this can be fascinating as well,
102 -> and the way that we do it at AWS is unique and interesting.
105.78 -> In order to fit in the time allotted,
107.7 -> this talk is necessarily gonna gloss over a lot of details.
111.27 -> In particular, pretty much everywhere
113.4 -> that you see our system propagating keys,
115.23 -> we're also propagating policies.
117.39 -> Even though we're gonna focus on authentication
119.09 -> in this talk, clearly, we also perform authorization.
124.08 -> In order to authenticate people,
125.97 -> we're gonna need to build a cryptographic protocol.
128.88 -> You've probably heard of cryptographic primitives like AES
131.54 -> or elliptic-curve cryptography.
134.22 -> Those are the fundamental building blocks,
136.44 -> but they're not generally useful.
137.64 -> They tend to be very narrow in their application.
140.34 -> So these primitives are built up
141.69 -> into higher and higher level constructions
144.24 -> that perform ever more useful things.
147.54 -> Rather than something simple like given a key
149.57 -> in a fixed sized message,
151.08 -> produce a fixed size output that's encrypted,
153.42 -> you get given a key in an arbitrary sized message,
156.54 -> produce an arbitrary sized output
158.22 -> that provides encryption and integrity protection.
161.73 -> However, even these building blocks don't fully solve
164.58 -> most real world problems.
166.23 -> The top layer of this stack is a protocol,
168.54 -> a cryptographic protocol.
170.16 -> This is the complete specification of every operation
173.1 -> that you need to perform in order to complete whatever task
175.645 -> that you're trying to complete,
177.42 -> ideally in a way that protects you against all
179.66 -> of the threats that you've specified.
182.16 -> Every layer of this stack is hard.
184.71 -> If you read the cryptographic papers, it's pretty rare
187.32 -> that one of the underlying building blocks is broken.
190.02 -> Much more often it's the gaps between the building blocks
193.05 -> where people find problems and things start to fall apart.
196.53 -> Making your own cryptographic protocol is hard,
199.8 -> it's expensive and it's not something
201.75 -> that we undertook casually.
203.73 -> We did a ton of design reviews, consultations,
206.19 -> penetration testing and more to maximize the odds
209.61 -> that we were gonna get this right.
212.452 -> It's traditional when discussing a cryptographic protocol
216.133 -> to model it as a conversation between two parties,
218.79 -> Alice and Bob, A and B.
221.19 -> Randall Monroe published this comic
222.72 -> at exactly the right time.
223.98 -> So please enjoy.
225.63 -> Anyway, we're gonna break from tradition in this talk.
229.71 -> We've got Alice, but Alice doesn't want to talk to Bob.
233.1 -> She wants to talk to the cloud.
235.05 -> So here's the cloud.
236.43 -> Alice needs to talk to the cloud, of course,
238.2 -> and we need to make sure that it's actually Alice
240.36 -> that's talking to us.
242.82 -> Now you're thinking, "Eric, this is a solved problem."
246.21 -> And indeed it is, for some definitions of solved.
249.3 -> The most common case is logging into a website.
252.15 -> So we've got a website and Alice has some sort of client,
255.6 -> most likely a web browser.
257.79 -> She browsers to some website and is told, who are you?
261.87 -> And her browser's automatically gonna go redirect her
264.21 -> to some sign in endpoint.
265.5 -> So a login endpoint.
267.48 -> Her browser is gonna render a username
269.19 -> and password dialogue for her.
270.45 -> She's gonna answer these challenges
272.49 -> and that login site is gonna return to her a cookie.
275.25 -> A tiny little bit of data.
276.66 -> Her browser's gonna squirrel this cookie away,
279.42 -> and then she gets redirected back to the website.
282.06 -> But this time she includes the freshly minted cookie
284.94 -> in her request to the website.
286.5 -> The website can validate the cookie,
288.39 -> and go like, "Hey, welcome Alice."
290.49 -> We've successfully authenticated Alice.
294.3 -> Now, an important bit of context here is
296.82 -> that when we initially designed this system is 2006.
300.15 -> That's a long time ago in the real world,
302.49 -> but it's like centuries in computer years.
305.88 -> Most of you probably weren't even born yet.
310.11 -> Netflix wasn't a streaming company.
312.9 -> Mobile phones were incredibly different.
315.81 -> The internet was much safer.
320.16 -> But TLS was not widely used.
325.303 -> TLS is transport layer security,
327.03 -> that's what we call it today.
327.9 -> We used to call it SSL, secure sockets layer.
330.09 -> That's what it was in use in the time.
332.79 -> It was expensive.
334.23 -> Cryptographically computationally expensive
336.09 -> on both the server and the client's side.
338.61 -> And so because we have a lot of clients that just pop up,
342.81 -> make an API call and disappear,
345.21 -> they can't amortize that TLS handshake across many requests.
348.69 -> They'd have to make that TLS connection
350.22 -> for every single call.
351.93 -> And when you consider that we have a lot
353.19 -> of embedded clients, things that have constrained CPU,
355.74 -> constrained power, constrained battery life.
358.17 -> That computational overhead could potentially
360.51 -> be completely unacceptable to our customers.
363.03 -> So in the internet of 2006 mandating
365.73 -> that every API call used TLS wasn't a starter.
368.73 -> So because we couldn't make all of our customers use TLS,
372.09 -> we had to offer clear text endpoints for some of our APIs,
375.24 -> and that deeply informed our design.
379.59 -> So maybe, maybe we don't wanna use that cookie protocol,
383.73 -> but surely this is already a solved problem, right?
386.76 -> So we'll just give Alice an asymmetric key pair.
390.09 -> In most implementations,
391.38 -> the public key is gonna be encoded in a certificate,
393.96 -> an X.509 certificate, the guy that you used
396.18 -> to secure a TLS endpoint, and a private key.
400.38 -> The private key is exactly that.
402.75 -> It's a secret, it's private.
404.13 -> Alice must not share it with anyone.
406.08 -> The public key can be shared freely.
409.29 -> And what you can do with these is you can sign,
412.17 -> you can use the private key to sign a request
414.78 -> that generates a signature.
416.85 -> And so Alice can make a call to her provider asking
420.36 -> for a cloud, please.
421.83 -> And she's gonna include the certificate.
423.75 -> This is an assertion of her identity.
425.79 -> Again, this is public data and this signature.
429.57 -> She's gonna send it to the company.
432.54 -> They're going to verify this API request,
434.55 -> given the certificate.
435.69 -> They're gonna check that the certificate was issued
437.31 -> by someone they trust, that it's still valid,
438.9 -> all of that stuff, that it's associated with Alice.
441.701 -> Then they check the signature,
443.49 -> which they can do given only the public key.
445.68 -> That checks out.
446.94 -> So they're gonna tell Alice, "Okay, this is great."
450.72 -> So we're good, right?
453.45 -> I mean, this is a great protocol.
457.17 -> One very nice property of this protocol
459.51 -> is that we've got a signature over the entire request.
462.24 -> So if for any reason Alice's request is tampered with,
465.93 -> whether it's malicious or a fat finger or a cosmic ray,
469.14 -> Alice is now asking for two clouds,
471.18 -> but she's signed for one cloud.
473.91 -> So she's gonna send this to the provider.
476.67 -> The signature validation is going to fail,
480.06 -> and the request is gonna be denied.
482.1 -> This protocol with a signature over the entire request
485.55 -> protects us against an eaves dropper looking
487.56 -> to steal the keys.
488.88 -> In the cookie protocol, if I eaves dropped on a request,
492.33 -> I got your cookie,
493.163 -> I could then use that cookie in future requests.
495.45 -> In this protocol, I eavesdrop on the request,
497.97 -> I get the signature, the signature is bound to the request.
500.79 -> I can't use it on different requests.
503.31 -> But it's not a panacea.
505.98 -> In this case, everything's working correctly.
507.9 -> Alice signed for one cloud.
509.13 -> Alice is requesting one cloud.
511.02 -> The cloud provider has said, "Yep, that's cool."
513.81 -> But we have an eavesdropper.
515.34 -> The eavesdropper pulls this off the wire.
518.58 -> Now, they can just send exactly that same request.
524.138 -> It has a valid signature.
526.32 -> It's going to validate at the provider
528.51 -> and it's going to succeed.
530.34 -> Now, in this case, Alice is paying for two clouds
533.01 -> rather than just one.
534.3 -> That may be nothing more than a nuisance,
536.28 -> but if this is something
537.113 -> like resetting an administrator password
539.58 -> that someone has taken the request and replayed it,
542.58 -> then that could be a real serious security problem.
545.04 -> And this is called exactly that, a replay attack,
547.92 -> because the adversary is just replaying a valid request
552.09 -> that they obtained.
554.76 -> So in reality, the string to sign
557.4 -> in the protocol isn't just the request itself,
559.59 -> it's gonna contain a timestamp.
561.66 -> So Alice is gonna sign this new string.
564.51 -> There's actually supposed to be a full time stamp there,
566.49 -> but I'm not very good at PowerPoint.
568.74 -> So to verify it, she's gonna send this request,
572.4 -> which includes the timestamp and the new signature.
575.49 -> We're gonna verify it.
577.14 -> This signature checks out.
578.55 -> This was actually signed by Alice,
579.99 -> or at least someone in possession of Alice's private key.
582.9 -> But then we're gonna make sure that that timestamp is good.
586.68 -> And since the timestamp is good,
589.2 -> then we respond success to Alice.
591.607 -> The reason that I'm saying good here
593.67 -> and not exactly the same is that you can't assume
596.1 -> that two arbitrary machines across the internet
598.32 -> are gonna have closely synchronized clocks.
600.87 -> And so you've gotta have a little bit of splay time
603.3 -> in there in order for things to work in the real world.
606 -> And in practice, we found that a value
607.77 -> of single digit minutes is typically reasonable.
611.61 -> I don't actually wanna derive
613.23 -> a full-fledged signing protocol here live on stage.
616.47 -> I just included this bit about replay attacks and timestamps
619.89 -> to illustrate how the simple obvious things often fail
623.31 -> to work in practice.
624.96 -> Like all cryptographic protocols,
627.03 -> these signing protocols are really, really difficult
629.28 -> to get right.
630.18 -> And anyone that wants to dip their toe in these waters
632.7 -> would be well served by reading the papers
635.01 -> about the flaws that have been discovered
636.6 -> in various protocols.
639 -> Some of you, for example, may be thinking like,
641.197 -> "Hey, he used a semicolon to separate the request
645.037 -> "from the timestamp.
646.387 -> "What if the request itself includes a semicolon?
649.267 -> "What's the parser gonna do then?"
651.48 -> Regardless, this looks pretty cool, right?
653.85 -> Like Alice can authenticate herself.
655.56 -> We've got a signature over the entire request,
657.3 -> so it's tamper evident.
658.65 -> We've got replay protection and we can keep layering on
661.35 -> from here.
662.97 -> But remember, it's 2006.
666.36 -> The overhead of TLS is the reason that it's not widespread,
669.27 -> or at least one of the reasons.
670.74 -> A 2006 computer is slow by today's standards,
674.52 -> and it's specifically the asymmetric cryptography,
677.43 -> the signing operation, that's expensive.
680.43 -> The protocol that we just described is built
682.35 -> on asymmetric cryptography.
684.12 -> We could eat that on our end.
685.62 -> Maybe we'd pass some of the costs along to our customers,
688.38 -> but our clients may not be able to sustain that load.
692.91 -> If we require every request to be signed,
694.89 -> that is a signing operation
696.3 -> and RSA signing is more expensive than RSA validation.
699.51 -> And the laptop, I have a M1 MacBook Pro,
703.05 -> and that laptop can do about 600 signing operations
706.44 -> per second single threaded.
708.66 -> So that's an entire core of my laptop.
710.64 -> It's less than a thousand DPS.
713.61 -> This is a tough trade off.
716.01 -> So to sum it all up,
718.59 -> every request is signed per request signature.
723.39 -> That means that the requests are tamper evident.
725.25 -> If someone has changed a request,
726.63 -> we will be able to detect that
727.86 -> when we're validating the signature.
730.38 -> And requests can't be replayed outside of the replay window.
734.19 -> And the protocol is stateless.
735.48 -> There's no need to go to a sign in endpoint
737.4 -> or some sort of token exchange.
739.47 -> You pop up, sign an API call, send it off, shut down.
742.8 -> So we've achieved all of those goals,
745.2 -> but the cryptography that we've chosen is expensive.
747.6 -> That's the fly in the ointment.
749.67 -> We're so close, but this isn't gonna work for our use case.
753.09 -> So what now?
756 -> In order to to solve our carefully constructed conundrum,
758.94 -> we need to introduce a new cryptographic primitive,
761.28 -> the hash.
762.99 -> A hash algorithm is also known as a digest.
766.835 -> All it does is it takes an input of any size
770.46 -> and it maps it to a fixed length output.
774.33 -> There are many algorithms that do this
775.98 -> that serve many purposes,
777.12 -> including error detection and correction.
779.31 -> But in this case, we're specifically concerned
781.29 -> with cryptographic hash algorithms.
784.47 -> In a good cryptographic hash algorithm,
786.81 -> the output gives you no information about the input.
789.9 -> One way of saying this is that if you flip one bit
792.24 -> in the input, approximately half
794.22 -> of the output bits will flip.
796.71 -> And it has to be hard to reverse.
798.78 -> And hard here has a formal definition.
800.97 -> It's not impossible to reverse because you could just keep
803.52 -> on guessing inputs until you got the same output.
806.1 -> You could brute force it.
807.57 -> So hard in this case means that no one has discovered a way
810.93 -> to reverse the hash that is faster than brute force.
814.44 -> And they are quick.
816.69 -> They are incredibly fast.
818.37 -> That same M1 MacBook Pro can run SHA-256,
822.18 -> which is a modern hash algorithm, bright and shiny,
825 -> at more than two gigabytes per second.
827.61 -> They're wicked fast.
829.95 -> So the reality is that we're already hashing.
834 -> So we talked about Alice generating a signature
836.58 -> over this long message,
838.23 -> but that's not what you do in practice.
840.93 -> The RSA signing algorithm is slow,
843 -> and if you're trying to sign an arbitrary length message,
845.43 -> it greatly complicates things.
847.11 -> So in practice, what you do is you use a hash algorithm
851.04 -> to turn the arbitrary length message that you're signing
853.38 -> into a digest and then you sign the digest.
856.53 -> And so in that protocol that we just described,
858.72 -> where we're using asymmetric keys, we are already hashing.
861.45 -> We're already paying the overhead of the hash.
864.36 -> And so it's basically free.
867.33 -> But a hash just takes an arbitrary length input
870.21 -> and maps it to a fixed length output.
872.34 -> It's not a signature.
874.14 -> How do we turn this into a signature?
877.56 -> The answer is that we need another cryptographic
880.35 -> construction that goes by the name of HMAC
882.93 -> or hash based message authentication code.
888.12 -> Here's a simplified definition of HMAC.
891.03 -> In reality, the two uses of the key on the right side
893.67 -> are exhort with some constants,
895.5 -> and there's a an extra key derivation step that I skipped.
898.47 -> You should not implement cryptographic primitives based
901.44 -> on my slides.
905.01 -> It looks complicated, but it's actually really simple.
910.8 -> We take the key, we can catenate it with the message.
915.03 -> Just string catenation.
916.29 -> We take the hash of that.
918.48 -> This can be a data intensive operation.
920.22 -> The message can be of arbitrary length.
921.99 -> This is approximately the hash that we would've had to do
924.27 -> in the asymmetric case.
925.86 -> And then we perform a second hash with the key concatenated
929.31 -> with the output of the first hash.
931.02 -> This is gonna be incredibly quick.
932.7 -> It's two fixed length inputs,
934.32 -> the key and the result of that first hash.
936.69 -> And so this is an incredibly fast operation.
939.51 -> You need this construction because if you just did
942.06 -> that first hash, the key plus the message,
944.43 -> you're vulnerable to things like extension attacks
946.59 -> or if I've got the message and the HMAC,
948.78 -> I can add to the message and I can add to the HMAC.
952.44 -> And so this construction is about 26 years old.
957.27 -> It has been deeply reviewed by cryptographic experts all
960.36 -> around the world.
961.193 -> This is not an Amazon thing,
963.18 -> and to date, no one has found any issues with it.
966.904 -> MD5 is an old and broken cryptographic hash algorithm.
971.1 -> You should not, or you should not be using MD5 for anything.
974.64 -> As far as we know now, HMAC-MD5 is still secure.
979.26 -> So for 26 years, this has been working well.
983.73 -> Here's an example.
984.75 -> I'm using secret as the key and "Shhh! Don't tell"
987.99 -> as the message, and that's the output.
991.59 -> I've added an H to the message.
993.81 -> You can see that the next hash that we generate
996.12 -> is completely unrelated.
997.86 -> I've changed the C to a K.
999.84 -> Again, completely unrelated output.
1002.93 -> And then here I've changed the capital S to a capital R.
1006.41 -> And this is interesting because the only difference between
1009.44 -> the ASCCI representation of a capital S and a capital R
1013.07 -> is that the least significant bit in R is zero
1015.68 -> and it's set in S.
1017.27 -> So this is a one bit change in the input,
1020.03 -> and you can see that the output is completely unrelated
1023.12 -> to the previous output.
1025.16 -> And the task that an attacker has to handle
1028.97 -> is given the message and the signature, figure out the key.
1034.58 -> And to date, again, no one has found a way to do this
1037.16 -> that's faster than brute force.
1040.34 -> So let's put this into practice.
1043.85 -> Rather than a certificate and private key,
1045.77 -> Alice is gonna have a single secret that's gonna be shared
1048.29 -> with her provider.
1049.85 -> She uses this to sign the request via HMAC.
1053.39 -> And once again, she includes the signature
1055.28 -> in the message that she sends.
1057.23 -> She does not include the signing secret.
1060.35 -> On receipt the cloud provider has
1062.39 -> to do exactly the same signature generation.
1064.34 -> They check to make sure
1065.173 -> that it's exactly the same signature output.
1067.94 -> Checks that the timestamp is good.
1070.31 -> And now since everything lines up, okay, we're in.
1075.56 -> So where are we?
1078.17 -> We swapped out asymmetric keys for an HMAC
1080.36 -> with a shared secret.
1081.68 -> Where's that leave us?
1082.85 -> We've maintained all of these properties that we liked
1085.205 -> about the previous protocol.
1088.64 -> And now the crypto is cheap.
1090.08 -> This is wicked fast.
1091.52 -> Even on the 2006 computer HMAC is fast.
1095.06 -> And one of the nice things about using HMAC is
1097.43 -> that while it is affected by quantum computers,
1100.4 -> it is way less serious than it is
1102.26 -> with asymmetric algorithms.
1103.85 -> If you're using sufficiently long keys for HMAC,
1106.85 -> as AWS does, then based on current research,
1110.66 -> HMAC-SHA256 remains quantum safe.
1115.19 -> However, we've got symmetric keys.
1117.65 -> The key has to be shared with the provider.
1120.14 -> And if the key is shared with the provider
1121.88 -> that service could then sign any request as Alice.
1125.48 -> So we've fixed a bunch of stuff,
1127.67 -> but we've introduced one more problem.
1129.74 -> So now that the groundwork is laid,
1131.39 -> let's dig into the actual system that we built.
1135.17 -> We've still got Alice and she's gonna be starting out
1137.21 -> with her web browser on her laptop.
1139.25 -> And let's say that she just created her AWS account.
1141.83 -> So she's got a username and password, right?
1144.68 -> And an MFA token.
1146.72 -> You all have the good taste to come here.
1148.4 -> So I don't need to say this,
1149.51 -> I'm sure you're all already using MFA,
1151.97 -> but if you're not, if you go down to the AWS booth
1154.43 -> at the security and identity desk,
1156.17 -> they're handing out free tokens.
1157.46 -> You should do that if you don't already have one.
1161.63 -> Alice wants to talk to the cloud.
1163.04 -> But now we're gonna have to zoom in and look
1165.14 -> at the innards of the cloud a little bit.
1167.57 -> The first thing that Alice wants to do
1169.1 -> is create an S3 bucket.
1171.05 -> So to do that, she's gonna have to sign an API
1173.99 -> and send it to the S3 endpoint.
1175.76 -> Of course, she doesn't have API keys yet.
1178.28 -> The way that you get these API keys
1180.23 -> is through the AWS management console.
1181.91 -> This is the web based interface to AWS.
1184.7 -> So that's the first place Alice goes.
1186.8 -> But she shows up there, she's unauthenticated.
1189.02 -> So she gets punted over to sign in
1190.91 -> where she answers the authentication challenge
1192.77 -> with her MFA token and that gives her back a cookie.
1196.28 -> This is exactly the protocol we started out with.
1199.37 -> Now Alice can authenticate to the management console.
1203.36 -> She includes her cookie, so the console knows who she is.
1206.27 -> And this is always, always, always conducted over TLS.
1209.69 -> All access to IAM and the management console
1212.12 -> has always been TLS protected.
1215.06 -> And to create a user, we have to interact with IAM.
1219.65 -> This is the identity and access management service.
1222.23 -> This is where you configure users and roles,
1224.36 -> policies and API keys.
1226.85 -> So on her behalf, the console is gonna call IAM
1229.61 -> and it's gonna create a new IAM user.
1232.58 -> She's requested that this user have API keys.
1235.01 -> So IAM needs to create those keys,
1237.89 -> returns it to the console,
1239.42 -> and the console's gonna return it back to Alice.
1243.23 -> So let's look at what those keys look like.
1246.08 -> The first is the access key ID.
1248.39 -> This is a key in the database sense.
1250.07 -> The primary key for this credential pair
1252.101 -> into our IAM database.
1254.36 -> It doesn't mean anything.
1255.62 -> It's not used in any cryptographic operations.
1258.62 -> It's just passed back to the service,
1260.18 -> so we know which key you're using to sign.
1262.46 -> They're not secrets, they don't mean anything,
1264.71 -> they can't be used to access anything
1266.06 -> and they have no internal structure.
1268.31 -> The second there is the secret access key.
1270.56 -> This is of course a secret.
1271.88 -> It's used as an HMAC key and it needs to be kept secure.
1275.45 -> However, it too has no internal structure.
1277.46 -> It's just a blob of entropy.
1280.13 -> This is the only time that you can get these keys.
1283.1 -> If you lose them, if you forget them,
1284.81 -> we will never let you retrieve them from our APIs
1287.09 -> or from the web console.
1288.47 -> Go create new keys.
1290.3 -> This is a real set of security credentials
1292.37 -> if you wanna screenshot it.
1293.75 -> I created an IAM user to write this talk
1296.21 -> and then immediately deleted it.
1299.57 -> So Alice can grab her SDK
1301.97 -> in her favorite programming language
1303.89 -> and she can configure it with these new keys.
1305.99 -> She's got everything she needs to make API calls.
1308.6 -> So she crafts the API call to S3, she signs it,
1312.98 -> she sends it to S3 and now we've got a problem.
1317.06 -> S3 has to validate this request.
1319.25 -> One option is we could give S3 Alice's key.
1322.31 -> S3 could validate that request, but then again,
1324.68 -> S3 could sign any request for Alice for any service.
1327.77 -> That's unacceptable.
1329.57 -> And so another option is that S3 could call,
1332.6 -> IAM and IAM could verify this API call.
1335.987 -> And in theory this would work,
1337.82 -> but it's not really what IAM should spend its time doing.
1341.15 -> Request verification sounds like a high scale
1344.09 -> and highly separable task.
1345.98 -> Something that's gonna scale in its own unique way.
1348.47 -> And so this makes it a candidate for being its own service,
1351.32 -> which is exactly what we did.
1353.72 -> So I'd like to introduce you
1355.01 -> to the Auth Runtime Service, ARS.
1357.62 -> This is an internal service.
1359 -> It's not directly called by customers,
1361.07 -> but it is one of the most heavily used services in AWS.
1364.09 -> It is used for authentication and authorization
1366.8 -> of inbound API calls.
1370.82 -> So every change that's made in IAM is de-normalized
1374.69 -> and propagated to ARS.
1378.8 -> So now S3 can call ARS and say, "Is this API call okay?"
1383.9 -> ARS can check the signature on it,
1384.852 -> say, "Yep, this is really was signed by Alice,"
1388.49 -> respond back to S3 and S3 can respond back to Alice.
1393.23 -> This theme of restricting access to credentials,
1396.02 -> restricting privileges that our services have
1398.42 -> is a common one and it'll keep coming up in this talk.
1400.61 -> It keeps coming up in the design of our services.
1403.52 -> One possible concern that you may have is
1405.2 -> that this database of keys that we're creating an IAM
1407.72 -> and ARS is incredibly sensitive.
1410.36 -> And that's true, but it's true
1412.07 -> whether these keys are symmetric or asymmetric.
1414.77 -> A confidentiality problem with this database is gonna mean
1418.01 -> that you get access to customer policies,
1420.17 -> you get to map out where they potentially have weak spots.
1422.93 -> An integrity problem with this database means
1424.67 -> that you can change passwords or API keys or policies.
1428.54 -> This database in any system, not just AWS,
1431.3 -> but any system that has an IAM service needs
1434.33 -> to be protected incredibly carefully at the highest level.
1437.51 -> And so we do that.
1438.83 -> This is the most lockdown system at AWS
1442.034 -> and we treat it with due care.
1445.19 -> Anyway, this is it at a high level,
1447.38 -> this is how AWS worked until late 2011 or early 2012.
1453.26 -> If we zoom out a bit,
1454.4 -> this is earth obviously,
1456.56 -> you can see all of our regions in various places.
1459.53 -> And we've been looking inside us-east-1
1461.9 -> which has IAM and ARS.
1464.03 -> However, if we look at something like eu-west-1,
1466.67 -> this is in Dublin, Ireland,
1469.04 -> you can see that only ARS is there.
1471.71 -> Likewise, ap-southeast-1 in Singapore, only ARS is there.
1475.52 -> IAM isn't deployed there.
1477.98 -> All IAM changes, user enroll creation policy updates,
1481.34 -> things like that, go through the control plane
1484.76 -> in us-east-1 and then they get propagated out globally.
1487.82 -> This separation between control plane and data plane,
1490.7 -> where all of the request validation happens,
1493.76 -> is another reason that we separated IAM and ARS
1496.43 -> into two separate services.
1498.35 -> And so just like IAM in us-east-1 is propagating
1501.95 -> to ARS in us-east-1, it's also propagating to all
1505 -> of the other regions in the AWS standard partition.
1508.16 -> And there should be many more arrows here,
1509.54 -> but again, I'm not very good at PowerPoint.
1512.81 -> So there's a bit of history here.
1515.75 -> Signature version zero was largely internal.
1518.811 -> There's no sign of it anymore.
1521.54 -> It was phased out very early in AWS' history.
1524.99 -> Version one was used only for our first couple of years,
1527.99 -> but we had a canonicalization issue back in 2008
1531.05 -> and we had to fix it.
1532.64 -> Now canonicalization is a big word,
1535.25 -> but it just means making the one true representation
1538.22 -> of something.
1539.21 -> Remember, in order to validate a request,
1541.85 -> the service has to generate a signature
1543.65 -> that exactly matches the signature that was passed in,
1546.2 -> which means that they need to have the exact same request
1548.45 -> that was signed.
1549.56 -> And it's acceptable for intermediates
1551.21 -> to reorder http headers or add spaces or things like that.
1555.08 -> So you have to be able
1555.913 -> to generate the one true representation.
1558.41 -> This is hard to get right and it's an example of the mortar
1561.5 -> between the building blocks that you can have problems in.
1565.1 -> So remember back a few slides where I teased like,
1568.01 -> you know, what if there's a semicolon in the request?
1571.31 -> This is an example of a low quality serialization standard
1576.35 -> that can lead to canonicalization problems
1578.6 -> that can lead to request verification issues.
1581.81 -> Anyway, what I've described so far
1584.258 -> is AWS signature version two.
1586.76 -> And we know of no security issues with it.
1590.87 -> However, let's pop back into us-east-1
1593.24 -> and we'll take a slightly different look at the region.
1595.73 -> S3 isn't our only service.
1597.26 -> We've got lots of them.
1598.22 -> I've only drawn four in here.
1600.38 -> And every single API call to any of these services
1604.19 -> is gonna involve a round trip to ARS.
1606.23 -> Every API call assigned,
1607.55 -> ARS is the only place you can check request signatures.
1610.1 -> This places ARS in an incredibly critical position.
1614 -> An outage in ARS is a region-wide outage.
1617.48 -> A scaling problem in any one of our services, places load
1620.78 -> on ARS.
1621.71 -> And if ARS then goes into stress, again,
1623.619 -> it's coupled back to all of our other services.
1626.75 -> And so this was an awesome design.
1628.97 -> It survived about five years,
1630.71 -> multiple orders of magnitude and scale,
1633.14 -> but this coupling between our services was not okay.
1636.02 -> We had to solve this.
1637.94 -> And again, this was a great design.
1639.74 -> I'm respecting what came before,
1642.11 -> but it was time to do something new.
1645.44 -> So we've got a system, it's working.
1649.4 -> There's literally millions of customer keys out in the world
1652.76 -> and we can't require that customers get new keys.
1656.69 -> At this point asymmetric cryptography is still too slow
1659.87 -> and it would require customers to get new keys,
1661.79 -> so it's not acceptable to the business.
1664.04 -> But HMAC is symmetric, gotta figure something out there,
1668.57 -> but it's fast.
1670.34 -> And so if HMAC isn't the answer,
1672.32 -> maybe lots of HMAC is the answer.
1675.89 -> And so that brings us to AWS signature version four.
1680.06 -> This is our answer to the problem that I just posed, SigV4.
1684.74 -> Now, signature version three was never broadly used.
1688.46 -> It was very early in its rollout
1690.62 -> and we had the core idea of signature version four.
1693.08 -> We're like, this is so much better than version three.
1695.6 -> And so the only sign that's left of version three
1697.7 -> is a couple of wiki pages internally.
1699.65 -> There's no existence of it anywhere in our services
1702.2 -> and we shall not speak of it again.
1705.74 -> So one HMAC is good, what if we have lots of HMACs?
1711.44 -> What we realized is we could use HMAC
1713.6 -> to perform key specialization.
1716.03 -> And this sounds like crypto gobbledygook
1718.28 -> that engineers use when they want you
1719.66 -> to stop asking questions and go away.
1721.61 -> But the underlying concept is really quite simple.
1724.28 -> We start off with the long term customer key.
1726.08 -> These are the millions of keys
1727.19 -> that our customers already have.
1730.28 -> We do a HMAC with a string AWS4,
1734.18 -> and the key against the date,
1736.73 -> and that generates the daily key for that principle.
1740.33 -> Then we take the daily key and we use that
1743 -> to HMAC the region name.
1745.43 -> And we take that service key.
1748.64 -> We take that region key and we HMAC it
1750.56 -> against the service name to get the service key.
1752.84 -> And then we have a terminal HMAC for again,
1755.6 -> cryptographic reasons,
1756.95 -> which is the literal string AWS4 requests.
1759.26 -> So we've got this string of HMAC derivations resulting in
1762.41 -> that circle blue key at the bottom.
1764.87 -> And it can look pretty complicated,
1766.76 -> but you can do this in literally one line of code.
1769.67 -> And it looks like a lot of HMACs,
1771.02 -> but again, HMAC is incredibly fast.
1774.08 -> And so the important thing here is
1776.03 -> that your SDK does this automatically.
1778.82 -> If you've used AWS in the past year,
1781.16 -> your SDK is doing this, you had no idea.
1783.981 -> The rollout here was incredibly slick.
1786.59 -> So let's go back to that service diagram
1788.63 -> and understand why we did this.
1789.98 -> What does it get us?
1791.36 -> So we've got Alice,
1794.18 -> she's going to do her key derivation.
1796.4 -> Again, her SDK does this for her automatically.
1799.22 -> She's gonna use that final blue key
1801.44 -> to sign the create bucket request.
1803.33 -> She's gonna send it to S3 in us-east-1.
1806.21 -> S3 is gonna send it to ARS.
1809.03 -> ARS is gonna say, okay,
1811.07 -> but they're also going to include that fully derived key
1814.91 -> which S3 can now cache.
1816.86 -> Now S3 can do all of the work associated
1818.72 -> with creating a bucket and they can return success to Alice.
1822.92 -> We refuse to allow S3 to cache Alice's long-term key,
1826.64 -> because that gave S3 privileges beyond S3.
1829.67 -> However, this fully derived key can only be used
1831.62 -> for 24 hours in this region for this service.
1835.64 -> And so this is safe for them to cache.
1837.98 -> It can only be used at S3.
1839.99 -> This is one of the ways that we're driving privileges
1842.63 -> out of the system.
1843.86 -> Now this doesn't look like any savings,
1845.78 -> because we still had to do the full round trip to ARS.
1848.72 -> But if Alice makes another API call,
1851.24 -> S3 already has the key cached.
1854.36 -> To this level of detail, this is how AWS has worked
1857.57 -> for the past 10 and a half years.
1859.46 -> There's a trade off inherent in the cache that S3 has.
1862.64 -> The longer it lives,
1863.87 -> the more value it provides us in relieving load.
1867.02 -> The the longer it lives however, the longer it takes
1870.86 -> for configuration values to propagate.
1872.51 -> Policy updates, things like that.
1874.28 -> And again, in practice, a value
1875.78 -> of single digit minutes seems to work very well.
1878.78 -> And the reason that this is effective is
1881 -> that our customer workloads tend to fall
1882.8 -> into two very different buckets.
1884.63 -> The first is interactive clients
1887.18 -> or things that just pop up, make a couple of API calls
1889.64 -> and go away, and a couple of extra milliseconds
1892.19 -> of latency there isn't a problem.
1893.84 -> They also don't present that much load to the services.
1896.48 -> The other is typically large production workloads,
1899.36 -> where they're just sending a stream of API calls
1901.34 -> against our services constantly.
1902.99 -> And in that case, the cache hit ratio is spectacular.
1906.71 -> Even if someone moved a gigantic new workload into AWS,
1911.372 -> it would represent it was what?
1913.46 -> Hundreds, maybe a thousand keys for millions of TPS.
1917.72 -> So we're talking about on the order
1919.16 -> of three orders of magnitude.
1920.69 -> Reduction in the load presented to ARS
1922.7 -> because of this caching.
1924.65 -> This is a more complex protocol than we had before
1927.29 -> and it's not something that we did lightly.
1929.51 -> We did multiple rounds of security reviews,
1931.58 -> penetration testing,
1932.6 -> we consulted external cryptographic experts.
1935.33 -> We really wanted to make sure that we have this right.
1938.09 -> And so if you're gonna do this,
1939.98 -> if you're gonna create your own cryptographic protocol,
1942.26 -> you need to be really convinced that it's the right thing
1944.54 -> for you to do.
1945.62 -> And then you need to dive super deep and own it.
1949.7 -> So as mentioned,
1950.861 -> version three never saw widespread adoption.
1953.75 -> In June, 2012 is when we launched
1956.15 -> our first SigV4 enabled service.
1958.1 -> So this is something just over a decade ago.
1965.01 -> SigV2 is still supported in any region
1967.52 -> and with any service where it was ever supported.
1970.49 -> We care a lot about backwards compatibility.
1973.28 -> If you're gonna change a website,
1974.81 -> you just change the website.
1975.98 -> The colors change, the buttons are in different locations.
1978.14 -> The human at the browser just figures it out.
1980.5 -> It is so much harder to change APIs.
1983.9 -> You've gotta adjust a new SDK.
1985.76 -> Maybe there's breaking changes.
1987.47 -> And so because we know of no security problems
1990.53 -> in signature version two,
1991.97 -> it's still available to our customers.
1994.13 -> However, usage has dropped to a trickle.
1996.35 -> All new SDKs default to SigV4.
1999.23 -> And by 2014, we had enough confidence in SigV4
2003.22 -> that when we launched our region in Frankfurt, Germany,
2005.89 -> it was a hundred percent SigV4 only.
2007.81 -> We never supported SigV2 there.
2010.06 -> Every region since then has been SigV4 only.
2013.09 -> At the time this was a hotly debated topic,
2015.64 -> but in hindsight it was absolutely the right call to make.
2019.6 -> And in April of 2019, we launched a region in Hong Kong,
2023.83 -> which is our first opt-in region.
2026.95 -> The reason we did that is for some of our customers,
2028.59 -> it was very important that they be able to run in Hong Kong.
2031.12 -> That was the entire business case for launching the region.
2033.85 -> But for other of our customers,
2035.62 -> it was very important to them that they not be able
2037.48 -> to run in Hong Kong.
2038.313 -> They did not want their data or their workloads
2040 -> to be present in Hong Kong.
2041.557 -> And so to solve this, the owner of an account has
2044.71 -> to explicitly tell us via API call
2047.05 -> that they want to enable the Hong Kong region.
2050.2 -> This is a standard AWS API call,
2052.21 -> so it can be restricted using IAM policies,
2054.79 -> AWS organizations and STPs.
2057.52 -> Just as with SigV4 only regions, every region launched
2060.94 -> since Hong Kong has been opt-in as well.
2063.97 -> So let's dig in and see what this means under the covers.
2066.94 -> So we've got our map of AWS regions.
2069.67 -> And if you look inside us-east-1, Northern Virginia,
2072.55 -> you can see that we've got IAM and ARS
2074.92 -> and it's propagating locally.
2076.57 -> And when Alice creates her API keys in us-east-1,
2079.96 -> they get propagated to ARS in us-east-1.
2082.63 -> Now Alice can make API calls in us-east-1.
2086.08 -> If you look inside eu-west-1 in Ireland,
2088.75 -> this is our second regions,
2090.31 -> SigV2 is still gonna be supported here.
2092.74 -> So as we mentioned earlier, there's no IAM in eu-west-1,
2096.79 -> but ARS is there.
2098.59 -> And so the IAM propagator
2100.09 -> is gonna automatically propagate keys
2102.28 -> and Alice's key will be made available to ARS in eu-west-1.
2106.42 -> So now Alice can make SigV2 or SigV4 calls to eu-west-1.
2112.45 -> Now we're gonna zoom in to Hong Kong, ap-east-1.
2116.35 -> Once again, IAM is not in this region, ARS is.
2119.98 -> However the IAM propagators pushing keys
2122.84 -> to Dublin to Ireland for Alice.
2125.08 -> It's not pushing anything for her account to ap-east-1.
2128.26 -> As far as Alice's account is concerned,
2131.23 -> this region does not exist.
2132.67 -> It's as if we'd never launched it.
2134.68 -> However, Alice wants to create a bucket in ap-east-1.
2138.43 -> She actually wants to take advantage of the region.
2140.47 -> So she has to make an API call to us to enable the region.
2144.97 -> And so she turns on Hong Kong.
2148.36 -> Now we start propagating her keys,
2150.43 -> but things are a little bit more complicated.
2152.92 -> Rather than sending the red key, Alice's long term key,
2156.22 -> IAM is gonna do a partial key derivation here.
2159.01 -> We're gonna do the daily in region specializations
2161.56 -> to the key, and that's what we're gonna propagate
2164.14 -> to Hong Kong.
2165.4 -> So the key that gets placed in Hong Kong is scoped
2168.16 -> to Hong Kong.
2168.993 -> It can be further specialized by ARS in Hong Kong
2171.73 -> for calls to individual services,
2174.34 -> but it has no trust anywhere outside of Hong Kong.
2177.94 -> And again, this is completely transparent to Alice
2181.394 -> and to our services.
2184.09 -> This is another example of us driving trust
2187.24 -> out of our systems, lowering the sensitivity
2189.88 -> of the keys such that the system is easier to operate
2193.78 -> and we're more confident
2194.86 -> that we've got a delightful configuration.
2199.69 -> And this is an example that I lifted
2202.09 -> from our IAM documentation.
2204.34 -> There's two statements in here.
2205.9 -> The first statement allows the principle
2208.15 -> to which this policy is applied to call enable region
2210.7 -> or disabled region on ap-east-1.
2212.83 -> So you can opt-in or opt-out of the Hong Kong region.
2215.56 -> And then the second statement is just the permissions
2217.33 -> that you would need if you wanted to do this
2218.65 -> through the console versus via the SDKs.
2221.59 -> You could clearly change this from allowed to deny
2224.26 -> and then whichever principle it was
2226.36 -> could not enable this region.
2231.28 -> So let's hit one last topic today.
2234.58 -> Short term keys.
2236.44 -> Throughout this talk, I've been referring
2237.85 -> to that red key as Alice's long-term key,
2240.13 -> which implies the existence of short-term keys.
2242.35 -> And indeed they do exist.
2244.39 -> We'll refer to them often as sessions.
2246.22 -> So an AWS session or a short-term key,
2248.23 -> the the terms are interchangeable.
2251.514 -> And this isn't a colloquialism,
2252.94 -> it's not something that we say informally.
2254.53 -> We have a very precise meaning for it.
2256.96 -> A short-term key or a long-term key, sorry,
2259.75 -> is valid until someone takes action to make it invalid.
2263.794 -> The security best practice is
2266.53 -> that you rotate your keys often.
2268.3 -> So hopefully your long-term keys don't actually live
2270.64 -> that long, but if no one takes action,
2272.775 -> our systems are gonna consider these keys
2275.26 -> to be valid indefinitely.
2277.18 -> A short-term key is born with an expiration date.
2280.15 -> They're like replicants.
2282.13 -> The lifespan can be set when they're created,
2284.05 -> but it is capped by the system to 36 hours.
2287.17 -> So compared to the typical lifespan of a long term key,
2289.99 -> these are indeed much shorter lived, hence the name.
2292.84 -> The real difference though is this automatic expiration.
2297.13 -> And so why would you have this?
2298.75 -> Why would you introduce this new complexity?
2301.03 -> One of the most common use cases is federated logins.
2303.88 -> It's common for large organizations to have some sort
2306.22 -> of identity broker federated login points.
2309.85 -> So you can do single sign on.
2311.05 -> You authenticate with your corporate credentials
2313.3 -> and you can go to the travel site
2314.68 -> or the customer management portal
2316.27 -> or the AWS management console.
2318.52 -> When you authenticate to us that way
2319.87 -> we need some representation of you inside of the account.
2323.38 -> And that is an AWS session.
2325.87 -> Another common use case is assuming an IAM role.
2329.17 -> We all know that you shouldn't log in as route
2331.15 -> or administrator, but sometimes you have
2333.13 -> to put on your admin hat.
2334.78 -> And you can do that by assuming a role
2336.58 -> with the appropriate privileges.
2338.38 -> In both of these cases,
2339.43 -> the access is meant to be short term.
2341.26 -> It's gonna time out at some point,
2343.06 -> and you want your privileges to trend back towards zero.
2349.06 -> One of my favorite features for AWS is the ability
2352.75 -> to pass a role to various AWS resources like EC2 instances
2357.04 -> or databases to give those resources the ability
2359.92 -> to make API calls on your behalf.
2362.38 -> And so let's take a closer look at that use case.
2366.13 -> I launched an EC2 instance and I specified
2369.28 -> that a role named test role should be mapped
2371.2 -> to this instance.
2372.033 -> EC2 takes it from there.
2373.72 -> Once the instance is running,
2375.25 -> I can contact the instance metadata service
2377.62 -> and pull the credentials for this instance.
2380.17 -> The instance metadata service is a web service listening
2382.54 -> on 169.254.169.254 on every single instance.
2387.28 -> It's serviced locally.
2388.99 -> And I hit this particularly long URL,
2391.66 -> that orange URL is the important bit here,
2394.18 -> and the result is something like this.
2397.33 -> Some of these fields are self explanatory.
2399.52 -> You can see from the the last update
2401.14 -> in the expiration header that these keys are good
2403.54 -> for about six hours.
2405.4 -> And access key and secret key are familiar,
2408.34 -> like we know what those do.
2410.95 -> So let's take a look at that world map again.
2413.95 -> Alice wants to launch an EC2 instance
2415.9 -> with an instance role in eu-west-1.
2418.63 -> Let's walk through the services we'd have to touch
2421.45 -> if we were to do this with long term keys.
2424.18 -> So first we'd have to call IAM in us-east-1
2428.41 -> and create that key.
2429.58 -> Then propagation would kick off
2431.02 -> and this new key would be pushed to other regions,
2433.33 -> including eu-west-1.
2435.22 -> It's only after this happens that the keys
2437.47 -> on the instance would be useful in eu-west-1.
2440.59 -> So now we have to have the IAM service
2442.96 -> in us-east-1 online and reachable.
2445.12 -> And these keys are good for about six hours.
2447.34 -> So at a minimum, we're looking at four new keys
2450.19 -> for every single EC2 instance in the world.
2452.53 -> And since they expire, we'd have to go through
2454.3 -> and clean out those keys.
2455.68 -> So that's at least eight transactions per day
2458.77 -> per EC2 instance on IAM in us-east-1.
2462.49 -> Plus we're pushing this key to a whole bunch of places
2465.16 -> that it doesn't have to be.
2467.23 -> And so this is coupling our regions together.
2470.47 -> It's adding latency, it's adding cost.
2473.08 -> We could choose not to propagate this key
2475.15 -> to anywhere except the places it needs to be,
2477.52 -> but that makes the propagator more complicated.
2479.53 -> It introduces the chance for subtle bugs
2482.56 -> and it makes the system harder to operate.
2484.69 -> So it's not something that I'd wanna build,
2486.1 -> it's absolutely not something I'd wanna operate.
2488.95 -> We need a different answer.
2490.36 -> We need to be able to create
2491.62 -> and automatically expire sessions at very high scale.
2494.68 -> And we need to be able to do it local to a region.
2497.38 -> And so the answer is one of my favorite parts
2500.38 -> of our authentication system.
2502.21 -> It's STS, the Secure Token Service.
2505.03 -> Unlike ARS, the Auth Runtime Service,
2506.86 -> this is a public facing service and it's virtually certain
2510.28 -> that even if you've never heard of it, you use STS.
2515.02 -> So STS is deployed in every region, everywhere on earth.
2519.67 -> This is the building block
2520.78 -> that lets us issue short term sessions at scale.
2526.15 -> And going back to that response that we got
2527.92 -> from the instance metadata service,
2529.24 -> there's one field that we haven't talked about yet,
2531.13 -> that token.
2532.18 -> And as you may have guessed,
2533.08 -> the secure token service is all about tokens.
2535.72 -> You can see that this token ends with snip.
2538.06 -> That's because the actual token was about 1200 bites long.
2541.21 -> And the slides were ugly enough as they were.
2543.58 -> 1200 bites is a lot.
2544.96 -> You can cram a lot in there.
2546.88 -> So let's pop one of these tokens open.
2551.92 -> This first set of lines is just internal bookkeeping.
2554.5 -> There's not a lot to be said about that.
2557.2 -> Next is the automatic expiration.
2559.93 -> You can see that unlike the instance metadata response,
2563.05 -> this is expressed as a creation time and a time to live.
2567.28 -> But again, it's about six hours.
2570.85 -> And then we have the access key and secret key.
2573.46 -> These are the exact same credentials that we saw before,
2576.37 -> just encoded inside of the token.
2578.74 -> We know exactly what to do with these.
2580.697 -> One of the cool things that you can do with an AWS session
2583.24 -> is you can constrain it based on policy.
2585.73 -> So if there are policy associated with this session,
2588.37 -> it would be here in the token.
2591.01 -> And then the last thing is an asymmetric signature,
2594.46 -> cryptographic signature, from the STS service saying,
2597.76 -> I STS do hereby swear that this token is cool and valid.
2602.05 -> That whole thing is then encrypted and it's passed back
2604.84 -> to our customer to treat as an opaque blob.
2607.39 -> So let's put this all together.
2609.19 -> Alice needs to call STS, she wants to create an S3 bucket,
2612.25 -> but of course she doesn't have S3 admin privileges.
2615.04 -> So she's gonna do a key derivation for STS.
2618.85 -> She's going to send her request to STS.
2622.27 -> STS is gonna talk to ARS,
2623.74 -> ARS is gonna do a key derivation, validation, et cetera.
2626.59 -> Return the specialized key back to STS for future API calls.
2632.17 -> And we're going to get back this, a session token.
2636.277 -> And it's going to have an access key ID and a secret key.
2641.35 -> Now the interesting thing here is
2643.87 -> that STS is gonna take this token,
2645.58 -> it's gonna encrypt it up and it's gonna send it back
2648.34 -> to Alice where she can then configure it in her SDK.
2651.1 -> You can see the little clock log
2652.78 -> in the little right corner indicating
2654.16 -> that this is ticking away towards expiry.
2657.28 -> Really important thing to note here is
2659.44 -> that STS did not record this session.
2662.14 -> It'll log the assume roll call in cloud trail,
2664.81 -> but it didn't store the session anywhere in AWS.
2667.99 -> There is no system anywhere in Amazon
2671.05 -> where the access key ID or the secret key is stored.
2674.41 -> We create the session, we bundle it up,
2676.33 -> we return it to the customer and then we forget about it.
2679.75 -> And so now Alice has these temporary keys configured.
2683.41 -> She can make that create bucket call.
2685.51 -> She starts with the access key ID and the secret key
2689.02 -> from the session that she just got,
2690.73 -> does the key derivation for S3,
2692.81 -> generates the signature for the request to S3,
2695.74 -> sends it to S3.
2697.15 -> S3 calls ARS.
2698.62 -> So far this looks entirely familiar.
2701.02 -> However, now things diverge.
2704.74 -> That access key ID doesn't exist in any database.
2707.35 -> We forgot about it, right?
2708.439 -> ARS has the mirror to the keys that STS has.
2712.63 -> So ARS can validate, can decrypt the token,
2715.6 -> validate the signature on the token,
2717.58 -> extract the access key and secret key ID,
2720.07 -> verify that this session is still active,
2721.9 -> all of those things.
2723.67 -> And if everything is good, it can use that access key ID
2726.97 -> and that secret key to validate the request
2728.957 -> that Alice just sent in.
2731.41 -> And so we validate that request.
2733.66 -> We send the fully derived key back to S3.
2737.616 -> S3 can go through all of the work of creating a bucket
2740.05 -> and return success to Alice.
2742.36 -> From a high level, what we just did seems really silly.
2745.3 -> We created a session,
2746.53 -> we gave it to Alice and then we forgot about it.
2749.26 -> What Alice wanted to use that session,
2750.97 -> she had to pass the token back to us.
2753.19 -> We're using the client for state tracking.
2755.92 -> By doing so, we can issue sessions at incredibly high rates.
2759.88 -> When we first built this,
2760.87 -> we told the pen testers that they would be unable
2763.21 -> to break us on scale.
2764.68 -> And the pen testers laughed
2766.36 -> 'cause they don't believe anyone.
2767.89 -> And they created a billion sessions and it didn't break us.
2770.77 -> I mean it generated a lot of logs,
2772.63 -> but it did not break us,
2773.59 -> 'cause there is no database of sessions.
2777.34 -> So you can issue a session in one region
2779.92 -> and immediately use it in a different region.
2782.23 -> Our control plane doesn't have to race you
2783.94 -> to get the keys there in time,
2785.68 -> because naturally you're gonna get the session
2787.54 -> from one region,
2788.38 -> you're gonna make an API call to the other region,
2790.09 -> you're gonna include all of the state necessary
2791.71 -> to validate that API call.
2795.76 -> Now again, you know Eric, you've been ranting
2799.3 -> that RSA asymmetric crypto is expensive.
2802.06 -> Like isn't it still a concern?
2803.92 -> Yes, but one, our cache ratios here are incredible.
2809.32 -> Not only are we caching the derived keys in the services,
2812.56 -> but ARS also has a cache of tokens that it's decrypted.
2815.89 -> And so if we see the same token presented again
2820.064 -> in a short window, we're already gonna have it decrypted.
2823.69 -> And it's safe to cache these tokens
2826.57 -> until their expiration point.
2828.28 -> And so the result is
2830.26 -> that we're not cracking tokens very often.
2832.75 -> And second, our customers only see these as opaque blobs.
2837.4 -> The only overhead of using an AWS session
2839.74 -> is a tiny little additional bit.
2841.27 -> In this case about 1200 bites
2843.16 -> of additional network overhead.
2844.99 -> That's it.
2845.92 -> And so we're doing all of the asymmetric cryptographic
2848.14 -> operations on our side and we can build our services
2850.78 -> to handle this load.
2853.81 -> Was it all worth it?
2855.76 -> The counterfactual is hard.
2857.11 -> We don't get to split the universe and play what if games.
2860.424 -> But we have a few data points.
2863.05 -> This is an older paper, but it's a great paper.
2865.81 -> It's totally worth reading even today.
2867.7 -> And I love that serious cryptographic research
2869.904 -> was published under the title 'Here Come The XO Ninjas'.
2873.37 -> Like when I first went looking for Thai
2875.53 -> and Juliano's paper, I was trying
2877.63 -> to find like the official paper,
2879.04 -> and I kept finding, 'Here Come The XO Ninjas'.
2881.38 -> Like this is the actual paper.
2884.05 -> The authors disclose a technique that takes advantage
2886.9 -> of the padding scheme used by CBC, Cipher block chaining.
2890.05 -> The way that block ciphers were built up into being able
2893.47 -> to encrypt larger messages in SSL-V3 and TLS version one.
2898.63 -> It's awesome here, but the reason I'm calling it out is
2903.13 -> that you can extract fixed values.
2905.89 -> If there's a value that occurs at the same place in a series
2909.16 -> of HTTP requests, you can use this padding oracle attack
2913.12 -> to extract that value.
2915.674 -> AWS API signatures are unique per request.
2919.24 -> Even if you're sending the same request over and over again,
2921.7 -> the timestamp's going to change,
2923.95 -> the signature's going to change.
2925.3 -> There is no fixed value to extract.
2927.4 -> So this was the first paper where our public response
2930.43 -> to our customers was, AWS APIs are not affected by this.
2934.333 -> That was a very satisfying result.
2936.49 -> There have been more of them in the intervening decade.
2940.27 -> And so what have we done?
2942.64 -> What just happened?
2944.53 -> At first glance, authenticating requests
2946.69 -> over http sounds like a deeply studied and solved problem.
2950.082 -> But the nature of our clients and the unique scale
2953.47 -> of our business made our use case very different
2957.82 -> from humans sitting at web browsers.
2959.38 -> And that led us to a unique and novel design.
2963.55 -> And as with all designs,
2964.78 -> we finally saw the end of the runway.
2967.36 -> But a bunch of smart people figured out
2969.22 -> how to evolve the system in place
2971.5 -> with minimal customer upheaval.
2973.96 -> Literally all customers had to do was ingest a new SDK
2976.971 -> and things just worked.
2980.29 -> And some of the things that we have in our design
2983.92 -> look needlessly complicated,
2985.9 -> but they've actually been really useful
2987.7 -> for supporting use cases like opt-in regions.
2990.91 -> They've given us the flexibility to say yes to the business
2994.291 -> rather than having to say no
2995.92 -> or having to make some tough trade-offs.
2998.59 -> And so AWS gets to continue to innovate
3000.72 -> on behalf of our customers.
3003.33 -> And even in these slides,
3005.76 -> the system that I've described isn't simple
3007.68 -> and the real system beneath these slides
3009.57 -> is even more complicated.
3011.52 -> Despite that, I consider it one of the most elegant things
3015.24 -> that I've ever had the pleasure of working on,
3017.67 -> because it's been able to evolve
3019.8 -> and we've been able to grow it
3021.39 -> with little to no customer visible changes.
3024 -> The majority of our customers aren't even aware
3026.64 -> that these things are happening despite the fact
3029.04 -> that together they drive over a half a billion requests
3031.95 -> per second.
3033.99 -> Anyway, thank you very much for coming.
3036.72 -> It is a delight to be at re:Invent in person
3039.42 -> and enjoy the rest of the conference.
3040.982 -> (applause)

Source: https://www.youtube.com/watch?v=tPr1AgGkvc4