AWS re:Invent 2022 - A day in the life of a billion requests (SEC404)
Aug 16, 2023
AWS re:Invent 2022 - A day in the life of a billion requests (SEC404)
Every day, sites around the world authenticate their callers. That is, they verify cryptographically that the requests are actually coming from who they claim to come from. In this session, learn about unique AWS requirements for scale and security that have led to some interesting and innovative solutions to this need. How did solutions evolve as AWS scaled multiple orders of magnitude and spread into many AWS Regions around the globe? Hear about some of the recent enhancements that have been launched to support new AWS features, and walk through some of the mechanisms that help ensure that AWS systems operate with minimal privileges. Learn more about AWS re:Invent at https://go.aws/3ikK4dD . Subscribe: More AWS videos http://bit.ly/2O3zS75 More AWS events videos http://bit.ly/316g9t4 ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster. #reInvent2022 #AWSreInvent2022 #AWSEvents
Content
0.36 -> - My name's Eric Brandwine and
I'm a distinguished engineer
2.88 -> with the Amazon Security team.
5.22 -> Today we're gonna talk
about some of the systems
7.08 -> and processes we use when
handling inbound API requests
10.83 -> from customers.
12.099 -> The title is a reference
to a talk that I gave
14.72 -> at this conference nine years ago.
17.01 -> It was entitled 'A day in the
life of a billion packets'.
20.25 -> It was talking about the internals of VPC.
22.95 -> Even back then, we handled way more
24.81 -> than a billion packets in a day.
26.94 -> So it was a fun title, but it
was numerically inaccurate.
30.6 -> This title is also inaccurate.
32.61 -> The most recent public statistic is
34.41 -> that AWS IAM handles half a
billion requests per second.
38.88 -> A quick bit of math leads you to the fact
41.04 -> that we must handle trillions
of requests per day.
44.1 -> This is one of the most
scaled systems on earth
47.16 -> and one of the unsung heroes of AWS.
49.8 -> And today we're gonna
take a closer look at it.
53.97 -> Anytime you're dealing with
anything over a network,
56.628 -> anything that any security
person has touched,
58.96 -> these two topics come up.
61.56 -> Authentication is making sure
that the party is actually
63.84 -> who they say they are.
65.19 -> Do I have Eric Brandwine's
driver's license.
67.89 -> Authorization is determining
whether or not the person
70.32 -> that has that driver's
license is cool enough
72.54 -> to do the thing.
73.53 -> I'm usually not that cool.
76.62 -> Most talks in this area go
gloss over authentication.
80.31 -> They assume that
authentication is handled,
82.41 -> it's a solved problem,
83.94 -> and then they spend a lot
of time on authorization.
86.49 -> Which is fine, authorization
is a fascinating topic.
89.16 -> It it's worth being dug into,
90.45 -> and there's a lot that can
and should be said about it.
92.7 -> However, in this talk,
94.05 -> we're gonna flip that script on its head
96.09 -> and we're gonna spend
most of our time talking
97.74 -> about authentication.
99.42 -> It turns out this can
be fascinating as well,
102 -> and the way that we do it at
AWS is unique and interesting.
105.78 -> In order to fit in the time allotted,
107.7 -> this talk is necessarily gonna
gloss over a lot of details.
111.27 -> In particular, pretty much everywhere
113.4 -> that you see our system propagating keys,
115.23 -> we're also propagating policies.
117.39 -> Even though we're gonna
focus on authentication
119.09 -> in this talk, clearly, we
also perform authorization.
124.08 -> In order to authenticate people,
125.97 -> we're gonna need to build
a cryptographic protocol.
128.88 -> You've probably heard of
cryptographic primitives like AES
131.54 -> or elliptic-curve cryptography.
134.22 -> Those are the fundamental building blocks,
136.44 -> but they're not generally useful.
137.64 -> They tend to be very narrow
in their application.
140.34 -> So these primitives are built up
141.69 -> into higher and higher level constructions
144.24 -> that perform ever more useful things.
147.54 -> Rather than something
simple like given a key
149.57 -> in a fixed sized message,
151.08 -> produce a fixed size
output that's encrypted,
153.42 -> you get given a key in an
arbitrary sized message,
156.54 -> produce an arbitrary sized output
158.22 -> that provides encryption
and integrity protection.
161.73 -> However, even these building
blocks don't fully solve
164.58 -> most real world problems.
166.23 -> The top layer of this stack is a protocol,
168.54 -> a cryptographic protocol.
170.16 -> This is the complete
specification of every operation
173.1 -> that you need to perform in
order to complete whatever task
175.645 -> that you're trying to complete,
177.42 -> ideally in a way that
protects you against all
179.66 -> of the threats that you've specified.
182.16 -> Every layer of this stack is hard.
184.71 -> If you read the cryptographic
papers, it's pretty rare
187.32 -> that one of the underlying
building blocks is broken.
190.02 -> Much more often it's the gaps
between the building blocks
193.05 -> where people find problems and
things start to fall apart.
196.53 -> Making your own cryptographic
protocol is hard,
199.8 -> it's expensive and it's not something
201.75 -> that we undertook casually.
203.73 -> We did a ton of design
reviews, consultations,
206.19 -> penetration testing and
more to maximize the odds
209.61 -> that we were gonna get this right.
212.452 -> It's traditional when discussing
a cryptographic protocol
216.133 -> to model it as a conversation
between two parties,
218.79 -> Alice and Bob, A and B.
221.19 -> Randall Monroe published this comic
222.72 -> at exactly the right time.
223.98 -> So please enjoy.
225.63 -> Anyway, we're gonna break
from tradition in this talk.
229.71 -> We've got Alice, but Alice
doesn't want to talk to Bob.
233.1 -> She wants to talk to the cloud.
235.05 -> So here's the cloud.
236.43 -> Alice needs to talk to
the cloud, of course,
238.2 -> and we need to make sure
that it's actually Alice
240.36 -> that's talking to us.
242.82 -> Now you're thinking, "Eric,
this is a solved problem."
246.21 -> And indeed it is, for some
definitions of solved.
249.3 -> The most common case is
logging into a website.
252.15 -> So we've got a website and
Alice has some sort of client,
255.6 -> most likely a web browser.
257.79 -> She browsers to some website
and is told, who are you?
261.87 -> And her browser's automatically
gonna go redirect her
264.21 -> to some sign in endpoint.
265.5 -> So a login endpoint.
267.48 -> Her browser is gonna render a username
269.19 -> and password dialogue for her.
270.45 -> She's gonna answer these challenges
272.49 -> and that login site is gonna
return to her a cookie.
275.25 -> A tiny little bit of data.
276.66 -> Her browser's gonna
squirrel this cookie away,
279.42 -> and then she gets redirected
back to the website.
282.06 -> But this time she includes
the freshly minted cookie
284.94 -> in her request to the website.
286.5 -> The website can validate the cookie,
288.39 -> and go like, "Hey, welcome Alice."
290.49 -> We've successfully authenticated Alice.
294.3 -> Now, an important bit of context here is
296.82 -> that when we initially
designed this system is 2006.
300.15 -> That's a long time ago in the real world,
302.49 -> but it's like centuries in computer years.
305.88 -> Most of you probably
weren't even born yet.
310.11 -> Netflix wasn't a streaming company.
312.9 -> Mobile phones were incredibly different.
315.81 -> The internet was much safer.
320.16 -> But TLS was not widely used.
325.303 -> TLS is transport layer security,
327.03 -> that's what we call it today.
327.9 -> We used to call it SSL,
secure sockets layer.
330.09 -> That's what it was in use in the time.
332.79 -> It was expensive.
334.23 -> Cryptographically
computationally expensive
336.09 -> on both the server and the client's side.
338.61 -> And so because we have a lot
of clients that just pop up,
342.81 -> make an API call and disappear,
345.21 -> they can't amortize that TLS
handshake across many requests.
348.69 -> They'd have to make that TLS connection
350.22 -> for every single call.
351.93 -> And when you consider that we have a lot
353.19 -> of embedded clients, things
that have constrained CPU,
355.74 -> constrained power,
constrained battery life.
358.17 -> That computational
overhead could potentially
360.51 -> be completely unacceptable
to our customers.
363.03 -> So in the internet of 2006 mandating
365.73 -> that every API call used
TLS wasn't a starter.
368.73 -> So because we couldn't make
all of our customers use TLS,
372.09 -> we had to offer clear text
endpoints for some of our APIs,
375.24 -> and that deeply informed our design.
379.59 -> So maybe, maybe we don't wanna
use that cookie protocol,
383.73 -> but surely this is already
a solved problem, right?
386.76 -> So we'll just give Alice
an asymmetric key pair.
390.09 -> In most implementations,
391.38 -> the public key is gonna be
encoded in a certificate,
393.96 -> an X.509 certificate,
the guy that you used
396.18 -> to secure a TLS endpoint,
and a private key.
400.38 -> The private key is exactly that.
402.75 -> It's a secret, it's private.
404.13 -> Alice must not share it with anyone.
406.08 -> The public key can be shared freely.
409.29 -> And what you can do with
these is you can sign,
412.17 -> you can use the private
key to sign a request
414.78 -> that generates a signature.
416.85 -> And so Alice can make a
call to her provider asking
420.36 -> for a cloud, please.
421.83 -> And she's gonna include the certificate.
423.75 -> This is an assertion of her identity.
425.79 -> Again, this is public
data and this signature.
429.57 -> She's gonna send it to the company.
432.54 -> They're going to verify this API request,
434.55 -> given the certificate.
435.69 -> They're gonna check that
the certificate was issued
437.31 -> by someone they trust,
that it's still valid,
438.9 -> all of that stuff, that
it's associated with Alice.
441.701 -> Then they check the signature,
443.49 -> which they can do given
only the public key.
445.68 -> That checks out.
446.94 -> So they're gonna tell Alice,
"Okay, this is great."
450.72 -> So we're good, right?
453.45 -> I mean, this is a great protocol.
457.17 -> One very nice property of this protocol
459.51 -> is that we've got a signature
over the entire request.
462.24 -> So if for any reason Alice's
request is tampered with,
465.93 -> whether it's malicious or a
fat finger or a cosmic ray,
469.14 -> Alice is now asking for two clouds,
471.18 -> but she's signed for one cloud.
473.91 -> So she's gonna send this to the provider.
476.67 -> The signature validation is going to fail,
480.06 -> and the request is gonna be denied.
482.1 -> This protocol with a signature
over the entire request
485.55 -> protects us against an
eaves dropper looking
487.56 -> to steal the keys.
488.88 -> In the cookie protocol, if I
eaves dropped on a request,
492.33 -> I got your cookie,
493.163 -> I could then use that
cookie in future requests.
495.45 -> In this protocol, I
eavesdrop on the request,
497.97 -> I get the signature, the
signature is bound to the request.
500.79 -> I can't use it on different requests.
503.31 -> But it's not a panacea.
505.98 -> In this case, everything's
working correctly.
507.9 -> Alice signed for one cloud.
509.13 -> Alice is requesting one cloud.
511.02 -> The cloud provider has
said, "Yep, that's cool."
513.81 -> But we have an eavesdropper.
515.34 -> The eavesdropper pulls this off the wire.
518.58 -> Now, they can just send
exactly that same request.
524.138 -> It has a valid signature.
526.32 -> It's going to validate at the provider
528.51 -> and it's going to succeed.
530.34 -> Now, in this case, Alice
is paying for two clouds
533.01 -> rather than just one.
534.3 -> That may be nothing more than a nuisance,
536.28 -> but if this is something
537.113 -> like resetting an administrator password
539.58 -> that someone has taken the
request and replayed it,
542.58 -> then that could be a real
serious security problem.
545.04 -> And this is called exactly
that, a replay attack,
547.92 -> because the adversary is just
replaying a valid request
552.09 -> that they obtained.
554.76 -> So in reality, the string to sign
557.4 -> in the protocol isn't
just the request itself,
559.59 -> it's gonna contain a timestamp.
561.66 -> So Alice is gonna sign this new string.
564.51 -> There's actually supposed to
be a full time stamp there,
566.49 -> but I'm not very good at PowerPoint.
568.74 -> So to verify it, she's
gonna send this request,
572.4 -> which includes the timestamp
and the new signature.
575.49 -> We're gonna verify it.
577.14 -> This signature checks out.
578.55 -> This was actually signed by Alice,
579.99 -> or at least someone in possession
of Alice's private key.
582.9 -> But then we're gonna make sure
that that timestamp is good.
586.68 -> And since the timestamp is good,
589.2 -> then we respond success to Alice.
591.607 -> The reason that I'm saying good here
593.67 -> and not exactly the same
is that you can't assume
596.1 -> that two arbitrary machines
across the internet
598.32 -> are gonna have closely
synchronized clocks.
600.87 -> And so you've gotta have
a little bit of splay time
603.3 -> in there in order for things
to work in the real world.
606 -> And in practice, we found that a value
607.77 -> of single digit minutes
is typically reasonable.
611.61 -> I don't actually wanna derive
613.23 -> a full-fledged signing
protocol here live on stage.
616.47 -> I just included this bit about
replay attacks and timestamps
619.89 -> to illustrate how the simple
obvious things often fail
623.31 -> to work in practice.
624.96 -> Like all cryptographic protocols,
627.03 -> these signing protocols are
really, really difficult
629.28 -> to get right.
630.18 -> And anyone that wants to dip
their toe in these waters
632.7 -> would be well served by reading the papers
635.01 -> about the flaws that have been discovered
636.6 -> in various protocols.
639 -> Some of you, for example,
may be thinking like,
641.197 -> "Hey, he used a semicolon
to separate the request
645.037 -> "from the timestamp.
646.387 -> "What if the request itself
includes a semicolon?
649.267 -> "What's the parser gonna do then?"
651.48 -> Regardless, this looks pretty cool, right?
653.85 -> Like Alice can authenticate herself.
655.56 -> We've got a signature
over the entire request,
657.3 -> so it's tamper evident.
658.65 -> We've got replay protection
and we can keep layering on
661.35 -> from here.
662.97 -> But remember, it's 2006.
666.36 -> The overhead of TLS is the
reason that it's not widespread,
669.27 -> or at least one of the reasons.
670.74 -> A 2006 computer is slow
by today's standards,
674.52 -> and it's specifically the
asymmetric cryptography,
677.43 -> the signing operation, that's expensive.
680.43 -> The protocol that we
just described is built
682.35 -> on asymmetric cryptography.
684.12 -> We could eat that on our end.
685.62 -> Maybe we'd pass some of the
costs along to our customers,
688.38 -> but our clients may not be
able to sustain that load.
692.91 -> If we require every request to be signed,
694.89 -> that is a signing operation
696.3 -> and RSA signing is more
expensive than RSA validation.
699.51 -> And the laptop, I have a M1 MacBook Pro,
703.05 -> and that laptop can do
about 600 signing operations
706.44 -> per second single threaded.
708.66 -> So that's an entire core of my laptop.
710.64 -> It's less than a thousand DPS.
713.61 -> This is a tough trade off.
716.01 -> So to sum it all up,
718.59 -> every request is signed
per request signature.
723.39 -> That means that the
requests are tamper evident.
725.25 -> If someone has changed a request,
726.63 -> we will be able to detect that
727.86 -> when we're validating the signature.
730.38 -> And requests can't be replayed
outside of the replay window.
734.19 -> And the protocol is stateless.
735.48 -> There's no need to go
to a sign in endpoint
737.4 -> or some sort of token exchange.
739.47 -> You pop up, sign an API
call, send it off, shut down.
742.8 -> So we've achieved all of those goals,
745.2 -> but the cryptography that
we've chosen is expensive.
747.6 -> That's the fly in the ointment.
749.67 -> We're so close, but this isn't
gonna work for our use case.
753.09 -> So what now?
756 -> In order to to solve our
carefully constructed conundrum,
758.94 -> we need to introduce a new
cryptographic primitive,
761.28 -> the hash.
762.99 -> A hash algorithm is
also known as a digest.
766.835 -> All it does is it takes
an input of any size
770.46 -> and it maps it to a fixed length output.
774.33 -> There are many algorithms that do this
775.98 -> that serve many purposes,
777.12 -> including error detection and correction.
779.31 -> But in this case, we're
specifically concerned
781.29 -> with cryptographic hash algorithms.
784.47 -> In a good cryptographic hash algorithm,
786.81 -> the output gives you no
information about the input.
789.9 -> One way of saying this is
that if you flip one bit
792.24 -> in the input, approximately half
794.22 -> of the output bits will flip.
796.71 -> And it has to be hard to reverse.
798.78 -> And hard here has a formal definition.
800.97 -> It's not impossible to reverse
because you could just keep
803.52 -> on guessing inputs until
you got the same output.
806.1 -> You could brute force it.
807.57 -> So hard in this case means that
no one has discovered a way
810.93 -> to reverse the hash that
is faster than brute force.
814.44 -> And they are quick.
816.69 -> They are incredibly fast.
818.37 -> That same M1 MacBook Pro can run SHA-256,
822.18 -> which is a modern hash
algorithm, bright and shiny,
825 -> at more than two gigabytes per second.
827.61 -> They're wicked fast.
829.95 -> So the reality is that
we're already hashing.
834 -> So we talked about Alice
generating a signature
836.58 -> over this long message,
838.23 -> but that's not what you do in practice.
840.93 -> The RSA signing algorithm is slow,
843 -> and if you're trying to sign
an arbitrary length message,
845.43 -> it greatly complicates things.
847.11 -> So in practice, what you do
is you use a hash algorithm
851.04 -> to turn the arbitrary length
message that you're signing
853.38 -> into a digest and then
you sign the digest.
856.53 -> And so in that protocol
that we just described,
858.72 -> where we're using asymmetric
keys, we are already hashing.
861.45 -> We're already paying the
overhead of the hash.
864.36 -> And so it's basically free.
867.33 -> But a hash just takes an
arbitrary length input
870.21 -> and maps it to a fixed length output.
872.34 -> It's not a signature.
874.14 -> How do we turn this into a signature?
877.56 -> The answer is that we
need another cryptographic
880.35 -> construction that goes by the name of HMAC
882.93 -> or hash based message authentication code.
888.12 -> Here's a simplified definition of HMAC.
891.03 -> In reality, the two uses of
the key on the right side
893.67 -> are exhort with some constants,
895.5 -> and there's a an extra key
derivation step that I skipped.
898.47 -> You should not implement
cryptographic primitives based
901.44 -> on my slides.
905.01 -> It looks complicated, but
it's actually really simple.
910.8 -> We take the key, we can
catenate it with the message.
915.03 -> Just string catenation.
916.29 -> We take the hash of that.
918.48 -> This can be a data intensive operation.
920.22 -> The message can be of arbitrary length.
921.99 -> This is approximately the hash
that we would've had to do
924.27 -> in the asymmetric case.
925.86 -> And then we perform a second
hash with the key concatenated
929.31 -> with the output of the first hash.
931.02 -> This is gonna be incredibly quick.
932.7 -> It's two fixed length inputs,
934.32 -> the key and the result of that first hash.
936.69 -> And so this is an
incredibly fast operation.
939.51 -> You need this construction
because if you just did
942.06 -> that first hash, the key plus the message,
944.43 -> you're vulnerable to things
like extension attacks
946.59 -> or if I've got the message and the HMAC,
948.78 -> I can add to the message
and I can add to the HMAC.
952.44 -> And so this construction
is about 26 years old.
957.27 -> It has been deeply reviewed
by cryptographic experts all
960.36 -> around the world.
961.193 -> This is not an Amazon thing,
963.18 -> and to date, no one has
found any issues with it.
966.904 -> MD5 is an old and broken
cryptographic hash algorithm.
971.1 -> You should not, or you should
not be using MD5 for anything.
974.64 -> As far as we know now,
HMAC-MD5 is still secure.
979.26 -> So for 26 years, this
has been working well.
983.73 -> Here's an example.
984.75 -> I'm using secret as the
key and "Shhh! Don't tell"
987.99 -> as the message, and that's the output.
991.59 -> I've added an H to the message.
993.81 -> You can see that the next
hash that we generate
996.12 -> is completely unrelated.
997.86 -> I've changed the C to a K.
999.84 -> Again, completely unrelated output.
1002.93 -> And then here I've changed
the capital S to a capital R.
1006.41 -> And this is interesting because
the only difference between
1009.44 -> the ASCCI representation of
a capital S and a capital R
1013.07 -> is that the least
significant bit in R is zero
1015.68 -> and it's set in S.
1017.27 -> So this is a one bit change in the input,
1020.03 -> and you can see that the
output is completely unrelated
1023.12 -> to the previous output.
1025.16 -> And the task that an
attacker has to handle
1028.97 -> is given the message and the
signature, figure out the key.
1034.58 -> And to date, again, no one
has found a way to do this
1037.16 -> that's faster than brute force.
1040.34 -> So let's put this into practice.
1043.85 -> Rather than a certificate and private key,
1045.77 -> Alice is gonna have a single
secret that's gonna be shared
1048.29 -> with her provider.
1049.85 -> She uses this to sign
the request via HMAC.
1053.39 -> And once again, she includes the signature
1055.28 -> in the message that she sends.
1057.23 -> She does not include the signing secret.
1060.35 -> On receipt the cloud provider has
1062.39 -> to do exactly the same
signature generation.
1064.34 -> They check to make sure
1065.173 -> that it's exactly the
same signature output.
1067.94 -> Checks that the timestamp is good.
1070.31 -> And now since everything
lines up, okay, we're in.
1075.56 -> So where are we?
1078.17 -> We swapped out asymmetric keys for an HMAC
1080.36 -> with a shared secret.
1081.68 -> Where's that leave us?
1082.85 -> We've maintained all of these
properties that we liked
1085.205 -> about the previous protocol.
1088.64 -> And now the crypto is cheap.
1090.08 -> This is wicked fast.
1091.52 -> Even on the 2006 computer HMAC is fast.
1095.06 -> And one of the nice
things about using HMAC is
1097.43 -> that while it is affected
by quantum computers,
1100.4 -> it is way less serious than it is
1102.26 -> with asymmetric algorithms.
1103.85 -> If you're using sufficiently
long keys for HMAC,
1106.85 -> as AWS does, then based
on current research,
1110.66 -> HMAC-SHA256 remains quantum safe.
1115.19 -> However, we've got symmetric keys.
1117.65 -> The key has to be shared
with the provider.
1120.14 -> And if the key is shared with the provider
1121.88 -> that service could then
sign any request as Alice.
1125.48 -> So we've fixed a bunch of stuff,
1127.67 -> but we've introduced one more problem.
1129.74 -> So now that the groundwork is laid,
1131.39 -> let's dig into the actual
system that we built.
1135.17 -> We've still got Alice and
she's gonna be starting out
1137.21 -> with her web browser on her laptop.
1139.25 -> And let's say that she just
created her AWS account.
1141.83 -> So she's got a username
and password, right?
1144.68 -> And an MFA token.
1146.72 -> You all have the good taste to come here.
1148.4 -> So I don't need to say this,
1149.51 -> I'm sure you're all already using MFA,
1151.97 -> but if you're not, if you
go down to the AWS booth
1154.43 -> at the security and identity desk,
1156.17 -> they're handing out free tokens.
1157.46 -> You should do that if you
don't already have one.
1161.63 -> Alice wants to talk to the cloud.
1163.04 -> But now we're gonna
have to zoom in and look
1165.14 -> at the innards of the cloud a little bit.
1167.57 -> The first thing that Alice wants to do
1169.1 -> is create an S3 bucket.
1171.05 -> So to do that, she's
gonna have to sign an API
1173.99 -> and send it to the S3 endpoint.
1175.76 -> Of course, she doesn't have API keys yet.
1178.28 -> The way that you get these API keys
1180.23 -> is through the AWS management console.
1181.91 -> This is the web based interface to AWS.
1184.7 -> So that's the first place Alice goes.
1186.8 -> But she shows up there,
she's unauthenticated.
1189.02 -> So she gets punted over to sign in
1190.91 -> where she answers the
authentication challenge
1192.77 -> with her MFA token and that
gives her back a cookie.
1196.28 -> This is exactly the protocol
we started out with.
1199.37 -> Now Alice can authenticate
to the management console.
1203.36 -> She includes her cookie, so
the console knows who she is.
1206.27 -> And this is always, always,
always conducted over TLS.
1209.69 -> All access to IAM and
the management console
1212.12 -> has always been TLS protected.
1215.06 -> And to create a user, we
have to interact with IAM.
1219.65 -> This is the identity and
access management service.
1222.23 -> This is where you
configure users and roles,
1224.36 -> policies and API keys.
1226.85 -> So on her behalf, the
console is gonna call IAM
1229.61 -> and it's gonna create a new IAM user.
1232.58 -> She's requested that
this user have API keys.
1235.01 -> So IAM needs to create those keys,
1237.89 -> returns it to the console,
1239.42 -> and the console's gonna
return it back to Alice.
1243.23 -> So let's look at what
those keys look like.
1246.08 -> The first is the access key ID.
1248.39 -> This is a key in the database sense.
1250.07 -> The primary key for this credential pair
1252.101 -> into our IAM database.
1254.36 -> It doesn't mean anything.
1255.62 -> It's not used in any
cryptographic operations.
1258.62 -> It's just passed back to the service,
1260.18 -> so we know which key you're using to sign.
1262.46 -> They're not secrets,
they don't mean anything,
1264.71 -> they can't be used to access anything
1266.06 -> and they have no internal structure.
1268.31 -> The second there is the secret access key.
1270.56 -> This is of course a secret.
1271.88 -> It's used as an HMAC key and
it needs to be kept secure.
1275.45 -> However, it too has no internal structure.
1277.46 -> It's just a blob of entropy.
1280.13 -> This is the only time that
you can get these keys.
1283.1 -> If you lose them, if you forget them,
1284.81 -> we will never let you
retrieve them from our APIs
1287.09 -> or from the web console.
1288.47 -> Go create new keys.
1290.3 -> This is a real set of security credentials
1292.37 -> if you wanna screenshot it.
1293.75 -> I created an IAM user to write this talk
1296.21 -> and then immediately deleted it.
1299.57 -> So Alice can grab her SDK
1301.97 -> in her favorite programming language
1303.89 -> and she can configure
it with these new keys.
1305.99 -> She's got everything she
needs to make API calls.
1308.6 -> So she crafts the API
call to S3, she signs it,
1312.98 -> she sends it to S3 and
now we've got a problem.
1317.06 -> S3 has to validate this request.
1319.25 -> One option is we could
give S3 Alice's key.
1322.31 -> S3 could validate that
request, but then again,
1324.68 -> S3 could sign any request
for Alice for any service.
1327.77 -> That's unacceptable.
1329.57 -> And so another option
is that S3 could call,
1332.6 -> IAM and IAM could verify this API call.
1335.987 -> And in theory this would work,
1337.82 -> but it's not really what IAM
should spend its time doing.
1341.15 -> Request verification
sounds like a high scale
1344.09 -> and highly separable task.
1345.98 -> Something that's gonna
scale in its own unique way.
1348.47 -> And so this makes it a candidate
for being its own service,
1351.32 -> which is exactly what we did.
1353.72 -> So I'd like to introduce you
1355.01 -> to the Auth Runtime Service, ARS.
1357.62 -> This is an internal service.
1359 -> It's not directly called by customers,
1361.07 -> but it is one of the most
heavily used services in AWS.
1364.09 -> It is used for authentication
and authorization
1366.8 -> of inbound API calls.
1370.82 -> So every change that's made
in IAM is de-normalized
1374.69 -> and propagated to ARS.
1378.8 -> So now S3 can call ARS and
say, "Is this API call okay?"
1383.9 -> ARS can check the signature on it,
1384.852 -> say, "Yep, this is really
was signed by Alice,"
1388.49 -> respond back to S3 and S3
can respond back to Alice.
1393.23 -> This theme of restricting
access to credentials,
1396.02 -> restricting privileges
that our services have
1398.42 -> is a common one and it'll
keep coming up in this talk.
1400.61 -> It keeps coming up in the
design of our services.
1403.52 -> One possible concern that you may have is
1405.2 -> that this database of keys
that we're creating an IAM
1407.72 -> and ARS is incredibly sensitive.
1410.36 -> And that's true, but it's true
1412.07 -> whether these keys are
symmetric or asymmetric.
1414.77 -> A confidentiality problem with
this database is gonna mean
1418.01 -> that you get access to customer policies,
1420.17 -> you get to map out where they
potentially have weak spots.
1422.93 -> An integrity problem
with this database means
1424.67 -> that you can change passwords
or API keys or policies.
1428.54 -> This database in any system, not just AWS,
1431.3 -> but any system that has
an IAM service needs
1434.33 -> to be protected incredibly
carefully at the highest level.
1437.51 -> And so we do that.
1438.83 -> This is the most lockdown system at AWS
1442.034 -> and we treat it with due care.
1445.19 -> Anyway, this is it at a high level,
1447.38 -> this is how AWS worked until
late 2011 or early 2012.
1453.26 -> If we zoom out a bit,
1454.4 -> this is earth obviously,
1456.56 -> you can see all of our
regions in various places.
1459.53 -> And we've been looking inside us-east-1
1461.9 -> which has IAM and ARS.
1464.03 -> However, if we look at
something like eu-west-1,
1466.67 -> this is in Dublin, Ireland,
1469.04 -> you can see that only ARS is there.
1471.71 -> Likewise, ap-southeast-1 in
Singapore, only ARS is there.
1475.52 -> IAM isn't deployed there.
1477.98 -> All IAM changes, user enroll
creation policy updates,
1481.34 -> things like that, go
through the control plane
1484.76 -> in us-east-1 and then they
get propagated out globally.
1487.82 -> This separation between
control plane and data plane,
1490.7 -> where all of the request
validation happens,
1493.76 -> is another reason that
we separated IAM and ARS
1496.43 -> into two separate services.
1498.35 -> And so just like IAM in
us-east-1 is propagating
1501.95 -> to ARS in us-east-1, it's
also propagating to all
1505 -> of the other regions in
the AWS standard partition.
1508.16 -> And there should be many more arrows here,
1509.54 -> but again, I'm not very
good at PowerPoint.
1512.81 -> So there's a bit of history here.
1515.75 -> Signature version zero
was largely internal.
1518.811 -> There's no sign of it anymore.
1521.54 -> It was phased out very
early in AWS' history.
1524.99 -> Version one was used only for
our first couple of years,
1527.99 -> but we had a canonicalization
issue back in 2008
1531.05 -> and we had to fix it.
1532.64 -> Now canonicalization is a big word,
1535.25 -> but it just means making
the one true representation
1538.22 -> of something.
1539.21 -> Remember, in order to validate a request,
1541.85 -> the service has to generate a signature
1543.65 -> that exactly matches the
signature that was passed in,
1546.2 -> which means that they need to
have the exact same request
1548.45 -> that was signed.
1549.56 -> And it's acceptable for intermediates
1551.21 -> to reorder http headers or add
spaces or things like that.
1555.08 -> So you have to be able
1555.913 -> to generate the one true representation.
1558.41 -> This is hard to get right and
it's an example of the mortar
1561.5 -> between the building blocks
that you can have problems in.
1565.1 -> So remember back a few
slides where I teased like,
1568.01 -> you know, what if there's
a semicolon in the request?
1571.31 -> This is an example of a low
quality serialization standard
1576.35 -> that can lead to canonicalization problems
1578.6 -> that can lead to request
verification issues.
1581.81 -> Anyway, what I've described so far
1584.258 -> is AWS signature version two.
1586.76 -> And we know of no security issues with it.
1590.87 -> However, let's pop back into us-east-1
1593.24 -> and we'll take a slightly
different look at the region.
1595.73 -> S3 isn't our only service.
1597.26 -> We've got lots of them.
1598.22 -> I've only drawn four in here.
1600.38 -> And every single API call
to any of these services
1604.19 -> is gonna involve a round trip to ARS.
1606.23 -> Every API call assigned,
1607.55 -> ARS is the only place you
can check request signatures.
1610.1 -> This places ARS in an
incredibly critical position.
1614 -> An outage in ARS is a region-wide outage.
1617.48 -> A scaling problem in any one
of our services, places load
1620.78 -> on ARS.
1621.71 -> And if ARS then goes into stress, again,
1623.619 -> it's coupled back to all
of our other services.
1626.75 -> And so this was an awesome design.
1628.97 -> It survived about five years,
1630.71 -> multiple orders of magnitude and scale,
1633.14 -> but this coupling between
our services was not okay.
1636.02 -> We had to solve this.
1637.94 -> And again, this was a great design.
1639.74 -> I'm respecting what came before,
1642.11 -> but it was time to do something new.
1645.44 -> So we've got a system, it's working.
1649.4 -> There's literally millions of
customer keys out in the world
1652.76 -> and we can't require that
customers get new keys.
1656.69 -> At this point asymmetric
cryptography is still too slow
1659.87 -> and it would require
customers to get new keys,
1661.79 -> so it's not acceptable to the business.
1664.04 -> But HMAC is symmetric, gotta
figure something out there,
1668.57 -> but it's fast.
1670.34 -> And so if HMAC isn't the answer,
1672.32 -> maybe lots of HMAC is the answer.
1675.89 -> And so that brings us to
AWS signature version four.
1680.06 -> This is our answer to the
problem that I just posed, SigV4.
1684.74 -> Now, signature version three
was never broadly used.
1688.46 -> It was very early in its rollout
1690.62 -> and we had the core idea
of signature version four.
1693.08 -> We're like, this is so much
better than version three.
1695.6 -> And so the only sign that's
left of version three
1697.7 -> is a couple of wiki pages internally.
1699.65 -> There's no existence of it
anywhere in our services
1702.2 -> and we shall not speak of it again.
1705.74 -> So one HMAC is good, what
if we have lots of HMACs?
1711.44 -> What we realized is we could use HMAC
1713.6 -> to perform key specialization.
1716.03 -> And this sounds like crypto gobbledygook
1718.28 -> that engineers use when they want you
1719.66 -> to stop asking questions and go away.
1721.61 -> But the underlying concept
is really quite simple.
1724.28 -> We start off with the
long term customer key.
1726.08 -> These are the millions of keys
1727.19 -> that our customers already have.
1730.28 -> We do a HMAC with a string AWS4,
1734.18 -> and the key against the date,
1736.73 -> and that generates the daily
key for that principle.
1740.33 -> Then we take the daily key and we use that
1743 -> to HMAC the region name.
1745.43 -> And we take that service key.
1748.64 -> We take that region key and we HMAC it
1750.56 -> against the service name
to get the service key.
1752.84 -> And then we have a
terminal HMAC for again,
1755.6 -> cryptographic reasons,
1756.95 -> which is the literal string AWS4 requests.
1759.26 -> So we've got this string of
HMAC derivations resulting in
1762.41 -> that circle blue key at the bottom.
1764.87 -> And it can look pretty complicated,
1766.76 -> but you can do this in
literally one line of code.
1769.67 -> And it looks like a lot of HMACs,
1771.02 -> but again, HMAC is incredibly fast.
1774.08 -> And so the important thing here is
1776.03 -> that your SDK does this automatically.
1778.82 -> If you've used AWS in the past year,
1781.16 -> your SDK is doing this, you had no idea.
1783.981 -> The rollout here was incredibly slick.
1786.59 -> So let's go back to that service diagram
1788.63 -> and understand why we did this.
1789.98 -> What does it get us?
1791.36 -> So we've got Alice,
1794.18 -> she's going to do her key derivation.
1796.4 -> Again, her SDK does this
for her automatically.
1799.22 -> She's gonna use that final blue key
1801.44 -> to sign the create bucket request.
1803.33 -> She's gonna send it to S3 in us-east-1.
1806.21 -> S3 is gonna send it to ARS.
1809.03 -> ARS is gonna say, okay,
1811.07 -> but they're also going to
include that fully derived key
1814.91 -> which S3 can now cache.
1816.86 -> Now S3 can do all of the work associated
1818.72 -> with creating a bucket and they
can return success to Alice.
1822.92 -> We refuse to allow S3 to
cache Alice's long-term key,
1826.64 -> because that gave S3 privileges beyond S3.
1829.67 -> However, this fully derived
key can only be used
1831.62 -> for 24 hours in this
region for this service.
1835.64 -> And so this is safe for them to cache.
1837.98 -> It can only be used at S3.
1839.99 -> This is one of the ways that
we're driving privileges
1842.63 -> out of the system.
1843.86 -> Now this doesn't look like any savings,
1845.78 -> because we still had to do
the full round trip to ARS.
1848.72 -> But if Alice makes another API call,
1851.24 -> S3 already has the key cached.
1854.36 -> To this level of detail,
this is how AWS has worked
1857.57 -> for the past 10 and a half years.
1859.46 -> There's a trade off inherent
in the cache that S3 has.
1862.64 -> The longer it lives,
1863.87 -> the more value it provides
us in relieving load.
1867.02 -> The the longer it lives
however, the longer it takes
1870.86 -> for configuration values to propagate.
1872.51 -> Policy updates, things like that.
1874.28 -> And again, in practice, a value
1875.78 -> of single digit minutes
seems to work very well.
1878.78 -> And the reason that this is effective is
1881 -> that our customer workloads tend to fall
1882.8 -> into two very different buckets.
1884.63 -> The first is interactive clients
1887.18 -> or things that just pop up,
make a couple of API calls
1889.64 -> and go away, and a couple
of extra milliseconds
1892.19 -> of latency there isn't a problem.
1893.84 -> They also don't present that
much load to the services.
1896.48 -> The other is typically
large production workloads,
1899.36 -> where they're just sending
a stream of API calls
1901.34 -> against our services constantly.
1902.99 -> And in that case, the cache
hit ratio is spectacular.
1906.71 -> Even if someone moved a
gigantic new workload into AWS,
1911.372 -> it would represent it was what?
1913.46 -> Hundreds, maybe a thousand
keys for millions of TPS.
1917.72 -> So we're talking about on the order
1919.16 -> of three orders of magnitude.
1920.69 -> Reduction in the load presented to ARS
1922.7 -> because of this caching.
1924.65 -> This is a more complex
protocol than we had before
1927.29 -> and it's not something
that we did lightly.
1929.51 -> We did multiple rounds
of security reviews,
1931.58 -> penetration testing,
1932.6 -> we consulted external
cryptographic experts.
1935.33 -> We really wanted to make
sure that we have this right.
1938.09 -> And so if you're gonna do this,
1939.98 -> if you're gonna create your
own cryptographic protocol,
1942.26 -> you need to be really convinced
that it's the right thing
1944.54 -> for you to do.
1945.62 -> And then you need to dive
super deep and own it.
1949.7 -> So as mentioned,
1950.861 -> version three never saw
widespread adoption.
1953.75 -> In June, 2012 is when we launched
1956.15 -> our first SigV4 enabled service.
1958.1 -> So this is something
just over a decade ago.
1965.01 -> SigV2 is still supported in any region
1967.52 -> and with any service where
it was ever supported.
1970.49 -> We care a lot about
backwards compatibility.
1973.28 -> If you're gonna change a website,
1974.81 -> you just change the website.
1975.98 -> The colors change, the buttons
are in different locations.
1978.14 -> The human at the browser
just figures it out.
1980.5 -> It is so much harder to change APIs.
1983.9 -> You've gotta adjust a new SDK.
1985.76 -> Maybe there's breaking changes.
1987.47 -> And so because we know
of no security problems
1990.53 -> in signature version two,
1991.97 -> it's still available to our customers.
1994.13 -> However, usage has dropped to a trickle.
1996.35 -> All new SDKs default to SigV4.
1999.23 -> And by 2014, we had
enough confidence in SigV4
2003.22 -> that when we launched our
region in Frankfurt, Germany,
2005.89 -> it was a hundred percent SigV4 only.
2007.81 -> We never supported SigV2 there.
2010.06 -> Every region since then
has been SigV4 only.
2013.09 -> At the time this was
a hotly debated topic,
2015.64 -> but in hindsight it was
absolutely the right call to make.
2019.6 -> And in April of 2019, we
launched a region in Hong Kong,
2023.83 -> which is our first opt-in region.
2026.95 -> The reason we did that is
for some of our customers,
2028.59 -> it was very important that they
be able to run in Hong Kong.
2031.12 -> That was the entire business
case for launching the region.
2033.85 -> But for other of our customers,
2035.62 -> it was very important to
them that they not be able
2037.48 -> to run in Hong Kong.
2038.313 -> They did not want their
data or their workloads
2040 -> to be present in Hong Kong.
2041.557 -> And so to solve this, the
owner of an account has
2044.71 -> to explicitly tell us via API call
2047.05 -> that they want to enable
the Hong Kong region.
2050.2 -> This is a standard AWS API call,
2052.21 -> so it can be restricted
using IAM policies,
2054.79 -> AWS organizations and STPs.
2057.52 -> Just as with SigV4 only
regions, every region launched
2060.94 -> since Hong Kong has been opt-in as well.
2063.97 -> So let's dig in and see what
this means under the covers.
2066.94 -> So we've got our map of AWS regions.
2069.67 -> And if you look inside
us-east-1, Northern Virginia,
2072.55 -> you can see that we've got IAM and ARS
2074.92 -> and it's propagating locally.
2076.57 -> And when Alice creates
her API keys in us-east-1,
2079.96 -> they get propagated to ARS in us-east-1.
2082.63 -> Now Alice can make API calls in us-east-1.
2086.08 -> If you look inside eu-west-1 in Ireland,
2088.75 -> this is our second regions,
2090.31 -> SigV2 is still gonna be supported here.
2092.74 -> So as we mentioned earlier,
there's no IAM in eu-west-1,
2096.79 -> but ARS is there.
2098.59 -> And so the IAM propagator
2100.09 -> is gonna automatically propagate keys
2102.28 -> and Alice's key will be made
available to ARS in eu-west-1.
2106.42 -> So now Alice can make SigV2
or SigV4 calls to eu-west-1.
2112.45 -> Now we're gonna zoom in
to Hong Kong, ap-east-1.
2116.35 -> Once again, IAM is not
in this region, ARS is.
2119.98 -> However the IAM propagators pushing keys
2122.84 -> to Dublin to Ireland for Alice.
2125.08 -> It's not pushing anything
for her account to ap-east-1.
2128.26 -> As far as Alice's account is concerned,
2131.23 -> this region does not exist.
2132.67 -> It's as if we'd never launched it.
2134.68 -> However, Alice wants to
create a bucket in ap-east-1.
2138.43 -> She actually wants to take
advantage of the region.
2140.47 -> So she has to make an API call
to us to enable the region.
2144.97 -> And so she turns on Hong Kong.
2148.36 -> Now we start propagating her keys,
2150.43 -> but things are a little
bit more complicated.
2152.92 -> Rather than sending the red
key, Alice's long term key,
2156.22 -> IAM is gonna do a partial
key derivation here.
2159.01 -> We're gonna do the daily
in region specializations
2161.56 -> to the key, and that's
what we're gonna propagate
2164.14 -> to Hong Kong.
2165.4 -> So the key that gets placed
in Hong Kong is scoped
2168.16 -> to Hong Kong.
2168.993 -> It can be further specialized
by ARS in Hong Kong
2171.73 -> for calls to individual services,
2174.34 -> but it has no trust anywhere
outside of Hong Kong.
2177.94 -> And again, this is completely
transparent to Alice
2181.394 -> and to our services.
2184.09 -> This is another example
of us driving trust
2187.24 -> out of our systems,
lowering the sensitivity
2189.88 -> of the keys such that the
system is easier to operate
2193.78 -> and we're more confident
2194.86 -> that we've got a delightful configuration.
2199.69 -> And this is an example that I lifted
2202.09 -> from our IAM documentation.
2204.34 -> There's two statements in here.
2205.9 -> The first statement allows the principle
2208.15 -> to which this policy is
applied to call enable region
2210.7 -> or disabled region on ap-east-1.
2212.83 -> So you can opt-in or opt-out
of the Hong Kong region.
2215.56 -> And then the second statement
is just the permissions
2217.33 -> that you would need if
you wanted to do this
2218.65 -> through the console versus via the SDKs.
2221.59 -> You could clearly change
this from allowed to deny
2224.26 -> and then whichever principle it was
2226.36 -> could not enable this region.
2231.28 -> So let's hit one last topic today.
2234.58 -> Short term keys.
2236.44 -> Throughout this talk, I've been referring
2237.85 -> to that red key as Alice's long-term key,
2240.13 -> which implies the existence
of short-term keys.
2242.35 -> And indeed they do exist.
2244.39 -> We'll refer to them often as sessions.
2246.22 -> So an AWS session or a short-term key,
2248.23 -> the the terms are interchangeable.
2251.514 -> And this isn't a colloquialism,
2252.94 -> it's not something that we say informally.
2254.53 -> We have a very precise meaning for it.
2256.96 -> A short-term key or a
long-term key, sorry,
2259.75 -> is valid until someone takes
action to make it invalid.
2263.794 -> The security best practice is
2266.53 -> that you rotate your keys often.
2268.3 -> So hopefully your long-term
keys don't actually live
2270.64 -> that long, but if no one takes action,
2272.775 -> our systems are gonna consider these keys
2275.26 -> to be valid indefinitely.
2277.18 -> A short-term key is born
with an expiration date.
2280.15 -> They're like replicants.
2282.13 -> The lifespan can be set
when they're created,
2284.05 -> but it is capped by
the system to 36 hours.
2287.17 -> So compared to the typical
lifespan of a long term key,
2289.99 -> these are indeed much shorter
lived, hence the name.
2292.84 -> The real difference though
is this automatic expiration.
2297.13 -> And so why would you have this?
2298.75 -> Why would you introduce
this new complexity?
2301.03 -> One of the most common use
cases is federated logins.
2303.88 -> It's common for large
organizations to have some sort
2306.22 -> of identity broker federated login points.
2309.85 -> So you can do single sign on.
2311.05 -> You authenticate with
your corporate credentials
2313.3 -> and you can go to the travel site
2314.68 -> or the customer management portal
2316.27 -> or the AWS management console.
2318.52 -> When you authenticate to us that way
2319.87 -> we need some representation
of you inside of the account.
2323.38 -> And that is an AWS session.
2325.87 -> Another common use case
is assuming an IAM role.
2329.17 -> We all know that you
shouldn't log in as route
2331.15 -> or administrator, but sometimes you have
2333.13 -> to put on your admin hat.
2334.78 -> And you can do that by assuming a role
2336.58 -> with the appropriate privileges.
2338.38 -> In both of these cases,
2339.43 -> the access is meant to be short term.
2341.26 -> It's gonna time out at some point,
2343.06 -> and you want your privileges
to trend back towards zero.
2349.06 -> One of my favorite features
for AWS is the ability
2352.75 -> to pass a role to various AWS
resources like EC2 instances
2357.04 -> or databases to give those
resources the ability
2359.92 -> to make API calls on your behalf.
2362.38 -> And so let's take a closer
look at that use case.
2366.13 -> I launched an EC2 instance and I specified
2369.28 -> that a role named test
role should be mapped
2371.2 -> to this instance.
2372.033 -> EC2 takes it from there.
2373.72 -> Once the instance is running,
2375.25 -> I can contact the
instance metadata service
2377.62 -> and pull the credentials
for this instance.
2380.17 -> The instance metadata service
is a web service listening
2382.54 -> on 169.254.169.254 on
every single instance.
2387.28 -> It's serviced locally.
2388.99 -> And I hit this particularly long URL,
2391.66 -> that orange URL is the important bit here,
2394.18 -> and the result is something like this.
2397.33 -> Some of these fields are self explanatory.
2399.52 -> You can see from the the last update
2401.14 -> in the expiration header
that these keys are good
2403.54 -> for about six hours.
2405.4 -> And access key and
secret key are familiar,
2408.34 -> like we know what those do.
2410.95 -> So let's take a look at
that world map again.
2413.95 -> Alice wants to launch an EC2 instance
2415.9 -> with an instance role in eu-west-1.
2418.63 -> Let's walk through the
services we'd have to touch
2421.45 -> if we were to do this with long term keys.
2424.18 -> So first we'd have to
call IAM in us-east-1
2428.41 -> and create that key.
2429.58 -> Then propagation would kick off
2431.02 -> and this new key would be
pushed to other regions,
2433.33 -> including eu-west-1.
2435.22 -> It's only after this happens that the keys
2437.47 -> on the instance would
be useful in eu-west-1.
2440.59 -> So now we have to have the IAM service
2442.96 -> in us-east-1 online and reachable.
2445.12 -> And these keys are good
for about six hours.
2447.34 -> So at a minimum, we're
looking at four new keys
2450.19 -> for every single EC2
instance in the world.
2452.53 -> And since they expire,
we'd have to go through
2454.3 -> and clean out those keys.
2455.68 -> So that's at least eight
transactions per day
2458.77 -> per EC2 instance on IAM in us-east-1.
2462.49 -> Plus we're pushing this key
to a whole bunch of places
2465.16 -> that it doesn't have to be.
2467.23 -> And so this is coupling
our regions together.
2470.47 -> It's adding latency, it's adding cost.
2473.08 -> We could choose not to propagate this key
2475.15 -> to anywhere except the
places it needs to be,
2477.52 -> but that makes the
propagator more complicated.
2479.53 -> It introduces the chance for subtle bugs
2482.56 -> and it makes the system harder to operate.
2484.69 -> So it's not something
that I'd wanna build,
2486.1 -> it's absolutely not
something I'd wanna operate.
2488.95 -> We need a different answer.
2490.36 -> We need to be able to create
2491.62 -> and automatically expire
sessions at very high scale.
2494.68 -> And we need to be able to
do it local to a region.
2497.38 -> And so the answer is
one of my favorite parts
2500.38 -> of our authentication system.
2502.21 -> It's STS, the Secure Token Service.
2505.03 -> Unlike ARS, the Auth Runtime Service,
2506.86 -> this is a public facing service
and it's virtually certain
2510.28 -> that even if you've never
heard of it, you use STS.
2515.02 -> So STS is deployed in every
region, everywhere on earth.
2519.67 -> This is the building block
2520.78 -> that lets us issue short
term sessions at scale.
2526.15 -> And going back to that
response that we got
2527.92 -> from the instance metadata service,
2529.24 -> there's one field that we
haven't talked about yet,
2531.13 -> that token.
2532.18 -> And as you may have guessed,
2533.08 -> the secure token service
is all about tokens.
2535.72 -> You can see that this
token ends with snip.
2538.06 -> That's because the actual token
was about 1200 bites long.
2541.21 -> And the slides were ugly
enough as they were.
2543.58 -> 1200 bites is a lot.
2544.96 -> You can cram a lot in there.
2546.88 -> So let's pop one of these tokens open.
2551.92 -> This first set of lines is
just internal bookkeeping.
2554.5 -> There's not a lot to be said about that.
2557.2 -> Next is the automatic expiration.
2559.93 -> You can see that unlike the
instance metadata response,
2563.05 -> this is expressed as a creation
time and a time to live.
2567.28 -> But again, it's about six hours.
2570.85 -> And then we have the
access key and secret key.
2573.46 -> These are the exact same
credentials that we saw before,
2576.37 -> just encoded inside of the token.
2578.74 -> We know exactly what to do with these.
2580.697 -> One of the cool things that
you can do with an AWS session
2583.24 -> is you can constrain it based on policy.
2585.73 -> So if there are policy
associated with this session,
2588.37 -> it would be here in the token.
2591.01 -> And then the last thing is
an asymmetric signature,
2594.46 -> cryptographic signature,
from the STS service saying,
2597.76 -> I STS do hereby swear that
this token is cool and valid.
2602.05 -> That whole thing is then
encrypted and it's passed back
2604.84 -> to our customer to
treat as an opaque blob.
2607.39 -> So let's put this all together.
2609.19 -> Alice needs to call STS, she
wants to create an S3 bucket,
2612.25 -> but of course she doesn't
have S3 admin privileges.
2615.04 -> So she's gonna do a
key derivation for STS.
2618.85 -> She's going to send her request to STS.
2622.27 -> STS is gonna talk to ARS,
2623.74 -> ARS is gonna do a key derivation,
validation, et cetera.
2626.59 -> Return the specialized key back
to STS for future API calls.
2632.17 -> And we're going to get
back this, a session token.
2636.277 -> And it's going to have an
access key ID and a secret key.
2641.35 -> Now the interesting thing here is
2643.87 -> that STS is gonna take this token,
2645.58 -> it's gonna encrypt it up
and it's gonna send it back
2648.34 -> to Alice where she can then
configure it in her SDK.
2651.1 -> You can see the little clock log
2652.78 -> in the little right corner indicating
2654.16 -> that this is ticking away towards expiry.
2657.28 -> Really important thing to note here is
2659.44 -> that STS did not record this session.
2662.14 -> It'll log the assume
roll call in cloud trail,
2664.81 -> but it didn't store the
session anywhere in AWS.
2667.99 -> There is no system anywhere in Amazon
2671.05 -> where the access key ID or
the secret key is stored.
2674.41 -> We create the session, we bundle it up,
2676.33 -> we return it to the customer
and then we forget about it.
2679.75 -> And so now Alice has these
temporary keys configured.
2683.41 -> She can make that create bucket call.
2685.51 -> She starts with the access
key ID and the secret key
2689.02 -> from the session that she just got,
2690.73 -> does the key derivation for S3,
2692.81 -> generates the signature
for the request to S3,
2695.74 -> sends it to S3.
2697.15 -> S3 calls ARS.
2698.62 -> So far this looks entirely familiar.
2701.02 -> However, now things diverge.
2704.74 -> That access key ID doesn't
exist in any database.
2707.35 -> We forgot about it, right?
2708.439 -> ARS has the mirror to
the keys that STS has.
2712.63 -> So ARS can validate,
can decrypt the token,
2715.6 -> validate the signature on the token,
2717.58 -> extract the access key and secret key ID,
2720.07 -> verify that this session is still active,
2721.9 -> all of those things.
2723.67 -> And if everything is good,
it can use that access key ID
2726.97 -> and that secret key to
validate the request
2728.957 -> that Alice just sent in.
2731.41 -> And so we validate that request.
2733.66 -> We send the fully derived key back to S3.
2737.616 -> S3 can go through all of the
work of creating a bucket
2740.05 -> and return success to Alice.
2742.36 -> From a high level, what we
just did seems really silly.
2745.3 -> We created a session,
2746.53 -> we gave it to Alice and
then we forgot about it.
2749.26 -> What Alice wanted to use that session,
2750.97 -> she had to pass the token back to us.
2753.19 -> We're using the client for state tracking.
2755.92 -> By doing so, we can issue
sessions at incredibly high rates.
2759.88 -> When we first built this,
2760.87 -> we told the pen testers
that they would be unable
2763.21 -> to break us on scale.
2764.68 -> And the pen testers laughed
2766.36 -> 'cause they don't believe anyone.
2767.89 -> And they created a billion
sessions and it didn't break us.
2770.77 -> I mean it generated a lot of logs,
2772.63 -> but it did not break us,
2773.59 -> 'cause there is no database of sessions.
2777.34 -> So you can issue a session in one region
2779.92 -> and immediately use it
in a different region.
2782.23 -> Our control plane doesn't have to race you
2783.94 -> to get the keys there in time,
2785.68 -> because naturally you're
gonna get the session
2787.54 -> from one region,
2788.38 -> you're gonna make an API
call to the other region,
2790.09 -> you're gonna include all
of the state necessary
2791.71 -> to validate that API call.
2795.76 -> Now again, you know
Eric, you've been ranting
2799.3 -> that RSA asymmetric crypto is expensive.
2802.06 -> Like isn't it still a concern?
2803.92 -> Yes, but one, our cache
ratios here are incredible.
2809.32 -> Not only are we caching the
derived keys in the services,
2812.56 -> but ARS also has a cache of
tokens that it's decrypted.
2815.89 -> And so if we see the same
token presented again
2820.064 -> in a short window, we're
already gonna have it decrypted.
2823.69 -> And it's safe to cache these tokens
2826.57 -> until their expiration point.
2828.28 -> And so the result is
2830.26 -> that we're not cracking tokens very often.
2832.75 -> And second, our customers only
see these as opaque blobs.
2837.4 -> The only overhead of using an AWS session
2839.74 -> is a tiny little additional bit.
2841.27 -> In this case about 1200 bites
2843.16 -> of additional network overhead.
2844.99 -> That's it.
2845.92 -> And so we're doing all of
the asymmetric cryptographic
2848.14 -> operations on our side and
we can build our services
2850.78 -> to handle this load.
2853.81 -> Was it all worth it?
2855.76 -> The counterfactual is hard.
2857.11 -> We don't get to split the
universe and play what if games.
2860.424 -> But we have a few data points.
2863.05 -> This is an older paper,
but it's a great paper.
2865.81 -> It's totally worth reading even today.
2867.7 -> And I love that serious
cryptographic research
2869.904 -> was published under the title
'Here Come The XO Ninjas'.
2873.37 -> Like when I first went looking for Thai
2875.53 -> and Juliano's paper, I was trying
2877.63 -> to find like the official paper,
2879.04 -> and I kept finding, 'Here
Come The XO Ninjas'.
2881.38 -> Like this is the actual paper.
2884.05 -> The authors disclose a
technique that takes advantage
2886.9 -> of the padding scheme used by
CBC, Cipher block chaining.
2890.05 -> The way that block ciphers
were built up into being able
2893.47 -> to encrypt larger messages in
SSL-V3 and TLS version one.
2898.63 -> It's awesome here, but the
reason I'm calling it out is
2903.13 -> that you can extract fixed values.
2905.89 -> If there's a value that occurs
at the same place in a series
2909.16 -> of HTTP requests, you can use
this padding oracle attack
2913.12 -> to extract that value.
2915.674 -> AWS API signatures are unique per request.
2919.24 -> Even if you're sending the same
request over and over again,
2921.7 -> the timestamp's going to change,
2923.95 -> the signature's going to change.
2925.3 -> There is no fixed value to extract.
2927.4 -> So this was the first paper
where our public response
2930.43 -> to our customers was, AWS
APIs are not affected by this.
2934.333 -> That was a very satisfying result.
2936.49 -> There have been more of them
in the intervening decade.
2940.27 -> And so what have we done?
2942.64 -> What just happened?
2944.53 -> At first glance, authenticating requests
2946.69 -> over http sounds like a deeply
studied and solved problem.
2950.082 -> But the nature of our
clients and the unique scale
2953.47 -> of our business made our
use case very different
2957.82 -> from humans sitting at web browsers.
2959.38 -> And that led us to a
unique and novel design.
2963.55 -> And as with all designs,
2964.78 -> we finally saw the end of the runway.
2967.36 -> But a bunch of smart people figured out
2969.22 -> how to evolve the system in place
2971.5 -> with minimal customer upheaval.
2973.96 -> Literally all customers had
to do was ingest a new SDK
2976.971 -> and things just worked.
2980.29 -> And some of the things
that we have in our design
2983.92 -> look needlessly complicated,
2985.9 -> but they've actually been really useful
2987.7 -> for supporting use cases
like opt-in regions.
2990.91 -> They've given us the flexibility
to say yes to the business
2994.291 -> rather than having to say no
2995.92 -> or having to make some tough trade-offs.
2998.59 -> And so AWS gets to continue to innovate
3000.72 -> on behalf of our customers.
3003.33 -> And even in these slides,
3005.76 -> the system that I've
described isn't simple
3007.68 -> and the real system beneath these slides
3009.57 -> is even more complicated.
3011.52 -> Despite that, I consider it
one of the most elegant things
3015.24 -> that I've ever had the
pleasure of working on,
3017.67 -> because it's been able to evolve
3019.8 -> and we've been able to grow it
3021.39 -> with little to no
customer visible changes.
3024 -> The majority of our
customers aren't even aware
3026.64 -> that these things are
happening despite the fact
3029.04 -> that together they drive over
a half a billion requests
3031.95 -> per second.
3033.99 -> Anyway, thank you very much for coming.
3036.72 -> It is a delight to be
at re:Invent in person
3039.42 -> and enjoy the rest of the conference.
3040.982 -> (applause)
Source: https://www.youtube.com/watch?v=tPr1AgGkvc4