AWS re:Invent 2022 - Securing Kubernetes: How to address Kubernetes attack vectors (CON318)

Aug 16, 2023

AWS re:Invent 2022 - Securing Kubernetes: How to address Kubernetes attack vectors (CON318)

This session reviews fundamentals of Kubernetes architecture and common attack vectors, security controls provided by Amazon EKS to address them, strategies customers can implement to reduce risk, and opportunities that open-source Kubernetes can improve upon.

Learn more about AWS re:Invent at https://go.aws/3ikK4dD.

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#reInvent2022 #AWSreInvent2022 #AWSEvents

Content

0.21 -> - All right, good morning.

1.8 -> Welcome to Las Vegas, everyone.

5.819 -> I'm Micah Hausler.

7.26 -> Today, we're gonna be talking about securing Kubernetes.

11.01 -> How to secure attack vectors to Kubernetes.

15.87 -> A little bit about me, I'm Micah Hausler.

17.76 -> I'm a Principal Engineer at Amazon Web Services.

20.7 -> I've been at AWS for four and a half years.

24.15 -> Working on security, availability, and identity.

28.56 -> I've been focused on EKS, and now on security,

32.91 -> and I also work on contribute to Kubernetes open source,

37.56 -> to the upstream project,

39.99 -> and specifically to the Kubernetes security committee.

43.56 -> So we field security reports to Kubernetes

46.14 -> through the bug bounty program,

47.25 -> through external reporters,

49.2 -> issue CVEs, work with teams to get fixes.

52.35 -> So I've authored CVEs, I have fixed CVEs,

55.95 -> and I have reported them.

59.22 -> Today, we're gonna talk through, for our agenda,

63.24 -> an incident that you might have encountered, hopefully not.

68.22 -> We're gonna talk through some

69.06 -> just generically about threats.

71.61 -> And then we'll talk through attack vectors and mitigations.

76.23 -> We've got a story here.

78.39 -> So this story is inspired from real events.

80.4 -> I've actually worked with customers

82.17 -> who've had this event, unfortunately.

85.26 -> So you can sort of imagine,

88.38 -> I'm putting, gonna be in your shoes for a second.

90.36 -> So you might be administering Kubernetes,

92.43 -> you might be using Kubernetes,

94.86 -> and you have an alert that you get.

99.48 -> You get a page.

100.313 -> You may have been out having dinner, you're coming home,

103.83 -> and you get a page through your company security program

109.773 -> and the report looks something like this.

111.42 -> Some security researcher reported to your team,

114.21 -> look, I can curl this endpoint

117.63 -> and I'm getting this data back.

120.3 -> And your first tier support person

123.18 -> who works at your company said,

124.537 -> "I don't know what this means.

126.18 -> You're the Kubernetes person so you gotta figure this out.

128.4 -> What does this mean?

129.84 -> Is this real?"

131.28 -> And as soon as you see this,

133.05 -> if you know anything about Kubernetes,

134.37 -> chill runs down your spine.

136.11 -> Because if you're using EKS,

139.2 -> you know what this means, hopefully.

141.63 -> So, what's going on here?

144.24 -> So someone ran Curl,

146.37 -> made this web request to this endpoint

149.49 -> and they hit this domain.

151.44 -> What's this domain?

152.273 -> This is a EKS Kubernetes cluster endpoint.

157.14 -> You can see the region that it's in, it's got a cluster,

159.6 -> a unique identifier there.

161.43 -> And the URL is API V1 secrets.

164.55 -> So what's that?

165.621 -> Okay, Kubernetes has secrets.

166.86 -> Kubernetes has internal identity

169.14 -> and then you can store your own secrets there,

171 -> whether that's, you know,

173.01 -> credentials to external services like GitHub or SendGrid

176.61 -> or whatever.

179.76 -> But you can see in this request, there's no authentication.

182.58 -> They're not making any authentication in this request,

185.97 -> it's just a simple web curl,

187.38 -> and they're getting back all the secrets in your cluster.

190.05 -> This is really bad.

193.23 -> So, what do you do?

195.87 -> Well, okay, you might have a little bit of bash foo

198.3 -> and you know the AWS CLI a little bit.

201.42 -> And so you list the clusters.

202.95 -> You knew the region already

203.85 -> because that was in the endpoint.

205.56 -> And you are dumping out, okay,

209.28 -> maybe they knew this was your cluster

210.93 -> and you're trying to verify,

211.77 -> is this really one of my clusters?

213.15 -> And so you describe all your Kubernetes,

216.24 -> your EKS clusters and get the endpoint name

219.33 -> and the corresponding name of cluster.

221.49 -> And this dumps out a list of cluster names and endpoints.

224.79 -> And so if we look through, you can see that

228.6 -> it's this timeline code executor prod three.

233.1 -> So something that is your cluster has this issue.

237.45 -> So, well, what's going on?

239.79 -> All right, we confirmed that as ours, so now what?

243.33 -> We need to figure this out and how to stop it.

245.64 -> This is a operational event,

247.41 -> just like any other operational event,

250.05 -> except it happens to be a security event.

251.34 -> So we wanna focus on mitigation here.

254.61 -> Over root cause.

255.6 -> We'll do root cause later.

256.898 -> So, we look at Kubernetes cluster roll binding.

259.86 -> So this is Kubernetes authorization system.

263.88 -> Using our role-based access control, RBAC,

267.04 -> to bind users to permission sets.

270.21 -> And we see that there's this suspicious one

273 -> called cluster system anonymous and we just dump it out.

276.09 -> What does that mean?

276.96 -> Okay, someone created this, it looks like,

279.27 -> where they bound the system anonymous user

284.79 -> to the cluster admin role.

287.67 -> So if you're familiar with Kubernetes,

289.26 -> this cluster admin role is built into Kubernetes,

291.35 -> it comes up with your cluster

293.28 -> and any unauthenticated user automatically gets assigned

296.85 -> the username system anonymous.

299.1 -> So if someone just allowed any anonymous user

302.58 -> to do anything in the Kubernetes API on your cluster,

306.87 -> that's really bad, right?

309.54 -> This is bad news.

311.34 -> Hopefully this isn't a production cluster, right?

315.15 -> And so we wanna figure out,

316.29 -> okay, we might have deleted that.

318 -> Look, okay, we maybe stopped the bleeding, right?

320.76 -> Those secrets are all toast.

322.65 -> We gotta treat it like they are.

324.81 -> But, okay, we now have to sort of figure out what happened.

327.48 -> We deleted that bad cluster role binding.

330.12 -> So we go into cloud watch logs,

334.89 -> and we pull up the cloud watch logs insights,

339.06 -> and we query with this query for, you know,

342.42 -> okay, we're looking at cube ABS server audit logs

344.94 -> for our EKS cluster and we're looking for RBAC,

348.84 -> RBAC events where the API group is RBAC.

352.44 -> And we're looking for a status code of less than 300.

354.48 -> So, any creates or updates.

358.11 -> And we come across this event.

360.81 -> So this is the actual, you know,

362.97 -> event that we were looking for.

364.14 -> Someone created this system anonymous user

367.02 -> with bound to the cluster role admin.

368.7 -> And we can, down in the bottom,

370.2 -> you can see this excerpt where it shows the UID,

377.22 -> which for EKS this maps to the IM user.

382.47 -> So you see the access key ID down there at the bottom.

385.05 -> And from that we can query track cloud trail

387.66 -> to see who actually assumed this role

390.39 -> and what that real user was.

391.8 -> So from there, that's on the AWS side.

393.99 -> So we can, we've done our AWS work to identify the issue,

398.4 -> to remediate it and then to root cause it.

401.28 -> And there's gonna be additional root remediation too, right?

404.52 -> You're gonna look for, okay,

405.48 -> who else, besides the researcher,

407.34 -> did anyone else access those secrets?

409.41 -> Do we need to delete the cluster?

410.67 -> Do we need to delete the nodes?

411.9 -> Do we need to,

412.95 -> we probably wanna rotate all those secrets

414.45 -> just for good hygiene.

416.79 -> But we can get an idea of what happened through the logging.

420.51 -> So, this is kind of a worst case scenario,

422.73 -> but just to kind of illustrate,

424.98 -> this is something that can happen.

426.78 -> And I hope it never happens to you,

427.89 -> I hope it hasn't happened to you.

429.87 -> But we're gonna kind of use this as just a framework for,

432.18 -> okay, what's going on?

435.45 -> Let's talk about threats.

438.39 -> So if you're running Kubernetes and if you're running EKS,

443.67 -> you probably have an architecture pretty similar to this.

447.63 -> You might have developers

448.62 -> who contribute code in one or more,

451.59 -> you get repositories that goes to your build and CI system,

456.03 -> and you might push some artifacts up to container registry

459.81 -> like ECR.

461.52 -> You might pull in some artifacts

463.74 -> like from AWS code artifact,

466.68 -> or some external one, NPM, or somewhere else.

471.3 -> And then those get deployed into your Kubernetes cluster.

474.57 -> You might have operations engineers

475.86 -> who need to access that Kubernetes

477.36 -> or platform engineers

478.193 -> need to access that Kubernetes cluster.

480.18 -> And that Kubernetes cluster manages EC2 instances

482.79 -> which take in traffic from a load balancer.

485.73 -> And they have applications on them

487.92 -> that read and write the objects into S3.

491.73 -> And you might be using Amazon ElastiCache for caching

495.99 -> or RDS for storage.

498.06 -> And you might be emitting clouds out to CloudWatch.

500.94 -> And you might also have a CDN in front of your S3 objects.

503.91 -> So this is probably a pretty familiar scenario

506.91 -> for most people.

507.743 -> So we're gonna kind of use this as our base

510.21 -> for what are threats to a cluster.

514.71 -> Drilling in a little bit to that EC2 instance,

518.13 -> you might have something that looks pretty similar to this,

520.05 -> conceptually, right?

521.67 -> You might have, again,

522.63 -> we have a load balancer sending traffic to one pod

525.507 -> and that pod might also access a database,

527.85 -> have another pod that might talk

529.23 -> to another service in the same cluster

531.87 -> on a different EC2 instance.

533.43 -> And these pods are writing,

536.34 -> reading and writing files from disk.

537.66 -> They also might write to log files.

539.64 -> You have the Kubernetes agent, the CUBA,

541.44 -> and the container runtime which is pulling from ECR.

544.86 -> And you might have a DaemonSet,

547.02 -> it's a Kubernetes concept to run one pod on every node

550.74 -> that is consuming these logs and uploading it to CloudWatch.

555.84 -> So when we talk about threats,

557.97 -> I want to use just the CIA model.

560.809 -> That's a information security concept about confidentiality,

567.75 -> integrity and availability.

568.8 -> These are kind of three pillars of security

571.56 -> and information security.

573.36 -> And we want to use these

574.89 -> to kind of come up with some threat models

576.69 -> and just threat modeling questions

579 -> to think about how do we wanna secure

581.22 -> our Kubernetes cluster?

582.81 -> So just some, this is not comprehensive,

587.01 -> but just to kind of get you thinking, okay,

588.81 -> if we're thinking about availability

591.75 -> and from a security angle,

593.76 -> what networks need access to your applications

597.18 -> or the load balancers to those applications, right?

599.61 -> And what networks need access to the Kubernetes API?

603.93 -> Is it just in your VPC or is it operators from the internet?

607.8 -> If we're thinking about integrity,

609.96 -> what actors or processes need access to your data?

613.08 -> Maybe have that data in S3 or RDS.

614.94 -> Is it just pods?

616.35 -> Is it other applications in your VPC?

619.77 -> And what same for actors and processes

622.44 -> to your software supply chain.

624.06 -> So that might be a little bit outside the scope of

626.49 -> securing Kubernetes itself.

627.81 -> But it goes into your application,

629.31 -> it's your container registry

631.14 -> and your data that gets in your container registry.

634.59 -> And then on confidentiality,

636 -> what actors or processes, again, need access to that data?

640.26 -> Are those only things that you trust?

642.84 -> Hopefully they are.

644.01 -> And then same for your compute runtime.

646.68 -> Are you running trusted code?

648.72 -> Are you running arbitrary code?

651.09 -> What are your definitions of that?

655.32 -> So one of the problems that we face often with Kubernetes

659.61 -> and thinking about threats and the security of it,

661.89 -> is that everything in Kubernetes is networked.

665.25 -> Speaking from working on the Kubernetes security committee,

668.19 -> this is one of the problems we have is

669.24 -> whenever we do have a CDE that we allocate an ID for

674.67 -> from a report, it always scores pretty high

677.01 -> because everything at Kubernetes,

678 -> almost everything at Kubernetes is networked.

681.69 -> So kind of continuing down this line

683.82 -> of threat modeling questions.

685.41 -> So, do you run arbitrary code in your containers?

688.5 -> Hope, probably not,

689.58 -> but maybe you are a service that does that, right?

691.32 -> Maybe you have a CI system, maybe you have something else,

696.18 -> you're running functions for your customers.

697.56 -> Do you trust the container as a security boundary?

700.62 -> Really?

701.64 -> And that's, by saying that, we're really saying,

704.28 -> do you trust the Linux Kernel as a security boundary?

707.04 -> 'Cause containers are really just Linux Kernel primitives.

711.3 -> Do your applications make outbound requests

713.7 -> to arbitrary networks?

715.8 -> Are you just making calls that you codify

719.4 -> or can customers specify, you know,

722.01 -> a URL in your application

723.48 -> that your application reaches out to?

726.21 -> 'Cause if they can do that,

727.5 -> they can enter the Kubernetes API,

729.63 -> they can enter some other domain within your cluster

732.87 -> that's an internal trusted thing

734.67 -> that maybe you don't want your customer accessing.

736.89 -> And then are all those external calls known or unknown?

741.06 -> Do you have that absolute list

742.38 -> or is it open-ended arbitrary?

745.83 -> What networks, users, processes,

747.99 -> need access to the Kubernetes API?

750.271 -> If you think about Kubernetes API

752.22 -> in terms of your application,

755.61 -> hopefully you're not exposing that to customers, right?

757.86 -> Like that's probably something

758.79 -> that's like an internal, inside the fence thing.

761.46 -> So how can we, if you're thinking about securing that,

764.16 -> how can we restrict and reduce access to that?

768 -> And then what applications in Kubernetes

770.19 -> need access to outside systems?

771.54 -> Do you even need to?

773.01 -> Are you just storing data within your closed loop system

775.89 -> and reporting it back through a customer request?

781.47 -> So, back to this architecture overview,

785.64 -> as we're thinking about securing the cluster,

787.32 -> we're gonna start thinking of the cluster

789.39 -> as the Kubernetes API and the things that it manages.

792.75 -> So the applications that run on pods, on instances.

798.75 -> And when we think about those, we're gonna start looking at,

801.24 -> okay, what are the lines that go in and out of those,

804.36 -> that this boundary?

807.06 -> Yes, for a full application security posture,

810.15 -> you need to look at all the arrows, right?

811.86 -> You need to look at developer committing code

814.05 -> and code going to CI, and CI pushing to ECR

816.54 -> and all that, right?

817.74 -> For this conversation,

819.45 -> we're gonna mostly focus on the threats to and from

823.05 -> Kubernetes itself.

826.14 -> And then the same for the data plane.

829.5 -> We've got the control plane of Kubernetes API

832.167 -> and then we've got the clustered data plane

834.36 -> of the EC2 instances.

835.86 -> And so, this is a lot more arrows, right?

838.68 -> This is like all the things happening inside the cluster.

841.5 -> So cluster two, you know, pod to pod, pod to host,

845.88 -> and then also pod to and from external services.

849.39 -> So whether that's a load balancer or another AWS service

852.84 -> or just another pod in the cluster.

856.47 -> So next, let's talk about attack vectors

858.57 -> and some of the mitigations around them.

863.04 -> So we've got the OWASP top 10.

867.96 -> So if you're familiar with this,

869.4 -> this is the open web application security project.

874.95 -> It's a nonprofit that works to improve software security

878.76 -> for web applications.

879.9 -> And every few years they publish

881.94 -> sort of a research based list

883.77 -> on 10 categories of security issues

886.41 -> that they've seen prevalence in.

889.11 -> So these are security vulnerability,

891.99 -> sort of types that appear in applications.

894.51 -> So today we're gonna not talk about everyone,

896.25 -> we're gonna talk about just a few of them.

898.29 -> I think they all can apply to Kubernetes,

901.38 -> administration of Kubernetes,

902.52 -> and certainly to applications that run on Kubernetes.

906.33 -> But for this purpose,

907.86 -> we're gonna really just drill into some of these

909.78 -> that specifically affect Kubernetes.

914.64 -> So first, so access control.

917.43 -> So this is a fun one.

919.62 -> A lot of the time for when you're thinking

922.23 -> about access control and broken access control,

926.7 -> we're gonna think about, in terms of OWASP,

929.85 -> we're gonna look at violation of lease privilege.

932.82 -> So, what actors and entities that access Kubernetes

936.54 -> or that run in Kubernetes?

939.03 -> What privileges that do they have that they don't need?

941.1 -> That's typically a source of an attack vector.

945.06 -> So this could be Kubernetes API permissions

947.4 -> for users and or pods.

949.74 -> If you have users that access the API,

952.144 -> that's an entry point, right?

955.89 -> When you have a human user that accesses an API,

959.1 -> you might give them typically more broad permission

960.87 -> than you would a process.

962.19 -> A process, say a Kubernetes pod

964.92 -> that needs to talk to the Kubernetes API,

967.08 -> you typically have a pretty good idea

968.46 -> of what that's gonna do.

969.293 -> A human user might be a little bit more exploratory.

971.13 -> So that's kind of one area

972.48 -> where you might grant expanded privileges to.

975.78 -> Then service metadata to pods.

978.99 -> Does every pod need to know about every other pod?

983.43 -> Kubernetes has some just built in service discovery

989.04 -> through whether that's environment variables or DNS,

991.95 -> where you can query

993.24 -> what are the other services running in the cluster.

995.4 -> As well as the Kubernetes API,

997.32 -> do you actually need that turned on?

998.49 -> That can just be a excessive privilege that it's not needed.

1002.54 -> Same with Linux permissions for pods.

1005.63 -> If you're trying to get something to work

1006.83 -> and okay, I need this host volume

1008.51 -> and I wanna allow it to be readable and writeable

1011.48 -> when maybe it only needs read permission.

1014.72 -> So that can be a violation of least privilege.

1017.03 -> Another area for broken access control privilege escalation.

1020.78 -> So this can take in a lot of different forms.

1022.58 -> This can be in the Kubernetes API,

1025.76 -> this can be in the data plane of the cluster itself.

1029.54 -> So a pod escalating privilege on a node

1032.72 -> to have more permissions than you intended it to have.

1036.08 -> And then broken access control can also just come

1038.24 -> from Kubernetes vulnerabilities themselves.

1039.92 -> So as you're using Kubernetes

1041.75 -> and thinking about your use of it,

1045.41 -> you wanna plan for, there will be bugs in Kubernetes.

1047.84 -> So how do we restrict and contain the access

1053 -> so that we are more insulated from those bugs?

1055.22 -> We might not be fully insulated,

1056.54 -> but that when they happen we're more insulated.

1062.48 -> So let's look at one example here.

1065.72 -> For least lease privilege.

1067.04 -> Okay, it's pretty obvious like I mentioned,

1068.54 -> there's some very obvious examples, right?

1070.22 -> Like you give user full permissions to read everything

1073.31 -> in the cluster, right?

1074.33 -> That's like maybe too much permission.

1077.03 -> But I wanna talk about kind of a subtler case

1079.28 -> that we've seen

1080.27 -> and where we've seen some applications have more privileges

1084.65 -> than they have to in a sort of unexpected way.

1087.05 -> So if you're taking our example from earlier,

1091.49 -> you're running a pod and it needs to talk to AWS services.

1094.01 -> If you're doing that,

1095.03 -> hopefully you're using the AWS' for Amazon EKS,

1100.01 -> the IM roles for service accounts.

1102.86 -> So this allows service accounts to assume an IM role

1106.46 -> and have their own identity

1107.72 -> rather than use the EC2 instance identity, right?

1111.41 -> So this way you can have multiple pods on the same node

1113.24 -> that have different IM roles.

1115.55 -> And the way that that works is

1117.35 -> Kubernetes issues a JSON web token, a jot,

1121.22 -> and the kubelet on the node actually requests that

1124.34 -> from the API server.

1125.99 -> The kubelet then mounts that jot as a file in the pod

1130.7 -> and the pod, when it comes up,

1133.52 -> the AWS SDK typically will read that file from disc,

1138.74 -> call out to STS and assume the role, right?

1143.54 -> So that's very common pattern.

1146.57 -> And this can be, you know,

1149.078 -> in the base case this can be very, very common.

1153.23 -> But if you have the same role

1155.45 -> or the same type of action that needs to be performed

1158.99 -> over a lot of different pods.

1161.18 -> Say you want a log upload

1165.65 -> or you want a volume to be mounted to a pod.

1169.28 -> Rather than have, say EFS, mount permission on every pod

1173.3 -> or something like that,

1175.719 -> and having to have every pod do this token dance,

1179.42 -> you want to have a CSI sidecar do this, right?

1183.17 -> This could be a lot of different things

1185.69 -> but we're taking the CSI example.

1187.55 -> So rather than, again,

1188.6 -> rather than have every pod have a different IM role,

1190.7 -> have to add permission to every one of those IM roles,

1193.01 -> we wanna have a central pod on host

1195.38 -> that can do this assumption for pods, right?

1197.6 -> Seems great.

1199.55 -> So in this case,

1200.48 -> the CSI DaemonSet might do something

1202.07 -> like get the token from the API server

1204.77 -> rather than kubelet fetching it

1206.72 -> and mounting it as a file on disc.

1208.49 -> And then this CSI DaemonSet would do the exchange

1212.09 -> and say, set up some volume out, right?

1217.07 -> Well what's going on here?

1218.24 -> Well, we had to give this CSI driver

1222.47 -> permissions to create service account tokens.

1224.9 -> And the only way to do that right now in Kubernetes

1227.42 -> is through Kubernetes RBAC, the role-based access control.

1230.42 -> So this is not attribute-based access,

1232.37 -> this is role-based access.

1234.2 -> So the role we're giving it, or cluster role in this case,

1236.99 -> is to create service account tokens.

1241.07 -> Notice that that's for every pod,

1245.15 -> every DaemonSet pod on the cluster.

1247.7 -> So what this means is any CSI, in this case, example case,

1253.7 -> helper DaemonSet can get service account tokens

1256.76 -> for any service account in the cluster,

1260.18 -> not just for the pod it's trying to create a mount for.

1265.16 -> Why is that bad?

1266.12 -> Well, there are a lot of built in service accounts

1269.284 -> in the Kubernetes API,

1271.22 -> things used by the scheduler, by the controller manager

1273.29 -> that have a lot of permissions to create pods.

1275.81 -> So when you do this,

1278.75 -> and any custom service accounts you might have,

1281.03 -> what this does is this reduces the isolation boundary

1283.52 -> of your node to the whole cluster

1285.68 -> or expands the isolation boundary to the whole cluster.

1288.29 -> Instead of just being able to get service accounts

1291.95 -> for the pods on the same node as a sidecar,

1294.29 -> I can get things for service accounts

1295.94 -> for any pod in the cluster.

1298.76 -> That's bad. (chuckling)

1300.92 -> Instead, what we want to do, and a mitigation for this,

1304.46 -> is that a new feature came out in Kubernetes 120 in Alpha,

1309.17 -> it's went GA in 122 and it's the token request CSI feature.

1313.79 -> So what this does is,

1315.62 -> the kubelet actually issues a service account token

1319.82 -> scoped for the pod to the CSI driver,

1322.79 -> rather than have the CSI driver

1324.71 -> request any service account token in the cluster.

1327.8 -> When a volume out comes in,

1329.36 -> kubelet calls out already to see a side driver to say,

1331.55 -> hey, set up this volume.

1332.63 -> And along with it, here's the token you need for that pod.

1336.47 -> And the reason why this is more secure

1338.36 -> is that kubelet does not actually use

1340.4 -> role-based access control for its permissions.

1343.07 -> There's another authorizer baked into Kubernetes.

1345.35 -> It's not user configurable, but it restricts.

1348.08 -> And there's an admission plugin for this too,

1350 -> the node restriction admission,

1351.5 -> that restricts the kubelet

1353.57 -> from accessing resources not assigned to that node.

1358.46 -> So, say we've got two different nodes

1360.56 -> and this kubelet can only access secrets, config map,

1364.13 -> et cetera, mounted to pods on its node.

1367.7 -> And so the same way for this CSI scenario,

1373.31 -> this can only get service account tokens

1374.81 -> for pods on the same node.

1376.34 -> So now if you have two different trust domains,

1378.23 -> you have application one and application two

1380.42 -> on two different nodes,

1383.03 -> this kubelet can only get pods,

1385.34 -> service accounts for pods assigned to it.

1389.63 -> So another example that we see, that I've seen,

1393.08 -> is where, okay, for privileged escalation,

1396.71 -> is you've got two different types of access

1399.35 -> you want to give users to the cluster, right?

1400.88 -> You might have developers that you want to give permission

1403.79 -> to be able to create pods and you might have an operator

1406.43 -> that you want to be able to create custom resources.

1410.24 -> Kubernetes has the idea of custom resources,

1413 -> definitions and custom resources.

1414.83 -> This allows you to define, dynamically,

1417.11 -> new types or records that can be created

1419.96 -> in the Kubernetes API.

1422.3 -> And you can have controllers that act on those types.

1425.99 -> So just as an example,

1429.23 -> at AWS we have the ACK

1431.12 -> or Amazon Controllers for Kubernetes project.

1433.76 -> This is a set of controllers, it's not a single project,

1437.48 -> it's a bunch of different projects

1440.06 -> that you can define AWS resources through a custom resource.

1445.43 -> So there's a bunch that are already GA of controllers,

1449.18 -> there's 13 controllers for GA.

1450.62 -> And then for things like RDS, laMDA, SageMaker,

1455.27 -> and then I think there's currently eight in preview

1457.7 -> and there's a bunch more on the way.

1458.75 -> You can go to the ACK website,

1461.211 -> aws-controllers-k8s.github.io

1465.38 -> to see what's currently available.

1468.86 -> But just as a concept,

1471.62 -> you've got a controller that looks for a CRD

1473.72 -> to go create AWS resource, right?

1477.23 -> So again, back to our scenario here,

1480.8 -> maybe we have the operator and we want to give them,

1483.29 -> again, give them ability to create CRDs.

1486.26 -> These controllers observe the CRDs

1488.75 -> and then they go create S3 buckets,

1491.99 -> relational databases or even load balancers,

1494.48 -> all different kinds of things, right?

1497.48 -> And we give our operator this cluster role.

1502.79 -> They can, hypothetically, they can create or do everything.

1506.33 -> Resource star, verb star on S3 buckets.

1510.77 -> And we bind this to the operator group.

1514.91 -> And then we have our developer

1516.95 -> and they have permission to do anything

1519.59 -> in the core Kubernetes group.

1520.88 -> So that's pods, nodes, a bunch of other things, right?

1523.26 -> So it seems good, right?

1524.57 -> Like this is sort of segregated

1526.25 -> where operators can access AWS resources,

1529.07 -> developers can just create pods in the cluster.

1533.18 -> The problem here comes from the permission that we just saw.

1538.25 -> So you saw how resource go back.

1543.14 -> Resource for this developer star and verb star.

1547.22 -> So we let them access any resource in verb

1550.16 -> in the core Kubernetes API name space.

1554.69 -> Why is that a problem?

1555.74 -> Kubernetes has a impersonation feature

1560.33 -> and it's governed by role-based access control.

1563.6 -> And if you don't specify the verbs and the resources,

1568.61 -> that includes everything,

1569.75 -> and everything includes impersonate, verb,

1573.02 -> and user and group.

1575.09 -> So now we've just given our developer who has,

1578.99 -> we were trying to segregate them

1580.04 -> to just Kubernetes resources,

1581.96 -> the ability to impersonate the operator.

1585.98 -> So it's as easy as cube cuddle, get resource type as group.

1592.22 -> That's how easy it is.

1593.45 -> There's headers as well,

1594.53 -> you can look up in the Kubernetes documentation

1596.45 -> if you want to have an actual client do impersonation.

1600.17 -> But this is another area

1601.85 -> where we've seen customers get in a bad situation

1605.9 -> through privileged escalation to the cluster.

1609.08 -> So mitigations for this.

1611.03 -> Use least privilege RBAC rules.

1615.02 -> You can actually generate RBAC policies

1618.23 -> from your audit logs.

1619.063 -> There's an open source project.

1620.51 -> Jordan Liggitt is a Kubernetes contributor

1622.88 -> and he has an open source project

1624.26 -> to consume your Kubernetes audit logs

1626.75 -> and actually generate RBAC policies.

1628.46 -> It's a great thing, so go give that a spin.

1631.07 -> Limit cluster wide permissions in DaemonSets.

1633.5 -> With that CSI example,

1635.03 -> you saw how it's easy to give cluster wide resources

1639.68 -> to something that lives on every node.

1641.57 -> That can be okay in certain cases,

1643.25 -> but oftentimes it's very easy

1645.38 -> to just grant too many permissions.

1648.71 -> And when you're using CSI, use drivers

1651.47 -> or ask about, if you're using a driver,

1653.15 -> ask if it supports token request.

1654.95 -> That's another way to limit this cluster wide permission

1657.92 -> for CSI drivers.

1660.95 -> And then when you do write RBAC policies,

1664.07 -> explicitly enumerate the verbs and resources,

1666.47 -> and ideally resource names if you can.

1669.08 -> This will prevent you

1670.46 -> from accidentally granting impersonate, right?

1674.12 -> And there's a couple other Kubernetes special verbs

1676.1 -> beyond just the, you know,

1677.24 -> Sandra Reed write get or get delete patch.

1683.06 -> List and watch.

1685.28 -> In the RBAC system

1686.18 -> there's some other privilege escalation type preventions

1691.67 -> that have special verbs.

1694.25 -> So, next we're gonna jump to security misconfigurations,

1699.02 -> OF number five.

1702.02 -> We've got, this can come in the form

1705.29 -> of a couple different things.

1708.02 -> We're gonna focus specifically

1709.25 -> on authorization misconfigurations.

1711.2 -> So we kind of looked at that,

1712.19 -> just kind of some overlap here between number one,

1715.22 -> but some will look more specifically at what this means.

1717.65 -> Unnecessary features enabled.

1719.48 -> This is really common.

1721.25 -> Kubernetes tries to be as easy to use out of the box

1723.92 -> as possible.

1724.97 -> And some features

1727.19 -> can accidentally be just left on by default.

1729.77 -> And then same with insecurity defaults.

1733.55 -> So, security misconfigurations.

1736.1 -> We looked at this one already, right?

1737.87 -> You accidentally bound system anonymous to cluster admin.

1741.41 -> That's obviously a problem.

1744.59 -> But another one is that we've seen a lot

1747.65 -> and this one we'll be seeing less of,

1751.7 -> but is where you try to mount the docker socket

1754.19 -> into the pod.

1755.69 -> This can feel really like,

1757.43 -> well why wouldn't I wanna do this?

1759.23 -> Maybe you have a Jenkins pod

1761.27 -> or something that needs to build containers

1763.67 -> that needs access to the community socket.

1765.748 -> This is not good.

1767.78 -> One, because Kubernetes has deprecated docker shim.

1771.41 -> It's gonna be completely gone in 1.20, 1.24, 1.25.

1777.11 -> And then also this bypasses all of the security controls

1780.41 -> that you have in Kubernetes around container creation.

1783.35 -> So you've just granted basically full runtime permission

1787.19 -> to run any process with any level of permission on the box

1790.85 -> by mounting this socket into a pot.

1793.7 -> So this is another thing that if you're doing this,

1795.92 -> this is a security misconfiguration, please don't do this.

1801.05 -> Another, so insecure defaults.

1803.75 -> If you look at the Kubernetes docs, you look at,

1806.166 -> and I did double check.

1806.999 -> When I read this, I was like, that can't be right.

1810.53 -> I looked at the code, yes, this is actually the case.

1813.2 -> If you just launch a kubelet

1815.54 -> with the default configuration values,

1818.06 -> anonymous auth is turned on

1820.28 -> and authorization mode always allow is set to default.

1823.82 -> What does this mean?

1825.449 -> So kubelet is the Kubernetes agent.

1826.73 -> It creates pods,

1827.6 -> it talks to the API server to look it for pods,

1830.75 -> talks to the runtime to actually launch the pods,

1833.06 -> and it also hosts a server.

1834.65 -> So if you use Kube Cuddle exec, Kube Cuddle logs,

1839.24 -> what you're doing is actually hitting Kubernetes API server,

1842.06 -> which is hitting the kubelet server.

1844.28 -> And the kubelet server has these off parameters.

1847.85 -> And anonymous auth basically says,

1850.55 -> you don't need authentication to talk to the kubelet server.

1853.31 -> And then authorization mode always allow says,

1855.92 -> always allow requests to the kubelet server.

1858.59 -> That's not good

1859.423 -> because that means anyone who can access the kubelet server

1861.53 -> can get your logs, can exec drop into a shell

1865.73 -> into your containers.

1869.9 -> If you spin up an EKS cluster

1871.37 -> and you look at the kubelet configuration

1872.87 -> that comes in your node,

1874.61 -> if you're using the EKS optimized on,

1877.49 -> and you cut out the kubelet configuration file,

1881.96 -> you'll see an authorization and authentication section

1886.25 -> that don't use the Kubernetes defaults.

1888.17 -> We turn off anonymous auth,

1890.9 -> we set authorization mode to webhook.

1893.21 -> So what kubelet will accept a token

1895.97 -> and it will send that token back to the API server to say,

1898.67 -> is this a legit token, should I allow this request?

1903.26 -> For security mitigation,

1904.94 -> security misconfiguration, mitigations

1908.93 -> don't add users to system masters

1911.93 -> in the AWS soft config map.

1914.15 -> This is kind of a,

1916.49 -> this is a anti pattern.

1917.93 -> System Masters and Kubernetes is a break glass group

1922.51 -> that actually bypasses all other authorization,

1926.18 -> it bypasses RBAC.

1927.59 -> So requests and it also bypasses admission

1932.76 -> or can bypass deletion of admission.

1935.21 -> So if you have a admission webhook

1936.86 -> that you're doing for policy,

1938.39 -> system masters can actually delete the policy webhook.

1941.18 -> It's, again, it's meant as a break glass roll.

1943.76 -> So if something goes horribly wrong,

1945.47 -> you can actually get yourself out of a pinch.

1948.74 -> But do not, as much as you can,

1950.81 -> do not use the system master's group in Kubernetes.

1953.57 -> Limit and restrict host access to and from pods.

1957.77 -> So the pods are containers

1960.08 -> and they have an isolation boundary

1961.43 -> that's in the Linux Kernel.

1962.99 -> The more holes you poke in that container,

1964.97 -> whether that's host pit name space,

1968.06 -> or allowing privileged access,

1971.42 -> or adding additional Linux capabilities,

1973.85 -> or mounting host file system,

1977.24 -> what's outside of, you know,

1978.98 -> that's a different code base, with that file system.

1982.28 -> You're poking holes in the container,

1983.51 -> so restrict those holes.

1984.92 -> And ideally restrict access to the host.

1987.68 -> Some host access is necessary,

1989.15 -> like we saw for logging or for other things,

1991.55 -> but for especially for your generic application pod,

1994.1 -> like limit that as much as possible.

1996.32 -> And then just for secure defaults,

1998 -> like use the defaults from your provider,

2000.88 -> in this case in EKS, use those same defaults.

2006.7 -> Next, we'll look at vulnerable and outdated components.

2012.22 -> So this one will hit really quick.

2017.02 -> If we look over the last five years,

2019.96 -> including the year to date,

2022.42 -> the orange box here represents Amazon Linux

2025.75 -> to Kernel CVE updates.

2029.44 -> And then the blue box represents Kubernetes CVEs.

2033.25 -> These aren't all Kubernetes proper,

2034.72 -> some of these are additional Kubernetes components

2037 -> like Nginx Ingress

2038.65 -> or some additional Kubernetes side projects.

2041.89 -> But just to show as an example,

2043.57 -> here's the number of CVEs over the last five years

2045.58 -> and you can see a pretty obvious trend and it's going up.

2049.66 -> So if we look at just even this year, 2022,

2053.26 -> you can see that there's,

2054.82 -> if you combine Kubernetes and Linux Kernel issues,

2059.56 -> there's over a hundred.

2061.69 -> That means we're averaging almost two a week, right?

2066.88 -> So if you have a cluster and you have,

2070.93 -> or say you have not even a cluster,

2072.22 -> say you have a just a node in your cluster

2075.07 -> and that node has been running from month,

2079.48 -> unless you're doing Kernel live patching,

2081.67 -> probably out of date,

2082.63 -> you probably have vulnerable components.

2085.6 -> So this is a one

2089.77 -> that I would really encourage you to balance also

2092.53 -> with your need for operational safety.

2096.04 -> So as these number of CVEs increase,

2099.61 -> and you want to keep these container

2100.93 -> and machine images up to date, absolutely.

2103.15 -> But you're also introducing change

2104.74 -> and change is where operational things can happen.

2107.35 -> So the goal is not necessarily zero CVEs

2112.84 -> being reported on all your applications.

2114.58 -> You need to actually look at, does the CVE affect me?

2117.37 -> Because if you're having two today, if it's two a week,

2121.15 -> and say it's next year, it's three

2123.34 -> and in a few years, it's more.

2125.29 -> Just say it gets to one a day,

2126.55 -> are you gonna be deploying every day with a security fix?

2129.16 -> Maybe, but not everyone can do that.

2132.97 -> But as much as you can, keep your applications up to date.

2136.03 -> Also, another one,

2137.35 -> really keep your Kubernetes cluster on a supported version.

2141.28 -> This link here goes to the Kubernetes version

2143.89 -> supported by Amazon EKS.

2145.913 -> Kubernetes upstream supports versions

2148.66 -> for one year from release.

2150.46 -> EKS generally releases a few months after that

2153.85 -> and we support versions for 14 months.

2156.7 -> So we provide additional CVE patching,

2160.21 -> language patching for the build of Kubernetes.

2165.34 -> Kubernetes is written in Go

2166.42 -> and Go has a different release cycle

2168.1 -> that maintains a one year release.

2170.14 -> So keep your Kubernetes up to date, that's a huge one.

2175.51 -> As we saw Kubernetes CVEs come out semi-frequently

2178.36 -> and EKS keeps on top of that for you.

2181.75 -> Especially for controlled print patches,

2183.52 -> there's no action required, we just patch that.

2187.21 -> So this link shows the calendar

2190.09 -> of when certain Kubernetes versions are supported by EKS

2193.12 -> and when you need to actually update.

2195.04 -> So that's a huge, huge thing

2197.41 -> that I would encourage everyone to do

2198.49 -> is make sure you have a plan to keep up to date

2200.56 -> on those Kubernetes versions.

2204.67 -> Next one.

2205.503 -> So, security and logging and monitoring failures.

2209.233 -> A lot of us are really good at operational alerting, right?

2214.33 -> There's a lot of focus on that, we don't want downtime.

2217.36 -> But, logging and monitoring for security events

2221.56 -> is often an afterthought.

2224.62 -> So this is just a quick one for Kubernetes.

2228.46 -> But what I really would encourage you to do,

2231.01 -> log, if you're using EKS log, turn on control plane logging.

2235.21 -> EKS provides this automatically to CloudWatch.

2237.538 -> You can get Kubernetes audit log,

2238.878 -> Kubernetes API server, scheduler,

2240.85 -> controller manager, authenticator logs

2242.5 -> to your CloudWatch account.

2244.18 -> And then same for your Kubernetes data plan.

2246.55 -> So if you're running,

2248.23 -> whether you're self-managing nodes

2249.97 -> or using EKS managed node groups,

2253.99 -> set up logging to get your system logs off host.

2257.26 -> This is gonna be really important,

2258.64 -> so that if you need to do any forensic analysis

2261.16 -> or if you wanna do, again, alerting on,

2265.03 -> security alerting on these logs, you can set that up.

2268.36 -> And then one bullet point not added here is,

2271.6 -> for logging and monitoring is,

2274.03 -> if you're using EKS, you can actually use GuardDuty

2276.76 -> to monitor your Kubernetes audit logs.

2279.07 -> We'll look for known bad actor patterns

2282.79 -> and misconfigurations apply to your cluster

2285.67 -> and alert you about those.

2291.28 -> So for the next one, the last one,

2294.16 -> this might be my favorite.

2296.05 -> So server side request forgery.

2298.78 -> So, what is this?

2300.4 -> If you're not familiar with this, so Kubernetes is,

2307.12 -> I sometimes like to say it's a service side request forgery

2309.64 -> as a service.

2312.34 -> If you've ever created a deployment,

2314.17 -> you might use something like Kube Cuddles.

2315.94 -> So your Kube Cuddle applies this deployment

2318.46 -> and you create that in the API

2320.65 -> and the controller manager creates a replica set,

2324.25 -> the replica set, and it also creates a pod.

2326.83 -> And then the scheduler assigns that pod to a node.

2331.6 -> That pod gets created by the Kubelet,

2334.78 -> which invokes the CNI, container network interface driver.

2339.76 -> And that CNI driver replies with an IP address

2343.21 -> and the Kubelet sends that back to the API server and says,

2346.36 -> here's the IP for this pod.

2347.92 -> So the Kubelet is the one doing the reporting

2351.31 -> to the API server, the pod IP.

2356.05 -> And it might look something like this, right?

2357.79 -> So this pod IP ends in 56.145.

2363.22 -> Now, that's a string field in the API.

2368.8 -> So if you want to port forward to that pod,

2372.16 -> Kubernetes will let you do that.

2374.2 -> It's really easy with Cube Cuddle,

2375.31 -> just port forward to pod Nginx port 80.

2378.34 -> And you don't even have to tell at the IP,

2379.81 -> the Kubernetes API already knows it.

2382 -> And so what you're actually doing here

2383.83 -> is sending a request to the API server,

2386.416 -> The API server is reaching out

2388.18 -> with a network connection directly to that IP

2390.52 -> that's been reported by the Kubelet

2392.74 -> and sending you a TCP proxy.

2395.41 -> So you can, locally, in your web browser hit your pod.

2398.47 -> It's really great for debugging, right?

2401.86 -> But if you give some,

2403.48 -> or if someone gains permission to modify pod statuses,

2409.99 -> they all of a sudden can do something like this.

2412.03 -> They can run Kube Cuddle proxies,

2414.22 -> so they're just opening a authenticated connection

2418.18 -> to the API server locally in the background.

2420.67 -> And then calling Curl and hitting this pod's endpoint,

2426.19 -> the status endpoint

2427.48 -> and patching the status to set a different IP, right?

2433.45 -> And in this case, it's something like 169.254,

2436.63 -> that's the 169.254, that's the AWS metadata service,

2441.16 -> which would return innocence credentials.

2443.44 -> Well, in Kubernetes, this was a CVE in the past,

2447.37 -> this has since been fixed.

2449.14 -> So you cannot actually get

2451.06 -> Kubernetes control plane metadata credentials this method.

2454.87 -> But an attacker could.

2459.88 -> If I go back one.

2461.47 -> They can update this pod IP to something else in your VPC.

2466.21 -> Say it's not your actual pod IP,

2468.04 -> but say it's an RDS database,

2471.49 -> say it's ElastiCache something, or like something like Redis

2475.69 -> that doesn't have an on indication, maybe RDS does.

2479.53 -> All of a sudden they can now probe around your VPC

2483.49 -> if they have sufficient permissions to both patch

2485.77 -> and port forward against your Kubernetes API server.

2488.35 -> Maybe you don't even have an internet gateway,

2489.76 -> but this is another method that you're using Kubernetes

2492.61 -> to poke holes to get access to things that you shouldn't.

2495.82 -> So, what can we do about this?

2502.39 -> Go forward.

2504.19 -> We can enable Kubernetes audit logging,

2506.14 -> One, we need to know what's going on.

2508.81 -> And you can, on top of that,

2510.34 -> alert on non node patching of pod status.

2514.93 -> So yes, someone could impersonate the node,

2518.23 -> but at least if you're looking

2520.48 -> for someone who's not impersonating the node,

2522.76 -> you can still see something bad is going on.

2524.95 -> Two, limit Kubernetes API outbound access.

2527.44 -> So when you create an EKS cluster,

2529.18 -> you give it a couple different resources.

2531.52 -> You say, I want this cluster to have these security groups

2535.42 -> and I want these subnets.

2537.13 -> Well, what are those?

2537.963 -> Those are where the EKS control

2541.38 -> or the Kubernetes control plan managed by EKS

2543.31 -> creates network interfaces in your VPC

2545.56 -> to facilitate this outbound access.

2548.14 -> So if you actually need outbound access to your pods,

2553.27 -> that's probably, you want to enable that.

2555.76 -> But if you don't,

2556.81 -> if Kubernetes API server shouldn't directly reach out

2559.42 -> to your Redis, to your ElastiCache, to your RDS,

2563.05 -> to something else in your VPC,

2564.88 -> control that security group and don't let it.

2566.74 -> Again, this isn't necessarily a trust of EKS problem,

2570.31 -> it's a trust of what can get through Kubernetes.

2574.42 -> And then secondarily, just keep clusters up to date.

2577.78 -> I'm gonna preach this all day.

2580.99 -> When Kubernetes SSRF type issues are found, we patch them.

2585.1 -> Those are typically control plane issues

2587.68 -> and we get embargoed notifications ahead of time

2590.35 -> and you don't have to even think about it.

2591.82 -> So when it hits the public and that report goes public,

2596.53 -> your EKS cluster's already patched.

2600.49 -> And then just some final thoughts on hardening clusters.

2605.92 -> So I'm gonna say it again.

2608.41 -> Keep on top of your cluster updates.

2611.35 -> I can't stress this enough.

2613.51 -> This also goes for containers, and your container images,

2617.56 -> and your node groups.

2620.26 -> Another, just really great practice here

2622 -> is use KMS encryption of Kubernetes secrets.

2624.28 -> So you can actually give EKS your KMS key

2629.227 -> and Kubernetes secrets can be encrypted.

2634.72 -> Two, we talked about this a little bit,

2636.1 -> but Kubernetes API server access.

2638.23 -> If you don't need the internet facing endpoint

2641.11 -> for the Kubernetes API server, you can disable that.

2644.2 -> Just make it accessible only through your VPC.

2647.77 -> Talked about this already but I'm gonna say it again.

2649.54 -> Enable audit logs with EKS, this is so important.

2652.69 -> And then if you want to, you set up GuardDuty.

2655.5 -> We work really closely with the GuardDuty team

2658.24 -> to add new findings all the time to say,

2662.62 -> here's a new issue, let's alert customers about it.

2665.77 -> And then also use IM roles for service accounts

2668.53 -> to give pods access to AWS APIs.

2671.02 -> Again, this kind of goes back to least privilege.

2672.67 -> Don't use the EC2 IMDS instance metadata role.

2678.49 -> How can you secure your pods?

2679.78 -> Not just your cluster and the the API.

2682.36 -> Use a policy enforcement agent or engine.

2685.21 -> So Kubernetes had the pod security policy

2689.5 -> and that's gonna be gone in 1.25.

2692.8 -> And so you need to migrate to something else.

2695.32 -> Kubernetes has a built-in pod security admission.

2699.01 -> That's not full PSP, it's a different scope

2702.97 -> and you can configure that in EKS.

2705.37 -> But you can also,

2706.203 -> if you want finer grained access to specific fields,

2708.82 -> say in your pod and allowing or disallowing those,

2711.31 -> use something like open policy agent or gatekeeper.

2714.64 -> And then finally,

2716.44 -> the EKS Security Best Practices guide here,

2719.458 -> the QR code is just the link here.

2722.35 -> Go read that.

2723.183 -> I really encourage you to read that.

2724.75 -> There's great standards out there

2725.89 -> like CIS benchmark for Kubernetes

2728.71 -> that have some really great recommendations

2731.02 -> on configurations.

2732.82 -> But in this EKS Security Best Practices Guide,

2735.94 -> we kind of go above and beyond that to add just,

2738.82 -> and we update it constantly with findings

2740.8 -> that we get from customers

2743.35 -> or when new Kubernetes features get released,

2746.59 -> on what to allow or disallow.

2749.11 -> So I highly, highly recommend giving that a read through.

2753.73 -> And with that, I think that's the end of our talk today.

2757.24 -> Thanks everyone.

2758.073 -> I think that's all we got time for today.

2759.396 -> (audience applauding)

Source: https://www.youtube.com/watch?v=vmZgHqYhLSU