AWS re:Invent 2022 - Deploying egress traffic controls in production environments (SEC312)

AWS re:Invent 2022 - Deploying egress traffic controls in production environments (SEC312)


AWS re:Invent 2022 - Deploying egress traffic controls in production environments (SEC312)

Private workloads that require access to resources outside of the VPC should be well monitored and managed. There are solutions that can make this easier, but selecting one requires evaluation of your security, reliability, and cost requirements. In this session, learn how Robinhood evaluated, selected, and implemented AWS Network Firewall to shape network traffic, block threats, and detect anomalous activity on workloads that process sensitive financial data. Robinhood engineers share how they selected a deployment model and executed a global network change that brought their traffic fully in line with firewall endpoints.

Learn more about AWS re:Invent at https://go.aws/3ikK4dD.

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#reInvent2022 #AWSreInvent2022 #AWSEvents


Content

0.33 -> - Hey, welcome everyone to
1.68 -> Deploying Egress Traffic Controls in Production Environment.
5.784 -> It's a bit of a mouthful, the title,
7.95 -> but I promise you we've got a valuable presentation ahead.
12.024 -> I'm Graham Zulauf.
13.65 -> I'm a principal solutions architect with AWS
15.832 -> and I'm joined here with the real all stars,
19.2 -> Houston and Kevin.
22.32 -> And we'll kick things off here.
25.62 -> We're gonna have a quick overview here of our agenda.
28.673 -> We're gonna talk about why do we even need egress controls.
31.95 -> Also a primer on AWS network firewall.
34.23 -> It's gonna play a key part in our discussion today.
36.5 -> So if you're not familiar with the service itself,
38.43 -> don't worry.
39.263 -> We'll get you up to speed real quick.
41.61 -> And then we're gonna hand off to Houston.
43.74 -> He's gonna talk about RobinHood's journey
46.17 -> in adopting network firewall,
48.63 -> but you wanna pay close attention to how they phased in
52.02 -> that adoption to their production environment.
55.11 -> And then Kevin is gonna dive deep
57.45 -> into the real meat of the presentation.
60.63 -> The level 300 details that you're probably here for.
63.03 -> That's the deployment and the implementation of the service.
67.83 -> So why do we need egress controls?
69.84 -> Well,
72.481 -> example of the Log4j exploit that was recently
76.44 -> in the wild.
77.64 -> A difficult one to mitigate,
81.06 -> but part of that exploit Log4j
83.352 -> after a host is attacked
86.76 -> is it has to reach back out
88.56 -> to download a payload and then execute that code.
91.89 -> Egress controls can mitigate
93.656 -> and protect against that sort of attack.
97.29 -> Also, many command and control frameworks depend on
101.04 -> phoning home, outbound communication in their attack.
105.63 -> So egress controls help mitigate and protect
108.515 -> against that exploit.
110.61 -> A recent report from Cisco Talos
114.9 -> said that 66% of ransomware attacks
117.57 -> used command and control,
120.09 -> specifically cobalt strike.
123.27 -> And so egress controls is an important way to protect
126.96 -> against that attack.
129.249 -> AWS network firewall, what is it?
131.13 -> It's AWS's managed service that makes it easy to deploy
135.32 -> network protection for your VPCs.
139.117 -> It's easy to set up, it's just a few clicks.
141.87 -> It has fine-grained access control where you can create
144.36 -> rules or import rules in open source formats.
147.81 -> And also it allows you to set policy and then apply that
152.01 -> across all of your VPCs and accounts.
155.37 -> It's built for the cloud.
156.54 -> What does that mean?
157.373 -> It scales.
158.206 -> So as your traffic increases,
159.596 -> the performance and the ability
162 -> of network firewall keeps up with that.
164.61 -> Supports hundreds of thousands of connections,
166.8 -> has HA built in.
168.36 -> You don't need to size instances or any or patching or
171.57 -> anything like that.
172.65 -> Again, flexible rule engine.
174.496 -> And it has compatibility with a number of formats
178.32 -> that are common.
181.23 -> So how do you use network firewall?
183.54 -> Simple, log into the VPC portion of the AWS console.
187.89 -> You create a new firewall there.
190.47 -> You can then create rules,
192.3 -> associate those with policies,
194.296 -> deploy an endpoint,
196.86 -> and then you're protected.
198.275 -> Then you can also use network firewall manager or
201.447 -> AWS firewall manager to then control multiple endpoints
204.63 -> across your entire organization
206.19 -> so you can have consistent policy.
207.99 -> Remember, AWS takes care of the patching,
210.27 -> they take care of the maintenance.
212.25 -> You don't need to worry about any of that.
214.71 -> It's all part of the service.
218.07 -> A closer look at some of the more fine-grain features here
221.16 -> in a modern firewall that AWS network firewall is,
224.22 -> you get things like domain filtering,
227.04 -> cicada compatible rule sets,
230.16 -> IPS IDS
231.54 -> functionality,
233.047 -> but also you're gonna get visibility,
235.14 -> meaning you're gonna see all of the flow logs if you want
239.19 -> all of your rules and alerts can get piped into CloudWatch,
242.699 -> Kinesis s3, and then the possibilities are endless there.
246.42 -> How you want to visualize or report or do threat hunting
249.9 -> inside of that data.
254.07 -> What are some of the use cases we see?
255.48 -> Well, the first swim lane you see there
257.58 -> is the one you're probably here to listen.
261.09 -> That is egress security,
262.8 -> and Kevin and...
265.956 -> - [Houston] Houston.
267.254 -> - Our folks here from RobinHood are gonna
270.9 -> go into that deeper.
273.191 -> Also, we have East West protection,
275.49 -> so VPC to VPC protection.
278.662 -> Network firewall helps you there.
280.8 -> And also again, the IDS and IPS functionality
283.943 -> is built right in.
287.22 -> So I'm gonna hand things off now to Houston.
289.35 -> - [Houston] Alright.
291.12 -> - Thank you, Kevin.
291.953 -> - [Graham] Sorry, Kevin.
293.22 -> - Hi, I'm Houston Hopkins.
294.6 -> I work on the RobinHood security team,
296.28 -> along with my co-presenter here, Kevin.
298.289 -> We're delighted to be here today to talk to you a little bit
300.69 -> about our network firewall journey,
303.51 -> which actually started a little less than a year ago.
306.81 -> We were looking for ways to really aggregate and dig deep
310.11 -> into our egress controls
311.61 -> as well as get better visibility on aggregate
313.74 -> of everything that was happening inside of RobinHood.
316.08 -> So...
318.9 -> to speak about the problem.
320.04 -> So there was per,
322.83 -> generally the, our egress controls were tightly coupled
326.1 -> with the applications that we run.
328.027 -> That's like, we use security groups,
329.73 -> we use nackles and it's,
332.273 -> and we love those things.
334.83 -> However, we needed an aggregate view.
336.75 -> So if perhaps there was a control that was missed
339.84 -> or we wanted to see what egress traffic
341.67 -> was still being allowed,
342.54 -> we needed something deeper.
344.273 -> We needed something that would complement our current
346.254 -> cloud capabilities, so something we're familiar with.
350.01 -> And we like the guardrail approach.
352.05 -> So ways to, to get a lot of value early on
355.08 -> and then get really deep later.
356.645 -> Most importantly, we didn't want to break production
359.4 -> and we wanted to keep our engineers happy.
364.8 -> So what we were looking for,
367.71 -> most importantly, it's egress control.
369.03 -> So we were looking at north south traffic
370.89 -> going up and out to the internet.
372.24 -> That was the primary thing we were, we were looking for.
375.376 -> We needed a flexible rules engine.
377.28 -> I think that's mentioned several times
379.23 -> in what Graham just said,
380.063 -> but the flexible rules engine made a big difference for us.
382.02 -> We needed something we could use,
383.04 -> something we could implement.
386.16 -> We were interested in deep packet inspection and packet
388.29 -> inspection in general,
389.76 -> what we could do there to inspect TLS, SNI, et cetera,
393.63 -> which actually we needed something to see domains.
398.67 -> We needed it to be reliable.
399.93 -> We trust AWS to be reliable.
401.989 -> We're excited about the cellular architecture.
404.22 -> We were familiar with it from other
405.584 -> AWS products that we used.
407.568 -> And we enjoyed the fact that there was a managed portion
410.55 -> of this infrastructure to keep it up and running,
412.74 -> but also managed rule sets that we could learn from,
415.17 -> use, and adopt.
416.944 -> Finally, it needed to be scalable.
419.1 -> We mentioned pay as you go several times,
422.286 -> we like to build things up as fast as we tear them down
424.5 -> if we need to,
425.7 -> with very little,
429.3 -> very little reason, very little overhead to not do that.
432.45 -> And we use infrastructure as code.
434.34 -> So we needed something that worked in Terraform,
436.23 -> which is what we use.
440.4 -> So you may be asking, do I need a network firewall?
442.95 -> I can't exactly answer that question for you,
444.9 -> but I can give you a little bit
446.04 -> about how we went about this.
448.045 -> So if you look on the right,
449.55 -> there's traditional VPC controls and they're great.
452.336 -> We don't plan to replace them, they're not going anywhere.
455.52 -> But we did need something that was
456.78 -> more of an aggregate view.
457.86 -> So the things to really focus in here,
459.33 -> there's a lot of text on this one is domain level filtering.
461.97 -> We need something to see domain names, DNS names.
464.603 -> We also liked that the
467.855 -> quotas and service limits were clearly documented.
470.52 -> We had a great idea on how we wanted to build it,
472.32 -> where it would scale,
476.13 -> and we liked the managed capabilities, of course.
481.62 -> So this is our approach
482.627 -> and it's quite simple in the beginning.
484.869 -> Let's get those firewalls deployed,
488.01 -> then we're gonna implement some telemetry
489.45 -> to see what we're learning and analyze that data,
492.63 -> put 'em into dashboards and see what we find.
494.58 -> And, surprise, we found things very early on,
497.25 -> very, very nice areas for improvement.
499.95 -> Secondly, we really had to focus on blocking and tackling
503.34 -> as we,
506.07 -> now that we have this capability,
507.51 -> we need to know how and when to use it effectively.
509.82 -> We needed people to know how to react
512.19 -> when we we had an emergency
514.23 -> and who we're gonna be in those approval chains.
516.84 -> I say do this early before you need it.
519.353 -> Finally, as we progress on and we're still on this journey,
522.917 -> you know, denying known bad is great,
525.39 -> it's a great first place to start,
527.07 -> but mine your data to learn how to build an allow list and
529.89 -> look more towards a positive security model.
532.825 -> And there will be things that don't fit
535.08 -> in your positive security model.
536.55 -> This helps us identify better ways to isolate that traffic.
540.57 -> In short, it's really helping us keep taps on our quest
543.78 -> to have a really great data perimeter.
547.119 -> At this point I'm gonna hand it off to Kevin.
549.27 -> - [Kevin] Thank you.
552.3 -> Hello, once again.
553.435 -> My name is Kevin Park and I'm a security software engineer
556.5 -> at RobinHood.
557.73 -> I had the opportunity to implement data playing
560.43 -> of the AWS network firewall at RobinHood
563.01 -> throughout various environment
564.69 -> for the purpose of egress control.
567.66 -> And I'm here to provide some guidance,
569.349 -> share technical details
572.473 -> and include easy mistakes
574.26 -> I have learned throughout the process of implementing
576.219 -> AWS network firewall.
577.92 -> And the best part is that I was able to
580.14 -> insert the firewall into the live production environment
582.84 -> with zero downtime.
586.89 -> Before diving in,
588.03 -> I want to share some of the implementation goals
590.46 -> that we sought out to achieve with the network firewall.
593.52 -> First, we want to capture and monitor all egress traffic.
597.75 -> By egress traffic we mean any network traffic that
600.93 -> originates within the RobinHood's internal environment
603.33 -> and traverses out to the public internet.
606.265 -> We not only want to see which IP address
609.09 -> is reaching out to which port,
611.248 -> but also be provided with useful information that can aid us
615.72 -> in implementing proactive controls and hopefully help us in
619.47 -> the investigations down the road if necessary.
622.56 -> And second, if we have a bad traffic,
625.05 -> we want to be able to block immediately.
627.781 -> But most importantly,
629.28 -> firewalls, firewall must not introduce any new bottlenecks
632.634 -> in the production environment that the engineers
635.76 -> now have to deal with.
638.192 -> Now we want the firewall to be, like, completely transparent
642.96 -> for those engineers who build, test,
645.45 -> and deploy the services.
647.205 -> But I also do wanna explicitly mention that we are here to
651.109 -> capture and monitor all the egress traffic,
653.7 -> but not necessarily all the ingress traffic.
655.92 -> So if you see any diagrams down the road
658.521 -> that where you go, hmm,
660.81 -> that does not capture all the
662.361 -> ingress coming into the RobinHood system.
666.12 -> This is the reason why.
670.47 -> Let's go over the couple of high level options
673.11 -> that we explore in the high level.
676.89 -> First is the centralized model.
679.53 -> In here we have AWS transit gateway
682.437 -> and a security VPC in the right,
685.633 -> Security VPC is where we host
689.43 -> one network firewall
690.81 -> and has the network, NAT gateway going to
694.23 -> the internet gateway.
696.33 -> So if we have one or more VPC instances,
700.17 -> all of 'em would connect,
701.64 -> connect through the transit gateway through the T G W E N I
706.35 -> and route to the firewall and to the public internet.
711.3 -> Now the benefit of this is that it's a
713.31 -> one single firewall deployment.
715.59 -> Now this comes with the benefit that it's a
717.593 -> single point of management with
720.99 -> and since it's one firewall,
723.12 -> you only have to deploy firewall rules to the single one.
726.39 -> And for a smaller companies, that's a bonus if you have a,
730.075 -> if you don't have a high traffic going through
731.97 -> and you can route multiple environments,
734.16 -> VPCs to this one single firewall.
738 -> But at the same time, this can be a downside
741.03 -> if you want a little more reliability,
743.82 -> this becomes a single choke point.
745.74 -> So if you have a misconfiguration in the transgateway
749.52 -> while maintaining it,
750.57 -> this can become a single point of failure where
755.024 -> it can impact all the VPCs that are routing through
758.19 -> this one single network firewall.
764.58 -> Now, the second option is the distributed model.
767.76 -> In this model,
770.52 -> the firewall is deployed in place
772.29 -> where the infrastructure already exists.
775.483 -> In the example here, we have a Kubernetes,
779.58 -> two Kubernetes cluster and a VPC
782.267 -> spanning across two availability zones.
785.19 -> In here, there are four total NAT gateways,
789.27 -> two for each cluster and for
791.85 -> one each availability zone.
794.43 -> And for that each NAT gateway there is a
797.1 -> network firewall endpoint paired with it
800.229 -> giving us a one-to-one scaling.
803.4 -> And with this, with this distributed model,
807.18 -> you'll have multiple firewalls and which you can do
811.62 -> a phase rollout of the firewall rules.
814.41 -> And so if you want to test the firewall rule in the
817.83 -> development environment first,
819.36 -> you can roll it out, and then roll it out to production.
822.24 -> Or if you have multiple clusters
823.74 -> in the production environment,
824.85 -> you can test it out in one smaller production cluster first
827.76 -> and then gradually roll it out to the next ones,
830.629 -> more bigger ones.
832.38 -> Now the downside is that,
834.57 -> obviously, this has n number of firewalls.
838.05 -> So if you have, like, or n number of firewalls
840.72 -> corresponding to n number of environments.
843.48 -> So if you have, like, five Kubernetes cluster
846.63 -> in a VPC and two in other,
848.67 -> that means that you'll have to deal and manage
850.68 -> seven total firewalls.
853.23 -> So if you're a more of a smaller company,
854.905 -> this might not be the approach
858.01 -> as you might need a playbook automation
860.76 -> to make sure that you have all the firewall rules
863.37 -> that you want in place and you're not missing in one
865.86 -> or the other because it's, you can take,
868.29 -> you can lose track of it with
870.03 -> as the number of firewalls grow.
873.45 -> But for RobinHood,
874.71 -> we decided to go with the distributed model.
877.487 -> The reasoning behind is mainly primarily scaling needs.
882.09 -> NAT gateway has a max throughput of 45 gigabits per second
886.32 -> and each network firewall endpoint has a max throughput of
889.65 -> a hundred gigabits per second.
892.664 -> Here at RobinHood,
895.35 -> we almost maxed out 45 gigabits per second.
898.02 -> That gateway limit when we had per
900.66 -> multiple production cluster sharing a single NAT gateway
904.14 -> per AZ, of course.
905.91 -> Now because of that we have split out the NAT gateways
909.69 -> to each Kubernetes cluster
912.3 -> and with the firewall deployment,
914.13 -> we don't want to introduce that kind of similar bottleneck.
917.19 -> So it was obvious that we had,
919.05 -> we wanted to go with this distributed model.
921.39 -> But for, obviously, as I said, for smaller companies,
924.57 -> central, centralized model might be a better fit.
931.791 -> Before going into how we deployed the network firewall,
936.591 -> I have to show you guys the before picture.
939.12 -> So this is an example
942.893 -> of a Kubernetes cluster
944.7 -> spanning across two availability zone
947.43 -> US East 1A and US East 1B.
949.617 -> And at the bottom those are the route tables, like,
952.23 -> that is associated with the public and the private subnets.
955.843 -> And the green ones are the public subnets.
959.619 -> It consists of a NAT gateway per AZ, of course,
962.94 -> and public load balancers.
965.85 -> And on the blue on the right side,
967.56 -> the blue ones are the private subnets
970.733 -> which consists of Kubernetes nodes,
973.688 -> which are just made out of Amazon EC2 instances.
977.88 -> There's some key details I want you guys to remember
980.61 -> is that on the public subnets,
982.92 -> the NAT gateway and public load balancer
985.2 -> share a same subnet.
987.992 -> The reasoning I'm telling you to remember is that
990.87 -> I'll show you how you can make a simple
993.091 -> easy routing mistake.
994.68 -> Because, because of this.
998.1 -> And just going over simple route table
1000.365 -> in the public subnet,
1003.32 -> it's routing directly to the IGW,
1006.08 -> whereas in the private subnets
1008.6 -> it's going through the NAT gateway,
1010.49 -> in the public subnet then to the internet gateway.
1017.06 -> Now how from that,
1018.35 -> how we decided, how we deployed the network firewall,
1021.59 -> we inserted the network firewall endpoint
1024.53 -> into each AZ by creating a,
1027.71 -> a dedicated subnet for the network firewall
1030.115 -> and have its sandwiched between
1033.38 -> the public and private subnets.
1036.35 -> Now the reasoning
1038.96 -> we put it in there is that we want to be able to
1041.75 -> see all the egress traffic, since that's our goal.
1044.51 -> So we want to be able to capture it,
1046.656 -> it going from the Kubernetes nodes
1049.85 -> to the NAT gateway and to the internet.
1052.55 -> Now due to the NAT gateways and the load balancer
1054.8 -> sharing a subnet firewall system
1058.16 -> between the public and private subnets,
1059.78 -> which means that we will occasionally capture some
1062.54 -> to internal to internal traffic
1064.363 -> primarily because public load balancers
1067.222 -> reaching back out to reaching into the Kubernetes nodes
1071.489 -> will be the internal to internal traffic.
1075.47 -> But that's okay, because the firewall endpoint has a much
1079.4 -> higher throughput up to a hundred gigabits per second.
1082.43 -> So we have the buffer room we need.
1085.835 -> But if our goal is to primarily just capture
1088.28 -> all the egress traffic,
1089.39 -> why not place it after the NAT gateway
1091.681 -> before the internet gateway?
1094.43 -> Now this would capture all the egress traffic that we need,
1098.48 -> however, and it would make the,
1100.385 -> and it would make the routing table much simpler.
1103.25 -> However, placing it after the NAT gateway
1106.82 -> would mask all the Kubernetes,
1108.29 -> like the source IP addresses.
1110.262 -> And for the purpose of firewall,
1111.98 -> we want to be able to see, like, where it's coming from
1115.76 -> internally, not just the NAT gateway that it's coming from.
1122.678 -> But one thing that you need to be very careful when
1126.32 -> designing the route table is that the,
1129.86 -> you have to preserve the network symmetry.
1133.034 -> So if you have a traffic
1135.89 -> going through the path, going through the firewall
1137.99 -> in one direction,
1138.98 -> then you have to make sure that the return traffic
1141.26 -> is also going through the network firewall.
1144.59 -> If you have a traffic going through the firewall
1147.23 -> in one direction, but not in the other direction,
1149.78 -> going around it,
1151.37 -> firewall will notice it and simply, silently,
1153.83 -> drop the packet.
1154.76 -> And if you deploy it as it is, you'll have a bad day,
1157.37 -> you'll have a production outage and
1160.447 -> yeah, you don't want that.
1161.96 -> So,
1164.12 -> how did we fix it?
1165.53 -> Oh, actually let me first mention this.
1167.54 -> So this is,
1168.56 -> this example right here is actually a mistake
1170.63 -> I made actually while designing it
1172.447 -> because I was so focused on making sure
1175.55 -> capturing the egress traffic,
1177.83 -> going through the NAT gateway.
1179.03 -> I forgot to consider the public load balancer
1181.37 -> in the equation of the public subnets.
1183.47 -> So, in this route table
1186.05 -> I have the VPCC IDR, in the public subnet
1189.74 -> like pointing to the network firewall endpoint.
1193.04 -> So, in the public subnet route table.
1195.68 -> And in the private subnet route table
1197.72 -> we have the 0.0.0 slash zero pointing to the
1203.14 -> to the network firewall endpoint.
1205.13 -> Obviously this works for the NAT gateway
1206.99 -> and the private subnets,
1208.1 -> but this creates the network asymmetry
1211.73 -> between the public load balancers
1213.5 -> and the Kubernetes nodes
1215.84 -> and you can have a partial outage.
1218.45 -> Now, how do you fix it?
1220.31 -> Solution is quite simple, as it sounds.
1223.31 -> You introduce more granular routes
1225.41 -> rather than the VPCC IDR pointing
1228.14 -> to slash 16 pointing to the network firewall endpoint.
1231.05 -> Instead, you just define
1235.187 -> the exact subnet CIDR
1236.63 -> that you want to route through the network firewall.
1239.78 -> So in this case,
1241.64 -> public subnets
1243.29 -> route table
1244.67 -> will have three additional routes
1247.689 -> where it's, these are the private subnets CIDR range
1252.77 -> in the same availability,
1254 -> availability zone pointing to the network firewall.
1256.88 -> And for the same thing, private subnets
1259.52 -> will have public subnets CIDR in the route table
1263.3 -> pointing to the network firewall.
1265.1 -> Now, then you can, now this will achieve network symmetry
1268.58 -> for both a NAT gateway between the private subnets
1273.38 -> and the network load balancers.
1279.02 -> Now, now we have the architecture defined.
1281.99 -> Now how do we deploy this systematically?
1284.72 -> RobinHood has, obviously, like, a lot of environments
1287.42 -> and we want to be able to do this same thing
1290.69 -> over and over again without mistakes.
1293.982 -> To do that we used Terraform.
1296.93 -> We used Terraform widely across RobinHood already.
1299.75 -> So it was a no-brainer just to use this and
1304.91 -> to deploy the firewall and the way I did it,
1307.61 -> I just created a single overarching network firewall
1311.06 -> terraform module that will do this for me.
1313.88 -> So this firewall module will automatically create
1316.7 -> the subnets for the firewall endpoints
1320.51 -> and the firewall obviously and,
1322.91 -> and the new route tables for the public, private,
1325.34 -> and the firewall subnets
1327.02 -> and propagate with the proper routes
1331.37 -> routing through the NAT gateway
1332.72 -> and through the firewall and also configure logging
1336.71 -> and where to send those logs to.
1339.98 -> In addition, I also added a bonus of just
1342.26 -> the firewall enabled Boolean flag, just true and false.
1346.37 -> So I can just set that a true or false to whether to
1349.7 -> create it or not in the VPC environment
1351.68 -> or the Kubernetes cluster environment.
1353.21 -> So, because network firewall is pay for what you use,
1357.38 -> so if you don't want to create a firewall in such a,
1359.81 -> like a small test environment,
1361.4 -> that you're gonna keep it internal,
1362.9 -> you don't need a firewall
1364.25 -> and so you can just have the flag
1368.09 -> to have it or not.
1370.697 -> But one thing to note is that,
1373.55 -> why am I creating a new set of route tables
1375.56 -> for both public and private subnets?
1377.45 -> As I mentioned, this environment already exists,
1379.91 -> but I'm creating a new set of route tables
1383.06 -> for the public and private subnet
1385.34 -> rather than just modifying the existing ones.
1388.58 -> Will I duplicate the route tables?
1393.77 -> Well, it's because, first, inserting a firewall
1396.83 -> into a live production system is tricky.
1399.44 -> As I, as I shown in the example one fire,
1402.17 -> one mistake in the route can bring the production down,
1405.56 -> have a partial outage or a full.
1408.86 -> How did I solve this, is that I create all the
1411.155 -> necessary firewall related resources in advance.
1413.95 -> This is including the route tables
1416.126 -> and duplicate all the routes in the route table,
1419.72 -> in the route tables and add the additional
1422.66 -> firewall integration routes.
1425.81 -> This allows us to verify the routes, like, once more
1429.32 -> and all the components necessarily
1432.62 -> before taking the system live.
1435.92 -> And then once we, once we verify it
1438.44 -> and everything's correct through,
1440.27 -> I just orient through a script programmatically,
1443.411 -> I have another Boolean variable called
1446.51 -> firewall route enable,
1448.76 -> which you can set it to true
1450.56 -> and just do a single terraform apply.
1453.32 -> This will switch the route table associations
1456.361 -> for the public and private
1459.401 -> subnets
1460.234 -> to the newly duplicated route tables and instant.
1465.38 -> Now this is the trick that I used to
1468.68 -> bring the firewall live with zero downtime.
1474.05 -> And we also replicated it today.
1477.246 -> (Kevin laughs)
1478.402 -> - [Houston] Quick, quick note on that is
1480.924 -> when you're troubleshooting and you want to
1482.18 -> remove the firewall from the equation,
1483.77 -> it's a quick way to turn it off and turn it back on when,
1486.59 -> when you've made everybody happy that it's not your fault.
1489.47 -> So. (laughs)
1492.11 -> - Yeah, and yeah, on that note, this Boolean variable,
1497.06 -> the firewall route enable also doubles as a fail safe flag.
1502.16 -> So imagine a rare outage scenario
1505.34 -> where the AWS shared cell hosting the network firewall
1507.89 -> endpoint has a disruption, goes down.
1512.06 -> Now we can use that same firewall
1515.21 -> route enable flag, set it to false,
1517.79 -> and do a single Terraform apply again.
1519.56 -> And this will flip the route table association
1522.47 -> back to the original
1524.51 -> with, obviously, in a matter of seconds.
1528.341 -> Now, but you might also be wondering, like,
1532.074 -> how do we make sure that the,
1535.25 -> the route tables that's not being used
1537.44 -> has the routes being maintained?
1540.32 -> The way I did it is this, this terraform,
1543.65 -> since it's the same infrastructure as code
1545.365 -> I just have it set so that
1548.57 -> if you try to add a route to one, one of 'em,
1551.21 -> then it'll automatically add it to the other.
1553.04 -> So it'll always maintain its congruence even if,
1555.59 -> if it's not being used, it's just that one has a firewall,
1558.336 -> additional firewall routes integration added to it.
1564.62 -> Now moving on to monitoring alerting,
1566.24 -> like deployment is the first part of it but
1568.234 -> keeping it up and running is the second part of it
1572.81 -> and continuous part of it.
1574.85 -> Now, firewall AWS CloudWatch provides all the necessary
1579.552 -> operational metrics from the network firewall
1583.01 -> and from that we have a,
1585.53 -> two
1586.73 -> very important alert
1589.7 -> generated from it.
1590.54 -> And the first one is just a basic
1592.25 -> low firewall traffic alert.
1594.59 -> This is, this fires if there's a close to zero
1599.39 -> network traffic going through the firewall.
1601.43 -> This can happen if the firewall is just simply disabled
1604.452 -> or an engineer accidentally
1606.53 -> or deliberately disabled
1609.32 -> the firewall to bypass the egress control mechanism.
1612.65 -> If that happens, this fire, this alert fires.
1616.25 -> In the second one, which is even more important
1619.04 -> is the firewall drop packets warning.
1623.15 -> This can happen in, this triggers when there are
1626.87 -> packets being actively dropped by the firewall.
1629.48 -> This can, this can occur if the firewall endpoint
1633.53 -> is hitting the bandwidth limits,
1634.82 -> which is pretty unlikely because it is,
1636.77 -> it has a hundred gigabits per second throughput
1640.25 -> we're way more than double the NAT gateway.
1643.85 -> So this is unlikely.
1645.5 -> And the second scenario is the,
1648.5 -> there's some firewall rule that is being triggered to,
1652.76 -> to drop the packets.
1654.05 -> Usually this will mean,
1655.01 -> this will mean that scenario because usually
1657.83 -> you have the firewall rule to drop packets
1659.96 -> if it's like a malicious traffic.
1661.97 -> So if this fires, we would get a alert on slack
1666.44 -> just as above but also be paged on our phones directly
1669.735 -> 'cause we would want to hop in and investigate right away.
1676.49 -> And then onto logging and visibility.
1680.679 -> So on the example here is
1683.75 -> just an event that I pull from a
1687.86 -> network firewall.
1689.27 -> And here obviously you can see
1692.15 -> the fields that would be present
1694.31 -> in the average VPC flow logs.
1697.021 -> You see source IP, source port, destination IP,
1701.244 -> and destination ports.
1703.58 -> Like, this is all useful.
1704.99 -> This already comes from flow log,
1706.55 -> but what makes the network firewall logs different
1709.981 -> is that it, it comes with like additional information
1712.46 -> like TLS fingerprint,
1714.32 -> like SNI domain version, domain, and the TLS version.
1718.661 -> And in other cases it also comes with like JA3,
1722.15 -> like, fingerprint hashes where you can reverse stacks,
1725.53 -> stack rank to find, like, the outliers and
1730.73 -> be like a threat hunting exercise
1732.62 -> and start investigation from there.
1738.407 -> And I'll be handing off to Houston to provide more insight.
1741.8 -> - Yep, sure.
1743.12 -> Thanks Kevin.
1744.465 -> Just a quick note,
1745.52 -> I know I mentioned dashboards a few slides ago.
1747.38 -> Kevin's overwhelmed us with awesome material since then.
1750.53 -> So this is just a quick,
1753.08 -> it's really a mock-up of one of our early dashboards
1756.32 -> showing what we got immediate visibility on
1759.5 -> by looking at domains,
1760.94 -> which was a nice added feature of using the network firewall
1763.601 -> and a couple of things that we looked at right away.
1766.31 -> So first of all, how often was RobinHood talking to AWS
1771.12 -> through the internet?
1772.37 -> That's, that's pretty interesting data.
1773.99 -> When you're building data perimeters you like to use things
1775.88 -> like VPC endpoints.
1777.2 -> So did we have gaps where we needed better coverage?
1779.248 -> Yes, we did.
1781.46 -> So the other thing is how often is RobinHood going out
1784.82 -> to the internet to talk to itself?
1786.53 -> So do we have a lot of opportunity to build in
1788.976 -> or just some opportunities to build in better routing there.
1793.25 -> Maybe route to the internal load balancer
1794.72 -> instead of going back out to the internet
1796.122 -> through all the various layers to come right back in.
1799.226 -> But more typically,
1801.29 -> we did actually look at all of the different domains
1803.06 -> that we were reaching out to, right?
1804.44 -> And so there's,
1805.273 -> there's various egress controls all the way up to stack.
1807.17 -> So there's certain domains that we're showing up repeatedly
1810.072 -> and were those interesting?
1812.261 -> That's helped us build our allow list,
1814.67 -> our quest for being positive security.
1816.88 -> Likewise, so you'll see, you know,
1819.5 -> maybe tons of hits to certain things.
1821.06 -> You're like, oh, those look like valid services.
1822.92 -> I hope we use those.
1823.82 -> If those look suspicious, we,
1825.35 -> we hope we would investigate those immediately as well.
1828.74 -> But then we also go down to the bottom, what is rare?
1831.02 -> What are these, what are these random sites
1832.64 -> that were random domains that we hit on occasion?
1834.682 -> What ports are being used on those domains?
1836.87 -> Are they interesting?
1839.24 -> And what we found here is that you'll find an application,
1842.862 -> I think this is pretty common across many companies.
1845.45 -> Someone built a scraper, right?
1846.92 -> Let's scrape the news.
1847.91 -> We're, we're RobinHood.
1848.743 -> We're interested in what's going on in the world.
1849.86 -> So a news scraper or maybe a threat intelligence scraper.
1854 -> Those types of things are really hard to wrap
1855.86 -> egress controls around because you want them to see
1857.78 -> the internet.
1858.86 -> Those are great opportunities for moving,
1863.03 -> moving that infrastructure to its own unique place
1865.25 -> and not applying the same exact rules
1866.78 -> that you were for the rest of the environments.
1868.76 -> I, I often, here, I'll touch on this here.
1875.63 -> Maybe, all right.
1877.37 -> So those discoveries, I just accidentally let the cat
1880.13 -> outta the bag before I got here.
1881.09 -> But we quickly identified opportunities where VPC endpoints
1885.59 -> were being, were not in the right flows.
1887.927 -> So we had a lot of opportunity to increase that
1891.362 -> and reduce our internet traffic.
1894.071 -> It's always good when your security control makes friends
1897.53 -> with your cost controls,
1898.754 -> it's nice to have them hand in hand.
1901.207 -> And then likewise with the RobinHood traffic
1904.25 -> that was routing directly to the internet
1905.75 -> and then coming right back in the door,
1907.4 -> lots of opportunity to save some money there
1909.35 -> and, and as security people,
1911.12 -> if it doesn't have to go to the internet,
1912.26 -> we don't want it to go to the internet.
1914.732 -> And then I say measure the top and the bottom.
1916.97 -> So again, if you have a lot of domains you're hitting
1918.95 -> all the time,
1920.18 -> you should know what those are.
1921.17 -> That should look fairly stable over time.
1923.36 -> If you have domains that are just really random and rare,
1925.91 -> those should be interesting to you as well because
1927.5 -> you know, if you're building an allow list,
1929.93 -> why would someone randomly hit these domains?
1931.91 -> There's good reasons, but you need to know them.
1934.209 -> This gave us the visibility to really investigate that.
1938.007 -> And also, I mentioned here is there a middle?
1941.452 -> Sometimes you get so excited by what you see at, you know,
1946.28 -> what are we hitting all the time?
1947.39 -> what are we rarely hitting
1948.5 -> that you may miss some things in the middle.
1949.88 -> It, it's, it's worth it to look all the way up and down.
1953.51 -> Those domains can easily be evaluated for like, you know,
1956.215 -> against various threat intelligence services, feeds,
1959.45 -> you name it, for domain reputation sites.
1962 -> And if there's anything interesting there,
1963.53 -> it's worth investigating.
1966.53 -> And again, on the deeper segmentation opportunities,
1968.51 -> if there's things that don't fit,
1970.134 -> it's best to just try to find a place to move
1972.38 -> those snowflakes to.
1974.27 -> Back over to Kevin.
1975.103 -> - [Kevin] Yeah.
1976.16 -> Yeah, I actually, I want to add a thing.
1979.16 -> Like, one of the interesting discovery that we didn't expect
1982.4 -> was, like, in the cost savings area.
1985.94 -> We were expecting, like, AWS network firewall
1989.03 -> to add cost to our bills,
1991.01 -> but actually, it actually provided more cost-saving
1994.7 -> opportunities because network firewall provides
1997.67 -> TLS like SNI domain.
1999.38 -> So when it, even if it's reaching out to the AWS services,
2003.7 -> since we can see, like, the domain names
2005.77 -> that it is reaching out to,
2007.27 -> we were able to see that there's, like, this huge S3 traffic
2011.116 -> that is reaching out through the network NAT gateway,
2015.49 -> which instead, instead of the NAT gateway,
2018.25 -> we can use a S3 VPC endpoint
2020.47 -> to save significant amount of money.
2022.96 -> And through this we are able to save
2025.366 -> some quite significant amount of money on there.
2028.236 -> This is not only just to s3 but also like Dynamo DB,
2032.38 -> which also has a s3 VPC, which has its own VPCE.
2039.82 -> And to,
2041.98 -> on the right,
2043.09 -> to wrap it up,
2044.38 -> I want to add further improvements we can do in,
2048.67 -> in the architecture of a firewall.
2051.16 -> Now,
2052.27 -> I said to have a
2055.188 -> network symmetry
2056.65 -> between the public and the private subnets,
2059.41 -> you need to have more granular routes.
2061.24 -> Now if you're one of those people that you like having
2063.82 -> nice, simple route tables, like me, like,
2066.398 -> and you don't want to have a bloating in the route tables,
2069.97 -> if you have a lot of subnets, this is the way to go.
2073.352 -> You want to separate out NAT gateway
2076.6 -> into its own segment subnet.
2080.32 -> Then, which then you can have a more flexibility
2083.59 -> over the route table where your only
2086.59 -> route traffic between the NAT gateway
2088.51 -> and the private subnets through the
2089.95 -> network firewall endpoint
2091.48 -> and all the public load balancer and traffic
2095.39 -> outside the firewall, around the firewall,
2098.41 -> directly to the private subnets,
2102.04 -> to the K nodes.
2103.6 -> Now this has benefits, obviously I mentioned simple,
2108.25 -> simpler route tables here as shown.
2111.7 -> You don't, you no longer have that bloating,
2113.89 -> but also you can have, what is it?
2118.58 -> Yeah, you will not no longer capture the
2121.84 -> internal to internal traffic that that was
2124.63 -> adding noise to our logging capabilities
2127.55 -> and just simply focus on the egress control.
2130.11 -> Now also by eliminating those internal to internal traffic,
2134.74 -> this does additionally save some of the cost because
2137.577 -> network firewall is billed on which amount you use.
2143.77 -> Now you can also do this in a,
2145.33 -> with zero production downtime if you have a,
2148.51 -> if you're spanning across multiple availability zones
2152.5 -> and you can introduce it temporarily across AZ traffic.
2155.74 -> The trick is that, so if you have another AZ,
2159.7 -> you just simply route the internet traffic
2163 -> to another NAT gateway in that AZ.
2165.22 -> So introduce that across AZ traffic,
2169.03 -> delete the,
2170.32 -> delete the NAT gateway that you want to move it out,
2172.969 -> but preserve the elastic IP
2175.84 -> and create a new subnet and
2178.57 -> create a new NAT gateway in
2181.982 -> with the same elastic IP
2184.21 -> that you preserve.
2185.043 -> Now this way,
2188.08 -> you are preserving the, like the egress IP,
2190.57 -> like, the public IP that is egregressing out to
2192.67 -> let's say if you have, like, some IP address
2194.83 -> base filtering for, like, other external services,
2197.83 -> then, and then do, you can do that.
2201.1 -> And finally, obviously, just move the
2204.76 -> internet traffic back to the new NAT gateway.
2207.88 -> At this way you can,
2209.23 -> if you already have NAT gateway and other services,
2211.63 -> other services within the public subnet like RobinHood did,
2215.2 -> we did, you can simply move it out in this scenario.
2222.52 -> But yeah, thank you.

Source: https://www.youtube.com/watch?v=z8MSmK1p4Tw