AWS re:Invent 2022 - Deploying egress traffic controls in production environments (SEC312)
AWS re:Invent 2022 - Deploying egress traffic controls in production environments (SEC312)
Private workloads that require access to resources outside of the VPC should be well monitored and managed. There are solutions that can make this easier, but selecting one requires evaluation of your security, reliability, and cost requirements. In this session, learn how Robinhood evaluated, selected, and implemented AWS Network Firewall to shape network traffic, block threats, and detect anomalous activity on workloads that process sensitive financial data. Robinhood engineers share how they selected a deployment model and executed a global network change that brought their traffic fully in line with firewall endpoints.
ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.
AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.
#reInvent2022 #AWSreInvent2022 #AWSEvents
Content
0.33 -> - Hey, welcome everyone to
1.68 -> Deploying Egress Traffic Controls
in Production Environment.
5.784 -> It's a bit of a mouthful, the title,
7.95 -> but I promise you we've got a
valuable presentation ahead.
12.024 -> I'm Graham Zulauf.
13.65 -> I'm a principal solutions
architect with AWS
15.832 -> and I'm joined here
with the real all stars,
19.2 -> Houston and Kevin.
22.32 -> And we'll kick things off here.
25.62 -> We're gonna have a quick
overview here of our agenda.
28.673 -> We're gonna talk about why do
we even need egress controls.
31.95 -> Also a primer on AWS network firewall.
34.23 -> It's gonna play a key part
in our discussion today.
36.5 -> So if you're not familiar
with the service itself,
38.43 -> don't worry.
39.263 -> We'll get you up to speed real quick.
41.61 -> And then we're gonna hand off to Houston.
43.74 -> He's gonna talk about RobinHood's journey
46.17 -> in adopting network firewall,
48.63 -> but you wanna pay close
attention to how they phased in
52.02 -> that adoption to their
production environment.
55.11 -> And then Kevin is gonna dive deep
57.45 -> into the real meat of the presentation.
60.63 -> The level 300 details that
you're probably here for.
63.03 -> That's the deployment and the
implementation of the service.
67.83 -> So why do we need egress controls?
69.84 -> Well,
72.481 -> example of the Log4j
exploit that was recently
76.44 -> in the wild.
77.64 -> A difficult one to mitigate,
81.06 -> but part of that exploit Log4j
83.352 -> after a host is attacked
86.76 -> is it has to reach back out
88.56 -> to download a payload and
then execute that code.
91.89 -> Egress controls can mitigate
93.656 -> and protect against that sort of attack.
97.29 -> Also, many command and
control frameworks depend on
101.04 -> phoning home, outbound
communication in their attack.
105.63 -> So egress controls help
mitigate and protect
108.515 -> against that exploit.
110.61 -> A recent report from Cisco Talos
114.9 -> said that 66% of ransomware attacks
117.57 -> used command and control,
120.09 -> specifically cobalt strike.
123.27 -> And so egress controls is
an important way to protect
126.96 -> against that attack.
129.249 -> AWS network firewall, what is it?
131.13 -> It's AWS's managed service
that makes it easy to deploy
135.32 -> network protection for your VPCs.
139.117 -> It's easy to set up,
it's just a few clicks.
141.87 -> It has fine-grained access
control where you can create
144.36 -> rules or import rules
in open source formats.
147.81 -> And also it allows you to set
policy and then apply that
152.01 -> across all of your VPCs and accounts.
155.37 -> It's built for the cloud.
156.54 -> What does that mean?
157.373 -> It scales.
158.206 -> So as your traffic increases,
159.596 -> the performance and the ability
162 -> of network firewall keeps up with that.
164.61 -> Supports hundreds of
thousands of connections,
166.8 -> has HA built in.
168.36 -> You don't need to size
instances or any or patching or
171.57 -> anything like that.
172.65 -> Again, flexible rule engine.
174.496 -> And it has compatibility
with a number of formats
178.32 -> that are common.
181.23 -> So how do you use network firewall?
183.54 -> Simple, log into the VPC
portion of the AWS console.
187.89 -> You create a new firewall there.
190.47 -> You can then create rules,
192.3 -> associate those with policies,
194.296 -> deploy an endpoint,
196.86 -> and then you're protected.
198.275 -> Then you can also use
network firewall manager or
201.447 -> AWS firewall manager to then
control multiple endpoints
204.63 -> across your entire organization
206.19 -> so you can have consistent policy.
207.99 -> Remember, AWS takes care of the patching,
210.27 -> they take care of the maintenance.
212.25 -> You don't need to worry about any of that.
214.71 -> It's all part of the service.
218.07 -> A closer look at some of the
more fine-grain features here
221.16 -> in a modern firewall that
AWS network firewall is,
224.22 -> you get things like domain filtering,
227.04 -> cicada compatible rule sets,
230.16 -> IPS IDS
231.54 -> functionality,
233.047 -> but also you're gonna get visibility,
235.14 -> meaning you're gonna see all
of the flow logs if you want
239.19 -> all of your rules and alerts
can get piped into CloudWatch,
242.699 -> Kinesis s3, and then the
possibilities are endless there.
246.42 -> How you want to visualize or
report or do threat hunting
249.9 -> inside of that data.
254.07 -> What are some of the use cases we see?
255.48 -> Well, the first swim lane you see there
257.58 -> is the one you're probably here to listen.
261.09 -> That is egress security,
262.8 -> and Kevin and...
265.956 -> - [Houston] Houston.
267.254 -> - Our folks here from RobinHood are gonna
270.9 -> go into that deeper.
273.191 -> Also, we have East West protection,
275.49 -> so VPC to VPC protection.
278.662 -> Network firewall helps you there.
280.8 -> And also again, the IDS
and IPS functionality
283.943 -> is built right in.
287.22 -> So I'm gonna hand things
off now to Houston.
289.35 -> - [Houston] Alright.
291.12 -> - Thank you, Kevin.
291.953 -> - [Graham] Sorry, Kevin.
293.22 -> - Hi, I'm Houston Hopkins.
294.6 -> I work on the RobinHood security team,
296.28 -> along with my co-presenter here, Kevin.
298.289 -> We're delighted to be here today
to talk to you a little bit
300.69 -> about our network firewall journey,
303.51 -> which actually started a
little less than a year ago.
306.81 -> We were looking for ways to
really aggregate and dig deep
310.11 -> into our egress controls
311.61 -> as well as get better
visibility on aggregate
313.74 -> of everything that was
happening inside of RobinHood.
316.08 -> So...
318.9 -> to speak about the problem.
320.04 -> So there was per,
322.83 -> generally the, our egress
controls were tightly coupled
326.1 -> with the applications that we run.
328.027 -> That's like, we use security groups,
329.73 -> we use nackles and it's,
332.273 -> and we love those things.
334.83 -> However, we needed an aggregate view.
336.75 -> So if perhaps there was
a control that was missed
339.84 -> or we wanted to see what egress traffic
341.67 -> was still being allowed,
342.54 -> we needed something deeper.
344.273 -> We needed something that
would complement our current
346.254 -> cloud capabilities, so
something we're familiar with.
350.01 -> And we like the guardrail approach.
352.05 -> So ways to, to get a lot of value early on
355.08 -> and then get really deep later.
356.645 -> Most importantly, we didn't
want to break production
359.4 -> and we wanted to keep our engineers happy.
364.8 -> So what we were looking for,
367.71 -> most importantly, it's egress control.
369.03 -> So we were looking at north south traffic
370.89 -> going up and out to the internet.
372.24 -> That was the primary thing
we were, we were looking for.
375.376 -> We needed a flexible rules engine.
377.28 -> I think that's mentioned several times
379.23 -> in what Graham just said,
380.063 -> but the flexible rules engine
made a big difference for us.
382.02 -> We needed something we could use,
383.04 -> something we could implement.
386.16 -> We were interested in deep
packet inspection and packet
388.29 -> inspection in general,
389.76 -> what we could do there to
inspect TLS, SNI, et cetera,
393.63 -> which actually we needed
something to see domains.
398.67 -> We needed it to be reliable.
399.93 -> We trust AWS to be reliable.
401.989 -> We're excited about the
cellular architecture.
404.22 -> We were familiar with it from other
405.584 -> AWS products that we used.
407.568 -> And we enjoyed the fact that
there was a managed portion
410.55 -> of this infrastructure to
keep it up and running,
412.74 -> but also managed rule sets
that we could learn from,
415.17 -> use, and adopt.
416.944 -> Finally, it needed to be scalable.
419.1 -> We mentioned pay as you go several times,
422.286 -> we like to build things up
as fast as we tear them down
424.5 -> if we need to,
425.7 -> with very little,
429.3 -> very little reason, very
little overhead to not do that.
432.45 -> And we use infrastructure as code.
434.34 -> So we needed something
that worked in Terraform,
436.23 -> which is what we use.
440.4 -> So you may be asking, do
I need a network firewall?
442.95 -> I can't exactly answer
that question for you,
444.9 -> but I can give you a little bit
446.04 -> about how we went about this.
448.045 -> So if you look on the right,
449.55 -> there's traditional VPC
controls and they're great.
452.336 -> We don't plan to replace them,
they're not going anywhere.
455.52 -> But we did need something that was
456.78 -> more of an aggregate view.
457.86 -> So the things to really focus in here,
459.33 -> there's a lot of text on this
one is domain level filtering.
461.97 -> We need something to see
domain names, DNS names.
464.603 -> We also liked that the
467.855 -> quotas and service limits
were clearly documented.
470.52 -> We had a great idea on
how we wanted to build it,
472.32 -> where it would scale,
476.13 -> and we liked the managed
capabilities, of course.
481.62 -> So this is our approach
482.627 -> and it's quite simple in the beginning.
484.869 -> Let's get those firewalls deployed,
488.01 -> then we're gonna implement some telemetry
489.45 -> to see what we're learning
and analyze that data,
492.63 -> put 'em into dashboards
and see what we find.
494.58 -> And, surprise, we found
things very early on,
497.25 -> very, very nice areas for improvement.
499.95 -> Secondly, we really had to
focus on blocking and tackling
503.34 -> as we,
506.07 -> now that we have this capability,
507.51 -> we need to know how and
when to use it effectively.
509.82 -> We needed people to know how to react
512.19 -> when we we had an emergency
514.23 -> and who we're gonna be
in those approval chains.
516.84 -> I say do this early before you need it.
519.353 -> Finally, as we progress on and
we're still on this journey,
522.917 -> you know, denying known bad is great,
525.39 -> it's a great first place to start,
527.07 -> but mine your data to learn
how to build an allow list and
529.89 -> look more towards a
positive security model.
532.825 -> And there will be things that don't fit
535.08 -> in your positive security model.
536.55 -> This helps us identify better
ways to isolate that traffic.
540.57 -> In short, it's really helping
us keep taps on our quest
543.78 -> to have a really great data perimeter.
547.119 -> At this point I'm gonna
hand it off to Kevin.
549.27 -> - [Kevin] Thank you.
552.3 -> Hello, once again.
553.435 -> My name is Kevin Park and I'm
a security software engineer
556.5 -> at RobinHood.
557.73 -> I had the opportunity to
implement data playing
560.43 -> of the AWS network firewall at RobinHood
563.01 -> throughout various environment
564.69 -> for the purpose of egress control.
567.66 -> And I'm here to provide some guidance,
569.349 -> share technical details
572.473 -> and include easy mistakes
574.26 -> I have learned throughout
the process of implementing
576.219 -> AWS network firewall.
577.92 -> And the best part is that I was able to
580.14 -> insert the firewall into the
live production environment
582.84 -> with zero downtime.
586.89 -> Before diving in,
588.03 -> I want to share some of
the implementation goals
590.46 -> that we sought out to achieve
with the network firewall.
593.52 -> First, we want to capture and
monitor all egress traffic.
597.75 -> By egress traffic we mean
any network traffic that
600.93 -> originates within the
RobinHood's internal environment
603.33 -> and traverses out to the public internet.
606.265 -> We not only want to see which IP address
609.09 -> is reaching out to which port,
611.248 -> but also be provided with useful
information that can aid us
615.72 -> in implementing proactive
controls and hopefully help us in
619.47 -> the investigations down
the road if necessary.
622.56 -> And second, if we have a bad traffic,
625.05 -> we want to be able to block immediately.
627.781 -> But most importantly,
629.28 -> firewalls, firewall must not
introduce any new bottlenecks
632.634 -> in the production environment
that the engineers
635.76 -> now have to deal with.
638.192 -> Now we want the firewall to be,
like, completely transparent
642.96 -> for those engineers who build, test,
645.45 -> and deploy the services.
647.205 -> But I also do wanna explicitly
mention that we are here to
651.109 -> capture and monitor
all the egress traffic,
653.7 -> but not necessarily all
the ingress traffic.
655.92 -> So if you see any diagrams down the road
658.521 -> that where you go, hmm,
660.81 -> that does not capture all the
662.361 -> ingress coming into the RobinHood system.
666.12 -> This is the reason why.
670.47 -> Let's go over the couple
of high level options
673.11 -> that we explore in the high level.
676.89 -> First is the centralized model.
679.53 -> In here we have AWS transit gateway
682.437 -> and a security VPC in the right,
685.633 -> Security VPC is where we host
689.43 -> one network firewall
690.81 -> and has the network, NAT gateway going to
694.23 -> the internet gateway.
696.33 -> So if we have one or more VPC instances,
700.17 -> all of 'em would connect,
701.64 -> connect through the transit
gateway through the T G W E N I
706.35 -> and route to the firewall
and to the public internet.
711.3 -> Now the benefit of this is that it's a
713.31 -> one single firewall deployment.
715.59 -> Now this comes with
the benefit that it's a
717.593 -> single point of management with
720.99 -> and since it's one firewall,
723.12 -> you only have to deploy firewall
rules to the single one.
726.39 -> And for a smaller companies,
that's a bonus if you have a,
730.075 -> if you don't have a high
traffic going through
731.97 -> and you can route multiple environments,
734.16 -> VPCs to this one single firewall.
738 -> But at the same time,
this can be a downside
741.03 -> if you want a little more reliability,
743.82 -> this becomes a single choke point.
745.74 -> So if you have a misconfiguration
in the transgateway
749.52 -> while maintaining it,
750.57 -> this can become a single
point of failure where
755.024 -> it can impact all the VPCs
that are routing through
758.19 -> this one single network firewall.
764.58 -> Now, the second option
is the distributed model.
767.76 -> In this model,
770.52 -> the firewall is deployed in place
772.29 -> where the infrastructure already exists.
775.483 -> In the example here, we have a Kubernetes,
779.58 -> two Kubernetes cluster and a VPC
782.267 -> spanning across two availability zones.
785.19 -> In here, there are four
total NAT gateways,
789.27 -> two for each cluster and for
791.85 -> one each availability zone.
794.43 -> And for that each NAT gateway there is a
797.1 -> network firewall endpoint paired with it
800.229 -> giving us a one-to-one scaling.
803.4 -> And with this, with
this distributed model,
807.18 -> you'll have multiple
firewalls and which you can do
811.62 -> a phase rollout of the firewall rules.
814.41 -> And so if you want to test
the firewall rule in the
817.83 -> development environment first,
819.36 -> you can roll it out, and then
roll it out to production.
822.24 -> Or if you have multiple clusters
823.74 -> in the production environment,
824.85 -> you can test it out in one
smaller production cluster first
827.76 -> and then gradually roll
it out to the next ones,
830.629 -> more bigger ones.
832.38 -> Now the downside is that,
834.57 -> obviously, this has n number of firewalls.
838.05 -> So if you have, like,
or n number of firewalls
840.72 -> corresponding to n number of environments.
843.48 -> So if you have, like,
five Kubernetes cluster
846.63 -> in a VPC and two in other,
848.67 -> that means that you'll
have to deal and manage
850.68 -> seven total firewalls.
853.23 -> So if you're a more of a smaller company,
854.905 -> this might not be the approach
858.01 -> as you might need a playbook automation
860.76 -> to make sure that you have
all the firewall rules
863.37 -> that you want in place and
you're not missing in one
865.86 -> or the other because it's, you can take,
868.29 -> you can lose track of it with
870.03 -> as the number of firewalls grow.
873.45 -> But for RobinHood,
874.71 -> we decided to go with
the distributed model.
877.487 -> The reasoning behind is mainly
primarily scaling needs.
882.09 -> NAT gateway has a max throughput
of 45 gigabits per second
886.32 -> and each network firewall
endpoint has a max throughput of
889.65 -> a hundred gigabits per second.
892.664 -> Here at RobinHood,
895.35 -> we almost maxed out 45
gigabits per second.
898.02 -> That gateway limit when we had per
900.66 -> multiple production cluster
sharing a single NAT gateway
904.14 -> per AZ, of course.
905.91 -> Now because of that we have
split out the NAT gateways
909.69 -> to each Kubernetes cluster
912.3 -> and with the firewall deployment,
914.13 -> we don't want to introduce that
kind of similar bottleneck.
917.19 -> So it was obvious that we had,
919.05 -> we wanted to go with
this distributed model.
921.39 -> But for, obviously, as I
said, for smaller companies,
924.57 -> central, centralized model
might be a better fit.
931.791 -> Before going into how we
deployed the network firewall,
936.591 -> I have to show you guys
the before picture.
939.12 -> So this is an example
942.893 -> of a Kubernetes cluster
944.7 -> spanning across two availability zone
947.43 -> US East 1A and US East 1B.
949.617 -> And at the bottom those
are the route tables, like,
952.23 -> that is associated with the
public and the private subnets.
955.843 -> And the green ones are the public subnets.
959.619 -> It consists of a NAT
gateway per AZ, of course,
962.94 -> and public load balancers.
965.85 -> And on the blue on the right side,
967.56 -> the blue ones are the private subnets
970.733 -> which consists of Kubernetes nodes,
973.688 -> which are just made out
of Amazon EC2 instances.
977.88 -> There's some key details I
want you guys to remember
980.61 -> is that on the public subnets,
982.92 -> the NAT gateway and public load balancer
985.2 -> share a same subnet.
987.992 -> The reasoning I'm telling
you to remember is that
990.87 -> I'll show you how you can make a simple
993.091 -> easy routing mistake.
994.68 -> Because, because of this.
998.1 -> And just going over simple route table
1000.365 -> in the public subnet,
1003.32 -> it's routing directly to the IGW,
1006.08 -> whereas in the private subnets
1008.6 -> it's going through the NAT gateway,
1010.49 -> in the public subnet then
to the internet gateway.
1017.06 -> Now how from that,
1018.35 -> how we decided, how we
deployed the network firewall,
1021.59 -> we inserted the network firewall endpoint
1024.53 -> into each AZ by creating a,
1027.71 -> a dedicated subnet for
the network firewall
1030.115 -> and have its sandwiched between
1033.38 -> the public and private subnets.
1036.35 -> Now the reasoning
1038.96 -> we put it in there is
that we want to be able to
1041.75 -> see all the egress traffic,
since that's our goal.
1044.51 -> So we want to be able to capture it,
1046.656 -> it going from the Kubernetes nodes
1049.85 -> to the NAT gateway and to the internet.
1052.55 -> Now due to the NAT gateways
and the load balancer
1054.8 -> sharing a subnet firewall system
1058.16 -> between the public and private subnets,
1059.78 -> which means that we will
occasionally capture some
1062.54 -> to internal to internal traffic
1064.363 -> primarily because public load balancers
1067.222 -> reaching back out to reaching
into the Kubernetes nodes
1071.489 -> will be the internal to internal traffic.
1075.47 -> But that's okay, because the
firewall endpoint has a much
1079.4 -> higher throughput up to a
hundred gigabits per second.
1082.43 -> So we have the buffer room we need.
1085.835 -> But if our goal is to
primarily just capture
1088.28 -> all the egress traffic,
1089.39 -> why not place it after the NAT gateway
1091.681 -> before the internet gateway?
1094.43 -> Now this would capture all the
egress traffic that we need,
1098.48 -> however, and it would make the,
1100.385 -> and it would make the
routing table much simpler.
1103.25 -> However, placing it after the NAT gateway
1106.82 -> would mask all the Kubernetes,
1108.29 -> like the source IP addresses.
1110.262 -> And for the purpose of firewall,
1111.98 -> we want to be able to see,
like, where it's coming from
1115.76 -> internally, not just the NAT
gateway that it's coming from.
1122.678 -> But one thing that you need
to be very careful when
1126.32 -> designing the route table is that the,
1129.86 -> you have to preserve the network symmetry.
1133.034 -> So if you have a traffic
1135.89 -> going through the path,
going through the firewall
1137.99 -> in one direction,
1138.98 -> then you have to make sure
that the return traffic
1141.26 -> is also going through
the network firewall.
1144.59 -> If you have a traffic
going through the firewall
1147.23 -> in one direction, but not
in the other direction,
1149.78 -> going around it,
1151.37 -> firewall will notice it
and simply, silently,
1153.83 -> drop the packet.
1154.76 -> And if you deploy it as it
is, you'll have a bad day,
1157.37 -> you'll have a production outage and
1160.447 -> yeah, you don't want that.
1161.96 -> So,
1164.12 -> how did we fix it?
1165.53 -> Oh, actually let me first mention this.
1167.54 -> So this is,
1168.56 -> this example right here
is actually a mistake
1170.63 -> I made actually while designing it
1172.447 -> because I was so focused on making sure
1175.55 -> capturing the egress traffic,
1177.83 -> going through the NAT gateway.
1179.03 -> I forgot to consider
the public load balancer
1181.37 -> in the equation of the public subnets.
1183.47 -> So, in this route table
1186.05 -> I have the VPCC IDR, in the public subnet
1189.74 -> like pointing to the
network firewall endpoint.
1193.04 -> So, in the public subnet route table.
1195.68 -> And in the private subnet route table
1197.72 -> we have the 0.0.0 slash
zero pointing to the
1203.14 -> to the network firewall endpoint.
1205.13 -> Obviously this works for the NAT gateway
1206.99 -> and the private subnets,
1208.1 -> but this creates the network asymmetry
1211.73 -> between the public load balancers
1213.5 -> and the Kubernetes nodes
1215.84 -> and you can have a partial outage.
1218.45 -> Now, how do you fix it?
1220.31 -> Solution is quite simple, as it sounds.
1223.31 -> You introduce more granular routes
1225.41 -> rather than the VPCC IDR pointing
1228.14 -> to slash 16 pointing to the
network firewall endpoint.
1231.05 -> Instead, you just define
1235.187 -> the exact subnet CIDR
1236.63 -> that you want to route
through the network firewall.
1239.78 -> So in this case,
1241.64 -> public subnets
1243.29 -> route table
1244.67 -> will have three additional routes
1247.689 -> where it's, these are the
private subnets CIDR range
1252.77 -> in the same availability,
1254 -> availability zone pointing
to the network firewall.
1256.88 -> And for the same thing, private subnets
1259.52 -> will have public subnets
CIDR in the route table
1263.3 -> pointing to the network firewall.
1265.1 -> Now, then you can, now this
will achieve network symmetry
1268.58 -> for both a NAT gateway
between the private subnets
1273.38 -> and the network load balancers.
1279.02 -> Now, now we have the architecture defined.
1281.99 -> Now how do we deploy this systematically?
1284.72 -> RobinHood has, obviously,
like, a lot of environments
1287.42 -> and we want to be able
to do this same thing
1290.69 -> over and over again without mistakes.
1293.982 -> To do that we used Terraform.
1296.93 -> We used Terraform widely
across RobinHood already.
1299.75 -> So it was a no-brainer
just to use this and
1304.91 -> to deploy the firewall
and the way I did it,
1307.61 -> I just created a single
overarching network firewall
1311.06 -> terraform module that will do this for me.
1313.88 -> So this firewall module
will automatically create
1316.7 -> the subnets for the firewall endpoints
1320.51 -> and the firewall obviously and,
1322.91 -> and the new route tables
for the public, private,
1325.34 -> and the firewall subnets
1327.02 -> and propagate with the proper routes
1331.37 -> routing through the NAT gateway
1332.72 -> and through the firewall
and also configure logging
1336.71 -> and where to send those logs to.
1339.98 -> In addition, I also added a bonus of just
1342.26 -> the firewall enabled Boolean
flag, just true and false.
1346.37 -> So I can just set that a
true or false to whether to
1349.7 -> create it or not in the VPC environment
1351.68 -> or the Kubernetes cluster environment.
1353.21 -> So, because network firewall
is pay for what you use,
1357.38 -> so if you don't want to
create a firewall in such a,
1359.81 -> like a small test environment,
1361.4 -> that you're gonna keep it internal,
1362.9 -> you don't need a firewall
1364.25 -> and so you can just have the flag
1368.09 -> to have it or not.
1370.697 -> But one thing to note is that,
1373.55 -> why am I creating a
new set of route tables
1375.56 -> for both public and private subnets?
1377.45 -> As I mentioned, this
environment already exists,
1379.91 -> but I'm creating a new set of route tables
1383.06 -> for the public and private subnet
1385.34 -> rather than just modifying
the existing ones.
1388.58 -> Will I duplicate the route tables?
1393.77 -> Well, it's because, first,
inserting a firewall
1396.83 -> into a live production system is tricky.
1399.44 -> As I, as I shown in the example one fire,
1402.17 -> one mistake in the route can
bring the production down,
1405.56 -> have a partial outage or a full.
1408.86 -> How did I solve this,
is that I create all the
1411.155 -> necessary firewall related
resources in advance.
1413.95 -> This is including the route tables
1416.126 -> and duplicate all the
routes in the route table,
1419.72 -> in the route tables and add the additional
1422.66 -> firewall integration routes.
1425.81 -> This allows us to verify
the routes, like, once more
1429.32 -> and all the components necessarily
1432.62 -> before taking the system live.
1435.92 -> And then once we, once we verify it
1438.44 -> and everything's correct through,
1440.27 -> I just orient through a
script programmatically,
1443.411 -> I have another Boolean variable called
1446.51 -> firewall route enable,
1448.76 -> which you can set it to true
1450.56 -> and just do a single terraform apply.
1453.32 -> This will switch the
route table associations
1456.361 -> for the public and private
1459.401 -> subnets
1460.234 -> to the newly duplicated
route tables and instant.
1465.38 -> Now this is the trick that I used to
1468.68 -> bring the firewall live
with zero downtime.
1474.05 -> And we also replicated it today.
1477.246 -> (Kevin laughs)
1478.402 -> - [Houston] Quick, quick note on that is
1480.924 -> when you're troubleshooting
and you want to
1482.18 -> remove the firewall from the equation,
1483.77 -> it's a quick way to turn it
off and turn it back on when,
1486.59 -> when you've made everybody
happy that it's not your fault.
1489.47 -> So. (laughs)
1492.11 -> - Yeah, and yeah, on that
note, this Boolean variable,
1497.06 -> the firewall route enable also
doubles as a fail safe flag.
1502.16 -> So imagine a rare outage scenario
1505.34 -> where the AWS shared cell
hosting the network firewall
1507.89 -> endpoint has a disruption, goes down.
1512.06 -> Now we can use that same firewall
1515.21 -> route enable flag, set it to false,
1517.79 -> and do a single Terraform apply again.
1519.56 -> And this will flip the
route table association
1522.47 -> back to the original
1524.51 -> with, obviously, in a matter of seconds.
1528.341 -> Now, but you might also
be wondering, like,
1532.074 -> how do we make sure that the,
1535.25 -> the route tables that's not being used
1537.44 -> has the routes being maintained?
1540.32 -> The way I did it is this, this terraform,
1543.65 -> since it's the same infrastructure as code
1545.365 -> I just have it set so that
1548.57 -> if you try to add a
route to one, one of 'em,
1551.21 -> then it'll automatically
add it to the other.
1553.04 -> So it'll always maintain
its congruence even if,
1555.59 -> if it's not being used, it's
just that one has a firewall,
1558.336 -> additional firewall routes
integration added to it.
1564.62 -> Now moving on to monitoring alerting,
1566.24 -> like deployment is the
first part of it but
1568.234 -> keeping it up and running
is the second part of it
1572.81 -> and continuous part of it.
1574.85 -> Now, firewall AWS CloudWatch
provides all the necessary
1579.552 -> operational metrics from
the network firewall
1583.01 -> and from that we have a,
1585.53 -> two
1586.73 -> very important alert
1589.7 -> generated from it.
1590.54 -> And the first one is just a basic
1592.25 -> low firewall traffic alert.
1594.59 -> This is, this fires if
there's a close to zero
1599.39 -> network traffic going
through the firewall.
1601.43 -> This can happen if the firewall
is just simply disabled
1604.452 -> or an engineer accidentally
1606.53 -> or deliberately disabled
1609.32 -> the firewall to bypass the
egress control mechanism.
1612.65 -> If that happens, this
fire, this alert fires.
1616.25 -> In the second one, which
is even more important
1619.04 -> is the firewall drop packets warning.
1623.15 -> This can happen in, this
triggers when there are
1626.87 -> packets being actively
dropped by the firewall.
1629.48 -> This can, this can occur
if the firewall endpoint
1633.53 -> is hitting the bandwidth limits,
1634.82 -> which is pretty unlikely because it is,
1636.77 -> it has a hundred gigabits
per second throughput
1640.25 -> we're way more than
double the NAT gateway.
1643.85 -> So this is unlikely.
1645.5 -> And the second scenario is the,
1648.5 -> there's some firewall rule
that is being triggered to,
1652.76 -> to drop the packets.
1654.05 -> Usually this will mean,
1655.01 -> this will mean that
scenario because usually
1657.83 -> you have the firewall rule to drop packets
1659.96 -> if it's like a malicious traffic.
1661.97 -> So if this fires, we
would get a alert on slack
1666.44 -> just as above but also be
paged on our phones directly
1669.735 -> 'cause we would want to hop
in and investigate right away.
1676.49 -> And then onto logging and visibility.
1680.679 -> So on the example here is
1683.75 -> just an event that I pull from a
1687.86 -> network firewall.
1689.27 -> And here obviously you can see
1692.15 -> the fields that would be present
1694.31 -> in the average VPC flow logs.
1697.021 -> You see source IP, source
port, destination IP,
1701.244 -> and destination ports.
1703.58 -> Like, this is all useful.
1704.99 -> This already comes from flow log,
1706.55 -> but what makes the network
firewall logs different
1709.981 -> is that it, it comes with
like additional information
1712.46 -> like TLS fingerprint,
1714.32 -> like SNI domain version,
domain, and the TLS version.
1718.661 -> And in other cases it
also comes with like JA3,
1722.15 -> like, fingerprint hashes
where you can reverse stacks,
1725.53 -> stack rank to find, like, the outliers and
1730.73 -> be like a threat hunting exercise
1732.62 -> and start investigation from there.
1738.407 -> And I'll be handing off to
Houston to provide more insight.
1741.8 -> - Yep, sure.
1743.12 -> Thanks Kevin.
1744.465 -> Just a quick note,
1745.52 -> I know I mentioned
dashboards a few slides ago.
1747.38 -> Kevin's overwhelmed us with
awesome material since then.
1750.53 -> So this is just a quick,
1753.08 -> it's really a mock-up of
one of our early dashboards
1756.32 -> showing what we got
immediate visibility on
1759.5 -> by looking at domains,
1760.94 -> which was a nice added feature
of using the network firewall
1763.601 -> and a couple of things that
we looked at right away.
1766.31 -> So first of all, how often
was RobinHood talking to AWS