AWS re:Inforce 2022 - Deploying AWS Network Firewall at scale: athenahealth's journey (NIS308)
Aug 16, 2023
AWS re:Inforce 2022 - Deploying AWS Network Firewall at scale: athenahealth's journey (NIS308)
When the Log4j vulnerability became known in December 2021, athenahealth made the decision to increase their cloud security posture by adding AWS Network Firewall to over 100 accounts and 230 VPCs. Join this session to learn about their initial deployment of a distributed architecture and how they were able to reduce their costs by approximately two-thirds by moving to a centralized model. The session also covers firewall policy creation, optimization, and management at scale. The session is aimed at architects and leaders focused on network and perimeter security that are interested in deploying AWS Network Firewall. Learn more about AWS re:Inforce at https://bit.ly/3baitIT . Subscribe: More AWS videos http://bit.ly/2O3zS75 More AWS events videos http://bit.ly/316g9t4 ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster. #reInforce2022 #CloudSecurity #AWS #AmazonWebServices #CloudComputing
Content
0.63 -> - All right, hello,
everybody, and welcome.
2.97 -> Sorry if I startled you.
5.07 -> Thanks for coming to our session
at the end of the day here.
9.06 -> We're gonna be talking about
10.47 -> deploying AWS Network Firewall at scale,
13.59 -> and we're using a real world example.
15.18 -> So my name is Dave Desroches.
16.62 -> I am an AWS solutions architect,
19.2 -> and I'm joined by Aaron
Baer and Mike McGinnis
22.68 -> who are coming from Athenahealth,
24.15 -> which is a customer that I support.
27.06 -> The origin of this came from a
29.85 -> deployment that they put in
place as a direct result of,
34.65 -> well, some things that they needed to do.
37.38 -> And I don't wanna steal
from their thunder,
38.85 -> so I'm gonna leave it to
them to tell the story,
40.83 -> but I thought that this would
make a really good story
43.23 -> for other customers to hear,
45.09 -> because it talks about some of the kind of
47.22 -> trials and tribulations
that they went through
49.65 -> as they went from basically
not having anything in place
52.56 -> to having this up in production
55.32 -> over a very short period of time.
57.18 -> So I think it's a good story.
60.27 -> I'm gonna start us out by
just doing a quick level set.
62.88 -> So I'm gonna go through
just a really quick
67.32 -> high level overview of network firewall,
70.11 -> and we'll talk a little bit about
71.25 -> a couple of the deployment
options that can go into place.
73.77 -> This is a 300 level session,
76.23 -> so we'll be getting into the weeds
77.73 -> of how these deployments are done.
80.43 -> Then I'm gonna hand it off to Aaron.
81.66 -> We'll go through the movement
from a centralized deployment
86.43 -> to a...
87.78 -> Sorry, from a distributed deployment
89.31 -> to a centralized deployment
in their environment.
92.07 -> And then finally, we'll go
through the policy and rules
94.65 -> that were put into place and
how that evolved over time.
98.37 -> So a little bit about
network firewall itself.
101.76 -> This is a managed firewall.
103.83 -> It kind of does what it says on the tin.
106.14 -> It's something where AWS
manages the firewall for you,
109.44 -> and then you leverage that firewall
111.54 -> by putting things like
policies, rules in place
114.23 -> in your environment.
115.95 -> This works for both north, south,
117.66 -> and east west kind of deployments.
119.97 -> And you can choose if it's gonna be
121.2 -> something that is symmetric
122.91 -> or egress-only kind of deployment as well.
127.17 -> It scales automatically,
128.49 -> so it has an auto scaling
mechanism under the covers.
131.88 -> It's very reliable.
133.38 -> It has a very flexible
rules engine as you'll see
136.59 -> when Mike goes through his
section of the presentation.
139.92 -> It, like most things at AWS, is a
143.4 -> no upfront commitment kind of thing.
145.23 -> You can put it in place, try it out,
146.7 -> and use whatever you need to for it.
153.09 -> So from your perspective as a customer,
155.55 -> you are responsible for doing things
158.01 -> that make sense for your business.
159.54 -> You're gonna be putting
the policy in place,
161.58 -> be establishing the
rule sets and so forth,
164.07 -> and you'll be defining what
the topology looks like,
166.29 -> what the architecture looks like,
168.09 -> where you have this thing
in your environment,
169.83 -> what kinds of traffic it's
going to look at and so forth.
173.55 -> Optionally, you can either
deploy it on its own,
175.5 -> or you can manage it
using Firewall Manager.
178.47 -> And then from an AWS side, we
are basically responsible for
182.34 -> handling the scaling of
things under the covers,
185.55 -> making sure that you've got
throughput and performance
187.89 -> that is appropriate for what you're doing
191.01 -> and handling things like zonal affinity.
193.38 -> So if you are deploying in a model
196.02 -> where you in a bunch of AZs,
198.06 -> making sure that the traffic goes
199.22 -> to the right place for the right things.
203.85 -> There are two kind of main
categories of deployments.
209.04 -> There's distributed and centralized.
211.41 -> In a distributed deployment,
212.79 -> you basically have firewall endpoints
214.89 -> that are placed into each of the VPCs,
217.5 -> in fact, into each of
the AZs in those VPCs.
222 -> And those firewall endpoints are
223.65 -> something that you
point routing tables to.
225.57 -> So the traffic gets directed
through that network firewall,
229.95 -> and it comes back and then goes to
231.42 -> whatever destination you're going to,
233.49 -> be it another VPC or an
internet gateway or whatever.
238.77 -> So in a distributed model,
you've got a lot of endpoints.
241.5 -> They all kind of live
all over your environment
243.54 -> based on what you're trying to do.
245.46 -> In a centralized deployment,
246.87 -> you're folding those things
together into an inspection VPC,
251.04 -> and that VPC hangs off
of a transit gateway.
254.49 -> So all of the other VPCs
that you want to send
257.04 -> through that inspection
VPC for inspection,
260.04 -> kind of does what it says,
262.47 -> gets sent off the transit
gateway to that VPC,
265.2 -> goes to the firewall,
266.22 -> runs through whichever rules make sense,
267.9 -> and then it comes back out
269.31 -> and goes either out to the internet
270.63 -> or to another VPC in that environment.
273.06 -> So those are the kind
of ground rules here.
275.88 -> So I'm gonna hand it off now to Aaron,
277.68 -> and Aaron is gonna talk a little bit about
279.06 -> an actual deployment.
286.11 -> - Hi, welcome, everyone.
288 -> We are from Athenahealth.
290.7 -> And at Athenahealth,
we are helping to shape
294.45 -> the future of healthcare.
297.6 -> And we're doing that.
299.1 -> We partner with healthcare
organizations of all sizes
303.09 -> and provide them with technology,
305.46 -> and insights, and expertise
309.57 -> to help them drive better
clinical and financial results.
316.11 -> I'm Aaron Baer as mentioned.
319.08 -> I am a principal member of our staff
322.98 -> and team lead for our
AWS infrastructure team.
328.17 -> And my goal here is to get into some
331.62 -> highly technical details of our journey
334.05 -> from a distributed deployment of
338.82 -> AWS Firewall to our now centralized
343.71 -> deployment.
344.543 -> So how can...
347.61 -> And a big part of this is
how a process like this
350.91 -> can drive your costs lower,
in fact, in many aspects.
356.64 -> So we're deploying a distributed firewall
359.43 -> to each VPC initially
362.13 -> and move to a deployment model
365.25 -> where our inspection is centralized.
370.17 -> So where did this start?
371.25 -> So this is about a
four-month journey for us.
375.36 -> And it started in December with Log4j.
378.113 -> I think a lot of people
had a lot of things
380.4 -> that started in December because of Log4j.
385.53 -> And middle of December,
387.87 -> I think I was just
about to go on vacation.
390.78 -> And on a Thursday,
394.41 -> I'm pulled into our calls.
396.48 -> We're discussing all of these things.
398.577 -> And it became quickly apparent
400.38 -> that we needed a really rapid solution
403.44 -> to get some mitigation in place.
406.47 -> Up until this event,
411.27 -> all of our VPCs were deployed
413.58 -> in a fairly standard model
where we used use NAT egress
417.33 -> out of the internet gateway of a VPC,
420.72 -> and there was no need for
inspection at the time,
424.5 -> and everything was basically
in and out of a VPC
427.92 -> within an account.
430.92 -> So it was also definitely
discussed and stated
435.33 -> that we didn't really have
any egress security in place.
437.67 -> Well, we had some.
438.87 -> Our VPCs have network ACLs.
441.69 -> Our public subnet network ACLs
445.38 -> are much more tight
448.08 -> even for egress than our
private subnet network ACLs,
453.42 -> but we didn't have any inspection
455.88 -> on egress to the internet.
459.33 -> A little bit about our
environment overall,
462.21 -> we definitely operate a
multi account environment.
467.73 -> We manage over well over 200 VPCs
470.55 -> across all of those accounts.
474.99 -> we only operate in US healthcare
and that sort of thing,
479.01 -> so we do actually only op also
operate within two regions.
483.72 -> But within those two regions,
485.01 -> we deploy subnets across seven
488.76 -> or more availability zones,
491.43 -> but at least seven.
494.4 -> We also have multiple
on-premises data centers,
501.12 -> and we essentially
operate two network zones.
504.15 -> We have production and
we have not production.
510.84 -> Much of this has already been in place.
512.7 -> we've kind of been operating in this space
514.65 -> for six years at least.
518.13 -> And so, our network
infrastructure also already had
521.64 -> some things that are common,
523.56 -> some things that are really important
525.09 -> to operating a large
scale network like this.
529.14 -> All of our VPC addresses,
network address space
532.59 -> is non-overlapping between accounts.
535.41 -> Every account is assigned a common /19.
539.88 -> We break that /19 into /20s per region.
544.86 -> We already use transit gateway
547.02 -> for communication between accounts,
550.02 -> for
552.27 -> using our VPN and Direct Connect.
554.7 -> We terminate those and
propagate those routes
557.91 -> into our transit gateway
route tables already.
562.95 -> And then a couple of key components that
565.35 -> assisted this as a rapid transformation.
569.46 -> All of our VPC infrastructure is in parity
571.59 -> across all of our accounts.
573.81 -> So we use CloudFormation,
576.21 -> and we use CloudFormation parameters
579.51 -> for the uniqueness of a VPC's deployment,
583.05 -> but every single VPC,
and every single account,
585.87 -> and every single region is
deployed from the same template.
591.27 -> So when we did this,
595.17 -> it was pretty straightforward.
596.58 -> But my hook to you is
600.15 -> we saved a lot of money
as our project went on.
605.88 -> And
607.71 -> so we started with nothing.
609.69 -> You can see from our graphs
here across AWS Firewall.
614.22 -> In December, we didn't have anything.
615.87 -> Log4j hit, 100% increase.
618.63 -> We're now having to spend more money.
623.67 -> And over the course of our
journey and our story here,
628.14 -> we were able to reduce that
initial increase of cost
632.73 -> by 98.72%.
636.063 -> And in addition to that,
638.7 -> the infrastructure changes
641.67 -> allowed us to also reduce
our NAT gateway resources
646.71 -> by 96.66%
649.65 -> because of networking dependencies
653.97 -> that we'll discuss shortly.
656.88 -> So overall, it was big success.
658.59 -> We were able to really put
in place something rapidly,
662.94 -> and, through changes, make big impact
667.59 -> on how we're using AWS Firewall
670.53 -> and the amount of money
we're spending to do that.
675.18 -> So let's get started on
setting some ground level here
679.95 -> about how our networks are configured.
684.24 -> At the beginning,
685.073 -> we have a fairly standard
VPC configuration.
687.57 -> As I mentioned, they're all the same.
689.82 -> Also, we're using infrastructure
as code practices.
694.29 -> And you can see in the before scene,
695.61 -> it was a fairly standard VPC.
697.02 -> We had a public subnet with a NAT gateway.
700.26 -> We had a private subnet
703.5 -> that then routed through the NAT gateway
705.6 -> to the internet gateway of the VPC.
708.99 -> Common practice, we deploy public-facing
711.66 -> load balancers to the public subnets.
713.55 -> Everything else goes
in the private subnets.
717.06 -> And after our first round of
change, our distributed model,
720.81 -> you can see a change here.
721.83 -> We had to add another subnet
724.08 -> to every availability zone within our VPC
728.04 -> to facilitate adding the
gateway load balancers
731.1 -> for the network firewall.
734.82 -> And when you do that, it's obvious.
738.12 -> It is the way it is, right?
739.35 -> The cost of network firewall
is directly proportional
742.08 -> to the number of gateway load
balancers that you deploy?
745.56 -> Similar to a NAT gateway.
746.94 -> As soon as you deploy one,
748.59 -> you're gonna be paying for it per hour.
752.91 -> In this distributed model though,
756.27 -> the one kind of benefit that
we had at the time is that
760.14 -> if you deploy gateway load balancers
762.21 -> to the same account that has NAT gateways,
765.96 -> that then offsets the
costs of the NAT gateways,
770.28 -> but you're still paying for
your gateway load balancers.
773.7 -> But since we had the automation in place,
775.56 -> since we had the code ready,
777.9 -> I was actually able to
781.29 -> make the infrastructure changes
783.6 -> to add the subnet in the
gateway load balancers for this
787.05 -> in about 2 1/2 days of
development over the weekend.
790.44 -> The word was let's go.
792.06 -> We need to do this.
793.02 -> No matter what, something has to happen.
796.05 -> And fortunately, that was able to be
799.83 -> completed in a rapid fashion.
802.02 -> The development was in a couple of days,
804.57 -> deployment of non-prod
was in the next day.
807.69 -> Successfully deployed a non-prod.
809.64 -> Awesome, let's go to prod
the next day after that.
812.43 -> And we have already
814.89 -> also supporting deployment infrastructure
817.2 -> that allows us to go through each account
820.83 -> running the same commands
to provide those updates
823.8 -> based on our CloudFormation
template changes.
829.83 -> So, as you can see,
831.96 -> there are a couple of additions
833.19 -> to our VPC infrastructure at this time.
837.849 -> In the previous, it was
just a private subnet,
841.41 -> a public subnet with a NAT
gateway, internet load balancer.
844.5 -> Pretty simple.
846.36 -> In this iteration,
848.43 -> We now have a private subnet,
850.17 -> a public subnet with a NAT gateway,
852.45 -> and a new firewall security subnet
857.34 -> with the gateway load balancer.
859.8 -> And if you notice at the top,
860.94 -> we have an addition there
862.38 -> where we have to attach
864.12 -> a route table to the internet gateway.
867.36 -> And that was a bit of a new,
870.6 -> a new thought to the way I had
originally designed our VPCs.
877.23 -> And
879.66 -> I think the primary reason of this is that
883.89 -> the internet gateway itself
has to know return routes
886.89 -> into your VPC.
889.95 -> And to do that, you can
associate a route table now
892.71 -> to an internet gateway
that's attached to your VPC.
898.5 -> A bit of a trick here
though that you see here,
903.36 -> our NAT gateway isn't the next
step to our internet gateway.
908.7 -> We now have a gateway
load balancer in between.
912.63 -> And so you actually have
to manage the route table
915 -> attached to your internet gateway,
916.86 -> 'cause if you have
multiple availability zones
920.19 -> and then the route of the public subnet
923.61 -> with the NAT gateway is
the gateway load balancer
927.27 -> and not the internet gateway,
929.55 -> the traffic has to return
to the same NAT gateway
933.57 -> that it left your network from.
938.52 -> So with those changes,
we had some challenges
942.9 -> and some interesting things that came up
945.66 -> as I was going through this.
948.57 -> As I mentioned, all of our
VPCs are deployed with subnets,
953.76 -> basically /24 subnets
of a /20 for the region,
958.47 -> which leaves some space
959.88 -> most of the time in the
original private network
964.89 -> that's assigned to the account.
967.62 -> but it didn't really give the correct
970.89 -> networking address space to
add three or four more subnets
978.48 -> for the gateway load balancers.
979.86 -> So I used an additional
CIDR overlay to the VPC
986.28 -> in this 100.90 range, 100.90
100.91, 100.92, 100.93.
993.81 -> So that we didn't just
consume that last bit
996.84 -> of address space that
we already had available
999.69 -> for growth in our existing VPCs,
1002.12 -> but we could still put these subnets in
1004.67 -> to get our load balancers,
gateway load balancers available.
1011.06 -> But we have hundreds of VPCs.
1013.34 -> We have seven availability zones.
1015.83 -> We have now additional
gateway load balancers
1018.25 -> in each one of these.
1020.03 -> So in 2 1/2 days,
essentially, deployed 1,680
1025.46 -> firewall endpoints across
all of our infrastructure.
1029.3 -> And the cost of over a thousand
gateway firewall endpoints
1034.16 -> is pretty high.
1035.96 -> So, Log4j, the word was do this, right?
1040.67 -> We don't care necessarily
how much this costs.
1042.77 -> We have to put this in place.
1045.086 -> It's a security priority for us.
1050.99 -> But that comes with some
1053.6 -> discussion that we have
in a minute as well.
1057.86 -> Another gotcha was the internet gateway
1061.73 -> route table that I discussed,
1065.36 -> and the return routes to
map to those NAT gateways
1072.98 -> And the conundrum there is,
1075.14 -> in CloudFormation,
1077.27 -> the stack can give you resource IDs
1081.29 -> for the NAT gateways that have been
1084.44 -> deployed into that instance
of the CloudFormation stack,
1088.7 -> but CloudFormation won't
actually give them back to you
1091.97 -> in any consistent manner.
1094.94 -> So you can't easily map.
1097.94 -> This NAT gateway ID is in this AZ,
1101.69 -> so that my route table,
1105.29 -> route entry can say
for this, I get this...
1109.88 -> The other portion of that,
you need your NAT gateway ID,
1114.05 -> you need your private subnets CIDR range.
1117.86 -> So that little /24 that
is at the return route.
1122.36 -> And so
1124.61 -> a complication of this distributed model
1126.98 -> and cloud formation is
that we had to write
1128.66 -> some external tooling that
then query the account,
1133.07 -> build a hash of that
information, loop through,
1136.25 -> assign those routes into the route table.
1139.55 -> It worked in automation because
you can do this one thing,
1141.86 -> and then you can run this next thing,
1143.93 -> but it was definitely something
1148.282 -> that was more than just being able to do
1151.4 -> one single cloud formation
update to make all of it happen.
1157.73 -> And then also in a distributed model,
1161.45 -> you're effectively also
deploying a AWS Firewall
1166.64 -> and an AWS policy to
attach to that firewall
1170.24 -> in every single account.
1171.74 -> So how do you manage that
across all of the accounts?
1176.03 -> Well, one way we were
able to do that is we
1179.93 -> centralized our rules
in a centralized account
1184.46 -> and share those rules to
all of our organization
1187.91 -> via resource access manager.
1195.56 -> So that was great.
1197.6 -> We put it in place,
1199.61 -> but we quickly realized
it's very expensive.
1202.85 -> And we quickly also had discussion that,
1208.1 -> awesome, we did it, we mitigated
1209.96 -> what we needed to for the moment,
1211.49 -> but is it the best solution?
1214.31 -> So a few things that
helped us transition into
1218.3 -> a centralized part was
that we were already using
1222.86 -> transit gateway route table
among all of our accounts.
1225.41 -> So we had extensive routing.
1227.81 -> We had ways that the network
1229.91 -> was able to pass traffic around
1232.16 -> amongst our entire network.
1235.58 -> And through our discussions with InfoSec
1238.314 -> and the rest of the organization,
1240.89 -> we did come to the realization that,
1243.05 -> ultimately, all we need to be inspecting
1246.23 -> to provide the service that we need
1249.17 -> is that we only need to inspect the egress
1250.97 -> that's actually leaving to the internet.
1254.9 -> And then we are able to put ourselves
1257.45 -> in a place of a phased migration
1259.91 -> to move through these new discoveries
1263.03 -> and move to what will end
up being our final solution
1267.44 -> as we go through.
1269.87 -> As we were going through this,
resources could be removed
1273.32 -> that were previously deployed,
1275.93 -> and all of our migration
steps could be automated
1278.15 -> because we were already actually deploying
1280.22 -> using infrastructure as code
1284.03 -> and automated updates
and deployment practices.
1290.36 -> So where did we start to go?
1293.6 -> We started to go to what is
the centralized model here.
1296.75 -> So you can see on the right
1299.69 -> that we have a set of AWS accounts
1303.02 -> and each of those AWS
accounts has their own VPC.
1306.32 -> And this example, we're in a region,
1309.77 -> and we're in a specific environment.
1312.02 -> So this is then essentially replicated
1314.93 -> for each region that we're in
1317.78 -> and for each networking environment
1319.58 -> that we're talking about.
1322.22 -> And we are already using transit gateways.
1323.69 -> So traffic from one account to another
1328.07 -> was already just routing right
through the transit gateway.
1330.59 -> Nothing had to change there.
1331.61 -> All those routes were already propagated
1333.86 -> into our transit gateway route table.
1337.07 -> We know that we didn't need any inspection
1339.8 -> between traffic that goes
between our own private accounts.
1343.52 -> We were already propagating and connecting
1346.1 -> to our number of data centers
1348.83 -> using VPN attached to the transit gateway.
1351.89 -> Those routes were already in place.
1354.08 -> And so during this
migration, this transition
1356.75 -> from distributed to centralized,
1361.07 -> we could put in place the inspection VPC
1364.55 -> and then affect the network
by just changing routes.
1368.09 -> So traffic that has a destination
1370.4 -> of default destination of 0.0.0.0/0
1374.21 -> coming out of a private subnet,
1375.89 -> instead of sending it to a
NAT gateway as we used to do,
1379.73 -> we just update the route, send
it to the transit gateway.
1383.93 -> You add an additional default route
1385.64 -> to your transit gateway route table
1388.22 -> and you've associate your inspection VPC
1393.47 -> to your transit gateway as well,
1396.56 -> and your default route to 0.0.0 out of the
1400.4 -> transit gateway route table
is your inspection VPC.
1406.97 -> And in this model as well,
1410.96 -> we already have inspection of ingress
1415.61 -> into a public-facing endpoint
1419.6 -> by requiring application load balancers
1422.36 -> deployed to public subnets
to have WAF attached.
1426.59 -> So ingress traffic to a service
1429.89 -> can still come through the
internet gateway of the VPC
1433.7 -> in the account where
the service is deployed.
1436.25 -> WAF is in place to provide security
1439.01 -> for that ingress traffic.
1440.78 -> We are only needing to be worried
1443.54 -> about the traffic that
goes to the internet.
1448.79 -> So let's talk a little bit more about
1450.08 -> the inspection VPC itself.
1452.18 -> So in the distributed model,
1459.53 -> I had to use that additional
site arrange overlay
1462.29 -> that was kind of outside
of our private network
1464.45 -> in the first place.
1465.98 -> It worked very well,
1467.57 -> but it wasn't entirely great
1469.73 -> because, all of a sudden,
we've got this 100.90 subnet
1473.54 -> or CIDR range attached to subnets
1475.64 -> and these VPCs that have
a private network space
1478.4 -> that's different than that.
1481.37 -> So we have a net new VPC
that we're able to build.
1484.64 -> And with that,
1485.93 -> we can do the same thing
we do in every account.
1487.73 -> We assign a /19 of our private subnet
1492.47 -> that's designated
specifically for this account.
1494.96 -> We keep record of where that should live,
1498.14 -> what account it should be associated to,
1501.47 -> and then gave more flexibility
1503.6 -> to deploy the subnets out of that
1507.32 -> private subnet range.
1508.28 -> So at the end of that,
1510.2 -> each of our subnets for
inspection VPC are now
1516.2 -> sharing the same non overlapping
private network space
1520.04 -> that the rest of our entire network uses.
1522.44 -> And that's a big benefit
1524.6 -> if at some point in the future
1526.37 -> we wanna add some other
inspection component
1528.71 -> to our network.
1530.36 -> We can go into this VPC.
1533.09 -> But for our use case here,
1536.48 -> we just need to implement
transit, sorry, network firewall.
1541.82 -> And so we have a three layers of subnet.
1544.16 -> So now you notice a difference
in this subnet layout
1548 -> than the previous subnet layout
1549.62 -> since we're able to
define this one net new.
1555.23 -> Our private subnets are essentially
1558.05 -> attached with an ENI that's
attached to the transit gateway.
1562.88 -> That routes then propagated
into the transit gateway,
1567.47 -> into the transit gateway route tables.
1569.87 -> The next layer, We actually then do
1572.84 -> network firewall gateway load balancers.
1576.26 -> And
1578.75 -> so the route table for the
first subnet is the destination
1583.4 -> of the gateway load
balancer in our firewall.
1586.22 -> Inspection can happen.
1588.08 -> And then the default route for
1592.67 -> the network gateway firewall subnets
1595.97 -> is the public subnet
of the inspection VPC,
1598.43 -> which holds the NAT gateway.
1600.74 -> And the benefit of this is
1602.63 -> now that the NAT gateway next
hop is the internet gateway,
1606.2 -> you don't have to manage
the routes like you used to.
1608.48 -> You don't have to manage
the routes the same way.
1609.95 -> You actually don't have to
manage any routes at all anymore.
1613.4 -> The attached route table
to the internet gateway
1616.43 -> and the NAT gateways being that hop out
1621.86 -> the route, the traffic
back in as you'd expect.
1630.83 -> So some of the challenges and some of the
1634.37 -> key points I wanted to
kinda point out here
1637.46 -> is that that simplification
1639.02 -> of the internet gateway route table
1640.58 -> was effective for us.
1643.76 -> And having the NAT
gateways on the other side
1647.12 -> of the gateway load balancers
1649.7 -> was what drove that simplification.
1653.03 -> And then we were able to drop
out the external scripting
1656.45 -> outside of CloudFormation
to manage those things
1659.3 -> and we're back to a single
template, a single update
1663.14 -> to manage the entire network
for that inspection VPC.
1668 -> And as we're going through this,
1669.53 -> one of the gotchas that we found
1671 -> is that network firewall manager
1673.01 -> doesn't really work for us in this way.
1676.37 -> And a couple of the
main reasons why is that
1680.81 -> network firewall manager kinda wants
1682.91 -> to build its own subnets,
1684.29 -> kinda wants to manage its own resources,
1686.18 -> and you don't have much control over that.
1688.64 -> So if you're already really controlled
1690.38 -> when you're in your network
1691.94 -> and you wanna just kinda
insert firewall into it,
1696.86 -> network firewall manager can do it,
1698.48 -> but it kind of gets a little chunky.
1700.13 -> It didn't really work for us.
1703.204 -> And then we couldn't just inherit subnets
1705.59 -> and say we want our
firewall gateways here.
1708.29 -> Like I said, it creates its own.
1710.45 -> It's expecting to be able to do that.
1712.07 -> So we didn't use network firewall manager
1717.383 -> to manage our network firewall.
1719.9 -> But a benefit is
1722.06 -> now that all private traffic
out of our private subnets
1725.6 -> that's destined for the internet
1727.43 -> no longer had to go through
1730.64 -> a NAT gateway within the same VPC
1734.48 -> and the NAT-ing for that traffic
1736.25 -> was handled by NAT gateways
with inspection VPC.
1740.51 -> It gave us the opportunity to actually
1742.88 -> remove all NAT gateways
from all of our VPCs.
1746.9 -> So when we deployed
1749.9 -> over 1,500 firewall
gateways into every account,
1754.82 -> we still also had NAT gateways.
1757.43 -> We were all not...
1758.66 -> And when we transitioned
to the centralized,
1760.85 -> we're able to reduce all of
those network firewall gateways
1764.6 -> gateway load balancers.
1766.61 -> But then in addition to that,
1768.77 -> we're also able to
1771.74 -> remove all NAT gateways.
1773.33 -> And previously, we had already
1774.71 -> built in to our CloudFormation
1776.12 -> that manages our VPC networks
1780.71 -> a conditional parameter of
enable NAT gateways true, false,
1787.22 -> and then a
1790.58 -> dependency within the resource creation
1793.25 -> of our CloudFormation that said
1795.2 -> if this parameter's set to false,
1797.06 -> then don't build the NAT gateways
1798.8 -> or remove the resources.
1801.95 -> So in our automation, in
our infrastructure as code,
1804.83 -> we were able to just say, "Hey,
turn off the NAT gateways,"
1807.05 -> boom, and run an update against the VPC,
1809.6 -> and removed over a thousand NAT gateways
1812.81 -> that we had run for years.
1816.68 -> I did mention our internet gateway,
1819.41 -> I'm sorry, internet-facing load balancers.
1822.26 -> We're already using WAF.
1823.49 -> So that was a complication
that we didn't have to solve
1825.38 -> with the network firewall.
1828.02 -> And
1830.12 -> a driving factor of us being
able to move to centralized
1833.69 -> was the fact that we
only needed to inspect
1838.25 -> internet-bound traffic.
1839.87 -> Traffic was already being
inspected between network zones,
1843.11 -> on-premise network firewall
or on-premise firewalling
1846.29 -> that had already been in place.
1849.32 -> We didn't need any inspection between
1851.87 -> AWS accounts within the
same network zone as well.
1858.62 -> So all of these things allowed us to drive
1861.71 -> a phased change of approach.
1864.89 -> We were able to quickly
mitigate our security issues
1870.02 -> that came up in December for Log4j.
1874.19 -> We were able to deploy
very quickly to do that.
1878.9 -> It was that moment of
this is more important
1882.08 -> than how much it costs
1884.63 -> until we deployed it.
1886.79 -> And the cost showed up, and we were like,
1889.227 -> "Hey, great, but maybe that's
not the right solution."
1893.57 -> It allowed us though to
step back and identify
1896.72 -> all the pieces that we do really need
1898.97 -> in a more pragmatic and calm discussion
1903.29 -> and ask questions of ourselves like,
1907.1 -> what is the best firewall
that we need to use?
1909.83 -> Is it from a vendor?
1911.87 -> Is it from something else?
1913.85 -> Is it for this network traffic?
1915.68 -> Is it not for this network traffic?
1918.65 -> We were able to have discussions
with external vendors.
1921.5 -> And one of the things we found at the time
1924.44 -> was they may have an excellent solution
1928.73 -> for network firewalling,
1931.04 -> but most of the time it was,
1933.83 -> well, you have to deploy
your own EC2 instances
1936.32 -> and manage your own infrastructure,
1938.12 -> and then you run our software
1939.89 -> to provide the firewalling services.
1942.38 -> And in this critical moment,
1945.47 -> our teams were kind of constrained.
1947.57 -> Honestly, I was the
person who was managing
1950.78 -> all of the infrastructure at the time,
1952.31 -> so we didn't have a lot
of bandwidth to say,
1956.187 -> "Okay, well, we can start managing
1957.65 -> a new fleet of EC2 instances
1959.3 -> and have the time to
write all the automation
1961.67 -> to install the third party vendor software
1964.43 -> and configure it with
automation, all these things."
1967.34 -> It didn't really work for us,
1968.87 -> but we were able to take that moment
1971.51 -> to have those discussions.
1975.74 -> Additionally, we didn't have to
1978.8 -> change everything all in one swoop.
1981.32 -> We were able to deploy the inspection VPCs
1983.18 -> and make sure that their
infrastructure was correct.
1986.63 -> And then we're able to take
the same firewall rules
1990.74 -> that we put in place for
1994.01 -> the centralized model
1995.87 -> and apply them to the inspection VPC,
1999.38 -> and then we were able to
migrate just our networking
2003.352 -> as we needed to in phase changes.
2006.758 -> So we're able to just change the route
2008.71 -> for the destination of the internet
2011.47 -> to the transit gateway in
all of our private subnets,
2015.01 -> and see that traffic move,
2017.29 -> but we already knew that
our firewall rules in place
2019.84 -> for our security-assisted custom
2025.42 -> we're definitely blocking
this outbound traffic rules,
2028.99 -> additional AWS managed
rules that we had set in
2034.3 -> valuation mode.
2036.7 -> We just used the same
rules for our first phase.
2038.77 -> So we knew when we pushed
the traffic over there,
2042.01 -> it wasn't gonna affect the
existing network traffic
2044.44 -> 'cause it was the same rule set.
2046.81 -> And then
2049.57 -> once we moved all of the networking,
2053.26 -> then we could focus on
2055.42 -> the next iteration of our firewalling
2058.57 -> and not only were we able to continue
2062.62 -> to mitigate the security
vulnerability issue with Log4j,
2066.903 -> we were able to also decide, well, hey,
2069.34 -> here's our chance to also
put in place best practices
2073.24 -> of actual default deny outbound egress
2076.66 -> for our entire network
2079.54 -> in a more controlled way.
2082.03 -> And so, we were able to then
2085.27 -> work on the firewall and the policy
2088.99 -> and start building a more strict policy
2091.87 -> for these new rules that are in place.
2095.2 -> Then we were able to update that firewall
2097.87 -> and attach our strict policy
2100.66 -> using alert established at first.
2104.92 -> And so we were essentially
passing all traffic through
2108.13 -> and logging all of that data
2110.53 -> so that we could then
inspect our network traffic
2113.26 -> without affecting actual information,
2118.09 -> And then we were able to communicate
2119.56 -> with our developer teams
2121.12 -> for them to start
looking at their own data
2123.31 -> because we were becoming aware
2125.32 -> and starting to state that,
2126.857 -> "Hey, we're gonna do default deny."
2129.46 -> And by default, if it's
not our own domain,
2132.64 -> if it's not our own known destination,
2134.89 -> untrusted places on the internet,
2137.77 -> default is gonna be...
2140.44 -> Deny is gonna be default.
2143.41 -> And we were able to do that using
2145.66 -> alert established on the policy first,
2148.75 -> logging and evaluating that.
2151.06 -> And then when we knew we were pretty good,
2153.7 -> when we had talked to all
of our developer teams
2156.16 -> and we're ready to go,
2157.69 -> we switch our policy just by
flipping alert established
2161.44 -> and adding drop established to our policy
2164.92 -> and everything was in place.
2167.41 -> So that is kind of the technical details
2170.53 -> of the network journey,
2171.97 -> and I'm gonna pass it on
to Mike now at this point,
2174.91 -> and he's gonna talk more
about the firewall rules.
2180.25 -> - Thanks, Aaron.
2184.9 -> Hey, everybody.
2185.733 -> My name is Mike McGinnis.
2186.7 -> I'm the principal security
engineer at Athena
2190 -> and I lead the public
cloud security group.
2193.87 -> So as Aaron mentioned,
we've gone on this journey
2197.71 -> from decentralized to centralized.
2199.96 -> So what we're looking
at, pre-network firewall.
2204.94 -> Really we're looking at the NACLs
2206.65 -> and security groups as the
primary filtering for traffic.
2210.61 -> It's all IP based, right?
2212.26 -> What's the IP?
2213.28 -> What's the port?
2214.113 -> What's the protocol?
2215.47 -> And then it's allowed through.
2217.75 -> we didn't have any way
to actually implement
2220.99 -> filtering based on HDP methods,
2224.56 -> user agents, domains, anything like that.
2228.55 -> Log4j happened,
2230.11 -> which was the emphasis to
get us into network firewall.
2237.88 -> As we started through the journey,
2239.14 -> basically, what we were looking at
2240.58 -> is we had to define
the policy requirements
2243.07 -> and we broke it down into
define the policy scope,
2246.61 -> have minimal impact and minimal outages,
2249.19 -> and enable the developers.
2251.2 -> So Aaron touched on this,
2252.31 -> but, basically, the defining the scope is,
2254.59 -> are we going to do inbound filtering only,
2256.6 -> outbound filtering?
2257.89 -> Do we care about the east to west?
2259.72 -> Do we care about internal traffic
2262.99 -> or what's the decision?
2265.18 -> Also, how granular do we wanna get?
2267.82 -> Is just IP good enough
2269.47 -> or do we actually want those domains?
2271.54 -> And all of that sort of gone into
2273.82 -> the scoping of the policy set.
2276.34 -> We wanted to have minimal
impact and minimal outages.
2279.52 -> As Aaron said, we were
six years into this.
2282.07 -> So just dropping a firewall in
2284.05 -> and putting default deny in place
2285.67 -> really wasn't the best idea.
2288.19 -> 'Cause if we broke a lot of stuff,
2289.51 -> the first thing to go would be firewall.
2293.02 -> As part of that, we also wanted to have
2294.46 -> a defined rollback strategy.
2296.86 -> As we were chunking through
each of those phases
2299.62 -> that Aaron was describing,
2301.51 -> we actually had a very well
defined rollback strategy
2305.11 -> for each and every one of those.
2306.91 -> So in case the change created an issue,
2310.9 -> we would be able to roll back
2312.55 -> while we identify what that
issue was and work to fix it.
2316.78 -> Lastly, on the requirement
was enabling the developers.
2320.86 -> We need the developers just
as much as they need us.
2325.06 -> We have to provide them the logs.
2326.65 -> We have to teach them how to use the logs,
2329.38 -> and then also bring them
along on the journey.
2332.02 -> Keep them updated where we...
2333.43 -> What's our progress?
2334.84 -> Where are we in the path?
2336.34 -> What's our intention?
2337.6 -> What's the end goal?
2341.83 -> So very similar to the phases
that Aaron was talking about,
2345.31 -> we also created three policy phases.
2347.77 -> So we created the alert policy default,
2351.19 -> then the alert policy strict,
2353.77 -> and then eventually default deny.
2356.65 -> And when I say alert
policy default and strict,
2359.92 -> those are the actual firewall modes.
2361.69 -> So the firewall has default mode
2364.09 -> in which it basically does
rule processing in groups
2367.03 -> based on the action.
2369.1 -> So it will evaluate your
passes first, then your denies,
2372.85 -> and then your alerts.
2375.04 -> In strict mode, it basically
will evaluate in strict order.
2378.55 -> So the way that the policy set is written
2381.79 -> is exactly how it's being processed.
2384.43 -> So rule group one.
2385.72 -> Rule one, two, three, that gets hit first.
2387.97 -> Then rule group two.
2389.14 -> One, two, three gets hit second,
2390.64 -> and so on and so forth.
2395.47 -> So with this one, alerting default,
2399.85 -> this was the one that was
pushed to all the default
2402.04 -> or the distributed policies.
2405.25 -> We initially set it up
in the default rule order
2407.77 -> 'cause that was more simplistic.
2409.99 -> We had the same firewall being
deployed via CloudFormation
2413.68 -> and the RAM to share the one set of rules
2417.16 -> with all of the firewalls.
2420.46 -> What we did also is we
actually, at this point in time,
2424.3 -> we put the firewall logs to CloudWatch,
2426.64 -> and we set a retention on the log group.
2430.78 -> We created a subsequent alarm to check for
2434.44 -> if a block list rule was hit.
2437.08 -> We would send an SNS over to our IR team
2440.26 -> that generated a ticket in their queue,
2442.54 -> and we were able to
actually block and respond
2446.35 -> to the Log4j as part of
the initial policy set.
2449.86 -> It's a little misleading
to say alerting default
2452.56 -> because we actually did have
2453.91 -> a handful of block rules in place.
2456.04 -> But for the most part,
2457.09 -> it was built to be an
alert in our policy set.
2461.26 -> Like Aaron mentioned,
2462.16 -> this was cut over to the
centralized firewalls
2465.37 -> so we could keep consistency of the policy
2468.34 -> while the entire network plumbing
2470.5 -> was actually being converted.
2474.73 -> This is what the rule
set actually looked like.
2477.43 -> Very, very simplistic, right?
2480.16 -> The stateless default action
2482.2 -> is to forward to the stateful rule engine.
2487.45 -> Then we use the AWS managed
rule sets in alert mode.
2490.99 -> We had an allow list and that was sort of
2492.64 -> just to clean up the logs a little bit.
2495.07 -> We had a block list,
2496.39 -> and then there actually
isn't a default action
2499.51 -> in default mode.
2502.42 -> And really this is all it looks like.
2504.07 -> This is the sort of the default
actions of the firewall.
2507.25 -> What you basically see in stateless
2508.63 -> is just gonna forward
it down to the stateful,
2510.678 -> and then the stateful is
just rule order default.
2514.69 -> Below this in the console
is your policy set.
2520.15 -> So moving through the journey, right?
2522.82 -> We've come into a centralized model.
2525.07 -> Now we're rebuilding the policy set
2527.65 -> to what we want it to be.
2530.59 -> So
2532.42 -> we moved this over to the
centralized firewalls.
2535.84 -> As part of that,
2538.12 -> we had to remove this Suricata priority.
2542.71 -> And what that basically means is that,
2545.77 -> under the hood, the firewalls
using Suricata IPS IDS,
2550.482 -> and Suricata is an open source IPS.
2555.07 -> So all of the rules are in Suricata.
2557.5 -> One of the keywords is a priority.
2559.96 -> In default mode, the priority will set
2562.9 -> the priority of the rule within the group.
2565.6 -> So even though it's a pass,
2567.52 -> you can have different rules within pass
2569.98 -> hitting at different times.
2571.6 -> But because we're in strict ordering now,
2573.67 -> that's no longer needed,
2575.41 -> and it actually throws a policy error
2578.29 -> if you try to push the rules with it,
2581.44 -> We updated the policy set.
2583.87 -> Every firewall, again,
had the same policy set.
2587.14 -> But one of the key changes in
centralized versus distributed
2590.56 -> is we actually consolidated
the logging to S3.
2594.07 -> And then from there, we
used event notifications
2596.65 -> to push it to our SIEM,
2598.9 -> and then start to leverage our SIEM
2600.61 -> for the continuous monitoring
2602.44 -> and take out that CloudWatch alarm.
2605.68 -> And that really was to synthesize
2609.58 -> the alerting and logging workflows
2612.34 -> with what we were already using
2614.29 -> and what the IR team
had already built out.
2619.6 -> The one thing to mention on that one
2621.22 -> is we also did provide the
developers access to those logs.
2625.39 -> We provided them very granular access.
2629.02 -> So this is what the new policy looks like.
2631.93 -> Within the stateless rules,
2633.1 -> we actually did add a stateless rule set
2636.73 -> to allow trusted IPs to
bypass the inspection engine.
2641.65 -> And when I say trusted,
I mean truly trusted IPs.
2646.54 -> Below that, we're gonna forward everything
2648.85 -> to the rule groups,
2651.1 -> to the stateful rule groups.
2652.84 -> So we have our allow list.
2654.4 -> We have our block list.
2656.2 -> Now we have the AWS managed block list.
2659.65 -> Below that is the filtered domains.
2661.75 -> So the idea here is that
2664.96 -> we really want to look at...
2668.02 -> Maybe this is traffic that
looks a little suspicious
2671.26 -> or it's traffic that
were not fully vetted.
2674.35 -> So if we drop 'em below the block list
2676.42 -> and the managed block list from AWS,
2679.27 -> if the domain, or if the IP,
2681.67 -> or whatever else rules are there,
2685.99 -> show up in either of the two lists above,
2688.63 -> they will actually still get
blocked by the rules above
2691.93 -> and not be allowed through.
2695.23 -> If they don't show up in those,
2696.58 -> then they'll be allowed through.
2698.59 -> It's sort of a proving ground in a sense.
2701.14 -> Below that is the
developer requested rules.
2703.09 -> We'll talk about the developer
workflow in a little bit.
2706.21 -> And then we have the
stateful default action,
2708.94 -> like Aaron mentioned,
of alert established.
2713.62 -> So here again, stateless is the same.
2715.99 -> Rule order is now from default district,
2718.39 -> and now we have a default
action of alert established.
2724.63 -> On the default deny,
2728.02 -> a lot of work went into
2730.81 -> moving from the alert
strict to default deny
2734.89 -> was mostly in the background.
2737.11 -> To be 100% honest,
2738.52 -> the change was super easy
2740.59 -> because all we had to do was
update the CloudFormation
2744.19 -> to add the action of drop established
2746.8 -> and do a CloudFormation stack update.
2748.39 -> That's literally it.
2749.223 -> It was three minutes and you're done.
2752.83 -> All of the effort was in prior
to getting to that point.
2758.32 -> What we did is we worked
with the development teams
2763.06 -> very, very, very closely.
2765.52 -> We implemented a
developer request process.
2768.52 -> We added a hundred developer
rules to the rule set,
2773.11 -> and this was purely based
on analysis of traffic,
2776.44 -> evaluation of traffic,
2777.76 -> and discussion with the development teams
2779.77 -> to say, "Do you actually need
to go to this health.gov site
2784.84 -> or do you need to do this?"
2786.07 -> And then we would have that discussion.
2788.41 -> We blasted them with
notification and messaging.
2792.13 -> We didn't bury them to the point where
2794.89 -> they just drowned us out.
2796.72 -> But it was...
2798.91 -> We want it to be very explicit
2800.23 -> as to when what change was happening
2802 -> so they could expect and plan for it.
2804.49 -> We held daily office hour calls.
2807.07 -> We do have...
2809.47 -> We do have offices in India.
2811.33 -> So the office hours would alternate days
2813.82 -> so we could accommodate all
of our development groups.
2818.05 -> And then again, we just
modified the firewall.
2822.16 -> When we did it, we did it
from a least used, least risk
2825.58 -> to most used, most risk method.
2828.94 -> Meaning we decided which of the firewalls
2831.67 -> was the least used.
2832.75 -> And if it went down because
of a poor policy change,
2835.81 -> that was okay or that
was the most acceptable.
2838.9 -> We started there for
the initial deployment.
2841.18 -> And as we worked through
it, we moved to the final,
2844.15 -> which was our main production site.
2849.64 -> Again, the rules did
not change whatsoever.
2852.79 -> While the rule groups did not change,
2854.53 -> the rules just got more defined
and more applicable to us.
2859.33 -> Here you can see default action n
2861.43 -> ow includes drop established.
2865.21 -> So where are we now?
2866.53 -> Default denies in place across everything
2868.66 -> in our AWS ecosystem.
2872.834 -> The whole change, including
Aaron's and the policy work,
2875.86 -> resulted in less than 10 medium issues.
2879.76 -> So it was not a very impactful change.
2884.86 -> We have, to date, over
120 developer requests
2890.59 -> through our process.
2893.35 -> Now we're really focusing
on operational efficiencies.
2895.96 -> Literally a couple days ago,
VPC prefix list were announced,
2899.83 -> which basically allows us to group CIDRs,
2902.98 -> which I think might make our policies
2904.72 -> a little more flexible.
2906.37 -> We're also looking at
refining policy sets.
2908.95 -> So as the business grows
and new workflows come in,
2912.37 -> we're able to look at
those more holistically.
2916.81 -> So real quick, I just wanna
2917.89 -> jump into the developer workflow.
2919.96 -> So basically, what we
wanted to do was make it
2922.3 -> super, super easy for developers
2924.43 -> to have a say in what traffic
is allowed in their account,
2929.17 -> granted it falls below our block list.
2933.1 -> So we do have a say.
2935.35 -> The workflow also has gating
and guardrails encoded into it
2939.4 -> so developer can't just say,
2941.027 -> "Hey, I want my account to go
2943.18 -> all the way to full internet.
2945.19 -> Give me 000 outbound."
2948.25 -> And we just say, "Yeah, sure, whatever."
2950.74 -> We actually have a lot of guardrails
2952.09 -> that we built into the process
to alleviate some of that.
2955.6 -> It's based in Git
2956.53 -> So signoffs, and auditing, and governance
2958.75 -> is all a part of it.
2960.01 -> It's built through a pipeline
so it's quick and efficient.
2963.7 -> And the approval process,
2966.22 -> we do approve it during business hours.
2968.5 -> If there's an incident, so
if there's an actual outage,
2971.731 -> the NOC can approve in our place,
2973.517 -> and then we have a process where
we're alerted and notified.
2977.11 -> So when the security team
comes in the next day,
2979.63 -> they can perform that assessment.
2982.6 -> So on the predefined rule variable,
2985.06 -> the policy or the rule group
2987.4 -> has what's called a rule variable,
2989.74 -> and it's basically just IP sets.
2991.54 -> We, the security, define
those and manage those
2993.91 -> as new accounts are created.
2996.7 -> The JSON that you see
under developer submission
3000.36 -> is literally the only
thing they have to submit
3003.51 -> as part of their PR.
3006.66 -> So when they create their poll request,
3008.07 -> they have to tell us the account ID
3010.83 -> and what domain, port, and protocol,
3013.74 -> domain/IP, port, protocol that they want.
3018.84 -> We review that.
3020.16 -> We merge it.
3021.84 -> And out pops our Suricata rules.
3025.02 -> So here, what we're doing
3026.37 -> is we're basically using their input
3029.64 -> to create the Suricata rules that we then
3032.76 -> deploy into that specific
developer rule group.
3038.55 -> I could go deep into this
because it's actually super...
3043.41 -> My opinion, it's super well-designed.
3044.76 -> There's a lot of catches that
we wanted to account for.
3048.99 -> But the end run is this
is what gets transformed
3052.35 -> in the firewall that allows
their traffic to get past.
3057.33 -> So policy challenges.
3061.35 -> We did a lot of work,
3062.7 -> a lot of movement throughout the journey.
3068.16 -> What we found is that
default ordering is nice,
3071.43 -> but it's limiting in advanced use cases.
3075.06 -> So we were really trying to get
3076.65 -> that alert kind of policy set to say,
3079.86 -> what are we seeing?
3081.66 -> We couldn't really get it easily.
3085.35 -> We were seeing a lot
of the denies happening
3088.5 -> before the alerts would
3089.73 -> so we were actually blocking more traffic
3091.98 -> than we had wanted to,
3094.05 -> without putting in past rules above it.
3096.81 -> Domain list is a really cool
feature of network firewall.
3100.05 -> The only gotcha is that they
put a deny any at the bottom
3104.07 -> so it's literally a white list.
3105.66 -> You put in your domains, you hit save,
3107.97 -> it lists out the domains in Suricata form,
3110.73 -> and then we'll put a
default at the bottom.
3114.69 -> So then that basically made us convert
3116.79 -> everything to Suricata,
3118.35 -> and we only use Suricata at this point.
3120.87 -> If you want an IP, a 5-tuple,
so IP port protocol rule,
3125.25 -> that's Suricata.
3126.39 -> Everything now is Suricata.
3129.36 -> We didn't change anything else.
3132.9 -> Home Net is not RFC1918.
3136.14 -> This one, in my opinion, is pretty funny.
3138.48 -> It's an oversight on our
part, but it's a good gotcha.
3141.99 -> So when we were deployed
into the distributed,
3145.02 -> the firewall actually sat in the VPC.
3148.83 -> So
3150.42 -> it took the CIDR of the VPC
3153.32 -> so it was working fine.
3155.67 -> because the default value of Home Net
3157.5 -> is the CIDRs or the VPC CIDR,
3160.41 -> when we moved to a consolidated model,
3163.65 -> the only traffic that initially passed
3166.05 -> was the traffic that originated
3168.09 -> within the VPC of the firewall.
3171.51 -> All the other traffic was not
being passed appropriately
3174.54 -> so we're like, "What's going on?"
3176.16 -> And then we did a little light reading,
3179.04 -> and we actually realized our mistake.
3181.5 -> So then we updated CloudFormation
3183.63 -> and put Home Net as a defined CIDR
3186.93 -> is a rule variable into
all of the policy sets
3189.66 -> or into all of the rule groups.
3191.58 -> And it fixed our issues
almost as instantaneously.
3196.56 -> Capacity limits is another one.
3198.39 -> So in the network firewall policy,
3202.65 -> you have what's called a capacity limit,
3204.51 -> and this is a hard set limit of 30,000.
3208.56 -> I equate 'em to rules,
3210 -> but it's basically 30,000
rules for your entire firewall.
3214.02 -> And you allocate these to each rule group.
3218.49 -> So if you just sort of
haphazardly assign them
3223.23 -> and you don't think about it,
3224.91 -> you end up either having to
3226.08 -> blow away the rule group, recreating it,
3228.72 -> or you can sort of max out.
3230.79 -> And when you get to that max
out page or the max out time,
3234.09 -> or, sorry, max out capacity,
3237.18 -> you're sort of in a conundrum.
3239.85 -> So we typically will do is we'll
3241.86 -> look at what the policy or
the rule set's gonna be,
3244.68 -> and then estimate what we
would see the top end being,
3248.94 -> padded another five to 10%,
3251.34 -> and then that's sort of the
capacity for that rule group.
3256.65 -> On the other end, we always
leave some unallocated
3260.55 -> so that way we're never
consuming the full 30,000.
3263.82 -> we're always leaving float.
3265.32 -> So that way if we need to add
more down the road, we can.
3267.9 -> If we need to add a new rule group
3269.79 -> only to transfer the policy set around,
3273.57 -> we have that ability and that flexibility.
3276.18 -> Versus if we're stuck at that top 30,000,
3278.73 -> you sort of lose that flexibility.
3283.32 -> So the next one, the network layer
3285.15 -> versus application layer rule processing.
3288.06 -> This one is fickle.
3290.43 -> So when you're doing your rule set,
3292.47 -> if you have allow this
source to this destination
3298.209 -> for all TCP traffic,
3300.12 -> you're never actually
going to be able to do
3302.16 -> the analysis at the app level
3304.5 -> because it's just gonna see the TCP
3306.78 -> and let it through.
3308.16 -> So it's never going to sort of
filter down to the app layer
3312 -> and get to looking at the
domains or the methods.
3317.22 -> Any of the more IPS kind of rules
3321.18 -> typically won't get
processed if you're doing,
3323.61 -> if you're passing it more at the L3 L4.
3326.52 -> So just be cognizant of your rules.
3330.78 -> because the firewall does not
3334.717 -> do TLS decryption,
3337.23 -> it leverages the TLS server
name indication extension,
3342.63 -> and...
3345.03 -> Almost everything supports it.
3346.47 -> However, less than 1% of our traffic
3349.11 -> does not support it in one way or another,
3351.72 -> or it hasn't throughout
the course of our journey.
3355.14 -> And most of the time,
3357.03 -> this is resolved by
updating the libraries.
3359.49 -> Sometimes we were finding
themes that were using
3362.04 -> really old code bases
3364.47 -> that they just needed to update,
3367.41 -> sometimes forcing it to use 1.2,
3370.05 -> even though their code is using 1.2,
3373.29 -> for some reason forcing
it to use it fixed it,
3377.55 -> or adding a parameter
as part of the web call.
3380.4 -> So when you're doing the HTTP request,
3384.54 -> actually putting in the parameter
3386.61 -> for the tls.sni and
defining the server name
3391.68 -> that you wanna go to
3394.145 -> will have the firewall recognize it.
3396.18 -> Because if your client request
doesn't include the SNI data,
3400.17 -> then the firewall will never see it,
3402.48 -> and it'll never match it
against a HTTPS TLS rule.
3408.33 -> And the last one is not
necessarily a challenge,
3411.09 -> but it's just sort of a caveat, right?
3412.77 -> So debugging on a prod firewall,
3416.22 -> we ran into this,
3417.63 -> where teams would say, "Hey,
it's working fine in non-prod,
3422.37 -> but it's not in prod."
3424.68 -> So how do we sort of handle
this or how did we handle this?
3428.88 -> We don't wanna just go
into the prod firewall
3430.59 -> and start typing around and
making some rule changes
3433.23 -> and see if it fixes it, right?
3436.38 -> Also, firewall is priced on
3439.65 -> a fixed monthly cost plus additional.
3442.59 -> So ballpark figure, it's like $36,000
3445.71 -> just to have this test
firewall floating around
3448.32 -> that you use once or twice a month.
3451.02 -> So cost might not be a
viable option either.
3454.89 -> And then also, some things
can't be replicated, right?
3458.7 -> For some unknown reason, some
developer patched a system
3464.04 -> and didn't tell.
3465.45 -> There's just oddities that can come up.
3467.88 -> We don't like them.
3468.713 -> We don't want them to happen,
3470.07 -> but sometimes these really
can't be replicated.
3475.14 -> So what we end up doing is
we actually put one rule.
3478.2 -> We basically have a top level rule group
3480.3 -> that we use with a /32.
3483.96 -> We put it through the
code, through our SCM.
3488.61 -> And that way, what
we're able to do is say,
3490.867 -> "Okay, what's the broken system?"
3492.69 -> We put that IP as a /32 in the rule,
3495.93 -> and then we have the flexibility to change
3498.15 -> just that specific rule
for that specific host.
3501.54 -> And because it's all done through code
3503.67 -> and through our pipeline,
it's all auditable.
3506.58 -> It's all maintainable,
3507.99 -> and it's a way for us
to easily perform that,
3513.81 -> that one-off testing when needed.
3519.78 -> So that's the end of our presentation.
3521.88 -> So these are the links
If you wanna learn more.
3525.36 -> Thank you for coming and
have a great evening.
Source: https://www.youtube.com/watch?v=VMVeTvX4OLw