AWS re:Inforce 2022 - Deploying AWS Network Firewall at scale: athenahealth's journey (NIS308)

AWS re:Inforce 2022 - Deploying AWS Network Firewall at scale: athenahealth's journey (NIS308)


AWS re:Inforce 2022 - Deploying AWS Network Firewall at scale: athenahealth's journey (NIS308)

When the Log4j vulnerability became known in December 2021, athenahealth made the decision to increase their cloud security posture by adding AWS Network Firewall to over 100 accounts and 230 VPCs. Join this session to learn about their initial deployment of a distributed architecture and how they were able to reduce their costs by approximately two-thirds by moving to a centralized model. The session also covers firewall policy creation, optimization, and management at scale. The session is aimed at architects and leaders focused on network and perimeter security that are interested in deploying AWS Network Firewall.

Learn more about AWS re:Inforce at https://bit.ly/3baitIT.

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#reInforce2022 #CloudSecurity #AWS #AmazonWebServices #CloudComputing


Content

0.63 -> - All right, hello, everybody, and welcome.
2.97 -> Sorry if I startled you.
5.07 -> Thanks for coming to our session at the end of the day here.
9.06 -> We're gonna be talking about
10.47 -> deploying AWS Network Firewall at scale,
13.59 -> and we're using a real world example.
15.18 -> So my name is Dave Desroches.
16.62 -> I am an AWS solutions architect,
19.2 -> and I'm joined by Aaron Baer and Mike McGinnis
22.68 -> who are coming from Athenahealth,
24.15 -> which is a customer that I support.
27.06 -> The origin of this came from a
29.85 -> deployment that they put in place as a direct result of,
34.65 -> well, some things that they needed to do.
37.38 -> And I don't wanna steal from their thunder,
38.85 -> so I'm gonna leave it to them to tell the story,
40.83 -> but I thought that this would make a really good story
43.23 -> for other customers to hear,
45.09 -> because it talks about some of the kind of
47.22 -> trials and tribulations that they went through
49.65 -> as they went from basically not having anything in place
52.56 -> to having this up in production
55.32 -> over a very short period of time.
57.18 -> So I think it's a good story.
60.27 -> I'm gonna start us out by just doing a quick level set.
62.88 -> So I'm gonna go through just a really quick
67.32 -> high level overview of network firewall,
70.11 -> and we'll talk a little bit about
71.25 -> a couple of the deployment options that can go into place.
73.77 -> This is a 300 level session,
76.23 -> so we'll be getting into the weeds
77.73 -> of how these deployments are done.
80.43 -> Then I'm gonna hand it off to Aaron.
81.66 -> We'll go through the movement from a centralized deployment
86.43 -> to a...
87.78 -> Sorry, from a distributed deployment
89.31 -> to a centralized deployment in their environment.
92.07 -> And then finally, we'll go through the policy and rules
94.65 -> that were put into place and how that evolved over time.
98.37 -> So a little bit about network firewall itself.
101.76 -> This is a managed firewall.
103.83 -> It kind of does what it says on the tin.
106.14 -> It's something where AWS manages the firewall for you,
109.44 -> and then you leverage that firewall
111.54 -> by putting things like policies, rules in place
114.23 -> in your environment.
115.95 -> This works for both north, south,
117.66 -> and east west kind of deployments.
119.97 -> And you can choose if it's gonna be
121.2 -> something that is symmetric
122.91 -> or egress-only kind of deployment as well.
127.17 -> It scales automatically,
128.49 -> so it has an auto scaling mechanism under the covers.
131.88 -> It's very reliable.
133.38 -> It has a very flexible rules engine as you'll see
136.59 -> when Mike goes through his section of the presentation.
139.92 -> It, like most things at AWS, is a
143.4 -> no upfront commitment kind of thing.
145.23 -> You can put it in place, try it out,
146.7 -> and use whatever you need to for it.
153.09 -> So from your perspective as a customer,
155.55 -> you are responsible for doing things
158.01 -> that make sense for your business.
159.54 -> You're gonna be putting the policy in place,
161.58 -> be establishing the rule sets and so forth,
164.07 -> and you'll be defining what the topology looks like,
166.29 -> what the architecture looks like,
168.09 -> where you have this thing in your environment,
169.83 -> what kinds of traffic it's going to look at and so forth.
173.55 -> Optionally, you can either deploy it on its own,
175.5 -> or you can manage it using Firewall Manager.
178.47 -> And then from an AWS side, we are basically responsible for
182.34 -> handling the scaling of things under the covers,
185.55 -> making sure that you've got throughput and performance
187.89 -> that is appropriate for what you're doing
191.01 -> and handling things like zonal affinity.
193.38 -> So if you are deploying in a model
196.02 -> where you in a bunch of AZs,
198.06 -> making sure that the traffic goes
199.22 -> to the right place for the right things.
203.85 -> There are two kind of main categories of deployments.
209.04 -> There's distributed and centralized.
211.41 -> In a distributed deployment,
212.79 -> you basically have firewall endpoints
214.89 -> that are placed into each of the VPCs,
217.5 -> in fact, into each of the AZs in those VPCs.
222 -> And those firewall endpoints are
223.65 -> something that you point routing tables to.
225.57 -> So the traffic gets directed through that network firewall,
229.95 -> and it comes back and then goes to
231.42 -> whatever destination you're going to,
233.49 -> be it another VPC or an internet gateway or whatever.
238.77 -> So in a distributed model, you've got a lot of endpoints.
241.5 -> They all kind of live all over your environment
243.54 -> based on what you're trying to do.
245.46 -> In a centralized deployment,
246.87 -> you're folding those things together into an inspection VPC,
251.04 -> and that VPC hangs off of a transit gateway.
254.49 -> So all of the other VPCs that you want to send
257.04 -> through that inspection VPC for inspection,
260.04 -> kind of does what it says,
262.47 -> gets sent off the transit gateway to that VPC,
265.2 -> goes to the firewall,
266.22 -> runs through whichever rules make sense,
267.9 -> and then it comes back out
269.31 -> and goes either out to the internet
270.63 -> or to another VPC in that environment.
273.06 -> So those are the kind of ground rules here.
275.88 -> So I'm gonna hand it off now to Aaron,
277.68 -> and Aaron is gonna talk a little bit about
279.06 -> an actual deployment.
286.11 -> - Hi, welcome, everyone.
288 -> We are from Athenahealth.
290.7 -> And at Athenahealth, we are helping to shape
294.45 -> the future of healthcare.
297.6 -> And we're doing that.
299.1 -> We partner with healthcare organizations of all sizes
303.09 -> and provide them with technology,
305.46 -> and insights, and expertise
309.57 -> to help them drive better clinical and financial results.
316.11 -> I'm Aaron Baer as mentioned.
319.08 -> I am a principal member of our staff
322.98 -> and team lead for our AWS infrastructure team.
328.17 -> And my goal here is to get into some
331.62 -> highly technical details of our journey
334.05 -> from a distributed deployment of
338.82 -> AWS Firewall to our now centralized
343.71 -> deployment.
344.543 -> So how can...
347.61 -> And a big part of this is how a process like this
350.91 -> can drive your costs lower, in fact, in many aspects.
356.64 -> So we're deploying a distributed firewall
359.43 -> to each VPC initially
362.13 -> and move to a deployment model
365.25 -> where our inspection is centralized.
370.17 -> So where did this start?
371.25 -> So this is about a four-month journey for us.
375.36 -> And it started in December with Log4j.
378.113 -> I think a lot of people had a lot of things
380.4 -> that started in December because of Log4j.
385.53 -> And middle of December,
387.87 -> I think I was just about to go on vacation.
390.78 -> And on a Thursday,
394.41 -> I'm pulled into our calls.
396.48 -> We're discussing all of these things.
398.577 -> And it became quickly apparent
400.38 -> that we needed a really rapid solution
403.44 -> to get some mitigation in place.
406.47 -> Up until this event,
411.27 -> all of our VPCs were deployed
413.58 -> in a fairly standard model where we used use NAT egress
417.33 -> out of the internet gateway of a VPC,
420.72 -> and there was no need for inspection at the time,
424.5 -> and everything was basically in and out of a VPC
427.92 -> within an account.
430.92 -> So it was also definitely discussed and stated
435.33 -> that we didn't really have any egress security in place.
437.67 -> Well, we had some.
438.87 -> Our VPCs have network ACLs.
441.69 -> Our public subnet network ACLs
445.38 -> are much more tight
448.08 -> even for egress than our private subnet network ACLs,
453.42 -> but we didn't have any inspection
455.88 -> on egress to the internet.
459.33 -> A little bit about our environment overall,
462.21 -> we definitely operate a multi account environment.
467.73 -> We manage over well over 200 VPCs
470.55 -> across all of those accounts.
474.99 -> we only operate in US healthcare and that sort of thing,
479.01 -> so we do actually only op also operate within two regions.
483.72 -> But within those two regions,
485.01 -> we deploy subnets across seven
488.76 -> or more availability zones,
491.43 -> but at least seven.
494.4 -> We also have multiple on-premises data centers,
501.12 -> and we essentially operate two network zones.
504.15 -> We have production and we have not production.
510.84 -> Much of this has already been in place.
512.7 -> we've kind of been operating in this space
514.65 -> for six years at least.
518.13 -> And so, our network infrastructure also already had
521.64 -> some things that are common,
523.56 -> some things that are really important
525.09 -> to operating a large scale network like this.
529.14 -> All of our VPC addresses, network address space
532.59 -> is non-overlapping between accounts.
535.41 -> Every account is assigned a common /19.
539.88 -> We break that /19 into /20s per region.
544.86 -> We already use transit gateway
547.02 -> for communication between accounts,
550.02 -> for
552.27 -> using our VPN and Direct Connect.
554.7 -> We terminate those and propagate those routes
557.91 -> into our transit gateway route tables already.
562.95 -> And then a couple of key components that
565.35 -> assisted this as a rapid transformation.
569.46 -> All of our VPC infrastructure is in parity
571.59 -> across all of our accounts.
573.81 -> So we use CloudFormation,
576.21 -> and we use CloudFormation parameters
579.51 -> for the uniqueness of a VPC's deployment,
583.05 -> but every single VPC, and every single account,
585.87 -> and every single region is deployed from the same template.
591.27 -> So when we did this,
595.17 -> it was pretty straightforward.
596.58 -> But my hook to you is
600.15 -> we saved a lot of money as our project went on.
605.88 -> And
607.71 -> so we started with nothing.
609.69 -> You can see from our graphs here across AWS Firewall.
614.22 -> In December, we didn't have anything.
615.87 -> Log4j hit, 100% increase.
618.63 -> We're now having to spend more money.
623.67 -> And over the course of our journey and our story here,
628.14 -> we were able to reduce that initial increase of cost
632.73 -> by 98.72%.
636.063 -> And in addition to that,
638.7 -> the infrastructure changes
641.67 -> allowed us to also reduce our NAT gateway resources
646.71 -> by 96.66%
649.65 -> because of networking dependencies
653.97 -> that we'll discuss shortly.
656.88 -> So overall, it was big success.
658.59 -> We were able to really put in place something rapidly,
662.94 -> and, through changes, make big impact
667.59 -> on how we're using AWS Firewall
670.53 -> and the amount of money we're spending to do that.
675.18 -> So let's get started on setting some ground level here
679.95 -> about how our networks are configured.
684.24 -> At the beginning,
685.073 -> we have a fairly standard VPC configuration.
687.57 -> As I mentioned, they're all the same.
689.82 -> Also, we're using infrastructure as code practices.
694.29 -> And you can see in the before scene,
695.61 -> it was a fairly standard VPC.
697.02 -> We had a public subnet with a NAT gateway.
700.26 -> We had a private subnet
703.5 -> that then routed through the NAT gateway
705.6 -> to the internet gateway of the VPC.
708.99 -> Common practice, we deploy public-facing
711.66 -> load balancers to the public subnets.
713.55 -> Everything else goes in the private subnets.
717.06 -> And after our first round of change, our distributed model,
720.81 -> you can see a change here.
721.83 -> We had to add another subnet
724.08 -> to every availability zone within our VPC
728.04 -> to facilitate adding the gateway load balancers
731.1 -> for the network firewall.
734.82 -> And when you do that, it's obvious.
738.12 -> It is the way it is, right?
739.35 -> The cost of network firewall is directly proportional
742.08 -> to the number of gateway load balancers that you deploy?
745.56 -> Similar to a NAT gateway.
746.94 -> As soon as you deploy one,
748.59 -> you're gonna be paying for it per hour.
752.91 -> In this distributed model though,
756.27 -> the one kind of benefit that we had at the time is that
760.14 -> if you deploy gateway load balancers
762.21 -> to the same account that has NAT gateways,
765.96 -> that then offsets the costs of the NAT gateways,
770.28 -> but you're still paying for your gateway load balancers.
773.7 -> But since we had the automation in place,
775.56 -> since we had the code ready,
777.9 -> I was actually able to
781.29 -> make the infrastructure changes
783.6 -> to add the subnet in the gateway load balancers for this
787.05 -> in about 2 1/2 days of development over the weekend.
790.44 -> The word was let's go.
792.06 -> We need to do this.
793.02 -> No matter what, something has to happen.
796.05 -> And fortunately, that was able to be
799.83 -> completed in a rapid fashion.
802.02 -> The development was in a couple of days,
804.57 -> deployment of non-prod was in the next day.
807.69 -> Successfully deployed a non-prod.
809.64 -> Awesome, let's go to prod the next day after that.
812.43 -> And we have already
814.89 -> also supporting deployment infrastructure
817.2 -> that allows us to go through each account
820.83 -> running the same commands to provide those updates
823.8 -> based on our CloudFormation template changes.
829.83 -> So, as you can see,
831.96 -> there are a couple of additions
833.19 -> to our VPC infrastructure at this time.
837.849 -> In the previous, it was just a private subnet,
841.41 -> a public subnet with a NAT gateway, internet load balancer.
844.5 -> Pretty simple.
846.36 -> In this iteration,
848.43 -> We now have a private subnet,
850.17 -> a public subnet with a NAT gateway,
852.45 -> and a new firewall security subnet
857.34 -> with the gateway load balancer.
859.8 -> And if you notice at the top,
860.94 -> we have an addition there
862.38 -> where we have to attach
864.12 -> a route table to the internet gateway.
867.36 -> And that was a bit of a new,
870.6 -> a new thought to the way I had originally designed our VPCs.
877.23 -> And
879.66 -> I think the primary reason of this is that
883.89 -> the internet gateway itself has to know return routes
886.89 -> into your VPC.
889.95 -> And to do that, you can associate a route table now
892.71 -> to an internet gateway that's attached to your VPC.
898.5 -> A bit of a trick here though that you see here,
903.36 -> our NAT gateway isn't the next step to our internet gateway.
908.7 -> We now have a gateway load balancer in between.
912.63 -> And so you actually have to manage the route table
915 -> attached to your internet gateway,
916.86 -> 'cause if you have multiple availability zones
920.19 -> and then the route of the public subnet
923.61 -> with the NAT gateway is the gateway load balancer
927.27 -> and not the internet gateway,
929.55 -> the traffic has to return to the same NAT gateway
933.57 -> that it left your network from.
938.52 -> So with those changes, we had some challenges
942.9 -> and some interesting things that came up
945.66 -> as I was going through this.
948.57 -> As I mentioned, all of our VPCs are deployed with subnets,
953.76 -> basically /24 subnets of a /20 for the region,
958.47 -> which leaves some space
959.88 -> most of the time in the original private network
964.89 -> that's assigned to the account.
967.62 -> but it didn't really give the correct
970.89 -> networking address space to add three or four more subnets
978.48 -> for the gateway load balancers.
979.86 -> So I used an additional CIDR overlay to the VPC
986.28 -> in this 100.90 range, 100.90 100.91, 100.92, 100.93.
993.81 -> So that we didn't just consume that last bit
996.84 -> of address space that we already had available
999.69 -> for growth in our existing VPCs,
1002.12 -> but we could still put these subnets in
1004.67 -> to get our load balancers, gateway load balancers available.
1011.06 -> But we have hundreds of VPCs.
1013.34 -> We have seven availability zones.
1015.83 -> We have now additional gateway load balancers
1018.25 -> in each one of these.
1020.03 -> So in 2 1/2 days, essentially, deployed 1,680
1025.46 -> firewall endpoints across all of our infrastructure.
1029.3 -> And the cost of over a thousand gateway firewall endpoints
1034.16 -> is pretty high.
1035.96 -> So, Log4j, the word was do this, right?
1040.67 -> We don't care necessarily how much this costs.
1042.77 -> We have to put this in place.
1045.086 -> It's a security priority for us.
1050.99 -> But that comes with some
1053.6 -> discussion that we have in a minute as well.
1057.86 -> Another gotcha was the internet gateway
1061.73 -> route table that I discussed,
1065.36 -> and the return routes to map to those NAT gateways
1072.98 -> And the conundrum there is,
1075.14 -> in CloudFormation,
1077.27 -> the stack can give you resource IDs
1081.29 -> for the NAT gateways that have been
1084.44 -> deployed into that instance of the CloudFormation stack,
1088.7 -> but CloudFormation won't actually give them back to you
1091.97 -> in any consistent manner.
1094.94 -> So you can't easily map.
1097.94 -> This NAT gateway ID is in this AZ,
1101.69 -> so that my route table,
1105.29 -> route entry can say for this, I get this...
1109.88 -> The other portion of that, you need your NAT gateway ID,
1114.05 -> you need your private subnets CIDR range.
1117.86 -> So that little /24 that is at the return route.
1122.36 -> And so
1124.61 -> a complication of this distributed model
1126.98 -> and cloud formation is that we had to write
1128.66 -> some external tooling that then query the account,
1133.07 -> build a hash of that information, loop through,
1136.25 -> assign those routes into the route table.
1139.55 -> It worked in automation because you can do this one thing,
1141.86 -> and then you can run this next thing,
1143.93 -> but it was definitely something
1148.282 -> that was more than just being able to do
1151.4 -> one single cloud formation update to make all of it happen.
1157.73 -> And then also in a distributed model,
1161.45 -> you're effectively also deploying a AWS Firewall
1166.64 -> and an AWS policy to attach to that firewall
1170.24 -> in every single account.
1171.74 -> So how do you manage that across all of the accounts?
1176.03 -> Well, one way we were able to do that is we
1179.93 -> centralized our rules in a centralized account
1184.46 -> and share those rules to all of our organization
1187.91 -> via resource access manager.
1195.56 -> So that was great.
1197.6 -> We put it in place,
1199.61 -> but we quickly realized it's very expensive.
1202.85 -> And we quickly also had discussion that,
1208.1 -> awesome, we did it, we mitigated
1209.96 -> what we needed to for the moment,
1211.49 -> but is it the best solution?
1214.31 -> So a few things that helped us transition into
1218.3 -> a centralized part was that we were already using
1222.86 -> transit gateway route table among all of our accounts.
1225.41 -> So we had extensive routing.
1227.81 -> We had ways that the network
1229.91 -> was able to pass traffic around
1232.16 -> amongst our entire network.
1235.58 -> And through our discussions with InfoSec
1238.314 -> and the rest of the organization,
1240.89 -> we did come to the realization that,
1243.05 -> ultimately, all we need to be inspecting
1246.23 -> to provide the service that we need
1249.17 -> is that we only need to inspect the egress
1250.97 -> that's actually leaving to the internet.
1254.9 -> And then we are able to put ourselves
1257.45 -> in a place of a phased migration
1259.91 -> to move through these new discoveries
1263.03 -> and move to what will end up being our final solution
1267.44 -> as we go through.
1269.87 -> As we were going through this, resources could be removed
1273.32 -> that were previously deployed,
1275.93 -> and all of our migration steps could be automated
1278.15 -> because we were already actually deploying
1280.22 -> using infrastructure as code
1284.03 -> and automated updates and deployment practices.
1290.36 -> So where did we start to go?
1293.6 -> We started to go to what is the centralized model here.
1296.75 -> So you can see on the right
1299.69 -> that we have a set of AWS accounts
1303.02 -> and each of those AWS accounts has their own VPC.
1306.32 -> And this example, we're in a region,
1309.77 -> and we're in a specific environment.
1312.02 -> So this is then essentially replicated
1314.93 -> for each region that we're in
1317.78 -> and for each networking environment
1319.58 -> that we're talking about.
1322.22 -> And we are already using transit gateways.
1323.69 -> So traffic from one account to another
1328.07 -> was already just routing right through the transit gateway.
1330.59 -> Nothing had to change there.
1331.61 -> All those routes were already propagated
1333.86 -> into our transit gateway route table.
1337.07 -> We know that we didn't need any inspection
1339.8 -> between traffic that goes between our own private accounts.
1343.52 -> We were already propagating and connecting
1346.1 -> to our number of data centers
1348.83 -> using VPN attached to the transit gateway.
1351.89 -> Those routes were already in place.
1354.08 -> And so during this migration, this transition
1356.75 -> from distributed to centralized,
1361.07 -> we could put in place the inspection VPC
1364.55 -> and then affect the network by just changing routes.
1368.09 -> So traffic that has a destination
1370.4 -> of default destination of 0.0.0.0/0
1374.21 -> coming out of a private subnet,
1375.89 -> instead of sending it to a NAT gateway as we used to do,
1379.73 -> we just update the route, send it to the transit gateway.
1383.93 -> You add an additional default route
1385.64 -> to your transit gateway route table
1388.22 -> and you've associate your inspection VPC
1393.47 -> to your transit gateway as well,
1396.56 -> and your default route to 0.0.0 out of the
1400.4 -> transit gateway route table is your inspection VPC.
1406.97 -> And in this model as well,
1410.96 -> we already have inspection of ingress
1415.61 -> into a public-facing endpoint
1419.6 -> by requiring application load balancers
1422.36 -> deployed to public subnets to have WAF attached.
1426.59 -> So ingress traffic to a service
1429.89 -> can still come through the internet gateway of the VPC
1433.7 -> in the account where the service is deployed.
1436.25 -> WAF is in place to provide security
1439.01 -> for that ingress traffic.
1440.78 -> We are only needing to be worried
1443.54 -> about the traffic that goes to the internet.
1448.79 -> So let's talk a little bit more about
1450.08 -> the inspection VPC itself.
1452.18 -> So in the distributed model,
1459.53 -> I had to use that additional site arrange overlay
1462.29 -> that was kind of outside of our private network
1464.45 -> in the first place.
1465.98 -> It worked very well,
1467.57 -> but it wasn't entirely great
1469.73 -> because, all of a sudden, we've got this 100.90 subnet
1473.54 -> or CIDR range attached to subnets
1475.64 -> and these VPCs that have a private network space
1478.4 -> that's different than that.
1481.37 -> So we have a net new VPC that we're able to build.
1484.64 -> And with that,
1485.93 -> we can do the same thing we do in every account.
1487.73 -> We assign a /19 of our private subnet
1492.47 -> that's designated specifically for this account.
1494.96 -> We keep record of where that should live,
1498.14 -> what account it should be associated to,
1501.47 -> and then gave more flexibility
1503.6 -> to deploy the subnets out of that
1507.32 -> private subnet range.
1508.28 -> So at the end of that,
1510.2 -> each of our subnets for inspection VPC are now
1516.2 -> sharing the same non overlapping private network space
1520.04 -> that the rest of our entire network uses.
1522.44 -> And that's a big benefit
1524.6 -> if at some point in the future
1526.37 -> we wanna add some other inspection component
1528.71 -> to our network.
1530.36 -> We can go into this VPC.
1533.09 -> But for our use case here,
1536.48 -> we just need to implement transit, sorry, network firewall.
1541.82 -> And so we have a three layers of subnet.
1544.16 -> So now you notice a difference in this subnet layout
1548 -> than the previous subnet layout
1549.62 -> since we're able to define this one net new.
1555.23 -> Our private subnets are essentially
1558.05 -> attached with an ENI that's attached to the transit gateway.
1562.88 -> That routes then propagated into the transit gateway,
1567.47 -> into the transit gateway route tables.
1569.87 -> The next layer, We actually then do
1572.84 -> network firewall gateway load balancers.
1576.26 -> And
1578.75 -> so the route table for the first subnet is the destination
1583.4 -> of the gateway load balancer in our firewall.
1586.22 -> Inspection can happen.
1588.08 -> And then the default route for
1592.67 -> the network gateway firewall subnets
1595.97 -> is the public subnet of the inspection VPC,
1598.43 -> which holds the NAT gateway.
1600.74 -> And the benefit of this is
1602.63 -> now that the NAT gateway next hop is the internet gateway,
1606.2 -> you don't have to manage the routes like you used to.
1608.48 -> You don't have to manage the routes the same way.
1609.95 -> You actually don't have to manage any routes at all anymore.
1613.4 -> The attached route table to the internet gateway
1616.43 -> and the NAT gateways being that hop out
1621.86 -> the route, the traffic back in as you'd expect.
1630.83 -> So some of the challenges and some of the
1634.37 -> key points I wanted to kinda point out here
1637.46 -> is that that simplification
1639.02 -> of the internet gateway route table
1640.58 -> was effective for us.
1643.76 -> And having the NAT gateways on the other side
1647.12 -> of the gateway load balancers
1649.7 -> was what drove that simplification.
1653.03 -> And then we were able to drop out the external scripting
1656.45 -> outside of CloudFormation to manage those things
1659.3 -> and we're back to a single template, a single update
1663.14 -> to manage the entire network for that inspection VPC.
1668 -> And as we're going through this,
1669.53 -> one of the gotchas that we found
1671 -> is that network firewall manager
1673.01 -> doesn't really work for us in this way.
1676.37 -> And a couple of the main reasons why is that
1680.81 -> network firewall manager kinda wants
1682.91 -> to build its own subnets,
1684.29 -> kinda wants to manage its own resources,
1686.18 -> and you don't have much control over that.
1688.64 -> So if you're already really controlled
1690.38 -> when you're in your network
1691.94 -> and you wanna just kinda insert firewall into it,
1696.86 -> network firewall manager can do it,
1698.48 -> but it kind of gets a little chunky.
1700.13 -> It didn't really work for us.
1703.204 -> And then we couldn't just inherit subnets
1705.59 -> and say we want our firewall gateways here.
1708.29 -> Like I said, it creates its own.
1710.45 -> It's expecting to be able to do that.
1712.07 -> So we didn't use network firewall manager
1717.383 -> to manage our network firewall.
1719.9 -> But a benefit is
1722.06 -> now that all private traffic out of our private subnets
1725.6 -> that's destined for the internet
1727.43 -> no longer had to go through
1730.64 -> a NAT gateway within the same VPC
1734.48 -> and the NAT-ing for that traffic
1736.25 -> was handled by NAT gateways with inspection VPC.
1740.51 -> It gave us the opportunity to actually
1742.88 -> remove all NAT gateways from all of our VPCs.
1746.9 -> So when we deployed
1749.9 -> over 1,500 firewall gateways into every account,
1754.82 -> we still also had NAT gateways.
1757.43 -> We were all not...
1758.66 -> And when we transitioned to the centralized,
1760.85 -> we're able to reduce all of those network firewall gateways
1764.6 -> gateway load balancers.
1766.61 -> But then in addition to that,
1768.77 -> we're also able to
1771.74 -> remove all NAT gateways.
1773.33 -> And previously, we had already
1774.71 -> built in to our CloudFormation
1776.12 -> that manages our VPC networks
1780.71 -> a conditional parameter of enable NAT gateways true, false,
1787.22 -> and then a
1790.58 -> dependency within the resource creation
1793.25 -> of our CloudFormation that said
1795.2 -> if this parameter's set to false,
1797.06 -> then don't build the NAT gateways
1798.8 -> or remove the resources.
1801.95 -> So in our automation, in our infrastructure as code,
1804.83 -> we were able to just say, "Hey, turn off the NAT gateways,"
1807.05 -> boom, and run an update against the VPC,
1809.6 -> and removed over a thousand NAT gateways
1812.81 -> that we had run for years.
1816.68 -> I did mention our internet gateway,
1819.41 -> I'm sorry, internet-facing load balancers.
1822.26 -> We're already using WAF.
1823.49 -> So that was a complication that we didn't have to solve
1825.38 -> with the network firewall.
1828.02 -> And
1830.12 -> a driving factor of us being able to move to centralized
1833.69 -> was the fact that we only needed to inspect
1838.25 -> internet-bound traffic.
1839.87 -> Traffic was already being inspected between network zones,
1843.11 -> on-premise network firewall or on-premise firewalling
1846.29 -> that had already been in place.
1849.32 -> We didn't need any inspection between
1851.87 -> AWS accounts within the same network zone as well.
1858.62 -> So all of these things allowed us to drive
1861.71 -> a phased change of approach.
1864.89 -> We were able to quickly mitigate our security issues
1870.02 -> that came up in December for Log4j.
1874.19 -> We were able to deploy very quickly to do that.
1878.9 -> It was that moment of this is more important
1882.08 -> than how much it costs
1884.63 -> until we deployed it.
1886.79 -> And the cost showed up, and we were like,
1889.227 -> "Hey, great, but maybe that's not the right solution."
1893.57 -> It allowed us though to step back and identify
1896.72 -> all the pieces that we do really need
1898.97 -> in a more pragmatic and calm discussion
1903.29 -> and ask questions of ourselves like,
1907.1 -> what is the best firewall that we need to use?
1909.83 -> Is it from a vendor?
1911.87 -> Is it from something else?
1913.85 -> Is it for this network traffic?
1915.68 -> Is it not for this network traffic?
1918.65 -> We were able to have discussions with external vendors.
1921.5 -> And one of the things we found at the time
1924.44 -> was they may have an excellent solution
1928.73 -> for network firewalling,
1931.04 -> but most of the time it was,
1933.83 -> well, you have to deploy your own EC2 instances
1936.32 -> and manage your own infrastructure,
1938.12 -> and then you run our software
1939.89 -> to provide the firewalling services.
1942.38 -> And in this critical moment,
1945.47 -> our teams were kind of constrained.
1947.57 -> Honestly, I was the person who was managing
1950.78 -> all of the infrastructure at the time,
1952.31 -> so we didn't have a lot of bandwidth to say,
1956.187 -> "Okay, well, we can start managing
1957.65 -> a new fleet of EC2 instances
1959.3 -> and have the time to write all the automation
1961.67 -> to install the third party vendor software
1964.43 -> and configure it with automation, all these things."
1967.34 -> It didn't really work for us,
1968.87 -> but we were able to take that moment
1971.51 -> to have those discussions.
1975.74 -> Additionally, we didn't have to
1978.8 -> change everything all in one swoop.
1981.32 -> We were able to deploy the inspection VPCs
1983.18 -> and make sure that their infrastructure was correct.
1986.63 -> And then we're able to take the same firewall rules
1990.74 -> that we put in place for
1994.01 -> the centralized model
1995.87 -> and apply them to the inspection VPC,
1999.38 -> and then we were able to migrate just our networking
2003.352 -> as we needed to in phase changes.
2006.758 -> So we're able to just change the route
2008.71 -> for the destination of the internet
2011.47 -> to the transit gateway in all of our private subnets,
2015.01 -> and see that traffic move,
2017.29 -> but we already knew that our firewall rules in place
2019.84 -> for our security-assisted custom
2025.42 -> we're definitely blocking this outbound traffic rules,
2028.99 -> additional AWS managed rules that we had set in
2034.3 -> valuation mode.
2036.7 -> We just used the same rules for our first phase.
2038.77 -> So we knew when we pushed the traffic over there,
2042.01 -> it wasn't gonna affect the existing network traffic
2044.44 -> 'cause it was the same rule set.
2046.81 -> And then
2049.57 -> once we moved all of the networking,
2053.26 -> then we could focus on
2055.42 -> the next iteration of our firewalling
2058.57 -> and not only were we able to continue
2062.62 -> to mitigate the security vulnerability issue with Log4j,
2066.903 -> we were able to also decide, well, hey,
2069.34 -> here's our chance to also put in place best practices
2073.24 -> of actual default deny outbound egress
2076.66 -> for our entire network
2079.54 -> in a more controlled way.
2082.03 -> And so, we were able to then
2085.27 -> work on the firewall and the policy
2088.99 -> and start building a more strict policy
2091.87 -> for these new rules that are in place.
2095.2 -> Then we were able to update that firewall
2097.87 -> and attach our strict policy
2100.66 -> using alert established at first.
2104.92 -> And so we were essentially passing all traffic through
2108.13 -> and logging all of that data
2110.53 -> so that we could then inspect our network traffic
2113.26 -> without affecting actual information,
2118.09 -> And then we were able to communicate
2119.56 -> with our developer teams
2121.12 -> for them to start looking at their own data
2123.31 -> because we were becoming aware
2125.32 -> and starting to state that,
2126.857 -> "Hey, we're gonna do default deny."
2129.46 -> And by default, if it's not our own domain,
2132.64 -> if it's not our own known destination,
2134.89 -> untrusted places on the internet,
2137.77 -> default is gonna be...
2140.44 -> Deny is gonna be default.
2143.41 -> And we were able to do that using
2145.66 -> alert established on the policy first,
2148.75 -> logging and evaluating that.
2151.06 -> And then when we knew we were pretty good,
2153.7 -> when we had talked to all of our developer teams
2156.16 -> and we're ready to go,
2157.69 -> we switch our policy just by flipping alert established
2161.44 -> and adding drop established to our policy
2164.92 -> and everything was in place.
2167.41 -> So that is kind of the technical details
2170.53 -> of the network journey,
2171.97 -> and I'm gonna pass it on to Mike now at this point,
2174.91 -> and he's gonna talk more about the firewall rules.
2180.25 -> - Thanks, Aaron.
2184.9 -> Hey, everybody.
2185.733 -> My name is Mike McGinnis.
2186.7 -> I'm the principal security engineer at Athena
2190 -> and I lead the public cloud security group.
2193.87 -> So as Aaron mentioned, we've gone on this journey
2197.71 -> from decentralized to centralized.
2199.96 -> So what we're looking at, pre-network firewall.
2204.94 -> Really we're looking at the NACLs
2206.65 -> and security groups as the primary filtering for traffic.
2210.61 -> It's all IP based, right?
2212.26 -> What's the IP?
2213.28 -> What's the port?
2214.113 -> What's the protocol?
2215.47 -> And then it's allowed through.
2217.75 -> we didn't have any way to actually implement
2220.99 -> filtering based on HDP methods,
2224.56 -> user agents, domains, anything like that.
2228.55 -> Log4j happened,
2230.11 -> which was the emphasis to get us into network firewall.
2237.88 -> As we started through the journey,
2239.14 -> basically, what we were looking at
2240.58 -> is we had to define the policy requirements
2243.07 -> and we broke it down into define the policy scope,
2246.61 -> have minimal impact and minimal outages,
2249.19 -> and enable the developers.
2251.2 -> So Aaron touched on this,
2252.31 -> but, basically, the defining the scope is,
2254.59 -> are we going to do inbound filtering only,
2256.6 -> outbound filtering?
2257.89 -> Do we care about the east to west?
2259.72 -> Do we care about internal traffic
2262.99 -> or what's the decision?
2265.18 -> Also, how granular do we wanna get?
2267.82 -> Is just IP good enough
2269.47 -> or do we actually want those domains?
2271.54 -> And all of that sort of gone into
2273.82 -> the scoping of the policy set.
2276.34 -> We wanted to have minimal impact and minimal outages.
2279.52 -> As Aaron said, we were six years into this.
2282.07 -> So just dropping a firewall in
2284.05 -> and putting default deny in place
2285.67 -> really wasn't the best idea.
2288.19 -> 'Cause if we broke a lot of stuff,
2289.51 -> the first thing to go would be firewall.
2293.02 -> As part of that, we also wanted to have
2294.46 -> a defined rollback strategy.
2296.86 -> As we were chunking through each of those phases
2299.62 -> that Aaron was describing,
2301.51 -> we actually had a very well defined rollback strategy
2305.11 -> for each and every one of those.
2306.91 -> So in case the change created an issue,
2310.9 -> we would be able to roll back
2312.55 -> while we identify what that issue was and work to fix it.
2316.78 -> Lastly, on the requirement was enabling the developers.
2320.86 -> We need the developers just as much as they need us.
2325.06 -> We have to provide them the logs.
2326.65 -> We have to teach them how to use the logs,
2329.38 -> and then also bring them along on the journey.
2332.02 -> Keep them updated where we...
2333.43 -> What's our progress?
2334.84 -> Where are we in the path?
2336.34 -> What's our intention?
2337.6 -> What's the end goal?
2341.83 -> So very similar to the phases that Aaron was talking about,
2345.31 -> we also created three policy phases.
2347.77 -> So we created the alert policy default,
2351.19 -> then the alert policy strict,
2353.77 -> and then eventually default deny.
2356.65 -> And when I say alert policy default and strict,
2359.92 -> those are the actual firewall modes.
2361.69 -> So the firewall has default mode
2364.09 -> in which it basically does rule processing in groups
2367.03 -> based on the action.
2369.1 -> So it will evaluate your passes first, then your denies,
2372.85 -> and then your alerts.
2375.04 -> In strict mode, it basically will evaluate in strict order.
2378.55 -> So the way that the policy set is written
2381.79 -> is exactly how it's being processed.
2384.43 -> So rule group one.
2385.72 -> Rule one, two, three, that gets hit first.
2387.97 -> Then rule group two.
2389.14 -> One, two, three gets hit second,
2390.64 -> and so on and so forth.
2395.47 -> So with this one, alerting default,
2399.85 -> this was the one that was pushed to all the default
2402.04 -> or the distributed policies.
2405.25 -> We initially set it up in the default rule order
2407.77 -> 'cause that was more simplistic.
2409.99 -> We had the same firewall being deployed via CloudFormation
2413.68 -> and the RAM to share the one set of rules
2417.16 -> with all of the firewalls.
2420.46 -> What we did also is we actually, at this point in time,
2424.3 -> we put the firewall logs to CloudWatch,
2426.64 -> and we set a retention on the log group.
2430.78 -> We created a subsequent alarm to check for
2434.44 -> if a block list rule was hit.
2437.08 -> We would send an SNS over to our IR team
2440.26 -> that generated a ticket in their queue,
2442.54 -> and we were able to actually block and respond
2446.35 -> to the Log4j as part of the initial policy set.
2449.86 -> It's a little misleading to say alerting default
2452.56 -> because we actually did have
2453.91 -> a handful of block rules in place.
2456.04 -> But for the most part,
2457.09 -> it was built to be an alert in our policy set.
2461.26 -> Like Aaron mentioned,
2462.16 -> this was cut over to the centralized firewalls
2465.37 -> so we could keep consistency of the policy
2468.34 -> while the entire network plumbing
2470.5 -> was actually being converted.
2474.73 -> This is what the rule set actually looked like.
2477.43 -> Very, very simplistic, right?
2480.16 -> The stateless default action
2482.2 -> is to forward to the stateful rule engine.
2487.45 -> Then we use the AWS managed rule sets in alert mode.
2490.99 -> We had an allow list and that was sort of
2492.64 -> just to clean up the logs a little bit.
2495.07 -> We had a block list,
2496.39 -> and then there actually isn't a default action
2499.51 -> in default mode.
2502.42 -> And really this is all it looks like.
2504.07 -> This is the sort of the default actions of the firewall.
2507.25 -> What you basically see in stateless
2508.63 -> is just gonna forward it down to the stateful,
2510.678 -> and then the stateful is just rule order default.
2514.69 -> Below this in the console is your policy set.
2520.15 -> So moving through the journey, right?
2522.82 -> We've come into a centralized model.
2525.07 -> Now we're rebuilding the policy set
2527.65 -> to what we want it to be.
2530.59 -> So
2532.42 -> we moved this over to the centralized firewalls.
2535.84 -> As part of that,
2538.12 -> we had to remove this Suricata priority.
2542.71 -> And what that basically means is that,
2545.77 -> under the hood, the firewalls using Suricata IPS IDS,
2550.482 -> and Suricata is an open source IPS.
2555.07 -> So all of the rules are in Suricata.
2557.5 -> One of the keywords is a priority.
2559.96 -> In default mode, the priority will set
2562.9 -> the priority of the rule within the group.
2565.6 -> So even though it's a pass,
2567.52 -> you can have different rules within pass
2569.98 -> hitting at different times.
2571.6 -> But because we're in strict ordering now,
2573.67 -> that's no longer needed,
2575.41 -> and it actually throws a policy error
2578.29 -> if you try to push the rules with it,
2581.44 -> We updated the policy set.
2583.87 -> Every firewall, again, had the same policy set.
2587.14 -> But one of the key changes in centralized versus distributed
2590.56 -> is we actually consolidated the logging to S3.
2594.07 -> And then from there, we used event notifications
2596.65 -> to push it to our SIEM,
2598.9 -> and then start to leverage our SIEM
2600.61 -> for the continuous monitoring
2602.44 -> and take out that CloudWatch alarm.
2605.68 -> And that really was to synthesize
2609.58 -> the alerting and logging workflows
2612.34 -> with what we were already using
2614.29 -> and what the IR team had already built out.
2619.6 -> The one thing to mention on that one
2621.22 -> is we also did provide the developers access to those logs.
2625.39 -> We provided them very granular access.
2629.02 -> So this is what the new policy looks like.
2631.93 -> Within the stateless rules,
2633.1 -> we actually did add a stateless rule set
2636.73 -> to allow trusted IPs to bypass the inspection engine.
2641.65 -> And when I say trusted, I mean truly trusted IPs.
2646.54 -> Below that, we're gonna forward everything
2648.85 -> to the rule groups,
2651.1 -> to the stateful rule groups.
2652.84 -> So we have our allow list.
2654.4 -> We have our block list.
2656.2 -> Now we have the AWS managed block list.
2659.65 -> Below that is the filtered domains.
2661.75 -> So the idea here is that
2664.96 -> we really want to look at...
2668.02 -> Maybe this is traffic that looks a little suspicious
2671.26 -> or it's traffic that were not fully vetted.
2674.35 -> So if we drop 'em below the block list
2676.42 -> and the managed block list from AWS,
2679.27 -> if the domain, or if the IP,
2681.67 -> or whatever else rules are there,
2685.99 -> show up in either of the two lists above,
2688.63 -> they will actually still get blocked by the rules above
2691.93 -> and not be allowed through.
2695.23 -> If they don't show up in those,
2696.58 -> then they'll be allowed through.
2698.59 -> It's sort of a proving ground in a sense.
2701.14 -> Below that is the developer requested rules.
2703.09 -> We'll talk about the developer workflow in a little bit.
2706.21 -> And then we have the stateful default action,
2708.94 -> like Aaron mentioned, of alert established.
2713.62 -> So here again, stateless is the same.
2715.99 -> Rule order is now from default district,
2718.39 -> and now we have a default action of alert established.
2724.63 -> On the default deny,
2728.02 -> a lot of work went into
2730.81 -> moving from the alert strict to default deny
2734.89 -> was mostly in the background.
2737.11 -> To be 100% honest,
2738.52 -> the change was super easy
2740.59 -> because all we had to do was update the CloudFormation
2744.19 -> to add the action of drop established
2746.8 -> and do a CloudFormation stack update.
2748.39 -> That's literally it.
2749.223 -> It was three minutes and you're done.
2752.83 -> All of the effort was in prior to getting to that point.
2758.32 -> What we did is we worked with the development teams
2763.06 -> very, very, very closely.
2765.52 -> We implemented a developer request process.
2768.52 -> We added a hundred developer rules to the rule set,
2773.11 -> and this was purely based on analysis of traffic,
2776.44 -> evaluation of traffic,
2777.76 -> and discussion with the development teams
2779.77 -> to say, "Do you actually need to go to this health.gov site
2784.84 -> or do you need to do this?"
2786.07 -> And then we would have that discussion.
2788.41 -> We blasted them with notification and messaging.
2792.13 -> We didn't bury them to the point where
2794.89 -> they just drowned us out.
2796.72 -> But it was...
2798.91 -> We want it to be very explicit
2800.23 -> as to when what change was happening
2802 -> so they could expect and plan for it.
2804.49 -> We held daily office hour calls.
2807.07 -> We do have...
2809.47 -> We do have offices in India.
2811.33 -> So the office hours would alternate days
2813.82 -> so we could accommodate all of our development groups.
2818.05 -> And then again, we just modified the firewall.
2822.16 -> When we did it, we did it from a least used, least risk
2825.58 -> to most used, most risk method.
2828.94 -> Meaning we decided which of the firewalls
2831.67 -> was the least used.
2832.75 -> And if it went down because of a poor policy change,
2835.81 -> that was okay or that was the most acceptable.
2838.9 -> We started there for the initial deployment.
2841.18 -> And as we worked through it, we moved to the final,
2844.15 -> which was our main production site.
2849.64 -> Again, the rules did not change whatsoever.
2852.79 -> While the rule groups did not change,
2854.53 -> the rules just got more defined and more applicable to us.
2859.33 -> Here you can see default action n
2861.43 -> ow includes drop established.
2865.21 -> So where are we now?
2866.53 -> Default denies in place across everything
2868.66 -> in our AWS ecosystem.
2872.834 -> The whole change, including Aaron's and the policy work,
2875.86 -> resulted in less than 10 medium issues.
2879.76 -> So it was not a very impactful change.
2884.86 -> We have, to date, over 120 developer requests
2890.59 -> through our process.
2893.35 -> Now we're really focusing on operational efficiencies.
2895.96 -> Literally a couple days ago, VPC prefix list were announced,
2899.83 -> which basically allows us to group CIDRs,
2902.98 -> which I think might make our policies
2904.72 -> a little more flexible.
2906.37 -> We're also looking at refining policy sets.
2908.95 -> So as the business grows and new workflows come in,
2912.37 -> we're able to look at those more holistically.
2916.81 -> So real quick, I just wanna
2917.89 -> jump into the developer workflow.
2919.96 -> So basically, what we wanted to do was make it
2922.3 -> super, super easy for developers
2924.43 -> to have a say in what traffic is allowed in their account,
2929.17 -> granted it falls below our block list.
2933.1 -> So we do have a say.
2935.35 -> The workflow also has gating and guardrails encoded into it
2939.4 -> so developer can't just say,
2941.027 -> "Hey, I want my account to go
2943.18 -> all the way to full internet.
2945.19 -> Give me 000 outbound."
2948.25 -> And we just say, "Yeah, sure, whatever."
2950.74 -> We actually have a lot of guardrails
2952.09 -> that we built into the process to alleviate some of that.
2955.6 -> It's based in Git
2956.53 -> So signoffs, and auditing, and governance
2958.75 -> is all a part of it.
2960.01 -> It's built through a pipeline so it's quick and efficient.
2963.7 -> And the approval process,
2966.22 -> we do approve it during business hours.
2968.5 -> If there's an incident, so if there's an actual outage,
2971.731 -> the NOC can approve in our place,
2973.517 -> and then we have a process where we're alerted and notified.
2977.11 -> So when the security team comes in the next day,
2979.63 -> they can perform that assessment.
2982.6 -> So on the predefined rule variable,
2985.06 -> the policy or the rule group
2987.4 -> has what's called a rule variable,
2989.74 -> and it's basically just IP sets.
2991.54 -> We, the security, define those and manage those
2993.91 -> as new accounts are created.
2996.7 -> The JSON that you see under developer submission
3000.36 -> is literally the only thing they have to submit
3003.51 -> as part of their PR.
3006.66 -> So when they create their poll request,
3008.07 -> they have to tell us the account ID
3010.83 -> and what domain, port, and protocol,
3013.74 -> domain/IP, port, protocol that they want.
3018.84 -> We review that.
3020.16 -> We merge it.
3021.84 -> And out pops our Suricata rules.
3025.02 -> So here, what we're doing
3026.37 -> is we're basically using their input
3029.64 -> to create the Suricata rules that we then
3032.76 -> deploy into that specific developer rule group.
3038.55 -> I could go deep into this because it's actually super...
3043.41 -> My opinion, it's super well-designed.
3044.76 -> There's a lot of catches that we wanted to account for.
3048.99 -> But the end run is this is what gets transformed
3052.35 -> in the firewall that allows their traffic to get past.
3057.33 -> So policy challenges.
3061.35 -> We did a lot of work,
3062.7 -> a lot of movement throughout the journey.
3068.16 -> What we found is that default ordering is nice,
3071.43 -> but it's limiting in advanced use cases.
3075.06 -> So we were really trying to get
3076.65 -> that alert kind of policy set to say,
3079.86 -> what are we seeing?
3081.66 -> We couldn't really get it easily.
3085.35 -> We were seeing a lot of the denies happening
3088.5 -> before the alerts would
3089.73 -> so we were actually blocking more traffic
3091.98 -> than we had wanted to,
3094.05 -> without putting in past rules above it.
3096.81 -> Domain list is a really cool feature of network firewall.
3100.05 -> The only gotcha is that they put a deny any at the bottom
3104.07 -> so it's literally a white list.
3105.66 -> You put in your domains, you hit save,
3107.97 -> it lists out the domains in Suricata form,
3110.73 -> and then we'll put a default at the bottom.
3114.69 -> So then that basically made us convert
3116.79 -> everything to Suricata,
3118.35 -> and we only use Suricata at this point.
3120.87 -> If you want an IP, a 5-tuple, so IP port protocol rule,
3125.25 -> that's Suricata.
3126.39 -> Everything now is Suricata.
3129.36 -> We didn't change anything else.
3132.9 -> Home Net is not RFC1918.
3136.14 -> This one, in my opinion, is pretty funny.
3138.48 -> It's an oversight on our part, but it's a good gotcha.
3141.99 -> So when we were deployed into the distributed,
3145.02 -> the firewall actually sat in the VPC.
3148.83 -> So
3150.42 -> it took the CIDR of the VPC
3153.32 -> so it was working fine.
3155.67 -> because the default value of Home Net
3157.5 -> is the CIDRs or the VPC CIDR,
3160.41 -> when we moved to a consolidated model,
3163.65 -> the only traffic that initially passed
3166.05 -> was the traffic that originated
3168.09 -> within the VPC of the firewall.
3171.51 -> All the other traffic was not being passed appropriately
3174.54 -> so we're like, "What's going on?"
3176.16 -> And then we did a little light reading,
3179.04 -> and we actually realized our mistake.
3181.5 -> So then we updated CloudFormation
3183.63 -> and put Home Net as a defined CIDR
3186.93 -> is a rule variable into all of the policy sets
3189.66 -> or into all of the rule groups.
3191.58 -> And it fixed our issues almost as instantaneously.
3196.56 -> Capacity limits is another one.
3198.39 -> So in the network firewall policy,
3202.65 -> you have what's called a capacity limit,
3204.51 -> and this is a hard set limit of 30,000.
3208.56 -> I equate 'em to rules,
3210 -> but it's basically 30,000 rules for your entire firewall.
3214.02 -> And you allocate these to each rule group.
3218.49 -> So if you just sort of haphazardly assign them
3223.23 -> and you don't think about it,
3224.91 -> you end up either having to
3226.08 -> blow away the rule group, recreating it,
3228.72 -> or you can sort of max out.
3230.79 -> And when you get to that max out page or the max out time,
3234.09 -> or, sorry, max out capacity,
3237.18 -> you're sort of in a conundrum.
3239.85 -> So we typically will do is we'll
3241.86 -> look at what the policy or the rule set's gonna be,
3244.68 -> and then estimate what we would see the top end being,
3248.94 -> padded another five to 10%,
3251.34 -> and then that's sort of the capacity for that rule group.
3256.65 -> On the other end, we always leave some unallocated
3260.55 -> so that way we're never consuming the full 30,000.
3263.82 -> we're always leaving float.
3265.32 -> So that way if we need to add more down the road, we can.
3267.9 -> If we need to add a new rule group
3269.79 -> only to transfer the policy set around,
3273.57 -> we have that ability and that flexibility.
3276.18 -> Versus if we're stuck at that top 30,000,
3278.73 -> you sort of lose that flexibility.
3283.32 -> So the next one, the network layer
3285.15 -> versus application layer rule processing.
3288.06 -> This one is fickle.
3290.43 -> So when you're doing your rule set,
3292.47 -> if you have allow this source to this destination
3298.209 -> for all TCP traffic,
3300.12 -> you're never actually going to be able to do
3302.16 -> the analysis at the app level
3304.5 -> because it's just gonna see the TCP
3306.78 -> and let it through.
3308.16 -> So it's never going to sort of filter down to the app layer
3312 -> and get to looking at the domains or the methods.
3317.22 -> Any of the more IPS kind of rules
3321.18 -> typically won't get processed if you're doing,
3323.61 -> if you're passing it more at the L3 L4.
3326.52 -> So just be cognizant of your rules.
3330.78 -> because the firewall does not
3334.717 -> do TLS decryption,
3337.23 -> it leverages the TLS server name indication extension,
3342.63 -> and...
3345.03 -> Almost everything supports it.
3346.47 -> However, less than 1% of our traffic
3349.11 -> does not support it in one way or another,
3351.72 -> or it hasn't throughout the course of our journey.
3355.14 -> And most of the time,
3357.03 -> this is resolved by updating the libraries.
3359.49 -> Sometimes we were finding themes that were using
3362.04 -> really old code bases
3364.47 -> that they just needed to update,
3367.41 -> sometimes forcing it to use 1.2,
3370.05 -> even though their code is using 1.2,
3373.29 -> for some reason forcing it to use it fixed it,
3377.55 -> or adding a parameter as part of the web call.
3380.4 -> So when you're doing the HTTP request,
3384.54 -> actually putting in the parameter
3386.61 -> for the tls.sni and defining the server name
3391.68 -> that you wanna go to
3394.145 -> will have the firewall recognize it.
3396.18 -> Because if your client request doesn't include the SNI data,
3400.17 -> then the firewall will never see it,
3402.48 -> and it'll never match it against a HTTPS TLS rule.
3408.33 -> And the last one is not necessarily a challenge,
3411.09 -> but it's just sort of a caveat, right?
3412.77 -> So debugging on a prod firewall,
3416.22 -> we ran into this,
3417.63 -> where teams would say, "Hey, it's working fine in non-prod,
3422.37 -> but it's not in prod."
3424.68 -> So how do we sort of handle this or how did we handle this?
3428.88 -> We don't wanna just go into the prod firewall
3430.59 -> and start typing around and making some rule changes
3433.23 -> and see if it fixes it, right?
3436.38 -> Also, firewall is priced on
3439.65 -> a fixed monthly cost plus additional.
3442.59 -> So ballpark figure, it's like $36,000
3445.71 -> just to have this test firewall floating around
3448.32 -> that you use once or twice a month.
3451.02 -> So cost might not be a viable option either.
3454.89 -> And then also, some things can't be replicated, right?
3458.7 -> For some unknown reason, some developer patched a system
3464.04 -> and didn't tell.
3465.45 -> There's just oddities that can come up.
3467.88 -> We don't like them.
3468.713 -> We don't want them to happen,
3470.07 -> but sometimes these really can't be replicated.
3475.14 -> So what we end up doing is we actually put one rule.
3478.2 -> We basically have a top level rule group
3480.3 -> that we use with a /32.
3483.96 -> We put it through the code, through our SCM.
3488.61 -> And that way, what we're able to do is say,
3490.867 -> "Okay, what's the broken system?"
3492.69 -> We put that IP as a /32 in the rule,
3495.93 -> and then we have the flexibility to change
3498.15 -> just that specific rule for that specific host.
3501.54 -> And because it's all done through code
3503.67 -> and through our pipeline, it's all auditable.
3506.58 -> It's all maintainable,
3507.99 -> and it's a way for us to easily perform that,
3513.81 -> that one-off testing when needed.
3519.78 -> So that's the end of our presentation.
3521.88 -> So these are the links If you wanna learn more.
3525.36 -> Thank you for coming and have a great evening.

Source: https://www.youtube.com/watch?v=VMVeTvX4OLw