AWS re:Inforce 2022 - Deploying AWS Network Firewall at scale: athenahealth's journey (NIS308)

Aug 16, 2023

AWS re:Inforce 2022 - Deploying AWS Network Firewall at scale: athenahealth's journey (NIS308)

When the Log4j vulnerability became known in December 2021, athenahealth made the decision to increase their cloud security posture by adding AWS Network Firewall to over 100 accounts and 230 VPCs. Join this session to learn about their initial deployment of a distributed architecture and how they were able to reduce their costs by approximately two-thirds by moving to a centralized model. The session also covers firewall policy creation, optimization, and management at scale. The session is aimed at architects and leaders focused on network and perimeter security that are interested in deploying AWS Network Firewall.

Learn more about AWS re:Inforce at https://bit.ly/3baitIT.

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#reInforce2022 #CloudSecurity #AWS #AmazonWebServices #CloudComputing

Content

0.63 -> - All right, hello, everybody, and welcome.

2.97 -> Sorry if I startled you.

5.07 -> Thanks for coming to our session at the end of the day here.

9.06 -> We're gonna be talking about

10.47 -> deploying AWS Network Firewall at scale,

13.59 -> and we're using a real world example.

15.18 -> So my name is Dave Desroches.

16.62 -> I am an AWS solutions architect,

19.2 -> and I'm joined by Aaron Baer and Mike McGinnis

22.68 -> who are coming from Athenahealth,

24.15 -> which is a customer that I support.

27.06 -> The origin of this came from a

29.85 -> deployment that they put in place as a direct result of,

34.65 -> well, some things that they needed to do.

37.38 -> And I don't wanna steal from their thunder,

38.85 -> so I'm gonna leave it to them to tell the story,

40.83 -> but I thought that this would make a really good story

43.23 -> for other customers to hear,

45.09 -> because it talks about some of the kind of

47.22 -> trials and tribulations that they went through

49.65 -> as they went from basically not having anything in place

52.56 -> to having this up in production

55.32 -> over a very short period of time.

57.18 -> So I think it's a good story.

60.27 -> I'm gonna start us out by just doing a quick level set.

62.88 -> So I'm gonna go through just a really quick

67.32 -> high level overview of network firewall,

70.11 -> and we'll talk a little bit about

71.25 -> a couple of the deployment options that can go into place.

73.77 -> This is a 300 level session,

76.23 -> so we'll be getting into the weeds

77.73 -> of how these deployments are done.

80.43 -> Then I'm gonna hand it off to Aaron.

81.66 -> We'll go through the movement from a centralized deployment

86.43 -> to a...

87.78 -> Sorry, from a distributed deployment

89.31 -> to a centralized deployment in their environment.

92.07 -> And then finally, we'll go through the policy and rules

94.65 -> that were put into place and how that evolved over time.

98.37 -> So a little bit about network firewall itself.

101.76 -> This is a managed firewall.

103.83 -> It kind of does what it says on the tin.

106.14 -> It's something where AWS manages the firewall for you,

109.44 -> and then you leverage that firewall

111.54 -> by putting things like policies, rules in place

114.23 -> in your environment.

115.95 -> This works for both north, south,

117.66 -> and east west kind of deployments.

119.97 -> And you can choose if it's gonna be

121.2 -> something that is symmetric

122.91 -> or egress-only kind of deployment as well.

127.17 -> It scales automatically,

128.49 -> so it has an auto scaling mechanism under the covers.

131.88 -> It's very reliable.

133.38 -> It has a very flexible rules engine as you'll see

136.59 -> when Mike goes through his section of the presentation.

139.92 -> It, like most things at AWS, is a

143.4 -> no upfront commitment kind of thing.

145.23 -> You can put it in place, try it out,

146.7 -> and use whatever you need to for it.

153.09 -> So from your perspective as a customer,

155.55 -> you are responsible for doing things

158.01 -> that make sense for your business.

159.54 -> You're gonna be putting the policy in place,

161.58 -> be establishing the rule sets and so forth,

164.07 -> and you'll be defining what the topology looks like,

166.29 -> what the architecture looks like,

168.09 -> where you have this thing in your environment,

169.83 -> what kinds of traffic it's going to look at and so forth.

173.55 -> Optionally, you can either deploy it on its own,

175.5 -> or you can manage it using Firewall Manager.

178.47 -> And then from an AWS side, we are basically responsible for

182.34 -> handling the scaling of things under the covers,

185.55 -> making sure that you've got throughput and performance

187.89 -> that is appropriate for what you're doing

191.01 -> and handling things like zonal affinity.

193.38 -> So if you are deploying in a model

196.02 -> where you in a bunch of AZs,

198.06 -> making sure that the traffic goes

199.22 -> to the right place for the right things.

203.85 -> There are two kind of main categories of deployments.

209.04 -> There's distributed and centralized.

211.41 -> In a distributed deployment,

212.79 -> you basically have firewall endpoints

214.89 -> that are placed into each of the VPCs,

217.5 -> in fact, into each of the AZs in those VPCs.

222 -> And those firewall endpoints are

223.65 -> something that you point routing tables to.

225.57 -> So the traffic gets directed through that network firewall,

229.95 -> and it comes back and then goes to

231.42 -> whatever destination you're going to,

233.49 -> be it another VPC or an internet gateway or whatever.

238.77 -> So in a distributed model, you've got a lot of endpoints.

241.5 -> They all kind of live all over your environment

243.54 -> based on what you're trying to do.

245.46 -> In a centralized deployment,

246.87 -> you're folding those things together into an inspection VPC,

251.04 -> and that VPC hangs off of a transit gateway.

254.49 -> So all of the other VPCs that you want to send

257.04 -> through that inspection VPC for inspection,

260.04 -> kind of does what it says,

262.47 -> gets sent off the transit gateway to that VPC,

265.2 -> goes to the firewall,

266.22 -> runs through whichever rules make sense,

267.9 -> and then it comes back out

269.31 -> and goes either out to the internet

270.63 -> or to another VPC in that environment.

273.06 -> So those are the kind of ground rules here.

275.88 -> So I'm gonna hand it off now to Aaron,

277.68 -> and Aaron is gonna talk a little bit about

279.06 -> an actual deployment.

286.11 -> - Hi, welcome, everyone.

288 -> We are from Athenahealth.

290.7 -> And at Athenahealth, we are helping to shape

294.45 -> the future of healthcare.

297.6 -> And we're doing that.

299.1 -> We partner with healthcare organizations of all sizes

303.09 -> and provide them with technology,

305.46 -> and insights, and expertise

309.57 -> to help them drive better clinical and financial results.

316.11 -> I'm Aaron Baer as mentioned.

319.08 -> I am a principal member of our staff

322.98 -> and team lead for our AWS infrastructure team.

328.17 -> And my goal here is to get into some

331.62 -> highly technical details of our journey

334.05 -> from a distributed deployment of

338.82 -> AWS Firewall to our now centralized

343.71 -> deployment.

344.543 -> So how can...

347.61 -> And a big part of this is how a process like this

350.91 -> can drive your costs lower, in fact, in many aspects.

356.64 -> So we're deploying a distributed firewall

359.43 -> to each VPC initially

362.13 -> and move to a deployment model

365.25 -> where our inspection is centralized.

370.17 -> So where did this start?

371.25 -> So this is about a four-month journey for us.

375.36 -> And it started in December with Log4j.

378.113 -> I think a lot of people had a lot of things

380.4 -> that started in December because of Log4j.

385.53 -> And middle of December,

387.87 -> I think I was just about to go on vacation.

390.78 -> And on a Thursday,

394.41 -> I'm pulled into our calls.

396.48 -> We're discussing all of these things.

398.577 -> And it became quickly apparent

400.38 -> that we needed a really rapid solution

403.44 -> to get some mitigation in place.

406.47 -> Up until this event,

411.27 -> all of our VPCs were deployed

413.58 -> in a fairly standard model where we used use NAT egress

417.33 -> out of the internet gateway of a VPC,

420.72 -> and there was no need for inspection at the time,

424.5 -> and everything was basically in and out of a VPC

427.92 -> within an account.

430.92 -> So it was also definitely discussed and stated

435.33 -> that we didn't really have any egress security in place.

437.67 -> Well, we had some.

438.87 -> Our VPCs have network ACLs.

441.69 -> Our public subnet network ACLs

445.38 -> are much more tight

448.08 -> even for egress than our private subnet network ACLs,

453.42 -> but we didn't have any inspection

455.88 -> on egress to the internet.

459.33 -> A little bit about our environment overall,

462.21 -> we definitely operate a multi account environment.

467.73 -> We manage over well over 200 VPCs

470.55 -> across all of those accounts.

474.99 -> we only operate in US healthcare and that sort of thing,

479.01 -> so we do actually only op also operate within two regions.

483.72 -> But within those two regions,

485.01 -> we deploy subnets across seven

488.76 -> or more availability zones,

491.43 -> but at least seven.

494.4 -> We also have multiple on-premises data centers,

501.12 -> and we essentially operate two network zones.

504.15 -> We have production and we have not production.

510.84 -> Much of this has already been in place.

512.7 -> we've kind of been operating in this space

514.65 -> for six years at least.

518.13 -> And so, our network infrastructure also already had

521.64 -> some things that are common,

523.56 -> some things that are really important

525.09 -> to operating a large scale network like this.

529.14 -> All of our VPC addresses, network address space

532.59 -> is non-overlapping between accounts.

535.41 -> Every account is assigned a common /19.

539.88 -> We break that /19 into /20s per region.

544.86 -> We already use transit gateway

547.02 -> for communication between accounts,

550.02 -> for

552.27 -> using our VPN and Direct Connect.

554.7 -> We terminate those and propagate those routes

557.91 -> into our transit gateway route tables already.

562.95 -> And then a couple of key components that

565.35 -> assisted this as a rapid transformation.

569.46 -> All of our VPC infrastructure is in parity

571.59 -> across all of our accounts.

573.81 -> So we use CloudFormation,

576.21 -> and we use CloudFormation parameters

579.51 -> for the uniqueness of a VPC's deployment,

583.05 -> but every single VPC, and every single account,

585.87 -> and every single region is deployed from the same template.

591.27 -> So when we did this,

595.17 -> it was pretty straightforward.

596.58 -> But my hook to you is

600.15 -> we saved a lot of money as our project went on.

605.88 -> And

607.71 -> so we started with nothing.

609.69 -> You can see from our graphs here across AWS Firewall.

614.22 -> In December, we didn't have anything.

615.87 -> Log4j hit, 100% increase.

618.63 -> We're now having to spend more money.

623.67 -> And over the course of our journey and our story here,

628.14 -> we were able to reduce that initial increase of cost

632.73 -> by 98.72%.

636.063 -> And in addition to that,

638.7 -> the infrastructure changes

641.67 -> allowed us to also reduce our NAT gateway resources

646.71 -> by 96.66%

649.65 -> because of networking dependencies

653.97 -> that we'll discuss shortly.

656.88 -> So overall, it was big success.

658.59 -> We were able to really put in place something rapidly,

662.94 -> and, through changes, make big impact

667.59 -> on how we're using AWS Firewall

670.53 -> and the amount of money we're spending to do that.

675.18 -> So let's get started on setting some ground level here

679.95 -> about how our networks are configured.

684.24 -> At the beginning,

685.073 -> we have a fairly standard VPC configuration.

687.57 -> As I mentioned, they're all the same.

689.82 -> Also, we're using infrastructure as code practices.

694.29 -> And you can see in the before scene,

695.61 -> it was a fairly standard VPC.

697.02 -> We had a public subnet with a NAT gateway.

700.26 -> We had a private subnet

703.5 -> that then routed through the NAT gateway

705.6 -> to the internet gateway of the VPC.

708.99 -> Common practice, we deploy public-facing

711.66 -> load balancers to the public subnets.

713.55 -> Everything else goes in the private subnets.

717.06 -> And after our first round of change, our distributed model,

720.81 -> you can see a change here.

721.83 -> We had to add another subnet

724.08 -> to every availability zone within our VPC

728.04 -> to facilitate adding the gateway load balancers

731.1 -> for the network firewall.

734.82 -> And when you do that, it's obvious.

738.12 -> It is the way it is, right?

739.35 -> The cost of network firewall is directly proportional

742.08 -> to the number of gateway load balancers that you deploy?

745.56 -> Similar to a NAT gateway.

746.94 -> As soon as you deploy one,

748.59 -> you're gonna be paying for it per hour.

752.91 -> In this distributed model though,

756.27 -> the one kind of benefit that we had at the time is that

760.14 -> if you deploy gateway load balancers

762.21 -> to the same account that has NAT gateways,

765.96 -> that then offsets the costs of the NAT gateways,

770.28 -> but you're still paying for your gateway load balancers.

773.7 -> But since we had the automation in place,

775.56 -> since we had the code ready,

777.9 -> I was actually able to

781.29 -> make the infrastructure changes

783.6 -> to add the subnet in the gateway load balancers for this

787.05 -> in about 2 1/2 days of development over the weekend.

790.44 -> The word was let's go.

792.06 -> We need to do this.

793.02 -> No matter what, something has to happen.

796.05 -> And fortunately, that was able to be

799.83 -> completed in a rapid fashion.

802.02 -> The development was in a couple of days,

804.57 -> deployment of non-prod was in the next day.

807.69 -> Successfully deployed a non-prod.

809.64 -> Awesome, let's go to prod the next day after that.

812.43 -> And we have already

814.89 -> also supporting deployment infrastructure

817.2 -> that allows us to go through each account

820.83 -> running the same commands to provide those updates

823.8 -> based on our CloudFormation template changes.

829.83 -> So, as you can see,

831.96 -> there are a couple of additions

833.19 -> to our VPC infrastructure at this time.

837.849 -> In the previous, it was just a private subnet,

841.41 -> a public subnet with a NAT gateway, internet load balancer.

844.5 -> Pretty simple.

846.36 -> In this iteration,

848.43 -> We now have a private subnet,

850.17 -> a public subnet with a NAT gateway,

852.45 -> and a new firewall security subnet

857.34 -> with the gateway load balancer.

859.8 -> And if you notice at the top,

860.94 -> we have an addition there

862.38 -> where we have to attach

864.12 -> a route table to the internet gateway.

867.36 -> And that was a bit of a new,

870.6 -> a new thought to the way I had originally designed our VPCs.

877.23 -> And

879.66 -> I think the primary reason of this is that

883.89 -> the internet gateway itself has to know return routes

886.89 -> into your VPC.

889.95 -> And to do that, you can associate a route table now

892.71 -> to an internet gateway that's attached to your VPC.

898.5 -> A bit of a trick here though that you see here,

903.36 -> our NAT gateway isn't the next step to our internet gateway.

908.7 -> We now have a gateway load balancer in between.

912.63 -> And so you actually have to manage the route table

915 -> attached to your internet gateway,

916.86 -> 'cause if you have multiple availability zones

920.19 -> and then the route of the public subnet

923.61 -> with the NAT gateway is the gateway load balancer

927.27 -> and not the internet gateway,

929.55 -> the traffic has to return to the same NAT gateway

933.57 -> that it left your network from.

938.52 -> So with those changes, we had some challenges

942.9 -> and some interesting things that came up

945.66 -> as I was going through this.

948.57 -> As I mentioned, all of our VPCs are deployed with subnets,

953.76 -> basically /24 subnets of a /20 for the region,

958.47 -> which leaves some space

959.88 -> most of the time in the original private network

964.89 -> that's assigned to the account.

967.62 -> but it didn't really give the correct

970.89 -> networking address space to add three or four more subnets

978.48 -> for the gateway load balancers.

979.86 -> So I used an additional CIDR overlay to the VPC

986.28 -> in this 100.90 range, 100.90 100.91, 100.92, 100.93.

993.81 -> So that we didn't just consume that last bit

996.84 -> of address space that we already had available

999.69 -> for growth in our existing VPCs,

1002.12 -> but we could still put these subnets in

1004.67 -> to get our load balancers, gateway load balancers available.

1011.06 -> But we have hundreds of VPCs.

1013.34 -> We have seven availability zones.

1015.83 -> We have now additional gateway load balancers

1018.25 -> in each one of these.

1020.03 -> So in 2 1/2 days, essentially, deployed 1,680

1025.46 -> firewall endpoints across all of our infrastructure.

1029.3 -> And the cost of over a thousand gateway firewall endpoints

1034.16 -> is pretty high.

1035.96 -> So, Log4j, the word was do this, right?

1040.67 -> We don't care necessarily how much this costs.

1042.77 -> We have to put this in place.

1045.086 -> It's a security priority for us.

1050.99 -> But that comes with some

1053.6 -> discussion that we have in a minute as well.

1057.86 -> Another gotcha was the internet gateway

1061.73 -> route table that I discussed,

1065.36 -> and the return routes to map to those NAT gateways

1072.98 -> And the conundrum there is,

1075.14 -> in CloudFormation,

1077.27 -> the stack can give you resource IDs

1081.29 -> for the NAT gateways that have been

1084.44 -> deployed into that instance of the CloudFormation stack,

1088.7 -> but CloudFormation won't actually give them back to you

1091.97 -> in any consistent manner.

1094.94 -> So you can't easily map.

1097.94 -> This NAT gateway ID is in this AZ,

1101.69 -> so that my route table,

1105.29 -> route entry can say for this, I get this...

1109.88 -> The other portion of that, you need your NAT gateway ID,

1114.05 -> you need your private subnets CIDR range.

1117.86 -> So that little /24 that is at the return route.

1122.36 -> And so

1124.61 -> a complication of this distributed model

1126.98 -> and cloud formation is that we had to write

1128.66 -> some external tooling that then query the account,

1133.07 -> build a hash of that information, loop through,

1136.25 -> assign those routes into the route table.

1139.55 -> It worked in automation because you can do this one thing,

1141.86 -> and then you can run this next thing,

1143.93 -> but it was definitely something

1148.282 -> that was more than just being able to do

1151.4 -> one single cloud formation update to make all of it happen.

1157.73 -> And then also in a distributed model,

1161.45 -> you're effectively also deploying a AWS Firewall

1166.64 -> and an AWS policy to attach to that firewall

1170.24 -> in every single account.

1171.74 -> So how do you manage that across all of the accounts?

1176.03 -> Well, one way we were able to do that is we

1179.93 -> centralized our rules in a centralized account

1184.46 -> and share those rules to all of our organization

1187.91 -> via resource access manager.

1195.56 -> So that was great.

1197.6 -> We put it in place,

1199.61 -> but we quickly realized it's very expensive.

1202.85 -> And we quickly also had discussion that,

1208.1 -> awesome, we did it, we mitigated

1209.96 -> what we needed to for the moment,

1211.49 -> but is it the best solution?

1214.31 -> So a few things that helped us transition into

1218.3 -> a centralized part was that we were already using

1222.86 -> transit gateway route table among all of our accounts.

1225.41 -> So we had extensive routing.

1227.81 -> We had ways that the network

1229.91 -> was able to pass traffic around

1232.16 -> amongst our entire network.

1235.58 -> And through our discussions with InfoSec

1238.314 -> and the rest of the organization,

1240.89 -> we did come to the realization that,

1243.05 -> ultimately, all we need to be inspecting

1246.23 -> to provide the service that we need

1249.17 -> is that we only need to inspect the egress

1250.97 -> that's actually leaving to the internet.

1254.9 -> And then we are able to put ourselves

1257.45 -> in a place of a phased migration

1259.91 -> to move through these new discoveries

1263.03 -> and move to what will end up being our final solution

1267.44 -> as we go through.

1269.87 -> As we were going through this, resources could be removed

1273.32 -> that were previously deployed,

1275.93 -> and all of our migration steps could be automated

1278.15 -> because we were already actually deploying

1280.22 -> using infrastructure as code

1284.03 -> and automated updates and deployment practices.

1290.36 -> So where did we start to go?

1293.6 -> We started to go to what is the centralized model here.

1296.75 -> So you can see on the right

1299.69 -> that we have a set of AWS accounts

1303.02 -> and each of those AWS accounts has their own VPC.

1306.32 -> And this example, we're in a region,

1309.77 -> and we're in a specific environment.

1312.02 -> So this is then essentially replicated

1314.93 -> for each region that we're in

1317.78 -> and for each networking environment

1319.58 -> that we're talking about.

1322.22 -> And we are already using transit gateways.

1323.69 -> So traffic from one account to another

1328.07 -> was already just routing right through the transit gateway.

1330.59 -> Nothing had to change there.

1331.61 -> All those routes were already propagated

1333.86 -> into our transit gateway route table.

1337.07 -> We know that we didn't need any inspection

1339.8 -> between traffic that goes between our own private accounts.

1343.52 -> We were already propagating and connecting

1346.1 -> to our number of data centers

1348.83 -> using VPN attached to the transit gateway.

1351.89 -> Those routes were already in place.

1354.08 -> And so during this migration, this transition

1356.75 -> from distributed to centralized,

1361.07 -> we could put in place the inspection VPC

1364.55 -> and then affect the network by just changing routes.

1368.09 -> So traffic that has a destination

1370.4 -> of default destination of 0.0.0.0/0

1374.21 -> coming out of a private subnet,

1375.89 -> instead of sending it to a NAT gateway as we used to do,

1379.73 -> we just update the route, send it to the transit gateway.

1383.93 -> You add an additional default route

1385.64 -> to your transit gateway route table

1388.22 -> and you've associate your inspection VPC

1393.47 -> to your transit gateway as well,

1396.56 -> and your default route to 0.0.0 out of the

1400.4 -> transit gateway route table is your inspection VPC.

1406.97 -> And in this model as well,

1410.96 -> we already have inspection of ingress

1415.61 -> into a public-facing endpoint

1419.6 -> by requiring application load balancers

1422.36 -> deployed to public subnets to have WAF attached.

1426.59 -> So ingress traffic to a service

1429.89 -> can still come through the internet gateway of the VPC

1433.7 -> in the account where the service is deployed.

1436.25 -> WAF is in place to provide security

1439.01 -> for that ingress traffic.

1440.78 -> We are only needing to be worried

1443.54 -> about the traffic that goes to the internet.

1448.79 -> So let's talk a little bit more about

1450.08 -> the inspection VPC itself.

1452.18 -> So in the distributed model,

1459.53 -> I had to use that additional site arrange overlay

1462.29 -> that was kind of outside of our private network

1464.45 -> in the first place.

1465.98 -> It worked very well,

1467.57 -> but it wasn't entirely great

1469.73 -> because, all of a sudden, we've got this 100.90 subnet

1473.54 -> or CIDR range attached to subnets

1475.64 -> and these VPCs that have a private network space

1478.4 -> that's different than that.

1481.37 -> So we have a net new VPC that we're able to build.

1484.64 -> And with that,

1485.93 -> we can do the same thing we do in every account.

1487.73 -> We assign a /19 of our private subnet

1492.47 -> that's designated specifically for this account.

1494.96 -> We keep record of where that should live,

1498.14 -> what account it should be associated to,

1501.47 -> and then gave more flexibility

1503.6 -> to deploy the subnets out of that

1507.32 -> private subnet range.

1508.28 -> So at the end of that,

1510.2 -> each of our subnets for inspection VPC are now

1516.2 -> sharing the same non overlapping private network space

1520.04 -> that the rest of our entire network uses.

1522.44 -> And that's a big benefit

1524.6 -> if at some point in the future

1526.37 -> we wanna add some other inspection component

1528.71 -> to our network.

1530.36 -> We can go into this VPC.

1533.09 -> But for our use case here,

1536.48 -> we just need to implement transit, sorry, network firewall.

1541.82 -> And so we have a three layers of subnet.

1544.16 -> So now you notice a difference in this subnet layout

1548 -> than the previous subnet layout

1549.62 -> since we're able to define this one net new.

1555.23 -> Our private subnets are essentially

1558.05 -> attached with an ENI that's attached to the transit gateway.

1562.88 -> That routes then propagated into the transit gateway,

1567.47 -> into the transit gateway route tables.

1569.87 -> The next layer, We actually then do

1572.84 -> network firewall gateway load balancers.

1576.26 -> And

1578.75 -> so the route table for the first subnet is the destination

1583.4 -> of the gateway load balancer in our firewall.

1586.22 -> Inspection can happen.

1588.08 -> And then the default route for

1592.67 -> the network gateway firewall subnets

1595.97 -> is the public subnet of the inspection VPC,

1598.43 -> which holds the NAT gateway.

1600.74 -> And the benefit of this is

1602.63 -> now that the NAT gateway next hop is the internet gateway,

1606.2 -> you don't have to manage the routes like you used to.

1608.48 -> You don't have to manage the routes the same way.

1609.95 -> You actually don't have to manage any routes at all anymore.

1613.4 -> The attached route table to the internet gateway

1616.43 -> and the NAT gateways being that hop out

1621.86 -> the route, the traffic back in as you'd expect.

1630.83 -> So some of the challenges and some of the

1634.37 -> key points I wanted to kinda point out here

1637.46 -> is that that simplification

1639.02 -> of the internet gateway route table

1640.58 -> was effective for us.

1643.76 -> And having the NAT gateways on the other side

1647.12 -> of the gateway load balancers

1649.7 -> was what drove that simplification.

1653.03 -> And then we were able to drop out the external scripting

1656.45 -> outside of CloudFormation to manage those things

1659.3 -> and we're back to a single template, a single update

1663.14 -> to manage the entire network for that inspection VPC.

1668 -> And as we're going through this,

1669.53 -> one of the gotchas that we found

1671 -> is that network firewall manager

1673.01 -> doesn't really work for us in this way.

1676.37 -> And a couple of the main reasons why is that

1680.81 -> network firewall manager kinda wants

1682.91 -> to build its own subnets,

1684.29 -> kinda wants to manage its own resources,

1686.18 -> and you don't have much control over that.

1688.64 -> So if you're already really controlled

1690.38 -> when you're in your network

1691.94 -> and you wanna just kinda insert firewall into it,

1696.86 -> network firewall manager can do it,

1698.48 -> but it kind of gets a little chunky.

1700.13 -> It didn't really work for us.

1703.204 -> And then we couldn't just inherit subnets

1705.59 -> and say we want our firewall gateways here.

1708.29 -> Like I said, it creates its own.

1710.45 -> It's expecting to be able to do that.

1712.07 -> So we didn't use network firewall manager

1717.383 -> to manage our network firewall.

1719.9 -> But a benefit is

1722.06 -> now that all private traffic out of our private subnets

1725.6 -> that's destined for the internet

1727.43 -> no longer had to go through

1730.64 -> a NAT gateway within the same VPC

1734.48 -> and the NAT-ing for that traffic

1736.25 -> was handled by NAT gateways with inspection VPC.

1740.51 -> It gave us the opportunity to actually

1742.88 -> remove all NAT gateways from all of our VPCs.

1746.9 -> So when we deployed

1749.9 -> over 1,500 firewall gateways into every account,

1754.82 -> we still also had NAT gateways.

1757.43 -> We were all not...

1758.66 -> And when we transitioned to the centralized,

1760.85 -> we're able to reduce all of those network firewall gateways

1764.6 -> gateway load balancers.

1766.61 -> But then in addition to that,

1768.77 -> we're also able to

1771.74 -> remove all NAT gateways.

1773.33 -> And previously, we had already

1774.71 -> built in to our CloudFormation

1776.12 -> that manages our VPC networks

1780.71 -> a conditional parameter of enable NAT gateways true, false,

1787.22 -> and then a

1790.58 -> dependency within the resource creation

1793.25 -> of our CloudFormation that said

1795.2 -> if this parameter's set to false,

1797.06 -> then don't build the NAT gateways

1798.8 -> or remove the resources.

1801.95 -> So in our automation, in our infrastructure as code,

1804.83 -> we were able to just say, "Hey, turn off the NAT gateways,"

1807.05 -> boom, and run an update against the VPC,

1809.6 -> and removed over a thousand NAT gateways

1812.81 -> that we had run for years.

1816.68 -> I did mention our internet gateway,

1819.41 -> I'm sorry, internet-facing load balancers.

1822.26 -> We're already using WAF.

1823.49 -> So that was a complication that we didn't have to solve

1825.38 -> with the network firewall.

1828.02 -> And

1830.12 -> a driving factor of us being able to move to centralized

1833.69 -> was the fact that we only needed to inspect

1838.25 -> internet-bound traffic.

1839.87 -> Traffic was already being inspected between network zones,

1843.11 -> on-premise network firewall or on-premise firewalling

1846.29 -> that had already been in place.

1849.32 -> We didn't need any inspection between

1851.87 -> AWS accounts within the same network zone as well.

1858.62 -> So all of these things allowed us to drive

1861.71 -> a phased change of approach.

1864.89 -> We were able to quickly mitigate our security issues

1870.02 -> that came up in December for Log4j.

1874.19 -> We were able to deploy very quickly to do that.

1878.9 -> It was that moment of this is more important

1882.08 -> than how much it costs

1884.63 -> until we deployed it.

1886.79 -> And the cost showed up, and we were like,

1889.227 -> "Hey, great, but maybe that's not the right solution."

1893.57 -> It allowed us though to step back and identify

1896.72 -> all the pieces that we do really need

1898.97 -> in a more pragmatic and calm discussion

1903.29 -> and ask questions of ourselves like,

1907.1 -> what is the best firewall that we need to use?

1909.83 -> Is it from a vendor?

1911.87 -> Is it from something else?

1913.85 -> Is it for this network traffic?

1915.68 -> Is it not for this network traffic?

1918.65 -> We were able to have discussions with external vendors.

1921.5 -> And one of the things we found at the time

1924.44 -> was they may have an excellent solution

1928.73 -> for network firewalling,

1931.04 -> but most of the time it was,

1933.83 -> well, you have to deploy your own EC2 instances

1936.32 -> and manage your own infrastructure,

1938.12 -> and then you run our software

1939.89 -> to provide the firewalling services.

1942.38 -> And in this critical moment,

1945.47 -> our teams were kind of constrained.

1947.57 -> Honestly, I was the person who was managing

1950.78 -> all of the infrastructure at the time,

1952.31 -> so we didn't have a lot of bandwidth to say,

1956.187 -> "Okay, well, we can start managing

1957.65 -> a new fleet of EC2 instances

1959.3 -> and have the time to write all the automation

1961.67 -> to install the third party vendor software

1964.43 -> and configure it with automation, all these things."

1967.34 -> It didn't really work for us,

1968.87 -> but we were able to take that moment

1971.51 -> to have those discussions.

1975.74 -> Additionally, we didn't have to

1978.8 -> change everything all in one swoop.

1981.32 -> We were able to deploy the inspection VPCs

1983.18 -> and make sure that their infrastructure was correct.

1986.63 -> And then we're able to take the same firewall rules

1990.74 -> that we put in place for

1994.01 -> the centralized model

1995.87 -> and apply them to the inspection VPC,

1999.38 -> and then we were able to migrate just our networking

2003.352 -> as we needed to in phase changes.

2006.758 -> So we're able to just change the route

2008.71 -> for the destination of the internet

2011.47 -> to the transit gateway in all of our private subnets,

2015.01 -> and see that traffic move,

2017.29 -> but we already knew that our firewall rules in place

2019.84 -> for our security-assisted custom

2025.42 -> we're definitely blocking this outbound traffic rules,

2028.99 -> additional AWS managed rules that we had set in

2034.3 -> valuation mode.

2036.7 -> We just used the same rules for our first phase.

2038.77 -> So we knew when we pushed the traffic over there,

2042.01 -> it wasn't gonna affect the existing network traffic

2044.44 -> 'cause it was the same rule set.

2046.81 -> And then

2049.57 -> once we moved all of the networking,

2053.26 -> then we could focus on

2055.42 -> the next iteration of our firewalling

2058.57 -> and not only were we able to continue

2062.62 -> to mitigate the security vulnerability issue with Log4j,

2066.903 -> we were able to also decide, well, hey,

2069.34 -> here's our chance to also put in place best practices

2073.24 -> of actual default deny outbound egress

2076.66 -> for our entire network

2079.54 -> in a more controlled way.

2082.03 -> And so, we were able to then

2085.27 -> work on the firewall and the policy

2088.99 -> and start building a more strict policy

2091.87 -> for these new rules that are in place.

2095.2 -> Then we were able to update that firewall

2097.87 -> and attach our strict policy

2100.66 -> using alert established at first.

2104.92 -> And so we were essentially passing all traffic through

2108.13 -> and logging all of that data

2110.53 -> so that we could then inspect our network traffic

2113.26 -> without affecting actual information,

2118.09 -> And then we were able to communicate

2119.56 -> with our developer teams

2121.12 -> for them to start looking at their own data

2123.31 -> because we were becoming aware

2125.32 -> and starting to state that,

2126.857 -> "Hey, we're gonna do default deny."

2129.46 -> And by default, if it's not our own domain,

2132.64 -> if it's not our own known destination,

2134.89 -> untrusted places on the internet,

2137.77 -> default is gonna be...

2140.44 -> Deny is gonna be default.

2143.41 -> And we were able to do that using

2145.66 -> alert established on the policy first,

2148.75 -> logging and evaluating that.

2151.06 -> And then when we knew we were pretty good,

2153.7 -> when we had talked to all of our developer teams

2156.16 -> and we're ready to go,

2157.69 -> we switch our policy just by flipping alert established

2161.44 -> and adding drop established to our policy

2164.92 -> and everything was in place.

2167.41 -> So that is kind of the technical details

2170.53 -> of the network journey,

2171.97 -> and I'm gonna pass it on to Mike now at this point,

2174.91 -> and he's gonna talk more about the firewall rules.

2180.25 -> - Thanks, Aaron.

2184.9 -> Hey, everybody.

2185.733 -> My name is Mike McGinnis.

2186.7 -> I'm the principal security engineer at Athena

2190 -> and I lead the public cloud security group.

2193.87 -> So as Aaron mentioned, we've gone on this journey

2197.71 -> from decentralized to centralized.

2199.96 -> So what we're looking at, pre-network firewall.

2204.94 -> Really we're looking at the NACLs

2206.65 -> and security groups as the primary filtering for traffic.

2210.61 -> It's all IP based, right?

2212.26 -> What's the IP?

2213.28 -> What's the port?

2214.113 -> What's the protocol?

2215.47 -> And then it's allowed through.

2217.75 -> we didn't have any way to actually implement

2220.99 -> filtering based on HDP methods,

2224.56 -> user agents, domains, anything like that.

2228.55 -> Log4j happened,

2230.11 -> which was the emphasis to get us into network firewall.

2237.88 -> As we started through the journey,

2239.14 -> basically, what we were looking at

2240.58 -> is we had to define the policy requirements

2243.07 -> and we broke it down into define the policy scope,

2246.61 -> have minimal impact and minimal outages,

2249.19 -> and enable the developers.

2251.2 -> So Aaron touched on this,

2252.31 -> but, basically, the defining the scope is,

2254.59 -> are we going to do inbound filtering only,

2256.6 -> outbound filtering?

2257.89 -> Do we care about the east to west?

2259.72 -> Do we care about internal traffic

2262.99 -> or what's the decision?

2265.18 -> Also, how granular do we wanna get?

2267.82 -> Is just IP good enough

2269.47 -> or do we actually want those domains?

2271.54 -> And all of that sort of gone into

2273.82 -> the scoping of the policy set.

2276.34 -> We wanted to have minimal impact and minimal outages.

2279.52 -> As Aaron said, we were six years into this.

2282.07 -> So just dropping a firewall in

2284.05 -> and putting default deny in place

2285.67 -> really wasn't the best idea.

2288.19 -> 'Cause if we broke a lot of stuff,

2289.51 -> the first thing to go would be firewall.

2293.02 -> As part of that, we also wanted to have

2294.46 -> a defined rollback strategy.

2296.86 -> As we were chunking through each of those phases

2299.62 -> that Aaron was describing,

2301.51 -> we actually had a very well defined rollback strategy

2305.11 -> for each and every one of those.

2306.91 -> So in case the change created an issue,

2310.9 -> we would be able to roll back

2312.55 -> while we identify what that issue was and work to fix it.

2316.78 -> Lastly, on the requirement was enabling the developers.

2320.86 -> We need the developers just as much as they need us.

2325.06 -> We have to provide them the logs.

2326.65 -> We have to teach them how to use the logs,

2329.38 -> and then also bring them along on the journey.

2332.02 -> Keep them updated where we...

2333.43 -> What's our progress?

2334.84 -> Where are we in the path?

2336.34 -> What's our intention?

2337.6 -> What's the end goal?

2341.83 -> So very similar to the phases that Aaron was talking about,

2345.31 -> we also created three policy phases.

2347.77 -> So we created the alert policy default,

2351.19 -> then the alert policy strict,

2353.77 -> and then eventually default deny.

2356.65 -> And when I say alert policy default and strict,

2359.92 -> those are the actual firewall modes.

2361.69 -> So the firewall has default mode

2364.09 -> in which it basically does rule processing in groups

2367.03 -> based on the action.

2369.1 -> So it will evaluate your passes first, then your denies,

2372.85 -> and then your alerts.

2375.04 -> In strict mode, it basically will evaluate in strict order.

2378.55 -> So the way that the policy set is written

2381.79 -> is exactly how it's being processed.

2384.43 -> So rule group one.

2385.72 -> Rule one, two, three, that gets hit first.

2387.97 -> Then rule group two.

2389.14 -> One, two, three gets hit second,

2390.64 -> and so on and so forth.

2395.47 -> So with this one, alerting default,

2399.85 -> this was the one that was pushed to all the default

2402.04 -> or the distributed policies.

2405.25 -> We initially set it up in the default rule order

2407.77 -> 'cause that was more simplistic.

2409.99 -> We had the same firewall being deployed via CloudFormation

2413.68 -> and the RAM to share the one set of rules

2417.16 -> with all of the firewalls.

2420.46 -> What we did also is we actually, at this point in time,

2424.3 -> we put the firewall logs to CloudWatch,

2426.64 -> and we set a retention on the log group.

2430.78 -> We created a subsequent alarm to check for

2434.44 -> if a block list rule was hit.

2437.08 -> We would send an SNS over to our IR team

2440.26 -> that generated a ticket in their queue,

2442.54 -> and we were able to actually block and respond

2446.35 -> to the Log4j as part of the initial policy set.

2449.86 -> It's a little misleading to say alerting default

2452.56 -> because we actually did have

2453.91 -> a handful of block rules in place.

2456.04 -> But for the most part,

2457.09 -> it was built to be an alert in our policy set.

2461.26 -> Like Aaron mentioned,

2462.16 -> this was cut over to the centralized firewalls

2465.37 -> so we could keep consistency of the policy

2468.34 -> while the entire network plumbing

2470.5 -> was actually being converted.

2474.73 -> This is what the rule set actually looked like.

2477.43 -> Very, very simplistic, right?

2480.16 -> The stateless default action

2482.2 -> is to forward to the stateful rule engine.

2487.45 -> Then we use the AWS managed rule sets in alert mode.

2490.99 -> We had an allow list and that was sort of

2492.64 -> just to clean up the logs a little bit.

2495.07 -> We had a block list,

2496.39 -> and then there actually isn't a default action

2499.51 -> in default mode.

2502.42 -> And really this is all it looks like.

2504.07 -> This is the sort of the default actions of the firewall.

2507.25 -> What you basically see in stateless

2508.63 -> is just gonna forward it down to the stateful,

2510.678 -> and then the stateful is just rule order default.

2514.69 -> Below this in the console is your policy set.

2520.15 -> So moving through the journey, right?

2522.82 -> We've come into a centralized model.

2525.07 -> Now we're rebuilding the policy set

2527.65 -> to what we want it to be.

2530.59 -> So

2532.42 -> we moved this over to the centralized firewalls.

2535.84 -> As part of that,

2538.12 -> we had to remove this Suricata priority.

2542.71 -> And what that basically means is that,

2545.77 -> under the hood, the firewalls using Suricata IPS IDS,

2550.482 -> and Suricata is an open source IPS.

2555.07 -> So all of the rules are in Suricata.

2557.5 -> One of the keywords is a priority.

2559.96 -> In default mode, the priority will set

2562.9 -> the priority of the rule within the group.

2565.6 -> So even though it's a pass,

2567.52 -> you can have different rules within pass

2569.98 -> hitting at different times.

2571.6 -> But because we're in strict ordering now,

2573.67 -> that's no longer needed,

2575.41 -> and it actually throws a policy error

2578.29 -> if you try to push the rules with it,

2581.44 -> We updated the policy set.

2583.87 -> Every firewall, again, had the same policy set.

2587.14 -> But one of the key changes in centralized versus distributed

2590.56 -> is we actually consolidated the logging to S3.

2594.07 -> And then from there, we used event notifications

2596.65 -> to push it to our SIEM,

2598.9 -> and then start to leverage our SIEM

2600.61 -> for the continuous monitoring

2602.44 -> and take out that CloudWatch alarm.

2605.68 -> And that really was to synthesize

2609.58 -> the alerting and logging workflows

2612.34 -> with what we were already using

2614.29 -> and what the IR team had already built out.

2619.6 -> The one thing to mention on that one

2621.22 -> is we also did provide the developers access to those logs.

2625.39 -> We provided them very granular access.

2629.02 -> So this is what the new policy looks like.

2631.93 -> Within the stateless rules,

2633.1 -> we actually did add a stateless rule set

2636.73 -> to allow trusted IPs to bypass the inspection engine.

2641.65 -> And when I say trusted, I mean truly trusted IPs.

2646.54 -> Below that, we're gonna forward everything

2648.85 -> to the rule groups,

2651.1 -> to the stateful rule groups.

2652.84 -> So we have our allow list.

2654.4 -> We have our block list.

2656.2 -> Now we have the AWS managed block list.

2659.65 -> Below that is the filtered domains.

2661.75 -> So the idea here is that

2664.96 -> we really want to look at...

2668.02 -> Maybe this is traffic that looks a little suspicious

2671.26 -> or it's traffic that were not fully vetted.

2674.35 -> So if we drop 'em below the block list

2676.42 -> and the managed block list from AWS,

2679.27 -> if the domain, or if the IP,

2681.67 -> or whatever else rules are there,

2685.99 -> show up in either of the two lists above,

2688.63 -> they will actually still get blocked by the rules above

2691.93 -> and not be allowed through.

2695.23 -> If they don't show up in those,

2696.58 -> then they'll be allowed through.

2698.59 -> It's sort of a proving ground in a sense.

2701.14 -> Below that is the developer requested rules.

2703.09 -> We'll talk about the developer workflow in a little bit.

2706.21 -> And then we have the stateful default action,

2708.94 -> like Aaron mentioned, of alert established.

2713.62 -> So here again, stateless is the same.

2715.99 -> Rule order is now from default district,

2718.39 -> and now we have a default action of alert established.

2724.63 -> On the default deny,

2728.02 -> a lot of work went into

2730.81 -> moving from the alert strict to default deny

2734.89 -> was mostly in the background.

2737.11 -> To be 100% honest,

2738.52 -> the change was super easy

2740.59 -> because all we had to do was update the CloudFormation

2744.19 -> to add the action of drop established

2746.8 -> and do a CloudFormation stack update.

2748.39 -> That's literally it.

2749.223 -> It was three minutes and you're done.

2752.83 -> All of the effort was in prior to getting to that point.

2758.32 -> What we did is we worked with the development teams

2763.06 -> very, very, very closely.

2765.52 -> We implemented a developer request process.

2768.52 -> We added a hundred developer rules to the rule set,

2773.11 -> and this was purely based on analysis of traffic,

2776.44 -> evaluation of traffic,

2777.76 -> and discussion with the development teams

2779.77 -> to say, "Do you actually need to go to this health.gov site

2784.84 -> or do you need to do this?"

2786.07 -> And then we would have that discussion.

2788.41 -> We blasted them with notification and messaging.

2792.13 -> We didn't bury them to the point where

2794.89 -> they just drowned us out.

2796.72 -> But it was...

2798.91 -> We want it to be very explicit

2800.23 -> as to when what change was happening

2802 -> so they could expect and plan for it.

2804.49 -> We held daily office hour calls.

2807.07 -> We do have...

2809.47 -> We do have offices in India.

2811.33 -> So the office hours would alternate days

2813.82 -> so we could accommodate all of our development groups.

2818.05 -> And then again, we just modified the firewall.

2822.16 -> When we did it, we did it from a least used, least risk

2825.58 -> to most used, most risk method.

2828.94 -> Meaning we decided which of the firewalls

2831.67 -> was the least used.

2832.75 -> And if it went down because of a poor policy change,

2835.81 -> that was okay or that was the most acceptable.

2838.9 -> We started there for the initial deployment.

2841.18 -> And as we worked through it, we moved to the final,

2844.15 -> which was our main production site.

2849.64 -> Again, the rules did not change whatsoever.

2852.79 -> While the rule groups did not change,

2854.53 -> the rules just got more defined and more applicable to us.

2859.33 -> Here you can see default action n

2861.43 -> ow includes drop established.

2865.21 -> So where are we now?

2866.53 -> Default denies in place across everything

2868.66 -> in our AWS ecosystem.

2872.834 -> The whole change, including Aaron's and the policy work,

2875.86 -> resulted in less than 10 medium issues.

2879.76 -> So it was not a very impactful change.

2884.86 -> We have, to date, over 120 developer requests

2890.59 -> through our process.

2893.35 -> Now we're really focusing on operational efficiencies.

2895.96 -> Literally a couple days ago, VPC prefix list were announced,

2899.83 -> which basically allows us to group CIDRs,

2902.98 -> which I think might make our policies

2904.72 -> a little more flexible.

2906.37 -> We're also looking at refining policy sets.

2908.95 -> So as the business grows and new workflows come in,

2912.37 -> we're able to look at those more holistically.

2916.81 -> So real quick, I just wanna

2917.89 -> jump into the developer workflow.

2919.96 -> So basically, what we wanted to do was make it

2922.3 -> super, super easy for developers

2924.43 -> to have a say in what traffic is allowed in their account,

2929.17 -> granted it falls below our block list.

2933.1 -> So we do have a say.

2935.35 -> The workflow also has gating and guardrails encoded into it

2939.4 -> so developer can't just say,

2941.027 -> "Hey, I want my account to go

2943.18 -> all the way to full internet.

2945.19 -> Give me 000 outbound."

2948.25 -> And we just say, "Yeah, sure, whatever."

2950.74 -> We actually have a lot of guardrails

2952.09 -> that we built into the process to alleviate some of that.

2955.6 -> It's based in Git

2956.53 -> So signoffs, and auditing, and governance

2958.75 -> is all a part of it.

2960.01 -> It's built through a pipeline so it's quick and efficient.

2963.7 -> And the approval process,

2966.22 -> we do approve it during business hours.

2968.5 -> If there's an incident, so if there's an actual outage,

2971.731 -> the NOC can approve in our place,

2973.517 -> and then we have a process where we're alerted and notified.

2977.11 -> So when the security team comes in the next day,

2979.63 -> they can perform that assessment.

2982.6 -> So on the predefined rule variable,

2985.06 -> the policy or the rule group

2987.4 -> has what's called a rule variable,

2989.74 -> and it's basically just IP sets.

2991.54 -> We, the security, define those and manage those

2993.91 -> as new accounts are created.

2996.7 -> The JSON that you see under developer submission

3000.36 -> is literally the only thing they have to submit

3003.51 -> as part of their PR.

3006.66 -> So when they create their poll request,

3008.07 -> they have to tell us the account ID

3010.83 -> and what domain, port, and protocol,

3013.74 -> domain/IP, port, protocol that they want.

3018.84 -> We review that.

3020.16 -> We merge it.

3021.84 -> And out pops our Suricata rules.

3025.02 -> So here, what we're doing

3026.37 -> is we're basically using their input

3029.64 -> to create the Suricata rules that we then

3032.76 -> deploy into that specific developer rule group.

3038.55 -> I could go deep into this because it's actually super...

3043.41 -> My opinion, it's super well-designed.

3044.76 -> There's a lot of catches that we wanted to account for.

3048.99 -> But the end run is this is what gets transformed

3052.35 -> in the firewall that allows their traffic to get past.

3057.33 -> So policy challenges.

3061.35 -> We did a lot of work,

3062.7 -> a lot of movement throughout the journey.

3068.16 -> What we found is that default ordering is nice,

3071.43 -> but it's limiting in advanced use cases.

3075.06 -> So we were really trying to get

3076.65 -> that alert kind of policy set to say,

3079.86 -> what are we seeing?

3081.66 -> We couldn't really get it easily.

3085.35 -> We were seeing a lot of the denies happening

3088.5 -> before the alerts would

3089.73 -> so we were actually blocking more traffic

3091.98 -> than we had wanted to,

3094.05 -> without putting in past rules above it.

3096.81 -> Domain list is a really cool feature of network firewall.

3100.05 -> The only gotcha is that they put a deny any at the bottom

3104.07 -> so it's literally a white list.

3105.66 -> You put in your domains, you hit save,

3107.97 -> it lists out the domains in Suricata form,

3110.73 -> and then we'll put a default at the bottom.

3114.69 -> So then that basically made us convert

3116.79 -> everything to Suricata,

3118.35 -> and we only use Suricata at this point.

3120.87 -> If you want an IP, a 5-tuple, so IP port protocol rule,

3125.25 -> that's Suricata.

3126.39 -> Everything now is Suricata.

3129.36 -> We didn't change anything else.

3132.9 -> Home Net is not RFC1918.

3136.14 -> This one, in my opinion, is pretty funny.

3138.48 -> It's an oversight on our part, but it's a good gotcha.

3141.99 -> So when we were deployed into the distributed,

3145.02 -> the firewall actually sat in the VPC.

3148.83 -> So

3150.42 -> it took the CIDR of the VPC

3153.32 -> so it was working fine.

3155.67 -> because the default value of Home Net

3157.5 -> is the CIDRs or the VPC CIDR,

3160.41 -> when we moved to a consolidated model,

3163.65 -> the only traffic that initially passed

3166.05 -> was the traffic that originated

3168.09 -> within the VPC of the firewall.

3171.51 -> All the other traffic was not being passed appropriately

3174.54 -> so we're like, "What's going on?"

3176.16 -> And then we did a little light reading,

3179.04 -> and we actually realized our mistake.

3181.5 -> So then we updated CloudFormation

3183.63 -> and put Home Net as a defined CIDR

3186.93 -> is a rule variable into all of the policy sets

3189.66 -> or into all of the rule groups.

3191.58 -> And it fixed our issues almost as instantaneously.

3196.56 -> Capacity limits is another one.

3198.39 -> So in the network firewall policy,

3202.65 -> you have what's called a capacity limit,

3204.51 -> and this is a hard set limit of 30,000.

3208.56 -> I equate 'em to rules,

3210 -> but it's basically 30,000 rules for your entire firewall.

3214.02 -> And you allocate these to each rule group.

3218.49 -> So if you just sort of haphazardly assign them

3223.23 -> and you don't think about it,

3224.91 -> you end up either having to

3226.08 -> blow away the rule group, recreating it,

3228.72 -> or you can sort of max out.

3230.79 -> And when you get to that max out page or the max out time,

3234.09 -> or, sorry, max out capacity,

3237.18 -> you're sort of in a conundrum.

3239.85 -> So we typically will do is we'll

3241.86 -> look at what the policy or the rule set's gonna be,

3244.68 -> and then estimate what we would see the top end being,

3248.94 -> padded another five to 10%,

3251.34 -> and then that's sort of the capacity for that rule group.

3256.65 -> On the other end, we always leave some unallocated

3260.55 -> so that way we're never consuming the full 30,000.

3263.82 -> we're always leaving float.

3265.32 -> So that way if we need to add more down the road, we can.

3267.9 -> If we need to add a new rule group

3269.79 -> only to transfer the policy set around,

3273.57 -> we have that ability and that flexibility.

3276.18 -> Versus if we're stuck at that top 30,000,

3278.73 -> you sort of lose that flexibility.

3283.32 -> So the next one, the network layer

3285.15 -> versus application layer rule processing.

3288.06 -> This one is fickle.

3290.43 -> So when you're doing your rule set,

3292.47 -> if you have allow this source to this destination

3298.209 -> for all TCP traffic,

3300.12 -> you're never actually going to be able to do

3302.16 -> the analysis at the app level

3304.5 -> because it's just gonna see the TCP

3306.78 -> and let it through.

3308.16 -> So it's never going to sort of filter down to the app layer

3312 -> and get to looking at the domains or the methods.

3317.22 -> Any of the more IPS kind of rules

3321.18 -> typically won't get processed if you're doing,

3323.61 -> if you're passing it more at the L3 L4.

3326.52 -> So just be cognizant of your rules.

3330.78 -> because the firewall does not

3334.717 -> do TLS decryption,

3337.23 -> it leverages the TLS server name indication extension,

3342.63 -> and...

3345.03 -> Almost everything supports it.

3346.47 -> However, less than 1% of our traffic

3349.11 -> does not support it in one way or another,

3351.72 -> or it hasn't throughout the course of our journey.

3355.14 -> And most of the time,

3357.03 -> this is resolved by updating the libraries.

3359.49 -> Sometimes we were finding themes that were using

3362.04 -> really old code bases

3364.47 -> that they just needed to update,

3367.41 -> sometimes forcing it to use 1.2,

3370.05 -> even though their code is using 1.2,

3373.29 -> for some reason forcing it to use it fixed it,

3377.55 -> or adding a parameter as part of the web call.

3380.4 -> So when you're doing the HTTP request,

3384.54 -> actually putting in the parameter

3386.61 -> for the tls.sni and defining the server name

3391.68 -> that you wanna go to

3394.145 -> will have the firewall recognize it.

3396.18 -> Because if your client request doesn't include the SNI data,

3400.17 -> then the firewall will never see it,

3402.48 -> and it'll never match it against a HTTPS TLS rule.

3408.33 -> And the last one is not necessarily a challenge,

3411.09 -> but it's just sort of a caveat, right?

3412.77 -> So debugging on a prod firewall,

3416.22 -> we ran into this,

3417.63 -> where teams would say, "Hey, it's working fine in non-prod,

3422.37 -> but it's not in prod."

3424.68 -> So how do we sort of handle this or how did we handle this?

3428.88 -> We don't wanna just go into the prod firewall

3430.59 -> and start typing around and making some rule changes

3433.23 -> and see if it fixes it, right?

3436.38 -> Also, firewall is priced on

3439.65 -> a fixed monthly cost plus additional.

3442.59 -> So ballpark figure, it's like $36,000

3445.71 -> just to have this test firewall floating around

3448.32 -> that you use once or twice a month.

3451.02 -> So cost might not be a viable option either.

3454.89 -> And then also, some things can't be replicated, right?

3458.7 -> For some unknown reason, some developer patched a system

3464.04 -> and didn't tell.

3465.45 -> There's just oddities that can come up.

3467.88 -> We don't like them.

3468.713 -> We don't want them to happen,

3470.07 -> but sometimes these really can't be replicated.

3475.14 -> So what we end up doing is we actually put one rule.

3478.2 -> We basically have a top level rule group

3480.3 -> that we use with a /32.

3483.96 -> We put it through the code, through our SCM.

3488.61 -> And that way, what we're able to do is say,

3490.867 -> "Okay, what's the broken system?"

3492.69 -> We put that IP as a /32 in the rule,

3495.93 -> and then we have the flexibility to change

3498.15 -> just that specific rule for that specific host.

3501.54 -> And because it's all done through code

3503.67 -> and through our pipeline, it's all auditable.

3506.58 -> It's all maintainable,

3507.99 -> and it's a way for us to easily perform that,

3513.81 -> that one-off testing when needed.

3519.78 -> So that's the end of our presentation.

3521.88 -> So these are the links If you wanna learn more.

3525.36 -> Thank you for coming and have a great evening.

Source: https://www.youtube.com/watch?v=VMVeTvX4OLw