AWS re:Invent 2022 - Trading up: Fidelity Investments takes trading to the cloud (FSI317)

Aug 16, 2023

AWS re:Invent 2022 - Trading up: Fidelity Investments takes trading to the cloud (FSI317)

The mainframe has been a cornerstone of the technical capabilities of Fidelity Investments for years. Recently, Fidelity instituted a strategic program to modernize its core brokerage platform and recast the mainframe’s capabilities into a cloud-based platform using elastic scalability and cost structures and more modern technologies while shedding technical debt. Join this session to learn more about Fidelity’s migration journey, which focuses on not only Fidelity’s technology stack (including Amazon EKS, Amazon MSK, Amazon DynamoDB, and AWS Lambda) but also its corporate culture and lessons learned from building one of the largest trading platforms on AWS.

Learn more about AWS re:Invent at https://go.aws/3ikK4dD.

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#reInvent2022 #AWSreInvent2022 #AWSEvents

Content

0.12 -> - Hello everybody.

2.25 -> Thank you for coming here today.

3.6 -> I know that it's 5:30,

5.4 -> the first day of re:Invent.

6.54 -> All your colleagues are probably at happy hour right now,

8.55 -> so we appreciate you being here with us instead.

11.25 -> This is FSI317,

13.02 -> Trading up: Fidelity Investments takes trading to the cloud.

16.86 -> My name is Jeremiah O'Connor.

18.18 -> I'm a Principal Solutions Architect for AWS.

20.94 -> I've been working with the Fidelity folks on stage

22.74 -> for about the last four years now,

24.33 -> helping them in their AWS journey.

26.82 -> So I'm gonna be joined today by Louis Mancini,

29.46 -> who is Head of Equity Options & Technology

32.13 -> for Fidelity Investments.

33.84 -> I'm also joined on stage with Amr Abdelhalem,

36.81 -> who is the Head of Cloud Platforms

38.94 -> at Fidelity Investments.

42.06 -> So, we've got a very useful agenda here today.

44.46 -> So, the first topic we're gonna get into

46.86 -> is we're gonna talk about

48.06 -> Fidelity's order management system.

50.46 -> Then, we're gonna segue into the roadmap and challenges

53.64 -> that they had migrating this order management system

56.22 -> to AWS, and really getting this very latency-sensitive

61.32 -> application running on AWS.

63.45 -> Then, we're gonna dive into the actual architecture itself,

65.73 -> so we're gonna dive into what AWS components comprise

68.64 -> this architecture for this application.

70.92 -> Then, we're gonna kick it over to Amr,

72.12 -> who's gonna talk a little bit about platform resiliency,

74.82 -> so the stuff under the hood

75.93 -> that this application actually runs on.

78.09 -> And then, finally, we're gonna wrap it all up

79.71 -> with sort of an overview of Fidelity's cloud platforms

82.335 -> and how they work today.

84.36 -> So with that, I'll pass it over to Louis,

86.07 -> who's gonna talk a little bit about

87.21 -> the order management system.

89.58 -> - Thanks Jeremiah.

94.05 -> So, my name's Louis Mancini.

95.561 -> I run Equity & Options Trading Technology

98.46 -> at Fidelity Investments,

99.87 -> and the last few years,

101.072 -> we have begun the modernization

103.02 -> of our trading stack at Fidelity.

105.3 -> The trading landscape in the last few years

107.67 -> has heavily changed, as I'm sure many of you are aware.

110.88 -> We've gone from building out older systems

116.189 -> to having a need to be able to scale to a much larger extent

120.87 -> and be able to process much larger volumes

124.11 -> of data in our systems to be able to handle

126.39 -> the ever increasing amount of volume

128.22 -> that's coming in from both our retail traders

130.32 -> and our institutional traders.

132.54 -> This project has begun in 2019,

134.88 -> and today we have it running in production

137.46 -> and I'm gonna take you through

138.81 -> how we went about that journey,

140.43 -> what the reasons we did for that to build that system ware,

143.58 -> and what we've done in the future

145.92 -> to get what we plan on doing in the future

147.3 -> to get to where we are.

149.73 -> So, let's talk about the reasons

150.93 -> that we had to come up with to build this.

153.27 -> We started seeing capacity constraints,

154.89 -> as we said, up until about 2019.

156.901 -> Volumes were relatively consistent.

159.36 -> We could kind of know about the max volumes

161.88 -> that we were gonna have, and based upon our number

163.77 -> of clients and the amount of accounts

165.09 -> that we were seeing.

167.07 -> Into 2019, things drastically changed,

169.44 -> which I'm gonna show in a couple of slides.

171.15 -> There were also some other reasons though,

172.71 -> that we needed to actually go down this path, right?

174.99 -> We needed to modernize our technology.

176.64 -> Our technology was starting to get a little old.

178.5 -> We've been been building on top of mainframes

180.3 -> and on top of X-86 server architectures

182.198 -> that have been in existence since the 90s.

184.92 -> We also realized that we needed to have

188.67 -> a large amount of savings in our terms of our costs, right?

192.33 -> When we look at the cloud,

193.59 -> we believe that we could have some significant savings

195.69 -> in both our hardware and licensing fees,

197.944 -> real estate costs, and our ability

200.04 -> to have to staff our data centers.

201.99 -> Another key component that we realized

203.73 -> as we started to begin this journey was speed to market.

206.28 -> Speed to market was important in our ability

208.08 -> to change things, and get that type of quick to market

213.81 -> new products was becoming ever more important,

216.45 -> especially as people were trading more

218.85 -> and we started to see much larger volumes.

222.66 -> So, let's talk about the impetus for this, right?

224.79 -> Back in 2019, this is a graph,

226.977 -> and this is one of my favorite graphs,

228.48 -> probably of my career.

229.8 -> You can see that this is the trading volumes

231.84 -> that we saw at Fidelity Investments.

233.49 -> In 2019, we had our historic high.

235.83 -> And you could see from the little part on the graph

238.17 -> what we saw up until then.

240.113 -> And at that time, we thought that was a very large number,

243.15 -> but then three things drastically changed

245.73 -> in the trading space.

247.77 -> We had fractional and notional trading

249.54 -> that came about, in terms of being able to trade

252.12 -> portions of shares, which ended up allowing people

254.76 -> to trade much more frequently.

257.37 -> We had zero commissions,

259.17 -> which also changed the trading landscape, right?

261.42 -> You saw that all of a sudden, that you can trade for $0

264.27 -> and it no longer became a barrier to trading

266.43 -> for you to be able to execute and route an order.

268.38 -> So instead of maybe trading one large lot of a 100,

270.69 -> you might have traded three lots of 33, 33, 34.

275.61 -> And then, the last piece, of course,

276.96 -> was the meme stocks, right?

278.94 -> The meme stocks hit in 2021.

281.16 -> So to give an example,

282.15 -> which you could see from what our historic peak in 2029,

285.249 -> we are doing 4X our historic 2019 peak

289.095 -> just in our average daily volumes today,

292.71 -> and we are doing well over five to 6X

297.81 -> in our peak that we saw in the meme stocks,

300.06 -> it was actually over 500% our original peak.

302.85 -> And as most of you that are involved in trading,

304.56 -> would know that's actually not even fully

306.51 -> the whole story, right?

307.71 -> What you really see is: you see it in the first half hour,

310.32 -> you see the vast majority of trades

312.57 -> versus the actual entire day.

315.24 -> So, if you were to look at this,

316.135 -> our ability to handle that volume

318.54 -> needed to increase very significantly,

321.27 -> and that's how this came about.

322.53 -> We had taken this system live

324.12 -> that we're gonna talk about

324.953 -> right before that meme stock crisis.

329.295 -> So, let's talk about how we actually went

331.26 -> about building these new systems.

333.66 -> The first thing we had to do,

334.65 -> is we had to build out a development team

336.33 -> capable of operating in the cloud, right?

339.36 -> That's a very different paradigm

341.28 -> than actually building out systems today running on-prem.

344.1 -> We had to take our developers

345.87 -> that were very familiar with trading systems

347.85 -> and retrain them on how to operate in a cloud

351.09 -> to be prepared for every and anything to possibly fail.

354.48 -> To understand availability zones, and regions,

357.6 -> and make sure that we could transmit orders in a Tier Zero

361.11 -> system at any time under any failure scenario.

364.17 -> So what that means in Fidelity terms,

366.21 -> is that even if we were to lose an availability zone

368.391 -> or were to lose a region,

370.17 -> we can seamlessly work on orders

373.38 -> that were submitted in prior regions

375.09 -> or new orders without a customer

377.271 -> knowing about the outage.

379.98 -> Some of the key things that we had to do

381.66 -> is we started shortening release cycles,

384.36 -> we did smaller builds,

386.28 -> we did CI/CD pipelines, and we did the standard stuff.

389.4 -> But some of the non-standard stuff that we had to do

391.59 -> that's special to trading

392.79 -> is we had to build a custom chaos

396.63 -> and performance testing framework

398.55 -> that allowed us to actually build out

400.8 -> and simulate market on open loads,

404.151 -> 10X market loads from the original peak

407.04 -> that you would see over there,

408.45 -> and be able to simulate failures

410.37 -> during any and all of these times

412.38 -> in real time, in a environment

414.263 -> that is a mimic of production

416.04 -> so that we can make sure

417.54 -> that we're actually able to handle these trades,

419.46 -> and get these trades to the market

421.23 -> in a very fast period of time.

423.27 -> The reason for that, because unlike most systems,

425.259 -> if a system fails or part way through a transaction,

428.79 -> it can be completed later,

429.75 -> say like a credit card system, right?

431.28 -> Could always be recharged five minutes or 10 minutes later

433.41 -> once the system comes back, a trade can't.

435.57 -> Once we've accepted a trade,

437.04 -> we need to get that trade to the market

438.982 -> or else the price might move

440.337 -> and the customer maybe owed money

442.26 -> on how to fix that trade from the price

443.67 -> they should have gotten from the price they've gotten.

445.83 -> So, building out these chaos tools

448.14 -> and these performance tools

449.34 -> was very key to us being able to deliver

452.43 -> a trading system on AWS with our tier zero requirements

456.84 -> of 24 by seven with 100% reliability.

462.39 -> So, let's talk about some of the challenges

464.79 -> that we went through to get through,

466.47 -> and build what I just talked about.

468.36 -> Some of the challenges in trading,

469.98 -> is in trading you have to interact with other parties

473.61 -> and you have to interact with older systems

475.74 -> that give you data.

477 -> Trading is not simply taking a trade,

478.65 -> verifying you have enough money

480.15 -> and then sending it to a broker or an exchange to execute.

484.77 -> That's the most simple form of it.

486.45 -> But when we look at fidelity and how we do trade,

488.43 -> we service an enormous amount of business lines,

491.76 -> stock plan services for people that have blackout calendars.

496.05 -> WI, which is our workplace investing,

498.45 -> we have to make sure that we can do all sorts

500.22 -> of different types of tradings,

501.09 -> and we also need to make sure

502.11 -> that we can handle complex order types.

504.45 -> So, we have to be able to integrate

506.04 -> with legacy systems that are both on-prem,

509.55 -> new systems that are in AWS for teams

512.1 -> that have already made the progress to AWS,

514.144 -> and we have to be able to integrate

515.91 -> with all of the existing trading framework

518.31 -> around Wall Street.

519.45 -> And most of the trades today, when you do a trade,

521.94 -> is sent via, many of you're aware is called the fix network.

525.18 -> Fix is a protocol that is built 20 something years ago

528.99 -> in the 90s.

529.92 -> It was designed around servers that have, you know,

532.44 -> a disc attached to them.

533.79 -> Has heartbeats, and is not designed for a cloud

536.79 -> where you need to have, you know,

537.96 -> storage that doesn't exactly exist on your server.

540.93 -> So, we had to solve by building custom fix engines

544.68 -> that can actually operate in a Kubernetes environment,

547.86 -> and we had to build custom frameworks

549.63 -> to make sure that we could actually operate

551.4 -> at a multiple thousands transactions per second,

554.43 -> as it would look like on a normal fixed engine

556.71 -> that you would see,

557.543 -> and be able to transmit those orders

559.5 -> to the market in a timely fashion.

561.57 -> We also had to make sure that those fixed engines

565.139 -> could be able to operate multi-region,

567.672 -> which was another hard concept that we had to get through,

570.33 -> which I'll explain a little bit more later

571.83 -> in some of the architecture diagrams.

574.32 -> Some of the other pieces

575.28 -> that we had to take into account is, as I said before,

577.89 -> we had to do it in a multi-region setup.

580.98 -> What does that mean?

581.836 -> We need to make sure that

583.08 -> if a customer's submitting an order,

584.76 -> it could go to either region one

586.74 -> or region two, and if anything was to happen to region one,

590.85 -> a customer could act upon that order in region two

594.941 -> and be able to cancel or replace that order

597.84 -> without actually knowing that the primary site

600.45 -> that they sent that that order to has failed.

603.33 -> We also need to make sure

604.53 -> that if we were to actually still have a single region,

607.74 -> that we would be able to survive

609.57 -> multiple availability zone failures as well.

612.81 -> And we did this in a multiple different factors.

614.813 -> We did this by utilizing

616.909 -> many of the services from Amazon

619.02 -> such as MSK and DynamoDB,

621.39 -> which we'll show later in our dark detector diagrams,

623.79 -> as well as having to build a lot of custom toolkits

626.22 -> to allow us to circuit break,

628.77 -> be able to tell when some things go wrong,

630.6 -> reroute orders that are in flight

632.55 -> in case of an issue,

633.78 -> detect failures, say in underlying storage,

637.05 -> and be able to actually work on orders in real time,

640.26 -> triage them automatically via the system

642.87 -> and be able to make sure

643.74 -> that they get to market in a real time.

645.69 -> And if for some reason they don't,

647.07 -> we can actually notify and have a corrected price

649.71 -> for the customer.

650.543 -> So this should all be invisible to the customer.

653.97 -> Some other pieces that we had to deal with

655.655 -> is the transition between public cloud and legacy.

658.755 -> We have an enormous amount of data

660.93 -> that's going through our legacy systems.

662.55 -> It was impossible for us

664.62 -> to actually just build a new system,

666.42 -> which would actually be many systems

669.03 -> and just reroute everything to the new system in one day.

672.24 -> So, we've had to stand up a new system inside of AWS,

675.96 -> as well as link it back to our old system

678.6 -> that's actually currently today on-prem.

681.9 -> And the way we've had to do that,

683.01 -> is we've had to forward bridge and backwards bridge data

686.4 -> and we'll go through that in the architectural diagram.

688.71 -> And some of the key components to that

690.15 -> right in the trading system here

692.04 -> is unlike the traditional low latency trading systems

694.74 -> that you would see in any of the other

697.47 -> margin broker dealers,

698.76 -> we also have to support order inquiry in real time.

702.33 -> Customers utilizing our active trader plow platform,

705.33 -> or fidelity.com, need to know what the status

708.09 -> of their order is immediately after it's submitted,

710.46 -> and immediately after it's been executed.

712.74 -> That requires us to have the ability

714.87 -> to serve those customers from either our old system out,

718.14 -> or our new system independent

720.653 -> of where we've actually sent that order to the marketplace.

725.94 -> And of course, the last piece of the challenges

727.59 -> that we've had to deal with coming from legacy systems,

730.92 -> we're able to control our hardware changes.

733.62 -> In the cloud, we cannot always control our hardware changes,

736.23 -> and as a 24 by seven system,

738 -> we need to be able to route away from changes

740.34 -> if there are major changes happening,

741.9 -> or be able to just absorb those changes in the trading day.

745.71 -> That required a large amount of understanding

748.26 -> from our developers to make sure

749.73 -> that they can code and be able to absorb changes

754.41 -> that happen intraday on these cloud based systems.

759.48 -> So, let's talk a little bit about some of the technology

761.074 -> we use to overcome some of the challenges

765.24 -> that I just spoke about.

767.97 -> One of the key pieces we had to use data stores.

771.33 -> Some of the data stores we used is DynamoDB,

773.46 -> as well as we had to use some custom in-house cash systems

776.85 -> as well as a traditional RDBMS to satisfy

779.43 -> some of the needs that we couldn't be satisfied

781.74 -> in a standard key value payer system such as Dynamo.

785.76 -> Another really big key component,

787.44 -> is the logging and visibility.

789.54 -> We had to build an entire framework

792.03 -> around logging and visibility

794.55 -> so that we could make sure that we could track a trade

797.82 -> from when that trade begins,

799.86 -> and hits the system through every single component

802.59 -> of the system, in real time,

804.81 -> to make sure that trade is actually executing,

807.57 -> and a problem hasn't occurred.

809.28 -> So, we have systems today

810.96 -> that are actually going to listen

812.16 -> to all the different pieces.

813.15 -> So, we will accept the trade,

814.8 -> we will validate it, we will then begin the routing process,

817.95 -> we will send it to an exchange,

819.33 -> we will receive an acknowledgement,

820.62 -> we will receive executions,

822 -> and we will validate in real time

824.04 -> to make sure that all of those processes

825.99 -> are performing as they're supposed to.

827.97 -> And if they're not performing as they're supposed to,

830.49 -> we have circuit breakers that we've built into the system

833.07 -> that will actually knock pieces of the system,

835.2 -> whether it's a region,

836.49 -> a pod in Kubernetes, or it's a region in AWS,

842.7 -> a whole region, or it's an availability zone

844.59 -> that will knock them out in real time

846.36 -> in an automated fashion

847.92 -> so that the customer is not impacted.

849.87 -> So, we can triage, or the system could automatically try

852.54 -> and correct itself any event

855.07 -> that should occur that all of a sudden,

857.49 -> our trades are not going to the market

858.99 -> in as timely fashion as necessary

860.88 -> to provide the customer of an execution

862.539 -> that will give them the best price.

865.71 -> Like I mentioned before,

866.91 -> our enhanced test tools

868.203 -> on top of just the standard test tools

870.39 -> that we had to build for like unit testing,

872.891 -> we had to build an entire custom chaos

876.24 -> and performance framework for that.

878.28 -> I mentioned that a little bit before,

879.87 -> but let me go into a little bit more detail on that, right?

882.84 -> We've built an entire replica

884.19 -> that we can stand up and stand down at will

886.16 -> of our production environment,

887.82 -> and have built replicas of our entire data sets

890.58 -> at 1X, 2X, and up to 10X

892.219 -> or even more of our maximum day,

894.96 -> like I showed on that green graph

896.4 -> at the beginning of the presentation.

898.32 -> We are then able to replicate that environment,

901.44 -> send data in, have an expectation

904.38 -> of what we would see at our 99th, 99.9, etc. timeframe,

908.31 -> and then we can inject automated faults

910.65 -> in every part of the system

911.76 -> that we can at least come up with that could fail

914.13 -> to make sure that as those failures occur,

917.168 -> we are able to actually respond in real time,

920.37 -> and be able to correct the system

921.96 -> so that the customer will not be aware.

923.97 -> And that goes from everything

925.23 -> from the smallest Kubernetes pod,

927.48 -> to a large scale database failure,

929.85 -> to a failure of an availability zone,

934.05 -> to an entire region failure.

936.18 -> So, we can fail in regions.

938.52 -> And we are set up in a way that's hot, hot,

940.74 -> which I'll explain a little bit later in the diagram.

943.26 -> But just as we mean, we can send order flow today

946.14 -> to both regions at all times.

947.79 -> So today,

948.93 -> if you were to be sending orders on fidelity.com

951.21 -> and you were to be utilizing it,

952.65 -> there is a roughly coin flip probability

955.98 -> that you'll be sending orders to either our new system,

958.98 -> or to our old system.

960.57 -> And there's also another roughly coin flip probability

962.79 -> that you'll be going to if you went to the new system,

964.98 -> you would be going to the one of the prime:

967.26 -> the first region or the second region.

968.76 -> There is no primary region.

970.35 -> There's two regions that run in a hot, hot fashion

972.96 -> replicating in real time to each other.

976.71 -> Some other pieces that we did

977.76 -> that were a little bit more standard,

979.248 -> we utilized mostly standard middleware

981.93 -> and messaging packages.

983.52 -> We used a lot of MSK from from Amazon managed service,

987.6 -> and we also did a little bit of custom middleware messaging

990.48 -> like I mentioned before,

991.65 -> where we had to build some custom fix engines,

993.54 -> and some custom underlying technology

995.37 -> to make those work inside of a Kubernetes environment.

999.15 -> And in terms of language,

1000.05 -> we use mostly standard Java languages,

1001.73 -> but we use a little smattering of everything else.

1006.56 -> So, let's talk a little bit about our order,

1008.12 -> our high level order architecture.

1010.58 -> The way we've structured this,

1011.66 -> is you can see we have the gray diagram,

1014.42 -> which is our legacy systems,

1016.34 -> which will not only take orders in

1018.38 -> and orders and send them to the market.

1021.68 -> Our green systems are our new AWS systems,

1023.87 -> which will also take orders in and orders to the market.

1027.11 -> Both systems will also process our customer inquiry traffic,

1032.54 -> which can be very large.

1034.25 -> We serve and operate as a standard order management system

1036.86 -> for a lot of different customers.

1038.96 -> We have our own system,

1040.13 -> such as Active Trade Pro and fidelity.com,

1044.36 -> but there are many channels and other business lines

1046.7 -> inside Fidelity and outside of Fidelity

1048.98 -> in our clearing business

1050.99 -> that utilize Fidelity's trading infrastructure.

1053.51 -> And all of them require the ability

1055.55 -> to know exactly where an order is,

1057.223 -> at what state that order is,

1059.21 -> and be able to act upon that order at any time.

1062.24 -> The key to what we did here

1063.46 -> is we wanted to make this invisible to the customer.

1066.89 -> So as you can see in the upper left,

1068.45 -> there's a box that's called director that runs on site.

1071.399 -> All trades and all inquiry statuses

1075.71 -> will go through director.

1077.12 -> Today, trades do inquiry process statuses

1079.43 -> are almost complete.

1080.9 -> Director has the ability based upon rules

1083.026 -> and circuit breaker knowledge

1084.92 -> of where to route an order.

1087.59 -> Or should it go to the new system,

1089.18 -> or should it go to the old system?

1091.25 -> What this does is it makes it invisible to the customer.

1093.8 -> We gave one API that's in front.

1095.63 -> So if someone's building out a new trading system,

1097.757 -> and they need to connect to us,

1099.2 -> someone has an old trading system,

1100.519 -> they don't know if they're going to the new system,

1102.83 -> because that allows us to migrate

1105.23 -> our flow piece by piece by piece,

1106.881 -> so as we build out new pieces of the system,

1110.03 -> we can continue to add functionality

1112.85 -> without customers needing to be tied to us

1115.25 -> to make sure that they make the appropriate changes

1117.65 -> and they are tied to us in releases.

1119.96 -> We became independent of their releases.

1123.59 -> The other key component here, which I'll stress,

1125.415 -> is our ability to back bridge

1127.43 -> and forward bridge our data.

1129.11 -> We are in a hybrid mode right now,

1130.79 -> and we've been in a hybrid mode for two years,

1133.07 -> and we will be in a hybrid mode in multiple years

1135.47 -> coming until we complete this project.

1138.05 -> The ability for us to to back bridge

1140.36 -> and forward bridge our data

1141.322 -> is another piece of the puzzle

1143.21 -> that allows it so that customers

1145.34 -> don't have to worry about where their inquiry

1147.62 -> or where their order goes.

1149.09 -> And if we were to move customers, it would be invisible.

1152.24 -> So, our AWS system has a copy of all of our trading data

1156.68 -> that's going on in real time on our legacy systems.

1160.61 -> Our legacy systems have a copy

1162.71 -> of all of our trading data

1164.3 -> that's going on inside of our AWS systems,

1168.41 -> so that in the event of a customer inquiry,

1170.93 -> they can go to either system and get served their data.

1174.26 -> It also allows us a lot of freedom

1176.93 -> to release at a much faster rate.

1178.73 -> We don't have to necessarily,

1180.5 -> if we want to do a large release,

1181.82 -> we can scale volume down and scale volume up.

1184.16 -> If we want to add new functionality,

1185.772 -> we can add that new functionality

1187.28 -> and slowly apply customers to it

1190.1 -> to make sure that we won't cause an outage

1191.78 -> or cause customer dissatisfaction.

1193.97 -> That's the key to how we've been building this

1196.22 -> and that migration has allowed us to continue

1199.13 -> to go forward.

1201.2 -> So, as you see in the picture,

1202.64 -> there's a forward bridge, and a backwards bridge.

1204.98 -> We're real time asynchronously

1206.39 -> replicating across both of them.

1208.55 -> - [Audience Member] Quick question.

1209.383 -> - Sure.

1210.216 -> - [Audience Member] So, are you all doing something

1211.61 -> like a strangler thing pattern to basically do

1215.63 -> the traffic shift as you keep adding new services?

1220.61 -> - It's very similar to a strangler pattern, yes.

1222.65 -> it's not exactly, but it's basically a rule based engine

1226.229 -> that underlying it has the 50 different or whatever x amount

1230.36 -> of rules that there are that would compromise

1232.16 -> 100% of the order set, and then slowly

1234.26 -> we go click down one by one by one,

1236 -> until eventually hopefully there is none.

1237.5 -> - [Audience Member] Thank you.

1239.12 -> - No problem.

1241.49 -> So, let's go forward on our platform resiliency.

1243.35 -> This is one of the hardest pieces of the puzzle

1244.853 -> that we needed to solve when building something on Amazon,

1248.12 -> especially something that needed to be tier zero.

1252.77 -> We needed to make sure that we could operate,

1254.27 -> as I said before,

1255.103 -> a multi-region multi availability zone pattern

1258.71 -> and be able to operate on orders that existed

1260.78 -> in either region at any time.

1263.54 -> So, couple of key decisions that we made at the beginning.

1266.66 -> All of our application logic is in Kubernetes,

1270.082 -> it is not in any sort of EC two instances,

1273.41 -> there is no application logic, including our fix engines

1276.17 -> that are operating outside of Kubernetes.

1277.94 -> That allows us to be able to scale

1281 -> whatever we need to scale.

1282.68 -> It also allows us to add process

1284.42 -> so that the event of, say,

1285.41 -> we have another very, very large spike

1287.54 -> that is unforeseen due to some sort of market event,

1290.3 -> we can simply click a button,

1291.95 -> and change our pod count from 20 to 50

1293.987 -> based upon our built up orders that we see overnight.

1297.44 -> One of the ways in trading that we can determine,

1299.27 -> especially in a more of in a customer facing system,

1301.94 -> is our overnight orders give us a guess

1303.74 -> as to what the probability

1304.94 -> of what we're gonna see in the first 30 minutes.

1306.92 -> So, unlike a traditional on-prem system

1308.9 -> where someone's gonna have to run

1310.28 -> and hopefully there'll be extra hardware

1311.81 -> lying around that we could set up,

1313.1 -> we never have to worry about that problem again.

1315.11 -> We literally just spin up a bunch more pods,

1317.373 -> and we're ready to go with that particular point.

1320.324 -> Some of the other keys that we needed to do

1323.24 -> to make sure that this would work

1324.38 -> is we need to be able to fail over,

1325.79 -> and this was something that was very, very difficult

1327.44 -> for us to build.

1328.55 -> We need to be able to seamlessly, within seconds,

1331.34 -> fail over our entire fixed infrastructure

1333.92 -> from one region to another region.

1336.86 -> So, if you trade a market order today,

1338.841 -> it's very quick, it's gonna execute

1341.36 -> or it's not gonna execute immediately.

1343.46 -> Half of the orders though are not market orders.

1345.44 -> They're much more complex order types,

1347.3 -> limit orders, GTCs, go to cancels, trailing stops,

1351.74 -> and they could exist for minutes, hours,

1354.2 -> days, weeks, months.

1355.823 -> Some could even go into the multi-month

1358.16 -> to even year timeframe.

1360.02 -> We need to make sure that if those orders

1361.7 -> exist on exchanges, then a customer could actually operate

1364.79 -> on those orders even in the event of a full failure

1368.27 -> of our entire region.

1370.43 -> So, to be able to accomplish that task,

1372.74 -> we've built the functionality

1374.03 -> that we can with the click of a button,

1375.95 -> be able to move our fixed engines

1377.63 -> from the affected bad region to the good region,

1380.78 -> and be able to take all of our customer flow

1382.7 -> from the director and point it only to a single region.

1385.46 -> We have actually, and this has been a very large benefit

1388.88 -> to us in production,

1389.713 -> so that we can actually go to a single region

1392.57 -> with minimal to no customer impact, if any,

1395.87 -> and be able to actually be able to trade

1397.61 -> in that other region.

1399.23 -> That was a very difficult problem,

1400.063 -> and one of the key pieces to how we built this system.

1405.17 -> So, let me finish up with the last slide on where we are.

1412.278 -> So, we go to the next slide.

1413.111 -> So where we are today,

1414.29 -> we began this journey about four years ago in 2019.

1417.62 -> We took it live on AWS somewhere

1420.98 -> around two years ago from the original POC that we did.

1424.219 -> You could could see: these are real volumes here,

1426.051 -> as you could actually see we've graphed them.

1429.32 -> You could actually see in the red line here,

1431.75 -> what our legacy system is processing,

1433.49 -> and it kind of matches a little to that green chart

1435.17 -> I showed at the beginning of the presentation.

1437.09 -> And in the blue line,

1438.71 -> what our new system is processing,

1441.14 -> and you could see that there are actually even days recently

1443.21 -> where we have actually processed more trades

1444.814 -> on our new system versus our old system.

1448.88 -> Over the next couple of years, coming years,

1451.04 -> we look to continue doing this for equities

1453.2 -> and then we eventually look to scale

1454.55 -> this pattern to options

1455.93 -> as well as many of our other different type

1457.58 -> of order types and business lines

1459.32 -> so that we can complete the migration

1461.33 -> from our older on premise legacy systems

1465.62 -> to our new cloud based systems.

1468.38 -> Thank you everybody for your time.

1470.561 -> (Louis chuckles)

1471.89 -> Thanks.

1472.723 -> Amr's gonna talk about the cloud component of it,

1474.95 -> and what Fidelity did to build out

1476.33 -> their cloud architecture next,

1477.65 -> and then we'll take some questions.

1479.84 -> - Thank you Luke.

1480.737 -> (audience applauds)

1489.56 -> All right, so let me take a different spin for the story.

1494.3 -> This is actually the first slide

1495.53 -> that we had in the presentation.

1496.82 -> - [Louis] For sure.

1497.653 -> - I'm going just backward a little bit

1498.486 -> like wanna talk about two dates here or two numbers here.

1502.04 -> The first one was 2019,

1504.316 -> and that was Fidelity, basically the announcement

1506.87 -> that the CNCF con at San Diego,

1511.16 -> our strategy to move to public clouds,

1513.47 -> multi public cloud as first.

1515.96 -> And then like three years later, which is today,

1518.72 -> we're at 5,700 application running in the public cloud.

1522.95 -> So, the all running in this platform

1525.26 -> and in the next 20 minutes,

1526.34 -> I'm gonna do my best to tell you the story,

1528.47 -> how we build it, where is it today,

1531.17 -> what's our vision for tomorrow.

1532.91 -> and how we are hosting many applications

1536.15 -> that you know, actually,

1537.2 -> I would love to interact with you guys.

1539.72 -> Can you guys raise your hand if you are using today

1541.79 -> fidelity.com, Fidelity mobile NetBenefits,

1545.489 -> all of our Fidelity products,

1548.18 -> and if you look around, it's awesome.

1550.88 -> Thank you for your business, thank you so much.

1553.34 -> And literally you are interacting with this platform today.

1558.17 -> So, this is a very interesting slide actually/

1561.59 -> It does show like you know, what was our vision,

1563.42 -> how we started that journey,

1565.07 -> and how we focus.

1565.903 -> When you scale 5,700 applications

1568.64 -> and more to come in that platform,

1570.53 -> you have to start first of all by building the foundation.

1573.74 -> This foundation have to be security foundation,

1576.56 -> your compliance foundation, your infrastructure foundations,

1580.43 -> when are you gonna host your application,

1582.14 -> your data, your event streaming.

1584.3 -> All these components need to be in place.

1586.64 -> and I'm gonna go a little bit deep

1587.96 -> about that in a few slides.

1590 -> Also, you have to reimagine these applications.

1592.43 -> So Louis and his platform,

1593.72 -> here is one of these.

1595.01 -> If you imagine since we talk about Kubernetes here,

1597.65 -> and about containers, imagine like all of these platforms

1600.65 -> are running as clusters of containers,

1603.2 -> they're all running like shapes or bolts

1605.3 -> or carriers that carries these containers.

1607.19 -> We have massive numbers of them today.

1609.65 -> Some of them are critical-T to level tier zero

1612.35 -> to tier one, to tier three and so on.

1614.6 -> But that imagining of the application itself

1617.09 -> to run encapsulated with your API, with your data

1620.66 -> with event streaming, with your observability and security,

1623.93 -> all encapsulated that one of the values

1626.57 -> that was added to this platform.

1628.79 -> And obviously work in investments,

1631.22 -> we care about our numbers.

1633.02 -> The FinOps model, I'm gonna show a few slides about that,

1635.3 -> but the FinOps for us is critical.

1637.85 -> It can get really expensive

1639.14 -> when you move this amount of applications to the cloud.

1642.11 -> How we manage that today,

1643.16 -> and how we are continuing to manage that,

1645.038 -> that's gonna be one of the discussions

1647.18 -> we're gonna have here.

1648.53 -> And last but not least, we manage thousand of developers,

1652.34 -> hundreds of develops teams, many business units

1655.31 -> or business partners working with us around that.

1658.46 -> So, we're definitely like focusing this year

1661.55 -> and next few years that on the developer experience

1663.92 -> about having, you know,

1665.27 -> and building our fidelity open source project,

1668.09 -> we're gonna discuss that, show it to you guys as well,

1670.61 -> and how we are attracting the talent in the company.

1674.629 -> So, just full disclaimer before I go on this slide,

1678.92 -> this is one way to build the platform.

1681.56 -> There's many other ways that we can build this platform,

1683.6 -> but this way is bulletproof.

1685.67 -> We tried it, it did work,

1687.59 -> and in this slide I wanna just share with you

1690.02 -> like you know, what is the rules?

1691.73 -> Rule number one that we used,

1693.11 -> we use an open source technology.

1695.03 -> We focus in containers,

1696.2 -> we focus in Kubernetes,

1697.37 -> we focus in many of the CNCF products that we use today.

1701.72 -> We're using Envoy.

1702.83 -> We are actually part of the Envoy source product itself.

1705.56 -> We're big in the telemetry side,

1708.02 -> and open telemetry process as well.

1710.03 -> So, that was one of the key strategy

1711.86 -> that we announced in 2019 at Coucon.

1716.12 -> The second part was used managed services.

1719.3 -> So, while Cervic is awesome,

1721.49 -> but we don't wanna get busy managing coop.

1724.46 -> As a matter of fact, we have a major private cloud

1726.529 -> in our data center today running coop,

1729.17 -> and it's a big job and big task

1732.11 -> for operating that platforms.

1734.27 -> So, definitely one of the recommendations I would say,

1736.64 -> is to start using cool, you know,

1737.99 -> many services like ETS,

1739.82 -> and I'm gonna go a little bit in deep

1741.08 -> and how that we're using today,

1742.79 -> and managing like hundreds and hundreds of clusters today

1746.15 -> or just like container ships as well.

1749.124 -> Number three, definitely,

1750.86 -> you need focus in building your network strategy.

1754.7 -> So we literally, we have inside Fidelity

1756.35 -> similar like a map like this, you know subway map,

1759.08 -> this is New York subway map,

1760.82 -> I couldn't share the right one that we have inside Fidelity,

1763.79 -> but we have a map that shows all the regions

1767.15 -> in all the cloud provider, it shows all the colors,

1770.06 -> it shows all of our data centers,

1772.13 -> the exchange and the latency between all these areas.

1776.15 -> So Louis and his team and other product

1778.64 -> and literacy where I should place my applications.

1781.97 -> What is a sunny day via the rainy day,

1784.67 -> what is my like, you know,

1786.02 -> like, in New York if you guys from New York area,

1787.64 -> you know there is a express subway

1789.02 -> and local subway when you taking the express subway,

1792.17 -> what happened when you have like this disaster

1794.09 -> and you have to go through a local subway.

1796.58 -> So the network is definitely one of the investment

1798.95 -> that we did, and since we start using managed services

1802.4 -> like ETS and MSK,

1804.5 -> we start focusing in building

1806.166 -> our Fidelity cooperatives program

1809.03 -> or our container program on top of that.

1811.52 -> So, we focus in the fleet management,

1813.5 -> how we manage these clusters,

1816.2 -> how we manage the multi-tenancy,

1817.88 -> how we can host multiple applications in these clusters.

1821.57 -> We also focus in application management side,

1824.36 -> how we integrate our platform or applications

1827.36 -> with the security in the backend with observability,

1831.95 -> with other components,

1833.282 -> And last but not least,

1835.07 -> integrating that with the cloud services itself.

1837.68 -> Like, you wanna manage your clusters

1841.04 -> and you manage your fleet from FinOps perspective

1843.98 -> for resource management,

1845.48 -> you wanna do that optimization.

1847.25 -> And our program focuses that.

1849.117 -> On top of that,

1850.13 -> we start building all of our core comments.

1852.17 -> So today, literally we are more running

1854.376 -> our event streaming programs there,

1856.37 -> we're running our API program on top of that platform.

1859.25 -> We're running many other programs

1861.29 -> including our future data programs itself

1863.36 -> is running on top of that platforms.

1865.52 -> And last but not least, obviously,

1867.56 -> trade and fidelity.com and others

1869.33 -> are running on top of that.

1874.34 -> There is multiple ways you can manage

1875.978 -> fleet of containers in Kubernetes.

1878.39 -> One way, you can buy a product.

1880.46 -> Second way, you can do like what we did three years ago,

1883.49 -> go and assemble multiple open source projects

1885.74 -> and build your own project

1887.27 -> or build your own open source program,

1889.37 -> or you can use ours.

1890.36 -> Ours is available, it's free, it's open source.

1893.27 -> We'll be very happy to collaborate with you,

1895.31 -> as a matter of fact, like, you know,

1896.683 -> there's a couple of banks already collaborating with us

1899.33 -> around that, and we'll love if you guys

1901.4 -> wanna pre-partner with us about it.

1904.28 -> Just to give you, like, and highlight the program itself.

1907.58 -> The first piece is like how you can connect your fleet.

1909.95 -> So imagine you have a fleet of ships

1911.99 -> that are running your containers,

1913.73 -> and you wanna have your thousand of developers

1916.07 -> access these platforms or these clusters, and safely,

1920.93 -> and understand what rule

1922.1 -> and what authentication authorization they can get in.

1924.68 -> That's what our KConnect tool does.

1927.541 -> Second one, is our Kraan,

1930.5 -> and Kraan is our framework that we build

1933.013 -> to manage these clusters,

1936.14 -> to build all the operators

1937.55 -> and the integration that we have in these clusters

1939.8 -> and how we can safely upgrade this cluster.

1942.83 -> As a matter of fact,

1945.86 -> Kubernetes program itself, and CNCF,

1948.223 -> and AWS required to upgrade every three months.

1951.26 -> So, you have to upgrade your environment every three months.

1953.48 -> We have requirement dehydration as well

1955.34 -> that goes almost on a monthly basis.

1958.28 -> So, with that program, and with Kraan,

1960.14 -> we didn't manage over 12,000 upgrades in the last few years.

1964.31 -> And that's all happened seamlessly

1966.17 -> without you know, interference for the business.

1968.99 -> And you will need definitely a kind of, like, framework

1972.11 -> that will manage this infrastructure

1973.58 -> for you or in your behalf.

1975.86 -> Last but not least,

1976.82 -> we're very focused today in resiliency and operation.

1979.82 -> So, we're actually raising our Theliv program,

1982.34 -> and Theliv is meant for hub,

1986.101 -> and it's a way of integrating our fleet of clusters

1990.23 -> or containers or Kubernates

1991.481 -> with our premesis infrastructure

1993.674 -> that we're gonna be launching in future,

1996.53 -> and collecting all of this data,

1998.06 -> and all of our data analytics

1999.53 -> for all our operational data in the backend.

2002.65 -> And we're building a framework

2003.76 -> where it actually can program diagnosing issues

2007.87 -> in your application or event issues.

2010.03 -> For instance, if you have an auto scaling event

2012.43 -> or you have a deployment event, it will do in your behalf,

2015.43 -> it'll do the checkup that you ask it to do

2017.188 -> and it'll figure out where is the issues

2019.27 -> and the challenges and will provide you with some, you know,

2024.22 -> solutions and hopefully in future to be intelligent enough

2027.97 -> by adding some machine learning Ops model on top of that.

2031.72 -> But this is a future for us.

2036.91 -> Now, the real foundation under all of that

2040.36 -> is an EKS cluster.

2042.28 -> Our EKS clusters are very systematic,

2045.82 -> meaning we provide like one single template

2050.991 -> for how the cluster running for all of our applications

2053.92 -> and all our systems.

2055.939 -> They have done as Louis is saying,

2058.21 -> in multi regions, multiple availability zones.

2061.6 -> So they become rebuilt

2063.13 -> for all of our application team to use.

2065.74 -> We also provide policies, like policies about

2069.55 -> how our routing is happening,

2070.93 -> how is our DNS service in the backend is being set up,

2073.42 -> how is our LBS is all figured.

2075.749 -> All of that becomes prebuilt.

2078.55 -> For all of our application team

2080.23 -> to host their application in that.

2087.58 -> Beside that, we actually do cluster management side

2090.85 -> or hosting all out what we call the Kraan program itself.

2094.12 -> That has all become, as well, prebuilt.

2096.1 -> So when you deploy your application to our platform,

2099.82 -> you actually literally deploy your application.

2102.49 -> Preconfigure for observability, preconfigure for routing,

2106.33 -> preconfigure for security, preconfigure for FinOps,

2110.073 -> preconfigure for east and west communication,

2114.61 -> and preconfigure for you know, additional tasks

2118.3 -> like you know how you do the near services

2120.07 -> that services discovery and others,

2121.63 -> and futuristic how we're gonna do service mesh

2124.48 -> overall across all these clusters.

2131.11 -> I mentioned developer experience,

2132.67 -> that's something actually we started this year.

2135.574 -> What we found after releasing all these containers today,

2138.82 -> we have over quarter million container running

2141.31 -> critical workload in production.

2144.19 -> And what we found that we need to start building this

2146.446 -> conversion or consolidation,

2148.75 -> and unified our developer experience.

2151.558 -> Today, we have multiple projects working as that.

2154.72 -> This is one of them, called the Starling project.

2157.48 -> And the Starling project

2158.41 -> is our application management platform.

2161.77 -> It very focus around, like, how you have unified experience

2166.09 -> when you board application,

2167.89 -> how you can manage the small things in the backend,

2170.62 -> how you can integrate the teams to board application,

2173.71 -> board multiple applications,

2175.51 -> how you can start like you know,

2177.1 -> building prescriptive model around deployment,

2181.09 -> using some frameworks like you know Argos,

2183.55 -> CD and other frameworks.

2185.59 -> How you actually manage your cluster

2187.54 -> so you can manage your upgrade,

2188.71 -> you can manage your hydration,

2190.39 -> and provide all of that through single portal

2193.18 -> that can be self-service for all the teams

2195.55 -> and all the application teams,

2197.17 -> and all of the Ops teams as well.

2199.78 -> You can manage that.

2204.25 -> Behind that, we actually have,

2206.03 -> and I have to recognize that and I have to mention this,

2208.66 -> this is Bombayer, our sister group, and very thick focus

2212.32 -> in building this modern development cycle.

2215.68 -> So, before your application, actually,

2218.05 -> let it board to our platform

2219.7 -> and before our application can get inside the platform,

2222.34 -> you have to go through that pipeline.

2224.38 -> This is our like, you know,

2227.26 -> I would say like one of the most like you know,

2230.59 -> awesome programs that I saw for intersourcing,

2233.29 -> because it does collaborate all of our

2234.86 -> thousand of developers are all collaborating

2237.7 -> and building that model system,

2239.65 -> and that model system, actually,

2240.91 -> it does focus on the governance side,

2243.31 -> it does focus in building, like, consistency

2245.29 -> around the CI processes,

2246.97 -> how this application is being built

2248.92 -> around the test stream works

2250.75 -> around the security aspect of that,

2252.67 -> and it lasts into our production

2254.62 -> where our cloud platform is,

2255.97 -> and where is our application is being deployed are.

2261.876 -> I wanna focus a little bit in the FinOps,

2263.89 -> there is actually one presentation there, it's awesome

2266.5 -> you know, the presentation information is there

2267.97 -> around the FinOps side,

2269.83 -> but I wanna go through how we started the FinOps model

2272.74 -> in the cloud platform.

2274.687 -> In day one, it's like when you buy a house,

2278.82 -> so you go there and you are very excited about the new house

2282.31 -> but then you get hit by the first mortgage bill,

2284.301 -> And be like, "Oh my god,

2286.18 -> I have to worry about that and I have to manage as well."

2288.88 -> So the first thing you do,

2290.02 -> you're kind of looking at refinancing information,

2292.54 -> you know, possibilities, and that's what we did.

2295.39 -> So, definitely, like, you know,

2297.28 -> you wanna go through a discussion

2299.56 -> about using reserves, in a sense,

2301.9 -> and how you can do that

2302.989 -> will provide you kind of like, you know,

2306.386 -> definitely cost management in this case.

2309.55 -> But the next time, the next one that you do after that,

2312.01 -> you start looking in your rooms,

2314.05 -> and you start seeing lights on,

2315.97 -> and you start like shutting down the lights

2317.95 -> behind your kids and everyone in the families, right?

2320.347 -> And we do that as well.

2321.91 -> We go like every weekend and every night,

2323.95 -> and we see which system are being not used

2326.2 -> or underutilized and we do shut down this system

2328.96 -> in the back ends.

2330.7 -> And the next one, you start thinking about like,

2332.717 -> "Why I don't put like an intelligent

2334.48 -> in our power system?

2336.52 -> Why don't start using solar,

2337.87 -> using some of this smart system

2340.45 -> and smart devices in the houses."

2342.52 -> And that's what we start doing as well in our side.

2345.01 -> So we start like utilizing the spot instance,

2347.83 -> that was one of the things

2349.03 -> that we leased about like two years ago

2350.68 -> and in our management infrastructure,

2353.17 -> we were able to get up to like 40% of saving by using spots.

2357.612 -> About two months ago, we released graph return,

2361.48 -> and that's additional saving that we're gonna experience,

2365.17 -> and we're still likely evaluating that right now,

2367.45 -> but we expecting that might go to additional 30%

2371.29 -> of the, you know, of course saving as well in that.

2374.98 -> But they feel like what is the most critical thing

2377.38 -> is really like you know, application being developed

2380.56 -> toward a financial aspect or how you can drive

2386.38 -> the culture inside your developing teams,

2388.45 -> and your developing community,

2390.1 -> to start thinking about cost saving

2392.59 -> and start thinking about

2394.66 -> how we can optimize the application itself.

2396.91 -> How we can use a smart auto scaler,

2399.31 -> where auto scaler will understand more than memory and CPUs.

2402.67 -> It will understand where this application is being utilized

2405.28 -> and how it can be likely, you know,

2407.08 -> in a sense, will be reduced

2408.07 -> when you use these kind of things.

2409.72 -> And that's actually a futuristic thing

2411.19 -> that we're trying to do right now as well.

2416.59 -> I did speak about observability a little bit

2419.05 -> when I mentioned Theliv,

2420.303 -> but this is one of the things

2421.87 -> that we're working on right now in our lab

2423.49 -> and we're gonna be releasing that massively

2425.08 -> across of all of our Fidelity cloud platforms,

2429.04 -> and what we found around the observability side,

2431.41 -> it's very interesting,

2432.58 -> because when you are in your data center,

2435.61 -> you are literally fine

2437.32 -> with having traditional monitoring tools.

2439.87 -> But when you start building a hybrid model,

2442.615 -> and hosting your application,

2444.64 -> part of your application on premise

2445.99 -> and other part is moving to the cloud

2447.61 -> and moving from one region to another regio,

2449.83 -> and moving from one cloud to another cloud,

2451.477 -> and you start to deploy your mono application

2454.84 -> toward like, you know, microservices.

2457.12 -> So one single lab become like 30 or 40 microservices,

2460.39 -> and you wanna manage all this communication,

2462.73 -> what you found that the observability tool

2464.672 -> have more noise, it's more expensive,

2467.587 -> and it doesn't provide entity art

2471.43 -> that you're looking for.

2473.17 -> So, one of the things that we're doing right now

2475 -> is start investing in building our observability pipeline.

2477.963 -> It's based on CNCF OpenTelemetry.

2481.75 -> It would ease its GA right now for metrics and for traces.

2486.047 -> We're still working with them today

2488.157 -> around the log side as well.

2490.21 -> And I think this is one of the areas

2491.157 -> that's gonna be our future for observability.

2494.65 -> Once we turn the pipeline on,

2498.16 -> this means we can remove most of this noises

2500.26 -> out of observability.

2501.45 -> We will be able also to drive

2505.073 -> like the actual data, or the critical data,

2508.21 -> to our premium solution of observability,

2510.7 -> and move the non-critical data

2512.44 -> to S3 storage or other solutions

2515.8 -> that can be used in the backend,

2517.69 -> and using Theliv and using the CNCF technology

2521.078 -> and Kubernetes, I think, copremesis and others,

2523.63 -> will be able to collect this data.

2525.79 -> You know, for example,

2527.073 -> when you connect to an EBI server in coop,

2530.53 -> you might be able to extract a thousand metrics per second

2533.562 -> out of your API server.

2536.251 -> This is by itself is very expensive

2539.26 -> if you're using like traditional cloud observability tools.

2543.31 -> But using that method, you'll be able to filter that.

2545.92 -> And you'll be able to see which area,

2547.75 -> which metrics that you care about

2549.7 -> at certain times and you can program the other one

2552.31 -> to use them or not use them as well.

2562.21 -> Now, the data pattern is a interesting topic

2567.67 -> because (laughs) we started the data journey

2570.093 -> using our traditional RDBMS,

2573.34 -> and, you know, SQL databases,

2575.091 -> and and Louis mentions that as well

2577.9 -> in the first section of the presentation.

2580.044 -> What we found that while the work well

2586.18 -> inside our data center,

2587.32 -> but when you move in the cloud side,

2588.79 -> you have to worry about failures

2591.19 -> and you have to worry about synchronization

2593.02 -> between multi regions,

2594.64 -> and between six availability roll zones for tier zero

2597.49 -> application, like Louis' app.

2599.62 -> And with that, we have to start investing,

2601.45 -> like, a newer pattern.

2602.77 -> This is one of the pattern that we invested in today

2605.56 -> using Dynamo DB.

2607.03 -> So, it's used as a caching layer

2609.4 -> in front of our RDBMS database in the backend,

2613.48 -> and it does hot, hot synchronization

2615.31 -> between the two regions.

2617.77 -> That's how we guarantee that the order or the failure

2620.92 -> in one of the region and the failure,

2622.42 -> one of the availability zones,

2625.93 -> can be recovered in the right SLAs

2629.2 -> when we have the data synchronization

2631.36 -> happening almost near time.

2636.174 -> And I wanna end it at,

2639.28 -> it's a great journey in the last three years

2643 -> I think, you know, we use multiple like newer technology,

2646.45 -> we have a lot of flyers,

2647.44 -> but what really mattered was the Fidelity culture.

2652.03 -> Having these four pillars between security,

2655.42 -> between the platform, between the SRE,

2657.945 -> between the applications,

2660.34 -> having harmony between the four,

2661.99 -> collaboration between the four pillars.

2665.71 -> That's actually what made our platform successful.

2669.7 -> We chat, we argue, (laughs) we discuss,

2673.87 -> we change plans, but at the end of the day,

2676.42 -> having these four pillars integrated

2678.67 -> and collaborating together,

2680.83 -> understanding that it's not like a traditional data center,

2684.01 -> not traditional races that can solve the problem.

2687.49 -> And instead, like, every team is worried

2689.554 -> and every team is focused on what the other team is doing,

2692.62 -> security team is helping the platform team,

2695.41 -> our platform team is for you know, focused on SRE.

2698.41 -> Our SRE team is helping in engineering,

2700.9 -> our application team is everywhere helping us with that.

2703.48 -> That's what matters, that's what Fidelity culture,

2706.637 -> the cloud platform culture is about.

2708.97 -> And I think that's why we're successful today,

2710.71 -> moving 5,700 application to the cloud

2713.74 -> and thank you so much for that.

2715.052 -> (audience applauds)

2721.45 -> - All right, thank you Amr.

2723.19 -> So, I promised we will leave room, I should say,

2726.38 -> for questions from the audience.

2728.59 -> I do have one question, though, for Louis.

2729.88 -> I'm gonna flip back to this architecture slide real quick,

2732.55 -> 'cause I think there's some interesting data points

2735.43 -> here that I wanna discuss real quick. Let me find it.

2739.24 -> So Louis, earlier on in your presentation,

2741.7 -> you were talking about orders queuing up, right?

2744.64 -> So I'm a fidelity.com customer,

2746.92 -> I wanna buy, let's say, I've got diamond hands,

2749.02 -> I wanna buy some game stock stop, I submit it overnight.

2752.11 -> These orders get queued for this system.

2754.21 -> So, obviously, you wanna maintain constant

2757.24 -> and low latency performance for all of your customers,

2759.4 -> whether they submit trades in the morning,

2761.05 -> or whether they submit them at 2:00 PM

2762.25 -> when there's not so much traffic on here.

2763.957 -> So, how does this architecture

2766 -> allow you to maintain low and you know,

2768.91 -> consistent latency throughout the end-to-end

2771.49 -> transaction from the time I submit a trade order,

2773.95 -> to the time it gets routed to an exchange.

2776.11 -> - Sure.

2776.943 -> So that takes us back to the observability pattern,

2778.99 -> which we were talking about earlier.

2780.46 -> We have a system today that is actually monitoring

2783.02 -> all of the different subsystems

2785.08 -> that you actually see in this diagram.

2786.79 -> It's monitoring from the fixed engines,

2788.68 -> to our components for routing,

2790.42 -> to our order management piece

2791.86 -> from when the order's placed,

2793.24 -> all the way to all the different steps

2794.86 -> it needs to make it to the exchange.

2797.65 -> We preallocate for the open based upon the overnight

2801.34 -> to make sure that we will have enough,

2803.08 -> that no matter what we hit based upon a certain ratio

2807.01 -> that we can always handle our overnight volumes

2809.2 -> at a relatively quick pace as when they execute

2811.42 -> at the market on open cross.

2812.86 -> And then based upon that,

2814.93 -> we will make note that we have enough capacity

2817.09 -> in each one of these systems

2818.68 -> to make sure that each one of them

2820.24 -> can handle on a single region what both regions

2823.894 -> could handle across the entire system a day.

2826.72 -> So, let me break that down a little bit.

2828.427 -> We have at least 2X per region, at least,

2832.54 -> to be able to handle

2833.59 -> what our theoretical max is for that day.

2836.11 -> So, say region one goes down and region two is there,

2839.65 -> region two can handle all of region one's

2841.6 -> theoretical highest max,

2843.13 -> and all of region two's theoretical highest max,

2845.56 -> and be able to monitor each one of the different pieces

2848.02 -> of the system as the the day goes on,

2849.91 -> and we can be alerted in the system

2851.143 -> that automatically route around different pieces of it,

2853.63 -> Say, an availability zone in system in region one goes down,

2856.51 -> or if the system can't route on it,

2858.1 -> we can take action to route around different pieces.

2860.74 -> We also have an entire system

2863.44 -> that we built on top of this

2865.54 -> that allows what we call an order room

2867.37 -> to actually monitor the orders in real time

2869.44 -> to make sure that execution prices coming back

2871.286 -> are where they're supposed to be,

2873.25 -> and in the event that they're not,

2874.78 -> can correct them and make sure

2876.01 -> that we are routing appropriately

2877.66 -> to those specific venues.

2878.86 -> 'Cause sometimes, what'll actually happen,

2880.69 -> is you may not actually have an issue with the system,

2883.06 -> but you might have an actual issue with one of venues

2885.37 -> that's actually executing our trades

2887.14 -> and that venue may start

2888.49 -> providing executions that are delayed,

2891.97 -> may take a while, and start my queuing orders back up.

2894.82 -> We need to also monitor and route around those venues.

2897.31 -> So we've had to build stuff around that

2899.02 -> to make sure that we could have a consistent time

2901.36 -> to market for our customers,

2902.86 -> and make sure that they can get those trades there on time.

2906.34 -> - [Jeremiah] Thank you.

2907.21 -> That is all the time we have.

2908.23 -> We may be around to answer any follow-up questions

2910.21 -> that anyone has.

2911.29 -> Thank you so much for coming to this session.

2913.57 -> We have our emails up here

2914.56 -> and then please feel free to rate the survey as well.

2917.32 -> - [Louis] Thank you very much.

2918.296 -> (audience applauds)

Source: https://www.youtube.com/watch?v=4K3cMUhR6Ns