AWS re:Invent 2022 - Graph feature engineering with Neo4j and Amazon SageMaker (PRT022)

Aug 16, 2023

AWS re:Invent 2022 - Graph feature engineering with Neo4j and Amazon SageMaker (PRT022)

Featurization is one of the most difficult problems in machine learning, just behind data wrangling in terms of the time it consumes. For many problems, featurization plays the largest role in determining model performance, greater even than choice of machine learning method. In this talk, walk through how graph features engineered in Neo4j can be used in a supervised learning model trained with Amazon SageMaker. These novel graph features can improve model performance beyond what is possible with more traditional approaches. This presentation is brought to you by Neo4j, an AWS Partner.

Learn more about AWS re:Invent at https://go.aws/3ikK4dD.

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#reInvent2022 #AWSreInvent2022 #AWSEvents

Content

0.18 -> - We're gonna talk about graph feature engineering

2.1 -> with Neo4j and Amazon SageMaker.

4.74 -> I'm here with me, my partner in crime, Ben Lackey

7.08 -> who's gonna be showing you a demo in a bit.

11.61 -> All right, so we'll get right to it.

13.32 -> So the agenda we have for you guys today

15 -> is gonna be, you know, concise

16.65 -> since we have 15 minutes on the clock.

18.66 -> We're gonna talk about some of the graph features

21.27 -> that Neo4j offers to improvise

23.82 -> your machine learning model, right?

26.13 -> We're gonna go through the architecture

27.78 -> and how we establish Neo4j with the AWS ecosystem.

32.64 -> And Ben's gonna walk you guys through a quick demo

35.64 -> and we're gonna talk about how you guys

37.56 -> can actually try it out yourself as well.

43.32 -> So, first of all, at a high level,

45.42 -> let's talk about, you know,

46.68 -> what Neo4j essentially does, right?

48.81 -> So you have three components in the middle.

51.81 -> So you have a Neo4j Graph Database

53.82 -> which kind of stores, you know,

56.07 -> and manipulates data within the graph database itself.

59.1 -> And then you have the data science component

61.5 -> that's gonna work with SageMaker

63.33 -> to give you the embedded inferences.

66.36 -> And then Neo4j Bloom is gonna help you

69 -> visualize that graph.

71.01 -> And what I also wanna say is that we have connectors

73.65 -> to all these Amazon native services

75.99 -> that you just see on the board

77.34 -> and we're also going to provide

79.47 -> more services in the future.

81.45 -> We have a lot of exciting connectors

83.1 -> that we're gonna be building going forward

85.14 -> and that will probably be at next re:Invent.

87.6 -> So you'll catch more on that later on.

91.05 -> Now, let's dive a little deeper

92.79 -> into the architecture itself, right?

94.56 -> So, first off, you have your data

97.26 -> in an Amazon S3 bucket, right?

99.66 -> So what you wanna do is-

100.77 -> Let's say you want to provide

102.3 -> an inference to a customer.

103.86 -> It could be any use case.

105.33 -> For example, patient data.

107.25 -> It's really important to understand the graph

109.89 -> or the logical representation

111.87 -> of what does this patient record attest to?

115.53 -> What are the drugs that this patient is on, right?

118.38 -> That's an example.

119.25 -> But we can diversify that to multiple use cases.

121.95 -> So what we're gonna do here is

123.54 -> we're gonna use a SageMaker Studio notebook

126.51 -> to actually pull data from S3

129.36 -> and send it to Neo4j's Graph Data Science.

132.57 -> And what that's gonna do is run your model

134.85 -> and send it to Neo4j Graph Database.

137.52 -> Now, what you wanna do is, using an API call,

140.13 -> it's gonna pull it back through

141.87 -> an automated training job

144.03 -> with the SageMaker Autopilot that you can configure.

147.18 -> That's actually gonna provide the output

149.16 -> back to the end user, right?

150.84 -> So, in a nutshell, it looks pretty simple.

152.85 -> In fact, we're on AWS Marketplace.

155.76 -> So if you wanna get started,

157.5 -> you can get started with Neo4j Enterprise

159.72 -> with a click of a button.

161.28 -> That's it.

162.113 -> And then you deploy all these instances

164.13 -> within your VPC.

167.58 -> And just what I described,

169.5 -> Ben is actually gonna take it forward

171.3 -> with a demo that's gonna highlight

173.61 -> how we actually do it.

175.32 -> With that, I'll hand it over to you.

176.67 -> - Okay.

177.503 -> Thanks Antony.

179.31 -> So I'm gonna go through a demo

181.2 -> of what Antony just described.

183.63 -> What we're gonna show-off is deploying

185.88 -> Neo4j Enterprise Edition

188.16 -> through the AWS Marketplace.

190.17 -> And then we're gonna use an example

192.66 -> that's actually in the Amazon SageMaker examples repo

196.77 -> to load some data.

198.27 -> And we're probably gonna run outta time there.

200.31 -> But the rest of the example shows

201.96 -> how to kick off a SageMaker Autopilot job

206.28 -> that uses data from the database as a new feature

211.21 -> in training a machine learning method.

213.57 -> So, this is all public.

215.58 -> You can always check it out later.

217.38 -> I actually just submitted

218.85 -> a new pull request this morning,

220.59 -> bumping it from Neo4j four

222.353 -> to Neo4j version five, which recently came out.

225.45 -> So, excited about that.

228.27 -> But yeah.

229.53 -> So, if you check out this repo under Autopilot,

235.71 -> down here, there's a Neo4j example.

239.16 -> And the Neo4j example talks through

242.4 -> how to deploy Neo4j and so on.

245.4 -> I have already done a little pre-work.

249.51 -> So this is SageMaker Studio.

253.68 -> And I've already cloned

255.33 -> the repo we were just looking at.

257.16 -> And I have a copy of the notebook open here.

259.8 -> So, if I read through the notebook,

263.1 -> it sort of says, hey,

264.45 -> use AWS Marketplace to do a deployment.

267.39 -> And I can click this handy-dandy link

270 -> and it sends me over to the marketplace.

272.58 -> And look here, there's Neo4j Enterprise Edition.

276.27 -> And it's just one of these kind of

277.8 -> next, next, next experiences.

280.23 -> So, you know, you click through to subscribe.

283.41 -> Click through to configure.

286.05 -> Select which one you would like to use

288.3 -> and say launch.

289.56 -> And you can just kind of take all the defaults here.

292.77 -> Launch cloud formation.

295.14 -> Thinks for a moment.

296.19 -> And I say launch.

297.33 -> And now it's gonna kick me into the Amazon Console.

301.38 -> And I don't know if any of you guys noticed,

305.4 -> but US-East-1, the most venerable region,

309.57 -> was having some difficulties this morning.

311.79 -> So we're gonna do stuff in Oregon to be safe.

317.22 -> But here's our CFT.

319.59 -> And to configure Neo4j,

321.96 -> you just enter in whatever parameters you want here.

325.26 -> So I'm gonna say, like,

326.91 -> reinvent demo for my stack name.

330.66 -> And I can take most of the defaults.

332.85 -> I mentioned this recently bumped to version five.

335.88 -> Version five is a major release for us,

337.92 -> includes a bunch of neat stuff.

340.32 -> We're gonna install Graph Data Science,

343.23 -> which is a collection of 50 plus

345.96 -> different graph machine learning algorithms.

348.99 -> We're gonna be using one specific algorithm

351.39 -> which is called FastRP.

353.46 -> And that's gonna take our graph

355.23 -> and turn it into column vectors

357.78 -> that can be fed into SageMaker.

360.09 -> So it's a nice way to take one data model

362.64 -> and use it in a more traditional machine learning context.

367.56 -> Need to type a password in there.

370.29 -> We're just gonna do one server instead of three.

373.26 -> And then the last thing is this sort of oddity.

376.41 -> You need to say, hey, it's open.

378.96 -> So, you do that.

380.43 -> We don't need to save that.

382.53 -> We go next.

384.15 -> And I acknowledge that's okay.

387.54 -> And I hit submit.

389.07 -> So that submits the stack to be deployed.

392.49 -> These deployments usually take between five and 10 minutes.

395.67 -> We clearly don't have that much time.

397.74 -> So I already made one

400.47 -> and you can check that out here.

402.57 -> Looks like I made it at 9:51 this morning.

405.69 -> And one of the cool things about this

407.4 -> is, if I click on outputs,

409.35 -> it gives me the address of the ELB

412.23 -> where my Neo4j's running.

414.18 -> So I can just click on that guy

416.61 -> and it's gonna pop that open.

420.09 -> Come on, come on.

421.02 -> Okay, so now we're logged into this Neo4j instance.

426.03 -> And if I click on it, nothing in the database.

429.51 -> So, let's fix that.

431.76 -> And yeah, we still got eight minutes,

433.5 -> so we're doing well.

435.63 -> So, if I go back to the notebook

437.82 -> I was describing earlier,

439.38 -> we have some examples of how to load data.

442.02 -> And this particular data set

444.63 -> we're going through in here

446.43 -> comes from the SEC's EDGAR system.

449.73 -> So it's filings of asset managers

452.37 -> who manage over a hundred million dollars.

454.77 -> And the machine learning problem

456.72 -> we've characterized in this data set

458.76 -> is we're trying to predict

460.83 -> how their portfolio churns.

463.08 -> So, they're required to file quarterly.

466.59 -> So, for a given quarter,

469.98 -> for a given holding, like, say,

472.47 -> Vanguard owns some Ford stock.

474.75 -> We try to predict true or false

477.42 -> if that position is going to increase or decrease.

480.18 -> Well, binary classification problem.

482.76 -> And I like it because, you know,

484.02 -> it's real data, it has nice structure,

485.91 -> it's, like, fun to look at.

488.01 -> So, very first thing that we need to do

491.55 -> is we install the Graph Data Science library.

494.7 -> Just pip install Graph Data Science.

496.65 -> Piece of cake.

498.09 -> And it's doing that.

500.88 -> And then I did something before this.

504.51 -> I edited this cell slightly

507.06 -> to insert the connection string

508.89 -> for the instance I already spun up.

512.58 -> So, I can run that cell.

515.22 -> And then I'm gonna create a connection to Neo4j.

518.55 -> And hooray, it connected,

521.37 -> which is actually a minor miracle

522.75 -> given how the Wi-Fi's been.

525.18 -> So the next thing I'm gonna do is

526.5 -> I'm gonna create some constraints on my database.

529.83 -> This can basically be thought of as indices.

533.61 -> So I'm saying, like,

537.18 -> filing managers the key for nodes of type manager.

542.01 -> So, you know, there can be only one Vanguard.

544.23 -> There can be only one BlackRock.

545.73 -> That kind of thing.

547.02 -> Sort of makes sense.

548.76 -> Right, so let's load some data.

551.82 -> Data is sitting in a public S3 bucket.

555 -> It's just plain old CSV files.

557.94 -> And we're gonna run through it a couple times.

561.24 -> So, first off, we're gonna create companies.

564 -> So these are like public companies.

565.77 -> Think like Ford, ExxonMobil,

567.9 -> Amazon is in the data set, stuff like that.

570.77 -> So that ran.

572.67 -> Same deal.

574.53 -> In this case we're gonna create the asset manager.

576.57 -> So these are companies like Fidelity, say,

580.71 -> that you might have your retirement account with.

583.83 -> That runs.

586.35 -> Next up...

589.68 -> Yeah, okay.

591.54 -> So, we're gonna create that.

593.25 -> And then this is sort of the graphiest bit of this.

597.78 -> We are going to create all the connections

600.6 -> between these different nodes we're loading up.

603.21 -> So we're creating relationships

605.61 -> like this asset manager owns this company.

611.22 -> And then we have other relationships

613.11 -> like these holdings are part of this fund

616.29 -> and so forth.

617.82 -> So, that takes a little bit to run.

620.94 -> We've got a year of all filings in the US

624.33 -> in this data set.

625.163 -> So, like-

627.141 -> I forget, I think it's a hundred meg or something.

629.07 -> It's not enormous, but it's something.

631.71 -> And it looks like that all ran.

635.52 -> So let's go back to our database.

637.62 -> And you can see in here now, there's stuff.

641.19 -> So it did in fact load up some data.

643.74 -> Looks like about a half a million things.

646.62 -> So, like, I can click on company here

649.23 -> and I can expand this guy

650.97 -> and I've run a query where I'm grabbing companies.

655.17 -> I'm limiting them to 25.

657.48 -> And you can see some different companies.

662.01 -> And I'm gonna-

663.75 -> Right now, the labels we're seeing in here

665.67 -> are sort of indecipherable numbers.

667.65 -> There's something called a CUSIP,

669.57 -> which in the finance industry

671.25 -> is a unique identifier,

672.63 -> basically a GUID for companies.

674.76 -> But if I click name of issuer here,

678 -> I can make it a little more human readable.

680.55 -> So, like, there's Becton Dickinson,

682.68 -> there's BlackRock, there's 3M.

685.32 -> And I can click on these guys and expand them.

688.41 -> And, like, Becton Dickinson,

691.71 -> a bunch of people own shares of that, it looks like.

696.99 -> And once again, I can change the label and so forth.

700.5 -> So, this is kind of fun to just

702.09 -> explore around the dataset

703.35 -> and do that kind of thing.

705.15 -> But I promised you featurization

708.57 -> and graph machine learning.

709.74 -> So I think we can squeeze that in real quick

712.29 -> before we're forced to wrap.

715.59 -> So, I'm going to take my database

719.49 -> and I'm gonna create an in-memory projection

722.13 -> on top of the database with this command.

725.01 -> And the nice part about this in-memory projection

727.32 -> is it allows us to, like, mutate things

730.65 -> and do all sorts of transformations

732.54 -> without destroying the underlying data.

736.56 -> And I can run this

737.67 -> and sort of make sure that ran okay.

740.34 -> And then this cell here is sort of

742.65 -> what you all came here for, I think.

745.62 -> This is, we're taking the whole graph

748.29 -> and we're using an algorithm called FastRP.

751.5 -> And we're gonna create a vector out of that graph

754.95 -> that represents each of these holdings.

758.16 -> And this can then be fed into SageMaker,

761.19 -> used in a supervised learning context

763.35 -> as a new feature that has predictive power.

766.56 -> And part of what's so cool about this

768.93 -> is it's not always the very best thing to do,

772.95 -> but it's a very much like one-size-fits-all thing.

776.13 -> Like, you can always compute an embedding

777.93 -> and try tuning the embedding in different things.

781.8 -> So it's often an interesting approach to explore

785.67 -> if you have connected data.

787.38 -> And sometimes it works really well

789 -> and sometimes other approaches work better.

791.34 -> But it's a great starting point.

793.02 -> So we run that and basically it spits out a vector.

798.48 -> This is where we will slightly run outta time

801.12 -> because there's a little pandas magic in here

804.63 -> that sort of massages this into a CSV file

807.48 -> that SageMaker is gonna accept.

809.58 -> And then towards the end of this notebook,

812.67 -> we kick off the SageMaker job.

815.46 -> SageMaker Autopilot jobs,

818.16 -> even if you tune them down to be really fast,

821.4 -> take about 90 minutes to run.

823.98 -> And that's because it runs multiple machine learning jobs,

827.52 -> creating a bunch of different candidates,

830.31 -> different attempts at training

832.08 -> and then you try to select

834.06 -> one of the candidates from that.

835.8 -> It spits out all these

836.76 -> really interesting reports on how it did

839.61 -> and so on.

841.56 -> So, with that, I am-

844.62 -> Well, actually I'm gonna do one more thing.

847.86 -> So I am on Twitter, @benofben.

853.02 -> And if you are curious,

856.2 -> the slides for this

859.08 -> are going to be available there in a second.

861.72 -> Here are our slides.

863.97 -> This is one of my pet peeves

865.53 -> about any presentation I ever go to.

867.63 -> Everyone always says they'll share the slides.

869.19 -> They never do.

870.12 -> So hopefully you enjoy that.

873.81 -> Let me try to switch this thing back

876.51 -> so we can conclude.

878.04 -> - [Antony] Yeah.

880.29 -> - And yeah, next steps.

882.75 -> Try it on AWS Marketplace.

884.46 -> It's really fun.

886.08 -> SageMaker Notebook was what we just stepped through there.

890.94 -> There's a Neo4j, an Amazon landing page.

893.82 -> Real simple to remember.

894.93 -> Just Neo4j.com/amazon.

897.81 -> Has a bunch of this information on it.

899.97 -> And we are, of course,

901.2 -> in the AWS Partner finder as well.

903.72 -> We have various qualifications, you know,

905.34 -> marketplace seller data and analytics,

907.59 -> the things you would expect.

909.9 -> With that, I think we're off.

911.28 -> Please come up and ask questions after.

914.04 -> - Thank you. Thank you everyone.

Source: https://www.youtube.com/watch?v=pgy-KF83XJ4