AWS re:Invent 2022 - Graph feature engineering with Neo4j and Amazon SageMaker (PRT022)

AWS re:Invent 2022 - Graph feature engineering with Neo4j and Amazon SageMaker (PRT022)


AWS re:Invent 2022 - Graph feature engineering with Neo4j and Amazon SageMaker (PRT022)

Featurization is one of the most difficult problems in machine learning, just behind data wrangling in terms of the time it consumes. For many problems, featurization plays the largest role in determining model performance, greater even than choice of machine learning method. In this talk, walk through how graph features engineered in Neo4j can be used in a supervised learning model trained with Amazon SageMaker. These novel graph features can improve model performance beyond what is possible with more traditional approaches. This presentation is brought to you by Neo4j, an AWS Partner.

Learn more about AWS re:Invent at https://go.aws/3ikK4dD.

Subscribe:
More AWS videos http://bit.ly/2O3zS75
More AWS events videos http://bit.ly/316g9t4

ABOUT AWS
Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.

AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

#reInvent2022 #AWSreInvent2022 #AWSEvents


Content

0.18 -> - We're gonna talk about graph feature engineering
2.1 -> with Neo4j and Amazon SageMaker.
4.74 -> I'm here with me, my partner in crime, Ben Lackey
7.08 -> who's gonna be showing you a demo in a bit.
11.61 -> All right, so we'll get right to it.
13.32 -> So the agenda we have for you guys today
15 -> is gonna be, you know, concise
16.65 -> since we have 15 minutes on the clock.
18.66 -> We're gonna talk about some of the graph features
21.27 -> that Neo4j offers to improvise
23.82 -> your machine learning model, right?
26.13 -> We're gonna go through the architecture
27.78 -> and how we establish Neo4j with the AWS ecosystem.
32.64 -> And Ben's gonna walk you guys through a quick demo
35.64 -> and we're gonna talk about how you guys
37.56 -> can actually try it out yourself as well.
43.32 -> So, first of all, at a high level,
45.42 -> let's talk about, you know,
46.68 -> what Neo4j essentially does, right?
48.81 -> So you have three components in the middle.
51.81 -> So you have a Neo4j Graph Database
53.82 -> which kind of stores, you know,
56.07 -> and manipulates data within the graph database itself.
59.1 -> And then you have the data science component
61.5 -> that's gonna work with SageMaker
63.33 -> to give you the embedded inferences.
66.36 -> And then Neo4j Bloom is gonna help you
69 -> visualize that graph.
71.01 -> And what I also wanna say is that we have connectors
73.65 -> to all these Amazon native services
75.99 -> that you just see on the board
77.34 -> and we're also going to provide
79.47 -> more services in the future.
81.45 -> We have a lot of exciting connectors
83.1 -> that we're gonna be building going forward
85.14 -> and that will probably be at next re:Invent.
87.6 -> So you'll catch more on that later on.
91.05 -> Now, let's dive a little deeper
92.79 -> into the architecture itself, right?
94.56 -> So, first off, you have your data
97.26 -> in an Amazon S3 bucket, right?
99.66 -> So what you wanna do is-
100.77 -> Let's say you want to provide
102.3 -> an inference to a customer.
103.86 -> It could be any use case.
105.33 -> For example, patient data.
107.25 -> It's really important to understand the graph
109.89 -> or the logical representation
111.87 -> of what does this patient record attest to?
115.53 -> What are the drugs that this patient is on, right?
118.38 -> That's an example.
119.25 -> But we can diversify that to multiple use cases.
121.95 -> So what we're gonna do here is
123.54 -> we're gonna use a SageMaker Studio notebook
126.51 -> to actually pull data from S3
129.36 -> and send it to Neo4j's Graph Data Science.
132.57 -> And what that's gonna do is run your model
134.85 -> and send it to Neo4j Graph Database.
137.52 -> Now, what you wanna do is, using an API call,
140.13 -> it's gonna pull it back through
141.87 -> an automated training job
144.03 -> with the SageMaker Autopilot that you can configure.
147.18 -> That's actually gonna provide the output
149.16 -> back to the end user, right?
150.84 -> So, in a nutshell, it looks pretty simple.
152.85 -> In fact, we're on AWS Marketplace.
155.76 -> So if you wanna get started,
157.5 -> you can get started with Neo4j Enterprise
159.72 -> with a click of a button.
161.28 -> That's it.
162.113 -> And then you deploy all these instances
164.13 -> within your VPC.
167.58 -> And just what I described,
169.5 -> Ben is actually gonna take it forward
171.3 -> with a demo that's gonna highlight
173.61 -> how we actually do it.
175.32 -> With that, I'll hand it over to you.
176.67 -> - Okay.
177.503 -> Thanks Antony.
179.31 -> So I'm gonna go through a demo
181.2 -> of what Antony just described.
183.63 -> What we're gonna show-off is deploying
185.88 -> Neo4j Enterprise Edition
188.16 -> through the AWS Marketplace.
190.17 -> And then we're gonna use an example
192.66 -> that's actually in the Amazon SageMaker examples repo
196.77 -> to load some data.
198.27 -> And we're probably gonna run outta time there.
200.31 -> But the rest of the example shows
201.96 -> how to kick off a SageMaker Autopilot job
206.28 -> that uses data from the database as a new feature
211.21 -> in training a machine learning method.
213.57 -> So, this is all public.
215.58 -> You can always check it out later.
217.38 -> I actually just submitted
218.85 -> a new pull request this morning,
220.59 -> bumping it from Neo4j four
222.353 -> to Neo4j version five, which recently came out.
225.45 -> So, excited about that.
228.27 -> But yeah.
229.53 -> So, if you check out this repo under Autopilot,
235.71 -> down here, there's a Neo4j example.
239.16 -> And the Neo4j example talks through
242.4 -> how to deploy Neo4j and so on.
245.4 -> I have already done a little pre-work.
249.51 -> So this is SageMaker Studio.
253.68 -> And I've already cloned
255.33 -> the repo we were just looking at.
257.16 -> And I have a copy of the notebook open here.
259.8 -> So, if I read through the notebook,
263.1 -> it sort of says, hey,
264.45 -> use AWS Marketplace to do a deployment.
267.39 -> And I can click this handy-dandy link
270 -> and it sends me over to the marketplace.
272.58 -> And look here, there's Neo4j Enterprise Edition.
276.27 -> And it's just one of these kind of
277.8 -> next, next, next experiences.
280.23 -> So, you know, you click through to subscribe.
283.41 -> Click through to configure.
286.05 -> Select which one you would like to use
288.3 -> and say launch.
289.56 -> And you can just kind of take all the defaults here.
292.77 -> Launch cloud formation.
295.14 -> Thinks for a moment.
296.19 -> And I say launch.
297.33 -> And now it's gonna kick me into the Amazon Console.
301.38 -> And I don't know if any of you guys noticed,
305.4 -> but US-East-1, the most venerable region,
309.57 -> was having some difficulties this morning.
311.79 -> So we're gonna do stuff in Oregon to be safe.
317.22 -> But here's our CFT.
319.59 -> And to configure Neo4j,
321.96 -> you just enter in whatever parameters you want here.
325.26 -> So I'm gonna say, like,
326.91 -> reinvent demo for my stack name.
330.66 -> And I can take most of the defaults.
332.85 -> I mentioned this recently bumped to version five.
335.88 -> Version five is a major release for us,
337.92 -> includes a bunch of neat stuff.
340.32 -> We're gonna install Graph Data Science,
343.23 -> which is a collection of 50 plus
345.96 -> different graph machine learning algorithms.
348.99 -> We're gonna be using one specific algorithm
351.39 -> which is called FastRP.
353.46 -> And that's gonna take our graph
355.23 -> and turn it into column vectors
357.78 -> that can be fed into SageMaker.
360.09 -> So it's a nice way to take one data model
362.64 -> and use it in a more traditional machine learning context.
367.56 -> Need to type a password in there.
370.29 -> We're just gonna do one server instead of three.
373.26 -> And then the last thing is this sort of oddity.
376.41 -> You need to say, hey, it's open.
378.96 -> So, you do that.
380.43 -> We don't need to save that.
382.53 -> We go next.
384.15 -> And I acknowledge that's okay.
387.54 -> And I hit submit.
389.07 -> So that submits the stack to be deployed.
392.49 -> These deployments usually take between five and 10 minutes.
395.67 -> We clearly don't have that much time.
397.74 -> So I already made one
400.47 -> and you can check that out here.
402.57 -> Looks like I made it at 9:51 this morning.
405.69 -> And one of the cool things about this
407.4 -> is, if I click on outputs,
409.35 -> it gives me the address of the ELB
412.23 -> where my Neo4j's running.
414.18 -> So I can just click on that guy
416.61 -> and it's gonna pop that open.
420.09 -> Come on, come on.
421.02 -> Okay, so now we're logged into this Neo4j instance.
426.03 -> And if I click on it, nothing in the database.
429.51 -> So, let's fix that.
431.76 -> And yeah, we still got eight minutes,
433.5 -> so we're doing well.
435.63 -> So, if I go back to the notebook
437.82 -> I was describing earlier,
439.38 -> we have some examples of how to load data.
442.02 -> And this particular data set
444.63 -> we're going through in here
446.43 -> comes from the SEC's EDGAR system.
449.73 -> So it's filings of asset managers
452.37 -> who manage over a hundred million dollars.
454.77 -> And the machine learning problem
456.72 -> we've characterized in this data set
458.76 -> is we're trying to predict
460.83 -> how their portfolio churns.
463.08 -> So, they're required to file quarterly.
466.59 -> So, for a given quarter,
469.98 -> for a given holding, like, say,
472.47 -> Vanguard owns some Ford stock.
474.75 -> We try to predict true or false
477.42 -> if that position is going to increase or decrease.
480.18 -> Well, binary classification problem.
482.76 -> And I like it because, you know,
484.02 -> it's real data, it has nice structure,
485.91 -> it's, like, fun to look at.
488.01 -> So, very first thing that we need to do
491.55 -> is we install the Graph Data Science library.
494.7 -> Just pip install Graph Data Science.
496.65 -> Piece of cake.
498.09 -> And it's doing that.
500.88 -> And then I did something before this.
504.51 -> I edited this cell slightly
507.06 -> to insert the connection string
508.89 -> for the instance I already spun up.
512.58 -> So, I can run that cell.
515.22 -> And then I'm gonna create a connection to Neo4j.
518.55 -> And hooray, it connected,
521.37 -> which is actually a minor miracle
522.75 -> given how the Wi-Fi's been.
525.18 -> So the next thing I'm gonna do is
526.5 -> I'm gonna create some constraints on my database.
529.83 -> This can basically be thought of as indices.
533.61 -> So I'm saying, like,
537.18 -> filing managers the key for nodes of type manager.
542.01 -> So, you know, there can be only one Vanguard.
544.23 -> There can be only one BlackRock.
545.73 -> That kind of thing.
547.02 -> Sort of makes sense.
548.76 -> Right, so let's load some data.
551.82 -> Data is sitting in a public S3 bucket.
555 -> It's just plain old CSV files.
557.94 -> And we're gonna run through it a couple times.
561.24 -> So, first off, we're gonna create companies.
564 -> So these are like public companies.
565.77 -> Think like Ford, ExxonMobil,
567.9 -> Amazon is in the data set, stuff like that.
570.77 -> So that ran.
572.67 -> Same deal.
574.53 -> In this case we're gonna create the asset manager.
576.57 -> So these are companies like Fidelity, say,
580.71 -> that you might have your retirement account with.
583.83 -> That runs.
586.35 -> Next up...
589.68 -> Yeah, okay.
591.54 -> So, we're gonna create that.
593.25 -> And then this is sort of the graphiest bit of this.
597.78 -> We are going to create all the connections
600.6 -> between these different nodes we're loading up.
603.21 -> So we're creating relationships
605.61 -> like this asset manager owns this company.
611.22 -> And then we have other relationships
613.11 -> like these holdings are part of this fund
616.29 -> and so forth.
617.82 -> So, that takes a little bit to run.
620.94 -> We've got a year of all filings in the US
624.33 -> in this data set.
625.163 -> So, like-
627.141 -> I forget, I think it's a hundred meg or something.
629.07 -> It's not enormous, but it's something.
631.71 -> And it looks like that all ran.
635.52 -> So let's go back to our database.
637.62 -> And you can see in here now, there's stuff.
641.19 -> So it did in fact load up some data.
643.74 -> Looks like about a half a million things.
646.62 -> So, like, I can click on company here
649.23 -> and I can expand this guy
650.97 -> and I've run a query where I'm grabbing companies.
655.17 -> I'm limiting them to 25.
657.48 -> And you can see some different companies.
662.01 -> And I'm gonna-
663.75 -> Right now, the labels we're seeing in here
665.67 -> are sort of indecipherable numbers.
667.65 -> There's something called a CUSIP,
669.57 -> which in the finance industry
671.25 -> is a unique identifier,
672.63 -> basically a GUID for companies.
674.76 -> But if I click name of issuer here,
678 -> I can make it a little more human readable.
680.55 -> So, like, there's Becton Dickinson,
682.68 -> there's BlackRock, there's 3M.
685.32 -> And I can click on these guys and expand them.
688.41 -> And, like, Becton Dickinson,
691.71 -> a bunch of people own shares of that, it looks like.
696.99 -> And once again, I can change the label and so forth.
700.5 -> So, this is kind of fun to just
702.09 -> explore around the dataset
703.35 -> and do that kind of thing.
705.15 -> But I promised you featurization
708.57 -> and graph machine learning.
709.74 -> So I think we can squeeze that in real quick
712.29 -> before we're forced to wrap.
715.59 -> So, I'm going to take my database
719.49 -> and I'm gonna create an in-memory projection
722.13 -> on top of the database with this command.
725.01 -> And the nice part about this in-memory projection
727.32 -> is it allows us to, like, mutate things
730.65 -> and do all sorts of transformations
732.54 -> without destroying the underlying data.
736.56 -> And I can run this
737.67 -> and sort of make sure that ran okay.
740.34 -> And then this cell here is sort of
742.65 -> what you all came here for, I think.
745.62 -> This is, we're taking the whole graph
748.29 -> and we're using an algorithm called FastRP.
751.5 -> And we're gonna create a vector out of that graph
754.95 -> that represents each of these holdings.
758.16 -> And this can then be fed into SageMaker,
761.19 -> used in a supervised learning context
763.35 -> as a new feature that has predictive power.
766.56 -> And part of what's so cool about this
768.93 -> is it's not always the very best thing to do,
772.95 -> but it's a very much like one-size-fits-all thing.
776.13 -> Like, you can always compute an embedding
777.93 -> and try tuning the embedding in different things.
781.8 -> So it's often an interesting approach to explore
785.67 -> if you have connected data.
787.38 -> And sometimes it works really well
789 -> and sometimes other approaches work better.
791.34 -> But it's a great starting point.
793.02 -> So we run that and basically it spits out a vector.
798.48 -> This is where we will slightly run outta time
801.12 -> because there's a little pandas magic in here
804.63 -> that sort of massages this into a CSV file
807.48 -> that SageMaker is gonna accept.
809.58 -> And then towards the end of this notebook,
812.67 -> we kick off the SageMaker job.
815.46 -> SageMaker Autopilot jobs,
818.16 -> even if you tune them down to be really fast,
821.4 -> take about 90 minutes to run.
823.98 -> And that's because it runs multiple machine learning jobs,
827.52 -> creating a bunch of different candidates,
830.31 -> different attempts at training
832.08 -> and then you try to select
834.06 -> one of the candidates from that.
835.8 -> It spits out all these
836.76 -> really interesting reports on how it did
839.61 -> and so on.
841.56 -> So, with that, I am-
844.62 -> Well, actually I'm gonna do one more thing.
847.86 -> So I am on Twitter, @benofben.
853.02 -> And if you are curious,
856.2 -> the slides for this
859.08 -> are going to be available there in a second.
861.72 -> Here are our slides.
863.97 -> This is one of my pet peeves
865.53 -> about any presentation I ever go to.
867.63 -> Everyone always says they'll share the slides.
869.19 -> They never do.
870.12 -> So hopefully you enjoy that.
873.81 -> Let me try to switch this thing back
876.51 -> so we can conclude.
878.04 -> - [Antony] Yeah.
880.29 -> - And yeah, next steps.
882.75 -> Try it on AWS Marketplace.
884.46 -> It's really fun.
886.08 -> SageMaker Notebook was what we just stepped through there.
890.94 -> There's a Neo4j, an Amazon landing page.
893.82 -> Real simple to remember.
894.93 -> Just Neo4j.com/amazon.
897.81 -> Has a bunch of this information on it.
899.97 -> And we are, of course,
901.2 -> in the AWS Partner finder as well.
903.72 -> We have various qualifications, you know,
905.34 -> marketplace seller data and analytics,
907.59 -> the things you would expect.
909.9 -> With that, I think we're off.
911.28 -> Please come up and ask questions after.
914.04 -> - Thank you. Thank you everyone.

Source: https://www.youtube.com/watch?v=pgy-KF83XJ4