AWS re:Invent 2022 - Graph feature engineering with Neo4j and Amazon SageMaker (PRT022)
AWS re:Invent 2022 - Graph feature engineering with Neo4j and Amazon SageMaker (PRT022)
Featurization is one of the most difficult problems in machine learning, just behind data wrangling in terms of the time it consumes. For many problems, featurization plays the largest role in determining model performance, greater even than choice of machine learning method. In this talk, walk through how graph features engineered in Neo4j can be used in a supervised learning model trained with Amazon SageMaker. These novel graph features can improve model performance beyond what is possible with more traditional approaches. This presentation is brought to you by Neo4j, an AWS Partner.
ABOUT AWS Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts.
AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.
#reInvent2022 #AWSreInvent2022 #AWSEvents
Content
0.18 -> - We're gonna talk about
graph feature engineering
2.1 -> with Neo4j and Amazon SageMaker.
4.74 -> I'm here with me, my
partner in crime, Ben Lackey
7.08 -> who's gonna be showing
you a demo in a bit.
11.61 -> All right, so we'll get right to it.
13.32 -> So the agenda we have for you guys today
15 -> is gonna be, you know, concise
16.65 -> since we have 15 minutes on the clock.
18.66 -> We're gonna talk about
some of the graph features
21.27 -> that Neo4j offers to improvise
23.82 -> your machine learning model, right?
26.13 -> We're gonna go through the architecture
27.78 -> and how we establish Neo4j
with the AWS ecosystem.
32.64 -> And Ben's gonna walk you
guys through a quick demo
35.64 -> and we're gonna talk about how you guys
37.56 -> can actually try it out yourself as well.
43.32 -> So, first of all, at a high level,
45.42 -> let's talk about, you know,
46.68 -> what Neo4j essentially does, right?
48.81 -> So you have three
components in the middle.
51.81 -> So you have a Neo4j Graph Database
53.82 -> which kind of stores, you know,
56.07 -> and manipulates data within
the graph database itself.
59.1 -> And then you have the
data science component
61.5 -> that's gonna work with SageMaker
63.33 -> to give you the embedded inferences.
66.36 -> And then Neo4j Bloom is gonna help you
69 -> visualize that graph.
71.01 -> And what I also wanna say
is that we have connectors
73.65 -> to all these Amazon native services
75.99 -> that you just see on the board
77.34 -> and we're also going to provide
79.47 -> more services in the future.
81.45 -> We have a lot of exciting connectors
83.1 -> that we're gonna be building going forward
85.14 -> and that will probably
be at next re:Invent.
87.6 -> So you'll catch more on that later on.
91.05 -> Now, let's dive a little deeper
92.79 -> into the architecture itself, right?
94.56 -> So, first off, you have your data
97.26 -> in an Amazon S3 bucket, right?
99.66 -> So what you wanna do is-
100.77 -> Let's say you want to provide
102.3 -> an inference to a customer.
103.86 -> It could be any use case.
105.33 -> For example, patient data.
107.25 -> It's really important
to understand the graph
109.89 -> or the logical representation
111.87 -> of what does this
patient record attest to?
115.53 -> What are the drugs that
this patient is on, right?
118.38 -> That's an example.
119.25 -> But we can diversify that
to multiple use cases.
121.95 -> So what we're gonna do here is
123.54 -> we're gonna use a
SageMaker Studio notebook
126.51 -> to actually pull data from S3
129.36 -> and send it to Neo4j's Graph Data Science.
132.57 -> And what that's gonna do is run your model
134.85 -> and send it to Neo4j Graph Database.
137.52 -> Now, what you wanna do
is, using an API call,
140.13 -> it's gonna pull it back through
141.87 -> an automated training job
144.03 -> with the SageMaker Autopilot
that you can configure.
147.18 -> That's actually gonna provide the output
149.16 -> back to the end user, right?
150.84 -> So, in a nutshell, it looks pretty simple.
152.85 -> In fact, we're on AWS Marketplace.
155.76 -> So if you wanna get started,
157.5 -> you can get started with Neo4j Enterprise
159.72 -> with a click of a button.
161.28 -> That's it.
162.113 -> And then you deploy all these instances
164.13 -> within your VPC.
167.58 -> And just what I described,
169.5 -> Ben is actually gonna take it forward
171.3 -> with a demo that's gonna highlight
173.61 -> how we actually do it.
175.32 -> With that, I'll hand it over to you.
176.67 -> - Okay.
177.503 -> Thanks Antony.
179.31 -> So I'm gonna go through a demo
181.2 -> of what Antony just described.
183.63 -> What we're gonna show-off is deploying
185.88 -> Neo4j Enterprise Edition
188.16 -> through the AWS Marketplace.
190.17 -> And then we're gonna use an example
192.66 -> that's actually in the Amazon
SageMaker examples repo
196.77 -> to load some data.
198.27 -> And we're probably gonna
run outta time there.
200.31 -> But the rest of the example shows
201.96 -> how to kick off a SageMaker Autopilot job
206.28 -> that uses data from the
database as a new feature
211.21 -> in training a machine learning method.
213.57 -> So, this is all public.
215.58 -> You can always check it out later.
217.38 -> I actually just submitted
218.85 -> a new pull request this morning,
220.59 -> bumping it from Neo4j four
222.353 -> to Neo4j version five,
which recently came out.
225.45 -> So, excited about that.
228.27 -> But yeah.
229.53 -> So, if you check out this
repo under Autopilot,
235.71 -> down here, there's a Neo4j example.
239.16 -> And the Neo4j example talks through
242.4 -> how to deploy Neo4j and so on.
245.4 -> I have already done a little pre-work.
249.51 -> So this is SageMaker Studio.
253.68 -> And I've already cloned
255.33 -> the repo we were just looking at.
257.16 -> And I have a copy of
the notebook open here.
259.8 -> So, if I read through the notebook,
263.1 -> it sort of says, hey,
264.45 -> use AWS Marketplace to do a deployment.
267.39 -> And I can click this handy-dandy link
270 -> and it sends me over to the marketplace.
272.58 -> And look here, there's
Neo4j Enterprise Edition.
276.27 -> And it's just one of these kind of
277.8 -> next, next, next experiences.
280.23 -> So, you know, you click
through to subscribe.
283.41 -> Click through to configure.
286.05 -> Select which one you would like to use
288.3 -> and say launch.
289.56 -> And you can just kind of
take all the defaults here.
292.77 -> Launch cloud formation.
295.14 -> Thinks for a moment.
296.19 -> And I say launch.
297.33 -> And now it's gonna kick me
into the Amazon Console.
301.38 -> And I don't know if any
of you guys noticed,
305.4 -> but US-East-1, the most venerable region,
309.57 -> was having some difficulties this morning.
311.79 -> So we're gonna do stuff
in Oregon to be safe.
317.22 -> But here's our CFT.
319.59 -> And to configure Neo4j,
321.96 -> you just enter in whatever
parameters you want here.
325.26 -> So I'm gonna say, like,
326.91 -> reinvent demo for my stack name.
330.66 -> And I can take most of the defaults.
332.85 -> I mentioned this recently
bumped to version five.
335.88 -> Version five is a major release for us,
337.92 -> includes a bunch of neat stuff.
340.32 -> We're gonna install Graph Data Science,
343.23 -> which is a collection of 50 plus
345.96 -> different graph machine
learning algorithms.
348.99 -> We're gonna be using
one specific algorithm
351.39 -> which is called FastRP.
353.46 -> And that's gonna take our graph
355.23 -> and turn it into column vectors
357.78 -> that can be fed into SageMaker.
360.09 -> So it's a nice way to take one data model
362.64 -> and use it in a more traditional
machine learning context.
367.56 -> Need to type a password in there.
370.29 -> We're just gonna do one
server instead of three.
373.26 -> And then the last thing
is this sort of oddity.
376.41 -> You need to say, hey, it's open.
378.96 -> So, you do that.
380.43 -> We don't need to save that.
382.53 -> We go next.
384.15 -> And I acknowledge that's okay.
387.54 -> And I hit submit.
389.07 -> So that submits the stack to be deployed.
392.49 -> These deployments usually take
between five and 10 minutes.
395.67 -> We clearly don't have that much time.
397.74 -> So I already made one
400.47 -> and you can check that out here.
402.57 -> Looks like I made it at 9:51 this morning.
405.69 -> And one of the cool things about this
407.4 -> is, if I click on outputs,
409.35 -> it gives me the address of the ELB
412.23 -> where my Neo4j's running.
414.18 -> So I can just click on that guy
416.61 -> and it's gonna pop that open.
420.09 -> Come on, come on.
421.02 -> Okay, so now we're logged
into this Neo4j instance.
426.03 -> And if I click on it,
nothing in the database.
429.51 -> So, let's fix that.
431.76 -> And yeah, we still got eight minutes,
433.5 -> so we're doing well.
435.63 -> So, if I go back to the notebook
437.82 -> I was describing earlier,
439.38 -> we have some examples of how to load data.
442.02 -> And this particular data set
444.63 -> we're going through in here
446.43 -> comes from the SEC's EDGAR system.
449.73 -> So it's filings of asset managers
452.37 -> who manage over a hundred million dollars.
454.77 -> And the machine learning problem
456.72 -> we've characterized in this data set
458.76 -> is we're trying to predict
460.83 -> how their portfolio churns.
463.08 -> So, they're required to file quarterly.
466.59 -> So, for a given quarter,
469.98 -> for a given holding, like, say,
472.47 -> Vanguard owns some Ford stock.
474.75 -> We try to predict true or false
477.42 -> if that position is going
to increase or decrease.
480.18 -> Well, binary classification problem.
482.76 -> And I like it because, you know,
484.02 -> it's real data, it has nice structure,
485.91 -> it's, like, fun to look at.
488.01 -> So, very first thing that we need to do
491.55 -> is we install the Graph
Data Science library.
494.7 -> Just pip install Graph Data Science.
496.65 -> Piece of cake.
498.09 -> And it's doing that.
500.88 -> And then I did something before this.
504.51 -> I edited this cell slightly
507.06 -> to insert the connection string
508.89 -> for the instance I already spun up.
512.58 -> So, I can run that cell.
515.22 -> And then I'm gonna create
a connection to Neo4j.
518.55 -> And hooray, it connected,
521.37 -> which is actually a minor miracle
522.75 -> given how the Wi-Fi's been.
525.18 -> So the next thing I'm gonna do is
526.5 -> I'm gonna create some
constraints on my database.
529.83 -> This can basically be
thought of as indices.
533.61 -> So I'm saying, like,
537.18 -> filing managers the key
for nodes of type manager.
542.01 -> So, you know, there can
be only one Vanguard.
544.23 -> There can be only one BlackRock.
545.73 -> That kind of thing.
547.02 -> Sort of makes sense.
548.76 -> Right, so let's load some data.
551.82 -> Data is sitting in a public S3 bucket.
555 -> It's just plain old CSV files.
557.94 -> And we're gonna run
through it a couple times.
561.24 -> So, first off, we're
gonna create companies.
564 -> So these are like public companies.
565.77 -> Think like Ford, ExxonMobil,
567.9 -> Amazon is in the data
set, stuff like that.
570.77 -> So that ran.
572.67 -> Same deal.
574.53 -> In this case we're gonna
create the asset manager.
576.57 -> So these are companies like Fidelity, say,
580.71 -> that you might have your
retirement account with.
583.83 -> That runs.
586.35 -> Next up...
589.68 -> Yeah, okay.
591.54 -> So, we're gonna create that.
593.25 -> And then this is sort of
the graphiest bit of this.
597.78 -> We are going to create all the connections
600.6 -> between these different
nodes we're loading up.
603.21 -> So we're creating relationships
605.61 -> like this asset manager owns this company.
611.22 -> And then we have other relationships
613.11 -> like these holdings are part of this fund
616.29 -> and so forth.
617.82 -> So, that takes a little bit to run.
620.94 -> We've got a year of all filings in the US
624.33 -> in this data set.
625.163 -> So, like-
627.141 -> I forget, I think it's a
hundred meg or something.
629.07 -> It's not enormous, but it's something.
631.71 -> And it looks like that all ran.
635.52 -> So let's go back to our database.
637.62 -> And you can see in here
now, there's stuff.
641.19 -> So it did in fact load up some data.
643.74 -> Looks like about a half a million things.
646.62 -> So, like, I can click on company here
649.23 -> and I can expand this guy
650.97 -> and I've run a query where
I'm grabbing companies.
655.17 -> I'm limiting them to 25.
657.48 -> And you can see some different companies.
662.01 -> And I'm gonna-
663.75 -> Right now, the labels we're seeing in here
665.67 -> are sort of indecipherable numbers.
667.65 -> There's something called a CUSIP,
669.57 -> which in the finance industry
671.25 -> is a unique identifier,
672.63 -> basically a GUID for companies.
674.76 -> But if I click name of issuer here,
678 -> I can make it a little
more human readable.
680.55 -> So, like, there's Becton Dickinson,
682.68 -> there's BlackRock, there's 3M.
685.32 -> And I can click on these
guys and expand them.
688.41 -> And, like, Becton Dickinson,
691.71 -> a bunch of people own shares
of that, it looks like.
696.99 -> And once again, I can change
the label and so forth.
700.5 -> So, this is kind of fun to just
702.09 -> explore around the dataset
703.35 -> and do that kind of thing.
705.15 -> But I promised you featurization
708.57 -> and graph machine learning.
709.74 -> So I think we can squeeze
that in real quick
712.29 -> before we're forced to wrap.
715.59 -> So, I'm going to take my database
719.49 -> and I'm gonna create
an in-memory projection
722.13 -> on top of the database with this command.
725.01 -> And the nice part about
this in-memory projection
727.32 -> is it allows us to, like, mutate things
730.65 -> and do all sorts of transformations
732.54 -> without destroying the underlying data.
736.56 -> And I can run this
737.67 -> and sort of make sure that ran okay.
740.34 -> And then this cell here is sort of
742.65 -> what you all came here for, I think.
745.62 -> This is, we're taking the whole graph
748.29 -> and we're using an
algorithm called FastRP.
751.5 -> And we're gonna create a
vector out of that graph
754.95 -> that represents each of these holdings.
758.16 -> And this can then be fed into SageMaker,
761.19 -> used in a supervised learning context
763.35 -> as a new feature that
has predictive power.
766.56 -> And part of what's so cool about this
768.93 -> is it's not always the
very best thing to do,
772.95 -> but it's a very much like
one-size-fits-all thing.
776.13 -> Like, you can always compute an embedding
777.93 -> and try tuning the embedding
in different things.
781.8 -> So it's often an interesting
approach to explore
785.67 -> if you have connected data.
787.38 -> And sometimes it works really well
789 -> and sometimes other
approaches work better.
791.34 -> But it's a great starting point.
793.02 -> So we run that and basically
it spits out a vector.
798.48 -> This is where we will
slightly run outta time
801.12 -> because there's a little
pandas magic in here
804.63 -> that sort of massages this into a CSV file
807.48 -> that SageMaker is gonna accept.
809.58 -> And then towards the end of this notebook,
812.67 -> we kick off the SageMaker job.
815.46 -> SageMaker Autopilot jobs,
818.16 -> even if you tune them
down to be really fast,
821.4 -> take about 90 minutes to run.
823.98 -> And that's because it runs
multiple machine learning jobs,
827.52 -> creating a bunch of different candidates,
830.31 -> different attempts at training
832.08 -> and then you try to select
834.06 -> one of the candidates from that.
835.8 -> It spits out all these
836.76 -> really interesting reports on how it did
839.61 -> and so on.
841.56 -> So, with that, I am-
844.62 -> Well, actually I'm
gonna do one more thing.
847.86 -> So I am on Twitter, @benofben.
853.02 -> And if you are curious,
856.2 -> the slides for this
859.08 -> are going to be available
there in a second.
861.72 -> Here are our slides.
863.97 -> This is one of my pet peeves
865.53 -> about any presentation I ever go to.
867.63 -> Everyone always says
they'll share the slides.
869.19 -> They never do.
870.12 -> So hopefully you enjoy that.
873.81 -> Let me try to switch this thing back
876.51 -> so we can conclude.
878.04 -> - [Antony] Yeah.
880.29 -> - And yeah, next steps.
882.75 -> Try it on AWS Marketplace.
884.46 -> It's really fun.
886.08 -> SageMaker Notebook was what
we just stepped through there.
890.94 -> There's a Neo4j, an Amazon landing page.
893.82 -> Real simple to remember.
894.93 -> Just Neo4j.com/amazon.
897.81 -> Has a bunch of this information on it.
899.97 -> And we are, of course,
901.2 -> in the AWS Partner finder as well.
903.72 -> We have various qualifications, you know,