Graph Database introduction, deep-dive and demo with Amazon Neptune - AWS Virtual Workshop
Graph Database introduction, deep-dive and demo with Amazon Neptune - AWS Virtual Workshop
With Amazon Neptune you can build and run identity, knowledge, fraud graph, and other applications with performance that scales to more than 100,000 queries per second. Neptune allows you to deploy graph applications using open-source APIs such as Gremlin, openCypher and SPARQL. Since Neptune is a fully managed database service, there is no need to worry about hardware provisioning, software patching, setup or backups. Many people are not familiar with graph databases which is why this workshop will introduce the cutting-edge use cases for graph databases that span fraud detection to personalization. This workshop will cover the architecture and all of the key features of Neptune. We will also use this time to do a demo of Neptune in action.
Learning Objectives: * Objective 1: Understand the benefits of Amazon Neptune by going over use cases including fraud detection, personalization and advertising targeting. * Objective 2: Dive deep into how open-source APIs can be used to deploy applications in Neptune. * Objective 3: We will also use this time to do a demo of Neptune in action.
☁️ AWS Online Tech Talks cover a wide range of topics and expertise levels through technical deep dives, demos, customer examples, and live Q\u0026A with AWS experts. Builders can choose from bite-sized 15-minute sessions, insightful fireside chats, immersive virtual workshops, interactive office hours, or watch on-demand tech talks at your own pace. Join us to fuel your learning journey with AWS.
#AWS
Content
3.06 -> [Music]
7.279 -> hello everyone
8.559 -> my name is dave beckberger i am a senior
10.8 -> graph architect on the amazon neptune
12.799 -> service team and i'm here today to talk
14.799 -> to you a little bit about graph
15.679 -> databases and then it's going to go into
17.68 -> a little bit of a deep dive in a demo on
19.84 -> amazon neptune which is aws's
21.92 -> purpose-built graph database offering
24.48 -> so let's jump right into it
26.88 -> so today when we work with customers we
29.76 -> see graphs being used for all types of
31.599 -> applications why is this
34 -> because graphs are very good at modeling
35.84 -> relationships that are not necessarily
37.84 -> easily represented or retrieved with
40.399 -> other types of databases out there today
43.12 -> if we look at this simple graph
44.399 -> representation it's pretty easy without
46 -> me having to tell you any other
47.2 -> additional information that you can
48.559 -> figure out that alice lives in any town
50.8 -> and that she works with bob
53.12 -> this is because greg you know this
55.039 -> represents how graphs and the graph the
57.68 -> graph way of looking at problems is very
60 -> intuitive to people
61.6 -> because they rep it does a very good job
63.28 -> of representing the natural way that we
65.519 -> think about data and connections in this
67.84 -> example you know
70 -> we're looking at these and we have what
71.76 -> we would consider from a graph
72.96 -> perspective a couple of nodes those
74.799 -> nodes representing the entities or the
76.56 -> real world objects here in this case
78.08 -> being alice bob in any town
80.32 -> and we have these these lines or these
82.32 -> connections between things which
83.759 -> represent the relationships between
85.439 -> these real world objects in this case
87.2 -> lives in and works with and when we look
89.52 -> at this and because of the way that
91.2 -> graphs and graph databases store this
93.28 -> data it really they really allow you to
96.079 -> explore these relationships and patterns
98.24 -> and this type of connected data in ways
100.479 -> that other can't uh whether ways of
102.72 -> other data stores and data structures
104.56 -> can't
106.88 -> customers are also very excited about
108.479 -> graphs and they're really especially
109.92 -> excited about managed graph services
112 -> since neptune was released in may of
114.079 -> 2018 customers have built many types of
116.64 -> applications on top of neptune but when
118.88 -> we think about it uh when we when we're
120.88 -> working with customers we kind of
122.159 -> broadly general generalize excuse me
124.88 -> these these
127.119 -> applications into a couple of common use
129.44 -> cases we see
130.72 -> the first common use case we see with
132.48 -> neptune is fraud detection and this is
134.72 -> exactly what you think it is we're
136.08 -> trying to use graphs to help find the
137.84 -> bad guys graphs are uniquely helpful for
140.48 -> fraud detection as they really enable us
143.44 -> as developers and as users of the system
145.68 -> to find the deep links and deep
147.28 -> connections and patterns of connections
149.52 -> in the data that really don't aren't
152 -> easily uh easily found using other sorts
155.04 -> of uh systems out there
157.92 -> the second use case we we generally see
159.92 -> is what we call identity graphs and
161.84 -> identity graphs are really based on the
164.16 -> concept of you know something like a
166.64 -> user is going to come to your website
168.4 -> for multiple different uh
170.56 -> different areas maybe they are going to
172.08 -> connect to it from their phone and their
173.599 -> work computer and their home computer
175.599 -> and their tablet and we really want to
177.28 -> be able to connect these these
179.599 -> these disparate interactions with the
181.84 -> user together in such a way that we can
184.8 -> kind of general uh generate a golden
187.04 -> record of kind of the the canonical form
189.68 -> of that data that a user can actually
192.72 -> so we can actually use that to help
194.159 -> provide things like personalized
195.599 -> recommendations or marketing ad
197.84 -> segmentation things like that
200.239 -> third use case we we often see with
202.239 -> customers are one of knowledge graphs or
204.56 -> knowledge organizations
206.319 -> and this is a really about connecting
208.159 -> disparate data silos inside of a company
210.64 -> together in such a way that you can
211.92 -> really get a holistic view of a specif
214.64 -> of of pieces of information and things
217.2 -> that are connected to that piece of
218.4 -> information
219.92 -> let's say you're something like an
221.36 -> e-commerce website and you have a
223.36 -> database that contains all of your
225.04 -> information about the products you sell
226.72 -> and one that contains all this
228.08 -> information about your customers and one
230.239 -> that contains all the information about
231.599 -> your inventory inside your warehouse
233.92 -> maybe you have another one that's
234.799 -> shipping and you want to be able to
235.76 -> connect all of these together in such a
237.439 -> way that you can look at all of the
239.76 -> information around a specific product or
241.76 -> customer or user something like that
244.159 -> and that would kind of fall into what we
245.439 -> call a knowledge graph type use case
248.64 -> the last one we have is the last common
250.4 -> use case we've seen is security graphs
252.239 -> and we've recently seen a real uptick in
254.239 -> customers interested in using graphs for
256.639 -> security based systems
258.32 -> security in general is sort of a graph
260.479 -> problem because it's really you know
262.24 -> security of for anything being physical
264.72 -> or logical or application security or
268 -> your cloud infrastructure security is
270.4 -> really about layers about multiple
272.4 -> different layers of security all being
274.08 -> connected together
275.68 -> in such a way that you want to be able
277.84 -> to look at that to be able to find
279.28 -> potential uh paths of of potential
283.6 -> um you know malfeasance or paths through
286.56 -> the graph and how things may or may not
289.199 -> be exposed to the internet for say when
291.52 -> they should or shouldn't be um so that's
293.68 -> sort of kind of what us you know one one
296.08 -> way we can look at security graphs and
298.4 -> this is really kind of a really
299.6 -> interesting and fast upgrading or
301.52 -> upcoming segment for uh graph based
304 -> solutions
305.28 -> so what are some other common business
306.96 -> problems that you think uh that would
308.8 -> work with customers when we think about
310.32 -> graphs and graph type problems problems
312.16 -> really needing those connect highly
313.44 -> connected data well as i kind of
315.68 -> mentioned already the first one is
316.96 -> people come to us saying they need to do
318.32 -> things like be better at detecting fraud
320.24 -> and fraudulent transactions inside their
322.08 -> system
324.16 -> maybe they want their customers to have
326.08 -> a better or more personalized
327.759 -> recommendation experience than they're
329.44 -> able to provide today
331.68 -> there's that knowledge graph use case if
333.199 -> we want to connect together those siloed
335.12 -> data sources inside our enterprise to
336.88 -> really kind of build out a you know an
338.4 -> entire
339.44 -> an entire platform that contains all of
341.84 -> the knowledge inside of our system
344.32 -> maybe you have those multiple websites
346.24 -> or you have multiple applications and
347.919 -> you need to link together disparate
349.919 -> customer identities in these systems to
351.759 -> kind of get that canonical or that
353.52 -> golden record
355.199 -> you know that
356.639 -> maybe you have machine learning
358.319 -> algorithms and you want to be able to
359.44 -> use the connections in your data to
361.44 -> improve those algorithms to give you
363.039 -> better sorts of answers
366.319 -> these sorts of questions or ones that
368 -> kind of are if you went out and you you
370.08 -> looked online you did some searching
371.68 -> around you would probably come back
373.039 -> these sorts of questions are really good
375.84 -> common business use cases for graph type
378.479 -> problems but there's also a wide array
381.12 -> of not so easily recognized graph
383.039 -> problems ones that when you
384.56 -> you know at first glance may not make
386.08 -> sense or you may not think of them as
387.44 -> graph problems but they really do lend
389.68 -> themselves very well to being solved
391.44 -> using graphs
392.8 -> first one example is you know what are
394.72 -> the risks in your it infrastructure your
396.639 -> supply chain you know any sort of i.t
399.28 -> infrastructure or supply chain it tends
401.759 -> to get very complicated very quickly
403.6 -> there's a lot of you know in the case of
404.96 -> a supply chain you have a lot of people
407.039 -> that are you know you have a lot of
408.4 -> products each of those products has its
409.919 -> own bill of materials each of the bill
411.759 -> of the items in those bill of materials
413.919 -> has one or more suppliers and those
415.759 -> suppliers have suppliers and those
417.52 -> suppliers have suppliers have suppliers
419.199 -> and being able to kind of look at the
421.12 -> overall risk portfolio of your supply
423.759 -> chain or your iq
425.199 -> infrastructure is a great use for our
427.039 -> graph
428.16 -> where did this data come from this is
430 -> this sort of question of being able to
431.599 -> track the lineage or the provenance of
433.84 -> data is one that we often talk with data
435.759 -> engineering teams about uh
438.4 -> these lends itself very well to a graph
440.4 -> because if you think if you can think
442.08 -> about it when you're sort of working
443.759 -> with you know any sort of data
445.199 -> engineering problem it's really about
446.72 -> taking data from one source doing some
448.72 -> sort of extract transformation and load
451.84 -> type process and usually loading it into
453.52 -> something else maybe you're combining or
455.599 -> aggregating that data but you probably
457.68 -> but when you aggregate that data you
459.199 -> still want to be able to track where did
460.639 -> this data come from maybe you need to be
463.039 -> able to track this to be able to comply
464.479 -> with privacy regulations like cccp or
466.879 -> gdpr and you want to be able to track
470 -> not only where this data is currently
471.919 -> but where where all the places it was
473.36 -> used intermittent
475.12 -> intermediately to be able to you know
478.24 -> clean up those things to be able to
479.84 -> understand the accuracy and the
482.96 -> efficacy of the data that you're working
484.56 -> with
486.479 -> why don't your search results relate to
488.56 -> the specific question was asking this is
490.4 -> a common use case we we see with
492.16 -> customers is they have search results
493.759 -> but their search results are a little
495.12 -> bit lackluster and a little
497.12 -> less clear than they would like them to
498.96 -> be so being able to use a graph to be
501.12 -> able to build that you know
503.759 -> those sorts of knowledge graphs where
505.039 -> you're connecting these data together to
506.4 -> be able to give you more relevant and
508.72 -> related answers to the types of
510.4 -> questions people are asking
512.959 -> how does person x have access to
515.36 -> information why
517.68 -> this is a security another security
519.919 -> graph type use case where you really
521.279 -> want to be able to you know maybe map
522.88 -> out the permissions to that folders and
525.519 -> files have by active directory groups i
528.16 -> was working with one customer where
529.839 -> their specific use case was they were
531.519 -> you know looking they wanted to
533.2 -> specifically look for
534.88 -> uh access or people that have acts that
536.959 -> had access to certain files and folders
539.519 -> through a very very long set of
541.2 -> connections because if they had a you
542.72 -> know if
544.399 -> i was giving direct access to this file
546.72 -> or folder it probably means i was
548.16 -> intended to have it but if i had to do
549.76 -> that if i had access to this file and
551.839 -> folder through multiple sets of groups
553.6 -> and permissions maybe it was an
555.2 -> unintended consequence of giving me a
557.519 -> permission was giving me access to this
559.36 -> this critical business information and
561.44 -> they want to be able to kind of track
562.64 -> that down and look for that sort of
564.399 -> thing
566.24 -> and you know things like
568 -> about your cloud infrastructure your
569.44 -> cloud infrastructure is a very good
571.76 -> example of a security graph type use
573.6 -> cases
574.8 -> being able to look at how different
576.48 -> things are used inside of it through a
578.64 -> wide array and a very large number of
581.36 -> variably connected data if you think
582.88 -> about your cloud infrastructure you're
584.16 -> going to have things like iam policies
586.399 -> which are connected to roles which are
587.92 -> then going to be connected to one of any
589.6 -> number of different types of entities be
591.44 -> they lambda functions or databases or
594.08 -> ec2 instances and you have a very
596.32 -> variably connected set of
598.8 -> entities that you're trying to look at
600.64 -> but when i really kind of sit down if i
602.16 -> wanted to kind of boil down graph
603.76 -> questions the types of questions that
605.2 -> graphs are good at answering
607.12 -> really for me it comes down to the where
609.04 -> why and how questions you know where are
610.56 -> the risks where did the data come from
612.959 -> why don't the search results do
614.399 -> something how about this person how is
616.72 -> this role being used things like that
618.88 -> and these tend to be good uh graph
621.12 -> questions because they have a few things
622.56 -> in common first they tend to navigate
625.279 -> variably connected structures of data
627.44 -> you know especially if you wanted to
628.48 -> think about this
629.68 -> in terms of looking at your cloud
631.519 -> infrastructure
632.64 -> the cloud infrastructure as we kind of
635.2 -> discussed a minute ago is really it's a
637.6 -> highly variable set of different
639.04 -> entities you have you know vpcs you have
641.92 -> enis you have iam rules iam policy all
645.12 -> of these are connected together against
646.64 -> other ec2 instances or databases and
648.959 -> being able to look at that and easily
651.44 -> move through that sort of information is
653.68 -> an area where graphs tend to excel
657.279 -> they also send to excel at questions
659.36 -> where you need to filter or compute a
661.279 -> result based on the strength weight or
663.2 -> quality of a relationship so in the case
664.959 -> of something like a supply chain risk
666.88 -> management being able to look at not
668.8 -> only the fact that these two things are
670.32 -> connected but how important is this
672.32 -> supplier to this person what other sorts
675.279 -> of
676.32 -> info what other sorts of backup
678.16 -> suppliers may they have for specific
679.839 -> things to be able to use that to help
681.12 -> calculate an overall risk of your supply
683.12 -> chain or where the risk or to find the
684.959 -> riskiest parts of your supply chain
687.04 -> is an example where using you know where
689.279 -> using the the connections in your data
691.76 -> is extremely important to be able to get
693.36 -> that answer
695.44 -> and finally recursing or requiring
698 -> traversing unknown numbers of
699.68 -> connections and this is really an area
701.6 -> where graphs really do excel
704.16 -> um and this is where you have questions
706 -> that are a bit open-ended you know let's
708 -> take a look at the example of how does
709.519 -> person x have access to information why
711.839 -> you know they may have been given direct
713.279 -> access to this information or they may
715.36 -> have this information or access to this
717.44 -> information through a wide array of
719.68 -> different connections through maybe
721.04 -> different active directory groups and
722.48 -> things of that nature
724.72 -> but you won't know exactly the number of
726.72 -> connections or how they're connected at
728.72 -> the time you're initially or you're
730.399 -> you're originally looking at the query
732.56 -> all you know is you want to find out how
733.839 -> two people or in two entities are
736.56 -> related inside this so this is really
738.72 -> sort of when i think about graphs and
740.399 -> graph type problems these are the sorts
741.839 -> of problems i really look for
744.88 -> where graphs benefit the you know the
747.36 -> the end use case quite
749.44 -> significantly
753.12 -> and why is this well there's a few
754.639 -> challenges around using
756.8 -> many other technologies with highly
758.959 -> connected data
760.32 -> first there they tend to be a little
762.079 -> unnatural for querying that data
764.56 -> and this tends to lead to
766.48 -> an inefficient processing of that sort
769.279 -> of information
771.36 -> and most other databases out there or
773.2 -> other data technologies out there tend
774.56 -> to have a rigid schema that's really
776.32 -> inflexible for rapidly changing data
779.92 -> that we that most of the types of use
782.639 -> cases we've talked about today be they
784.32 -> fraud graphs or knowledge graphs or
786.24 -> security graphs or identity graphs
788.639 -> really tend to require
790.8 -> so let's dive a little bit into that
792.24 -> what is it about graphs that actually
794.8 -> make them better to handle this sort of
796.88 -> highly connected data
798.639 -> well the first aspect here is the query
800.88 -> languages the query languages that we
802.639 -> use with graphs are really optimized to
805.04 -> use the connections
807.04 -> to move through through the network of
809.2 -> data that you're looking at and this
810.8 -> comes down to the fact that graphs
812.399 -> databases and manage graph services are
814.399 -> based on graph theory and one of the
815.839 -> kind of key pieces of graph theory is
817.68 -> this concept of traversing your data or
819.76 -> moving from point a to point b so if we
822 -> what looked at this specific example
824.32 -> we're looking at this and it says dave
826.079 -> works at amazon um the little gremlin
828.639 -> guy is sort of uh representing that's
831.44 -> that's the logo for the apache tinker
833.12 -> pop gremlin project and that's
834.56 -> representing kind of where we are in our
836.8 -> information today when i write queries
839.199 -> and graph query languages
841.36 -> as opposed to kind of uh you know when
844 -> the way you we work with them we'll see
845.6 -> an example of this later it really
847.76 -> taught you know they really work by
849.279 -> moving data from point a to point b so
851.519 -> you're i'm moving through my my graph or
853.519 -> my network from dave to amazon if we
856.16 -> want to contrast this with something
857.6 -> like a relational database relational
860 -> databases work on relational algebra set
862.72 -> set out algebra and they work by
864.8 -> combining sets of data so if we wanted
866.399 -> to kind of look at this in the same you
868.56 -> know the same thing uh the same example
870.48 -> here i would probably have a table
872.56 -> called something like person in a table
874.399 -> called something like company i would
876.48 -> perform a joint on this in order to get
878.24 -> the fact that dave is a person at a
880.16 -> company um
881.92 -> that i work at that company through some
883.68 -> sort of foreign key between those tables
886.24 -> and
887.04 -> the way you know at its very core kind
888.72 -> of the way that that relational
891.04 -> databases work as opposed to moving from
892.88 -> point a to point b in a graph you know
895.12 -> when i move it from point a to point b i
896.639 -> don't necessarily mean unless i
897.839 -> explicitly ask for it i don't
899.199 -> necessarily maintain the history of
900.959 -> everywhere i've been in relational
902.88 -> databases as i'm joining these tables
904.8 -> together i'm building a bigger and
906.32 -> bigger table in memory
908.639 -> in theory in memory that basically is
911.04 -> containing all of the information and
912.56 -> all of the history of where i've been
915.279 -> this is why when you start running large
917.44 -> queries that have to traverse or move
919.6 -> through a lot of data in order to get
921.279 -> there
922.32 -> you know graph databases because i'm
924.56 -> moving point a to point b as opposed to
926.48 -> building a bigger and bigger in-memory
928.16 -> table or more efficient uh from a memory
930.32 -> perspective and a speed perspective
932.32 -> we're gonna do it
933.68 -> the other aspect there is graph
935.6 -> databases are really optimized for
937.839 -> processing connected data at the kind of
940.48 -> engine level let's you know because
942.399 -> graph databases store not just the
944.959 -> entities but the connections
947.04 -> this really gives them the you know uh
949.36 -> the advantage of the fact that the
950.56 -> connections that you're working on are
952 -> data itself this means that they're
953.6 -> physically saved to disk so when i need
956.24 -> to actually retrieve this data in order
958.24 -> to know i want to move from point a to
959.68 -> point b across the connection i'm really
962.079 -> just reading data again i'm reading data
964.639 -> off of disk to retrieve that information
967.12 -> if we contrast this again with something
968.48 -> like a relational database
970.32 -> the worksack connection in this example
972.48 -> is really metadata it would be
973.68 -> represented through something like a
975.12 -> foreign key between those per that
976.639 -> person and company table
978.639 -> so when i want to find out what company
980.8 -> somebody works at i need to actually
982.48 -> calculate that at random time i need to
984.32 -> to
984.959 -> i need to run that relational algebra to
986.88 -> calculate that as opposed to being able
988.72 -> to retrieve from disk so when i start
991.279 -> needing to process
992.639 -> hundreds or thousands or hundreds of
994.72 -> thousands or millions or billions of
996.16 -> these sorts of relationships the fact
998 -> that i can retrieve them from this
999.279 -> versus calculate them at runtime really
1001.36 -> does lead to a much more efficient
1002.88 -> processing of that sort of information
1007.6 -> and the last kind of uh item i wanted to
1009.44 -> touch on here is a little bit about
1010.72 -> schema flexibility
1012.8 -> when we're looking at these two uh
1015.199 -> these two examples up here we see that
1017.68 -> we have you know these are both
1019.04 -> representations of family trees
1021.68 -> with a graph and with with amazon
1023.759 -> neptune and with most graphs they they
1025.6 -> tend they're they're they're what's
1026.72 -> known as a schema-less database i
1028.48 -> personally not my favorite terminology
1030.4 -> because if you have data you have schema
1032.799 -> so i like to think of it more in terms
1034.4 -> of explicit versus implicit schema so in
1037.199 -> the case of a graph when i start
1039.919 -> with its schema less nature or its
1041.839 -> implicit nature of schema i can just
1044.799 -> start writing information to my system i
1047.36 -> can write a person and i can start
1049.44 -> writing a property of a first name or a
1050.88 -> last name
1052.4 -> i don't have to declare these ahead of
1053.919 -> time i don't have to set up tables or
1056.16 -> keys or constraints around that i can
1058.4 -> just start writing that information to
1060.16 -> my system as i as the data that's coming
1063.039 -> in or the maybe the attributes of that
1064.48 -> data changes or we add new
1066.64 -> uh maybe we add a new type of data a new
1068.72 -> new set of entities to my graph i can
1070.559 -> just start writing those and they will
1071.76 -> be automatically included into the
1073.44 -> schema of my graph
1075.919 -> this provides a lot of flexibility
1077.52 -> especially as data evolves over time if
1079.919 -> we want yet again want to compare that