GRAPH DATABASE
WHAT IS GRAPH DATABASE
• A graph database uses graph structures for semantic queries. The data gets
stored as nodes, edges, and properties.
• Nodes represent entities such as people or products.
• Edges define relationships between nodes.
• Properties provide additional information about nodes and edges.
• This structure allows for efficient querying and visualization of complex
relationships.
2
3
KEY FEATURES OF GRAPH
DATABASES
• Data Modeling Capabilities
• Graph databases offer flexible data modeling. Unlike relational databases,
graph databases do not require a predefined schema. This flexibility allows
for the easy addition of new types of relationships and nodes.
• Graph databases can model real-world scenarios more naturally. This
capability proves useful in dynamic environments like social networks and
supply chain management.
• Query Languages
• Graph databases use specialized query languages.
• Neo4j uses Cypher, a declarative graph query language.
• TigerGraph employs GSQL, which combines SQL-like syntax with graph
traversal capabilities.
• ArangoDB uses AQL, a versatile query language for its multi-model
database.
• These languages enable complex queries that would be cumbersome in
SQL.
4
• Performance and Scalability
• Graph databases outperform traditional databases in handling
connected data.
• Neo4j offers robust performance for read-heavy workloads.
• TigerGraph excels in data loading speed and storage efficiency.
• ArangoDB provides competitive performance with its multi-model
approach.
• Scalability remains a critical factor. Graph databases scale
horizontally, accommodating growing datasets without sacrificing
performance. This scalability ensures that graph databases meet the
demands of modern applications.
5
Feature Neo4j ArangoDB TigerGraph
Type
Pure graph database (property
graph model).
Multi-model DB (Graph +
Document + Key/Value).
Native parallel graph database
(high-performance).
Graph Model
Property graph (nodes,
relationships, properties).
Supports property graph +
document store + key-value.
Property graph with distributed,
parallel architecture.
Query Language
Cypher (easy, SQL-like for graphs).
Also supports openCypher, GQL
(emerging).
AQL (Arango Query Language) –
handles graphs + documents
together.
GSQL (powerful, but more
complex; designed for analytics).
Scalability
Good for medium-to-large graphs.
Cluster support, but scaling
requires effort.
Scales decently due to multi-
model nature, but not as
optimized for very large graphs.
Excellent scalability (billions of
edges, parallel queries). Built for
big data + enterprise scale.
Strengths
- Rich ecosystem (Bloom
visualization, Graph Data Science
Library). - Easy to learn (Cypher). -
Strong community & tooling.
- Flexibility: one DB for documents
+ graphs. - Can avoid extra
integrations (good for hybrid
workloads). - Open-source
friendly.
- Very fast on large-scale graph
analytics (fraud detection, supply
chain, social networks). - Parallel
query execution.
Weaknesses
- Performance drops with very
large graphs (hundreds of billions
of edges). - Clustering setup can
be complex.
- Graph features not as mature as
Neo4j. - AQL is harder to master
than Cypher.
- Smaller community. - Enterprise-
oriented (can be costly). - GSQL
learning curve is steep.
Best Use Cases
- Fraud detection,
recommendation engines, supply
chain visualization, knowledge
graphs.
- Applications needing both graph
+ document DB (e.g., product
catalogs, metadata +
relationships).
- Heavy-duty analytics on huge
datasets (telecom, finance,
cybersecurity, supply chain risk).
Harder (GSQL is powerful but not
6
WHAT IS GRAPH QUERY
LANGUAGE
• A graph query language is a specialized tool for interacting with graph databases, which
store data as networks of nodes (representing entities) and edges (representing
relationships).
• Unlike traditional relational databases that use tables and SQL, graph databases prioritize
connections between data points, making graph query languages better suited for
navigating complex relationships.
• Common examples include Cypher (used in Neo4j), Gremlin (supported by Apache
TinkerPop), and SPARQL (for RDF data).
• These languages allow developers to express queries that traverse paths, filter nodes
based on properties, or analyze interconnected data patterns efficiently.
• Graph query languages are particularly useful in scenarios where relationships are
central to the problem. Social networks use them to find connections between users,
recommendation engines leverage them to identify related products, and fraud detection
systems analyze transaction patterns.
7
8
GRAPH QUERY LANGUAGE
• A graph query language is a specialized programming language designed
to interact with and extract information from graph databases.
• It serves as the bridge between users or applications and the underlying
graph database, enabling them to retrieve, update, and manipulate data stored
in a graph format.
• Graph query languages provide a way to express queries that navigate the
intricate network of nodes and edges to find specific patterns, relationships,
or insights within the data.
GRAPH QUERY LANGUAGES 9
1. Cypher
Cypher is a declarative query language designed specifically for
graph databases.
Its syntax is inspired by natural language, making it relatively easy to
read and understand. Cypher queries typically follow a pattern of
MATCH, WHERE, and RETURN, allowing you to express patterns, filter
results, and retrieve specific data.
Key features:
Pattern matching: Cypher excels at pattern matching, allowing you
to describe complex graph structures and relationships concisely.
Declarative nature: You focus on what data you want to retrieve,
and Cypher figures out the optimal way to traverse the graph.
Wide adoption: Cypher is widely used with Neo4j, one of the most
popular graph database systems, contributing to its popularity.
GQL (GRAPH QUERY LANGUAGE)
GQL (Graph Query Language) GQL is a new international standard for property
graph database languages, officially published as ISO/IEC 39075 in April 2024.
Developed by the same committee responsible for SQL, GQL represents a significant
milestone as the first new database query language standardized by ISO in over 35
years.
Key features:
Powerful graph pattern matching (GPM): GQL's GPM allows users to write
relatively simple queries for complex data analysis.
Rich data types: Includes support for various data types, including character and
byte strings, fixed-point and floating-point numerics, and native nested data.
ISO standard: GQL is the official ISO/IEC standard (ISO/IEC 39075) for property
graph database languages, providing a standardized approach across the industry
10
11
GREMLIN
Gremlin is a more imperative and procedural language compared to Cypher. It
provides a flexible and powerful way to traverse and manipulate graph data.
Gremlin queries are often chained together using steps that filter, transform, and
aggregate data as it flows through the traversal.
Key features:
Traversal framework: Gremlin's core strength lies in its ability to express
complex graph traversals and transformations.
Imperative style: You have fine-grained control over how the graph is
traversed and how data is processed at each step.
Hybrid capability: Gremlin Supports both OLTP and OLAP operations.
Multilingual integration: Gremlin Can be embedded in multiple programming
languages.
12
SPARQL
SPARQL (pronounced "sparkle") is a query language primarily used for querying
RDF (Resource Description Framework) data, a standard way to represent
knowledge graphs.
RDF data is essentially a graph where nodes represent resources, and edges represent
relationships between them. SPARQL offers powerful capabilities for querying and
reasoning over RDF graphs.
Key features:
RDF compatibility: SPARQL is specifically designed to work with RDF data and its
underlying graph structure.
Semantic web focus: It aligns with the principles of the Semantic Web, enabling
querying and inference over linked data.
SQL-like: SPARQL supports SQL-like syntax for querying graph patterns.
13
WHAT IS GRAPH DATA
MODELING?
Data modeling is a practice that defines the logic of queries and the structure of the data in
storage. A well-designed model is the key to leveraging the strengths of a graph database as
it improves query performance, supports flexible queries, and optimizes storage.
In summary, the process of creating a data model includes the following:
• Understand the domain and define specific use cases (questions) for the application.
• Develop an initial graph data model by extracting entities and decide how they relate to
each other.
• Test the use cases against the initial data model.
• Create the graph with test data using Cypher.
• Test the use cases, including performance against the graph.
• Refactor the graph data model due to changes in the key use cases or for performance
reasons.
14
CREATE A GRAPH DATA MODEL
Define the domain
The Movies example dataset, the domain includes movies, people who acted or directed movies, and users
who rated movies. It is in the connections (relationships) between these entities that you find insights about
your domain.
Define the use case
In other words, what questions are you trying to answer?
You can make a list of questions to help you identify the application use cases. The questions will help you
define what you need from the application, and what data must be included in the graph.
For this tutorial, your application should be able to answer these questions:
• Which people acted in a movie?
• Which person directed a movie?
• Which movies did a person act in?
• How many users rated a movie?
• Who was the youngest person to act in a movie?
• Which role did a person play in a movie?
• Which is the highest rated movie in a particular year according to imDB?
• Which drama movies did an actor act in?
• Which users gave a movie a rating of 5?
15
Define the purpose
When designing a graph data model for an application, you may need both a
data model and an instance model.
Data model
The data model describes the nodes and relationships in the domain and
includes labels, types, and properties.
Instance model
An instance model is a representation of the data that is stored and processed
in the actual model. You can use an instance model to test against your use
cases.
16
INSTANCE MODEL
17
Define entities
An instance model helps you preview how the data will be stored as nodes, relationships, and
properties. The next step is to refine your model with more details.
Labels
The dominant nouns in your application use case are represented as nodes in your model and
can be used as node labels. For example:
Which person acted in a movie?
How many users rated a movie?
The nodes in your initial model are thus Person, Movie, and User
Node properties
MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN]-(m:Movie)
RETURN m
With these properties, it is easier to visualize what you need from the graph to answer
the use case questions.
18
19
Unique identifiers
In Cypher, it is possible to create two different nodes with the exact same data. However, from a data
management and model perspective, different nodes should contain different data. You can
use unique identifiers to make sure that every node is a separate and distinguished entity.
In the initial instance model, these are the properties set for the Movies nodes:
Movie.title (string)
Movie.tmdbID (integer)
Movie.released (date)
Movie.imdbRating (decimal between 0-10)
Movie.genres (list of strings)
And for the Person nodes:
Person.name (string)
Person.tmdbID (integer)
Person.born (date)
20
Relationships
Relationships are connections between nodes, and these connections are the verbs
in your use cases:
Which person acted in a movie?
Which person directed a movie?
At a glance, connections seem straightforward.To get started, thinking of
relationships from the perspective that “connections are verbs” works well, but
there are other important considerations that you will learn as you advance with
your model.
21
Naming
It is important to choose good names (types) for the relationships in the graph
and be as specific as possible in order to allow Neo4j to traverse only relevant
connections.
For example, instead of connecting two nodes with a generic relationship type
(e.g. CONNECTED_TO), prefer to be more specific and intuitive about the way
those entities connect.
For this sample, you could define relationships as:
ACTED_IN
DIRECTED
With these options, you can already plan the direction of the relationships.
22
Relationship direction
All relationships must have a direction. When created, relationships need to
specify their direction explicity or be inferred by the left-to-right order of the
pattern.
In the example use cases, the ACTED_IN relationship must be created to go
from a Person node to a Movie node:
23
Relationship properties
Properties for a relationship are used to enrich how two nodes are related. When you need to
know how two nodes are related and not just that they are related, you can use relationship
properties to further define the relationship.
The example question "Which role did a person play in a movie?" can be asked with the help
of the property roles in the ACTED_IN relationship:
24
Add more data
Now that you have created the first connections between the nodes, it’s time to add more
information to the graph. This way, you can answer more questions, such as:
How many users rated a movie?
Which users gave a movie a rating of 5?
To answer these questions, you need information about users and their ratings in your graph,
which means a change in your data model. Note that, with the addition of new data such as the
property roles in the ACTED_IN relationship, your initial data model has already been updated
along the way:
25
GRAPH ALGORITHMS
Graph algorithms are computational methods designed to process and analyze
data structured as a graph.
A graph is a data structure consisting of vertices (also known as nodes) and
edges that connect these vertices.
These algorithms are crucial for understanding relationships, paths, and patterns
within interconnected data.
26
RANDOM WALKS
• The first and most basic of these concepts are random walks.
• A random walk simply chooses a starting node from which to begin its walk and
then randomly traverses the graph for some amount of steps or “hops”.
• A random walk can be performed on any graph whether it be directed,
undirected, weighted, or unweighted or even disconnected graphs.
• As we will see, these random walks can be used to solve a number of problems
and are therefore the foundation for most graph algorithms.
27
PATHFINDING & SEARCH ALGORITHMS
Another foundational graph algorithm family are graph shortest path
algorithms. Shortest path algorithms typically come in two flavors depending
on the nature of the problem and how you want to explore the graph to
ultimately find the shortest path.
Depth First Search, starts by traversing as deeply into the graph as possible
before returning to its starting point and pursuing another deep path traversal.
Breadth First Search, keeps its traversals as close to the starting node as
possible and only ventures deeper into the graph when it has exhausted all
possible paths closest to it.
28
Pathfinding is used in many use cases, perhaps most notably in Google Maps. In the
earliest days of GPS, Google Maps used pathfinding on a graph to calculate the
fastest route to arrive at a given destination. This is just one of many examples of
graphs being used to solve everyday problems for countless people.
29
CENTRALITY ALGORITHMS
Centrality Algorithms can be used to analyze a graph as a whole to
understand which nodes within that graph have the greatest impact on the
network.
However, to measure the influence of a node in the network with an
algorithm, we must first define what “impact” on a graph means.
This differs from algorithm to algorithm and is a great starting point when
trying to decide which centrality algorithm to choose.
30
Degree Centrality uses the average degree of a node to measure how much of an impact it is having on
the graph
Closeness Centrality uses the inverse farness distance between a given node and all other nodes to
understand how central the node is to the graph
Betweenness Centrality uses shortest paths to determine which nodes serve as central ‘bridges’ across
a graph to identify key bottlenecks within a network
PageRank uses a set of random walks to measure how influential a given node is to a network. By
measuring which nodes are more likely to be visited on a random walk. Note that PageRank addresses
the disconnected graph problem which random walks face by occasionally jumping to a random point
in the graph rather than making a direct hop. This allows the algorithm to explore even disconnected
portions of the graph. Named for Google founder Larry Page, PageRank was developed as the
backbone of the Google search engine and allowed it to exceed the performance of all its competitors in
the early stages of the internet.
31
COMMUNITY DETECTION ALGORITHMS
Community detection is a common use case for a variety of graphs.
Typically it is used in any situation where understanding the distinct groups of
nodes within a graph offers some tangible value to the use case.
This could be anything from social networks, from fleets of trucks making
deliveries to a network of accounts transacting with one another.
However, which algorithm you choose to discover these communities will
greatly impact how they’re grouped.
32
Triangle Count simply uses the principle that three nodes fully connected to one another (like a triangle) is the
simplest community dynamic that can exist in a graph. It therefore finds every combination of triangles within
a graph to determine how those nodes are grouped together
Strongly Connected Components and Connected Components are excellent algorithms for determining the
shape of your graph. Both aim to measure how many graphs make up the entirety of the data. While Connected
Components simply returns the number of completely disconnected graphs within a set of nodes and edges,
Strongly Connected Components returns those subgraphs which are solidly connected by many linkages.
Because of this, they are typically used in combination as a form of initial exploratory data analysis when first
analyzing graph data.
Louvain Modularity finds communities by comparing clusters of nodes and edges to the average for the
network. If a group of nodes are found to be generally greater than what is seen on average in the graph, those
nodes can be considered a community.
33
SHORTEST PATH
ALGORITHMS
The Shortest Path algorithms calculate the
shortest path between a pair of nodes. There
are different algorithms that achieve this goal:
Dijkstra's Algorithm Purpose: Find the shortest
path from a single source to all other nodes in
a graph with non-negative edge weights.
Method: Uses a greedy approach with a
priority queue to iteratively select the nearest
unvisited node. Example Use Cases: GPS
navigation, network routing, and game
pathfinding.
34
Bellman-Ford Algorithm Purpose: Compute shortest
paths from a single source in graphs that may contain
negative edge weights. Method: Iteratively relax all
edges up to (V 1) times, where V is the number of
−
vertices. Example Use Cases: Currency arbitrage
detection, routing in networks with variable costs, and
analyzing economic models.
Floyd-Warshall Algorithm Purpose: Find shortest
paths between all pairs of nodes in a graph. Method:
Uses dynamic programming to update path lengths by
considering all possible intermediate nodes. Example
Use Cases: Traffic analysis, social network distance
metrics, and transitive closure in databases.
Click icon to add picture
35
GRAPH DATABASE INDEXING
It is the process of creating and maintaining
indexes in a graph database, for faster
querying and traversal. They can be of
different types based on the implementation:
Full-text search (FTS) indexing
Spatial indexing Relationship indexing
Click icon to add picture
36
FULL-TEXT SEARCH (FTS)
INDEXING
Full-text search (FTS) indexing enables very fast queries
over textual data stored in the nodes.
Text-based properties (like names, descriptions) are
indexed for efficient searching.
This allows users to perform flexible queries using
keywords, partial matches, or phrases
Examples: Searching product descriptions in an e-
commerce graph, locating users by name or bio in a social
network graph
37
SPATIAL INDEXING
Spatial indexing enables fast querying of
location-based data stored in graph nodes or
relationships.
Spatial properties (like coordinates or
geometries) using tree based data structures
(like R-trees) are indexed to support spatial
operations.
This allows users to perform efficient queries
based on location data.
Examples: Finding nearby restaurants in a
location graph, locate nearby friends in a
social graph.
Click icon to add picture
38
RELATIONSHIP INDEXING
It speeds up queries that depend on specific types or
properties of relationships between nodes.
Indexes relationship types or attributes (like
timestamps, weights, or labels) to allow fast filtering
and traversal of graph.
This allows users to quickly find relevant connections
based on the relationship properties. Examples:
Filtering recent interactions in social graphs, tracing
financial transactions in fraud detection.
THANK
YOU

Graph databases and its algorithms .pptx

  • 1.
  • 2.
    WHAT IS GRAPHDATABASE • A graph database uses graph structures for semantic queries. The data gets stored as nodes, edges, and properties. • Nodes represent entities such as people or products. • Edges define relationships between nodes. • Properties provide additional information about nodes and edges. • This structure allows for efficient querying and visualization of complex relationships. 2
  • 3.
    3 KEY FEATURES OFGRAPH DATABASES • Data Modeling Capabilities • Graph databases offer flexible data modeling. Unlike relational databases, graph databases do not require a predefined schema. This flexibility allows for the easy addition of new types of relationships and nodes. • Graph databases can model real-world scenarios more naturally. This capability proves useful in dynamic environments like social networks and supply chain management. • Query Languages • Graph databases use specialized query languages. • Neo4j uses Cypher, a declarative graph query language. • TigerGraph employs GSQL, which combines SQL-like syntax with graph traversal capabilities. • ArangoDB uses AQL, a versatile query language for its multi-model database. • These languages enable complex queries that would be cumbersome in SQL.
  • 4.
    4 • Performance andScalability • Graph databases outperform traditional databases in handling connected data. • Neo4j offers robust performance for read-heavy workloads. • TigerGraph excels in data loading speed and storage efficiency. • ArangoDB provides competitive performance with its multi-model approach. • Scalability remains a critical factor. Graph databases scale horizontally, accommodating growing datasets without sacrificing performance. This scalability ensures that graph databases meet the demands of modern applications.
  • 5.
    5 Feature Neo4j ArangoDBTigerGraph Type Pure graph database (property graph model). Multi-model DB (Graph + Document + Key/Value). Native parallel graph database (high-performance). Graph Model Property graph (nodes, relationships, properties). Supports property graph + document store + key-value. Property graph with distributed, parallel architecture. Query Language Cypher (easy, SQL-like for graphs). Also supports openCypher, GQL (emerging). AQL (Arango Query Language) – handles graphs + documents together. GSQL (powerful, but more complex; designed for analytics). Scalability Good for medium-to-large graphs. Cluster support, but scaling requires effort. Scales decently due to multi- model nature, but not as optimized for very large graphs. Excellent scalability (billions of edges, parallel queries). Built for big data + enterprise scale. Strengths - Rich ecosystem (Bloom visualization, Graph Data Science Library). - Easy to learn (Cypher). - Strong community & tooling. - Flexibility: one DB for documents + graphs. - Can avoid extra integrations (good for hybrid workloads). - Open-source friendly. - Very fast on large-scale graph analytics (fraud detection, supply chain, social networks). - Parallel query execution. Weaknesses - Performance drops with very large graphs (hundreds of billions of edges). - Clustering setup can be complex. - Graph features not as mature as Neo4j. - AQL is harder to master than Cypher. - Smaller community. - Enterprise- oriented (can be costly). - GSQL learning curve is steep. Best Use Cases - Fraud detection, recommendation engines, supply chain visualization, knowledge graphs. - Applications needing both graph + document DB (e.g., product catalogs, metadata + relationships). - Heavy-duty analytics on huge datasets (telecom, finance, cybersecurity, supply chain risk). Harder (GSQL is powerful but not
  • 6.
    6 WHAT IS GRAPHQUERY LANGUAGE • A graph query language is a specialized tool for interacting with graph databases, which store data as networks of nodes (representing entities) and edges (representing relationships). • Unlike traditional relational databases that use tables and SQL, graph databases prioritize connections between data points, making graph query languages better suited for navigating complex relationships. • Common examples include Cypher (used in Neo4j), Gremlin (supported by Apache TinkerPop), and SPARQL (for RDF data). • These languages allow developers to express queries that traverse paths, filter nodes based on properties, or analyze interconnected data patterns efficiently. • Graph query languages are particularly useful in scenarios where relationships are central to the problem. Social networks use them to find connections between users, recommendation engines leverage them to identify related products, and fraud detection systems analyze transaction patterns.
  • 7.
  • 8.
    8 GRAPH QUERY LANGUAGE •A graph query language is a specialized programming language designed to interact with and extract information from graph databases. • It serves as the bridge between users or applications and the underlying graph database, enabling them to retrieve, update, and manipulate data stored in a graph format. • Graph query languages provide a way to express queries that navigate the intricate network of nodes and edges to find specific patterns, relationships, or insights within the data.
  • 9.
    GRAPH QUERY LANGUAGES9 1. Cypher Cypher is a declarative query language designed specifically for graph databases. Its syntax is inspired by natural language, making it relatively easy to read and understand. Cypher queries typically follow a pattern of MATCH, WHERE, and RETURN, allowing you to express patterns, filter results, and retrieve specific data. Key features: Pattern matching: Cypher excels at pattern matching, allowing you to describe complex graph structures and relationships concisely. Declarative nature: You focus on what data you want to retrieve, and Cypher figures out the optimal way to traverse the graph. Wide adoption: Cypher is widely used with Neo4j, one of the most popular graph database systems, contributing to its popularity.
  • 10.
    GQL (GRAPH QUERYLANGUAGE) GQL (Graph Query Language) GQL is a new international standard for property graph database languages, officially published as ISO/IEC 39075 in April 2024. Developed by the same committee responsible for SQL, GQL represents a significant milestone as the first new database query language standardized by ISO in over 35 years. Key features: Powerful graph pattern matching (GPM): GQL's GPM allows users to write relatively simple queries for complex data analysis. Rich data types: Includes support for various data types, including character and byte strings, fixed-point and floating-point numerics, and native nested data. ISO standard: GQL is the official ISO/IEC standard (ISO/IEC 39075) for property graph database languages, providing a standardized approach across the industry 10
  • 11.
    11 GREMLIN Gremlin is amore imperative and procedural language compared to Cypher. It provides a flexible and powerful way to traverse and manipulate graph data. Gremlin queries are often chained together using steps that filter, transform, and aggregate data as it flows through the traversal. Key features: Traversal framework: Gremlin's core strength lies in its ability to express complex graph traversals and transformations. Imperative style: You have fine-grained control over how the graph is traversed and how data is processed at each step. Hybrid capability: Gremlin Supports both OLTP and OLAP operations. Multilingual integration: Gremlin Can be embedded in multiple programming languages.
  • 12.
    12 SPARQL SPARQL (pronounced "sparkle")is a query language primarily used for querying RDF (Resource Description Framework) data, a standard way to represent knowledge graphs. RDF data is essentially a graph where nodes represent resources, and edges represent relationships between them. SPARQL offers powerful capabilities for querying and reasoning over RDF graphs. Key features: RDF compatibility: SPARQL is specifically designed to work with RDF data and its underlying graph structure. Semantic web focus: It aligns with the principles of the Semantic Web, enabling querying and inference over linked data. SQL-like: SPARQL supports SQL-like syntax for querying graph patterns.
  • 13.
    13 WHAT IS GRAPHDATA MODELING? Data modeling is a practice that defines the logic of queries and the structure of the data in storage. A well-designed model is the key to leveraging the strengths of a graph database as it improves query performance, supports flexible queries, and optimizes storage. In summary, the process of creating a data model includes the following: • Understand the domain and define specific use cases (questions) for the application. • Develop an initial graph data model by extracting entities and decide how they relate to each other. • Test the use cases against the initial data model. • Create the graph with test data using Cypher. • Test the use cases, including performance against the graph. • Refactor the graph data model due to changes in the key use cases or for performance reasons.
  • 14.
    14 CREATE A GRAPHDATA MODEL Define the domain The Movies example dataset, the domain includes movies, people who acted or directed movies, and users who rated movies. It is in the connections (relationships) between these entities that you find insights about your domain. Define the use case In other words, what questions are you trying to answer? You can make a list of questions to help you identify the application use cases. The questions will help you define what you need from the application, and what data must be included in the graph. For this tutorial, your application should be able to answer these questions: • Which people acted in a movie? • Which person directed a movie? • Which movies did a person act in? • How many users rated a movie? • Who was the youngest person to act in a movie? • Which role did a person play in a movie? • Which is the highest rated movie in a particular year according to imDB? • Which drama movies did an actor act in? • Which users gave a movie a rating of 5?
  • 15.
    15 Define the purpose Whendesigning a graph data model for an application, you may need both a data model and an instance model. Data model The data model describes the nodes and relationships in the domain and includes labels, types, and properties. Instance model An instance model is a representation of the data that is stored and processed in the actual model. You can use an instance model to test against your use cases.
  • 16.
  • 17.
    17 Define entities An instancemodel helps you preview how the data will be stored as nodes, relationships, and properties. The next step is to refine your model with more details. Labels The dominant nouns in your application use case are represented as nodes in your model and can be used as node labels. For example: Which person acted in a movie? How many users rated a movie? The nodes in your initial model are thus Person, Movie, and User Node properties MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN]-(m:Movie) RETURN m With these properties, it is easier to visualize what you need from the graph to answer the use case questions.
  • 18.
  • 19.
    19 Unique identifiers In Cypher,it is possible to create two different nodes with the exact same data. However, from a data management and model perspective, different nodes should contain different data. You can use unique identifiers to make sure that every node is a separate and distinguished entity. In the initial instance model, these are the properties set for the Movies nodes: Movie.title (string) Movie.tmdbID (integer) Movie.released (date) Movie.imdbRating (decimal between 0-10) Movie.genres (list of strings) And for the Person nodes: Person.name (string) Person.tmdbID (integer) Person.born (date)
  • 20.
    20 Relationships Relationships are connectionsbetween nodes, and these connections are the verbs in your use cases: Which person acted in a movie? Which person directed a movie? At a glance, connections seem straightforward.To get started, thinking of relationships from the perspective that “connections are verbs” works well, but there are other important considerations that you will learn as you advance with your model.
  • 21.
    21 Naming It is importantto choose good names (types) for the relationships in the graph and be as specific as possible in order to allow Neo4j to traverse only relevant connections. For example, instead of connecting two nodes with a generic relationship type (e.g. CONNECTED_TO), prefer to be more specific and intuitive about the way those entities connect. For this sample, you could define relationships as: ACTED_IN DIRECTED With these options, you can already plan the direction of the relationships.
  • 22.
    22 Relationship direction All relationshipsmust have a direction. When created, relationships need to specify their direction explicity or be inferred by the left-to-right order of the pattern. In the example use cases, the ACTED_IN relationship must be created to go from a Person node to a Movie node:
  • 23.
    23 Relationship properties Properties fora relationship are used to enrich how two nodes are related. When you need to know how two nodes are related and not just that they are related, you can use relationship properties to further define the relationship. The example question "Which role did a person play in a movie?" can be asked with the help of the property roles in the ACTED_IN relationship:
  • 24.
    24 Add more data Nowthat you have created the first connections between the nodes, it’s time to add more information to the graph. This way, you can answer more questions, such as: How many users rated a movie? Which users gave a movie a rating of 5? To answer these questions, you need information about users and their ratings in your graph, which means a change in your data model. Note that, with the addition of new data such as the property roles in the ACTED_IN relationship, your initial data model has already been updated along the way:
  • 25.
    25 GRAPH ALGORITHMS Graph algorithmsare computational methods designed to process and analyze data structured as a graph. A graph is a data structure consisting of vertices (also known as nodes) and edges that connect these vertices. These algorithms are crucial for understanding relationships, paths, and patterns within interconnected data.
  • 26.
    26 RANDOM WALKS • Thefirst and most basic of these concepts are random walks. • A random walk simply chooses a starting node from which to begin its walk and then randomly traverses the graph for some amount of steps or “hops”. • A random walk can be performed on any graph whether it be directed, undirected, weighted, or unweighted or even disconnected graphs. • As we will see, these random walks can be used to solve a number of problems and are therefore the foundation for most graph algorithms.
  • 27.
    27 PATHFINDING & SEARCHALGORITHMS Another foundational graph algorithm family are graph shortest path algorithms. Shortest path algorithms typically come in two flavors depending on the nature of the problem and how you want to explore the graph to ultimately find the shortest path. Depth First Search, starts by traversing as deeply into the graph as possible before returning to its starting point and pursuing another deep path traversal. Breadth First Search, keeps its traversals as close to the starting node as possible and only ventures deeper into the graph when it has exhausted all possible paths closest to it.
  • 28.
    28 Pathfinding is usedin many use cases, perhaps most notably in Google Maps. In the earliest days of GPS, Google Maps used pathfinding on a graph to calculate the fastest route to arrive at a given destination. This is just one of many examples of graphs being used to solve everyday problems for countless people.
  • 29.
    29 CENTRALITY ALGORITHMS Centrality Algorithmscan be used to analyze a graph as a whole to understand which nodes within that graph have the greatest impact on the network. However, to measure the influence of a node in the network with an algorithm, we must first define what “impact” on a graph means. This differs from algorithm to algorithm and is a great starting point when trying to decide which centrality algorithm to choose.
  • 30.
    30 Degree Centrality usesthe average degree of a node to measure how much of an impact it is having on the graph Closeness Centrality uses the inverse farness distance between a given node and all other nodes to understand how central the node is to the graph Betweenness Centrality uses shortest paths to determine which nodes serve as central ‘bridges’ across a graph to identify key bottlenecks within a network PageRank uses a set of random walks to measure how influential a given node is to a network. By measuring which nodes are more likely to be visited on a random walk. Note that PageRank addresses the disconnected graph problem which random walks face by occasionally jumping to a random point in the graph rather than making a direct hop. This allows the algorithm to explore even disconnected portions of the graph. Named for Google founder Larry Page, PageRank was developed as the backbone of the Google search engine and allowed it to exceed the performance of all its competitors in the early stages of the internet.
  • 31.
    31 COMMUNITY DETECTION ALGORITHMS Communitydetection is a common use case for a variety of graphs. Typically it is used in any situation where understanding the distinct groups of nodes within a graph offers some tangible value to the use case. This could be anything from social networks, from fleets of trucks making deliveries to a network of accounts transacting with one another. However, which algorithm you choose to discover these communities will greatly impact how they’re grouped.
  • 32.
    32 Triangle Count simplyuses the principle that three nodes fully connected to one another (like a triangle) is the simplest community dynamic that can exist in a graph. It therefore finds every combination of triangles within a graph to determine how those nodes are grouped together Strongly Connected Components and Connected Components are excellent algorithms for determining the shape of your graph. Both aim to measure how many graphs make up the entirety of the data. While Connected Components simply returns the number of completely disconnected graphs within a set of nodes and edges, Strongly Connected Components returns those subgraphs which are solidly connected by many linkages. Because of this, they are typically used in combination as a form of initial exploratory data analysis when first analyzing graph data. Louvain Modularity finds communities by comparing clusters of nodes and edges to the average for the network. If a group of nodes are found to be generally greater than what is seen on average in the graph, those nodes can be considered a community.
  • 33.
    33 SHORTEST PATH ALGORITHMS The ShortestPath algorithms calculate the shortest path between a pair of nodes. There are different algorithms that achieve this goal: Dijkstra's Algorithm Purpose: Find the shortest path from a single source to all other nodes in a graph with non-negative edge weights. Method: Uses a greedy approach with a priority queue to iteratively select the nearest unvisited node. Example Use Cases: GPS navigation, network routing, and game pathfinding.
  • 34.
    34 Bellman-Ford Algorithm Purpose:Compute shortest paths from a single source in graphs that may contain negative edge weights. Method: Iteratively relax all edges up to (V 1) times, where V is the number of − vertices. Example Use Cases: Currency arbitrage detection, routing in networks with variable costs, and analyzing economic models. Floyd-Warshall Algorithm Purpose: Find shortest paths between all pairs of nodes in a graph. Method: Uses dynamic programming to update path lengths by considering all possible intermediate nodes. Example Use Cases: Traffic analysis, social network distance metrics, and transitive closure in databases. Click icon to add picture
  • 35.
    35 GRAPH DATABASE INDEXING Itis the process of creating and maintaining indexes in a graph database, for faster querying and traversal. They can be of different types based on the implementation: Full-text search (FTS) indexing Spatial indexing Relationship indexing Click icon to add picture
  • 36.
    36 FULL-TEXT SEARCH (FTS) INDEXING Full-textsearch (FTS) indexing enables very fast queries over textual data stored in the nodes. Text-based properties (like names, descriptions) are indexed for efficient searching. This allows users to perform flexible queries using keywords, partial matches, or phrases Examples: Searching product descriptions in an e- commerce graph, locating users by name or bio in a social network graph
  • 37.
    37 SPATIAL INDEXING Spatial indexingenables fast querying of location-based data stored in graph nodes or relationships. Spatial properties (like coordinates or geometries) using tree based data structures (like R-trees) are indexed to support spatial operations. This allows users to perform efficient queries based on location data. Examples: Finding nearby restaurants in a location graph, locate nearby friends in a social graph. Click icon to add picture
  • 38.
    38 RELATIONSHIP INDEXING It speedsup queries that depend on specific types or properties of relationships between nodes. Indexes relationship types or attributes (like timestamps, weights, or labels) to allow fast filtering and traversal of graph. This allows users to quickly find relevant connections based on the relationship properties. Examples: Filtering recent interactions in social graphs, tracing financial transactions in fraud detection.
  • 39.

Editor's Notes

  • #2 Graph databases excel in scenarios where relationships between data points hold significant value. Traditional relational databases struggle with complex relationships. Graph databases handle these efficiently. Social media platforms, recommendation engines, and fraud detection systems benefit greatly from graph databases. The ability to traverse relationships quickly makes graph databases indispensable in modern data management.
  • #5 1. Type Neo4j → A pure graph database. It only focuses on storing and querying graph data (nodes, relationships, and properties). ArangoDB → A multi-model database. It can work as a graph database, a document store (like MongoDB), and a key-value store—all in one system. TigerGraph → A native parallel graph database. Built specifically for graph data with high performance and parallel processing. 2. Graph Model Neo4j → Uses the property graph model (nodes, relationships, each can have properties). ArangoDB → Supports property graph + document store + key-value. So, more flexible than Neo4j. TigerGraph → Also property graph, but distributed and parallelized to handle huge datasets. 3. Query Language Neo4j → Uses Cypher (like SQL but for graphs; very user-friendly). Also moving toward openCypher and GQL (Graph Query Language). ArangoDB → Uses AQL (Arango Query Language), which can query both graphs and documents. TigerGraph → Uses GSQL, which is powerful but more complex. Designed for analytics on big graph data. 4. Scalability Neo4j → Works well for medium-to-large graphs, but when graphs get huge (billions of nodes/edges), performance can drop. Cluster scaling is possible but tricky. ArangoDB → Scales decently because of its multi-model nature but is not the best for extremely large graphs. TigerGraph → Excellent scalability. Designed for billions of edges and real-time queries at enterprise scale. 5. Strengths Neo4j → Rich ecosystem (visualization tools like Bloom, Graph Data Science library). Easy to learn (Cypher is simple and intuitive). Large community and support. ArangoDB → Very flexible (one DB for both graphs and documents). Avoids the need for multiple DBs in hybrid use cases. Open-source friendly. TigerGraph → Extremely fast for large-scale graph analytics (fraud detection, social networks, supply chain). Built for parallel queries (handles big workloads well). 6. Weaknesses Neo4j → Performance struggles with very large graphs (hundreds of billions of edges). Cluster setup is complex. ArangoDB → Graph functionality is not as advanced as Neo4j. AQL has a steeper learning curve than Cypher. TigerGraph → Smaller community compared to Neo4j. More enterprise-oriented, so licensing can be costly. GSQL is harder to learn. 7. Best Use Cases Neo4j → Fraud detection Recommendation engines Supply chain visualization Knowledge graphs ArangoDB → Scenarios needing both graph + document DB (e.g., product catalogs with relationships, metadata storage). TigerGraph → Heavy-duty analytics for very large datasets (telecom, finance, cybersecurity, supply chain risk). 8. Learning Curve Neo4j → Easy (Cypher is intuitive, similar to SQL). ArangoDB → Medium (AQL is flexible but harder than Cypher). TigerGraph → Harder (GSQL is powerful but not beginner-friendly). A multi-model database means the database supports more than one type of data model (instead of being limited to just one, like only graphs or only documents). When we say multi-model graph database, it usually means: It supports graph data (nodes + edges + properties). It also supports other models such as: Document model → JSON-like documents (like MongoDB). Key-value model → simple key → value storage (like Redis). Sometimes relational/tabular model too. TigerGraph takes the property graph model and scales it for very large datasets by: Distributed storage Data is split across multiple servers/nodes (instead of sitting in one machine). Example: if you have 1 billion users in a social network, the data is divided across servers to fit and manage better. Parallel processing Queries (like finding shortest paths, PageRank, fraud detection) are run in parallel across servers. This makes it much faster than traditional single-machine graph databases.
  • #9 ChatGPT said: A declarative query language is a type of query language where you describe what you want from the database, not how to get it. Cypher Overview Cypher is a declarative query language → meaning you tell the database what you want (the result), not how to compute it. It was created for graph databases (especially Neo4j). Its syntax is designed to look a bit like natural language and SQL, so it’s easier to read and write. Main Structure Cypher queries usually follow 3 main clauses: MATCH → Describe the graph pattern you want to find.Example: MATCH (p:Person)-[:FRIEND_OF]->(f:Person) (Find people p and their friends f). WHERE → Apply conditions/filters. Example: WHERE p.age > 25 (Only friends where the person is older than 25). RETURN → Choose what data to get back. Example: RETURN p.name, f.name (Return the names of both). Key Features Explained Pattern Matching Cypher lets you describe graph patterns in a visual, ASCII-art-like style. Example: (a)-[:LIKES]->(b) looks like a small diagram → “a likes b.” This makes it intuitive to represent complex networks (like social networks, fraud detection, etc.). Declarative Nature You don’t worry about how to traverse the graph (step-by-step navigation). You just say what pattern you want, and Neo4j figures out the best execution plan. This makes Cypher beginner-friendly compared to imperative query languages (like Gremlin). Wide Adoption Cypher is the standard language for Neo4j, the most popular graph DB. Because Neo4j has such a big ecosystem (tools, libraries, community), Cypher also became widely adopted and recognized in industry and academia.
  • #10 Graph Pattern Mathcing GQL (Graph Query Language) Overview GQL is a new database query language standard. It was officially published in April 2024 as ISO/IEC 39075. This is a big deal because it’s the first new international database query language standard in over 35 years (the last one was SQL). It’s developed by the same ISO committee that made SQL → so it’s meant to become the “SQL for Graphs.” Key Features Explained 1. Powerful Graph Pattern Matching (GPM) At the heart of GQL is graph pattern matching. This lets you easily describe and find complex graph structures (nodes, edges, relationships) using concise queries. Example idea: Instead of writing long traversal logic, you can just express “Find all friends of friends of Alice who live in Paris” in a single query. GPM makes it much easier for data analysis on networks like social graphs, fraud detection, recommendation systems, etc. 2. Rich Data Types GQL supports a wide variety of data types, similar to SQL: Character and byte strings → for storing text or binary data. Numeric types → both integers (fixed-point) and decimals/floats (floating-point). Nested data structures → supports arrays, lists, or nested objects directly inside graph nodes/edges. This flexibility means you can store and query real-world complex data more naturally in graph form. 3. ISO Standard GQL is now the official ISO/IEC standard for property graph query languages. Why it matters: Provides uniformity across vendors (Neo4j, TigerGraph, ArangoDB, etc. can align with one language). Prevents fragmentation → developers won’t have to learn a new graph query language for each database. Boosts industry adoption → just like SQL became universal for relational databases, GQL aims to become universal for graph databases. Why GQL is Important Before GQL, graph DBs used their own languages: Neo4j → Cypher ArangoDB → AQL TigerGraph → GSQL Gremlin (Apache TinkerPop) This caused fragmentation → developers had to learn multiple languages. With GQL, the idea is to unify the industry around one standardized language for property graph databases, similar to how SQL unified relational databases. 👉 In short: GQL is the new international standard query language for property graphs, providing: Easy graph pattern matching (for powerful analytics). Support for rich, nested data types. An ISO-certified standard, ensuring consistency and adoption across graph databases.
  • #11 imperative = "telling the computer how to do something Gremlin Overview Gremlin is a graph traversal language (part of Apache TinkerPop framework). Unlike Cypher or GQL (which are declarative → you say what you want), Gremlin is imperative/procedural → you specify how to move through the graph step by step. Think of it like writing instructions for a robot explorer: “Start here, move along edges of this type, filter out some nodes, transform the data, then aggregate results.” Key Features Explained 1. Traversal Framework Gremlin’s power is in traversals → walking through a graph step by step. You can chain multiple steps together to filter, transform, and analyze data. Example (conceptually): “Start at Alice → move to all her friends → filter only those older than 30 → count them.” This step-by-step approach makes it very flexible for complex graph algorithms (like shortest path, centrality, recommendations). 2. Imperative Style With Gremlin, you control the traversal process. You write procedural steps (like in programming) instead of just describing the end result. Advantage → More control and flexibility. Disadvantage → Harder to learn than declarative languages like Cypher. 3. Hybrid Capability (OLTP + OLAP) OLTP (Online Transaction Processing): Real-time queries on graph data (like “Who are Alice’s friends?”). OLAP (Online Analytical Processing): Large-scale analytics (like PageRank, community detection, shortest path over billions of nodes). Gremlin supports both, making it good for both transactional and analytical workloads. 4. Multilingual Integration Gremlin is not limited to one query interface. It can be embedded in many programming languages → Java, Python, JavaScript, Groovy, Scala, etc. This makes it popular with developers who want graph queries integrated directly in their application code. Example (simplified) Imagine you want to find names of Alice’s friends over 30: In Cypher (declarative): MATCH (a:Person {name: "Alice"})-[:FRIEND_OF]->(f:Person) WHERE f.age > 30 RETURN f.name; 👉 You describe the pattern, Neo4j figures out how. In Gremlin (imperative): g.V().has("Person","name","Alice").out("FRIEND_OF").has("age", gt(30)).values("name") 👉 You explicitly say: Find vertex with name = Alice → Traverse outgoing FRIEND_OF edges → Keep only those with age > 30 → Return their names. ✅ Summary: Gremlin = imperative, traversal-based, flexible, procedural, integrates with many languages, good for complex traversals + analytics. Cypher/GQL = declarative, easier to read, good for pattern matching and querying. step by step."
  • #12 🔹 Meaning of the parts: SELECT ?name → This tells SPARQL: I want to return the variable ?name as the result. WHERE { ... } → Defines the pattern we are searching for in the graph. ?person foaf:knows ?friend . ?person = some person (a variable node). foaf:knows = the FOAF (Friend of a Friend) property meaning "knows". ?friend = another person (a friend). 👉 Translation: Find all people and their friends. ?friend foaf:name ?name . For each ?friend found above, look up their foaf:name. Bind that name to the variable ?name. SPARQL Overview SPARQL (pronounced sparkle) is a query language for RDF data. RDF (Resource Description Framework) is a standard from the W3C for representing knowledge graphs and linked data. RDF organizes data into triples: Subject – Predicate – Object Example: (Alice) – [knows] → (Bob) So, RDF data is essentially a graph: Nodes = resources (people, places, things). Edges = relationships (knows, livesIn, worksAt, etc.). SPARQL is the main language for querying and reasoning over RDF graphs. Key Features Explained 1. RDF Compatibility SPARQL is purpose-built for RDF → it understands RDF triples and works directly with that structure. You can query not just direct connections but also data spread across the web (because RDF supports linked data across multiple sources). 2. Semantic Web Focus SPARQL is tightly connected to the Semantic Web vision (an extension of the web where data is machine-readable and connected). It allows inference: meaning it can use ontologies (like OWL vocabularies) to infer new facts from existing ones. Example: If RDF says: Alice worksAt CompanyX CompanyX locatedIn Paris SPARQL + reasoning could infer: Alice is located in Paris This makes SPARQL great for knowledge graphs, linked open data, and reasoning engines. 3. SQL-like Syntax SPARQL syntax looks similar to SQL → so relational database users can learn it quickly. Queries work with graph patterns (like Cypher does), but expressed in triple form. Example Suppose we want to find all people Alice knows: PREFIX foaf: <http://xmlns.com/foaf/0.1/> SELECT ?friend WHERE { ?person foaf:name "Alice" . ?person foaf:knows ?friend . } Explanation: PREFIX → defines shorthand (foaf namespace for Friend-of-a-Friend ontology). ?person → variable for the subject node. ?friend → variable for the object node. The WHERE block describes the graph pattern: Find a person with name “Alice.” Then find who she foaf:knows. The result will return a list of Alice’s friends. ✅ Summary SPARQL = standard query language for RDF knowledge graphs. Strengths → works with linked data, semantic reasoning, and the Semantic Web vision. Looks like SQL but designed for graph triples. What is RDF? RDF = Resource Description Framework Think of it as a way to store information as triples: Subject → Predicate → Object Example: Alice → knows → Bob So RDF is like a graph: Nodes = Alice, Bob Edge/Relationship = knows 2. What is SPARQL? SPARQL = a query language (like SQL for databases). But instead of querying tables (rows/columns), you query graphs made of triples. So if your data is stored as RDF triples, SPARQL is the tool to ask questions about it. 3. Why is SPARQL special? RDF compatibility → It works directly with triples. Semantic web focus → It can do reasoning (infer new facts, like "if Alice works at CompanyX and CompanyX is in Paris → then Alice is in Paris"). SQL-like syntax → It feels familiar if you know SQL.
  • #15 Differentiation between a person who acted in a movie, who directed a movie, and who rated a movie. What ratings were given, how many there are, and when they were submitted. Which role an actor played in a movie and what their age is. The genres of the movies.
  • #26 A graph database stores data as nodes (entities) and edges (relationships). Random walks are used to explore the graph probabilistically rather than exhaustively. A random walk starts at a node and moves to a neighboring node at random, continuing for a number of steps. The frequency of visiting nodes or the paths generated can reveal structural properties, similarities, and clusters. 2. Applications in Modern Graph Databases Node Embedding / Representation Learning Algorithms like DeepWalk and Node2Vec use random walks to learn vector representations of nodes in a graph.