WHAT IS GRAPHDATABASE
• A graph database uses graph structures for semantic queries. The data gets
stored as nodes, edges, and properties.
• Nodes represent entities such as people or products.
• Edges define relationships between nodes.
• Properties provide additional information about nodes and edges.
• This structure allows for efficient querying and visualization of complex
relationships.
2
3.
3
KEY FEATURES OFGRAPH
DATABASES
• Data Modeling Capabilities
• Graph databases offer flexible data modeling. Unlike relational databases,
graph databases do not require a predefined schema. This flexibility allows
for the easy addition of new types of relationships and nodes.
• Graph databases can model real-world scenarios more naturally. This
capability proves useful in dynamic environments like social networks and
supply chain management.
• Query Languages
• Graph databases use specialized query languages.
• Neo4j uses Cypher, a declarative graph query language.
• TigerGraph employs GSQL, which combines SQL-like syntax with graph
traversal capabilities.
• ArangoDB uses AQL, a versatile query language for its multi-model
database.
• These languages enable complex queries that would be cumbersome in
SQL.
4.
4
• Performance andScalability
• Graph databases outperform traditional databases in handling
connected data.
• Neo4j offers robust performance for read-heavy workloads.
• TigerGraph excels in data loading speed and storage efficiency.
• ArangoDB provides competitive performance with its multi-model
approach.
• Scalability remains a critical factor. Graph databases scale
horizontally, accommodating growing datasets without sacrificing
performance. This scalability ensures that graph databases meet the
demands of modern applications.
5.
5
Feature Neo4j ArangoDBTigerGraph
Type
Pure graph database (property
graph model).
Multi-model DB (Graph +
Document + Key/Value).
Native parallel graph database
(high-performance).
Graph Model
Property graph (nodes,
relationships, properties).
Supports property graph +
document store + key-value.
Property graph with distributed,
parallel architecture.
Query Language
Cypher (easy, SQL-like for graphs).
Also supports openCypher, GQL
(emerging).
AQL (Arango Query Language) –
handles graphs + documents
together.
GSQL (powerful, but more
complex; designed for analytics).
Scalability
Good for medium-to-large graphs.
Cluster support, but scaling
requires effort.
Scales decently due to multi-
model nature, but not as
optimized for very large graphs.
Excellent scalability (billions of
edges, parallel queries). Built for
big data + enterprise scale.
Strengths
- Rich ecosystem (Bloom
visualization, Graph Data Science
Library). - Easy to learn (Cypher). -
Strong community & tooling.
- Flexibility: one DB for documents
+ graphs. - Can avoid extra
integrations (good for hybrid
workloads). - Open-source
friendly.
- Very fast on large-scale graph
analytics (fraud detection, supply
chain, social networks). - Parallel
query execution.
Weaknesses
- Performance drops with very
large graphs (hundreds of billions
of edges). - Clustering setup can
be complex.
- Graph features not as mature as
Neo4j. - AQL is harder to master
than Cypher.
- Smaller community. - Enterprise-
oriented (can be costly). - GSQL
learning curve is steep.
Best Use Cases
- Fraud detection,
recommendation engines, supply
chain visualization, knowledge
graphs.
- Applications needing both graph
+ document DB (e.g., product
catalogs, metadata +
relationships).
- Heavy-duty analytics on huge
datasets (telecom, finance,
cybersecurity, supply chain risk).
Harder (GSQL is powerful but not
6.
6
WHAT IS GRAPHQUERY
LANGUAGE
• A graph query language is a specialized tool for interacting with graph databases, which
store data as networks of nodes (representing entities) and edges (representing
relationships).
• Unlike traditional relational databases that use tables and SQL, graph databases prioritize
connections between data points, making graph query languages better suited for
navigating complex relationships.
• Common examples include Cypher (used in Neo4j), Gremlin (supported by Apache
TinkerPop), and SPARQL (for RDF data).
• These languages allow developers to express queries that traverse paths, filter nodes
based on properties, or analyze interconnected data patterns efficiently.
• Graph query languages are particularly useful in scenarios where relationships are
central to the problem. Social networks use them to find connections between users,
recommendation engines leverage them to identify related products, and fraud detection
systems analyze transaction patterns.
8
GRAPH QUERY LANGUAGE
•A graph query language is a specialized programming language designed
to interact with and extract information from graph databases.
• It serves as the bridge between users or applications and the underlying
graph database, enabling them to retrieve, update, and manipulate data stored
in a graph format.
• Graph query languages provide a way to express queries that navigate the
intricate network of nodes and edges to find specific patterns, relationships,
or insights within the data.
9.
GRAPH QUERY LANGUAGES9
1. Cypher
Cypher is a declarative query language designed specifically for
graph databases.
Its syntax is inspired by natural language, making it relatively easy to
read and understand. Cypher queries typically follow a pattern of
MATCH, WHERE, and RETURN, allowing you to express patterns, filter
results, and retrieve specific data.
Key features:
Pattern matching: Cypher excels at pattern matching, allowing you
to describe complex graph structures and relationships concisely.
Declarative nature: You focus on what data you want to retrieve,
and Cypher figures out the optimal way to traverse the graph.
Wide adoption: Cypher is widely used with Neo4j, one of the most
popular graph database systems, contributing to its popularity.
10.
GQL (GRAPH QUERYLANGUAGE)
GQL (Graph Query Language) GQL is a new international standard for property
graph database languages, officially published as ISO/IEC 39075 in April 2024.
Developed by the same committee responsible for SQL, GQL represents a significant
milestone as the first new database query language standardized by ISO in over 35
years.
Key features:
Powerful graph pattern matching (GPM): GQL's GPM allows users to write
relatively simple queries for complex data analysis.
Rich data types: Includes support for various data types, including character and
byte strings, fixed-point and floating-point numerics, and native nested data.
ISO standard: GQL is the official ISO/IEC standard (ISO/IEC 39075) for property
graph database languages, providing a standardized approach across the industry
10
11.
11
GREMLIN
Gremlin is amore imperative and procedural language compared to Cypher. It
provides a flexible and powerful way to traverse and manipulate graph data.
Gremlin queries are often chained together using steps that filter, transform, and
aggregate data as it flows through the traversal.
Key features:
Traversal framework: Gremlin's core strength lies in its ability to express
complex graph traversals and transformations.
Imperative style: You have fine-grained control over how the graph is
traversed and how data is processed at each step.
Hybrid capability: Gremlin Supports both OLTP and OLAP operations.
Multilingual integration: Gremlin Can be embedded in multiple programming
languages.
12.
12
SPARQL
SPARQL (pronounced "sparkle")is a query language primarily used for querying
RDF (Resource Description Framework) data, a standard way to represent
knowledge graphs.
RDF data is essentially a graph where nodes represent resources, and edges represent
relationships between them. SPARQL offers powerful capabilities for querying and
reasoning over RDF graphs.
Key features:
RDF compatibility: SPARQL is specifically designed to work with RDF data and its
underlying graph structure.
Semantic web focus: It aligns with the principles of the Semantic Web, enabling
querying and inference over linked data.
SQL-like: SPARQL supports SQL-like syntax for querying graph patterns.
13.
13
WHAT IS GRAPHDATA
MODELING?
Data modeling is a practice that defines the logic of queries and the structure of the data in
storage. A well-designed model is the key to leveraging the strengths of a graph database as
it improves query performance, supports flexible queries, and optimizes storage.
In summary, the process of creating a data model includes the following:
• Understand the domain and define specific use cases (questions) for the application.
• Develop an initial graph data model by extracting entities and decide how they relate to
each other.
• Test the use cases against the initial data model.
• Create the graph with test data using Cypher.
• Test the use cases, including performance against the graph.
• Refactor the graph data model due to changes in the key use cases or for performance
reasons.
14.
14
CREATE A GRAPHDATA MODEL
Define the domain
The Movies example dataset, the domain includes movies, people who acted or directed movies, and users
who rated movies. It is in the connections (relationships) between these entities that you find insights about
your domain.
Define the use case
In other words, what questions are you trying to answer?
You can make a list of questions to help you identify the application use cases. The questions will help you
define what you need from the application, and what data must be included in the graph.
For this tutorial, your application should be able to answer these questions:
• Which people acted in a movie?
• Which person directed a movie?
• Which movies did a person act in?
• How many users rated a movie?
• Who was the youngest person to act in a movie?
• Which role did a person play in a movie?
• Which is the highest rated movie in a particular year according to imDB?
• Which drama movies did an actor act in?
• Which users gave a movie a rating of 5?
15.
15
Define the purpose
Whendesigning a graph data model for an application, you may need both a
data model and an instance model.
Data model
The data model describes the nodes and relationships in the domain and
includes labels, types, and properties.
Instance model
An instance model is a representation of the data that is stored and processed
in the actual model. You can use an instance model to test against your use
cases.
17
Define entities
An instancemodel helps you preview how the data will be stored as nodes, relationships, and
properties. The next step is to refine your model with more details.
Labels
The dominant nouns in your application use case are represented as nodes in your model and
can be used as node labels. For example:
Which person acted in a movie?
How many users rated a movie?
The nodes in your initial model are thus Person, Movie, and User
Node properties
MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN]-(m:Movie)
RETURN m
With these properties, it is easier to visualize what you need from the graph to answer
the use case questions.
19
Unique identifiers
In Cypher,it is possible to create two different nodes with the exact same data. However, from a data
management and model perspective, different nodes should contain different data. You can
use unique identifiers to make sure that every node is a separate and distinguished entity.
In the initial instance model, these are the properties set for the Movies nodes:
Movie.title (string)
Movie.tmdbID (integer)
Movie.released (date)
Movie.imdbRating (decimal between 0-10)
Movie.genres (list of strings)
And for the Person nodes:
Person.name (string)
Person.tmdbID (integer)
Person.born (date)
20.
20
Relationships
Relationships are connectionsbetween nodes, and these connections are the verbs
in your use cases:
Which person acted in a movie?
Which person directed a movie?
At a glance, connections seem straightforward.To get started, thinking of
relationships from the perspective that “connections are verbs” works well, but
there are other important considerations that you will learn as you advance with
your model.
21.
21
Naming
It is importantto choose good names (types) for the relationships in the graph
and be as specific as possible in order to allow Neo4j to traverse only relevant
connections.
For example, instead of connecting two nodes with a generic relationship type
(e.g. CONNECTED_TO), prefer to be more specific and intuitive about the way
those entities connect.
For this sample, you could define relationships as:
ACTED_IN
DIRECTED
With these options, you can already plan the direction of the relationships.
22.
22
Relationship direction
All relationshipsmust have a direction. When created, relationships need to
specify their direction explicity or be inferred by the left-to-right order of the
pattern.
In the example use cases, the ACTED_IN relationship must be created to go
from a Person node to a Movie node:
23.
23
Relationship properties
Properties fora relationship are used to enrich how two nodes are related. When you need to
know how two nodes are related and not just that they are related, you can use relationship
properties to further define the relationship.
The example question "Which role did a person play in a movie?" can be asked with the help
of the property roles in the ACTED_IN relationship:
24.
24
Add more data
Nowthat you have created the first connections between the nodes, it’s time to add more
information to the graph. This way, you can answer more questions, such as:
How many users rated a movie?
Which users gave a movie a rating of 5?
To answer these questions, you need information about users and their ratings in your graph,
which means a change in your data model. Note that, with the addition of new data such as the
property roles in the ACTED_IN relationship, your initial data model has already been updated
along the way:
25.
25
GRAPH ALGORITHMS
Graph algorithmsare computational methods designed to process and analyze
data structured as a graph.
A graph is a data structure consisting of vertices (also known as nodes) and
edges that connect these vertices.
These algorithms are crucial for understanding relationships, paths, and patterns
within interconnected data.
26.
26
RANDOM WALKS
• Thefirst and most basic of these concepts are random walks.
• A random walk simply chooses a starting node from which to begin its walk and
then randomly traverses the graph for some amount of steps or “hops”.
• A random walk can be performed on any graph whether it be directed,
undirected, weighted, or unweighted or even disconnected graphs.
• As we will see, these random walks can be used to solve a number of problems
and are therefore the foundation for most graph algorithms.
27.
27
PATHFINDING & SEARCHALGORITHMS
Another foundational graph algorithm family are graph shortest path
algorithms. Shortest path algorithms typically come in two flavors depending
on the nature of the problem and how you want to explore the graph to
ultimately find the shortest path.
Depth First Search, starts by traversing as deeply into the graph as possible
before returning to its starting point and pursuing another deep path traversal.
Breadth First Search, keeps its traversals as close to the starting node as
possible and only ventures deeper into the graph when it has exhausted all
possible paths closest to it.
28.
28
Pathfinding is usedin many use cases, perhaps most notably in Google Maps. In the
earliest days of GPS, Google Maps used pathfinding on a graph to calculate the
fastest route to arrive at a given destination. This is just one of many examples of
graphs being used to solve everyday problems for countless people.
29.
29
CENTRALITY ALGORITHMS
Centrality Algorithmscan be used to analyze a graph as a whole to
understand which nodes within that graph have the greatest impact on the
network.
However, to measure the influence of a node in the network with an
algorithm, we must first define what “impact” on a graph means.
This differs from algorithm to algorithm and is a great starting point when
trying to decide which centrality algorithm to choose.
30.
30
Degree Centrality usesthe average degree of a node to measure how much of an impact it is having on
the graph
Closeness Centrality uses the inverse farness distance between a given node and all other nodes to
understand how central the node is to the graph
Betweenness Centrality uses shortest paths to determine which nodes serve as central ‘bridges’ across
a graph to identify key bottlenecks within a network
PageRank uses a set of random walks to measure how influential a given node is to a network. By
measuring which nodes are more likely to be visited on a random walk. Note that PageRank addresses
the disconnected graph problem which random walks face by occasionally jumping to a random point
in the graph rather than making a direct hop. This allows the algorithm to explore even disconnected
portions of the graph. Named for Google founder Larry Page, PageRank was developed as the
backbone of the Google search engine and allowed it to exceed the performance of all its competitors in
the early stages of the internet.
31.
31
COMMUNITY DETECTION ALGORITHMS
Communitydetection is a common use case for a variety of graphs.
Typically it is used in any situation where understanding the distinct groups of
nodes within a graph offers some tangible value to the use case.
This could be anything from social networks, from fleets of trucks making
deliveries to a network of accounts transacting with one another.
However, which algorithm you choose to discover these communities will
greatly impact how they’re grouped.
32.
32
Triangle Count simplyuses the principle that three nodes fully connected to one another (like a triangle) is the
simplest community dynamic that can exist in a graph. It therefore finds every combination of triangles within
a graph to determine how those nodes are grouped together
Strongly Connected Components and Connected Components are excellent algorithms for determining the
shape of your graph. Both aim to measure how many graphs make up the entirety of the data. While Connected
Components simply returns the number of completely disconnected graphs within a set of nodes and edges,
Strongly Connected Components returns those subgraphs which are solidly connected by many linkages.
Because of this, they are typically used in combination as a form of initial exploratory data analysis when first
analyzing graph data.
Louvain Modularity finds communities by comparing clusters of nodes and edges to the average for the
network. If a group of nodes are found to be generally greater than what is seen on average in the graph, those
nodes can be considered a community.
33.
33
SHORTEST PATH
ALGORITHMS
The ShortestPath algorithms calculate the
shortest path between a pair of nodes. There
are different algorithms that achieve this goal:
Dijkstra's Algorithm Purpose: Find the shortest
path from a single source to all other nodes in
a graph with non-negative edge weights.
Method: Uses a greedy approach with a
priority queue to iteratively select the nearest
unvisited node. Example Use Cases: GPS
navigation, network routing, and game
pathfinding.
34.
34
Bellman-Ford Algorithm Purpose:Compute shortest
paths from a single source in graphs that may contain
negative edge weights. Method: Iteratively relax all
edges up to (V 1) times, where V is the number of
−
vertices. Example Use Cases: Currency arbitrage
detection, routing in networks with variable costs, and
analyzing economic models.
Floyd-Warshall Algorithm Purpose: Find shortest
paths between all pairs of nodes in a graph. Method:
Uses dynamic programming to update path lengths by
considering all possible intermediate nodes. Example
Use Cases: Traffic analysis, social network distance
metrics, and transitive closure in databases.
Click icon to add picture
35.
35
GRAPH DATABASE INDEXING
Itis the process of creating and maintaining
indexes in a graph database, for faster
querying and traversal. They can be of
different types based on the implementation:
Full-text search (FTS) indexing
Spatial indexing Relationship indexing
Click icon to add picture
36.
36
FULL-TEXT SEARCH (FTS)
INDEXING
Full-textsearch (FTS) indexing enables very fast queries
over textual data stored in the nodes.
Text-based properties (like names, descriptions) are
indexed for efficient searching.
This allows users to perform flexible queries using
keywords, partial matches, or phrases
Examples: Searching product descriptions in an e-
commerce graph, locating users by name or bio in a social
network graph
37.
37
SPATIAL INDEXING
Spatial indexingenables fast querying of
location-based data stored in graph nodes or
relationships.
Spatial properties (like coordinates or
geometries) using tree based data structures
(like R-trees) are indexed to support spatial
operations.
This allows users to perform efficient queries
based on location data.
Examples: Finding nearby restaurants in a
location graph, locate nearby friends in a
social graph.
Click icon to add picture
38.
38
RELATIONSHIP INDEXING
It speedsup queries that depend on specific types or
properties of relationships between nodes.
Indexes relationship types or attributes (like
timestamps, weights, or labels) to allow fast filtering
and traversal of graph.
This allows users to quickly find relevant connections
based on the relationship properties. Examples:
Filtering recent interactions in social graphs, tracing
financial transactions in fraud detection.
#2 Graph databases excel in scenarios where relationships between data points hold significant value. Traditional relational databases struggle with complex relationships. Graph databases handle these efficiently. Social media platforms, recommendation engines, and fraud detection systems benefit greatly from graph databases. The ability to traverse relationships quickly makes graph databases indispensable in modern data management.
#5 1. Type
Neo4j → A pure graph database. It only focuses on storing and querying graph data (nodes, relationships, and properties).
ArangoDB → A multi-model database. It can work as a graph database, a document store (like MongoDB), and a key-value store—all in one system.
TigerGraph → A native parallel graph database. Built specifically for graph data with high performance and parallel processing.
2. Graph Model
Neo4j → Uses the property graph model (nodes, relationships, each can have properties).
ArangoDB → Supports property graph + document store + key-value. So, more flexible than Neo4j.
TigerGraph → Also property graph, but distributed and parallelized to handle huge datasets.
3. Query Language
Neo4j → Uses Cypher (like SQL but for graphs; very user-friendly). Also moving toward openCypher and GQL (Graph Query Language).
ArangoDB → Uses AQL (Arango Query Language), which can query both graphs and documents.
TigerGraph → Uses GSQL, which is powerful but more complex. Designed for analytics on big graph data.
4. Scalability
Neo4j → Works well for medium-to-large graphs, but when graphs get huge (billions of nodes/edges), performance can drop. Cluster scaling is possible but tricky.
ArangoDB → Scales decently because of its multi-model nature but is not the best for extremely large graphs.
TigerGraph → Excellent scalability. Designed for billions of edges and real-time queries at enterprise scale.
5. Strengths
Neo4j →
Rich ecosystem (visualization tools like Bloom, Graph Data Science library).
Easy to learn (Cypher is simple and intuitive).
Large community and support.
ArangoDB →
Very flexible (one DB for both graphs and documents).
Avoids the need for multiple DBs in hybrid use cases.
Open-source friendly.
TigerGraph →
Extremely fast for large-scale graph analytics (fraud detection, social networks, supply chain).
Built for parallel queries (handles big workloads well).
6. Weaknesses
Neo4j →
Performance struggles with very large graphs (hundreds of billions of edges).
Cluster setup is complex.
ArangoDB →
Graph functionality is not as advanced as Neo4j.
AQL has a steeper learning curve than Cypher.
TigerGraph →
Smaller community compared to Neo4j.
More enterprise-oriented, so licensing can be costly.
GSQL is harder to learn.
7. Best Use Cases
Neo4j →
Fraud detection
Recommendation engines
Supply chain visualization
Knowledge graphs
ArangoDB →
Scenarios needing both graph + document DB (e.g., product catalogs with relationships, metadata storage).
TigerGraph →
Heavy-duty analytics for very large datasets (telecom, finance, cybersecurity, supply chain risk).
8. Learning Curve
Neo4j → Easy (Cypher is intuitive, similar to SQL).
ArangoDB → Medium (AQL is flexible but harder than Cypher).
TigerGraph → Harder (GSQL is powerful but not beginner-friendly).
A multi-model database means the database supports more than one type of data model (instead of being limited to just one, like only graphs or only documents).
When we say multi-model graph database, it usually means:
It supports graph data (nodes + edges + properties).
It also supports other models such as:
Document model → JSON-like documents (like MongoDB).
Key-value model → simple key → value storage (like Redis).
Sometimes relational/tabular model too.
TigerGraph takes the property graph model and scales it for very large datasets by:
Distributed storage
Data is split across multiple servers/nodes (instead of sitting in one machine).
Example: if you have 1 billion users in a social network, the data is divided across servers to fit and manage better.
Parallel processing
Queries (like finding shortest paths, PageRank, fraud detection) are run in parallel across servers.
This makes it much faster than traditional single-machine graph databases.
#9 ChatGPT said:
A declarative query language is a type of query language where you describe what you want from the database, not how to get it.
Cypher Overview
Cypher is a declarative query language → meaning you tell the database what you want (the result), not how to compute it.
It was created for graph databases (especially Neo4j).
Its syntax is designed to look a bit like natural language and SQL, so it’s easier to read and write.
Main Structure
Cypher queries usually follow 3 main clauses:
MATCH → Describe the graph pattern you want to find.Example: MATCH (p:Person)-[:FRIEND_OF]->(f:Person)
(Find people p and their friends f).
WHERE → Apply conditions/filters.
Example: WHERE p.age > 25
(Only friends where the person is older than 25).
RETURN → Choose what data to get back.
Example: RETURN p.name, f.name
(Return the names of both).
Key Features Explained
Pattern Matching
Cypher lets you describe graph patterns in a visual, ASCII-art-like style.
Example: (a)-[:LIKES]->(b) looks like a small diagram → “a likes b.”
This makes it intuitive to represent complex networks (like social networks, fraud detection, etc.).
Declarative Nature
You don’t worry about how to traverse the graph (step-by-step navigation).
You just say what pattern you want, and Neo4j figures out the best execution plan.
This makes Cypher beginner-friendly compared to imperative query languages (like Gremlin).
Wide Adoption
Cypher is the standard language for Neo4j, the most popular graph DB.
Because Neo4j has such a big ecosystem (tools, libraries, community), Cypher also became widely adopted and recognized in industry and academia.
#10 Graph Pattern Mathcing
GQL (Graph Query Language) Overview
GQL is a new database query language standard.
It was officially published in April 2024 as ISO/IEC 39075.
This is a big deal because it’s the first new international database query language standard in over 35 years (the last one was SQL).
It’s developed by the same ISO committee that made SQL → so it’s meant to become the “SQL for Graphs.”
Key Features Explained
1. Powerful Graph Pattern Matching (GPM)
At the heart of GQL is graph pattern matching.
This lets you easily describe and find complex graph structures (nodes, edges, relationships) using concise queries.
Example idea: Instead of writing long traversal logic, you can just express “Find all friends of friends of Alice who live in Paris” in a single query.
GPM makes it much easier for data analysis on networks like social graphs, fraud detection, recommendation systems, etc.
2. Rich Data Types
GQL supports a wide variety of data types, similar to SQL:
Character and byte strings → for storing text or binary data.
Numeric types → both integers (fixed-point) and decimals/floats (floating-point).
Nested data structures → supports arrays, lists, or nested objects directly inside graph nodes/edges.
This flexibility means you can store and query real-world complex data more naturally in graph form.
3. ISO Standard
GQL is now the official ISO/IEC standard for property graph query languages.
Why it matters:
Provides uniformity across vendors (Neo4j, TigerGraph, ArangoDB, etc. can align with one language).
Prevents fragmentation → developers won’t have to learn a new graph query language for each database.
Boosts industry adoption → just like SQL became universal for relational databases, GQL aims to become universal for graph databases.
Why GQL is Important
Before GQL, graph DBs used their own languages:
Neo4j → Cypher
ArangoDB → AQL
TigerGraph → GSQL
Gremlin (Apache TinkerPop)
This caused fragmentation → developers had to learn multiple languages.
With GQL, the idea is to unify the industry around one standardized language for property graph databases, similar to how SQL unified relational databases.
👉 In short:
GQL is the new international standard query language for property graphs, providing:
Easy graph pattern matching (for powerful analytics).
Support for rich, nested data types.
An ISO-certified standard, ensuring consistency and adoption across graph databases.
#11 imperative = "telling the computer how to do something
Gremlin Overview
Gremlin is a graph traversal language (part of Apache TinkerPop framework).
Unlike Cypher or GQL (which are declarative → you say what you want),
Gremlin is imperative/procedural → you specify how to move through the graph step by step.
Think of it like writing instructions for a robot explorer:
“Start here, move along edges of this type, filter out some nodes, transform the data, then aggregate results.”
Key Features Explained
1. Traversal Framework
Gremlin’s power is in traversals → walking through a graph step by step.
You can chain multiple steps together to filter, transform, and analyze data.
Example (conceptually):
“Start at Alice → move to all her friends → filter only those older than 30 → count them.”
This step-by-step approach makes it very flexible for complex graph algorithms (like shortest path, centrality, recommendations).
2. Imperative Style
With Gremlin, you control the traversal process.
You write procedural steps (like in programming) instead of just describing the end result.
Advantage → More control and flexibility.
Disadvantage → Harder to learn than declarative languages like Cypher.
3. Hybrid Capability (OLTP + OLAP)
OLTP (Online Transaction Processing): Real-time queries on graph data (like “Who are Alice’s friends?”).
OLAP (Online Analytical Processing): Large-scale analytics (like PageRank, community detection, shortest path over billions of nodes).
Gremlin supports both, making it good for both transactional and analytical workloads.
4. Multilingual Integration
Gremlin is not limited to one query interface.
It can be embedded in many programming languages → Java, Python, JavaScript, Groovy, Scala, etc.
This makes it popular with developers who want graph queries integrated directly in their application code.
Example (simplified)
Imagine you want to find names of Alice’s friends over 30:
In Cypher (declarative):
MATCH (a:Person {name: "Alice"})-[:FRIEND_OF]->(f:Person)
WHERE f.age > 30
RETURN f.name;
👉 You describe the pattern, Neo4j figures out how.
In Gremlin (imperative):
g.V().has("Person","name","Alice").out("FRIEND_OF").has("age", gt(30)).values("name")
👉 You explicitly say:
Find vertex with name = Alice →
Traverse outgoing FRIEND_OF edges →
Keep only those with age > 30 →
Return their names.
✅ Summary:
Gremlin = imperative, traversal-based, flexible, procedural, integrates with many languages, good for complex traversals + analytics.
Cypher/GQL = declarative, easier to read, good for pattern matching and querying.
step by step."
#12 🔹 Meaning of the parts:
SELECT ?name
→ This tells SPARQL: I want to return the variable ?name as the result.
WHERE { ... }
→ Defines the pattern we are searching for in the graph.
?person foaf:knows ?friend .
?person = some person (a variable node).
foaf:knows = the FOAF (Friend of a Friend) property meaning "knows".
?friend = another person (a friend).
👉 Translation: Find all people and their friends.
?friend foaf:name ?name .
For each ?friend found above, look up their foaf:name.
Bind that name to the variable ?name.
SPARQL Overview
SPARQL (pronounced sparkle) is a query language for RDF data.
RDF (Resource Description Framework) is a standard from the W3C for representing knowledge graphs and linked data.
RDF organizes data into triples:
Subject – Predicate – Object
Example: (Alice) – [knows] → (Bob)
So, RDF data is essentially a graph:
Nodes = resources (people, places, things).
Edges = relationships (knows, livesIn, worksAt, etc.).
SPARQL is the main language for querying and reasoning over RDF graphs.
Key Features Explained
1. RDF Compatibility
SPARQL is purpose-built for RDF → it understands RDF triples and works directly with that structure.
You can query not just direct connections but also data spread across the web (because RDF supports linked data across multiple sources).
2. Semantic Web Focus
SPARQL is tightly connected to the Semantic Web vision (an extension of the web where data is machine-readable and connected).
It allows inference: meaning it can use ontologies (like OWL vocabularies) to infer new facts from existing ones.
Example: If RDF says:
Alice worksAt CompanyX
CompanyX locatedIn Paris
SPARQL + reasoning could infer: Alice is located in Paris
This makes SPARQL great for knowledge graphs, linked open data, and reasoning engines.
3. SQL-like Syntax
SPARQL syntax looks similar to SQL → so relational database users can learn it quickly.
Queries work with graph patterns (like Cypher does), but expressed in triple form.
Example
Suppose we want to find all people Alice knows:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?friend
WHERE {
?person foaf:name "Alice" .
?person foaf:knows ?friend .
}
Explanation:
PREFIX → defines shorthand (foaf namespace for Friend-of-a-Friend ontology).
?person → variable for the subject node.
?friend → variable for the object node.
The WHERE block describes the graph pattern:
Find a person with name “Alice.”
Then find who she foaf:knows.
The result will return a list of Alice’s friends.
✅ Summary
SPARQL = standard query language for RDF knowledge graphs.
Strengths → works with linked data, semantic reasoning, and the Semantic Web vision.
Looks like SQL but designed for graph triples.
What is RDF?
RDF = Resource Description Framework
Think of it as a way to store information as triples:
Subject → Predicate → Object
Example:
Alice → knows → Bob
So RDF is like a graph:
Nodes = Alice, Bob
Edge/Relationship = knows
2. What is SPARQL?
SPARQL = a query language (like SQL for databases).
But instead of querying tables (rows/columns), you query graphs made of triples.
So if your data is stored as RDF triples, SPARQL is the tool to ask questions about it.
3. Why is SPARQL special?
RDF compatibility → It works directly with triples.
Semantic web focus → It can do reasoning (infer new facts, like "if Alice works at CompanyX and CompanyX is in Paris → then Alice is in Paris").
SQL-like syntax → It feels familiar if you know SQL.
#15 Differentiation between a person who acted in a movie, who directed a movie, and who rated a movie.
What ratings were given, how many there are, and when they were submitted.
Which role an actor played in a movie and what their age is.
The genres of the movies.
#26 A graph database stores data as nodes (entities) and edges (relationships). Random walks are used to explore the graph probabilistically rather than exhaustively.
A random walk starts at a node and moves to a neighboring node at random, continuing for a number of steps.
The frequency of visiting nodes or the paths generated can reveal structural properties, similarities, and clusters.
2. Applications in Modern Graph Databases
Node Embedding / Representation Learning
Algorithms like DeepWalk and Node2Vec use random walks to learn vector representations of nodes in a graph.