By – AYUSH
Netaji Subhash engineering college, kolkata
Introduction
 The method of identifying similar groups of data in a
dataset is called clustering.
 It is one of the most popular techniques in data science.
 Entities in each group are comparatively more similar to
entities of that group than those of the other groups.
 In this presentation, I will be taking you through the
types of clustering, different clustering algorithms and a
brief view of two of the most commonly used clustering
methods i.e.,
Graph Based Clustering and Density Based Clustering.
Graph
Based
Clustering
Graph Theory :
 Graph Theory can be used for getting thorough information
about the inside structure of the data set in terms of :
- cliques (subgraph of graph such that all vertices in subgraph are
completely connected)
- clusters (highly connected group of nodes)
- centrality (measure of importance of a node in the network)
- outliers (unimportant nodes)
 Applications :
- Social Graphs (drawing edges between us and the people
and everything)
- Path Optimization Algorithms (Minimal Spanning Tree, Kruskal’s, Prim’s)
- GPS Navigation Systems (shortest path APIs)
GRAPH BASED CLUSTERING
 Graph-based clustering is a method for identifying
groups of similar cells or samples.
 It makes no prior assumptions about the clusters in the
data.
 This means the number, size, density, and shape of
clusters does not need to be known or assumed prior to
clustering.
 Consequently, graph-based clustering is useful for
identifying clustering in complex data sets such as
scRNA-seq.
IDEA :
• Graph-Based clustering uses the proximity graph
– Start with the proximity matrix
– Consider each point as a node in a graph
– Each edge between two nodes has a weight which is the
proximity between the two points
– Initially the proximity graph is fully connected
– MIN (single-link) and MAX (complete-link) can be viewed as
starting with this graph
• In the simplest case, clusters are connected components in the graph.
GRAPH CLUSTERING IDEA :
HIERARCHICAL METHOD :
1) Determining a minimal spanning tree (MST)
2) Delete branches iteratively
New Connected Components = Cluster
MINIMAL SPANNING TREE :
A minimal spanning tree of a connected graph G = (V,E) is a
connected subgraph with minimal weight that contains all nodes of
G and has no cycles.
Minimal Spanning Trees can be calculated with :-
 Prim’s Algorithm
- Prim's (also known as Jarník's) algorithm is a greedy algorithm that finds a
minimum spanning tree for a weighted undirected graph.
- This means it finds a subset of the edges that forms a tree that includes
every vertex, where the total weight of all the edges in the tree is
minimized.
 Kruskal’s Algorithm
- Kruskal's algorithm is a minimum-spanning-tree algorithm which finds an
edge of the least possible weight that connects any two trees in the forest.
- It is a greedy algorithm in graph theory as it finds a minimum spanning tree
for a connected weighted graph adding increasing cost arcs at each step.
Branch Deletion
Delete Branches – Different Strategies :-
I. Delete the branch with maximum weight.
II. Delete inconsistent branches.
III. Delete by analysis of weights.
SUMMARY :-
In graph based clustering objects are represented as
nodes in a complete or connected graph.
The distance between two objects is given by the weight
of the corresponding branch.
Hierarchical Method :
(1) Determine a minimal spanning tree(MST).
(2) Delete branches iteratively.
Visualization of information in large datasets.
DENSITY
BASED
CLUSTERING
DBSCAN :
 Density based spatial clustering of applications with noise.
 It is one of the most cited clustering algorithms in the literature.
Features : -
• Spatial data
(geomarketing, tomography, satellite images)
• Discovery of clusteres with arbitrary shape
(spherical, drawn out, linear, elongated)
• Good efficiency or large databases
(parallel programming)
• Only two parameters required.
• No prior knowledge of the number of clusters are required.
IDEA :
Clusters have a high density of points.
In the area of noise the density is lower than in any of the
clusters.
Goal :
Formalize the notions of clusters and
noise.
Density based cluster : definition
 Relies on a density-based notion of cluster: A cluster is defined as
a
maximum set of density-connected points.
 A cluster C is a subset of D satisfying
- For all p, q if p is in C, and q is density reachable from p, then
q
is also in C
- For all p, q in C: p is density connected to q
DENSITY BASED CLUSTERING: DATA
● Two Parameters:
- Eps : Maximum radius of the neighbourhood
- MinPts : Minimum number of points in an Eps-neighbourhood of that point
● Neps(p) : {q belongs to D| dist(p,q)<= Eps}
Problem :
 In each cluster there are two kinds of points :
- points inside the cluster (core points)
- points on the border (border points)
 An Eps-neighbourhood of a border point contains significantly less
points than an Eps-neighbourhood of a core point.
IDEA :
For every point p in a cluster C there is a point q ∈
C, so that
1) p is inside the Eps-neighbourhood of q and
2) Neps(q) contains at least MinPts points.
● Directly density-reachable: A point p is directly
density-reachable from point q with regard to Eps and MinPts, if
1) p ∈ to Neps (q) (reachability)
2)|Neps (q)|>= MinPts (core point condition)
DEFINITION :
Density-reachable:
 A point p is density-reachable
from a point q wrt. Eps,
MinPts if there is a chain of
points p1,...,pn,p1= q, pn = p
such that pi+1 is directly
density-reachable from pi.

Density-concerned:
 A point p is density-connected
to a wrt. Eps, MinPts if there is
a point o such that both, p and
q are density-reachable from
O wrt. Eps and MinPts.

DBSCAN (algorithm) :
Start with an arbitrary point p from the database and
retrieve all points density-reachable from p with regard to
Eps and MinPts.
If p is a core point, the procedure yields a cluster with
regards to Eps and MinPts and the point is classified.
If p is a border point, no points are density-reachable
from p and DBSCAN visits the next unclassified point in
the database.
Density based clustering – application
CONCLUSION
Clustering is a descriptive technique.
The solution is not unique and it strongly depends
upon the analyst’s choices.
We described how it is possible to combine different
results in order to obtain stable clusters, not
depending too much on the criteria selected to
analyze data.
Clustering always provides groups, even if there is no
group structure.
REFERENCES :
 A big help from Eric Kropat.
 Wikipedia , Google Searches

Graph and Density Based Clustering

  • 1.
    By – AYUSH NetajiSubhash engineering college, kolkata
  • 2.
    Introduction  The methodof identifying similar groups of data in a dataset is called clustering.  It is one of the most popular techniques in data science.  Entities in each group are comparatively more similar to entities of that group than those of the other groups.  In this presentation, I will be taking you through the types of clustering, different clustering algorithms and a brief view of two of the most commonly used clustering methods i.e., Graph Based Clustering and Density Based Clustering.
  • 3.
  • 4.
    Graph Theory : Graph Theory can be used for getting thorough information about the inside structure of the data set in terms of : - cliques (subgraph of graph such that all vertices in subgraph are completely connected) - clusters (highly connected group of nodes) - centrality (measure of importance of a node in the network) - outliers (unimportant nodes)  Applications : - Social Graphs (drawing edges between us and the people and everything) - Path Optimization Algorithms (Minimal Spanning Tree, Kruskal’s, Prim’s) - GPS Navigation Systems (shortest path APIs)
  • 5.
    GRAPH BASED CLUSTERING Graph-based clustering is a method for identifying groups of similar cells or samples.  It makes no prior assumptions about the clusters in the data.  This means the number, size, density, and shape of clusters does not need to be known or assumed prior to clustering.  Consequently, graph-based clustering is useful for identifying clustering in complex data sets such as scRNA-seq.
  • 6.
    IDEA : • Graph-Basedclustering uses the proximity graph – Start with the proximity matrix – Consider each point as a node in a graph – Each edge between two nodes has a weight which is the proximity between the two points – Initially the proximity graph is fully connected – MIN (single-link) and MAX (complete-link) can be viewed as starting with this graph • In the simplest case, clusters are connected components in the graph.
  • 7.
  • 8.
    HIERARCHICAL METHOD : 1)Determining a minimal spanning tree (MST) 2) Delete branches iteratively New Connected Components = Cluster MINIMAL SPANNING TREE : A minimal spanning tree of a connected graph G = (V,E) is a connected subgraph with minimal weight that contains all nodes of G and has no cycles.
  • 9.
    Minimal Spanning Treescan be calculated with :-  Prim’s Algorithm - Prim's (also known as Jarník's) algorithm is a greedy algorithm that finds a minimum spanning tree for a weighted undirected graph. - This means it finds a subset of the edges that forms a tree that includes every vertex, where the total weight of all the edges in the tree is minimized.  Kruskal’s Algorithm - Kruskal's algorithm is a minimum-spanning-tree algorithm which finds an edge of the least possible weight that connects any two trees in the forest. - It is a greedy algorithm in graph theory as it finds a minimum spanning tree for a connected weighted graph adding increasing cost arcs at each step.
  • 10.
    Branch Deletion Delete Branches– Different Strategies :- I. Delete the branch with maximum weight. II. Delete inconsistent branches. III. Delete by analysis of weights.
  • 11.
    SUMMARY :- In graphbased clustering objects are represented as nodes in a complete or connected graph. The distance between two objects is given by the weight of the corresponding branch. Hierarchical Method : (1) Determine a minimal spanning tree(MST). (2) Delete branches iteratively. Visualization of information in large datasets.
  • 12.
  • 13.
    DBSCAN :  Densitybased spatial clustering of applications with noise.  It is one of the most cited clustering algorithms in the literature. Features : - • Spatial data (geomarketing, tomography, satellite images) • Discovery of clusteres with arbitrary shape (spherical, drawn out, linear, elongated) • Good efficiency or large databases (parallel programming) • Only two parameters required. • No prior knowledge of the number of clusters are required.
  • 14.
    IDEA : Clusters havea high density of points. In the area of noise the density is lower than in any of the clusters. Goal : Formalize the notions of clusters and noise.
  • 15.
    Density based cluster: definition  Relies on a density-based notion of cluster: A cluster is defined as a maximum set of density-connected points.  A cluster C is a subset of D satisfying - For all p, q if p is in C, and q is density reachable from p, then q is also in C - For all p, q in C: p is density connected to q
  • 16.
    DENSITY BASED CLUSTERING:DATA ● Two Parameters: - Eps : Maximum radius of the neighbourhood - MinPts : Minimum number of points in an Eps-neighbourhood of that point ● Neps(p) : {q belongs to D| dist(p,q)<= Eps}
  • 17.
    Problem :  Ineach cluster there are two kinds of points : - points inside the cluster (core points) - points on the border (border points)  An Eps-neighbourhood of a border point contains significantly less points than an Eps-neighbourhood of a core point.
  • 18.
    IDEA : For everypoint p in a cluster C there is a point q ∈ C, so that 1) p is inside the Eps-neighbourhood of q and 2) Neps(q) contains at least MinPts points.
  • 19.
    ● Directly density-reachable:A point p is directly density-reachable from point q with regard to Eps and MinPts, if 1) p ∈ to Neps (q) (reachability) 2)|Neps (q)|>= MinPts (core point condition) DEFINITION :
  • 20.
    Density-reachable:  A pointp is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p1,...,pn,p1= q, pn = p such that pi+1 is directly density-reachable from pi.  Density-concerned:  A point p is density-connected to a wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from O wrt. Eps and MinPts. 
  • 21.
    DBSCAN (algorithm) : Startwith an arbitrary point p from the database and retrieve all points density-reachable from p with regard to Eps and MinPts. If p is a core point, the procedure yields a cluster with regards to Eps and MinPts and the point is classified. If p is a border point, no points are density-reachable from p and DBSCAN visits the next unclassified point in the database.
  • 22.
    Density based clustering– application
  • 23.
    CONCLUSION Clustering is adescriptive technique. The solution is not unique and it strongly depends upon the analyst’s choices. We described how it is possible to combine different results in order to obtain stable clusters, not depending too much on the criteria selected to analyze data. Clustering always provides groups, even if there is no group structure.
  • 24.
    REFERENCES :  Abig help from Eric Kropat.  Wikipedia , Google Searches