Nadar Saraswathi College of arts
and science, Theni
Density Based methods
Maximization outlier analysis
1
Department of CS & IT
Presented by
S.Vijayalakshmi I- Msc (IT)
Eick: Topics9---Clustering 2
Density based methods
 DBSCAN
 DENCLUE
2
Eick: Topics9---Clustering 2
3
Density-Based Clustering Methods
 Clustering based on density (local cluster criterion),
such as density-connected points or based on an
explicitly constructed density function
 Major features:
 Discover clusters of arbitrary shape
 Handle noise
 One scan
 Need density parameters
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)
 DENCLUE: Hinneburg & D. Keim (KDD’98/2006)
 OPTICS: Ankerst, et al (SIGMOD’99).
 CLIQUE: Agrawal, et al. (SIGMOD’98)
Eick: Topics9---Clustering 2
DBSCAN
(http://www2.cs.uh.edu/~ceick/7363/Papers/dbscan.pdf )
 DBSCAN is a density-based algorithm.
 Density = number of points within a specified radius r (Eps)
 A point is a core point if it has more than a specified number of
points (MinPts) within Eps
 These are points that are at the interior of a cluster
 A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point
 A noise point is any point that is not a core point or a border
point.
Eick: Topics9---Clustering 2
DBSCAN: Core, Border, and Noise Points
Eick: Topics9---Clustering 2
DBSCAN Algorithm (simplified view for teaching)
1. Create a graph whose nodes are the points to be clustered
2. For each core-point c create an edge from c to every point
p in the -neighborhood of c
3. Set N to the nodes of the graph;
4. If N does not contain any core points terminate
5. Pick a core point c in N
6. Let X be the set of nodes that can be reached from c by
going forward;
1. create a cluster containing X{c}
2. N=N/(X{c})
7. Continue with step 4
Remark: points that are not assigned to any cluster are outliers;
Eick: Topics9---Clustering 2
DBSCAN: Core, Border and Noise Points
Original Points Point types: core,
border and noise
Eps = 10, MinPts = 4
Eick: Topics9---Clustering 2
When DBSCAN Works Well
Original Points Clusters
• Resistant to Noise
• Can handle clusters of different shapes and sizes
Eick: Topics9---Clustering 2
When DBSCAN Does NOT Work Well
Original Points
(MinPts=4, Eps=9.75).
(MinPts=4, Eps=9.92)
• Varying densities
• High-dimensional data
Eick: Topics9---Clustering 2
DBSCAN: Determining EPS and MinPts
 Idea is that for points in a cluster, their kth nearest
neighbors are at roughly the same distance
 Noise points have the kth nearest neighbor at farther
distance
 So, plot sorted distance of every point to its kth nearest
neighbor
Non-Core-points
Core-points
Run K-means for Minp=4 and not fixed
Eick: Topics9---Clustering 2
 Time Complexity: O(n2)—for each point it has
to be determined if it is a core point, can be
reduced to O(n*log(n)) in lower dimensional
spaces by using efficient data structures (n is
the number of objects to be clustered);
 Space Complexity: O(n).
Complexity DBSCAN
Eick: Topics9---Clustering 2
 Good: can detect arbitrary shapes, not very
sensitive to noise, supports outlier detection,
complexity is kind of okay, beside K-means
the second most used clustering algorithm.
 Bad: does not work well in high-dimensional
datasets, parameter selection is tricky, has
problems of identifying clusters of varying
densities (SSN algorithm), density
estimation is kind of simplistic (does not
create a real density function, but rather a
graph of density-connected points)
Summary DBSCAN
Eick: Topics9---Clustering 2
DBSCAN Algorithm Revisited
 Eliminate noise points
 Perform clustering on the remaining points:
Skip!
Eick: Topics9---Clustering 2
14
DENCLUE
(http://www2.cs.uh.edu/~ceick/ML/Denclue2.pdf )
 DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)
 Major features
 Solid mathematical foundation
 Good for data sets with large amounts of noise
 Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
 Significant faster than existing algorithm (faster than
DBSCAN by a factor of up to 45) ????????
 But needs a large number of parameters
Eick: Topics9---Clustering 2
15
 Uses grid cells but only keeps information about grid cells that
do actually contain data points and manages these cells in a
tree-based access structure.
 Influence function: describes the impact of a data point within
its neighborhood.
 Overall density of the data space can be calculated as the sum
of the influence function of all data points.
 Clusters can be determined using hill climbing by identifying
density attractors; density attractors are local maximal of the
overall density function.
 Objects that are associated with the same density attractor
belong to the same cluster.
Denclue: Technical Essence
Eick: Topics9---Clustering 2
16
Gradient: The steepness of a slope
 Example



N
i
x
x
d
D
Gaussian
i
e
x
f 1
2
)
,
(
2
2
)
( 







N
i
x
x
d
i
i
D
Gaussian
i
e
x
x
x
x
f 1
2
)
,
(
2
2
)
(
)
,
( 
f x y e
Gaussian
d x y
( , )
( , )


2
2
2
Eick: Topics9---Clustering 2
17
Example: Density Computation
D={x1,x2,x3,x4}
fD
Gaussian(x)= influence(x,x1) + influence(x,x2) + influence(x,x3)
+ influence(x4)=0.04+0.06+0.08+0.6=0.78
x1
x2
x3
x4
x 0.6
0.08
0.06
0.04
y
Remark: the density value of y would be larger than the one for x
Eick: Topics9---Clustering 2
18
Density Attractor
Eick: Topics9---Clustering 2
19
Examples of DENCLUE Clusters
Eick: Topics9---Clustering 2
20
Basic Steps DENCLUE Algorithms
1. Determine density attractors
2. Associate data objects with density
attractors using hill climbing
3. Possibly, merge the initial clusters
further relying on a hierarchical
clustering approach (optional; not
covered in this lecture)

Density based methods

  • 1.
    Nadar Saraswathi Collegeof arts and science, Theni Density Based methods Maximization outlier analysis 1 Department of CS & IT Presented by S.Vijayalakshmi I- Msc (IT)
  • 2.
    Eick: Topics9---Clustering 2 Densitybased methods  DBSCAN  DENCLUE 2
  • 3.
    Eick: Topics9---Clustering 2 3 Density-BasedClustering Methods  Clustering based on density (local cluster criterion), such as density-connected points or based on an explicitly constructed density function  Major features:  Discover clusters of arbitrary shape  Handle noise  One scan  Need density parameters  Several interesting studies:  DBSCAN: Ester, et al. (KDD’96)  DENCLUE: Hinneburg & D. Keim (KDD’98/2006)  OPTICS: Ankerst, et al (SIGMOD’99).  CLIQUE: Agrawal, et al. (SIGMOD’98)
  • 4.
    Eick: Topics9---Clustering 2 DBSCAN (http://www2.cs.uh.edu/~ceick/7363/Papers/dbscan.pdf)  DBSCAN is a density-based algorithm.  Density = number of points within a specified radius r (Eps)  A point is a core point if it has more than a specified number of points (MinPts) within Eps  These are points that are at the interior of a cluster  A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point  A noise point is any point that is not a core point or a border point.
  • 5.
    Eick: Topics9---Clustering 2 DBSCAN:Core, Border, and Noise Points
  • 6.
    Eick: Topics9---Clustering 2 DBSCANAlgorithm (simplified view for teaching) 1. Create a graph whose nodes are the points to be clustered 2. For each core-point c create an edge from c to every point p in the -neighborhood of c 3. Set N to the nodes of the graph; 4. If N does not contain any core points terminate 5. Pick a core point c in N 6. Let X be the set of nodes that can be reached from c by going forward; 1. create a cluster containing X{c} 2. N=N/(X{c}) 7. Continue with step 4 Remark: points that are not assigned to any cluster are outliers;
  • 7.
    Eick: Topics9---Clustering 2 DBSCAN:Core, Border and Noise Points Original Points Point types: core, border and noise Eps = 10, MinPts = 4
  • 8.
    Eick: Topics9---Clustering 2 WhenDBSCAN Works Well Original Points Clusters • Resistant to Noise • Can handle clusters of different shapes and sizes
  • 9.
    Eick: Topics9---Clustering 2 WhenDBSCAN Does NOT Work Well Original Points (MinPts=4, Eps=9.75). (MinPts=4, Eps=9.92) • Varying densities • High-dimensional data
  • 10.
    Eick: Topics9---Clustering 2 DBSCAN:Determining EPS and MinPts  Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance  Noise points have the kth nearest neighbor at farther distance  So, plot sorted distance of every point to its kth nearest neighbor Non-Core-points Core-points Run K-means for Minp=4 and not fixed
  • 11.
    Eick: Topics9---Clustering 2 Time Complexity: O(n2)—for each point it has to be determined if it is a core point, can be reduced to O(n*log(n)) in lower dimensional spaces by using efficient data structures (n is the number of objects to be clustered);  Space Complexity: O(n). Complexity DBSCAN
  • 12.
    Eick: Topics9---Clustering 2 Good: can detect arbitrary shapes, not very sensitive to noise, supports outlier detection, complexity is kind of okay, beside K-means the second most used clustering algorithm.  Bad: does not work well in high-dimensional datasets, parameter selection is tricky, has problems of identifying clusters of varying densities (SSN algorithm), density estimation is kind of simplistic (does not create a real density function, but rather a graph of density-connected points) Summary DBSCAN
  • 13.
    Eick: Topics9---Clustering 2 DBSCANAlgorithm Revisited  Eliminate noise points  Perform clustering on the remaining points: Skip!
  • 14.
    Eick: Topics9---Clustering 2 14 DENCLUE (http://www2.cs.uh.edu/~ceick/ML/Denclue2.pdf)  DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)  Major features  Solid mathematical foundation  Good for data sets with large amounts of noise  Allows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets  Significant faster than existing algorithm (faster than DBSCAN by a factor of up to 45) ????????  But needs a large number of parameters
  • 15.
    Eick: Topics9---Clustering 2 15 Uses grid cells but only keeps information about grid cells that do actually contain data points and manages these cells in a tree-based access structure.  Influence function: describes the impact of a data point within its neighborhood.  Overall density of the data space can be calculated as the sum of the influence function of all data points.  Clusters can be determined using hill climbing by identifying density attractors; density attractors are local maximal of the overall density function.  Objects that are associated with the same density attractor belong to the same cluster. Denclue: Technical Essence
  • 16.
    Eick: Topics9---Clustering 2 16 Gradient:The steepness of a slope  Example    N i x x d D Gaussian i e x f 1 2 ) , ( 2 2 ) (         N i x x d i i D Gaussian i e x x x x f 1 2 ) , ( 2 2 ) ( ) , (  f x y e Gaussian d x y ( , ) ( , )   2 2 2
  • 17.
    Eick: Topics9---Clustering 2 17 Example:Density Computation D={x1,x2,x3,x4} fD Gaussian(x)= influence(x,x1) + influence(x,x2) + influence(x,x3) + influence(x4)=0.04+0.06+0.08+0.6=0.78 x1 x2 x3 x4 x 0.6 0.08 0.06 0.04 y Remark: the density value of y would be larger than the one for x
  • 18.
  • 19.
  • 20.
    Eick: Topics9---Clustering 2 20 BasicSteps DENCLUE Algorithms 1. Determine density attractors 2. Associate data objects with density attractors using hill climbing 3. Possibly, merge the initial clusters further relying on a hierarchical clustering approach (optional; not covered in this lecture)