Clustering
Lecture 4: Density-based Methods
Jing Gao
SUNY Buffalo
1
Outline
• Basics
– Motivation, definition, evaluation
• Methods
– Partitional
– Hierarchical
– Density-based
– Mixture model
– Spectral methods
• Advanced topics
– Clustering ensemble
– Clustering in MapReduce
– Semi-supervised clustering, subspace clustering, co-clustering,
etc.
2
Density-based Clustering
• Basic idea
– Clusters are dense regions in the data space,
separated by regions of lower object density
– A cluster is defined as a maximal set of density-
connected points
– Discovers clusters of arbitrary shape
• Method
– DBSCAN
3
Density Definition
• -Neighborhood – Objects within a radius of  from
an object.
• “High density” - ε-Neighborhood of an object contains
at least MinPts of objects.
q p
ε
ε
ε-Neighborhood of p
ε-Neighborhood of q
Density of p is “high” (MinPts = 4)
Density of q is “low” (MinPts = 4)
}
)
,
(
|
{
:
)
( 
 
q
p
d
q
p
N
4
Core, Border & Outlier
Given  and MinPts,
categorize the objects into
three exclusive groups.
 = 1unit, MinPts = 5
Core
Border
Outlier
A point is a core point if it has more than a
specified number of points (MinPts) within
Eps—These are points that are at the
interior of a cluster.
A border point has fewer than MinPts
within Eps, but is in the neighborhood
of a core point.
A noise point is any point that is not a
core point nor a border point.
5
Example
Original Points Point types: core,
border and outliers
 = 10, MinPts = 4
6
Density-reachability
• Directly density-reachable
• An object q is directly density-reachable from object p
if p is a core object and q is in p’s -neighborhood.
q p
ε
ε
• q is directly density-reachable from p
• p is not directly density-reachable from
q
• Density-reachability is asymmetric
MinPts = 4
7
Density-reachability
• Density-Reachable (directly and indirectly):
– A point p is directly density-reachable from p2
– p2 is directly density-reachable from p1
– p1 is directly density-reachable from q
– p  p2  p1 q form a chain
p
q
p2
• p is (indirectly) density-reachable
from q
• q is not density-reachable from p
p1
MinPts = 7
8
DBSCAN Algorithm: Example
• Parameter
•  = 2 cm
• MinPts = 3
for each o  D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
9
DBSCAN Algorithm: Example
• Parameter
•  = 2 cm
• MinPts = 3
for each o  D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
10
DBSCAN Algorithm: Example
• Parameter
•  = 2 cm
• MinPts = 3
for each o  D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
11
DBSCAN: Sensitive to Parameters
12
DBSCAN: Determining EPS and MinPts
• Idea is that for points in a cluster, their kth nearest
neighbors are at roughly the same distance
• Noise points have the kth nearest neighbor at farther
distance
• So, plot sorted distance of every point to its kth nearest
neighbor
13
When DBSCAN Works Well
Original Points Clusters
• Resistant to Noise
• Can handle clusters of different shapes and sizes
14
When DBSCAN Does NOT Work Well
Original Points
(MinPts=4, Eps=9.92).
(MinPts=4, Eps=9.75)
• Cannot handle varying densities
• sensitive to parameters—hard to
determine the correct set of
parameters
15
Take-away Message
• The basic idea of density-based clustering
• The two important parameters and the definitions of
neighborhood and density in DBSCAN
• Core, border and outlier points
• DBSCAN algorithm
• DBSCAN’s pros and cons
16

clustering density technidques in machine learning

  • 1.
    Clustering Lecture 4: Density-basedMethods Jing Gao SUNY Buffalo 1
  • 2.
    Outline • Basics – Motivation,definition, evaluation • Methods – Partitional – Hierarchical – Density-based – Mixture model – Spectral methods • Advanced topics – Clustering ensemble – Clustering in MapReduce – Semi-supervised clustering, subspace clustering, co-clustering, etc. 2
  • 3.
    Density-based Clustering • Basicidea – Clusters are dense regions in the data space, separated by regions of lower object density – A cluster is defined as a maximal set of density- connected points – Discovers clusters of arbitrary shape • Method – DBSCAN 3
  • 4.
    Density Definition • -Neighborhood– Objects within a radius of  from an object. • “High density” - ε-Neighborhood of an object contains at least MinPts of objects. q p ε ε ε-Neighborhood of p ε-Neighborhood of q Density of p is “high” (MinPts = 4) Density of q is “low” (MinPts = 4) } ) , ( | { : ) (    q p d q p N 4
  • 5.
    Core, Border &Outlier Given  and MinPts, categorize the objects into three exclusive groups.  = 1unit, MinPts = 5 Core Border Outlier A point is a core point if it has more than a specified number of points (MinPts) within Eps—These are points that are at the interior of a cluster. A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point. A noise point is any point that is not a core point nor a border point. 5
  • 6.
    Example Original Points Pointtypes: core, border and outliers  = 10, MinPts = 4 6
  • 7.
    Density-reachability • Directly density-reachable •An object q is directly density-reachable from object p if p is a core object and q is in p’s -neighborhood. q p ε ε • q is directly density-reachable from p • p is not directly density-reachable from q • Density-reachability is asymmetric MinPts = 4 7
  • 8.
    Density-reachability • Density-Reachable (directlyand indirectly): – A point p is directly density-reachable from p2 – p2 is directly density-reachable from p1 – p1 is directly density-reachable from q – p  p2  p1 q form a chain p q p2 • p is (indirectly) density-reachable from q • q is not density-reachable from p p1 MinPts = 7 8
  • 9.
    DBSCAN Algorithm: Example •Parameter •  = 2 cm • MinPts = 3 for each o  D do if o is not yet classified then if o is a core-object then collect all objects density-reachable from o and assign them to a new cluster. else assign o to NOISE 9
  • 10.
    DBSCAN Algorithm: Example •Parameter •  = 2 cm • MinPts = 3 for each o  D do if o is not yet classified then if o is a core-object then collect all objects density-reachable from o and assign them to a new cluster. else assign o to NOISE 10
  • 11.
    DBSCAN Algorithm: Example •Parameter •  = 2 cm • MinPts = 3 for each o  D do if o is not yet classified then if o is a core-object then collect all objects density-reachable from o and assign them to a new cluster. else assign o to NOISE 11
  • 12.
    DBSCAN: Sensitive toParameters 12
  • 13.
    DBSCAN: Determining EPSand MinPts • Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance • Noise points have the kth nearest neighbor at farther distance • So, plot sorted distance of every point to its kth nearest neighbor 13
  • 14.
    When DBSCAN WorksWell Original Points Clusters • Resistant to Noise • Can handle clusters of different shapes and sizes 14
  • 15.
    When DBSCAN DoesNOT Work Well Original Points (MinPts=4, Eps=9.92). (MinPts=4, Eps=9.75) • Cannot handle varying densities • sensitive to parameters—hard to determine the correct set of parameters 15
  • 16.
    Take-away Message • Thebasic idea of density-based clustering • The two important parameters and the definitions of neighborhood and density in DBSCAN • Core, border and outlier points • DBSCAN algorithm • DBSCAN’s pros and cons 16