K-means Clustering Algorithm with Matlab Source code
The K-means clustering algorithm partitions observations into K clusters by minimizing the distance between observations and cluster centroids. It works by randomly assigning observations to K clusters, calculating the distance between each observation and centroid, reassigning observations to their closest centroid, and repeating until cluster assignments are stable. Common distance measures used include Euclidean, squared Euclidean, and Manhattan distances. The algorithm aims to group similar observations together based on feature similarity to reduce the size of codebooks for applications like speech processing.
K-means clustering partitions observations into K clusters, minimizing distance between centroids and observations. Various distance measures are utilized, with stability determined when clusters no longer change.
Squared Euclidean distance is favored in speech processing, allowing effective clustering of features like mel-frequency cepstral coefficients, aiding in efficient codebook size.
Visual representations illustrate clustering of Gaussian white noise data into seven distinct clusters with marked centroids.
Silhouette values illustrate cluster separation. Negative values indicate conflict, while positive values show clear separations among clusters.
MATLAB source code for K-means clustering illustrates data manipulation and visual plotting of resulting clusters.
Use of MATLAB to calculate silhouette values for assessing cluster quality, providing mean silhouette for evaluation.
K-means Clustering Algorithm with Matlab Source code
1.
The K-means Clustering
Algorithm1
K-means is a method of clustering observations into a specific number of disjoint clusters.
The ”K” refers to the number of clusters specified. Various distance measures exist to deter-
mine which observation is to be appended to which cluster. The algorithm aims at minimiz-
ing the measure between the centroide of the cluster and the given observation by iteratively
appending an observation to any cluster and terminate when the lowest distance measure
is achieved.
1.1 Overview Of Algorithm
1. The sample space is intially partitioned into K clusters and the observations are ran-
domly assigned to the clusters.
2. For each sample:
• Calculate the distance from the observation to the centroide of the cluster.
• IF the sample is closest to its own cluster THEN leave it ELSE select another
cluster.
3. Repeat steps 1 and 2 untill no observations are moved from one cluster to another
When step 3 terminates the clusters are stable and each sample is assigned a cluster which
results in the lowest possible distance to the centriode of the cluster.
Common distance measures include the Euclidean distance, the Euclidean squared distance
and the Manhattan or City distance.
The Euclidean measure corresponds to the shortest geometric distance between to points.
d =
N
∑
i=1
(xi −yi)2
1
(19.1)
19.2 Distance measures
http://bit.ly/2Mub6xP
19.1
2.
A faster wayof determining the distance is by use of the squared Euclidean distance which
calculates the above distance squared, i.e.
dsq =
N
∑
i=1
(xi −yi)2
The Manhattan measure calculates a distance between points based on a grid and is illus-
Euclidean measure Manhattan measure
For applications in speech processing the squared Euclidean distance is widely used.
K-means can be used to cluster the extracted features from speech signals. The extracted
features from the signal include for instance mel frequency cepstral coefficients or line spec-
trum pairs. This allows speech signals with similar spectral characteristics to be positioned
into the same position in the codebook. In this way similar narrow band signals will be
predicted likewise thereby limiting the size of the codebook.
The following figures illustrate the K-means algoritm on a 2-dimensional data set.
2
Figure 19.1:Comparision between theEuclideanandtheManhattan measure.
CHAPTER 19. THE K-MEANS CLUSTERING ALGORITHM
(19.2)
trated in Figure 19.1.
19.3 Application of K-means
19.4 Example of K-means Clustering
http://bit.ly/2Mub6xP
0 0.2 0.40.6 0.8 1
1
2
3
4
5
6
7
Silhouette Value
Cluster
into the seven clusters. If the distance from one point to two centroids is the
same, it means the point could belong to both centroids. The result is a conflict
which gives a negative value in the Silhouette diagram. The positive part of
the Silhuoette diagram, shows that there is a clear seperation of the points
between the clusters.
4
Figure 19.4:TheSilhouettediagramshowshow well thedataareseperated
CHAPTER 19. THE K-MEANS CLUSTERING ALGORITHM
http://bit.ly/2Mub6xP
5.
1 close al l
2 clear a l l
3 clc
4
5 Limit = 2 0 ;
6
7 X = [ 1 0 ∗ randn (400 ,2) ; 1 0 ∗ randn (400 ,2) ] ;
8 plot (X ( : , 1 ) ,X ( : , 2 ) , ’k . ’ )
9 length (X ( : , 1 ) )
10 figure
11 %i =1;
12 k=1;
13 for i =1: length (X ( : , 1 ) )
14 i f ( sqrt (X( i , 1 ) ^2+X( i , 2 ) ^2) ) > Limit ;
15 X( i , 1 ) =0;
16 X( i , 2 ) =0;
17 else
18 Y( k , 1 ) =X( i , 1 ) ;
19 Y( k , 2 ) =X( i , 2 ) ;
20 k=k+1;
21 end
22 end
23 plot (Y ( : , 1 ) ,Y ( : , 2 ) , ’k . ’ )
24 figure
25
26 [ cidx , c t r s ] = kmeans (Y , 7 , ’ d i s t ’ , ’ sqEuclidean ’ , ’ rep ’ ,5 , ’ disp ’ , ’
f i n a l ’ , ’ EmptyAction ’ , ’ singleton ’ ) ;
27
28 plot (Y( cidx ==1 ,1) ,Y( cidx ==1 ,2) , ’ r . ’ , . . .
29 Y( cidx ==2 ,1) ,Y( cidx ==2 ,2) , ’b . ’ , c t r s ( : , 1 ) , c t r s ( : , 2 ) , ’ kx ’ ) ;
30
31 hold on
32 plot (Y( cidx ==3 ,1) ,Y( cidx ==3 ,2) , ’y . ’ ,Y( cidx ==4 ,1) ,Y( cidx ==4 ,2) , ’g . ’ )
;
33
34 hold on
35 plot (Y( cidx ==5 ,1) ,Y( cidx ==5 ,2) , ’ c . ’ ,Y( cidx ==6 ,1) ,Y( cidx ==6 ,2) , ’m. ’ )
;
36
37 hold on
38 plot (Y( cidx ==7 ,1) ,Y( cidx ==7 ,2) , ’k . ’ ) ;
39
40 figure
5
19.5. MATLAB SOURCE CODE
19.5 Matlab Source Code
http://bit.ly/2Mub6xP
6.
41 [ silk, h]= s i l h o u e t t e (Y, cidx , ’ sqEuclidean ’ ) ;
42 mean( s i l k )
6
CHAPTER 19. THE K-MEANS CLUSTERING ALGORITHM
Checkout: http://bit.ly/2Mub6xP
Data Science Course Content: