The document discusses various unsupervised learning techniques including clustering algorithms like k-means, k-medoids, hierarchical clustering and density-based clustering. It explains how k-means clustering works by selecting initial random centroids and iteratively reassigning data points to the closest centroid. The elbow method is described as a way to determine the optimal number of clusters k. The document also discusses how k-medoids clustering is more robust to outliers than k-means because it uses actual data points as cluster representatives rather than centroids.
Introduction of Dr. M.Pyingkodi from MCA at Kongu Engineering College, Erode.
Explains unsupervised learning, focusing on analyzing unlabelled data to find hidden insights.
Defines clustering and association analysis in unsupervised learning, focusing on data relationships.
Defines clustering as a task in unsupervised learning for grouping similar items into clusters.
Introduces major clustering techniques: partitioning, hierarchical, and density-based methods.
Describes partitioning clustering, including k-means and k-medoids algorithms for data grouping.
Details hierarchical clustering, including agglomerative and divisive methods, forming dendrograms.
Defines density-based clustering, focusing on cluster formation based on data point density.
Explains K-means algorithm functionality, grouping unlabeled data based on similarities and distances.
Outlines the iterative process in K-means, including centroid assignment and reassignment steps.
Walkthrough of K-means clustering with chosen centroids to demonstrate data point assignment.
Explains iterative updating of centroids by calculating distances to determine cluster membership.
Describes the iterative stages of K-means, calculating distances and updating clusters until stable.
Discusses methods to determine the optimal number of clusters in K-means, focusing on elbow method.Explains WCSS, a key metric for evaluating cluster quality and determining the ideal number of clusters.
Introduces BCSS and distortion metrics for assessing the spread and quality of clustering results.
Describes K-medoids as an alternative to K-means, focusing on representative points for cluster centers.
Mentions the Partitioning around Medoids (PAM) algorithm as a practical implementation of K-medoids.
Unsupervised learning
• Unsupervisedlearning is a machine learning concept where the
unlabelled and unclassified information is analysed to discover
hidden knowledge.
• The algorithms work on the data without any prior training.
Example:
• movie promotions to the correct group of people.
• Earlier times: same set of movie to all the visitors of the page.
• Now: based on their interest, understand what type of movie is liked
by what segment of the people.
3.
Clustering and AssociationAnalysis
• Cluster analysis finds the commonalities between the data
objects and categorizes them as per the presence and
absence of those commonalities. Clustering which helps in
segmentation of the set of objects into groups of similar
objects.
• Association: An association rule is an unsupervised learning
method which is used for finding the relationships between
variables/objects in the large database (dataset).
4.
Clustering
• clustering isdefined as an unsupervised machine learning task that
automatically divides the data into clusters or groups of similar items.
5.
Different types ofclustering techniques
The major clustering techniques are
• Partitioning methods,
• Hierarchical methods, and
• Density-based methods.
7.
Partitioning methods
• Partitionalclustering divides data objects into nonoverlapping groups. In other words, no object can
be a member of more than one cluster, and every cluster must have at least one object.
Two of the most important algorithms for partitioning-based clustering are k-means and k-medoid
• In the k-means algorithm, the centroid of the prototype is identified for clustering, which is normally the
mean of a group of points.
• Similarly, the k-medoid algorithm identifies the medoid which is the most representative point for a
group of points.
• These algorithms are both nondeterministic, meaning they could produce different results from two
separate runs even if the runs were based on the same input.
8.
Hierarchical Clustering
Hierarchical clusteringdetermines cluster assignments by building a hierarchy. This is implemented by either a
bottom-up or a top-down approach:
• Agglomerative clustering is the bottom-up approach. It merges the two points that are the most similar
until all points have been merged into a single cluster.
• Divisive clustering is the top-down approach. It starts with all points as one cluster and splits the least
similar clusters at each step until only single data points remain.
• These methods produce a tree-based hierarchy of points called a dendrogram.
• hierarchical clustering is a deterministic process, meaning cluster assignments won’t change when you run
an algorithm twice on the same input data.
• we have seen in the K-means clustering that there are some challenges with this algorithm, which are a
predetermined number of clusters, and it always tries to create the clusters of the same size. To solve
these two challenges, we can opt for the hierarchical clustering algorithm because, in this algorithm, we don't
need to have knowledge about the predefined number of clusters.
9.
Density-Based Clustering
• Density-basedclustering determines cluster assignments
based on the density of data points in a region.
• Clusters are assigned where there are high densities of data
points separated by low-density regions.
10.
Partitioning (K-means -A centroid-
based technique)
• K Means segregates the unlabeled data into various groups, called
clusters, based on having similar features, common patterns.
• The principle of the k-means algorithm is to assign each of the ‘n’
data points to one of the K clusters where ‘K’ is a userdefined
parameter as the number of clusters desired.
• The objective is to maximize the homogeneity within the clusters
and also to maximize the differences between the clusters.
• The homogeneity and differences are measured in terms of the
distance between the objects or points in the data set.
11.
• Kmeans Algorithmis an Iterative algorithm that
divides a group of n datasets into k subgroups
/clusters based on the similarity and their mean
distance from the centroid of that particular
subgroup/ formed.
• K, here is the pre-defined number of clusters to
be formed by the Algorithm.
• If K=3, It means the number of clusters to be
formed from the dataset is 3
12.
Step-1: Select thevalue of K, to decide the number of clusters to be
formed.
Step-2: Select random K points which will act as centroids.
Step-3: Assign each data point, based on their distance from the
randomly selected points (Centroid), to the nearest/closest centroid
which will form the predefined clusters.
Step-4: place a new centroid of each cluster.
Step-5: Repeat step no.3, which reassign each datapoint to the new
closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to Step 7.
Step-7: FINISH
• Let's takenumber k of clusters, i.e., K=2, to identify the dataset and to put them
into different clusters. It means here we will try to group these datasets into two
different clusters.
• We need to choose some random k points or centroid to form the cluster. These
points can be either the points from the dataset or any other point. So, here
we are selecting the below two points as k points, which are not the part of our
dataset.
15.
• Now wewill assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by applying some mathematics that we have studied
to calculate the distance between two points. So, we will draw a median
between both the centroids.
• From the image, it is clear that points left side of the line is near to the K1 or
blue centroid, and points to the right of the line are close to the yellow centroid.
Let's color them as blue and yellow for clear visualization.
16.
• As weneed to find the closest cluster, so we will repeat
the process by choosing a new centroid. To choose the
new centroids.
17.
• Consider thebelow data set which has the values of the data points
• We can randomly choose two initial points as the centroids and from there
we can start calculating distance of each point.
• For now we will consider that D2 and D4 are the centroids.
• To start with we should calculate the distance with the help of Euclidean
Distance which is √((x1-y1)² + (x2-y2)²
18.
Iteration 1:
• Step1: We need to calculate the distance between the initial centroid points
with other data points. Below have shown the calculation of distance from
initial centroids D2 and D4 from data point D1.
• After calculating the distance of all data points, we get the values as below.
19.
Step 2: Next,we need to group the data points which are closer to centriods. Observe the
above table, we can notice that D1 is closer to D4 as the distance is less. Hence we can
say that D1 belongs to D4 Similarly, D3 and D5 belongs to D2. After grouping, we need
to calculate the mean of grouped values from Table 1.
Cluster 1: (D1, D4) Cluster 2: (D2, D3, D5)
Step 3: Now, we calculate the mean values of the clusters created and the new centriod
values will these mean values and centroid is moved along the graph.
From the above table, we can say the new centroid for cluster 1 is (2.0, 1.0) and for
cluster 2 is (2.67, 4.67)
20.
Iteration 2:
Step 4:Again the values of euclidean distance is calculated from the new centriods. Below is
the table of distance between data points and new centroids.
• We can notice now that clusters have changed the data points. Now the cluster 1 has D1,
D2 and D4 data objects. Similarly, cluster 2 has D3 and D5
Step 5: Calculate the mean values of new clustered groups from Table 1 which we followed in
step 3. The below table will show the mean values
Now we have the new centroid value as following:
cluster 1 ( D1, D2, D4) – (1.67, 1.67) and cluster 2 (D3, D5) – (3.5, 5.5)
This process has to be repeated until we find a constant value for centroids and the latest
cluster will be considered as the final cluster solution.
21.
Choosing the valueof K :
• For a small data set, sometimes a rule of thumb that is followed is
• But unfortunately, this thumb rule does not work well for large data sets.
There are several statistical methods to arrive at the suitable number of clusters.
• To find the number of clusters in the data, we need to run the K-Means
clustering algorithm for different values of K and compare the results.
• We should choose the optimal value of K that gives us best performance.
There are different techniques available to find the optimal value of K.
• The most common technique is the elbow method which is described below.
• one effective approach is to employ the hierarchical clustering technique on
sample points from the data set and then arrive at sample K clusters.
22.
How to choosethe value of "K number of clusters" in K-
means Clustering?
• The performance of the K-means clustering algorithm depends upon highly
efficient clusters that it forms.
• To choose the optimal number of clusters is a big task.
Elbow Method:
• The Elbow method is one of the most popular ways to find the optimal
number of clusters.
• This method uses the concept of WCSS value.
• WCSS stands for Within Cluster Sum of Squares, which defines the total
variations within a cluster.
The formula to calculate the value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in
CLuster3 distance(Pi C3)2
23.
In the aboveformula of WCSS,
• ∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances
between each data point and its centroid within a cluster1 and the same for
the other two terms.
• To measure the distance between data points and centroid, we can use any
method such as Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
• It executes the K-means clustering on a given dataset for different K values
(ranges from 1-10).
• For each value of K, calculates the WCSS value.
• Plots a curve between calculated WCSS values and the number of clusters K.
• The sharp point of bend or a point of the plot looks like an arm, then that point
is considered as the best value of K.
24.
• Between ClustersSum of Squares (BCSS), which measures the squared average
distance between all centroids. To calculate BCSS, you find the Euclidean distance
from a given cluster centroid to all other cluster centroids. You then iterate this
process for all of the clusters, and sum all of the values together. This value is the
BCSS. You can divide by the number of clusters to calculate the average BCSS.
• Essentially, BCSS measures the variation between all clusters. A large value can
indicate clusters that are spread out, while a small value can indicate clusters that
are close to each other.
• We iterate the values of k from 1 to 9 and calculate the values of distortions for
each value of k and calculate the distortion and inertia for each value of k in the
given range.
25.
• Distortion isthe average of the euclidean squared distance
from the centroid of the respective clusters. Typically, the
Euclidean distance metric is used.
• Inertia is the sum of squared distances of samples to their
closest cluster centre.
• The measure of quality of clustering uses the SSE technique
K-Medoids: a representativeobject-based
technique
• k-means algorithm is sensitive to outliers in the data
set.
• Consider the values 1, 2, 3, 5, 9, 10, 11, and 25.
• Point 25 is the outlier, and it affects the cluster
formation negatively when the mean of the points is
considered as centroids.
30.
• Because theSSE of the second clustering is lower, k-means tend to
put point 9 in the same cluster with 1, 2, 3, and 6 though the point is
logically nearer to points 10 and 11.
• This skewedness is introduced due to the outlier point 25, which
shifts the mean away from the centre of the cluster.
k-medoids provides a solution to this problem.
• Instead of considering the mean of the data points in the cluster,
kmedoids considers k representative data points from the existing
points in the data set as the centre of the clusters.
• Note that the medoids in this case are actual data points or objects
from the data set and not an imaginary point as in the case when the
mean of the data sets within cluster is used as the centroid in the k-
means technique. The SSE is calculated as
where o is the representative point or object of cluster C .
31.
• One ofthe practical implementation of the k-medoids principle is the
Partitioning around Medoids (PAM) algorithm.