Welcome To My Presentation
On
Clustering Analysis
Submitted By
Ruhul Amin
Department of Statistics
Pabna University of Science & Technology
Department of Statistics, Pabna University of Science & Technology
OUTLINEOF PRESENTATION
 Clustering : basic concept
 Types of clustering
 Clustering techniques
 K-means clustering
 K-means clustering algorithm
 Requirements
 Applications
 Advantages & Disadvantages
 Conclusion
Department of Statistics, Pabna University of Science & Technology 2
CLUSTERING: BASICCONCEPT
CLUSTERING
Clustering is traditionally viewed as an unsupervised method for data analysis. Clustering is the task of
the population or data points into a number of groups such that data points in the same groups are more
to other data points in the same group than those in other groups. In simple words, the aim is to segregate
groups with similar traits and assign them into clusters. It is a main task of exploratory data mining, and a
common technique for statistical data analysis, used in many fields, including machine learning, pattern
recognition, image analysis, information
retrieval, bioinformatics, data compression, and computer graphics.
Department of Statistics, Pabna University of Science & Technology 3
TYPESOF CLUSTERING
Broadly speaking, clustering can be divided into two subgroups :
HARD CLUSTERING:
In hard clustering, each data point either belongs to a cluster completely or not.
As an instance, we want the algorithm to read all of the tweets and determine if a tweet is a positive or a negative
tweet.
SOFT CLUSTERING:
In the soft clustering method, each data point will not completely belong to one cluster, instead, it can be a member of
more than one cluster it has a set of membership coefficients corresponding to the probability of being in a given
cluster.
As an instance, if you are attempting to forecast the rating changes for the counterparties who you trade with. The
algorithm can create clusters for each rating and indicate the likelihood of a counterparty to belong to a cluster.
Department of Statistics, Pabna University of Science & Technology 4
TYPES OF CLUSTERING
Is clustering typically …?
A. Supervised
B. Unsupervised
Department of Statistics, Pabna University of Science & Technology 5
Supervised
Unsupervised
CLUSTERING TECHNIQUES
Department of Statistics, Pabna University of Science & Technology 6
CLUSTERINGTECHNIQUES
A CATEGORIZATION OF MAJOR CLUSTERING METHODS
Partitioning Methods
Hierarchical Methods
Density-based Methods
Grid-based Methods
Model-based Methods
Department of Statistics, Pabna University of Science & Technology 7
CLUSTERINGTECHNIQUES
Partitional clustering decomposes a data set into a set of disjoint clusters.
Partitional clustering (or partitioning clustering) are clustering method
used to classify observations, within a data set, into multiple groups based on their
similarity. The algorithms require the analyst to specify the number of clusters to
be generated (N ≥ K). This course describes the
commonly used partitional, including: k means clustering
Department of Statistics, Pabna University of Science & Technology 8
K MEANSCLUSTERING
K-means clustering (Macqueen, 1967) is a method commonly used to automatically partition a data set
into k groups. It proceeds by selecting k initial cluster.
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.E., Data without defined
categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the
variable K (N ≥ K). The algorithm works iteratively to assign each data point to one of K groups based on the features that are
provided. Data points are clustered based on feature similarity. The results of the k-means clustering algorithm are:
1.the centroids of the K clusters, which can be used to label new data.
2.labels for the training data (each data point is assigned to a single cluster).
.
Department of Statistics, Pabna University of Science & Technology 9
K MEANS CLUSTERING ALGORITHMS
AS, YOU CAN SEE, K-MEANS ALGORITHM IS COMPOSED OF 3 STEPS:
STEP 1: INITIALIZATION
The first thing k-means does, is randomly choose K examples (data points) from the dataset as initial
centroids and that’s simply because it does not know yet where the center of each cluster is. (A
centroid is the center of a cluster).
STEP 2: CLUSTER ASSIGNMENT
Then, all the data points that are the closest (similar) to a centroid will create a cluster. If we’re using
the Euclidean distance between data points and every centroid, a straight line is drawn between two
centroids, then a perpendicular bisector (boundary line) divides this line into two clusters.
STEP 3: MOVE THE CENTROID
Now, we have new clusters, that need centers. A centroid’s new value is going to be the mean of all the
examples (data points) in a cluster.
We’ll keep repeating step 2 and 3 until the centroids stop moving, in other words, k-means algorithm is
converged.
Department of Statistics, Pabna University of Science & Technology 10
K MEANS CLUSTERING ALGORITHMS
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Step 1 Step 2 Step 3
Step 4
Department of Statistics, Pabna University of Science & Technology 11
K MEANSCLUSTERINGALGORITHM
CLUSTER ANALYSIS – EXAMPLE
We will work with a real-number example of the well-known k-means clustering algorithm.
We will try to find clusters in the below dataset, consisting of 5 points.
Department of Statistics, Pabna University of Science & Technology 12
K MEANSCLUSTERINGALGORITHMS
STEP 1: SET CLUSTER QUANTITY
The k-means algorithm requires you to set a number of clusters k beforehand. Here, we take k=2(the data look like there
clusters – one on the bottom left and one on the top right).
STEP 2: ASSIGNMENT OF DATA POINTS
In the assignment step, each data point gets assigned to the nearest cluster centroid. The cluster centroids can be seen as
centers of gravity within each cluster. To start with, we chose random points as centroids. Here, we take point A(1,1)
Instead of taking actual data points, we could have taken completely random points as well.
To calculate the nearest cluster centroid for each data point, you need a distance measure. There is a large number of
available metrics doing the job. We will work with the ordinary Euclidian distance.
Department of Statistics, Pabna University of Science & Technology 13
K MEANSCLUSTERINGALGORITHMS
STEP 3: MOVE THE CENTROID
Now, we have new clusters, that need centers. A centroid’s new value is going to be the mean of
all the examples in a cluster.
We’ll keep repeating step 2 and 3 until the centroids stop moving, in other words, k-means
algorithm is converged.
Department of Statistics, Pabna University of Science & Technology 14
K MEANSCLUSTERINGALGORITHMS
Department of Statistics, Pabna University of Science & Technology 15
K MEANSCLUSTERINGALGORITHMS
K MEANS CLUSTERINGALGORITHMS
Department of Statistics, Pabna University of Science & Technology 17
Requirements
Requirements
Requirements of clustering in data mining:-
1. Scalability - we need highly scalable clustering algorithms to deal with large databases.
2. Ability to deal with different kind of attributes - algorithms should be capable to be applied on any kind of data such as
interval based (numerical) data, categorical, binary data.
3. Discovery of clusters with attribute shape - the clustering algorithm should be capable of detect cluster of arbitrary
shape. The should not be bounded to only distance measures that tend to find spherical cluster of small size.
4. High dimensionality - the clustering algorithm should not only be able to handle low- dimensional data but also the high
dimensional space.
5. Ability to deal with noisy data - databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such
data and may lead to poor quality clusters.
6. Interpretability - the clustering results should be interpretable, comprehensible and usable.
Department of Statistics, Pabna University of Science & Technology 18
APPLICATIONS
HERE ARE 7 EXAMPLES OF CLUSTERING ALGORITHMS IN ACTION.
1. IDENTIFYING FAKE NEWS
How clustering works:
The way that the algorithm works is by taking in the content of the fake news article, the corpus,
examining the words used and then clustering them. These clusters are what helps the algorithm
determine which pieces are genuine and which are fake news. Certain words are found more
commonly in sensationalized, click-bait articles. When you see a high percentage of specific
terms in an article, it gives a higher probability of the material being fake news.
2. SPAM FILTER
How clustering works:
k-means clustering techniques have proven to be an effective way of identifying spam. The way
that it works is by looking at the different sections of the email (header, sender, and content). The
data is then grouped together.
These groups can then be classified to identify which are spam. Including clustering in the
classification process improves the accuracy of the filter to 97%. This is excellent news for
people who want to be sure they’re not missing out on your favorite newsletters and offers.
Department of Statistics, Pabna University of Science & Technology 19
APPLICATIONS
3. ASTRONOMY:
It helps to find groups of similar stars and galaxies.
4. GENOMICS:
It can be used to derive plant and animal taxonomies, categorize genes with similar functionality and gain insight into structures inherent in
populations.
5. CLASSIFYING NETWORK TRAFFIC
How clustering works:
k-means clustering is used to group together characteristics of the traffic sources. When the clusters are created, you can then classify the traffic
types. The process is faster and more accurate than the previous autoclass method. By having precise information on traffic sources, you are able
to grow your site and plan capacity effectively.
6. IDENTIFYING FRAUDULENT OR CRIMINAL ACTIVITY
How clustering works:
By analysing the GPS logs, the algorithm is able to group similar behaviors. Based on the characteristics of the groups you are then able to
classify them into those that are real and which are fraudulent.
7. DOCUMENT ANALYSIS
HOW CLUSTERING WORKS:
Hierarchical clustering has been used to solve this problem. The algorithm is able to look at the text and group it into different
themes. Using this technique, you can cluster and organize similar documents quickly using the characteristics identified in the
paragraph.
8.CALL RECORD DETAIL ANALYSIS
A call detail record (CDR) is the information captured by telecom companies during the call, SMS, and internet activity of a
customer.
Department of Statistics, Pabna University of Science & Technology 20
K-means advantages and disadvantages
Advantages of k-means
Relatively simple to implement.
Scales to large data sets.
Guarantees convergence.
Can warm-start the positions of centroids.
Easily adapts to new examples (data points).
Generalizes to clusters of different shapes and sizes, such as elliptical clusters.
Disadvantage of k-means
Choosing k manually being dependent on initial values.
For a low k, you can mitigate this dependence by running k-means several times with different
initial values and picking the best result. As k increases, you need advanced versions of k-means to
pick better values of the initial centroids (called k-means seeding).
Clustering data of varying sizes and density.
Clustering outliers.
Scaling with number of dimensions.
Department of Statistics, Pabna University of Science & Technology 21
CONCLUSION
Conclusion:
K means algorithm is useful for undirected knowledge discovery and is relatively simple.
K mean has found wide spread usage in lot of field raging from unsupervised learning of neural
,Pattern recognitions, classification analysis, Artificial intelligence ,Image processing and many others
Department of Statistics, Pabna University of Science & Technology 22

Presentation on K-Means Clustering

  • 1.
    Welcome To MyPresentation On Clustering Analysis Submitted By Ruhul Amin Department of Statistics Pabna University of Science & Technology Department of Statistics, Pabna University of Science & Technology
  • 2.
    OUTLINEOF PRESENTATION  Clustering: basic concept  Types of clustering  Clustering techniques  K-means clustering  K-means clustering algorithm  Requirements  Applications  Advantages & Disadvantages  Conclusion Department of Statistics, Pabna University of Science & Technology 2
  • 3.
    CLUSTERING: BASICCONCEPT CLUSTERING Clustering istraditionally viewed as an unsupervised method for data analysis. Clustering is the task of the population or data points into a number of groups such that data points in the same groups are more to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters. It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. Department of Statistics, Pabna University of Science & Technology 3
  • 4.
    TYPESOF CLUSTERING Broadly speaking,clustering can be divided into two subgroups : HARD CLUSTERING: In hard clustering, each data point either belongs to a cluster completely or not. As an instance, we want the algorithm to read all of the tweets and determine if a tweet is a positive or a negative tweet. SOFT CLUSTERING: In the soft clustering method, each data point will not completely belong to one cluster, instead, it can be a member of more than one cluster it has a set of membership coefficients corresponding to the probability of being in a given cluster. As an instance, if you are attempting to forecast the rating changes for the counterparties who you trade with. The algorithm can create clusters for each rating and indicate the likelihood of a counterparty to belong to a cluster. Department of Statistics, Pabna University of Science & Technology 4
  • 5.
    TYPES OF CLUSTERING Isclustering typically …? A. Supervised B. Unsupervised Department of Statistics, Pabna University of Science & Technology 5 Supervised Unsupervised
  • 6.
    CLUSTERING TECHNIQUES Department ofStatistics, Pabna University of Science & Technology 6
  • 7.
    CLUSTERINGTECHNIQUES A CATEGORIZATION OFMAJOR CLUSTERING METHODS Partitioning Methods Hierarchical Methods Density-based Methods Grid-based Methods Model-based Methods Department of Statistics, Pabna University of Science & Technology 7
  • 8.
    CLUSTERINGTECHNIQUES Partitional clustering decomposesa data set into a set of disjoint clusters. Partitional clustering (or partitioning clustering) are clustering method used to classify observations, within a data set, into multiple groups based on their similarity. The algorithms require the analyst to specify the number of clusters to be generated (N ≥ K). This course describes the commonly used partitional, including: k means clustering Department of Statistics, Pabna University of Science & Technology 8
  • 9.
    K MEANSCLUSTERING K-means clustering(Macqueen, 1967) is a method commonly used to automatically partition a data set into k groups. It proceeds by selecting k initial cluster. K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.E., Data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K (N ≥ K). The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the k-means clustering algorithm are: 1.the centroids of the K clusters, which can be used to label new data. 2.labels for the training data (each data point is assigned to a single cluster). . Department of Statistics, Pabna University of Science & Technology 9
  • 10.
    K MEANS CLUSTERINGALGORITHMS AS, YOU CAN SEE, K-MEANS ALGORITHM IS COMPOSED OF 3 STEPS: STEP 1: INITIALIZATION The first thing k-means does, is randomly choose K examples (data points) from the dataset as initial centroids and that’s simply because it does not know yet where the center of each cluster is. (A centroid is the center of a cluster). STEP 2: CLUSTER ASSIGNMENT Then, all the data points that are the closest (similar) to a centroid will create a cluster. If we’re using the Euclidean distance between data points and every centroid, a straight line is drawn between two centroids, then a perpendicular bisector (boundary line) divides this line into two clusters. STEP 3: MOVE THE CENTROID Now, we have new clusters, that need centers. A centroid’s new value is going to be the mean of all the examples (data points) in a cluster. We’ll keep repeating step 2 and 3 until the centroids stop moving, in other words, k-means algorithm is converged. Department of Statistics, Pabna University of Science & Technology 10
  • 11.
    K MEANS CLUSTERINGALGORITHMS 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Step 1 Step 2 Step 3 Step 4 Department of Statistics, Pabna University of Science & Technology 11
  • 12.
    K MEANSCLUSTERINGALGORITHM CLUSTER ANALYSIS– EXAMPLE We will work with a real-number example of the well-known k-means clustering algorithm. We will try to find clusters in the below dataset, consisting of 5 points. Department of Statistics, Pabna University of Science & Technology 12
  • 13.
    K MEANSCLUSTERINGALGORITHMS STEP 1:SET CLUSTER QUANTITY The k-means algorithm requires you to set a number of clusters k beforehand. Here, we take k=2(the data look like there clusters – one on the bottom left and one on the top right). STEP 2: ASSIGNMENT OF DATA POINTS In the assignment step, each data point gets assigned to the nearest cluster centroid. The cluster centroids can be seen as centers of gravity within each cluster. To start with, we chose random points as centroids. Here, we take point A(1,1) Instead of taking actual data points, we could have taken completely random points as well. To calculate the nearest cluster centroid for each data point, you need a distance measure. There is a large number of available metrics doing the job. We will work with the ordinary Euclidian distance. Department of Statistics, Pabna University of Science & Technology 13
  • 14.
    K MEANSCLUSTERINGALGORITHMS STEP 3:MOVE THE CENTROID Now, we have new clusters, that need centers. A centroid’s new value is going to be the mean of all the examples in a cluster. We’ll keep repeating step 2 and 3 until the centroids stop moving, in other words, k-means algorithm is converged. Department of Statistics, Pabna University of Science & Technology 14
  • 15.
    K MEANSCLUSTERINGALGORITHMS Department ofStatistics, Pabna University of Science & Technology 15
  • 16.
  • 17.
    K MEANS CLUSTERINGALGORITHMS Departmentof Statistics, Pabna University of Science & Technology 17
  • 18.
    Requirements Requirements Requirements of clusteringin data mining:- 1. Scalability - we need highly scalable clustering algorithms to deal with large databases. 2. Ability to deal with different kind of attributes - algorithms should be capable to be applied on any kind of data such as interval based (numerical) data, categorical, binary data. 3. Discovery of clusters with attribute shape - the clustering algorithm should be capable of detect cluster of arbitrary shape. The should not be bounded to only distance measures that tend to find spherical cluster of small size. 4. High dimensionality - the clustering algorithm should not only be able to handle low- dimensional data but also the high dimensional space. 5. Ability to deal with noisy data - databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters. 6. Interpretability - the clustering results should be interpretable, comprehensible and usable. Department of Statistics, Pabna University of Science & Technology 18
  • 19.
    APPLICATIONS HERE ARE 7EXAMPLES OF CLUSTERING ALGORITHMS IN ACTION. 1. IDENTIFYING FAKE NEWS How clustering works: The way that the algorithm works is by taking in the content of the fake news article, the corpus, examining the words used and then clustering them. These clusters are what helps the algorithm determine which pieces are genuine and which are fake news. Certain words are found more commonly in sensationalized, click-bait articles. When you see a high percentage of specific terms in an article, it gives a higher probability of the material being fake news. 2. SPAM FILTER How clustering works: k-means clustering techniques have proven to be an effective way of identifying spam. The way that it works is by looking at the different sections of the email (header, sender, and content). The data is then grouped together. These groups can then be classified to identify which are spam. Including clustering in the classification process improves the accuracy of the filter to 97%. This is excellent news for people who want to be sure they’re not missing out on your favorite newsletters and offers. Department of Statistics, Pabna University of Science & Technology 19
  • 20.
    APPLICATIONS 3. ASTRONOMY: It helpsto find groups of similar stars and galaxies. 4. GENOMICS: It can be used to derive plant and animal taxonomies, categorize genes with similar functionality and gain insight into structures inherent in populations. 5. CLASSIFYING NETWORK TRAFFIC How clustering works: k-means clustering is used to group together characteristics of the traffic sources. When the clusters are created, you can then classify the traffic types. The process is faster and more accurate than the previous autoclass method. By having precise information on traffic sources, you are able to grow your site and plan capacity effectively. 6. IDENTIFYING FRAUDULENT OR CRIMINAL ACTIVITY How clustering works: By analysing the GPS logs, the algorithm is able to group similar behaviors. Based on the characteristics of the groups you are then able to classify them into those that are real and which are fraudulent. 7. DOCUMENT ANALYSIS HOW CLUSTERING WORKS: Hierarchical clustering has been used to solve this problem. The algorithm is able to look at the text and group it into different themes. Using this technique, you can cluster and organize similar documents quickly using the characteristics identified in the paragraph. 8.CALL RECORD DETAIL ANALYSIS A call detail record (CDR) is the information captured by telecom companies during the call, SMS, and internet activity of a customer. Department of Statistics, Pabna University of Science & Technology 20
  • 21.
    K-means advantages anddisadvantages Advantages of k-means Relatively simple to implement. Scales to large data sets. Guarantees convergence. Can warm-start the positions of centroids. Easily adapts to new examples (data points). Generalizes to clusters of different shapes and sizes, such as elliptical clusters. Disadvantage of k-means Choosing k manually being dependent on initial values. For a low k, you can mitigate this dependence by running k-means several times with different initial values and picking the best result. As k increases, you need advanced versions of k-means to pick better values of the initial centroids (called k-means seeding). Clustering data of varying sizes and density. Clustering outliers. Scaling with number of dimensions. Department of Statistics, Pabna University of Science & Technology 21
  • 22.
    CONCLUSION Conclusion: K means algorithmis useful for undirected knowledge discovery and is relatively simple. K mean has found wide spread usage in lot of field raging from unsupervised learning of neural ,Pattern recognitions, classification analysis, Artificial intelligence ,Image processing and many others Department of Statistics, Pabna University of Science & Technology 22