This presentation introduces clustering analysis and the k-means clustering technique. It defines clustering as an unsupervised method to segment data into groups with similar traits. The presentation outlines different clustering types (hard vs soft), techniques (partitioning, hierarchical, etc.), and describes the k-means algorithm in detail through multiple steps. It discusses requirements for clustering, provides examples of applications, and reviews advantages and disadvantages of k-means clustering.
Introduction to clustering analysis; outline includes concepts, types, techniques, K-means, requirements, applications, advantages, and conclusion.
Definition of clustering as an unsupervised method for data analysis; types include hard and soft clustering.
Introduction to clustering techniques and methods like partitioning, hierarchical, density-based, grid-based, and model-based methods.
Explanation of K-means clustering, its iterative process, and real-number example illustrating the procedure.
Detailed steps of K-means algorithm: setting cluster quantity, assignment of data points, and centroid adjustment.
Continuation of the K-means algorithm description and its execution steps.
Key requirements for clustering algorithms include scalability, attribute handling, cluster shape detection, high dimensionality, noise tolerance, and interpretability.
Diverse applications of clustering include identifying fake news, spam filtering, astronomy, genomics, network traffic classification, fraud detection, and document analysis.
Pros and cons of K-means clustering, emphasizing its simplicity and scalability, alongside challenges like centroid initialization and handling varying densities.
Summary of K-means effectiveness in unsupervised learning, useful for knowledge discovery across various fields.
Welcome To MyPresentation
On
Clustering Analysis
Submitted By
Ruhul Amin
Department of Statistics
Pabna University of Science & Technology
Department of Statistics, Pabna University of Science & Technology
2.
OUTLINEOF PRESENTATION
Clustering: basic concept
Types of clustering
Clustering techniques
K-means clustering
K-means clustering algorithm
Requirements
Applications
Advantages & Disadvantages
Conclusion
Department of Statistics, Pabna University of Science & Technology 2
3.
CLUSTERING: BASICCONCEPT
CLUSTERING
Clustering istraditionally viewed as an unsupervised method for data analysis. Clustering is the task of
the population or data points into a number of groups such that data points in the same groups are more
to other data points in the same group than those in other groups. In simple words, the aim is to segregate
groups with similar traits and assign them into clusters. It is a main task of exploratory data mining, and a
common technique for statistical data analysis, used in many fields, including machine learning, pattern
recognition, image analysis, information
retrieval, bioinformatics, data compression, and computer graphics.
Department of Statistics, Pabna University of Science & Technology 3
4.
TYPESOF CLUSTERING
Broadly speaking,clustering can be divided into two subgroups :
HARD CLUSTERING:
In hard clustering, each data point either belongs to a cluster completely or not.
As an instance, we want the algorithm to read all of the tweets and determine if a tweet is a positive or a negative
tweet.
SOFT CLUSTERING:
In the soft clustering method, each data point will not completely belong to one cluster, instead, it can be a member of
more than one cluster it has a set of membership coefficients corresponding to the probability of being in a given
cluster.
As an instance, if you are attempting to forecast the rating changes for the counterparties who you trade with. The
algorithm can create clusters for each rating and indicate the likelihood of a counterparty to belong to a cluster.
Department of Statistics, Pabna University of Science & Technology 4
5.
TYPES OF CLUSTERING
Isclustering typically …?
A. Supervised
B. Unsupervised
Department of Statistics, Pabna University of Science & Technology 5
Supervised
Unsupervised
CLUSTERINGTECHNIQUES
A CATEGORIZATION OFMAJOR CLUSTERING METHODS
Partitioning Methods
Hierarchical Methods
Density-based Methods
Grid-based Methods
Model-based Methods
Department of Statistics, Pabna University of Science & Technology 7
8.
CLUSTERINGTECHNIQUES
Partitional clustering decomposesa data set into a set of disjoint clusters.
Partitional clustering (or partitioning clustering) are clustering method
used to classify observations, within a data set, into multiple groups based on their
similarity. The algorithms require the analyst to specify the number of clusters to
be generated (N ≥ K). This course describes the
commonly used partitional, including: k means clustering
Department of Statistics, Pabna University of Science & Technology 8
9.
K MEANSCLUSTERING
K-means clustering(Macqueen, 1967) is a method commonly used to automatically partition a data set
into k groups. It proceeds by selecting k initial cluster.
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.E., Data without defined
categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the
variable K (N ≥ K). The algorithm works iteratively to assign each data point to one of K groups based on the features that are
provided. Data points are clustered based on feature similarity. The results of the k-means clustering algorithm are:
1.the centroids of the K clusters, which can be used to label new data.
2.labels for the training data (each data point is assigned to a single cluster).
.
Department of Statistics, Pabna University of Science & Technology 9
10.
K MEANS CLUSTERINGALGORITHMS
AS, YOU CAN SEE, K-MEANS ALGORITHM IS COMPOSED OF 3 STEPS:
STEP 1: INITIALIZATION
The first thing k-means does, is randomly choose K examples (data points) from the dataset as initial
centroids and that’s simply because it does not know yet where the center of each cluster is. (A
centroid is the center of a cluster).
STEP 2: CLUSTER ASSIGNMENT
Then, all the data points that are the closest (similar) to a centroid will create a cluster. If we’re using
the Euclidean distance between data points and every centroid, a straight line is drawn between two
centroids, then a perpendicular bisector (boundary line) divides this line into two clusters.
STEP 3: MOVE THE CENTROID
Now, we have new clusters, that need centers. A centroid’s new value is going to be the mean of all the
examples (data points) in a cluster.
We’ll keep repeating step 2 and 3 until the centroids stop moving, in other words, k-means algorithm is
converged.
Department of Statistics, Pabna University of Science & Technology 10
K MEANSCLUSTERINGALGORITHM
CLUSTER ANALYSIS– EXAMPLE
We will work with a real-number example of the well-known k-means clustering algorithm.
We will try to find clusters in the below dataset, consisting of 5 points.
Department of Statistics, Pabna University of Science & Technology 12
13.
K MEANSCLUSTERINGALGORITHMS
STEP 1:SET CLUSTER QUANTITY
The k-means algorithm requires you to set a number of clusters k beforehand. Here, we take k=2(the data look like there
clusters – one on the bottom left and one on the top right).
STEP 2: ASSIGNMENT OF DATA POINTS
In the assignment step, each data point gets assigned to the nearest cluster centroid. The cluster centroids can be seen as
centers of gravity within each cluster. To start with, we chose random points as centroids. Here, we take point A(1,1)
Instead of taking actual data points, we could have taken completely random points as well.
To calculate the nearest cluster centroid for each data point, you need a distance measure. There is a large number of
available metrics doing the job. We will work with the ordinary Euclidian distance.
Department of Statistics, Pabna University of Science & Technology 13
14.
K MEANSCLUSTERINGALGORITHMS
STEP 3:MOVE THE CENTROID
Now, we have new clusters, that need centers. A centroid’s new value is going to be the mean of
all the examples in a cluster.
We’ll keep repeating step 2 and 3 until the centroids stop moving, in other words, k-means
algorithm is converged.
Department of Statistics, Pabna University of Science & Technology 14
Requirements
Requirements
Requirements of clusteringin data mining:-
1. Scalability - we need highly scalable clustering algorithms to deal with large databases.
2. Ability to deal with different kind of attributes - algorithms should be capable to be applied on any kind of data such as
interval based (numerical) data, categorical, binary data.
3. Discovery of clusters with attribute shape - the clustering algorithm should be capable of detect cluster of arbitrary
shape. The should not be bounded to only distance measures that tend to find spherical cluster of small size.
4. High dimensionality - the clustering algorithm should not only be able to handle low- dimensional data but also the high
dimensional space.
5. Ability to deal with noisy data - databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such
data and may lead to poor quality clusters.
6. Interpretability - the clustering results should be interpretable, comprehensible and usable.
Department of Statistics, Pabna University of Science & Technology 18
19.
APPLICATIONS
HERE ARE 7EXAMPLES OF CLUSTERING ALGORITHMS IN ACTION.
1. IDENTIFYING FAKE NEWS
How clustering works:
The way that the algorithm works is by taking in the content of the fake news article, the corpus,
examining the words used and then clustering them. These clusters are what helps the algorithm
determine which pieces are genuine and which are fake news. Certain words are found more
commonly in sensationalized, click-bait articles. When you see a high percentage of specific
terms in an article, it gives a higher probability of the material being fake news.
2. SPAM FILTER
How clustering works:
k-means clustering techniques have proven to be an effective way of identifying spam. The way
that it works is by looking at the different sections of the email (header, sender, and content). The
data is then grouped together.
These groups can then be classified to identify which are spam. Including clustering in the
classification process improves the accuracy of the filter to 97%. This is excellent news for
people who want to be sure they’re not missing out on your favorite newsletters and offers.
Department of Statistics, Pabna University of Science & Technology 19
20.
APPLICATIONS
3. ASTRONOMY:
It helpsto find groups of similar stars and galaxies.
4. GENOMICS:
It can be used to derive plant and animal taxonomies, categorize genes with similar functionality and gain insight into structures inherent in
populations.
5. CLASSIFYING NETWORK TRAFFIC
How clustering works:
k-means clustering is used to group together characteristics of the traffic sources. When the clusters are created, you can then classify the traffic
types. The process is faster and more accurate than the previous autoclass method. By having precise information on traffic sources, you are able
to grow your site and plan capacity effectively.
6. IDENTIFYING FRAUDULENT OR CRIMINAL ACTIVITY
How clustering works:
By analysing the GPS logs, the algorithm is able to group similar behaviors. Based on the characteristics of the groups you are then able to
classify them into those that are real and which are fraudulent.
7. DOCUMENT ANALYSIS
HOW CLUSTERING WORKS:
Hierarchical clustering has been used to solve this problem. The algorithm is able to look at the text and group it into different
themes. Using this technique, you can cluster and organize similar documents quickly using the characteristics identified in the
paragraph.
8.CALL RECORD DETAIL ANALYSIS
A call detail record (CDR) is the information captured by telecom companies during the call, SMS, and internet activity of a
customer.
Department of Statistics, Pabna University of Science & Technology 20
21.
K-means advantages anddisadvantages
Advantages of k-means
Relatively simple to implement.
Scales to large data sets.
Guarantees convergence.
Can warm-start the positions of centroids.
Easily adapts to new examples (data points).
Generalizes to clusters of different shapes and sizes, such as elliptical clusters.
Disadvantage of k-means
Choosing k manually being dependent on initial values.
For a low k, you can mitigate this dependence by running k-means several times with different
initial values and picking the best result. As k increases, you need advanced versions of k-means to
pick better values of the initial centroids (called k-means seeding).
Clustering data of varying sizes and density.
Clustering outliers.
Scaling with number of dimensions.
Department of Statistics, Pabna University of Science & Technology 21
22.
CONCLUSION
Conclusion:
K means algorithmis useful for undirected knowledge discovery and is relatively simple.
K mean has found wide spread usage in lot of field raging from unsupervised learning of neural
,Pattern recognitions, classification analysis, Artificial intelligence ,Image processing and many others
Department of Statistics, Pabna University of Science & Technology 22