Introduction
Clustering is a multivariate analysis used to group similar objects (close in terms of distance) together in the same group (cluster). Unlike supervised learning methods (for example, classification and regression) a clustering analysis does not use any label information, but simply uses the similarity between data features to group them into clusters.
Clustering does not refer to specific algorithms but it’s a process to create groups based on similarity measure. Clustering analysis use unsupervised learning algorithm to create clusters.
Clustering algorithms generally work on simple principle of maximization of intracluster similarities and minimization of intercluster similarities. The measure of similarity determines how the clusters need to be formed.
Similarity is a characterization of the ratio of the number of attributes two objects share in common compared to the total list of attributes between them. Objects which have everything in common are identical, and have a similarity of 1.0. Objects which have nothing in common have a similarity of 0.0.
Clustering can be widely adapted in the analysis of businesses. For example, a marketing department can use clustering to segment customers by personal attributes. As a result of this, different marketing campaigns targeting various types of customers can be designed.
Clustering model is a notion used to signify what kind of clusters we are trying to identify. The four most common models of clustering methods are hierarchical clustering, k-means clustering, model-based clustering, and density-based clustering:
A good clustering algorithm can be evaluated based on two primary objectives:
Hierarchical clustering adopts either an agglomerative or divisive method to build a hierarchy of clusters. Regardless of which approach is adopted, both first use a distance similarity measure to combine or split clusters. The recursive process continues until there is only one cluster left or you cannot split more clusters. Eventually, we can use a dendrogram to represent the hierarchy of clusters.
The agglomerative methods, using a recursive algorithm that follows the next phases:
In a hierarchical clustering, there are two very important parameters: the distance metric and the linkage method.
Clustering distance metric. Defining closeness is a fundamental aspect. A measure of dissimilarity is that which defines clusters that will be combined in the case of agglomerative method, or that, in the case of divisive clustering method, when these are to be divided. The main measures of distance are as follows:
You can read more about distance measure and it visualization here.
Linkage methods. The linkage methods determine how the distance between two clusters is defined. A linkage rule is necessary for calculating the inter-cluster distances. It is important to try several linkage methods to compare their results. Depending on the dataset, some methods may work better. The following is a list of the most common linkage methods:
Hierarchical clustering is based on the distance connectivity between observations of clusters. The steps involved in the clustering process are:
Modeling in R
In R, we use the hclust()
function for hierarchical cluster analysis. This is part of the stats
package. To perform hierarchical clustering, the input data has to be in a distance matrix form.
Another important function used here is dist()
this function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix. By default, it is Euclidean distance.
You can then examine the iris dataset structure:
str(iris)
Let's look at the parallel coordinates plot of the data.
Parallel coordinates are a common way of visualizing high-dimensional geometry and analyzing multivariate data. The lines show the relationship between the axes, much like scatterplots, and the patterns that the lines form indicate the relationship. You can also gather details about the relationships between the axes when you see the clustering of lines.
# get nice colors library(colorspace) iris2 <- iris[,-5] species_labels <- iris[,5] species_col <- rev(rainbow_hcl(3))[as.numeric(species_labels)] MASS::parcoord(iris2, col = species_col, var.label = TRUE, lwd = 2) title("Parallel coordinates plot of the Iris data") legend("top", cex = 1, legend = as.character(levels(species_labels)), fill = unique(species_col), horiz = TRUE)
We can see that the Setosa species are distinctly different from Versicolor and Virginica (they have lower petal length and width). But Versicolor and Virginica cannot easily be separated based on measurements of their sepal and petal width/length.
You can then use agglomerative hierarchical clustering to cluster the customer data:
iris_hc = hclust(dist(iris[,1:4], method="euclidean"), method="ward.D2")
We use the Euclidean distance as distance metrics, and use Ward's minimum variance method to perform agglomerative clustering.
Lastly, you can use the plot
function to plot the dendrogram:
plot(iris_hc, hang = -0.01, cex = 0.7)
Finally, we use the plot
function to plot the dendrogram of hierarchical clusters. We specify hang
to display labels at the bottom of the dendrogram, and use cex
to shrink the label to 70 percent of the normal size.
In the dendrogram displayed above, each leaf corresponds to one observation. As we move up the tree, observations that are similar to each other are combined into branches. You can see more kinds of visualizing dendrograms in R here or here.
In a dendrogram, we can see the hierarchy of clusters, but we have not grouped data into different clusters yet. However, we can determine how many clusters are within the dendrogram and cut the dendrogram at a certain tree height to separate the data into different groups. We'll demonstrate how to use the cutree
function to separate the data into a given number of clusters.
First, categorize the data into four groups:
fit = cutree(iris_hc, k = 3)
You can then examine the cluster labels for the data:
fit
Count the number of data within each cluster:
table(fit, iris$Species)
Finally, you can visualize how data is clustered with the red rectangle border:
plot(iris_hc, hang = -0.01, cex = 0.7) rect.hclust(iris_hc, k=3, border="red") # rect.hclust(res.hc, k = 3, border = 2:4)
table
the clustering distribution with actual Species
table(fit, iris$Species)
An alternative way to produce dendrograms is to specifically convert hclust objects into dendrograms objects.
# Convert hclust into a dendrogram and plot hcd <- as.dendrogram(iris_hc) # Default plot plot(hcd, type = "rectangle", ylab = "Y") # Define nodePar nodePar <- list(lab.cex = 0.6, pch = c(NA, 19), cex = 0.7, col = "blue") # Customized plot; remove labels plot(hcd, ylab = "Y", nodePar = nodePar, leaflab = "none") # Horizontal plot plot(hcd, xlab = "Y", nodePar = nodePar, horiz = TRUE) # Change edge color plot(hcd, xlab = "Height", nodePar = nodePar, edgePar = list(col = 2:3, lwd = 2:1))
The R package ggplot2
have no functions to plot dendrograms. However, the ad-hoc package ggdendro
offers a decent solution.
# install.packages('ggdendro') library(ggplot2) library(ggdendro) # basic option ggdendrogram(iris_hc) # another option ggdendrogram(iris_hc, rotate = TRUE, size = 4, theme_dendro = FALSE, color = "tomato")
Using the function fviz_cluster()
in factoextra
], we can also visualize the result in a scatter plot. Observations are represented by points in the plot, using principal components. A frame is drawn around each cluster.
# install.packages('factoextra') library(factoextra) fviz_cluster(list(data=iris[,1:4], cluster=fit))
Comparing clustering methods
After fitting data into clusters using different clustering methods, you may wish to measure the accuracy of the clustering. In most cases, you can use either intracluster or intercluster metrics as measurements. The higher the intercluster distance, the better it is, and the lower the intracluster distance, the better it is.
We now introduce how to compare different clustering methods using cluster.stat
from the fpc
package.
Perform the following steps to compare clustering methods.
First, install and load the fpc
package
#install.packages("fpc") library("fpc")
You then need to use hierarchical clustering with the single method to cluster customer data and generate the object hc_single
single_c = hclust(dist(iris[,1:4]), method="single") hc_single = cutree(single_c, k=3)
Use hierarchical clustering with the complete method to cluster customer data and generate the object hc_complete
complete_c = hclust(dist(iris[,1:4]), method="complete") hc_complete = cutree(complete_c, k=3)
You can then use k-means clustering to cluster customer data and generate the object km
set.seed(1234) km = kmeans(iris[1:4], 3, iter.max=1000, algorithm=c("Forgy"))
Next, retrieve the cluster validation statistics of either clustering method:
cs = cluster.stats(dist(iris[1:4]), km$cluster)
Most often, we focus on using within.cluster.ss
and avg.silwidth
to validate the clustering method. The within.cluster.ss
measurement stands for the within clusters sum of squares, and avg.silwidth
represents the average silhouette width.
within.cluster.ss
measurement shows how closely related objects are in clusters; the smaller the value, the more closely related objects are within the cluster.avg.silwidth
is a measurement that considers how closely related objects are within the cluster and how clusters are separated from each other. The silhouette value usually ranges from 0 to 1; a value closer to 1 suggests the data is better clustered.cs[c("within.cluster.ss","avg.silwidth")]
Finally, we can generate the cluster statistics of each clustering method and list them in a table:
sapply(list(kmeans = km$cluster, hc_single = hc_single, hc_complete = hc_complete), function(c) cluster.stats(dist(iris[1:4]), c)[c("within.cluster.ss","avg.silwidth")])
Data preparation
Remove any missing value (i.e, NA values for not available)
df <- na.omit(iris[,1:4])
Next, we have to scale all the numeric variables. Scaling means each variable will now have mean zero and standard deviation one. Ideally, you want a unit in each coordinate to represent the same degree of difference. Scaling makes the standard deviation the unit of measurement in each coordinate. This is done to avoid the clustering algorithm to depend to an arbitrary variable unit. You can scale/standardize the data using the R function scale:
data <- scale(df)
Useful links