Introduction
Clustering is a multivariate analysis used to group similar objects (close in terms of distance) together in the same group (cluster). Unlike supervised learning methods (for example, classification and regression) a clustering analysis does not use any label information, but simply uses the similarity between data features to group them into clusters.
Clustering does not refer to specific algorithms but it’s a process to create groups based on similarity measure. Clustering analysis use unsupervised learning algorithm to create clusters.
Clustering algorithms generally work on simple principle of maximization of intracluster similarities and minimization of intercluster similarities. The measure of similarity determines how the clusters need to be formed.
Similarity is a characterization of the ratio of the number of attributes two objects share in common compared to the total list of attributes between them. Objects which have everything in common are identical, and have a similarity of 1.0. Objects which have nothing in common have a similarity of 0.0.
Clustering can be widely adapted in the analysis of businesses. For example, a marketing department can use clustering to segment customers by personal attributes. As a result of this, different marketing campaigns targeting various types of customers can be designed.
Clustering model is a notion used to signify what kind of clusters we are trying to identify. The four most common models of clustering methods are hierarchical clustering, k-means clustering, model-based clustering, and density-based clustering:
A good clustering algorithm can be evaluated based on two primary objectives:
Density-based clustering uses the idea of density reachability and density connectivity (as an alternative to distance measurement), which makes it very useful in discovering a cluster in nonlinear shapes. This method finds an area with a higher density than the remaining area. One of the most famous methods is Density-Based Spatial Clustering of Applications with Noise (DBSCAN). It uses the concept of density reachability and density connectivity.
The Density-based clustering algorithm DBSCAN is a fundamental data clustering technique for finding arbitrary shape clusters as well as for detecting outliers.
Unlike K-Means, DBSCAN does not require the number of clusters as a parameter. Rather it infers the number of clusters based on the data, and it can discover clusters of arbitrary shape (for comparison, K-Means usually discovers spherical clusters). Partitioning methods (K-means, PAM clustering) and hierarchical clustering are suitable for finding spherical-shaped clusters or convex clusters. In other words, they work well for compact and well separated clusters. Moreover, they are also severely affected by the presence of noise and outliers in the data.
Advantages of Density-based clustering are
Disadvantages of Density-based clustering are
This algorithm works on a parametric approach. The two parameters involved in this algorithm are:
One limitation of DBSCAN is that it is sensitive to the choice of e, in particular if clusters have different densities. If e is too small, sparser clusters will be defined as noise. If e is too large, denser clusters may be merged together.
Once these parameters are defined, the algorithm divides the data points into three points:
The steps in DBSCAN are simple after defining the previous steps:
Modeling in R
In the following part, we will demonstrate how to use DBSCAN to perform density-based clustering.
# install.packages("dbscan") library(dbscan)
Determing optimal value of e (eps). First of all, we should define the optimal value of e. The method proposed here consists of computing the he k-nearest neighbor distances in a matrix of points.
The idea is to calculate, the average of the distances of every point to its k nearest neighbors. The value of k will be specified by the user and corresponds to MinPts.
Next, these k-distances are plotted in an ascending order. The aim is to determine the knee” which corresponds to the optimal e parameter.
A knee corresponds to a threshold where a sharp change occurs along the k-distance curve.
The function kNNdistplot()
can be used to draw the k-distance plot.
iris_matrix <- as.matrix(iris[, -5]) kNNdistplot(iris_matrix, k=4) abline(h=0.4, col="red")
It can be seen that the optimal e value is around a distance of 0.4.
set.seed(1234) db = dbscan(iris_matrix, 0.4, 4) db
The result shows DBSCAN has found four clusters and assigned 25 cases as noise/outliers.
Display the hull plot
hullplot(iris_matrix, db$cluster)
The hull plot shows the separation is good, so we can play around with the parameters to get more generalized or specialized clusters. Black points are outliers.
You can play with e (neighborhood size) and MinPts (minimum of points needed for core cluster).
Compare amount of the data within each cluster
table(iris$Species, db$cluster)
Comparing clustering methods
After fitting data into clusters using different clustering methods, you may wish to measure the accuracy of the clustering. In most cases, you can use either intracluster or intercluster metrics as measurements. The higher the intercluster distance, the better it is, and the lower the intracluster distance, the better it is.
We now introduce how to compare different clustering methods using cluster.stat
from the fpc
package.
Perform the following steps to compare clustering methods.
First, install and load the fpc
package
#install.packages("fpc") library("fpc")
Next, retrieve the cluster validation statistics of clustering method:
cs = cluster.stats(dist(iris[1:4]), db$cluster)
Most often, we focus on using within.cluster.ss
and avg.silwidth
to validate the clustering method. The within.cluster.ss
measurement stands for the within clusters sum of squares, and avg.silwidth
represents the average silhouette width.
within.cluster.ss
measurement shows how closely related objects are in clusters; the smaller the value, the more closely related objects are within the cluster.avg.silwidth
is a measurement that considers how closely related objects are within the cluster and how clusters are separated from each other. The silhouette value usually ranges from 0 to 1; a value closer to 1 suggests the data is better clustered.cs[c("within.cluster.ss","avg.silwidth")]
Finally, you can generate the cluster statistics of different clustering method and list them in a table.