Introduction: supervised and unsupervised learning
Machine learnin is one of the disciplines that is most frequently used in data mining and can be subdivided into two main tasks: supervised learning and unsupervised learning.
Supervised learning. This is a task of machine learning, which is executed by a set of methods aimed to infer a function from the training data. Normally, the training data is composed of a set of observations. Each observation possesses a diverse number of variables named predictors, and one variable that we want to predict, also known as labels or classes. These labels or classes represent the teachers because the models learn from them.
The ultimate aim of the function created by the model is the extrapolation of its behavior towards new observations, that is, prediction. This prediction corresponds to the output value of a supervised learning model that could be numeric, as in the case of a regression problem; or a class, as in the case of classification problems.
Below are some examples of supervised learning models
The unsupervised learning is a machine learning task that aims to describe the associations and patterns in relation to a set of input variables. The fundamental difference from supervised learning is that input data has no class labels, so it has no variables to predict and rather tries to find data structures by their relationship.
In unsupervised learning, we can speak of two stages: modeling and profiting.
In the modeling phase, we take the input data and proceed to apply techniques of feature selection or feature extraction. Once we define the most convenient variables, we proceed to choose the best method of unsupervised learning to solve the problem at hand.
After choosing the method, we proceed to build the model and execute an iterative tuning process until we are satisfied with the results. In contrast to supervised learning, in which the model value is derived mostly from prediction, in unsupervised learning, the findings obtained during the modeling phase could be enough to fulfill the purpose, in which case, the process would stop. For example, if the objective is to make a customer group, once done, the modeling phase will have an idea of the existing groups, and that could be the goal of the analysis.
Assuming that the model was subsequently used, there is a second stage, which is when we have the model and want to exploit it again. We will receive new data and use the model that we built to run on them and get results.
Below are some examples of unsupervised learning models
Preprocessing and standardization of data
Identifying groups (clustering analysis) can uncover and help to explain some patterns hiding in data and it is frequently the answer for multiple problems in many industries or contexts. Finding clusters can help to find relationships between study variables, but not the relations that these variables may have in relation to a target variable.
Typically, the application of clustering techniques involves five phases
In fact, when data is good, the construction of models that will respond to our problems becomes easier. Considering that clustering models work with distances, they are especially influenced by the data that we use.
Rescaling data. Whichever model we use will assume different things about the data. In the case of clustering models, it is very important that data, from the different numerical variables, expressed on a similar scale. When distances are calculated, if the units are very different, we cannot get proper results. For example, if we are doing an analysis related to a group of people and we have, among other data, the dollar income and age of the subjects, a variable such as income could overshadow the age variable. For analysis, 20 years of age may be more important than a $10,000 income. However, when compared both by measure of distances, a clustering model might underestimate the importance of age.
In order to mitigate the problem of the scale of the data, we can use methods of standardization. The standardization is the process of adjusting the data to a specific range, for example, between 0 and 1 or between –1 and 1.
There are several techniques for normalization
The recenter performs a standard z-score transformation. The variable’s mean value is subtracted from each value and each one is then divided by the standard deviation. The resulting variable will have a mean of 0 and a standard deviation of 1.
After transformation, the variables are not expressed in their original scales, for example, age no longer displays a number of years. To make it interesting while performing a test, we can confirm that the transformation does not affect the relationship between the variables:
ages = sample(12:100, 20, replace=T) incomes = sample(500:10000, 20, replace=T) data = cbind(ages, incomes) scaled = scale(data) # comparing correlations # data before transformation cor(data) # data after transformation cor(scaled)
Another method for make transformation is the Scale [0-1]: just rescale the original values into the 0 - 1 range. This is done by subtracting the minimum value from the variable’s value for each observation and, then, dividing by the difference between the minimum and the maximum values.
If we want to implement this transformation method, we can use the reshape package, in particular the rescaler function
# install.packages("reshape") library("reshape") scaled = rescaler(data, "range") head(scaled)
Median/MAD is a robust version of the standard recenter transform. The median value is subtracted from each value, and each is then divided by the median absolute deviation. The resulting variable will have a median of 0.
The function is included in the reshape package and its use is very similar to the previous example
# install.packages("reshape") library("reshape") scaled = rescaler(data, "robust") head(scaled)
Clustering analysis
Clustering is based on the concepts of similarity and distance, while proximity is determined by a distance function. It allows the generation of clusters where each of these groups consists of individuals who have common features with each other.
Clustering is used for knowledge discovery rather than prediction. It provides an insight into the natural groupings found within data.
The analysis of clusters is similar to the classification models, with the difference that the groups are not preset (labels). The goal is to perform a partition of data into clusters that can be disjoint or not.
Clustering model is a notion used to signify what kind of clusters we are trying to identify. The four most common models of clustering methods are hierarchical clustering, k-means clustering, model-based clustering, and density-based clustering:
A good clustering algorithm can be evaluated based on two primary objectives:
Clustering tendency
Most clustering algorithms will always produce a clustering, even if the data does not contain a cluster structure. It is typically good to check cluster tendency before attempting to cluster the data.
Prepare data for iris dataset.
#install.packages("seriation") library(seriation) d_iris <- dist(iris[,-5])
Visual Analysis for Cluster Tendency Assessment (VAT) reorders the objects to show potential clustering tendency as a block structure (dark blocks along the main diagonal).
VAT(d_iris)
iVAT uses largest distances in all possible paths between the objects instead of the distances to make the block structure better visible.
iVAT(d_iris)
Both plots show a strong cluster structure with 2 clusters.
K-Means algorithm
There are many methods, and the most popular are based on Hierarchical Classification and dynamic clouds or K-Means.
In very general terms, the K-Means algorithm aims to partition a set of observations into k clusters so that each observation belongs to the cluster that possesses the nearest mean. It just finds the k different groups that have the maximum dissimilarity.
Since the mean is used as a measure of estimating the centroid, it is not free from the presence of extreme observations or outliers. Hence, it is required to check the presence of outliers in a dataset before running k-means clustering. We can the boxplot method of identifying the presence of any outliers in the our dataset.
You can see video How K-Means algorithm works or K-means Algorithm Demo.
Main features of K-Means algorithm:
In the simplest form of the algorithm, it has two steps:
These two steps are iteratively executed until the assignments in any two consecutive iteration don’t change, meaning either a point of local or global optima (not always guaranteed) is reached.
All steps of K-Means algorithm:
Although it is a computationally difficult problem, there are very efficient implementations to quickly find the local optimum. In an optimization problem, the optimum is the value that maximizes or minimizes the condition that we are looking for.
Given a set of observations , K-Means clustering aims to partition the N observations into sets so as to minimize the within-cluster sum of squares (WCSS):
The method of Forgy and Random partition are the most common initialization approaches. Forgy chooses k observations from the data and uses these as the initial means. The method first assigns a cluster to each observation at random, then proceeds to the update phase, thereby computing the initial mean to become the centroid of the cluster’s randomly assigned points. The assignment phase is also referred to as the expectation phase, and the update phase as the maximization phase, making this algorithm a variant of the generalized expectation maximization algorithm.
In R, there are several packages that allow us to use K-Means. The following is an implementation using the Iris dataset
# build the K-Means standard model set.seed(42) k = 3 km = kmeans(iris[1:4], k, iter.max=1000, algorithm=c("Forgy"))
The K-Means function requires that we properly define two important parameters: the number of clusters that it is convenient to use and the type of the function algorithm to be used to establish the optimum.
We can see some information about the outcome
# size of clusters km$size [1] 62 38 50 # centers of clusters, three clusters by variable km$centers
Based on the numerical data of Iris, model K-Means builds three groups, as we indicated, and then proceeds to classify each observation, in one of those groups.
Let's plot the result. This allows for a better appreciation of the information, so we must reduce the dataset to be represented in two dimensions
# translate into a two dimensions using multidimensional scaling dist = dist(iris[1:4]) mds = cmdscale(dist) palette(c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3", "#FF7F00", "#FFFF33", "#A65628", "#F781BF", "#999999", "#000000")) plot(mds, col=km$cluster, pch = 20, cex = 3) # plot centroids km1 = kmeans(mds, k, iter.max=1000, algorithm=c("Forgy")) points(km1$centers, col = 10, pch = 8, cex = 2)
You can draw the centers of each cluster in a barplot
, which will provide more details on how each attribute affects the clustering. To inspect the center of each cluster input following command.
barplot(t(km$centers), beside=TRUE, xlab="cluster", ylab="value")install.packages("cluster") library("cluster") # cluster plot clusplot(mds, km$cluster, color=TRUE, shade=TRUE, labels=2, lines=1, main='Cluster Analysis for Iris')
The graphic expression of clusters can help to evaluate whether the result makes sense. If we determine that it does, we can also include the results of cluster analysis in the original dataset. Then we can work on R or also export it to other tools
# add cluster to original dataset iris.cluster = cbind(iris, km[1]) head(iris.cluster[3:6])
Let us compare the clusters with the species
table(km$cluster, iris$Species)
Using the aggregate()
function, we can also look at the Sepal.Length
characteristics of the clusters. aggregate()
function computes statistics for subgroups of data. Here, it calculates the mean Sepal.Length
by cluster
:
aggregate(data = iris.cluster, Sepal.Length ~ cluster, mean)
Other approaches for clustering
Partitioning Around Medoids (PAM) is similar to K-Means, but it is more robust with outliers. It can be implemented using pam()
from cluster
package.
#install.packages("cluster") library("cluster") # implements PAM function cls = pam(iris[1:4], 3) # display cluster plot an silhouette plot plot(cls) # display cluster diagnostics summary(cls)
The summary(cls)
calculated above from pam()
gives the silhouette width information for each data point. Silhouette width is a measure to estimate the dissimilarity between clusters. A higher silhouette width is preferred to determine the optimal number of clusters.
Clustering Large Applications (CLARA) is useful for clustering large datasets. clara()
from cluster
package uses a sampling approach to cluster large datasets. It provide silhouette plot and information to determine optimum number of clusters.
#install.packages("cluster") library("cluster") # cluster with clara cls = clara (iris[1:4], 3, samples=50) summary(cls) plot(cls)
Defining the number of clusters
As the k-means algorithm is highly sensitive to the starting position of the cluster centers, this means that random chance may have a substantial impact on the final set of clusters. The choice requires a delicate balance.
There are different methods for determining the optimal number of clusters for k-means. These methods include direct methods and statistical testing methods. You can read more here.
There are several tips how to choose the appropriate number of clusters:
NbClust package. The NbClust
package provides 30 indices for determining the number of clusters and proposes to the user the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods.
In the following snippet we find the suggested amount of clusters
install.packages("NbClust") library("NbClust") data = iris[,-5] nb = NbClust(data, diss=NULL, distance="euclidean", min.nc=2, max.nc=15, method="complete", index="alllong")
The package also has some graphical indices.
# sets 1x2 grid for plotting par(mfrow=c(1, 2)) # making graph of recounts hist(nb$Best.nc[1,], breaks=max(na.omit(nb$Best.nc[1,]))) barplot(table(nb$Best.nc[1,]))
Sum of squares. Also, we can use the sum of squares to determine which k value is best for finding the optimum number of clusters for k-means.
We'll demonstrate how to find the optimum number of clusters by iteratively getting within the sum of squares and the average silhouette value. For the within sum of squares, lower values represent clusters with better quality. Perform the following steps to find the optimum number of clusters for the k-means clustering.
First, calculate the within sum of squares (withinss
) of different numbers of clusters:
k = 2:10 set.seed(42) WSS = sapply(k, function(k) {kmeans(iris[1:4], centers=k)$tot.withinss})
You can then use a line plot
to plot the within sum of squares with a different number of k
plot(k, WSS, type="l", xlab= "Number of k", ylab="Within sum of squares")
Next, you can calculate the average silhouette width (avg.silwidth
) of different numbers of clusters
SW = sapply(k, function(k) {cluster.stats(dist(iris[1:4]), kmeans(iris[1:4], centers=k)$cluster)$avg.silwidth})
You can then use a line plot
to plot the average silhouette width with a different number of k
plot(k, SW, type="l", xlab="Number of clusers", ylab="Average silhouette width")
By plotting the within sum of squares in regard to different number of k, we find that the elbow of the plot is at k=2.
We use which.max
to obtain the value of k to determine the location of the maximum average silhouette width and retrieve the maximum number of clusters:
k[which.max(SW)]
gap statistic. Alternative approach is gap statistic. The approach can be applied to any clustering method (K-means clustering, hierarchical clustering, etc).
The gap statistic compares the total within intracluster variation for different values of k with their expected values under null reference distribution of the data, i.e. a distribution with no obvious clustering.
library("factoextra") fviz_nbclust(iris[,-5], kmeans, method = "gap_stat")
You can see that recommended choice of clusters is 4.
Evaluation of clustering
Internal Evaluation. When a clustering result is evaluated based on the data that was clustered, it is called internal evaluation. These methods usually assign the best score to the algorithm that produces clusters with high similarity within a cluster and low similarity between clusters. There are following methods for internal evaluation:
External Evaluation. External evaluation is similar to evaluation done on test data. The data used for testing is not used for training the model. The test data is then evaluated and labels assigned by experts or some third party benchmarks. Then clustering results on these already labeled items provide us the metric for how good the clusters grouped our data. As the metric depends on external inputs, it is called external evaluation. There are following methods for External evaluation:
Outcome
K-means clustering is an unsupervised learning algorithm that tries to combine similar objects into a group in such a way that the within-group similarity should be maximum and the between-group object similarity should be minimum.
The algorithm's goal is to minimize the within-cluster variation as defined by the squared Euclidean distances.
Merits of K-means clustering:
Demerits of K-means clustering:
Useful book