Introduction
The k-NN algorithm is among the simplest of all machine learning algorithms. It also might surprise many to know that k-NN is one of the top 10 data mining algorithms.
k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. In both cases, the input points consists of the k closest training examples in the feature space.
Non-parametric means that it does not make any assumptions on the underlying data distribution.
Since the points are in feature space, they have a notion of distance. This need not necessarily be Euclidean distance although it is the one commonly used. Each of the training data consists of a set of vectors and class label associated with each vector. In the simplest case, it will be either + or – (for positive or negative classes). But k-NN, can work equally well with arbitrary number of classes.
k number decides how many neighbors (where neighbors is defined based on the distance metric) influence the classification. With a given k value we can make boundaries of each class.
For any fixed point x we can calculate how close each observation Xi is to x using the Euclidean distance. This ranks the data by how close they are to x. Imagine drawing a small ball about x and slowly inflating it. As the ball hits the first observation Xi, this is the first nearest neighbor of x. As the ball further inflates and hits a second observation, this observation is the second nearest neighbor.
Efficiency of the k-NN algorithm largely depends on the value of k i.e, choosing the number of nearest neighbors. Choosing the optimal value for k is best done by first inspecting the data. In general, a large k value is more precise as it reduces the overall noise but there is no guarantee. Cross-validation is another way to retrospectively determine a good k value by using an independent dataset to validate the k value. Historically, the optimal k for most datasets has been between 3-10.
k-NN makes predictions using the training dataset directly. Predictions are made for a new instance x by searching through the entire training set for the k most similar instances (the neighbors) and summarizing the output variable for those k instances. For regression this might be the mean output variable, in classification this might be the mode (or most common) class value.
To determine which of the k instances in the training dataset are most similar to a new input a distance measure is used. You can choose the best distance metric based on the properties of your data. Following are the most popular distance measures
The output of k-NN depends on whether it is used for classification or regression:
However, it is more widely used in classification problems in the industry because it's easy of interpretation and low calculation time (for small dataset). The computational complexity of k-NN increases with the size of the training dataset.
k-NN is also a lazy algorithm. What this means is that it does not use the training data points to do any generalization. Lack of generalization means that k-NN keeps all the training data and all the training data is needed during the testing phase. As result:
In k-Nearest Neighbor classification, the training dataset is used to classify each member of a target dataset. The structure of the data is that there is a classification (categorical) variable of interest ("buyer," or "non-buyer," for example), and a number of additional predictor variables (age, income, location, etc).
Before use k-NN we should do some preparations
Following is the algorithm:
Modeling in R
Let's see how to use k-NN for classification. In this case, we are given some data points for training and also a new unlabelled data for testing. Our aim is to find the class label for the new point. The algorithm has different behavior based on k.
Before starting with the k-NN implementation using R, I would like to discuss two Data Normalization/Standardization techniques for data consistent. You can easily see this when you go through the results of the summary()
function. Look at the minimum and maximum values of all the (numerical) attributes. If you see that one attribute has a wide range of values, you will need to normalize your dataset, because this means that the distance will be dominated by this feature.
scale()
function.To perform a k-nearest neighbour classification in R we will make use of the knn
function in the class
package and iris
data set.
# familiarize with data str(iris) table(iris$Species)
The table()
function performs a simple two-way cross-tabulation of values. For each unique value of the first variable, it counts the occurrences of different values of the second variable. It works for numeric, factor, and character variables.
A quick look at the Species
attribute through tells you that the division of the species of flowers is 50-50-50.
If you want to check the percentual division of the Species
attribute, you can ask for a table of proportions:
round(prop.table(table(iris$Species)) * 100, digits = 1)
Also, you can try to get an idea of your data by making some graphs, such as scatter plots. It can be interesting to see how much one variable is affected by another. In other words, you want to see if there is any correlation between two variables.
plot(iris$Sepal.Length, iris$Sepal.Width, col=iris$Species, pch=19) legend("topright", legend=levels(iris$Species), bty="n", pch=19, col=palette())
You see that there is a high correlation between the Sepal.Length
and the Sepal.Width
of the Setosa
iris flowers, while the correlation is somewhat less high for the Virginica
and Versicolor
flowers.
To normalize the data use the following function.
normMinMax = function(x) { return((x-min(x))/max(x)-min(x)) } iris_norm <- as.data.frame(lapply(iris[1:4], normMinMax))
Now create training and test data set. Following is random splitting of iris data as 70% train and 30% test data sets.
set.seed(1234) indexes = sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3)) iris_train = iris[indexes==1, 1:4] iris_test = iris[indexes==2, 1:4] iris_train_labels = iris[indexes==1, 5] iris_test_labels = iris[indexes==2, 5]
Load class
package and call knn()
function to build the model.
#install.packages("class") library("class") iris_mdl = knn(train=iris_train, test=iris_test, cl=iris_train_labels, k=3)
The function knn
takes as its arguments a training data set and a testing data set, which we are now going to create.
You can read about Error rate with varying k here.
To evaluate the model, use CrossTable()
function in gmodels
package.
#install.packages("gmodels") library("gmodels") CrossTable(x=iris_test_labels, y=iris_mdl, prop.chisq = FALSE)
Output:
As you can see from the table above the model make one error mistake of predicting as versicolor
instead of virginica
, whereas rest all the predicted correctly.
Since the data set contains only 150 observations, the model makes only one mistake. There may be more mistakes in huge data. In that case we need to try using different approach like instead of min-max normalization use Z-Score standardization and also vary the values of k to see the impact on accuracy of predicted result.
We would like to determine the misclassification rate for a given value of k. First, build a confusion matrix. When the values of the classification are plotted in a n x n matrix (2x2 in case of binary classification), the matrix is called the confusion matrix. Each column of the matrix represents the number of predictions of each class, while each row represents the instances in the actual class.
CM = table(iris_test_labels, iris_mdl)
Second, build a diagonal mark quality prediction. Applying the diag
function to this table then selects the diagonal elements, i.e., the number of points where k-nearest neighbour classification agrees with the true classification, and the sum
command simply adds these values up.
accuracy = (sum(diag(CM)))/sum(CM)
The model has a high overall accuracy that is 0.9736842.
Pros and Cons of k-NN algorithm
Pros. The algorithm is highly unbiased in nature and makes no prior assumption of the underlying data. Being simple and effective in nature, it is easy to implement and has gained good popularity.
Cons. Indeed it is simple but k-NN algorithm has drawn a lot of flake for being extremely simple! If we take a deeper look, this doesn’t create a model since there’s no abstraction process involved. Yes, the training process is really fast as the data is stored verbatim (hence lazy learner) but the prediction time is pretty high with useful insights missing at times. Therefore, building this algorithm requires time to be invested in data preparation (especially treating the missing data and categorical features) to obtain a robust model.