Principal Component Analysis (PCA) is a dimensionality reduction technique that is widely used in data analysis. More concretely, PCA is used to reduce a large number of correlated variables into a smaller set of uncorrelated variables called principal components.
Introduction
In many datasets, particularly in the social sciences, you will see many variables highly correlated with each other. It may additionally suffer from high dimensionality or, as it is known, the curse of dimensionality. The goal then is to use PCA in order to create a smaller set of variables that capture most of the information from the original set of variables, thus simplifying the dataset and often leading to hidden insights. These new variables (principal components) are highly uncorrelated with each other.
Dimensionality reduction is a process of features extraction rather than a feature selection process. Feature extraction is a process of transforming the data in the high-dimensional space to a space of fewer dimensions.
Why do we need to reduce the dimension of the data? Well sometimes it's just difficult to do analysis on high dimensional data, especially on interpreting it. This is because there are dimensions that aren't significant which adds to our problem on the analysis. So in order to deal with this, we remove those nuisance dimension and deal with the significant one.
PCA is the process of finding the principal components. The first principal component in a dataset is the linear combination that captures the maximum variance in the data. A second component is created by selecting another linear combination that maximizes the variance with the constraint that its direction is perpendicular to the first component. The subsequent components (equal to the number of variables) would follow this same rule.
Modeling in R
There are many packages and functions that can apply PCA in R. In this post I will use the function prcomp()
from the stats
package.
There are several functions prcomp()
and princomp()
from the built-in R stats
package.
princomp()
uses the spectral decomposition approach.prcomp()
uses the singular value decomposition (SVD).Function prcomp()
performs a singular value decomposition directly on the data matrix this is slightly more accurate than doing an eigenvalue decomposition on the covariance matrix but conceptually equivalent. Therefore, prcomp()
is the preferred function.
I will use the classical iris dataset for the demonstration. The data contain four continuous variables which corresponds to physical measures of flowers and a categorical variable describing the flowers' species.
# load and explore data data(iris) head(iris, 3)
We will apply PCA to the four continuous variables and use the categorical variable to visualize the PCs later. PCA is sensitive to scale, so the data has been scaled with a log()
transformation to the continuous variables. This is necessary if the input variables have very different variances, beside log
transformation we can use scale()
function. Also we set center
and scale
equal to TRUE
in the call to prcomp()
to standardize the variables prior to the application of PCA:
# log transform log.ir = log(iris[, 1:4]) ir.species = iris[, 5]
Once you have standardised your variables, you can carry out a principal component analysis using the prcomp()
function in R.
ir.pca = prcomp(log.ir, center = TRUE, scale. = TRUE)
The prcomp()
function returns an object of class prcomp
, which have some methods available. The print()
method returns the standard deviation of each of the four PCs, and their rotation (or loadings), which are the coefficients of the linear combinations of the continuous variables. The standard deviations here are the square roots of the eigenvalues.
print(ir.pca)
The plot()
method returns a plot of the variances (y-axis) associated with the PCs (x-axis). The figure below is useful to decide how many PCs to retain for further analysis. In this simple case with only 4 PCs this is not a hard task and we can see that the first two PCs explain most of the variability in the data.
plot(ir.pca, type = "l")
By seeing the graph we could say that the first two principal components are able to explain great part of our data. However isn't there any less subjective way to choose our principal components? Yes, the easiest ways are by using two criteria: Kaiser-Guttman criterion and Broken Stick criterion. They are both based on the eigenvalues, which tell us how much a principal component is able to explain of our initial data set. To get our eigenvalues, just call:
ev = ir.pca$sdev^2
Now we can use the eigenvalues to contruct a very useful graph that can be help us to decide which princiapl components we need to choose. Copy and paste following evplot()
function into R. I've slightly modify it to look more pretty. Original function here. Then, you can call evplot()
evplot = function(ev) { # Broken stick model (MacArthur 1957) n = length(ev) bsm = data.frame(j=seq(1:n), p=0) bsm$p[1] = 1/n for (i in 2:n) bsm$p[i] = bsm$p[i-1] + (1/(n + 1 - i)) bsm$p = 100*bsm$p/n # Plot eigenvalues and % of variation for each axis op = par(mfrow=c(2,1),omi=c(0.1,0.3,0.1,0.1), mar=c(1, 1, 1, 1)) barplot(ev, main="Eigenvalues", col="bisque", las=2) abline(h=mean(ev), col="red") legend("topright", "Average eigenvalue", lwd=1, col=2, bty="n") barplot(t(cbind(100*ev/sum(ev), bsm$p[n:1])), beside=TRUE, main="% variation", col=c("bisque",2), las=2) legend("topright", c("% eigenvalue", "Broken stick model"), pch=15, col=c("bisque",2), bty="n") par(op) } evplot(ev)
The first barplot tells us (according to Kaiser-Guttman criterion) that only the first principal component should be chosen (because its bar is above the red line). The same component should be chosen according to the other barplot (broken stick model), because it has the only bar above the red bar. Simple as that.
The summary()
method describe the importance of the PCs. The first row describe again the standard deviation associated with each PC. The second row shows the proportion of the variance in the data explained by each component while the third row describe the cumulative proportion of explained variance. We can see there that the first two PCs accounts for more than 95% of the variance of the data.
summary(ir.pca)
We can use the predict()
function if we observe new data and want to predict their PCs values. Just for illustration pretend the last two rows of the iris data has just arrived and we want to see what is their PCs values:
# predict PCs predict(ir.pca, newdata=tail(log.ir, 2))
You might want to see how our initial variables contribute to two or three principal components, and how observations are explained by components. We can use a ggbiplot()
for that. Note that choices
is the argument defining which principal components will be used on the ggbiplot()
(we will use the first two components).
library(devtools) install_github("ggbiplot", "vqv") library(ggbiplot) ggbiplot(ir.pca, choices=c(1,2), obs.scale = 1, var.scale = 1)
We can also use our fifth variable from the iris data set (the categorical variable iris[,5]
) to define groups on the ggbiplot()
. In that way, observations that belong to the same Iris' species will be grouped.
ggbiplot(ir.pca,choices=c(1,2), groups=iris[,5], obs.scale = 1, var.scale = 1, ellipse = TRUE)
We can see that Petal.Length
, Petal.Width
and Sepal.Length
are positively explained by the first principal component, while the Sepal.Width
is negatively explained by it. Also, the second principal component was only able to negatively explain Sepal.Width
and Sepal.Length
.
So, advantages of Principal Component Analysis include: