Introduction
In statistics, modeling is where we get down to business. Models quantify the relationships between our variables and one let us make predictions.
The regression analysis focus on building a thought process around how the modeling techniques establish and quantify a relation among response variables and predictors. Regression is to build a function of independent variables (also known as features, attributes, predictors, dimensions) to predict a dependent variable (also called response, output, target).
The concept of causation is important to keep in mind, as most of the time our thought process deviates from how relationships quantified by a model have to be interpreted. For example, a statistical model will be able to quantify relationships between completely irrelevant measures, say electricity generation and beer consumption.
Any regression analysis involves three key sets of variables:
Important concept to understand is around parametric and non-parametric methods.
A simple linear regression is the most basic model. The linear regression model is used for explaining the relationship between a single dependent variable known as Y and one or more X's independent variables known as input, predictor, or independent variables. So, it’s just two variables and is modeled as a linear relationship with an error term
We are given the data for x and y. Our mission is to fit the model, which will give us the best estimates for and .
In other words, simple linear regression fits a straight line through the set of n points in such a way that makes the sum of squared residuals of the model (that is, vertical distances between the points of the data set and the fitted line) as small as possible.
That generalizes naturally to multiple linear regression, where we have multiple variables on the righthand side of the relationship
Statisticians call u, v, and w the predictors and y the response. Obviously, the model is useful only if there is a fairly linear relationship between the predictors and the response, but that requirement is much less restrictive than you might think.
The beauty of R is that anyone can build these linear models. The models are built by a function, lm
, which returns a model object. From the model object, we get the coefficients () and regression statistics.
Regression creates a model, and ANOVA (Analysis of variance) is one method of evaluating such models. ANOVA is actually a family of techniques that are connected by a common mathematical analysis.
oneway.test
function, otherwise, use the nonparametric version, the kruskal.test
function.anova
function compares two regression models and reports whether they are significantly different.Essentially, the linear regression model will help you with:
Simple linear regression
Imagine, you have two vectors, x and y, that hold paired observations: . You believe there is a linear relationship between x and y, and you want to create a regression model of the relationship.
The regression uses the ordinary least-squares (OLS) algorithm to fit the linear model
where and are the regression coefficients and the are the error terms.
The lm
function can perform linear regression. The main argument is a model formula, such as y ~ x. The formula has the response variable on the left of the tilde character ( ~ ) and the predictor variable on the right. The function estimates the regression coefficients, and , and reports them as the intercept and the coefficient of x, respectively
lm(iris$Sepal.Length ~ iris$Sepal.Width)
In this case, the regression equation
iris$Sepal.Length = 6.5262 - 0.2234 iris$Sepal.Width
Multiple linear regression
Suppose you have several predictor variables (e.g., u, v, and w) and a response variable y. You believe there is a linear relationship between the predictors and the response, and you want to perform a linear regression on the data.
Multiple linear regression is the obvious generalization of simple linear regression. It allows multiple predictor variables instead of one predictor variable and still uses OLS to compute the coefficients of a linear equation. The three-variable regression just given corresponds to this linear model:
R uses the lm
function for both simple and multiple linear regression. You simply add
more variables to the righthand side of the model formula. The output then shows the
coefficients of the fitted model:
lm(y ~ u + v + w)
Let's look at strengths and weaknesses of multiple linear regression.
Strength of multiple linear regression:
Weaknesses of multiple linear regression:
Getting regression statistics
Before interpreting and using your model, you will need to determine whether it is a good fit to the data and includes a good combination of explanatory variables. You may also be considering several alternative models for your data and want to compare them.
The fit of a model is commonly measured in a few different ways. These include:
Save the regression model in a variable, say m
m = lm(y ~ u + v + w)
Then use functions to extract regression statistics and information from the model
anova(m)
ANOVA tablecoefficients(m)
Model coefficientscoef(m)
Same as coefficients(m)confint(m)
Confidence intervals for the regression coefficientsdeviance(m)
Residual sum of squareseffects(m)
Vector of orthogonal effectsfitted(m)
Vector of fitted y valuesresiduals(m)
Model residualsresid(m)
Same as residuals(m)summary(m)
Key statistics, such as , the F statistic, and the residual standard error (). You can read good description of summary
in book "R Cookbook" by Paul Teetor (page 270).vcov(m)
Variance–covariance matrix of the main parametersPlotting regression residuals
Residuals are core to the diagnostic of regression models. Normality of residual is an important condition for the model to be a valid linear regression model. In simple words, normality implies that the errors/residuals are random noise and our model has captured all the signals in data.
The linear regression model gives us the conditional expectation of function Y for given values of X. However, the fitted equation has some residual to it. We need the expectation of residual to be normally distributed with a mean of 0 or reducible to 0. A normal residual means that the model inference (confidence interval, model predictors’ significance) is valid.
If you want a visual display of your regression residuals: you can plot the model object by selecting the residuals plot from the available plots
m = lm(y ~ x) plot(m, which=1) abline(coef(m))
Normally, plotting a regression model object produces several diagnostic plots. You can select just the residuals plot by specifying which=1
.
Predicting new values
If you want to predict new values from your regression model then you can use predict
function. Save the predictor data in a data frame. Use the predict
function, setting the newdata parameter to the data frame:
m = lm(y ~ u + v + w) preds = data.frame(u=3.1, v=4.0, w=5.5) predict(m, newdata=preds)
Once you have a linear model, making predictions is quite easy because the predict
function does all the heavy lifting. The only annoyance is arranging for a data frame to contain your data.
The predict
function returns a vector of predicted values with one prediction for every
row in the data. The example contains one row, so predict returns one value
preds = data.frame(u=3.1, v=4.0, w=5.5) predict(m, newdata=preds) 12.31374
Example
Let's build multiple regression for Iris data set. Prepare data
irisn = iris irisn[,5] = unclass(iris$Species)
Plot data for multivariate data
plot(irisn, col=irisn$Species)
Build Linear Model for Iris dataset
ml = lm(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, data=irisn)
Coefficients
coef(ml)
Differences between observed values and fitted values
residuals(ml)[1:10]
To check for homoscedasticity plot following chart. Residual vs predicted value scatter plot should not look like a funnel.
plot(ml, which=1)
To check if the residuals are approximately normally distributed plot following chart. Q-Q plot should be close to the straight line.
plot(ml, which=2)
To predict Species based on Linear Model using new values for Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
new_data = data.frame(Sepal.Length=5.7, Sepal.Width=3.1, Petal.Length=5.0, Petal.Width=1.7) predict(ml, new_data)
Calculate the root-mean-square error (RMSE), less is better
RMSE = function(predicted, true) mean((predicted - true)^2)^.5 RMSE(predict(ml, new_data), irisn$Species)
Types of regression
There are different types of regression: