In statistics, modeling is where we get down to business. Models quantify the relationships between our variables and one let us make predictions.
The regression analysis focus on building a thought process around how the modeling techniques establish and quantify a relation among response variables and predictors. Regression is to build a function of independent variables (also known as features, attributes, predictors, dimensions) to predict a dependent variable (also called response, output, target).
The concept of causation is important to keep in mind, as most of the time our thought process deviates from how relationships quantified by a model have to be interpreted. For example, a statistical model will be able to quantify relationships between completely irrelevant measures, say electricity generation and beer consumption.
Any regression analysis involves three key sets of variables:
Important concept to understand is around parametric and non-parametric methods.
A simple linear regression is the most basic model. The linear regression model is used for explaining the relationship between a single dependent variable known as Y and one or more X's independent variables known as input, predictor, or independent variables. So, it’s just two variables and is modeled as a linear relationship with an error term
We are given the data for x and y. Our mission is to fit the model, which will give us the best estimates for and .
In other words, simple linear regression fits a straight line through the set of n points in such a way that makes the sum of squared residuals of the model (that is, vertical distances between the points of the data set and the fitted line) as small as possible.
That generalizes naturally to multiple linear regression, where we have multiple variables on the righthand side of the relationship
Statisticians call u, v, and w the predictors and y the response. Obviously, the model is useful only if there is a fairly linear relationship between the predictors and the response, but that requirement is much less restrictive than you might think.
The beauty of R is that anyone can build these linear models. The models are built by a function,
lm , which returns a model object. From the model object, we get the coefficients () and regression statistics.
Regression creates a model, and ANOVA (Analysis of variance) is one method of evaluating such models. ANOVA is actually a family of techniques that are connected by a common mathematical analysis.
oneway.test function, otherwise, use the nonparametric version, the
anova function compares two regression models and reports whether they are significantly different.
Essentially, the linear regression model will help you with:
Simple linear regression
Imagine, you have two vectors, x and y, that hold paired observations: . You believe there is a linear relationship between x and y, and you want to create a regression model of the relationship.
The regression uses the ordinary least-squares (OLS) algorithm to fit the linear model
where and are the regression coefficients and the are the error terms.
lm function can perform linear regression. The main argument is a model formula, such as y ~ x. The formula has the response variable on the left of the tilde character ( ~ ) and the predictor variable on the right. The function estimates the regression coefficients, and , and reports them as the intercept and the coefficient of x, respectively
lm(iris$Sepal.Length ~ iris$Sepal.Width)
In this case, the regression equation
iris$Sepal.Length = 6.5262 - 0.2234 iris$Sepal.Width
Multiple linear regression
Suppose you have several predictor variables (e.g., u, v, and w) and a response variable y. You believe there is a linear relationship between the predictors and the response, and you want to perform a linear regression on the data.
Multiple linear regression is the obvious generalization of simple linear regression. It allows multiple predictor variables instead of one predictor variable and still uses OLS to compute the coefficients of a linear equation. The three-variable regression just given corresponds to this linear model:
R uses the
lm function for both simple and multiple linear regression. You simply add
more variables to the righthand side of the model formula. The output then shows the
coefficients of the fitted model:
lm(y ~ u + v + w)
Let's look at strengths and weaknesses of multiple linear regression.
Strength of multiple linear regression:
Weaknesses of multiple linear regression:
Getting regression statistics
Before interpreting and using your model, you will need to determine whether it is a good fit to the data and includes a good combination of explanatory variables. You may also be considering several alternative models for your data and want to compare them.
The fit of a model is commonly measured in a few different ways. These include:
Save the regression model in a variable, say m
m = lm(y ~ u + v + w)
Then use functions to extract regression statistics and information from the model
anova(m) ANOVA table
coefficients(m) Model coefficients
coef(m) Same as coefficients(m)
confint(m) Confidence intervals for the regression coefficients
deviance(m) Residual sum of squares
effects(m) Vector of orthogonal effects
fitted(m) Vector of fitted y values
residuals(m) Model residuals
resid(m) Same as residuals(m)
summary(m) Key statistics, such as , the F statistic, and the residual standard error (). You can read good description of
summary in book "R Cookbook" by Paul Teetor (page 270).
vcov(m) Variance–covariance matrix of the main parameters
Plotting regression residuals
Residuals are core to the diagnostic of regression models. Normality of residual is an important condition for the model to be a valid linear regression model. In simple words, normality implies that the errors/residuals are random noise and our model has captured all the signals in data.
The linear regression model gives us the conditional expectation of function Y for given values of X. However, the fitted equation has some residual to it. We need the expectation of residual to be normally distributed with a mean of 0 or reducible to 0. A normal residual means that the model inference (confidence interval, model predictors’ significance) is valid.
If you want a visual display of your regression residuals: you can plot the model object by selecting the residuals plot from the available plots
m = lm(y ~ x) plot(m, which=1) abline(coef(m))
Normally, plotting a regression model object produces several diagnostic plots. You can select just the residuals plot by specifying
Predicting new values
If you want to predict new values from your regression model then you can use
predict function. Save the predictor data in a data frame. Use the
predict function, setting the newdata parameter to the data frame:
m = lm(y ~ u + v + w) preds = data.frame(u=3.1, v=4.0, w=5.5) predict(m, newdata=preds)
Once you have a linear model, making predictions is quite easy because the
predict function does all the heavy lifting. The only annoyance is arranging for a data frame to contain your data.
predict function returns a vector of predicted values with one prediction for every
row in the data. The example contains one row, so predict returns one value
preds = data.frame(u=3.1, v=4.0, w=5.5) predict(m, newdata=preds) 12.31374
Let's build multiple regression for Iris data set. Prepare data
irisn = iris irisn[,5] = unclass(iris$Species)
Plot data for multivariate data
Build Linear Model for Iris dataset
ml = lm(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, data=irisn)
Differences between observed values and fitted values
To check for homoscedasticity plot following chart. Residual vs predicted value scatter plot should not look like a funnel.
To check if the residuals are approximately normally distributed plot following chart. Q-Q plot should be close to the straight line.
To predict Species based on Linear Model using new values for Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
new_data = data.frame(Sepal.Length=5.7, Sepal.Width=3.1, Petal.Length=5.0, Petal.Width=1.7) predict(ml, new_data)
Calculate the root-mean-square error (RMSE), less is better
RMSE = function(predicted, true) mean((predicted - true)^2)^.5 RMSE(predict(ml, new_data), irisn$Species)
Types of regression
There are different types of regression: