In statistics, modeling is where we get down to business. Models quantify the relationships between our variables and one let us make predictions.
The regression analysis focus on building a thought process around how the modeling techniques establish and quantify a relation among response variables and predictors.
The concept of causation is important to keep in mind, as most of the time our thought process deviates from how relationships quantified by a model have to be interpreted. For example, a statistical model will be able to quantify relationships between completely irrelevant measures, say electricity generation and beer consumption.
Any regression analysis involves three key sets of variables:
Important concept to understand is around parametric and non-parametric methods.
A simple linear regression is the most basic model. The linear regression model is used for explaining the relationship between a single dependent variable known as Y and one or more X's independent variables known as input, predictor, or independent variables. So, it’s just two variables and is modeled as a linear relationship with an error term
We are given the data for x and y. Our mission is to fit the model, which will give us the best estimates for and .
That generalizes naturally to multiple linear regression, where we have multiple variables on the righthand side of the relationship
Statisticians call u, v, and w the predictors and y the response. Obviously, the model is useful only if there is a fairly linear relationship between the predictors and the response, but that requirement is much less restrictive than you might think.
The beauty of R is that anyone can build these linear models. The models are built by a function,
lm , which returns a model object. From the model object, we get the coefficients () and regression statistics.
Regression creates a model, and ANOVA (Analysis of variance) is one method of evaluating such models. ANOVA is actually a family of techniques that are connected by a common mathematical analysis.
oneway.testfunction, otherwise, use the nonparametric version, the
anovafunction compares two regression models and reports whether they are significantly different.
Essentially, the linear regression model will help you with:
Simple linear regression
Imagine, you have two vectors, x and y, that hold paired observations: . You believe there is a linear relationship between x and y, and you want to create a regression model of the relationship.
The regression uses the ordinary least-squares (OLS) algorithm to fit the linear model
where and are the regression coefficients and the are the error terms.
lm function can perform linear regression. The main argument is a model formula, such as y ~ x. The formula has the response variable on the left of the tilde character ( ~ ) and the predictor variable on the right. The function estimates the regression coefficients, and , and reports them as the intercept and the coefficient of x, respectively
lm(iris$Sepal.Length ~ iris$Sepal.Width)
In this case, the regression equation
iris$Sepal.Length = 6.5262 - 0.2234 iris$Sepal.Width
Multiple linear regression
Suppose you have several predictor variables (e.g., u, v, and w) and a response variable y. You believe there is a linear relationship between the predictors and the response, and you want to perform a linear regression on the data.
Multiple linear regression is the obvious generalization of simple linear regression. It allows multiple predictor variables instead of one predictor variable and still uses OLS to compute the coefficients of a linear equation. The three-variable regression just given corresponds to this linear model:
R uses the
lm function for both simple and multiple linear regression. You simply add
more variables to the righthand side of the model formula. The output then shows the
coefficients of the fitted model:
lm(y ~ u + v + w)
Let's look at strengths and weaknesses of multiple linear regression.
Strength of multiple linear regression:
Weaknesses of multiple linear regression:
Getting regression statistics
Before interpreting and using your model, you will need to determine whether it is a good fit to the data and includes a good combination of explanatory variables. You may also be considering several alternative models for your data and want to compare them.
The fit of a model is commonly measured in a few different ways. These include:
Save the regression model in a variable, say m
m = lm(y ~ u + v + w)
Then use functions to extract regression statistics and information from the model
coef(m)Same as coefficients(m)
confint(m)Confidence intervals for the regression coefficients
deviance(m)Residual sum of squares
effects(m)Vector of orthogonal effects
fitted(m)Vector of fitted y values
resid(m)Same as residuals(m)
summary(m)Key statistics, such as , the F statistic, and the residual standard error (). You can read good description of
summaryin book "R Cookbook" by Paul Teetor (page 270).
vcov(m)Variance–covariance matrix of the main parameters
Plotting regression residuals
Residuals are core to the diagnostic of regression models. Normality of residual is an important condition for the model to be a valid linear regression model. In simple words, normality implies that the errors/residuals are random noise and our model has captured all the signals in data.
The linear regression model gives us the conditional expectation of function Y for given values of X. However, the fitted equation has some residual to it. We need the expectation of residual to be normally distributed with a mean of 0 or reducible to 0. A normal residual means that the model inference (confidence interval, model predictors’ significance) is valid.
If you want a visual display of your regression residuals: you can plot the model object by selecting the residuals plot from the available plots
m = lm(y ~ x) plot(m, which=1) abline(coef(m))
Normally, plotting a regression model object produces several diagnostic plots. You can select just the residuals plot by specifying
Predicting new values
If you want to predict new values from your regression model then you can use
predict function. Save the predictor data in a data frame. Use the
predict function, setting the newdata parameter to the data frame:
m = lm(y ~ u + v + w) preds = data.frame(u=3.1, v=4.0, w=5.5) predict(m, newdata=preds)
Once you have a linear model, making predictions is quite easy because the
predict function does all the heavy lifting. The only annoyance is arranging for a data frame to contain your data.
predict function returns a vector of predicted values with one prediction for every
row in the data. The example contains one row, so predict returns one value
preds = data.frame(u=3.1, v=4.0, w=5.5) predict(m, newdata=preds) 12.31374