About statistical tests and hypotheses in R Science 04.06.2016

Any significant application of R includes statistics or models or graphics. Most recipes involve statistical tests or confidence intervals. The statistical tests let you choose between two competing hypotheses. Confidence intervals reflect the likely range of a population parameter and are calculated based on your data sample.

Hypothesis testing is useful because it allow you to use your data as evidence to support or refute a claim, and helps you to make decisions. Hypothesis tests are divided in to two groups: parametric and nonparametric tests. Parametric tests make some assumptions about the distribution of the data, usually that the data follows a normal distribution. Nonparametric tests don’t make this kind of assumption, so are suitable in a wider range of situations.

Hypothesis tests are tools for checking the validity of a statement around certain statistics relating to an experiment design.

A hypothesis test uses a sample to test hypotheses about the population from which the sample is drawn. This helps you make decisions or draw conclusions about the population.

Null Hypotheses, Alternative Hypotheses, and p-Values

Many of the statistical tests have one or two data samples and also have two competing hypotheses, either of which could reasonably be true.

One hypothesis, called the null hypothesis, is that nothing happened: the mean was unchanged; the treatment had no effect; you got the expected answer; the model did not improve; and so forth. It is usually a hypothesis about the value of an unknown parameter such as the population mean or variance, e.g.: The population mean is equal to five. The null hypothesis is adopted unless proven false.

The other hypothesis, called the alternative hypothesis, is that something happened: the mean rose; the treatment improved the patients’ health; you got an unexpected answer; the model fit better; and so forth. This is generally the inverse of the null hypothesis, e.g.: The population mean is not equal to five.

Abstracting the details at this point, the consequence of the two statements would simply lead toward accepting H0 or rejecting it. In general, the null hypothesis is always a statement of "no difference" and the alternative statement challenges this null. A more numerically concise way of writing these two statements would be:

  • H0 : sample mean is 0
  • Ha : sample mean is not 0

We want to determine which hypothesis is more likely in light of the data:

  1. To begin, we assume that the null hypothesis is true.
  2. We calculate a test statistic. It could be something simple, such as the mean of the sample, or it could be quite complex. The critical requirement is that we must know the statistic’s distribution. We might know the distribution of the sample mean, for example, by invoking the Central Limit Theorem.
  3. From the statistic and its distribution we can calculate a p-value, the probability of a test statistic value as extreme or more extreme than the one we observed, while assuming that the null hypothesis is true.
  4. If the p-value is too small, we have strong evidence against the null hypothesis. This is called rejecting the null hypothesis. The null hypothesis is that the variables are independent.
  5. If the p-value is not small then we have no such evidence. This is called failing to reject the null hypothesis. The alternative hypothesis is that the variables are not independent.

Main question is: when is a p-value 'too small'? There is the common convention (significance level) that we reject the null hypothesis when p < 0.05 and fail to reject it when p > 0.05. But the real answer is, 'it depends'. Your chosen significance level depends on your problem domain. The conventional limit of p < 0.05 works for many problems. For someone working in high-risk areas, p < 0.01 or p < 0.001 might be necessary.

Testing the mean of a sample (Student’s T-Tests)

The t test is a workhorse of statistics, and this is one of its basic uses: making inferences about a population mean from a sample.

The Student’s t-test is used to test hypotheses about the mean value of a population or two populations. A t-test is suitable if the data is believed to be drawn from a normal distribution. If the samples are large enough (i.e., at least 30 values per sample), then the t-test can be used even if the data is not normally distributed. If your data does not satisfy either of these criteria, then use the Wilcoxon rank-sum test instead.

There are three types of t-test

  • One-sample t-test is used to compare the mean value of a sample with a constant value denoted mu . It has the null hypothesis that the population mean is equal to mu , and the alternative hypothesis that it is not.
  • Two-sample t-test is used to compare the mean values of two independent samples, to determine whether they are drawn from populations with equal means. It has the null hypothesis that the two means are equal, and the alternative hypothesis that they are not equal.
  • Paired t-test is used to compare the mean values for two samples, where each value in one sample corresponds to a particular value in the other sample. It has the null hypothesis that the two means are equal, and the alternative hypothesis that they are not equal.

You have a sample from a population. Given this sample, you want to know if the mean of the population could reasonably be a particular value m.

Apply the t.test function to the sample x with the argument mu=m (one-sample T-Test). The mu argument gives the value with which you want to compare the sample mean. It is optional and has a default value of 0.

t.test(sample1, mu=m)

The output includes a p-value. Conventionally, if p < 0.05 then the population mean is unlikely to be m whereas p > 0.05 provides no such evidence.

If your sample size n is small, then the underlying population must be normally distributed in order to derive meaningful results from the t test. A good rule of thumb is that 'small' means n < 30.

The following example simulates sampling from a normal population with mean mu=100. It uses the t test to ask if the population mean could be 95, and t.test reports a p-value of 0.02378:

sample1 = rnorm(50, mean=100, sd=15)
t.test(sample1, mu=95)

The p-value is small and so it’s unlikely (based on the sample data) that 95 could be the mean of the population.

The large p-value indicates that the sample is consistent with assuming a population mean mu of 100. In statistical terms, the data does not provide evidence against the true mean being 100.

A common case is testing for a mean of zero. If you omit the mu argument, it defaults to zero.

Testing for normality

Sometimes you may wish to determine whether your sample is consistent with having been drawn from a particular type of distribution such as the normal distribution. This is useful because many statistical techniques (such as analysis of variance) are only suitable for normally distributed data.

Two methods that allow you to do this are the Shapiro-Wilk and Kolmogorov-Smirnov tests. To visually assess how well your data fits the normal distribution, use a histogram or normal probability plot.

You can use shapiro.test function (Shapiro-Wilk Test) to determine whether your data sample is from a normally distributed population. The null hypothesis for the test is that the sample is drawn from a normal distribution and the alternative hypothesis is that it is not.

shapiro.test(sample1)

The output includes a p-value. Conventionally, p < 0.05 indicates that the population is likely not normally distributed whereas p > 0.05 provides no such evidence. The large p-value suggests the underlying population could be normally distributed.

A one-sample Kolmogorov-Smirnov test helps to determine whether a sample is drawn from a particular theoretical distribution. It has the null hypothesis that the sample is drawn from the distribution and the alternative hypothesis that it is not.

You can perform a Kolmogorov-Smirnov test with the ks.test function. To perform a one-sample test with the null hypothesis that the sample is drawn from a normal distribution with a mean of 70 and a standard deviation of 7, use the command

ks.test(sample1, "pnorm", 70, 7)

To test a sample against another theoretical distribution, replace pnorm with the relevant cumulative distribution function. You must also substitute the mean and standard deviation with any parameters relevant to the distribution.

Comparing the means of two samples

If you have one sample each from two populations and you want to know if the two populations could have the same mean you can perform a t test by calling the t.test function:

t.test(sample1, sample2)

It requires that the samples be large enough (both samples have 20 or more observations) or that the underlying populations be normally distributed.

By default, t.test assumes that your data are not paired. If the observations are paired (i.e., if each is paired with one ), then specify paired=TRUE

t.test(sample1, sample2, paired=TRUE)

In either case, t.test will compute a p-value. Conventionally, if p < 0.05 then the means are likely different whereas p > 0.05 provides no such evidence.

  • If either sample size is small, then the populations must be normally distributed. Here, 'small' means fewer than 20 data points.
  • If the two populations have the same variance, specify var.equal=TRUE to obtain a less conservative test.

Testing a correlation for significance

If you calculated the correlation between two variables, but you don’t know if the correlation is statistically significant then use cor.test function.

The cor.test function can calculate both the p-value and the confidence interval of the correlation. If the variables came from normally distributed populations then use the default measure of correlation, which is the Pearson method:

cor.test(sample1, sample2)

Pearson's correlation indicates the strength of the association between two normally distributed numeric attributes.

For nonnormal populations, use the Spearman method instead:

cor.test(sample1, sample2, method="Spearman")

This correlation coefficient first ranks the observations of both attributes included in the analysis. It then computes the differences between the ranks of each observation on these two attributes. Finally, it computes the correlation coefficient.

The function returns several values, including the p-value from the test of significance. Conventionally, p < 0.05 indicates that the correlation is likely significant whereas p > 0.05 indicates it is not.

Suppose we have two vectors, sample1 and sample2, with values from normal populations. We might be very pleased that their correlation is greater than 0.83:

cor(sample1, sample2)
[1] 0.8352458

But that is naive. If we run cor.test, it reports a relatively large p-value of 0.1648

cor.test(sample1, sample2)

Testing two samples for the same distribution

A two-sample Kolmogorov-Smirnov test helps to determine whether two samples are drawn from the same distribution. It has the null hypothesis that they are drawn from the same distribution and the alternative hypothesis that they are not.

Suppose you have two samples, and you are wondering: did they come from the same distribution? The answer is Kolmogorov–Smirnov test* that compares two samples and tests them for being drawn from the same distribution. The ks.test function implements that test:

ks.test(x, y)

The output includes a p-value. Conventionally, a p-value of less than 0.05 indicates that the two samples (x and y) were drawn from different distributions whereas a p-value exceeding 0.05 provides no such evidence.

The Kolmogorov–Smirnov test is wonderful for two reasons. First, it is a nonparametric test and so you needn’t make any assumptions regarding the underlying distributions: it works for all distributions. Second, it checks the location, dispersion, and shape of the populations, based on the samples. If these characteristics disagree then the test will detect that, allowing us to conclude that the underlying distributions are different. Suppose we suspect that the vectors x and y come from differing distributions. Here, ks.test reports a p-value of 0.01297:

ks.test(sample1, sample2)

Comparing the variance of two samples

A hypothesis test for variance allows you to compare the variance of two or more samples to determine whether they are drawn from populations with equal variance. The tests have the null hypothesis that the variances are equal and the alternative hypothesis that they are not. These tests are useful for checking the assumptions of a t-test or analysis of variance.

There are two popular types of test for variance

  • F-test allows you to compare the variance of two samples. It is suitable for normally distributed data.
  • Bartlett’s test allows you to compare the variance of two or more samples. It is suitable for normally distributed data.

You can perform an F-test with the var.test function. For data in unstacked form (with the samples in separate variables), use the command:

var.test(sample1, sample2)

Non-parametric methods

When a training dataset does not conform to any specific probability distribution because of non-adherence to the assumptions of that specific probability distribution, the only option left to analyze the data is via non-parametric methods. Non-parametric methods do not follow any assumption regarding the probability distribution. Using non-parametric methods, one can draw inferences and perform hypothesis testing without adhering to any assumptions.

Let's look at Wilcoxon signed-rank test.

The mayor of a city wants to see if pollution levels are reduced by closing the streets to the car traffic. This is measured by the rate of pollution every 60 minutes (from 8am till 22pm, total of 15 measurements) in a day when traffic is open, and in a day of closure to traffic. Here the values of air pollution:

With traffic: 214, 159, 169, 202, 103, 119, 200, 109, 132, 142, 194, 104, 219, 119, 234

Without traffic: 159, 135, 141, 101, 102, 168, 62, 167, 174, 159, 66, 118, 181, 171, 112

It is clear that the two groups are paired, because there is a bond between the readings, consisting in the fact that we are considering the same city (with its peculiarities weather, ventilation, etc.) albeit in two different days. Not being able to assume a Gaussian distribution for the values recorded, we must proceed with a non-parametric test, the Wilcoxon signed rank test.

day1 <- c(214, 159, 169, 202, 103, 119, 200, 109, 132, 142, 194, 104, 219, 119, 234)
day2 <- c(159, 135, 141, 101, 102, 168, 62, 167, 174, 159, 66, 118, 181, 171, 112)

wilcox.test(day1, day2, paired=TRUE)

Wilcoxon signed rank test

data: day1 and day2
V = 80, p-value = 0.2769
alternative hypothesis: true location shift is not equal to 0

Since the p-value is greater than 0.05, we conclude that the means have remained essentially unchanged (we accept the null hypothesis H0), then blocking traffic for a single day did not lead to any improvement in terms of pollution of the city.