About probability distributions and modeling in R

About probability distributions and modeling in R Science 10.12.2015

Random Variables

Random variable is a numerical characteristic that takes on different values due to chance.

Examples

Coin Flips. The number of heads in four flips of a coin is a random variable because the results will vary between trials.
Heights. Sample of 100 are repeatedly pulled from the population of all students and their heights are measured. The mean height of samples of 100 students is a random variable because the statistic will vary between samples.

Random variables are classified into two broad types: discrete and continuous. A discrete random variable has a countable set of distinct possible values. A continuous random variable is such that any value (to any number of decimal places) within some interval is a possible value. So, if a variable can takes on any value between two specified values, it is called a continuous variable; otherwise, it is called a discrete variable.

Binomial random variable is a specific type of discrete random variable that counts how often a particular event occurs in a fixed number of tries or trials.

For a variable to be a binomial random variable, all of the following conditions must be met:

There are a fixed number of trials (a fixed sample size)
On each trial, the event of interest either occurs or does not
The probability of occurrence (or not) is the same on each trial
Trials are independent of one another

Examples

Discrete Random Variables:

Number of heads in 4 flips of a coin (possible outcomes are 0, 1, 2, 3, 4)
Number of classes missed last week (possible outcomes are 0, 1, 2, 3, ..., up to the maximum number of classes)
Amount won or lost when betting $1 on the lottery

Continuous Random Variables:

Heights of individuals
Time to finish a test
Hours spent exercising last week

Probability Distribution

The probability of an event is estimated from the observed data by dividing the number of trials in which the event occurred by the total number of trials. For instance, if it rained 3 out of 10 days with similar conditions as today, the probability of rain today can be estimated as 3 / 10 = 0.30 or 30 percent. Similarly, if 10 out of 50 prior email messages were spam, then the probability of any incoming message being spam can be estimated as 10 / 50 = 0.20 or 20 percent.

To denote these probabilities, we use notation in the form P(x), which signifies the probability of event x. For example, P(rain) = 0.30 and P(spam) = 0.20.

A probability distribution is a table or an equation that links each outcome of a statistical experiment with its probability of occurrence. Consider a simple experiment in which we flip a coin two times. An outcome of the experiment might be the number of heads that we see in two coin flips.

If a random variable is a discrete variable, its probability distribution is called a discrete probability distribution.

An example will make this clear. Suppose you flip a coin two times. This simple statistical experiment can have four possible outcomes: HH, HT, TH, and TT. Now, let the random variable X represent the number of Heads that result from this experiment. The random variable X can only take on the values 0, 1, or 2, so it is a discrete random variable.

The probability distribution for this statistical experiment appears below.

Number of heads	Probability
0	0.25
1	0.50
2	0.25

The above table represents a discrete probability distribution because it relates each value of a discrete random variable with its probability of occurrence.

There are any number of discrete probability distributions. A discrete random variable can take on only clearly separated values, such as heads or tails, or the number of spots on a six-sided die. The categories must be mutually exclusive and exhaustive. Every event belongs to one and only one category, and the sum of the probabilities is 1.

The mean of any discrete probability distribution can be computed by the following

$\mu=\sum[xP(x)]$

The variance for a discrete probability distribution is calculated as follows

$\sigma^2=\sum[(x-\mu)^2P(x)]$

R has functions for all of the well-known probability distributions. For each distribution, there are the following four functions:

Probability density function or probability mass function (prefix d). For discrete distributions, you can use the probability mass function (pmf) to answer questions of the type "What is the probability that the outcome will be equal to x?".
Cumulative distribution function (prefix p). You can use the cumulative density function (cdf) to answer questions of the type "If we were to randomly select a member of a given population, what is the probability that it will have a value less than x, or a value between x and y?".
Inverse cumulative distribution function (prefix q). Use the inverse cdf to answer questions such as "Which value do x% of the population fall below?",or "What range of values do x% of the population fall within?".
Random number generator (prefix r). Use the random number generator to simulate a random sample from a given distribution.

The probability density function (pdf) and cumulative distribution function (cdf) are two ways of specifying the probability distribution of a random variable.

The pdf is denoted f(x) and gives the relative likelihood that the value of the random variable will be equal to x. The total area under the curve is equal to 1.

The cdf is denoted F(x) and gives the probability that the value of a random variable will be less than or equal to x.

The Binomial Distribution

The discrete binomial distribution is very useful for modeling processes in which the binary outcome can be either a success (1) or a failure (0). The random variable X is the number of successes in N independent trials, for each of which the probability of success, p, is the same. The number of successes can range from 0 to N. The expected value of k is N p, the number of trials times the probability of success on a given trial, and the variance of the binomial distribution is N pq, where q = 1 − p. We calculate the binomial probability as follows

$p(X=k|p,N)=\binom{N}{k}p^k(1-p)^{N-k}$

Here is a binomial distribution for the number of successes (heads) in 10 tosses of a fair coin, in which the probability of success for each independent trial is .50. We establish a vector of the number of successes (heads), which can range from 0 to 10, with 5 being the most likely value. The dbinom function produces a vector of values.

x = 0:10
binomDist = dbinom(x, 10, 0.50)
plot(x, binomDist, type = "h")
points(x, binomDist)
abline(h = 0)
lines(x, binomDist)

Example 1

Suppose that a fair die is rolled 10 times. What is the probability of throwing exactly two sixes?

dbinom(2, 10, 1/6)

The probability of throwing two sixes is approximately 0.29 or 29 percent.

Example 2

If you were to roll a fair six-sided die 100 times, what is the probability of rolling a six no more than 10 times? The number of sixes in 100 dice rolls follows a binomial distribution, so you can answer the question with the pbinom function

pbinom(10, 100, 1/6)

From the output, you can see that the probability of rolling no more than 10 sixes is 0.043 (4.3%).

What is the probability of rolling a six more than 20 times?

pbinom(20, 100, 1/6, lower.tail=F)

The probability of rolling more than 20 sixes is approximately 0.15, or 15 percent.

Example 3

To simulate the number of sixes thrown in 10 rolls of a fair die, use the command:

rbinom(1, 10, 1/6)
[1] 3

The Poisson Distribution

The Poisson distribution is a special case of the binomial distribution. We define success and failure in the usual way as 1 and 0, respectively, and as with the binomial distribution, the distinction is often arbitrary.

The Poisson distribution, unlike the binomial distribution, has no theoretical upper bound on the number of occurrences that can happen within a given interval. We assume the number of occurrences in each interval is independent of the number of occurrences in any other interval. We also assume the probability that an occurrence will happen is the same for every interval. As the interval size decreases, we assume the probability of an occurrence in the interval becomes smaller. In the Poisson distribution, the count of the number of occurrences, X, can take on whole numbers 0, 1, 2, 3, ... The mean number of successes per unit of measure is the value μ. If k is any whole number 0 or greater, then

$P(X=k)=\frac{e^{-\mu}\mu^k}{k!}$

The number of lobster ordered in a restaurant on a given day is known to follow a Poisson distribution with a mean of 20. What is the probability that exactly eighteen lobsters will be ordered tomorrow?

Example 1

dpois(18, 20)

The probability that exactly eighteen lobsters are ordered is 8.4 percent.

Example 2

The number of lobsters ordered on any given day in a restaurant follows a Poisson distribution with a mean of 20. To simulate the number of lobsters ordered over a seven-day period, use the command:

rpois(7, 20)
[1] 19 10 13 23 21 13 25

The Normal Distribution

Continuous variables can take on any value within some specified range. Thus continuous probability functions plot a probability density function (PDF) instead of a discrete probability mass function (PMF).

In contrast to discrete probability distributions, the probability of a single point on the curve is essentially zero, and we rarely examine such probabilities, rather focusing on areas under the curve. In statistics, the four most commonly used continuous probability distributions are the normal distribution and three other distributions theoretically related to the normal distribution, namely, the t distribution, the F distribution, and the chi-square distribution.

The normal distribution serves as the backbone of modern statistics. As the distribution is continuous, we are usually interested in finding areas under the normal curve. In particular, we are often interested in left-tailed probabilities, right-tailed probabilities, and the area between two given scores on the normal distribution. There are any number of normal distributions, each for any non-zero value of σ, the population standard deviation, so we often find it convenient to work with the unit or standard normal distribution. The unit normal distribution has a mean of 0 (not to be confused in any way with a zero indicating the absence of a quantity), and a standard deviation of 1. The normal distribution is symmetrical and mound shaped, and its mean, mode, and median are all equal to 0. For any normal distribution, we can convert the distribution to the standard normal distribution as follows:

$z=\frac{x-\mu_x}{\sigma_x}$

which is often called z-scoring or standardizing. The empirical rule tells us that for mound-shaped symmetrical distributions like the standard normal distribution, about 68% of the observations will lie between plus and minus 1 standard deviation from the mean. Approximately 95% of the observations will lie within plus or minus 2 standard deviations, and about 99.7% of observations will lie within plus or minus 3 standard deviations. We can use the built-in functions for the normal distribution to see how accurately this empirical rule describes the normal distribution. We find the rule is quite accurate.

pnorm(3) - pnorm(-3)
[1] 0.9973002

To find the value of the pdf at x=2.5 for a normal distribution with a mean of 5 and a standard deviation of 2, use the command:

dnorm(2.5, mean=5, sd=2)
[1] 0.09132454

To find a probability for a nonstandard normal distribution, add the mean and sd arguments. For example, if a random variable is known to be normally distributed with a mean of 5 and a standard deviation of 2 and you wish to find the probability that a randomly selected member will be no more than 6, use the command:

pnorm(6, mean=5, sd=2)
[1] 0.6914625

To find the complementary probability that the value will be greater than 6, set the lower.tail argument to F:

pnorm(6, 5, 2, lower.tail=F)
[1] 0.3085375

Example of generating random numbers from a normal distribution

Hand span in a particular population is known to be normally distributed with a mean of 195 millimeters and a standard deviation of 17 millimeters. To simulate the hand spans of three randomly selected people, use the command:

rnorm(3, 195, 17)
[1] 186.376 172.164 195.504

One of the most important applications of the normal distribution is its ability to describe the distributions of the means of samples from a population. The central limit theorem tells us that as the sample size increases, the distribution of sample means becomes more and more normal, regardless of the shape of the parent distribution.

Vocabulary

Discrete is data that can only take on set number of values.

Continuous is quantitative data that can take on any value between the minimum and maximum, and any value between two other values.

Probability is the likelihood of an event occuring.

Mean is the numerical average; calculated as the sum of all of the data values divided by the number of values.

Standard deviation is roughly the average difference between individual data and the mean.

Sample is a subset of the population from which data is actually collected.

Population is the entire set of possible observations in which we are interested.

Parameter is a measurable characteristic of a population, such as a mean or standard deviation.

Statistic is a measurable characteristic of a sample, such as a mean or standard deviation.

Useful links

About probability distributions and modeling in R Science 10.12.2015

Quote

Categories

Archive ↓