**Random Variables**

*Random variable* is a numerical characteristic that takes on different values due to chance.

*Examples*

*Coin Flips*. The number of heads in four flips of a coin is a random variable because the results will vary between trials.*Heights*. Sample of 100 are repeatedly pulled from the population of all students and their heights are measured. The mean height of samples of 100 students is a random variable because the statistic will vary between samples.

Random variables are classified into two broad types: discrete and continuous. A **discrete random variable** has a countable set of distinct possible values. **A continuous random variable** is such that any value (to any number of decimal places) within some interval is a possible value. So, if a variable can takes on any value between two specified values, it is called a **continuous variable**; otherwise, it is called a **discrete variable**.

*Binomial random variable* is a specific type of discrete random variable that counts how often a particular event occurs in a fixed number of tries or trials.

For a variable to be a binomial random variable, all of the following conditions must be met:

- There are a fixed number of trials (a fixed sample size)
- On each trial, the event of interest either occurs or does not
- The probability of occurrence (or not) is the same on each trial
- Trials are independent of one another

*Examples*

**Discrete Random Variables:**

- Number of heads in 4 flips of a coin (possible outcomes are 0, 1, 2, 3, 4)
- Number of classes missed last week (possible outcomes are 0, 1, 2, 3, ..., up to the maximum number of classes)
- Amount won or lost when betting $1 on the lottery

**Continuous Random Variables:**

- Heights of individuals
- Time to finish a test
- Hours spent exercising last week

**Probability Distribution**

The *probability of an event* is estimated from the observed data by dividing the number of trials in which the event occurred by the total number of trials. For instance, if it rained 3 out of 10 days with similar conditions as today, the probability of rain today can be estimated as 3 / 10 = 0.30 or 30 percent. Similarly, if 10 out of 50 prior email messages were spam, then the probability of any incoming message being spam can be estimated as 10 / 50 = 0.20 or 20 percent.

To denote these probabilities, we use notation in the form *P(x)*, which signifies the probability of event *x*. For example, *P(rain) = 0.30* and *P(spam) = 0.20*.

A probability distribution is a table or an equation that links each outcome of a statistical experiment with its probability of occurrence. Consider a simple experiment in which we flip a coin two times. An outcome of the experiment might be the number of heads that we see in two coin flips.

If a random variable is a discrete variable, its probability distribution is called a discrete probability distribution.

An example will make this clear. Suppose you flip a coin two times. This simple statistical experiment can have four possible outcomes: *HH, HT, TH, and TT*. Now, let the random variable X represent the number of Heads that result from this experiment. The random variable X can only take on the values 0, 1, or 2, so it is a discrete random variable.

The probability distribution for this statistical experiment appears below.

Number of heads | Probability |
---|---|

0 | 0.25 |

1 | 0.50 |

2 | 0.25 |

The above table represents a discrete probability distribution because it relates each value of a discrete random variable with its probability of occurrence.

There are any number of discrete probability distributions. A discrete random variable can take on only clearly separated values, such as heads or tails, or the number of spots on a six-sided die. The categories must be mutually exclusive and exhaustive. Every event belongs to one and only one category, and the sum of the probabilities is 1.

The mean of any discrete probability distribution can be computed by the following

The variance for a discrete probability distribution is calculated as follows

R has functions for all of the well-known probability distributions. For each distribution, there are the following four functions:

*Probability density function or probability mass function (prefix d).*For discrete distributions, you can use the probability mass function (pmf) to answer questions of the type*"What is the probability that the outcome will be equal to x?"*.*Cumulative distribution function (prefix p).*You can use the cumulative density function (cdf) to answer questions of the type*"If we were to randomly select a member of a given population, what is the probability that it will have a value less than x, or a value between x and y?"*.*Inverse cumulative distribution function (prefix q).*Use the inverse cdf to answer questions such as*"Which value do x% of the population fall below?*",or*"What range of values do x% of the population fall within?"*.*Random number generator (prefix r).*Use the random number generator to simulate a random sample from a given distribution.

The probability density function (pdf) and cumulative distribution function (cdf) are two ways of specifying the probability distribution of a random variable.

The pdf is denoted f(x) and gives the relative likelihood that the value of the random variable will be equal to x. The total area under the curve is equal to 1.

The cdf is denoted F(x) and gives the probability that the value of a random variable will be less than or equal to x.

**The Binomial Distribution**

The discrete binomial distribution is very useful for modeling processes in which the binary outcome can be either a success (1) or a failure (0). The random variable *X* is the number of successes in *N* independent trials, for each of which the probability of success, *p*, is the same. The number of successes can range from 0 to N. The expected value of *k* is *N p*, the number of trials times the probability of success on a given trial, and the variance of the binomial distribution is *N pq*, where *q = 1 − p*. We calculate the binomial
probability as follows

Here is a binomial distribution for the number of successes (heads) in 10 tosses of a fair coin, in which the probability of success for each independent trial is .50. We establish a vector of the number of successes (heads), which can range from 0 to 10, with 5 being the most likely value. The `dbinom`

function produces a vector of values.

x = 0:10 binomDist = dbinom(x, 10, 0.50) plot(x, binomDist, type = "h") points(x, binomDist) abline(h = 0) lines(x, binomDist)

**Example 1**

Suppose that a fair die is rolled 10 times. What is the probability of throwing exactly two sixes?

dbinom(2, 10, 1/6)

The probability of throwing two sixes is approximately 0.29 or 29 percent.

**Example 2**

If you were to roll a fair six-sided die 100 times, what is the probability of rolling a six no more than 10 times? The number of sixes in 100 dice rolls follows a binomial distribution, so you can answer the question with the `pbinom`

function

pbinom(10, 100, 1/6)

From the output, you can see that the probability of rolling no more than 10 sixes is 0.043 (4.3%).

What is the probability of rolling a six more than 20 times?

pbinom(20, 100, 1/6, lower.tail=F)

The probability of rolling more than 20 sixes is approximately 0.15, or 15 percent.

**Example 3**

To simulate the number of sixes thrown in 10 rolls of a fair die, use the command:

rbinom(1, 10, 1/6) [1] 3

**The Poisson Distribution**

The Poisson distribution is a special case of the binomial distribution. We define success and failure in the usual way as 1 and 0, respectively, and as with the binomial distribution, the distinction is often arbitrary.

The Poisson distribution, unlike the binomial distribution, has no theoretical upper bound on the number of occurrences that can happen within a given interval. We assume the number of occurrences in each interval is independent of the number of occurrences in any other interval. We also assume the probability that an occurrence will happen is the same for every interval. As the interval size decreases, we assume the probability of an occurrence in the interval becomes smaller. In the Poisson distribution, the count of the number of occurrences, *X*, can take on whole numbers 0, 1, 2, 3, ... The mean number of successes per unit of measure is the value *μ*. If *k* is any whole number 0 or greater, then

The number of lobster ordered in a restaurant on a given day is known to follow a Poisson distribution with a mean of 20. What is the probability that exactly eighteen lobsters will be ordered tomorrow?

**Example 1**

dpois(18, 20)

The probability that exactly eighteen lobsters are ordered is 8.4 percent.

**Example 2**

The number of lobsters ordered on any given day in a restaurant follows a Poisson distribution with a mean of 20. To simulate the number of lobsters ordered over a seven-day period, use the command:

rpois(7, 20) [1] 19 10 13 23 21 13 25

**The Normal Distribution**

Continuous variables can take on any value within some specified range. Thus continuous probability functions plot a probability density function (PDF) instead of a discrete probability mass function (PMF).

In contrast to discrete probability distributions, the probability of a single point on the curve is essentially zero, and we rarely examine such probabilities, rather focusing on areas under the curve. In statistics, the four most commonly used continuous probability distributions are the normal distribution and three other distributions theoretically related to the normal distribution, namely, *the t distribution*, *the F distribution*, and *the chi-square distribution*.

The normal distribution serves as the backbone of modern statistics. As the distribution is continuous, we are usually interested in finding areas under the normal curve. In particular, we are often interested in left-tailed probabilities, right-tailed probabilities, and the area between two given scores on the normal distribution. There are any number of normal distributions, each for any non-zero value of σ, the population standard deviation, so we often find it convenient to work with the unit or standard normal distribution. The unit normal distribution has a mean of 0 (not to be confused in any way with a zero indicating the absence of a quantity), and a standard deviation of 1. The normal distribution is symmetrical and mound shaped, and its mean, mode, and median are all equal to 0. For any normal distribution, we can convert the distribution to the standard normal distribution as follows:

which is often called z-scoring or standardizing. The empirical rule tells us that for mound-shaped symmetrical distributions like the standard normal distribution, about 68% of the observations will lie between plus and minus 1 standard deviation from the mean. Approximately 95% of the observations will lie within plus or minus 2 standard deviations, and about 99.7% of observations will lie within plus or minus 3 standard deviations. We can use the built-in functions for the normal distribution to see how accurately this empirical rule describes the normal distribution. We find the rule is quite accurate.

pnorm(3) - pnorm(-3) [1] 0.9973002

To find the value of the pdf at x=2.5 for a normal distribution with a mean of 5 and a standard deviation of 2, use the command:

dnorm(2.5, mean=5, sd=2) [1] 0.09132454

To find a probability for a nonstandard normal distribution, add the mean and sd arguments. For example, if a random variable is known to be normally distributed with a mean of 5 and a standard deviation of 2 and you wish to find the probability that a randomly selected member will be no more than 6, use the command:

pnorm(6, mean=5, sd=2) [1] 0.6914625

To find the complementary probability that the value will be greater than 6, set the lower.tail argument to F:

pnorm(6, 5, 2, lower.tail=F) [1] 0.3085375

**Example of generating random numbers from a normal distribution**

Hand span in a particular population is known to be normally distributed with a mean of 195 millimeters and a standard deviation of 17 millimeters. To simulate the hand spans of three randomly selected people, use the command:

rnorm(3, 195, 17) [1] 186.376 172.164 195.504

One of the most important applications of the normal distribution is its ability to describe the distributions of the means of samples from a population. The central limit theorem tells us that as the sample size increases, the distribution of sample means becomes more and more normal, regardless of the shape of the parent distribution.

**Vocabulary**

**Discrete** is data that can only take on set number of values.

**Continuous** is quantitative data that can take on any value between the minimum and maximum, and any value between two other values.

**Probability** is the likelihood of an event occuring.

**Mean** is the numerical average; calculated as the sum of all of the data values divided by the number of values.

**Standard deviation** is roughly the average difference between individual data and the mean.

**Sample** is a subset of the population from which data is actually collected.

**Population** is the entire set of possible observations in which we are interested.

**Parameter** is a measurable characteristic of a population, such as a mean or standard deviation.

**Statistic** is a measurable characteristic of a sample, such as a mean or standard deviation.

**Useful links**

Formal education will make you a living. Self-education will make you a fortune.
*Jim Rohn*

- Android
- AngularJS
- Databases
- Development
- Django
- iOS
- Java
- JavaScript
- LaTex
- Linux
- Meteor JS
- Python
- Science

- August 2020
- July 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- November 2019
- October 2019
- September 2019
- August 2019
- July 2019
- February 2019
- January 2019
- December 2018
- November 2018
- August 2018
- July 2018
- June 2018
- May 2018
- April 2018
- March 2018
- February 2018
- January 2018
- December 2017
- November 2017
- October 2017
- September 2017
- August 2017
- July 2017
- June 2017
- May 2017
- April 2017
- March 2017
- February 2017
- January 2017
- December 2016
- November 2016
- October 2016
- September 2016
- August 2016
- July 2016
- June 2016
- May 2016
- April 2016
- March 2016
- February 2016
- January 2016
- December 2015
- November 2015
- October 2015
- September 2015
- August 2015
- July 2015
- June 2015
- February 2015
- January 2015
- December 2014
- November 2014
- October 2014
- September 2014
- August 2014
- July 2014
- June 2014
- May 2014
- April 2014
- March 2014
- February 2014
- January 2014
- December 2013
- November 2013
- October 2013