- Forward


Inferential Statistics
An Introduction


Prof. David Bernstein
James Madison University

Computer Science Department
bernstdh@jmu.edu

Print

Motivation
Back SMYC Forward
  • The Rumor:
    • Green M&Ms are an aphrodisiac
  • The Impact on Someone I Know:
    • He bought a lot of bags of M&Ms
  • What He Concluded:
    • There are too few green M&Ms
Motivation (cont.)
Back SMYC Forward
  • The Data:
    • There are 58 M&Ms in a bag
    • There are 6 different colors
  • The Naive Hypothesis:
    • There should be \(58/6 = 9.667\) green M&Ms in a bag
  • A More Sophisticated Approach:
    • We should be able to use what we know about probability to calculate the expected value
Making a Probabilistic Argument
Back SMYC Forward
  • A Simpler Case:
    • There are 4 M&Ms in a bag
    • There are 2 different colors (red and green)
  • The Sample Space:
    • mm_sample-space
Making a Probabilistic Argument (cont.)
Back SMYC Forward
  • A Reasonable Assumption:
    • Each outcome is equally likely
  • The Resulting Probabilities:
    • mm_probabilities
Making a Probabilistic Argument (cont.)
Back SMYC Forward
  • Creating a Random Variable:
    • Remember, a random variable is a function that maps from the sample space to a set of numbers
    • In this case, let's let the random variable be the number of green M&Ms in a bag
  • The Random Variable:
    • mm_random-variable
Making a Probabilistic Argument (cont.)
Back SMYC Forward
  • Creating a Probability Density Function:
    • Remember, a probability density function is a mapping from the possible values of the random variable to the corresponding probabilities
  • In This Case:
    • \(f(0) = 1/16\)
    • \(f(1) = 1/16 + 1/16 + 1/16 + 1/16 = 4/16\)
    • \(f(2) = 1/16 + 1/16 + 1/16 + 1/16 + 1/16 + 1/16 = 6/16\)
    • \(f(3) = 1/16 + 1/16 + 1/16 + 1/16 = 4/16\)
    • \(f(4) = 1/16\)
Making a Probabilistic Argument (cont.)
Back SMYC Forward
  • Calculating the Expected Value:
    • Remember, the expected value is the sum of the probabilities times the values of the random variable
  • In This Case:
    • \(E(X) = f(0) \cdot 0 + f(1) \cdot 1 + f(2) \cdot 2 + f(3) \cdot 3 + f(4) \cdot 4\)
    • \(= 1/16 \cdot 0 + 4/16 \cdot 1 + 6/16 \cdot 2 + + 4/16 \cdot 3 + 1/16 \cdot 4\)
    • \(= 32/16 = 2\)
Interpretation
Back SMYC Forward
  • We "Expect":
    • There will be 2 green M&Ms in a bag
  • We Realize:
    • The bags are filled randomly, so any individual bag might contain fewer or more than 2
    • For example, we know from the probability density function that the probability of 1 green M&M is 4/16
  • Implication:
    • If you want to have any confidence, open more than one bag
    • We can easily calculate the probabilities of different events
Back to the Real World
Back SMYC Forward
  • One Way to Proceed:
    • Do the same thing for bags of 58 M&Ms in 6 colors
  • The Difficulty:
    • The sample space contains \(6^{58}\) sample points
    • (1,357,602,000,000,000,000,000,000,000,000,000,000,000,000,000)
  • A Better Way to Proceed:
    • Inferential Statistics
Statistics
Back SMYC Forward
  • Defined:
    • Numbers that are derived from a set of data
  • Two Important Kinds:
    • Measures of central tendency
    • Measures of variability
Statistics from the Population/Complete Set
Back SMYC Forward
  • Arithmetic Mean of the Population:
    • \(\mu = \frac{x_1+x_2+\cdots+x_N}{N}\)
  • Standard Deviation of the Population:
    • \(\sigma = \sqrt{ \frac{(x_1 - \mu)^2 + (x_2-\mu)^2+\cdots+(x_N-\mu)^2} {N}}\)
Statistics from the Population (cont.)
Back SMYC Forward
  • One Way to Proceed:
    • Open every bag of M&Ms ever made
    • See if the arithmetic mean number of greens is less than 58/6
  • A Better Way to Proceed:
    • Work with a sample (i.e., a part of the population) and make inferences about the population
  • A Limitation:
    • We will have to make some kind of probabilistic statement about the population
Making Probabilistic Statements about the Population
Back SMYC Forward
  • What We Need:
    • A probability density function
  • Thus Far:
    • We have assigned probabilities to sample points and used them to find the probability density function
  • Now:
    • We want to make a probabilistic statement about a sample mean
    • To do so, we need the distribution of sample means (i.e., we need the probability that the mean number of green M&Ms in any sample is, say 2, or any other number)
    • So, we have to calculate this distribution
The Central Limit Theorem
Back SMYC Forward
  • Definition of the (Arithmetic) Sample Mean:
    • \(\overline{x} = \frac{x_1+x_2+\cdots+x_n}{n}\)
  • An Observation:
    • Each time we take a sample it will have a different mean
  • A Question:
    • What is the distribution of the sample means?
The Central Limit Theorem (cont.)
Back SMYC Forward
  • The Answer:
    • The normal density function
    • \(f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{- \frac{(x - \mu)^2}{2 \sigma ^2}}\)
  • A Visualization:
    • normal-density
  • Amazingly:
    • The random variables do not have to be normally distributed -- as long as the random variables are independent and all have the same distribution, the distribution of sample means will tend to be normal
Back to the Simple Example
Back SMYC Forward
  • A Single Simulation:
    • To simulate the purchase of 10 of our small bags of M&Ms I rolled a 16-sided die 10 times
  • To See the Value of the Central Limit Theorem:
    • My first simulated purchase of 10 bags had a mean of 1.2 greens
    • My second simulated purchase of 10 bags had a mean of 1.5 greens
    • I did this 998 more times, calculating the mean each time
Back to the Simple Example (cont.)
Back SMYC Forward

The Frequencies

mm_frequencies
Back to the Simple Example (cont.)
Back SMYC Forward

The Relative Frequencies are "Bell Shaped"

mm_relative-frequencies
The Distribution of Sample Means
Back SMYC Forward
  • The Mean of this Distribution (\(\mu_{\overline{x}}\)):
    • \(\mu_{\overline{x}} = \mu\)
  • The Standard Deviation of this Distribution (\(\sigma_{\overline{x}}\)):
    • With replacement (which we will assume):
    • \(\sigma_{\overline{x}} = \frac{\sigma}{\sqrt{n}}\)
    • Without replacement:
    • \(\sigma_{\overline{x}} = \frac{\sigma}{\sqrt{n}} \sqrt{\frac{N-n}{n-1}} \)
Using the Distribution of Sample Means
Back SMYC Forward
  • One Particular Case:
    • The area under the the normal density function between \(\mu - 1.65 \sigma\) and \(\infty\) is approximately \(0.95\)
  • Visualizing this Case:
    • normal-density_area_one-tailed
  • Using this Case:
    • The probability of observing a value less than \(\mu - 1.65 \sigma\) is approximately \(0.05\) (i.e., \(1 - 0.95\))
Back to Our Example
Back SMYC Forward
  • A Hypothesis:
    • The bags are "fair" (i.e., the population mean is 9.667 greens)
  • Suppose:
    • The number of greens is normally distributed with (a mean of 9.667 and) a standard deviation of 2
  • Then We Know:
    • The probability is less than 0.05 that any sample of bags (not any bag) has less than \(9.667 - 1.65 \cdot 2 = 6.367\) greens
  • Testing the Hypothesis:
    • I collect a sample of bags
    • The sample mean is 5 greens
    • I can conclude that it is unlikely (i.e., there is a less than 0.05 probability) that the population mean is 9.667 (i.e., I can reject the hypothesis)
Back to Our Example (cont.)
Back SMYC Forward
  • An Important Question:
    • How did you find the standard deviation?
  • A Lame Answer:
    • I didn't -- I said "suppose" it is 2
  • An Important Open Question:
    • What good is this if I have to know the standard deviation?
Estimating the Standard Deviations
Back SMYC Forward
  • Estimating the Standard Deviation of the Population:
    • \(s = \sqrt{ \frac{(x_1 - \overline{x})^2 + (x_2 - \overline{x})^2 + \cdots + (x_n - \overline{x})^2 }{n - 1} } \)
  • Estimating the Standard Deviation of the Distribution of Sample Means:
    • \(s_{\overline{x}} = \frac{s}{\sqrt{n}}\)
Using the Appropriate Distribution
Back SMYC Forward
  • An Observation:
    • If we use \(s\) rather than \(\sigma\) we can't use the normal distribution
  • Student's \(t\) Distribution:
    • Bell-shaped
    • Close to normal when \(n\) is large
    • "Squat" when \(n\) is small
  • An Implication:
    • We can't use the 1.65 rule-of-thumb -- the value depends on the sample size
Using the Appropriate Distribution - Nerd Humor
Back SMYC Forward
\(t\) Distribution
/imgs
(Courtesy of xkcd)
Critical Values of \(t\) at the 0.95 Level
Back SMYC Forward
t_critical-values
Finishing the Example
Back SMYC Forward
  • Study Design:
    • I bought 20 bags of M&Ms
    • I counted the number of green M&Ms in each and denoted the value in bag \(i\) as \(x_i\)
  • The Data:
    • The mean was \(\overline{x} = 7.2\)
    • mm_standard-deviation
Finishing the Example (cont.)
Back SMYC Forward
  • Estimating the Standard Deviation of the Population:
    • \(s = \sqrt{ \frac{(x_1 - \overline{x})^2 + (x_2 - \overline{x})^2 + \cdots + (x_n - \overline{x})^2 }{n - 1} } = \sqrt{\frac{133.20}{20-1}} = 2.65\)
  • Estimating the Standard Deviation of the Distribution of Sample Means:
    • \(s_{\overline{x}} = \frac{s}{\sqrt{n}} = \frac{2.65}{\sqrt{20}} = 0.593\)
Finishing the Example (cont.)
Back SMYC Forward
  • Hypothesis:
    • There are 9.667 green M&Ms in a bag (i.e., the population mean is 9.667)
  • Testing the Hypothesis at the 0.95 Level:
    • \(n-1=19\) implies that the critical \(t\) values is 1.729
    • This means that the probability is less than 0.05 that a sample of 20 bags will have a mean number of greens less than \(9.667 - 0.593 \cdot 1.729 = 8.642\)
    • Since our actual sample mean is 7.2 we can reject the hypothesis
There's Always More to Learn
Back -