Summations
01 Theory
Theory 1
In many contexts it is useful to consider random variables that are summations of a large number of variables.
Summation formulas: and
Suppose is a large sum of random variables:
Then:
If and are uncorrelated (e.g. if they are independent):
Link to originalExtra - Derivation of variance of a sum
Using the definition:
In the last line we use the fact that for the first term, and the symmetry property of covariance for the second term with the factor of 2.
02 Illustration
Example - Binomial expectation and variance
Binomial expectation and variance
Suppose we have repeated Bernoulli trials with .
The sum is a binomial variable: .
We know and .
The summation rule for expectation:
The summation rule for variance:
Link to original
Example - Pascal expectation and variance
Pascal expectation and variance
(1) Let .
Let be independent random variables, where:
- counts the trials until the first success
- counts the trials after the first success until the second success
- counts the trials after the success until the success
Observe that .
(2) Notice that for every . Therefore:
(3) Using the summation rule, conclude:
Link to original
Example - Multinomial covariances
Multinomial covariances
Each trial of an experiment has possible outcomes labeled with probabilities of occurrence . The experiment is run times.
Let count the number of occurrences of outcome . So .
Find .
Solution
Notice that is also a binomial variable with success probability . (‘Success’ is an outcome of either or . ‘Failure’ is any other value.)
The variance of a binomial is known to be .
Compute by solving:
Link to original
Months with a birthday
Months with a birthday
Suppose study groups of 10 are formed from a large population.
For a typical study group, how many months out of the year contain a birthday of a member of the group? (Assume all 12 months have equal duration.)
Solution
(1) Let be 1 if month contains a birthday, and 0 otherwise.
So we seek . This equals .
The answer will be because all terms are equal.
(2) For a given :
The complement event:
(3) Therefore:
Link to original
Example - Hats in the air
Hats in the air
All sailors throws their hats in the air, and catch a random hat when they fall back down.
(a) How many sailors do you expect will catch the hat they own?
(b) What is the variance of this number?
Solution
Strangely, the answers are both 1, regardless of the number of sailors. Here is the reasoning:
(a) Let when sailor catches their own hat, and otherwise. Thus is Bernoulli with .
Now counts the total number of hats caught by their owners.
Note that . Therefore:
(b) We know:
Now calculate :
Use . Observe that . Therefore:
Now calculate :
We need to compute .
Notice that when and both catch their own hats, and 0 otherwise. So it is Bernoulli. Then:
Therefore:
Putting everything together:
Link to original
Central Limit Theorem
03 Theory
Theory 1
Video by 3Blue1Brown:
IID variables
Random variables are called independent, identically distributed when they are independent and have the same distribution.
IID variables: Same distribution, different values
Independent variables cannot be correlated, so the values taken by IID variables will disagree on all (most) outcomes.
We do have:
Standardization
Suppose is any random variable.
The standardization of is:
The variable has and . We can reconstruct by:
Suppose is a collection of IID random variables.
Define:
where:
So is the standardization of .
Let be a standard normal random variable, .
Central Limit Theorem
Suppose for IID variables , and are the standardizations of .
Then for any interval :
We say that converges in probability to the standard normal .
The distribution of a very large sum of IID variables is determined merely by and from the original IID variables, while the data of higher moments fades away.
The name “normal distribution” is used because it arises from a large sum of repetitions of any other kind of distribution. It is therefore ubiquitous in applications.
Misuse of the CLT
It is important to learn when the CLT is applicable and when it is not. Many people (even professionals) apply it wrongly.
For example, sometimes one hears the claim that if enough students take an exam, the distribution of scores will be approximately normal. This is totally wrong!
Link to originalIntuition for the CLT
The CLT is about the distribution of simultaneity, or (in other words) about accumulated alignment between independent variables.
With a large , deviations of the total sum are predominantly created by simultaneous (correlated) deviations of a large portion of summands away from their means, rather than the contributions of individual summands deviating a large amount.
Simultaneity across a large of independent items is described by… the bell curve.
04 Illustration
Exercise - Test scores distribution
Test scores distribution
Explain what is wrong with the claim that test scores should be normally distributed when a large number of students take a test.
Can you imagine a scenario with a good argument that test scores would be normally distributed?
(Hint: think about the composition of a single test instead of the number of students taking the test.)
Link to original
Exercise - Height follows a bell curve
Height follows a bell curve
The height of female American basketball players follows a bell curve. Why?
Link to original
05 Theory - extra
Theory 2
Extra - Moment Generating Functions
Theory 1
In order to show why the CLT is true, we introduce the technique of moment generating functions. Recall that the moment of a distribution is simply . Write for this value.
Recall the power series for :
The function has the property of being a bijective differentiable map from to , and it converts addition to multiplication: .
Given a random variable , we can compose with to obtain a new variable. Define the moment generating function of as follows:
This is a function of and returns values in . It is called the moment generating function because it contains the data of all the higher moments . They can be extracted by taking derivatives and evaluating at zero:
It is reasonable to consider as a formal power series in the variable that has the higher moments for coefficients.
Example - Moment generating function of a standard normal
We compute where . From the formula for expected value of a function of a random variable, we have:
Complete the square in the exponent: . Thus:
The last factor can be taken outside the integral:
Exercise - Moment generating function of an exponential variable
Compute for .
Moment generating functions have the remarkable property of encoding the distribution itself:
Distributions determined by MGFs
Assume and both converge. If , then .
Moreover, if for any interval of values , then for all and .
Be careful about moments vs. generating functions!
Sometimes the moments all exist, but they grow so fast that the moment generating function does not converge. For example, the log-normal distribution for has this property.
The fact above does not apply when this happens.
When moment generating functions approximate each other, their corresponding distributions also approximate each other:
Distributions converge when MGFs converge
Suppose that for all on some interval . (In particular, assume that converges on some such interval.) Then for any , we have:
Link to originalExercise Using an MGF
Suppose is nonnegative and when and when . Find a bound on using (a) Markov’s Inequality, and (b) Chebyshev’s Inequality.
Link to originalExtra - Derivation of CLT
Theory 2
The main role of moment generating functions in the proof of the CLT is to convert the sum into a product by putting the sum into an exponent.
We have , and recall , so and . First, compute the MGF of . We have:
Exchange the sum in the exponent for a product of exponentials:
Now since the are independent, the factors are also independent of each other. Use the product rule when are independent to obtain:
Now expand the exponential in its Taylor series and use linearity of expectation:
We don’t give a complete argument for the final approximation, but a few remarks are worthwhile. For fixed , and assuming the moments have adequately bounded growth in , the series in each factor converges for all . Using Taylor’s theorem we could write an error term as a shrinking function of . The real trick of analysis is to show that in the product of factors, these error terms shrink fast enough that the limit value is not affected.
In any case, the factors of the last line are independent of , so we have:
But is the MGF of . Therefore , so .
Link to original
06 Theory
Theory 3
Normal approximations rely on the limit stated in the CLT to approximate probabilities for large sums of variables.
Normal approximation of binomial
Let for IID variables with and .
The normal approximation of is:
For example, suppose , so . We know and . Therefore:
A rule of thumb is that the normal approximation to the binomial is effective when .
Link to originalEfficient computation
This CDF is far easier to compute for large than the CDF of itself. The factorials in are hard (even for a computer) when is large, and the summation adds another factor to the scaling cost.
07 Illustration
Example - Binomial estimation: 10,000 flips
Binomial estimation: 10,000 flips
Flip a fair coin 10,000 times. Write for the number of heads.
Estimate the probability that .
Solution
(1) Check the rule of thumb: and , so and the approximation is effective.
(2) Now, calculate needed quantities:
(3) Set up CDF:
(4) Compute desired probability:
Link to original
Example - Summing 1000 dice
Summing 1000 dice
Suppose 1,000 dice are rolled.
Estimate the probability that the total sum of rolled numbers is more than 3,600.
Solution
(1) Let be the number rolled on the die.
Let , so sums up the rolled numbers.
We seek .
(2) Now, calculate needed quantities:
(3) Set up CDF:
(4) Compute desired probability:
Link to original
Exercise - Estimating
Estimating S1000
The odds of a random poker hand containing one pair is 0.42.
Estimate the probability that at least 450 out of 1000 poker hands will contain one pair.
Link to original
Exercise - Nutrition study
Nutrition study
A nutrition review board will endorse a diet if it has any positive effect in at least 65% of those tested in a certain study with 100 participants.
Suppose the diet is bogus, but 50% of participants display some positive effect by pure chance.
What is the probability that it will be endorsed?
Answer
Link to original