Coin-flip experiment and the Central Limit Theorem

Question: I flip a fair coin 100 times. What is the probability distribution of the number of heads that I get.

In [1]:
import numpy as np
from scipy import stats

import matplotlib as mpl
import matplotlib.pyplot as plt
In [2]:
%matplotlib notebook

# M.L. modification of matplotlib defaults
# Changes can also be put in matplotlibrc file, 
# or effected using mpl.rcParams[]
mpl.style.use('classic')
plt.rc('figure', figsize = (6, 4.5)) # Reduces overall size of figures
plt.rc('axes', labelsize=16, titlesize=14)
plt.rc('figure', autolayout = True) # Adjusts supblot parameters for new size

Simulating spin flips

I am going to generate 100 random 0's and 1's, and let 1 represent a "heads" and 0 represent a "tails."

In [3]:
n_flips = 100 # Number of coin flips
data = stats.randint.rvs(0, 2, size=n_flips)
print(data)
[0 1 1 1 0 1 0 1 0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 0 1 1
 1 1 1 0 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 1 0 1 1 1 0 0 0 1 1 1
 0 1 0 0 0 1 0 0 0 1 1 1 0 1 1 0 1 1 0 0 1 1 0 1 1 0]

I can count the "heads" by summing the array data.

In [4]:
n_heads = np.sum(data)
print(n_heads)
56

Simulating 200 trials of 100 coin flip experiments

I store the number of heads in each trial in the array results.

In [5]:
n_expts = 200                           # umber of experiments to simulate
results = np.zeros(n_expts) # Create array for results of "experiments"

for i in range(n_expts):
    results[i] = np.sum(stats.randint.rvs(0, 2, size=n_flips)) #Note that each experiment consists of 100 coin flips

Histogram of results

Choose bin boundaries so that each bin includes a single integer, e.g., $[47.5, 48.5]$, $[48.5,49.5]$, $[49.5, 50.5]$, etc.

In [6]:
low = 35.5
high = 64.5
nbins = int(high-low) 
plt.figure()
plt.xlabel("# of heads")
plt.ylabel("occurrences")
h_out = plt.hist(results, nbins, [low,high])
plt.xlim(low,high);

Increasing the number of trials. Looks more Gaussian.

In [7]:
n_expts = 10000   
results = np.zeros(n_expts) # Create array for results of "experiments"

for i in range(n_expts):
    results[i] = np.sum(stats.randint.rvs(0, 2, size=n_flips))
In [8]:
low = 35.5
high = 64.5
nbins = int(high - low)
plt.figure()
plt.xlabel("# of heads")
plt.ylabel("occurrences")
out = plt.hist(results, nbins, [low,high])
plt.xlim(low,high);

Checking Central Limit Theorem

Following Hughes and Hase statement of the Central Limit Theorem at the top of p.33, we should look at the distribution of the sample mean:

$$ \langle x\rangle = \frac{1}{N}(x_1 + x_2 + \cdots + x_N). $$

It's the distribution of the sample mean that approaches the normal distribution.

The individual values $x_i$ are sampled from a discrete distribution with two values: $x_i = 0$ with a probability of 0.5, and $x_i = 1$ with a probability 0.5. The mean of this parent distribution for the $x_i$'s is $\langle x\rangle = 0.5$, and the standard deviation of this parent distribution is $\sigma = 0.5$.

The histogram below is the same as the histogram immediately above, except that the results for the number of heads in each trial have been divided by $N$, the number of coin-flips in a trial (in this example 100).

The Central Limit Theorem states that total number of heads divided by the number of flips (100) should tend toward a normal distribution with a mean of 0.5 and a standard deviation of $\sigma_\text{parent}/\sqrt{100}$. Let's check.

In [9]:
low = 35.5/n_flips
high = 64.5/n_flips
# Don't change number of bins from previous histogram
plt.figure()
plt.xlabel("fraction of heads")
plt.ylabel("occurrences")
h_out = plt.hist(results/n_flips, nbins, [low,high])
plt.xlim(low,high);

Check the standard deviation

In [10]:
print("sample mean =", np.mean(results/n_flips), ", sample std =", np.std(results/n_flips))
sample mean = 0.49994800000000006 , sample std = 0.050354317550732434
In [11]:
sigma_parent = 1/2
print("predicted std =", sigma_parent/np.sqrt(n_flips) )
predicted std = 0.05

We can check more than just the numerical value of the standard deviation

Use the CDF of the normal distribution and the bin boundaries to determine the expected occurrences for a normal distribution:

\begin{eqnarray*} P(x_1<x<x_2) &=& \int_{x_1}^{x_2} P_{DF}(x)\, dx \\ &=& \int_{-\infty}^{x_2} P_{DF}(x)\, dx - \int_{-\infty}^{x_1} P_{DF}(x)\, dx \\ &=& C_{DF}(x_2) - C_{DF}(x_1) \end{eqnarray*}

Note: Information about the occurrences and bin boundaries is part of the output of the plt.hist() function. (I gave this output the name h_out in the plotting cell above.) h_out[0] is an array with the counts in each bin, and h_out[1] is an array with the bin boundaries.

In [12]:
h_out
Out[12]:
(array([ 21.,  21.,  57.,  84.,  91., 152., 244., 319., 346., 478., 564.,
        685., 763., 765., 805., 813., 710., 635., 571., 502., 374., 304.,
        228., 148., 103.,  77.,  58.,  29.,  17.]),
 array([0.355, 0.365, 0.375, 0.385, 0.395, 0.405, 0.415, 0.425, 0.435,
        0.445, 0.455, 0.465, 0.475, 0.485, 0.495, 0.505, 0.515, 0.525,
        0.535, 0.545, 0.555, 0.565, 0.575, 0.585, 0.595, 0.605, 0.615,
        0.625, 0.635, 0.645]),
 <BarContainer object of 29 artists>)
In [13]:
low = 35.5/n_flips
high = 64.5/n_flips
# use nbins previously defined
plt.figure()
plt.xlabel("(# of heads)/100")
plt.ylabel("occurrences")
h_out = plt.hist(results/n_flips, nbins, [low,high], alpha = 0.5)
plt.xlim(low,high);

x = np.zeros(len(h_out[1])-1)   # Create empty array for mid-points of histogram bins
y = np.zeros(len(h_out[1])-1)   # Create empty array for expected occurences from normal dist.
for i in range(len(x)):
    x[i] = (h_out[1][i+1] + h_out[1][i])/2    # fill x with center-of-bin values
    y[i] = n_expts*(stats.norm.cdf(h_out[1][i+1],0.5,0.5/np.sqrt(n_flips))\
                    - stats.norm.cdf(h_out[1][i],0.5,0.5/np.sqrt(n_flips)))   # fill y with probabilities
                                                                              # based on gaussian pdf
plt.scatter(x, y, c = 'red');

Looks pretty Gaussian!!

In a future class we will talk about how to make quantitative assessments of confidence about whether a sample of random variables comes from a given distribution.

Version information

version_information is from J.R. Johansson (jrjohansson at gmail.com); see Introduction to scientific computing with Python for more information and instructions for package installation.

version_information is installed on the linux network at Bucknell

In [14]:
%load_ext version_information
In [15]:
version_information numpy, scipy, matplotlib
Out[15]:
SoftwareVersion
Python3.7.8 64bit [GCC 7.5.0]
IPython7.17.0
OSLinux 3.10.0 1160.36.2.el7.x86_64 x86_64 with centos 7.9.2009 Core
numpy1.19.1
scipy1.5.0
matplotlib3.3.0
Tue Feb 01 08:56:29 2022 EST
In [ ]: