NOTE: In this notebook I use the module scipy.stats for all statistics functions, including generation of random numbers. There are other modules with some overlapping functionality, e.g., the regular python random module, and the scipy.random module, but I do not use them here. The scipy.stats module includes tools for a large number of distributions, it includes a large and growing set of statistical functions, and there is a unified class structure. (And namespace issues are minimized.) See https://docs.scipy.org/doc/scipy/reference/stats.html.

In [1]:

```
import scipy as sp
from scipy import stats
import matplotlib as mpl # As of July 2017 Bucknell computers use v. 2.x
import matplotlib.pyplot as plt
# Following is an Ipython magic command that puts figures in the notebook.
# For figures in separate windows, comment out following line and uncomment
# the next line
# Must come before defaults are changed.
%matplotlib notebook
#%matplotlib
# As of Aug. 2017 reverting to 1.x defaults.
# In 2.x text.ustex requires dvipng, texlive-latex-extra, and texlive-fonts-recommended,
# which don't seem to be universal
# See https://stackoverflow.com/questions/38906356/error-running-matplotlib-in-latex-type1cm?
mpl.style.use('classic')
# M.L. modifications of matplotlib defaults using syntax of v.2.0
# More info at http://matplotlib.org/2.0.0/users/deflt_style_changes.html
# Changes can also be put in matplotlibrc file, or effected using mpl.rcParams[]
plt.rc('figure', figsize = (6, 4.5)) # Reduces overall size of figures
plt.rc('axes', labelsize=16, titlesize=14)
plt.rc('figure', autolayout = True) # Adjusts supblot parameters for new size
```

The p.d.f. of the uniform distribution is

$$ P_U(x;\bar{x},a) = \left\{\begin{array}{cl} 1/a & \mbox{if $\bar{x}âˆ’a/2\leq x\leq \bar{x}+a/2$}\\ 0 & \mbox{otherwise}. \end{array}\right. $$To show that this is normalized, I integrate it over all values of $x$: $$ \begin{eqnarray*} \int_{-\infty}^\infty P_U(x)\, dx &=& \int_{\bar x - a/2}^{\bar x + a/2} \frac{1}{a}\, dx \\ &=& \left\vert \frac{x}{a}\right\vert_{\bar x -\frac{a} {2}}^{\bar x + \frac{a}{2}} \\ &=& \frac{1}{a}\left[\left(\bar x + \frac{a}{2}\right) - \left(\bar x + \frac{a}{2}\right)\right] \\ &=& \frac{1}{a}\left[\frac{a}{2} + \frac{a}{2}\right] \\ &=& 1 \end{eqnarray*} $$

It's normalized.

The mean is

$$ \begin{eqnarray*} \bar x &\equiv& \int_{-\infty}^\infty x\, P_U(x)\, dx \\ &=& \int_{\bar x - a/2}^{\bar x + a/2} \frac{x}{a}\, dx \\ &=& \frac{1}{a}\left\vert \frac{x^2}{2}\right\vert_{\bar x - \frac{a}{2}}^{\bar x + \frac{a}{2}}\\ &=& \frac{1}{2a}\left[\left(\bar x + \frac{a}{2}\right)^2 - \left(\bar x - \frac{a}{2}\right)^2 \right]\\ &=& \frac{1}{2a}\left[ \bar x^2 + \bar x a + \frac{a^2}{4} - \bar x^2 + \bar x a -\frac{a^2}{4}\right] &=& \bar x \end{eqnarray*} $$The variance, or square of the standard deviation, is the second moment of the p.d.f. about its mean, which is simplified in the second line using Eq. (3.6) from H&H:

$$ \begin{eqnarray*} \sigma^2 &\equiv& \int_{-\infty}^\infty (x - \bar x)^2 P_U(x)\, dx \\ &=& \overline{x^2} - \bar{x}^2 \\ &=& \left(\int_{\bar x- a/2}^{\bar x + a/2} \frac{1}{a} x^2\, dx\right) - \bar x^2\\ &=& \frac{1}{a}\left.\frac{x^3}{3}\right\vert_{\bar x - a/2}^{\bar x + a/2} - \bar x^2\\ &=& \frac{1}{3a}\left[(\bar x + a/2)^3 - (\bar x - a/2)^3)\right] - \bar x^2 \\ &=& \frac{1}{3a} \left[\left(\bar x^3 + 3 \bar x^2\frac{a}{2} + 3 \bar x \frac{a^2}{4} + \frac{a^3}{8}\right) - \left(\bar x^3 - 3 \bar x^2\frac{a}{2} + 3 \bar x \frac{a^2}{4} - \frac{a^3}{8}\right)\right] - \bar x^2 \\ &=& \frac{1}{3a}\left[3\bar x^2 + \frac{a^3}{4}\right] - \bar x^2 \\ &=& \frac{a^2}{12} \end{eqnarray*} $$Note that this is all *much easier to do with a computer algebra system, like
Mathematica:*

In [2]:

```
counts = sp.array([1,3,4,5,6,7,8,9,10,11,12,13])
occurences = sp.array([1,2,3,6,9,11,8,8,6,2,1,1])
n_expts = sp.sum(occurences) # check
print(n_expts)
```

In [3]:

```
total = sp.dot(occurences,counts)
print("total # of counts =", total)
```

In [4]:

```
for i in range(7):
print(i,sp.stats.binom.pmf(i,6, 1/4))
```

This is the best estimate of the mean of the parent Poisson distribution.

In [5]:

```
mean = total/n_expts
print('mean counts/second =', mean)
```

Use the cumulative distribution function for the Poisson distribution

In [6]:

```
probability = sp.stats.poisson.cdf(5, mean)
print("probability of getting 5 or fewer counts =", probability)
n_expected = n_expts*probability
print("expected number of trials with 5 or fewer counts =", n_expected)
```

Use the cumulative distribution function to find the probability of 19 or fewer counts. The probability of 20 or more is (1 - the probability of 19 or fewer).

In [7]:

```
probability = 1 - sp.stats.poisson.cdf(19, mean)
print("probability of getting 20 or more counts =", probability)
n_expected = n_expts*probability
print("expected number of trials with 5 or fewer counts =", n_expected)
```

If we use the minute as our unit of time, the determination of the mean count rate is trivial: it's $R = 270\, \mbox{min$^{âˆ’1}$}$. If we choose seconds, it's only slightly less trivial: $R=270/60 = 4.5\, \mbox{s$^{âˆ’1}$}$.

We are sampling from a Poisson distribution, so the error given, by the standard deviation, is $\sigma = \sqrt{\bar \mu}$:

In [8]:

```
sigma = sp.sqrt(270)
print('sigma (min^-1) =', sigma, '; sigma (s^-1) = ', sigma/60)
```

In [9]:

```
print('fractional error =', sigma/270)
```

In [10]:

```
n_expected = 15*270
print('expected in 15 minutes =', n_expected)
```

The probability of actually getting this number of counts is given by the probability mass function of the Poisson distribution:

In [11]:

```
probability = sp.stats.poisson.pmf(n_expected,n_expected)
print('probability of getting exactly', n_expected,' counts =', probability)
```

In [12]:

```
plt.figure(1)
n = sp.linspace(0,60,61)
x = sp.linspace(0,60,301)
p = sp.stats.poisson.pmf(n,35)
g = sp.stats.norm.pdf(x,35,sp.sqrt(35))
plt.scatter(n, p, c='red', label='Poisson')
plt.plot(x, g, label='Gaussian')
plt.xlim(0,60)
plt.axhline(0)
plt.legend();
```

For each of the draws, the probability of getting a heart is $\frac{1}{4}$. The probability of getting $k$ successes in $n$ independent trials, where the $p$ is the probability of success in an indvidual trial is given by the binomial distribution:

In [13]:

```
for k in range(7):
n = 6
p = 0.25
print("The probability of",k,"hearts is",sp.stats.binom.pmf(k,n,p))
```

- 6 Colors: Yellow, Blue, Orange, Red, Green, and Blue
- Assume 60 M&Ms in every bag
- Assume equal probabilities (well mixed, large "reservoir")
- Assume 24 students (bags) per section

To get started, sample one bag of M&Ms, and count the numberof brown M&Ms.

Do this by generating 60 random integers from the set 0, 1, 2, 3, 4, 5, and let's say that "brown" = 5.

In [14]:

```
bag = sp.stats.randint.rvs(0,6,size = 60) # or sp.random.randint(0,6,60)
bag
```

Out[14]:

- Count the number of each color in the bag using
`sp.bincount(bag)`

. The first element in the array is the number of occurences of 0 in "bag," the second element is the number of occurences of 1, etc.

In [15]:

```
sp.bincount(bag)
```

Out[15]:

- For our "brown" = 5 choice, the number of brown M&Ms is the last element in the array returned by
`bincount`

, or`sp.bincount(bag)[5]`

.

In [16]:

```
sp.bincount(bag)[5]
```

Out[16]:

- Now sample many bags
- Record number of brown M&Ms in each bag

In [17]:

```
# Long version of sampling many bags
nb = 24 # number of bags
data_section = sp.zeros(nb) # array in which to store data for a lab section
for i in range(nb):
bag = sp.stats.randint.rvs(0,6,size=60)
data_section[i] = sp.bincount(bag)[5]
data_section
```

Out[17]:

In [18]:

```
# Concise version of sampling many bags
nb = 24 # number of bags
data_section = sp.array([sp.bincount(sp.stats.randint.rvs(0,6,size=60))[5] for i in range(nb)])
data_section
```

Out[18]:

In [19]:

```
sp.mean(data_section), sp.std(data_section), sp.std(data_section)/sp.sqrt(len(data_section)-1)
```

Out[19]:

$\overline N = 10.1 \pm 0.6$

In [27]:

```
plt.figure(2)
nbins = 20
low = 0
high = 20
plt.hist(data_section,nbins,[low,high])
plt.xlim(0,20)
plt.title("Histogram of brown M&Ms per bag - Single Section",fontsize=14)
plt.xlabel("Number of brown M&Ms")
plt.ylabel("Occurences");
```

In [21]:

```
nb = 24 # Number of bags in a section
ns = 200 # Number of sections in class
data_class = sp.zeros(ns) # array for results from each section
for j in range(ns):
data_section = sp.zeros(nb) # array in which to store section data
for i in range(nb):
bag = sp.stats.randint.rvs(0,6,size=60)
data_section[i] = sp.bincount(bag)[5]
data_class[j] = sp.mean(data_section)
```

In [22]:

```
sp.mean(data_class), sp.std(data_class)
```

Out[22]:

The standard deviation of the section averages (0.55) is consistent with that predicted by the standard deviation of the mean determined from a single section (0.6).

For a more thorough comparision with the predictions of the central limit theorem (CLT) I will compare the histogram of section averages with the the normal distribution predicted by the CLT. The average of the normal distribution is predicted to be 10, and the standard deviation is predicted to be $\sigma_\text{parent} = \sqrt{N}$, where $N$ is the number of bags in a section.

The parent distribution is a binomial distribution with $n = 60$, and $p = 1/6$.

In [23]:

```
sigmaP = sp.stats.binom.std(60,1/6.)
sigmaP
```

Out[23]:

This result is consistent with the observed experimental standard deviation from a single section.

When comparing the histogram with the normal distribution, we are really comparing the observed occurences in a bin with an integration of the probability distribution over finite interval: $$ N_\text{section} \times \int_{x_1}^{x_2} p(x)\, dx \simeq N_\text{section}\, p(\overline x)\, (x_2âˆ’x_1), $$ where $(x_2âˆ’x_1)$ is the binwidth. (For wide bins it would be better to use the cdf rather than this approximation.)

In [28]:

```
plt.figure(3)
nbins = 25
low = 5
high = 15
binwidth = (high-low)/nbins
plt.hist(data_class,nbins,[low,high],alpha=0.5)
plt.xlim(low,high)
x = sp.linspace(0,20,400)
y = sp.stats.norm.pdf(x,10,sigmaP/sp.sqrt(nb))*ns*binwidth
plt.plot(x,y)
plt.title("Histogram of section averages",fontsize=14)
plt.xlabel("Number of brown M&Ms")
plt.ylabel("Occurences");
```

The histogram compares well with the preditions of the CLT. (For a more quantitative comparison, see Chapter 8 of Hughes and Hase.) Even though the parent distribution is binomial, the CLT says that the distribution of AVERAGES drawn from the parent will be gaussian.

In [25]:

```
%load_ext version_information
#%install_ext http://raw.github.com/jrjohansson/version_information/master/version_information.py
```

In [26]:

```
version_information scipy, matplotlib
```

Out[26]:

In [ ]:

```
```