PHYS 310, HW #2, Spring 2018

NOTE: In this notebook I use the module scipy.stats for all statistics functions, including generation of random numbers. There are other modules with some overlapping functionality, e.g., the regular python random module, and the scipy.random module, but I do not use them here. The scipy.stats module includes tools for a large number of distributions, it includes a large and growing set of statistical functions, and there is a unified class structure. (And namespace issues are minimized.) See https://docs.scipy.org/doc/scipy/reference/stats.html.

In [1]:
import scipy as sp
from scipy import stats

import matplotlib as mpl       # As of July 2017 Bucknell computers use v. 2.x 
import matplotlib.pyplot as plt

# Following is an Ipython magic command that puts figures in the  notebook.
# For figures in separate windows, comment out following line and uncomment
# the next line
# Must come before defaults are changed.
%matplotlib notebook
#%matplotlib

# As of Aug. 2017 reverting to 1.x defaults.
# In 2.x text.ustex requires dvipng, texlive-latex-extra, and texlive-fonts-recommended, 
# which don't seem to be universal
# See https://stackoverflow.com/questions/38906356/error-running-matplotlib-in-latex-type1cm?
mpl.style.use('classic')
        
# M.L. modifications of matplotlib defaults using syntax of v.2.0 
# More info at http://matplotlib.org/2.0.0/users/deflt_style_changes.html
# Changes can also be put in matplotlibrc file, or effected using mpl.rcParams[]
plt.rc('figure', figsize = (6, 4.5))            # Reduces overall size of figures
plt.rc('axes', labelsize=16, titlesize=14)
plt.rc('figure', autolayout = True)             # Adjusts supblot parameters for new size

H&H Problem 3.2

(i) Normalization

The p.d.f. of the uniform distribution is

$$ P_U(x;\bar{x},a) = \left\{\begin{array}{cl} 1/a & \mbox{if $\bar{x}−a/2\leq x\leq \bar{x}+a/2$}\\ 0 & \mbox{otherwise}. \end{array}\right. $$

To show that this is normalized, I integrate it over all values of $x$: $$ \begin{eqnarray*} \int_{-\infty}^\infty P_U(x)\, dx &=& \int_{\bar x - a/2}^{\bar x + a/2} \frac{1}{a}\, dx \\ &=& \left\vert \frac{x}{a}\right\vert_{\bar x -\frac{a} {2}}^{\bar x + \frac{a}{2}} \\ &=& \frac{1}{a}\left[\left(\bar x + \frac{a}{2}\right) - \left(\bar x + \frac{a}{2}\right)\right] \\ &=& \frac{1}{a}\left[\frac{a}{2} + \frac{a}{2}\right] \\ &=& 1 \end{eqnarray*} $$

It's normalized.

(ii) Mean

The mean is

$$ \begin{eqnarray*} \bar x &\equiv& \int_{-\infty}^\infty x\, P_U(x)\, dx \\ &=& \int_{\bar x - a/2}^{\bar x + a/2} \frac{x}{a}\, dx \\ &=& \frac{1}{a}\left\vert \frac{x^2}{2}\right\vert_{\bar x - \frac{a}{2}}^{\bar x + \frac{a}{2}}\\ &=& \frac{1}{2a}\left[\left(\bar x + \frac{a}{2}\right)^2 - \left(\bar x - \frac{a}{2}\right)^2 \right]\\ &=& \frac{1}{2a}\left[ \bar x^2 + \bar x a + \frac{a^2}{4} - \bar x^2 + \bar x a -\frac{a^2}{4}\right] &=& \bar x \end{eqnarray*} $$

(iii) Variance

The variance, or square of the standard deviation, is the second moment of the p.d.f. about its mean, which is simplified in the second line using Eq. (3.6) from H&H:

$$ \begin{eqnarray*} \sigma^2 &\equiv& \int_{-\infty}^\infty (x - \bar x)^2 P_U(x)\, dx \\ &=& \overline{x^2} - \bar{x}^2 \\ &=& \left(\int_{\bar x- a/2}^{\bar x + a/2} \frac{1}{a} x^2\, dx\right) - \bar x^2\\ &=& \frac{1}{a}\left.\frac{x^3}{3}\right\vert_{\bar x - a/2}^{\bar x + a/2} - \bar x^2\\ &=& \frac{1}{3a}\left[(\bar x + a/2)^3 - (\bar x - a/2)^3)\right] - \bar x^2 \\ &=& \frac{1}{3a} \left[\left(\bar x^3 + 3 \bar x^2\frac{a}{2} + 3 \bar x \frac{a^2}{4} + \frac{a^3}{8}\right) - \left(\bar x^3 - 3 \bar x^2\frac{a}{2} + 3 \bar x \frac{a^2}{4} - \frac{a^3}{8}\right)\right] - \bar x^2 \\ &=& \frac{1}{3a}\left[3\bar x^2 + \frac{a^3}{4}\right] - \bar x^2 \\ &=& \frac{a^2}{12} \end{eqnarray*} $$

Note that this is all much easier to do with a computer algebra system, like Mathematica:

H&H Problem 3.7

In [2]:
counts = sp.array([1,3,4,5,6,7,8,9,10,11,12,13])
occurences = sp.array([1,2,3,6,9,11,8,8,6,2,1,1])

n_expts = sp.sum(occurences)  # check 
print(n_expts)
58

(i) Total number of counts

In [3]:
total = sp.dot(occurences,counts)
print("total # of counts =", total)
total # of counts = 423
In [4]:
for i in range(7):
    print(i,sp.stats.binom.pmf(i,6, 1/4))
0 0.177978515625
1 0.35595703125
2 0.296630859375
3 0.1318359375
4 0.032958984375
5 0.00439453125
6 0.000244140625

(ii) Mean number of counts

This is the best estimate of the mean of the parent Poisson distribution.

In [5]:
mean = total/n_expts
print('mean counts/second =', mean)
mean counts/second = 7.29310344828

(i) Expected # of trials with 5 or fewer counts¶

Use the cumulative distribution function for the Poisson distribution

In [6]:
probability = sp.stats.poisson.cdf(5, mean)
print("probability of getting 5 or fewer counts =", probability)
n_expected = n_expts*probability
print("expected number of trials with 5 or fewer counts =", n_expected)
probability of getting 5 or fewer counts = 0.264848929955
expected number of trials with 5 or fewer counts = 15.3612379374

(ii) Expected # of trials with 20 or more counts¶

Use the cumulative distribution function to find the probability of 19 or fewer counts. The probability of 20 or more is (1 - the probability of 19 or fewer).

In [7]:
probability = 1 - sp.stats.poisson.cdf(19, mean)
print("probability of getting 20 or more counts =", probability)
n_expected = n_expts*probability
print("expected number of trials with 5 or fewer counts =", n_expected)
probability of getting 20 or more counts = 7.67424359367e-05
expected number of trials with 5 or fewer counts = 0.00445106128433

H&H Problem 3.8

(i) Mean count rate

If we use the minute as our unit of time, the determination of the mean count rate is trivial: it's $R = 270\, \mbox{min$^{−1}$}$. If we choose seconds, it's only slightly less trivial: $R=270/60 = 4.5\, \mbox{s$^{−1}$}$.

(ii) Error in the mean count rate

We are sampling from a Poisson distribution, so the error given, by the standard deviation, is $\sigma = \sqrt{\bar \mu}$:

In [8]:
sigma = sp.sqrt(270)
print('sigma (min^-1) =', sigma, '; sigma (s^-1) = ', sigma/60)
sigma (min^-1) = 16.4316767252 ; sigma (s^-1) =  0.273861278753

(iii) Fractional error

In [9]:
print('fractional error =', sigma/270)
fractional error = 0.060858061945

Expected counts in 15 minutes: $N = 15\, \mbox{min} \times R$

In [10]:
n_expected = 15*270
print('expected in 15 minutes =', n_expected)
expected in 15 minutes = 4050

The probability of actually getting this number of counts is given by the probability mass function of the Poisson distribution:

In [11]:
probability = sp.stats.poisson.pmf(n_expected,n_expected)
print('probability of getting exactly', n_expected,' counts =', probability)
probability of getting exactly 4050  counts = 0.00626864416478

H&H Problem 3.9

In [12]:
plt.figure(1)
n = sp.linspace(0,60,61)
x = sp.linspace(0,60,301)
p = sp.stats.poisson.pmf(n,35)
g = sp.stats.norm.pdf(x,35,sp.sqrt(35))
plt.scatter(n, p, c='red', label='Poisson')
plt.plot(x, g, label='Gaussian')
plt.xlim(0,60)
plt.axhline(0)
plt.legend();

Problem 5: Drawing cards

For each of the draws, the probability of getting a heart is $\frac{1}{4}$. The probability of getting $k$ successes in $n$ independent trials, where the $p$ is the probability of success in an indvidual trial is given by the binomial distribution:

In [13]:
for k in range(7):
    n = 6
    p = 0.25
    print("The probability of",k,"hearts is",sp.stats.binom.pmf(k,n,p))
The probability of 0 hearts is 0.177978515625
The probability of 1 hearts is 0.35595703125
The probability of 2 hearts is 0.296630859375
The probability of 3 hearts is 0.1318359375
The probability of 4 hearts is 0.032958984375
The probability of 5 hearts is 0.00439453125
The probability of 6 hearts is 0.000244140625

Problem 6, Simulation of PHYS 211 M&M Experiment

  • 6 Colors: Yellow, Blue, Orange, Red, Green, and Blue
  • Assume 60 M&Ms in every bag
  • Assume equal probabilities (well mixed, large "reservoir")
  • Assume 24 students (bags) per section

Part 1

To get started, sample one bag of M&Ms, and count the numberof brown M&Ms.
Do this by generating 60 random integers from the set 0, 1, 2, 3, 4, 5, and let's say that "brown" = 5.

In [14]:
bag = sp.stats.randint.rvs(0,6,size = 60) # or sp.random.randint(0,6,60)
bag
Out[14]:
array([1, 5, 3, 4, 5, 5, 2, 2, 3, 5, 0, 4, 0, 0, 0, 3, 1, 0, 1, 0, 4, 5, 4,
       5, 5, 2, 0, 4, 1, 3, 3, 3, 0, 2, 5, 2, 0, 5, 0, 0, 3, 0, 2, 2, 3, 4,
       1, 1, 3, 0, 0, 5, 1, 1, 4, 3, 3, 0, 0, 2])
  • Count the number of each color in the bag using sp.bincount(bag). The first element in the array is the number of occurences of 0 in "bag," the second element is the number of occurences of 1, etc.
In [15]:
sp.bincount(bag)
Out[15]:
array([16,  8,  8, 11,  7, 10])
  • For our "brown" = 5 choice, the number of brown M&Ms is the last element in the array returned by bincount, or sp.bincount(bag)[5].
In [16]:
sp.bincount(bag)[5]
Out[16]:
10
  • Now sample many bags
  • Record number of brown M&Ms in each bag
In [17]:
# Long version of sampling many bags
nb = 24                     # number of bags 
data_section = sp.zeros(nb) # array in which to store data for a lab section
for i in range(nb):
    bag = sp.stats.randint.rvs(0,6,size=60)
    data_section[i] = sp.bincount(bag)[5]

data_section
Out[17]:
array([  8.,  11.,  10.,  21.,   7.,   6.,  15.,   9.,  10.,   9.,  12.,
         8.,   7.,  12.,   7.,  11.,  12.,  12.,  19.,  14.,  11.,   6.,
         7.,  13.])
In [18]:
# Concise version of sampling many bags
nb = 24            # number of bags
data_section = sp.array([sp.bincount(sp.stats.randint.rvs(0,6,size=60))[5] for i in range(nb)])
data_section
Out[18]:
array([10, 13,  8,  9, 11, 13,  9, 11,  9,  5,  7,  7, 12, 12, 14,  9,  6,
       16, 12, 14, 12,  8,  8,  8])
In [19]:
sp.mean(data_section), sp.std(data_section), sp.std(data_section)/sp.sqrt(len(data_section)-1)
Out[19]:
(10.125, 2.7585095613392387, 0.57518900485348978)

Answer for part 1, the results from a single lab section:

$\overline N = 10.1 \pm 0.6$

In [27]:
plt.figure(2)
nbins = 20
low = 0
high = 20
plt.hist(data_section,nbins,[low,high])
plt.xlim(0,20)
plt.title("Histogram of brown M&Ms per bag - Single Section",fontsize=14)
plt.xlabel("Number of brown M&Ms")
plt.ylabel("Occurences");

Part 2 -- Now we simulate data from 200 lab sections.

In [21]:
nb = 24    # Number of bags in a section
ns = 200   # Number of sections in class
data_class = sp.zeros(ns) # array for results from each section
for j in range(ns):
    data_section = sp.zeros(nb)     # array in which to store section data
    for i in range(nb):
        bag = sp.stats.randint.rvs(0,6,size=60)
        data_section[i] = sp.bincount(bag)[5]
    data_class[j] = sp.mean(data_section)
In [22]:
sp.mean(data_class), sp.std(data_class)
Out[22]:
(9.9941666666666649, 0.6089329236915636)

The standard deviation of the section averages (0.55) is consistent with that predicted by the standard deviation of the mean determined from a single section (0.6).

For a more thorough comparision with the predictions of the central limit theorem (CLT) I will compare the histogram of section averages with the the normal distribution predicted by the CLT. The average of the normal distribution is predicted to be 10, and the standard deviation is predicted to be $\sigma_\text{parent} = \sqrt{N}$, where $N$ is the number of bags in a section.

The parent distribution is a binomial distribution with $n = 60$, and $p = 1/6$.

In [23]:
sigmaP = sp.stats.binom.std(60,1/6.)
sigmaP
Out[23]:
2.8867513459481291

This result is consistent with the observed experimental standard deviation from a single section.

When comparing the histogram with the normal distribution, we are really comparing the observed occurences in a bin with an integration of the probability distribution over finite interval: $$ N_\text{section} \times \int_{x_1}^{x_2} p(x)\, dx \simeq N_\text{section}\, p(\overline x)\, (x_2−x_1), $$ where $(x_2−x_1)$ is the binwidth. (For wide bins it would be better to use the cdf rather than this approximation.)

In [28]:
plt.figure(3)
nbins = 25
low = 5
high = 15
binwidth = (high-low)/nbins
plt.hist(data_class,nbins,[low,high],alpha=0.5)
plt.xlim(low,high)
x = sp.linspace(0,20,400)
y = sp.stats.norm.pdf(x,10,sigmaP/sp.sqrt(nb))*ns*binwidth
plt.plot(x,y)
plt.title("Histogram of section averages",fontsize=14)
plt.xlabel("Number of brown M&Ms")
plt.ylabel("Occurences");

The histogram compares well with the preditions of the CLT. (For a more quantitative comparison, see Chapter 8 of Hughes and Hase.) Even though the parent distribution is binomial, the CLT says that the distribution of AVERAGES drawn from the parent will be gaussian.

Version information

In [25]:
%load_ext version_information

#%install_ext http://raw.github.com/jrjohansson/version_information/master/version_information.py
Loading extensions from ~/.ipython/extensions is deprecated. We recommend managing extensions like any other Python packages, in site-packages.
In [26]:
version_information scipy, matplotlib
Out[26]:
SoftwareVersion
Python3.6.1 64bit [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
IPython6.1.0
OSLinux 3.10.0 693.11.6.el7.x86_64 x86_64 with redhat 7.4 Maipo
scipy0.19.1
matplotlib2.0.2
Thu May 17 11:24:45 2018 EDT
In [ ]: