{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Statistics Tools\n",
"\n",
"NOTE: In this notebook I use the`stats` submodule of `scipy` for all statistics functions, including generation of random numbers. There are other modules with some overlapping functionality, e.g., the regular python `random` module, and the `np.random` module, but I do not use them here. The `stats` submodule includes tools for a large number of distributions, it includes a large and growing set of statistical functions, and there is a unified class structure. In addition, potential namespace issues are minimized. See https://docs.scipy.org/doc/scipy/reference/stats.html.\n",
"\n",
"Marty Ligare, August 2020\n",
"\n",
"Modified by Tom Solomon, February 2021\n",
"\n",
"Modified by Ned Ladd, January 2022 to remove np.random references in favor of stats.randint.rvs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from scipy import stats\n",
"\n",
"import matplotlib as mpl\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Following is an Ipython magic command that puts figures in notebook.\n",
"%matplotlib notebook\n",
" \n",
"# M.L. modifications of matplotlib defaults\n",
"# Changes can also be put in matplotlibrc file, \n",
"# or effected using mpl.rcParams[]\n",
"mpl.style.use('classic') \n",
"plt.rc('figure', figsize = (6, 4.5)) # Reduces overall size of figures\n",
"plt.rc('axes', labelsize=16, titlesize=14)\n",
"plt.rc('figure', autolayout = True) # Adjusts supblot parameters for new size\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generating random integers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to generate just one random integer, here is a way to do it. You have to specify a \"low\" and \"high\" for the range; i.e., you want a random integer between \"low\" and \"high\". Note that you have to add one to the \"high\" limit.\n",
"\n",
"This example pulls a random integer between 3 and 7. Execute the cell a few times to get a feel for it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"low = 3\n",
"high = 7\n",
"stats.randint.rvs(low,high+1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is more likely that you will want to get a full array of integers in one fell swoop. Here is how to do that."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Generate n integers between low and high:\n",
"low = -3\n",
"high = 6\n",
"n = 100\n",
"\n",
"# or equivalently:\n",
"# loq, high, n = (-3, 6, 100)\n",
"\n",
"stats.randint.rvs(low, high+1, size=n)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sampling random numbers from a uniform p.d.f."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Sample n random numbers in interval [0.0,1.0]:\n",
"n = 10\n",
"stats.uniform.rvs(size=n) # \"uniform\" in the command indicates all values are equally likely"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Or, if you need a uniform distribution within a different interval\n",
"# you can specify the lower bound of the interval, and the range\n",
"#\n",
"low = 2\n",
"rng = 7\n",
"stats.uniform.rvs(low,rng,size=10) # this command will produce 10 random values in the interval [2.0,9.0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sampling random numbers from a normal distribution"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sample $n$ random numbers from the normal distribution with mean $\\mu$, standard deviation $\\sigma$, and pdf of Eq.(2.4) of Hughes & Hase:\n",
"\\begin{equation}\n",
"p(x) = \\frac{1}{\\sigma\\sqrt{2\\pi}}\\exp\\left[-\\frac{(x-\\mu)^2}{2\\sigma^2}\\right]\n",
"\\end{equation}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Sampling from normal distribution\n",
"n = 10\n",
"mean = 10.\n",
"sigma = 2.\n",
"stats.norm.rvs(mean, sigma, size=n) # \"norm\" in the command indicates a \"normal\" or \"Gaussian\" distribution"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Graph the Probability Distribution Function (PDF) of a Normal Distribution"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure()\n",
"x = np.linspace(mean-3.*sigma, mean+3.*sigma,200) # make an array of 200 evenly spaced values, starting with\n",
" # mean-3*sigma = 4 and going to mean+3*sigma = 10\n",
"y = stats.norm.pdf(x, mean, sigma) # determine the value of the pdf at each of the points in 'x'\n",
"plt.title(\"pdf of normal distribution\")\n",
"plt.xlabel(\"$x$\")\n",
"plt.ylabel(\"$p(x)$\")\n",
"plt.grid()\n",
"plt.plot(x, y);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Graph of the Cumulative Distribution Function (CDF) of normal distribution"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure()\n",
"x = np.linspace(mean-3.*sigma, mean+3.*sigma, 200) # don't really need to do this again....\n",
"y = stats.norm.cdf(x, mean, sigma)\n",
"plt.title(\"cdf of normal distribution\")\n",
"plt.xlabel(\"$x$\")\n",
"plt.ylabel(\"$\\int_{-\\infty}^x p(x^\\prime)\\, dx^\\prime$\")\n",
"plt.grid()\n",
"plt.plot(x, y);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sampling random numbers from a Poisson distribution"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sample $n$ random numbers from the Poisson distribution with average count $\\overline{N}$, and probability distibution given by Eq.(3.1) of Hughes & Hase:\n",
"\\begin{equation}\n",
"p(N;\\overline{N}) = \\frac{\\exp\\left(-\\overline{N}\\right)\\overline{N}^N}{N!}\n",
"\\end{equation}\n",
"The standard deviation of the Poisson distribution is given by \n",
"$$ \\sigma = \\sqrt{\\overline{N}}. $$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Sampling from a Poisson distribution \n",
"n = 1000\n",
"mean = 2\n",
"stats.poisson.rvs(mean, size=n)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"np.mean(_) # The underscore \"_\" is similar to Mathematicas \"%\"\n",
" # It refers to the output of the previously executed cell"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"np.std(__) # Notice the double underscore \"__\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"np.sqrt(mean)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Graph of the Probability Mass Function (PMF; it's like a PDF) of Poisson distribution"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure()\n",
"x = np.linspace(0, 12, 13)\n",
"y = stats.poisson.pmf(x, mean)\n",
"plt.title(\"pmf of Poisson distribution\")\n",
"plt.xlabel(\"$n$\")\n",
"plt.ylabel(\"$p(n)$\")\n",
"plt.grid()\n",
"plt.axhline(0)\n",
"plt.scatter(x, y);\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Graph of the CDF of Poisson distribution"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure()\n",
"x = np.linspace(0, 12, 13)\n",
"y = stats.poisson.cdf(x, mean)\n",
"plt.title(\"cdf of Poisson distribution\")\n",
"plt.xlabel(\"$x$\")\n",
"plt.ylabel(\"$C_{DF}$\")\n",
"plt.xlim(-1, 13)\n",
"plt.grid()\n",
"plt.axhline(0)\n",
"plt.scatter(x, y);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sampling random numbers from a binomial distribution"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Consider $n$ trials, with probability of success $p$ in each trial. The array below is the number successes in each of $size$ trials."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Sampling from a binomial distribution\n",
"n = 2\n",
"p = 0.4\n",
"stats.binom.rvs(n, p, size=100)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"np.mean(_)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The probablity of getting $x$ successes is given by the probability mass function (pmf). This is analogous to the continous pdf (and it's called the PDF in Mathematica). As an example, the probability of 2 successes in 3 trials with a probability of success in each trial of 0.4 is\n",
"29%:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"n, s, p = (3, 2, 0.4)\n",
"stats.binom.pmf(s, n, p)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Graph of pmf (~pdf) of binomial distribution"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure()\n",
"n = 5\n",
"x = np.linspace(0, n, n+1)\n",
"y = stats.binom.pmf(x, n, p)\n",
"\n",
"plt.title(\"pmf ($\\sim$pdf) of binom. dist.; $n=5$, $p = 0.4$\")\n",
"plt.xlabel(\"$n$\")\n",
"plt.ylabel(\"probability of $n$ successes\")\n",
"plt.grid()\n",
"plt.axhline(0)\n",
"plt.scatter(x, y);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Graph of cdf of binomial distribution"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure()\n",
"n = 5\n",
"x = np.linspace(0, n, n+1)\n",
"y = stats.binom.cdf(x, n, p)\n",
"plt.title(\"cdf of binomial dist.; $n=5$, $p = 0.4$\")\n",
"plt.xlabel(\"$n$\")\n",
"plt.ylabel(\"cdf\")\n",
"plt.grid()\n",
"plt.axhline(0)\n",
"plt.scatter(x, y);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Histograms"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Generate some random data from a normal distribution."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"n = 100\n",
"mean = 10.\n",
"sigma = 2.\n",
"data = stats.norm.rvs(mean, sigma, size=n)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Select number of bins between low and high values.\n",
"\n",
"NOTE: `plt.hist` plots the histogram, and ouputs the binned data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure()\n",
"nbins = 6\n",
"low = mean - 3*sigma\n",
"high = mean + 3*sigma\n",
"plt.xlabel(\"value\")\n",
"plt.ylabel(\"occurences\")\n",
"plt.title(\"Histogram; equal sized bins\")\n",
"out = plt.hist(data, nbins, [low,high])\n",
"out[0],out[1] # occurrences and bin boundaries"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### OR ... Select specific bin boundaries\n",
"Again, `plt.hist` outputs the binned data and plots the histogram.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure()\n",
"bins = [4, 7, 8, 10, 13, 16]\n",
"plt.xlabel(\"value\")\n",
"plt.ylabel(\"occurences\")\n",
"plt.title(\"Histogram; specified (nonuniform) bins\")\n",
"out = plt.hist(data, bins)\n",
"out[0],out[1] # occurrences and bin boundaries"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Version Information \n",
"`version_information` is from J.R. Johansson (jrjohansson at gmail.com); see Introduction to scientific computing with Python for more information and instructions for package installation.\n",
"\n",
"`version_information` is installed on the linux network at Bucknell"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%load_ext version_information"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"version_information numpy, scipy, matplotlib"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.8"
}
},
"nbformat": 4,
"nbformat_minor": 1
}