{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Statistics Tools\n", "\n", "NOTE: In this notebook I use the`stats` submodule of `scipy` for all statistics functions, including generation of random numbers. There are other modules with some overlapping functionality, e.g., the regular python `random` module, and the `np.random` module, but I do not use them here. The `stats` submodule includes tools for a large number of distributions, it includes a large and growing set of statistical functions, and there is a unified class structure. In addition, potential namespace issues are minimized. See https://docs.scipy.org/doc/scipy/reference/stats.html.\n", "\n", "Marty Ligare, August 2020\n", "\n", "Modified by Tom Solomon, February 2021\n", "\n", "Modified by Ned Ladd, January 2022 to remove np.random references in favor of stats.randint.rvs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from scipy import stats\n", "\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Following is an Ipython magic command that puts figures in notebook.\n", "%matplotlib notebook\n", " \n", "# M.L. modifications of matplotlib defaults\n", "# Changes can also be put in matplotlibrc file, \n", "# or effected using mpl.rcParams[]\n", "mpl.style.use('classic') \n", "plt.rc('figure', figsize = (6, 4.5)) # Reduces overall size of figures\n", "plt.rc('axes', labelsize=16, titlesize=14)\n", "plt.rc('figure', autolayout = True) # Adjusts supblot parameters for new size\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generating random integers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to generate just one random integer, here is a way to do it. You have to specify a \"low\" and \"high\" for the range; i.e., you want a random integer between \"low\" and \"high\". Note that you have to add one to the \"high\" limit.\n", "\n", "This example pulls a random integer between 3 and 7. Execute the cell a few times to get a feel for it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "low = 3\n", "high = 7\n", "stats.randint.rvs(low,high+1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is more likely that you will want to get a full array of integers in one fell swoop. Here is how to do that." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Generate n integers between low and high:\n", "low = -3\n", "high = 6\n", "n = 100\n", "\n", "# or equivalently:\n", "# loq, high, n = (-3, 6, 100)\n", "\n", "stats.randint.rvs(low, high+1, size=n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sampling random numbers from a uniform p.d.f." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Sample n random numbers in interval [0.0,1.0]:\n", "n = 10\n", "stats.uniform.rvs(size=n) # \"uniform\" in the command indicates all values are equally likely" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Or, if you need a uniform distribution within a different interval\n", "# you can specify the lower bound of the interval, and the range\n", "#\n", "low = 2\n", "rng = 7\n", "stats.uniform.rvs(low,rng,size=10) # this command will produce 10 random values in the interval [2.0,9.0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sampling random numbers from a normal distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sample $n$ random numbers from the normal distribution with mean $\\mu$, standard deviation $\\sigma$, and pdf of Eq.(2.4) of Hughes & Hase:\n", "\\begin{equation}\n", "p(x) = \\frac{1}{\\sigma\\sqrt{2\\pi}}\\exp\\left[-\\frac{(x-\\mu)^2}{2\\sigma^2}\\right]\n", "\\end{equation}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Sampling from normal distribution\n", "n = 10\n", "mean = 10.\n", "sigma = 2.\n", "stats.norm.rvs(mean, sigma, size=n) # \"norm\" in the command indicates a \"normal\" or \"Gaussian\" distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Graph the Probability Distribution Function (PDF) of a Normal Distribution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure()\n", "x = np.linspace(mean-3.*sigma, mean+3.*sigma,200) # make an array of 200 evenly spaced values, starting with\n", " # mean-3*sigma = 4 and going to mean+3*sigma = 10\n", "y = stats.norm.pdf(x, mean, sigma) # determine the value of the pdf at each of the points in 'x'\n", "plt.title(\"pdf of normal distribution\")\n", "plt.xlabel(\"$x$\")\n", "plt.ylabel(\"$p(x)$\")\n", "plt.grid()\n", "plt.plot(x, y);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Graph of the Cumulative Distribution Function (CDF) of normal distribution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure()\n", "x = np.linspace(mean-3.*sigma, mean+3.*sigma, 200) # don't really need to do this again....\n", "y = stats.norm.cdf(x, mean, sigma)\n", "plt.title(\"cdf of normal distribution\")\n", "plt.xlabel(\"$x$\")\n", "plt.ylabel(\"$\\int_{-\\infty}^x p(x^\\prime)\\, dx^\\prime$\")\n", "plt.grid()\n", "plt.plot(x, y);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sampling random numbers from a Poisson distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sample $n$ random numbers from the Poisson distribution with average count $\\overline{N}$, and probability distibution given by Eq.(3.1) of Hughes & Hase:\n", "\\begin{equation}\n", "p(N;\\overline{N}) = \\frac{\\exp\\left(-\\overline{N}\\right)\\overline{N}^N}{N!}\n", "\\end{equation}\n", "The standard deviation of the Poisson distribution is given by \n", "$$ \\sigma = \\sqrt{\\overline{N}}. $$" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Sampling from a Poisson distribution \n", "n = 1000\n", "mean = 2\n", "stats.poisson.rvs(mean, size=n)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.mean(_) # The underscore \"_\" is similar to Mathematicas \"%\"\n", " # It refers to the output of the previously executed cell" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.std(__) # Notice the double underscore \"__\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.sqrt(mean)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Graph of the Probability Mass Function (PMF; it's like a PDF) of Poisson distribution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure()\n", "x = np.linspace(0, 12, 13)\n", "y = stats.poisson.pmf(x, mean)\n", "plt.title(\"pmf of Poisson distribution\")\n", "plt.xlabel(\"$n$\")\n", "plt.ylabel(\"$p(n)$\")\n", "plt.grid()\n", "plt.axhline(0)\n", "plt.scatter(x, y);\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Graph of the CDF of Poisson distribution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure()\n", "x = np.linspace(0, 12, 13)\n", "y = stats.poisson.cdf(x, mean)\n", "plt.title(\"cdf of Poisson distribution\")\n", "plt.xlabel(\"$x$\")\n", "plt.ylabel(\"$C_{DF}$\")\n", "plt.xlim(-1, 13)\n", "plt.grid()\n", "plt.axhline(0)\n", "plt.scatter(x, y);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sampling random numbers from a binomial distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Consider $n$ trials, with probability of success $p$ in each trial. The array below is the number successes in each of $size$ trials." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Sampling from a binomial distribution\n", "n = 2\n", "p = 0.4\n", "stats.binom.rvs(n, p, size=100)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.mean(_)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The probablity of getting $x$ successes is given by the probability mass function (pmf). This is analogous to the continous pdf (and it's called the PDF in Mathematica). As an example, the probability of 2 successes in 3 trials with a probability of success in each trial of 0.4 is\n", "29%:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n, s, p = (3, 2, 0.4)\n", "stats.binom.pmf(s, n, p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Graph of pmf (~pdf) of binomial distribution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure()\n", "n = 5\n", "x = np.linspace(0, n, n+1)\n", "y = stats.binom.pmf(x, n, p)\n", "\n", "plt.title(\"pmf ($\\sim$pdf) of binom. dist.; $n=5$, $p = 0.4$\")\n", "plt.xlabel(\"$n$\")\n", "plt.ylabel(\"probability of $n$ successes\")\n", "plt.grid()\n", "plt.axhline(0)\n", "plt.scatter(x, y);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Graph of cdf of binomial distribution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure()\n", "n = 5\n", "x = np.linspace(0, n, n+1)\n", "y = stats.binom.cdf(x, n, p)\n", "plt.title(\"cdf of binomial dist.; $n=5$, $p = 0.4$\")\n", "plt.xlabel(\"$n$\")\n", "plt.ylabel(\"cdf\")\n", "plt.grid()\n", "plt.axhline(0)\n", "plt.scatter(x, y);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Histograms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Generate some random data from a normal distribution." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n = 100\n", "mean = 10.\n", "sigma = 2.\n", "data = stats.norm.rvs(mean, sigma, size=n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Select number of bins between low and high values.\n", "\n", "NOTE: `plt.hist` plots the histogram, and ouputs the binned data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure()\n", "nbins = 6\n", "low = mean - 3*sigma\n", "high = mean + 3*sigma\n", "plt.xlabel(\"value\")\n", "plt.ylabel(\"occurences\")\n", "plt.title(\"Histogram; equal sized bins\")\n", "out = plt.hist(data, nbins, [low,high])\n", "out[0],out[1] # occurrences and bin boundaries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### OR ... Select specific bin boundaries\n", "Again, `plt.hist` outputs the binned data and plots the histogram.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure()\n", "bins = [4, 7, 8, 10, 13, 16]\n", "plt.xlabel(\"value\")\n", "plt.ylabel(\"occurences\")\n", "plt.title(\"Histogram; specified (nonuniform) bins\")\n", "out = plt.hist(data, bins)\n", "out[0],out[1] # occurrences and bin boundaries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Version Information \n", "`version_information` is from J.R. Johansson (jrjohansson at gmail.com); see Introduction to scientific computing with Python for more information and instructions for package installation.\n", "\n", "`version_information` is installed on the linux network at Bucknell" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load_ext version_information" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "version_information numpy, scipy, matplotlib" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.8" } }, "nbformat": 4, "nbformat_minor": 1 }