{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using the $\\chi^2$ test of a distribution: Is this a fair die?\n",
    "\n",
    "In this notebook I provide two data sets from the roll of a nominally standard 6-sided die.\n",
    "The question is: Is this a fair die, or not?  The data is an array of integers chosen \n",
    "from the set {0,1,2,3,4,5}. (We could make this set be {1,2,3,4,5,6} to correspond to the \n",
    "numbers on a die, but with python it's easier to start at 0.)\n",
    "\n",
    "Use the procedure outlined at top of p.112 in Hughes and Hase, and calculate the \n",
    "$\\chi^2$ statistic: \n",
    "\n",
    "$$\n",
    "\\chi^2 = \\sum_i \\frac{(O_i - E_i)^2}{E_i}\n",
    "$$\n",
    "\n",
    "where $O_i$ represents the number of occurrences of something in the sample data, and $E_i$ gives the expected number of occurrences based on the null hypothesis.\n",
    "\n",
    "Marty Ligare, August 2020"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from scipy import stats\n",
    "from scipy import optimize\n",
    "\n",
    "import matplotlib as mpl\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Following is an Ipython magic command that puts figures in the  notebook.\n",
    "%matplotlib notebook\n",
    "        \n",
    "# M.L. modifications of matplotlib defaults\n",
    "# Changes can also be put in matplotlibrc file, \n",
    "# or effected using mpl.rcParams[]\n",
    "mpl.style.use('classic') \n",
    "plt.rc('figure', figsize = (6, 4.5))      # Reduces overall size of figures\n",
    "plt.rc('axes', labelsize=16, titlesize=14)\n",
    "plt.rc('figure', autolayout = True)       # Adjusts supblot parameters for new size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data set #1: 1000 rolls"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = np.loadtxt('die_1000.dat', dtype=int)\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Formulate null hypothesis: the die is fair\n",
    "That means that each value should occur with a probability of 1/6.\n",
    "for each value\n",
    "\n",
    "#### Calculate expected number of occurrences based on null hypothesis "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "e  = len(data)*np.ones(6)/6.\n",
    "print(e)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Make a \"histogram\" of the observed data\n",
    "Here that just means counting occurrences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "o = np.bincount(data, minlength=6)\n",
    "print(o)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure()\n",
    "x = np.linspace(0,5,6)\n",
    "plt.scatter(x, e, label='expected')\n",
    "plt.scatter(x, o, color='red', label='observed')\n",
    "plt.xlabel('value')\n",
    "plt.ylabel('occurrences')\n",
    "plt.title('1000 Rolls')\n",
    "plt.legend(loc='upper left');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Is the high number of occurrences of 5 anomalous?\n",
    "Calculate $\\chi^2$ and find probability that random sampling would \n",
    "result in a value of $\\chi^2$ greater than the observed value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "chi2_data = np.sum((o-e)**2/e)\n",
    "print('chi2 = ',chi2_data)\n",
    "print('reduced chi2 = ',chi2_data/6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = 1 - stats.chi2.cdf(chi2_data, 6)\n",
    "print('probability of getting a value of chisq greater than ',chi2_data,'with 6 bins is ',p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Not a high probability, but not totally unreasonable.  Don't reject null hypothesis based on this data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data set #2: 50000 rolls"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "data = np.loadtxt('die_50000.dat', dtype=int)\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Formulate null hypothesis: the die is fair\n",
    "That means that each value should occur with a probability of 1/6.\n",
    "for each value\n",
    "\n",
    "#### Calculate expected number of occurrences "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "e  = len(data)*np.ones(6)/6.\n",
    "print(e)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Make a \"histogram\" of the observed data\n",
    "Here that just means counting occurrences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "o = np.bincount(data, minlength=6)\n",
    "print(o)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure()\n",
    "x = np.linspace(0,5,6)\n",
    "plt.scatter(x, e, label='expected')\n",
    "plt.scatter(x, o, color='red', label='observed')\n",
    "plt.xlabel('value')\n",
    "plt.ylabel('occurrences')\n",
    "plt.title('50000 Rolls')\n",
    "plt.legend(loc='upper left');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Is the high number of occurrences of 5 anomalous?\n",
    "Calculate $\\chi^2$ and find probability that random sampling would \n",
    "result in a value of $\\chi^2$ greater than the observed value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "chi2_data = np.sum((o-e)**2/e)\n",
    "print('chi2 = ',chi2_data)\n",
    "print('reduced chi2 = ',chi2_data/6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = 1 - stats.chi2.cdf(chi2_data, 6)\n",
    "print('probability of getting a value of chisq greater than ',chi2_data,'with 6 bins is ',p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### This is a very low probability.  It's safe to  reject null hypothesis based on this data; it's very probable that the die is NOT fair."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Version Information \n",
    "`version_information` is from J.R. Johansson (jrjohansson at gmail.com).\n",
    "See <a href=http://nbviewer.jupyter.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-0-Scientific-Computing-with-Python.ipynb>Introduction to scientific computing with Python</a>\n",
    "for more information and instructions for package installation.<br>\n",
    "\n",
    "`version_information` has been installed system-wide on Bucknell linux networked computers. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext version_information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%version_information numpy, scipy, matplotlib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}