{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Using the $\\chi^2$ test of a distribution: Is this a fair die?\n", "\n", "In this notebook I provide two data sets from the roll of a nominally standard 6-sided die.\n", "The question is: Is this a fair die, or not? The data is an array of integers chosen \n", "from the set {0,1,2,3,4,5}. (We could make this set be {1,2,3,4,5,6} to correspond to the \n", "numbers on a die, but with python it's easier to start at 0.)\n", "\n", "Use the procedure outlined at top of p.112 in Hughes and Hase, and calculate the \n", "$\\chi^2$ statistic: \n", "\n", "$$\n", "\\chi^2 = \\sum_i \\frac{(O_i - E_i)^2}{E_i}\n", "$$\n", "\n", "where $O_i$ represents the number of occurrences of something in the sample data, and $E_i$ gives the expected number of occurrences based on the null hypothesis.\n", "\n", "Marty Ligare, August 2020" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from scipy import stats\n", "from scipy import optimize\n", "\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Following is an Ipython magic command that puts figures in the notebook.\n", "%matplotlib notebook\n", " \n", "# M.L. modifications of matplotlib defaults\n", "# Changes can also be put in matplotlibrc file, \n", "# or effected using mpl.rcParams[]\n", "mpl.style.use('classic') \n", "plt.rc('figure', figsize = (6, 4.5)) # Reduces overall size of figures\n", "plt.rc('axes', labelsize=16, titlesize=14)\n", "plt.rc('figure', autolayout = True) # Adjusts supblot parameters for new size" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data set #1: 1000 rolls" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = np.loadtxt('die_1000.dat', dtype=int)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Formulate null hypothesis: the die is fair\n", "That means that each value should occur with a probability of 1/6.\n", "for each value\n", "\n", "#### Calculate expected number of occurrences based on null hypothesis " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "e = len(data)*np.ones(6)/6.\n", "print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Make a \"histogram\" of the observed data\n", "Here that just means counting occurrences." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "o = np.bincount(data, minlength=6)\n", "print(o)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure()\n", "x = np.linspace(0,5,6)\n", "plt.scatter(x, e, label='expected')\n", "plt.scatter(x, o, color='red', label='observed')\n", "plt.xlabel('value')\n", "plt.ylabel('occurrences')\n", "plt.title('1000 Rolls')\n", "plt.legend(loc='upper left');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Is the high number of occurrences of 5 anomalous?\n", "Calculate $\\chi^2$ and find probability that random sampling would \n", "result in a value of $\\chi^2$ greater than the observed value." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "chi2_data = np.sum((o-e)**2/e)\n", "print('chi2 = ',chi2_data)\n", "print('reduced chi2 = ',chi2_data/6)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p = 1 - stats.chi2.cdf(chi2_data, 6)\n", "print('probability of getting a value of chisq greater than ',chi2_data,'with 6 bins is ',p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Not a high probability, but not totally unreasonable. Don't reject null hypothesis based on this data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data set #2: 50000 rolls" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "data = np.loadtxt('die_50000.dat', dtype=int)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Formulate null hypothesis: the die is fair\n", "That means that each value should occur with a probability of 1/6.\n", "for each value\n", "\n", "#### Calculate expected number of occurrences " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "e = len(data)*np.ones(6)/6.\n", "print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Make a \"histogram\" of the observed data\n", "Here that just means counting occurrences." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "o = np.bincount(data, minlength=6)\n", "print(o)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure()\n", "x = np.linspace(0,5,6)\n", "plt.scatter(x, e, label='expected')\n", "plt.scatter(x, o, color='red', label='observed')\n", "plt.xlabel('value')\n", "plt.ylabel('occurrences')\n", "plt.title('50000 Rolls')\n", "plt.legend(loc='upper left');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Is the high number of occurrences of 5 anomalous?\n", "Calculate $\\chi^2$ and find probability that random sampling would \n", "result in a value of $\\chi^2$ greater than the observed value." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "chi2_data = np.sum((o-e)**2/e)\n", "print('chi2 = ',chi2_data)\n", "print('reduced chi2 = ',chi2_data/6)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p = 1 - stats.chi2.cdf(chi2_data, 6)\n", "print('probability of getting a value of chisq greater than ',chi2_data,'with 6 bins is ',p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### This is a very low probability. It's safe to reject null hypothesis based on this data; it's very probable that the die is NOT fair." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Version Information \n", "`version_information` is from J.R. Johansson (jrjohansson at gmail.com).\n", "See Introduction to scientific computing with Python\n", "for more information and instructions for package installation.
\n", "\n", "`version_information` has been installed system-wide on Bucknell linux networked computers. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load_ext version_information" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%version_information numpy, scipy, matplotlib" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }