{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using the $\\chi^2$ test of a distribution: Is this a fair die?\n",
"\n",
"In this notebook I provide two data sets from the roll of a nominally standard 6-sided die.\n",
"The question is: Is this a fair die, or not? The data is an array of integers chosen \n",
"from the set {0,1,2,3,4,5}. (We could make this set be {1,2,3,4,5,6} to correspond to the \n",
"numbers on a die, but with python it's easier to start at 0.)\n",
"\n",
"Use the procedure outlined at top of p.112 in Hughes and Hase, and calculate the \n",
"$\\chi^2$ statistic: \n",
"\n",
"$$\n",
"\\chi^2 = \\sum_i \\frac{(O_i - E_i)^2}{E_i}\n",
"$$\n",
"\n",
"where $O_i$ represents the number of occurrences of something in the sample data, and $E_i$ gives the expected number of occurrences based on the null hypothesis.\n",
"\n",
"Marty Ligare, August 2020"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from scipy import stats\n",
"from scipy import optimize\n",
"\n",
"import matplotlib as mpl\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Following is an Ipython magic command that puts figures in the notebook.\n",
"%matplotlib notebook\n",
" \n",
"# M.L. modifications of matplotlib defaults\n",
"# Changes can also be put in matplotlibrc file, \n",
"# or effected using mpl.rcParams[]\n",
"mpl.style.use('classic') \n",
"plt.rc('figure', figsize = (6, 4.5)) # Reduces overall size of figures\n",
"plt.rc('axes', labelsize=16, titlesize=14)\n",
"plt.rc('figure', autolayout = True) # Adjusts supblot parameters for new size"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data set #1: 1000 rolls"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data = np.loadtxt('die_1000.dat', dtype=int)\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Formulate null hypothesis: the die is fair\n",
"That means that each value should occur with a probability of 1/6.\n",
"for each value\n",
"\n",
"#### Calculate expected number of occurrences based on null hypothesis "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"e = len(data)*np.ones(6)/6.\n",
"print(e)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Make a \"histogram\" of the observed data\n",
"Here that just means counting occurrences."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"o = np.bincount(data, minlength=6)\n",
"print(o)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure()\n",
"x = np.linspace(0,5,6)\n",
"plt.scatter(x, e, label='expected')\n",
"plt.scatter(x, o, color='red', label='observed')\n",
"plt.xlabel('value')\n",
"plt.ylabel('occurrences')\n",
"plt.title('1000 Rolls')\n",
"plt.legend(loc='upper left');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Is the high number of occurrences of 5 anomalous?\n",
"Calculate $\\chi^2$ and find probability that random sampling would \n",
"result in a value of $\\chi^2$ greater than the observed value."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"chi2_data = np.sum((o-e)**2/e)\n",
"print('chi2 = ',chi2_data)\n",
"print('reduced chi2 = ',chi2_data/6)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"p = 1 - stats.chi2.cdf(chi2_data, 6)\n",
"print('probability of getting a value of chisq greater than ',chi2_data,'with 6 bins is ',p)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Not a high probability, but not totally unreasonable. Don't reject null hypothesis based on this data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data set #2: 50000 rolls"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"data = np.loadtxt('die_50000.dat', dtype=int)\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Formulate null hypothesis: the die is fair\n",
"That means that each value should occur with a probability of 1/6.\n",
"for each value\n",
"\n",
"#### Calculate expected number of occurrences "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"e = len(data)*np.ones(6)/6.\n",
"print(e)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Make a \"histogram\" of the observed data\n",
"Here that just means counting occurrences."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"o = np.bincount(data, minlength=6)\n",
"print(o)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure()\n",
"x = np.linspace(0,5,6)\n",
"plt.scatter(x, e, label='expected')\n",
"plt.scatter(x, o, color='red', label='observed')\n",
"plt.xlabel('value')\n",
"plt.ylabel('occurrences')\n",
"plt.title('50000 Rolls')\n",
"plt.legend(loc='upper left');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Is the high number of occurrences of 5 anomalous?\n",
"Calculate $\\chi^2$ and find probability that random sampling would \n",
"result in a value of $\\chi^2$ greater than the observed value."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"chi2_data = np.sum((o-e)**2/e)\n",
"print('chi2 = ',chi2_data)\n",
"print('reduced chi2 = ',chi2_data/6)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"p = 1 - stats.chi2.cdf(chi2_data, 6)\n",
"print('probability of getting a value of chisq greater than ',chi2_data,'with 6 bins is ',p)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### This is a very low probability. It's safe to reject null hypothesis based on this data; it's very probable that the die is NOT fair."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Version Information \n",
"`version_information` is from J.R. Johansson (jrjohansson at gmail.com).\n",
"See Introduction to scientific computing with Python\n",
"for more information and instructions for package installation.
\n",
"\n",
"`version_information` has been installed system-wide on Bucknell linux networked computers. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%load_ext version_information"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%version_information numpy, scipy, matplotlib"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}