{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"NOTE:\n",
"-----\n",
"\n",
"Please run the below cells first before proceeding- you'll need them soon!"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The sql extension is already loaded. To reload it, use:\n",
" %reload_ext sql\n"
]
},
{
"data": {
"text/plain": [
"u'Connected: None@dataset_1.db'"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%load_ext sql\n",
"%sql sqlite:///dataset_1.db"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Activity 3-2\n",
"------------\n",
"Aggregation operators, GROUP BY"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise #1\n",
"-----------\n",
"\n",
"Consider a set of tables that describe the up-and-coming bagel startup industry; for now let's just look at two tables here, `bagel`, which describes types of bagels made by the different bagel companies:\n",
"> * name STRING\n",
"> * price FLOAT\n",
"> * made_by STRING\n",
"\n",
"And `purchase`:\n",
"> * bagel_name STRING\n",
"> * franchise STRING\n",
"> * date INT\n",
"> * quantity INT\n",
"> * purchaser_age INT\n",
"\n",
"Where `purchase.bagel_name` references `bagel.name` and `purchase.franchise` references `bagel.made_by`:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"
\n",
" \n",
" name | \n",
" price | \n",
" made_by | \n",
"
\n",
" \n",
" Plain with shmear | \n",
" 1.99 | \n",
" Bobs Bagels | \n",
"
\n",
" \n",
" Egg with shmear | \n",
" 2.39 | \n",
" Bobs Bagels | \n",
"
\n",
" \n",
" eBagel Drinkable Bagel | \n",
" 27.99 | \n",
" eBagel | \n",
"
\n",
"
"
],
"text/plain": [
"[(u'Plain with shmear', 1.99, u'Bobs Bagels'),\n",
" (u'Egg with shmear', 2.39, u'Bobs Bagels'),\n",
" (u'eBagel Drinkable Bagel', 27.99, u'eBagel')]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT * FROM bagel LIMIT 3;"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" bagel_name | \n",
" franchise | \n",
" date | \n",
" quantity | \n",
" purchaser_age | \n",
"
\n",
" \n",
" Plain with shmear | \n",
" Bobs Bagels | \n",
" 1 | \n",
" 12 | \n",
" 28 | \n",
"
\n",
" \n",
" Egg with shmear | \n",
" Bobs Bagels | \n",
" 2 | \n",
" 6 | \n",
" 47 | \n",
"
\n",
" \n",
" Plain with shmear | \n",
" BAGEL CORP | \n",
" 2 | \n",
" 12 | \n",
" 24 | \n",
"
\n",
"
"
],
"text/plain": [
"[(u'Plain with shmear', u'Bobs Bagels', 1, 12, 28),\n",
" (u'Egg with shmear', u'Bobs Bagels', 2, 6, 47),\n",
" (u'Plain with shmear', u'BAGEL CORP', 2, 12, 24)]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT * FROM purchase LIMIT 3;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Can you write a query to get the _total revenue_ for each bagel type **which had an average purchaser age over 18**? Type your query below:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" name | \n",
" Revenue | \n",
"
\n",
" \n",
" Egg with shmear | \n",
" 14.34 | \n",
"
\n",
" \n",
" Plain with shmear | \n",
" 84.51 | \n",
"
\n",
"
"
],
"text/plain": [
"[(u'Egg with shmear', 14.34), (u'Plain with shmear', 84.50999999999999)]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"SELECT b.name, SUM(p.quantity * b.price) AS Revenue\n",
"FROM bagel b, purchase p\n",
"WHERE b.name = p.bagel_name AND b.made_by = p.franchise\n",
"GROUP BY b.name\n",
"HAVING AVG(p.purchaser_age) > 18;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise #2\n",
"-----------\n",
"\n",
"Here we'll use a simplified version of the `precipitation_full` table, which just has _daily_ rainfall _in CA only_, and has the following schema:\n",
"\n",
"> * station_id\n",
"> * day\n",
"> * precipitation"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" station_id | \n",
" day | \n",
" precipitation | \n",
"
\n",
" \n",
" 16102 | \n",
" 1 | \n",
" 10 | \n",
"
\n",
" \n",
" 16102 | \n",
" 4 | \n",
" 10 | \n",
"
\n",
" \n",
" 16102 | \n",
" 24 | \n",
" 30 | \n",
"
\n",
" \n",
" 21201 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 21201 | \n",
" 20 | \n",
" 10 | \n",
"
\n",
"
"
],
"text/plain": [
"[(16102, 1, 10),\n",
" (16102, 4, 10),\n",
" (16102, 24, 30),\n",
" (21201, 1, 0),\n",
" (21201, 20, 10)]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT * FROM precipitation LIMIT 5;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We want to get station_ids which have average precipitations > 75. Try doing this first as a nested query:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" station_id | \n",
"
\n",
" \n",
" 88302 | \n",
"
\n",
" \n",
" 250002 | \n",
"
\n",
" \n",
" 335701 | \n",
"
\n",
" \n",
" 357302 | \n",
"
\n",
" \n",
" 488301 | \n",
"
\n",
"
"
],
"text/plain": [
"[(88302,), (250002,), (335701,), (357302,), (488301,)]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"SELECT DISTINCT p.station_id\n",
"FROM precipitation p \n",
"WHERE (\n",
" SELECT AVG(precipitation) \n",
" FROM precipitation \n",
" WHERE station_id = p.station_id) > 75;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, try re-writing as a GROUP BY:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" station_id | \n",
"
\n",
" \n",
" 88302 | \n",
"
\n",
" \n",
" 250002 | \n",
"
\n",
" \n",
" 335701 | \n",
"
\n",
" \n",
" 357302 | \n",
"
\n",
" \n",
" 488301 | \n",
"
\n",
"
"
],
"text/plain": [
"[(88302,), (250002,), (335701,), (357302,), (488301,)]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"SELECT station_id\n",
"FROM precipitation\n",
"GROUP BY station_id\n",
"HAVING AVG(precipitation) > 75;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now time it by using `%time` followed by single-line versions of your queries above (clunky, but will work) to see how they compare!\n",
"\n",
"**Note:** Yes, currently the answers are filled in below for convenience... but you should still try getting them on your own above!"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n",
"CPU times: user 41.9 ms, sys: 1.39 ms, total: 43.3 ms\n",
"Wall time: 42.5 ms\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" station_id | \n",
"
\n",
" \n",
" 88302 | \n",
"
\n",
" \n",
" 250002 | \n",
"
\n",
" \n",
" 335701 | \n",
"
\n",
" \n",
" 357302 | \n",
"
\n",
" \n",
" 488301 | \n",
"
\n",
"
"
],
"text/plain": [
"[(88302,), (250002,), (335701,), (357302,), (488301,)]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%time %sql SELECT DISTINCT p.station_id FROM precipitation p WHERE (SELECT AVG(precipitation) FROM precipitation WHERE station_id = p.station_id) > 75;"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n",
"CPU times: user 3.06 ms, sys: 1.43 ms, total: 4.49 ms\n",
"Wall time: 3.11 ms\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" station_id | \n",
"
\n",
" \n",
" 88302 | \n",
"
\n",
" \n",
" 250002 | \n",
"
\n",
" \n",
" 335701 | \n",
"
\n",
" \n",
" 357302 | \n",
"
\n",
" \n",
" 488301 | \n",
"
\n",
"
"
],
"text/plain": [
"[(88302,), (250002,), (335701,), (357302,), (488301,)]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%time %sql SELECT p.station_id FROM precipitation p GROUP BY p.station_id HAVING AVG(p.precipitation) > 75;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**An ~ 10-20x difference in execution time!!**"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 1
}