{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "NOTE:\n", "-----\n", "\n", "Please run the below cells first before proceeding- you'll need them soon!" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The sql extension is already loaded. To reload it, use:\n", " %reload_ext sql\n" ] }, { "data": { "text/plain": [ "u'Connected: None@dataset_1.db'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%load_ext sql\n", "%sql sqlite:///dataset_1.db" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Activity 3-2\n", "------------\n", "Aggregation operators, GROUP BY" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercise #1\n", "-----------\n", "\n", "Consider a set of tables that describe the up-and-coming bagel startup industry; for now let's just look at two tables here, `bagel`, which describes types of bagels made by the different bagel companies:\n", "> * name STRING\n", "> * price FLOAT\n", "> * made_by STRING\n", "\n", "And `purchase`:\n", "> * bagel_name STRING\n", "> * franchise STRING\n", "> * date INT\n", "> * quantity INT\n", "> * purchaser_age INT\n", "\n", "Where `purchase.bagel_name` references `bagel.name` and `purchase.franchise` references `bagel.made_by`:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namepricemade_by
Plain with shmear1.99Bobs Bagels
Egg with shmear2.39Bobs Bagels
eBagel Drinkable Bagel27.99eBagel
" ], "text/plain": [ "[(u'Plain with shmear', 1.99, u'Bobs Bagels'),\n", " (u'Egg with shmear', 2.39, u'Bobs Bagels'),\n", " (u'eBagel Drinkable Bagel', 27.99, u'eBagel')]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql SELECT * FROM bagel LIMIT 3;" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bagel_namefranchisedatequantitypurchaser_age
Plain with shmearBobs Bagels11228
Egg with shmearBobs Bagels2647
Plain with shmearBAGEL CORP21224
" ], "text/plain": [ "[(u'Plain with shmear', u'Bobs Bagels', 1, 12, 28),\n", " (u'Egg with shmear', u'Bobs Bagels', 2, 6, 47),\n", " (u'Plain with shmear', u'BAGEL CORP', 2, 12, 24)]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql SELECT * FROM purchase LIMIT 3;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can you write a query to get the _total revenue_ for each bagel type **which had an average purchaser age over 18**? Type your query below:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameRevenue
Egg with shmear14.34
Plain with shmear84.51
" ], "text/plain": [ "[(u'Egg with shmear', 14.34), (u'Plain with shmear', 84.50999999999999)]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT b.name, SUM(p.quantity * b.price) AS Revenue\n", "FROM bagel b, purchase p\n", "WHERE b.name = p.bagel_name AND b.made_by = p.franchise\n", "GROUP BY b.name\n", "HAVING AVG(p.purchaser_age) > 18;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercise #2\n", "-----------\n", "\n", "Here we'll use a simplified version of the `precipitation_full` table, which just has _daily_ rainfall _in CA only_, and has the following schema:\n", "\n", "> * station_id\n", "> * day\n", "> * precipitation" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
station_iddayprecipitation
16102110
16102410
161022430
2120110
212012010
" ], "text/plain": [ "[(16102, 1, 10),\n", " (16102, 4, 10),\n", " (16102, 24, 30),\n", " (21201, 1, 0),\n", " (21201, 20, 10)]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql SELECT * FROM precipitation LIMIT 5;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to get station_ids which have average precipitations > 75. Try doing this first as a nested query:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
station_id
88302
250002
335701
357302
488301
" ], "text/plain": [ "[(88302,), (250002,), (335701,), (357302,), (488301,)]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT DISTINCT p.station_id\n", "FROM precipitation p \n", "WHERE (\n", " SELECT AVG(precipitation) \n", " FROM precipitation \n", " WHERE station_id = p.station_id) > 75;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, try re-writing as a GROUP BY:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
station_id
88302
250002
335701
357302
488301
" ], "text/plain": [ "[(88302,), (250002,), (335701,), (357302,), (488301,)]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT station_id\n", "FROM precipitation\n", "GROUP BY station_id\n", "HAVING AVG(precipitation) > 75;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now time it by using `%time` followed by single-line versions of your queries above (clunky, but will work) to see how they compare!\n", "\n", "**Note:** Yes, currently the answers are filled in below for convenience... but you should still try getting them on your own above!" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n", "CPU times: user 41.9 ms, sys: 1.39 ms, total: 43.3 ms\n", "Wall time: 42.5 ms\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
station_id
88302
250002
335701
357302
488301
" ], "text/plain": [ "[(88302,), (250002,), (335701,), (357302,), (488301,)]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time %sql SELECT DISTINCT p.station_id FROM precipitation p WHERE (SELECT AVG(precipitation) FROM precipitation WHERE station_id = p.station_id) > 75;" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n", "CPU times: user 3.06 ms, sys: 1.43 ms, total: 4.49 ms\n", "Wall time: 3.11 ms\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
station_id
88302
250002
335701
357302
488301
" ], "text/plain": [ "[(88302,), (250002,), (335701,), (357302,), (488301,)]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time %sql SELECT p.station_id FROM precipitation p GROUP BY p.station_id HAVING AVG(p.precipitation) > 75;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**An ~ 10-20x difference in execution time!!**" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 1 }