CSCI 305: Introduction to Database Systems

Homework 1: Build a SQL Database From the Data in a Spreadsheet

1. Objectives

After complete this assignment, students are expected to be able to

2. Overview

In this assignment, you are given a spreadsheet in Excel form that contains the New York Times Best 10 Books in the years 2015, 2016, and 2017. The spreadsheet has three separate sheets, one named Books, one named Authors, and one named Publishers. One can infer the content of these files from the file names. Specifically, the Books sheet includes the title, the author's name, publisher's name, and publisher year. This sheet also has a column "Translated or edited by" as some of the books are translated from other languages or edited by someone else. The Authors sheet includes the first and last name of the author. And the Publishers has the usual information such as publisher name, parent company, country of origin, when the company was founded, and its postal address.

Your work is to build a SQL database from this set of information such that you, or any other users, can query the database and search for any information that is in the database. For example, one might query all book titles from the database, or all books published in a particular year, or author names who published a particular book.

Please follow the instructions in the sections below to complete the assignment. Use the web as your source of information if you are not sure how to do certain things. Please don't hesitate to talk to your instructor if you have questions.

3. Setup and Data

Create a directory named "hw01" in your "csci305" directory under your git repo. Download the following files to your "hw01" directory.

4. Prepare the Data

The first task is to extract the information from the spreadsheet and save the data into a collection of CSV (Common Separated Values) files. Save the spreadsheet file, named it "nytimes-best-books.xlsx" or a name of your choice. Open this file in a spreadsheet program. You should see three sheets in this spreadsheet as described in the Overview section. Save each individual sheet as CSV file, name these files as books.csv, authors.csv, and publishers.csv, respectively. If you haven't worked with CSV files before, they are just plain text files that you can open with any text editors. Typically you can use any programs to manipulate these files easily.

Of the three CSV files, authors.csv and publishers.csv are ready to be used, that is, they are in good format. The books.csv contains a few lines at the beginning of the file that needs to be removed before they can be taken by the database. You need to open the books.csv file with a text editor and delete the lines before the title line for the columns, that is the line contains "Title", "Author Last Name",... You will need to keep the title line so that the name of these columns can be used directly in the database.

5. Create SQL Database from CSV File

Most SQL database provides tools to build database from CSV files. In sqlite3, you can run a batch command sequence to build database tables from a set of CSV files. The following is an example of how this works. Assume we have a CSV file named "city.csv" that contains the population of over 200 U.S. cities. The CSV file has two columns, one is the city name, and other is the population in that city. You can use the following steps to create a database named "city_pop.db" from the given CSV file.

  1. Create a text file containing the following sqlite3 meta commands. Give this file a reasonable name, e.g., "create_city.script" would be an appropriate name.
    .mode csv
    .import city.csv cities
    .exit
    
    These commands basically means setting the mode as CSV, importing information from the file "city.csv" and build a table named "cities" in the database, then exiting.
  2. To create a database from the above CSV file, you can run the following command from command line (e.g., from a Linux terminal):
    sqlite3 city_pop.db < create_city.script
    
    where "city_pop.db" will be the name for the newly created sqlite3 database file name, and "create_city.script" is the script containing a sequence of sqlite3 commands described above.
  3. If multiple tables are needed in a database (from multiple CSV files), you'd include multiple lines of commands in your script, e.g.,
    .mode csv
    .import city.csv cities
    .import names.csv names
    .import population.csv pops
    .exit
    
    assuming you have three CSV files to be used.

You can also accomplish the above task within the sqlite3 environment. All you have to do is to run the command,

sqlite3
which brings you into the sqlite3 environment, then issue the meta-commands one at a time.

The above examples creates a database from a set of CSV files. Once a sqlite3 database is created, you can create CSV files from the database. There are multiple programs can do the work. Follow this link to see sqltocsv, which is a shell script that converts a SQLite database into a list of tables in CSV form.

The following is a list of files used in the above described example. Please try it out to create the "city_pop.db" database from the "city.csv" file. You do not need to submit this part of the work. But this exercise will help you complete the assignment.

Once you feel comfortable of creating database from a spreadsheet file, download the nytimes-best-books.xlsx file, follow the steps and create a database with three tables in it. Make sure you name your database "books.db" to make the grading easier.

6. Set Primary Key and Foreign Key for the Tables

Once the database "books.db" is created, if you go into sqlite3 to examine the content, for example,

sqlite3 books.db
> .schema
> select * from books;
or some other commands like that, you will see that the database tables you just created do not have any primary keys or foreign keys. Your next task is to alter the table(s) to set proper primary key or foreign key. The basic idea is the following.

You first need to identify what should be your primary key and what should be your foreign key(s). Sometimes for convenience and efficiency you may add extra column, e.g., an integer value for an ID to be used as the primary if needed. For example, in the case of the name of a person, or the title of a book, it might be a good idea to use an integer as the ID as it is possible that different people have the same name, or different books have the same title. A foreign key is needed if the tables (relations) are necessarily connected to each other. In our "books.db" case, it makes sense to have the author name (or ID) as a foreign key in the books table. On the other hand, you should not define unnecessary keys. (One can have only one primary key per table, but one could have multiple foreign keys in a table.)

Once you identify various keys, you will need to alter the existing tables to add the key information. For example, if you want to use title as the primary key in the books table, you will have to basically save the existing table books as a temporary table, create a new table that has all the attributes of the old table except that the title now is the primary key. In another word, in the original table, you'd have the equivalent of

create table books (
   "title" text,
   "author last name" text,
   ...
);
In the new table, you'd have the equivalent of
create table books (
   "title" text primary key not null,
   "author last name" text,
   ...
);
Notice that here the content with double quotes such as "title" or "author last name" are identifiers in sqlite, that is, they are just like variable names in Python or Java title or author_last_name. While identifiers in Python or Java do not allow the space character, sqlite does allow space, except that identifiers containing spaces have to be surrounded by a pair of double quotes. If you prefer not to have space or double quotes, this is the time you can rename these attributes and remove spaces in the identifiers. For example,
create table books (
   title text primary key not null,
   author_last_name text,
   ...
);

Please read this post to see how one can alter the table such that the attributes can be changed, such as setting primary key or foreign key(s). Note There is a minor typo in that post. In the syntax portion of the "Modify column in table" section,

CREATE TABLE table1 (
( column1 datatype [ NULL | NOT NULL],
...
There is an extra open parenthesis "(" before "column1".

Once you set up proper primary key and foreign keys, you now have a complete working database. You may try a few SFW queries of your choice. Record three to five of your favorite queries in a file named "myquery.sql" for submission for testing purpose.

7. Submission

Congratulations! You have just successfully created a database from a set of spreadsheets. Please make sure to submit all your files.

8. References


Last modified: Tue Jan 23 13:20:55 EST 2018