CSCI 305 — Introduction to Database Systems

CSCI 305: Introduction to Database Systems

Project 1: Building a Web Accessible Movie Rating Database

Due Dates
Project parts are to be submitted to gitlab 11:55 pm on the due date.

Project Part Due

Proposal Fri 02/23

Part 1 Wed 03/07

Part 2 Wed 03/21

Part 3 Wed 04/04

0. Objectives

After completing the project, students are expected to be able to

Design proper schema for a real-world database with sizable data;

Implement a multi-table database that allows the user to issue various kinds of queries through a web interface;

Populate the database with data from a collection of CSV files (or directly from spreadsheet(s);

Identify and set up proper constraints and their triggers on the database so that if any data becomes erroneous, warning message(s) will be generated and presented.

1. Overview

This is a multi-part project to create a SQL database from a data set of your choice that may excite you. The project is expected to be completed by a team of 2-3 students. Students can pick their own partner(s). If you choose solo, please let the instructor know and explain the reason(s).

You are given a default data set of movies and their ratings. You are encouraged to find other data sets from the web. The Reference section of the assignment provides a list of potential sources. You can certainly use other sources. If you decide to use your own data sets, the required work will be very similar to what is described here, which is a three-part project. Please contact your instructor if you have any questions regarding using your own data set.

The default data set is the movie rating data from University of Minnesota's research project MovieLens. They are in the form of a collection of CSV files. The information collected includes movies with title, release date, and category, users with name and state of residence, tags that the rating user wrote for the movie, ratings provided by the user, and links between the movie title in the rating database and IMDB (Internet Movie Database). This data set is a small one (ml-latest-small.zip) from the site that contains about 100,000 ratings and 1,300 tags applied to 9,000 movies by 700 users as of October 2016.

You are asked to create a SQLite database using this collection of data and write programs that can support specified class of queries through a web interface. If you choose a different data set, which is highly recommended, your project is expected to accomplish the same goals as the one with default data set.

This is a four-part project, each with its own goals and due date.

2. Part Zero: Choosing a Data Set and Form a Student Team

Students will be given a default data set to work in the project. However it is highly recommended that students find a data set of their interest and build a database from there. The data set of your choice needs to meet three requirements.

The data set is reasonably large, in the order of a few thousands of records or more.

The data set would naturally contain (or be split into) multiple tables.

The data set would contain reasonable structure for a relational database.

Please discuss it with your instructor if you have a data set in mind, but not sure if it is appropriate for a database application.

Students are encouraged to work in teams of two to three. If you prefer to work alone, you should discuss with the instructor.

2.1 Part zero submissions

Submit a proposal no more two to three pages that describe which data set you want to use and what you plan to do. In general, you should follow what is proposed in this project, that is, you will need to create the database from an existing set of files (be it spreadsheet or plain text file or others); specify proper constraints for the database tables including primary keys, foreign keys, and other data integrity constraints and checks; propose and demonstrate reasonable set of queries; and provide a web interface for your database.

In addition, you should pick up a title for your database and specify the team members for the project.

3. Part One: Creating the Database from CSV Files

The students are given the five CSV files as initial data.

movies.csv contains a movie ID, the title, the year of release, and a set of genres;

ratings_with_id.csv contains the ID of the rating itself and the ID of the user who rated the movie, the movie ID, the rating, and SQLite time stamp when the rating was recorded;

tags_with_id.csv contains the ID of the tag, ID of the user who rated the movie, ID of the movie that is rated, the string tag entered by the user, and the time stamp when the rating was recorded;

links.csv contains the movie ID used in this database, the IMDB (Internet Movie DataBase) ID, and TMDB (The Movie DataBase) ID, the two most popular movie databases;

users.csv contains user ID, the user name, and the state where the user resides.

The first four files are from the original data sets. The users.csv was created with random information so that one might query information with a person's name, rather than ID. The original data removed the user name to protect privacy. We put some random names back into the database to make the information a bit more human readable.

You are to complete the following tasks as Part I of the project.
Design the schema for all five tables following the general design guidelines we discussed in class.

Bulk load the CSV files into the database as file tables. We did an exercise on this subject. Refer to https://cs.stanford.edu/people/widom/cs145/sqlite/SQLiteLoad.html for further information. Make sure to specify appropriate primary keys, foreign keys, and constrains. Note also that in a typical bulk loading from CSV files, the attribute names would contain double quotes, such as "Name". This works fine as far as the database is concerned. But it is a bit awkward for human users to read and to query. It is best to rename all these attributes by removing the double quotes in the names, though this is not required.

Once the tables are loaded, write the queries that can answer the following questions. Make sure your database returns proper information. Note that use SQLite's meta command print to print some meaningful headings in the results to help the user understand your result. See the last sample query which asks for the movie titles for which the rating tags contains the word of "boring", "bored", or "bore".
Find the number of users. It should be 699.

Find the number of users from PA. It should be 11.

Find the number of movies. It should be 9125.

Find the number of action movies. It should be 1545.

Find the names of the movies that received rating of greater than or equal to 4.5. It should be 4,035 distinct titles.

Find the number of drama that received rating of equal to 5.0. It should be 1,637 distinct titles.

Find the number of users who gave the movie "Titanic (1997)" rating of 5.0. It should be 26.

Find the user ID, user name, and the frequency who rated most movies. It should be HERBERT|547|2391. That is, the user name is Herbert whose ID is 547, and who provided 2,391 ratings to this movie database.

Find the average rating for movie "Titanic (1997)", and its count (how many ratings from which the average is based.) It should be Titanic (1997)|3.33|164.
Find the names of the movies that received 'boring', 'bore', or 'bored' in the rating tags. The output should look something like the following.
title                                                       
------------------------------------------------------------
While You Were Sleeping (1995)                              
2001: A Space Odyssey (1968)                                
Howards End (1992)                                          
Legend of Bagger Vance, The (2000)                          
Adaptation (2002)                                           
Lord of the Rings: The Two Towers, The (2002)               
Lord of the Rings: The Return of the King, The (2003)       
3.1 Part 1 submissions

Submit the following files through your git repo.

The SQLite scripts that create and revise the database from the collection of CSV files.

All the source CSV files.

The database file itself.

A SQLite script file that contains all queries to accomplish the tasks list above.

A brief README text file of a few paragraphs describing what you accomplished, any challenges, or any comments regarding this part of the project.

4. Part Two: Creating Triggers and Constraints

Students are asked to develop constraints and triggers when there is a violation of the constraints. But first students are asked to add referential constraints to the existing database tables. (Essentially we need to drop the old tables and reload the data based on the referential constraints.)

NOTE: Once we add 'check' to the constraints, the table headings are no longer allowed. For example, the first row of the table ratings_with_id.csv is as follows.
ratingId, userId, movieId, rating, timestamp
when checking of (rating ≥ 0.0 and rating ≤ 5.0) is enforced, the heading row (the first row) won't pass because they are not numerical values in the desired ranges!

In order to implement the trigger/constraints that the timestamp of ratings and tags has to be later than that of movie's release time, we have to had a column in the Movies table named timestamp that corresponds to the year that the movie was released. To support this feature, we are going to ask students to do the following with the given data.

Use Microsoft Excel or other spreadsheet software to extract the year of the movie from the title. A typical movie title is in the form of 'Toy Story (1995)'. In Excel, one can use cell function left(right(5), 4) to extract '1995' out of the title [right(5) results in '1995)', then left(4) results in '1995'].

Use SQL command to add a column named 'timestamp' in the Movie table

Use SQL command to update the values in this column such that the value of the timestamp for each movie is the same as 'year-01-01', that is, January 1st of the year when the movie was released. The reason to take this format is that the built-in function strftime() in SQLite requires such a date format. For example, strftime('%s', '2017-07-08') returns the number of seconds since Unix epoch (January 1st, 1970). Thus the timestamp value of the movie now is compatible with that in rating and tag and they can be compared.

When completed, the following referential constraints should be met. That is, the primary keys, any foreign keys and their type, as well as attributes and their type should meet the listed conditions.
The user table
userId (int, primary key), userName (text), state (char(2))
The tags table
The ratings table
ratingId (int, primary key), userId (int) references User(userId), movieId (int) references Movie(movieId), rating (real), timestamp (int)
The movies table
movieId (int primary key), title (text), year (int), genres (text)
The links table
movieId (int primary key) references Movie(movieId), imdbId (int), tmdbId (int)
The following constraints should be checked.

Constraints for users

No two users can share the same userId.

All tags and ratings must come from a valid user (users must already exist).

Constraints for ratings

No two ratings can share the same ratingId.

Ratings must be between 0 and 5 inclusive.

Ratings must be given by a valid user.

Ratings must be given to a valid movie.

Ratings must be dated after the movie is released.

Constraints for tags

No two tags can share the same tagId.

Tags must be given by a valid user.

Tags must be given to a valid movie.

Tags must be dated after the movie is released.

Constraints for movies

No two movies can share the same movieId.

Each movie must have a release date (year, assuming 1/1/year).

4.1 Part 2 Submissions

Submit the following files through your git repo.

All SQLite script files that creates the constraints and triggers;

The database file itself;

Sample runs that demonstrate your triggers work in the way they are designed; (You can use Linux script (1) command, or any other mechanism to capture the screen output.)

A brief README text file of a few paragraphs describing what you accomplished, any challenges, or any comments regarding this part of the project.

5. Part Three: Creating A Web Interface

In this part of the project, you are to write programs (using any "language") to enable operations from the web. You can use Python libraries such as webpy or http.client, or any library from other programming languages to complete the tasks.

Required functionality through a web interface includes the following.

Be able to support queries of the type:

Search for movie title, including wild card, e.g., "Toy Story" or "Toy*";

Search for movie title along with its average rating, highest rating, lowest rating, number of ratings;

Search by genres of a particular movie, e.g., genres of "Toy Story";

Search for movies that fall into a genre category, along with its average rating, e.g., list all movies in "comedy" with their average rating, print the first 30 of them;

Count the number of users, movies, ratings, or tags;

List or count a particular category of users, e.g., users from the state of PA;

Be able to support viewing of all information pertaining to a particular movie, title, genres, date, all ratings, and all tags;

Be able to support adding new ratings or new tags to a particular movie;

Be able to reject queries that violate any constraints and send proper error message over the web;

Be able to protect the database by only accepting a subset of pre-defined queries.

5.1 Part 3 Submissions

Make a zip file of everything you have and submit it through your git repo. The key is that when downloaded from git, the files should be self-contained that they can be unzipped and run directly without requiring other files.

6. References

Five-Thirty-Eight data collection https://github.com/fivethirtyeight/data. Accessed Feb 20, 2018.

Data.gov: U.S. government data collections https://catalog.data.gov/dataset?res_format=CSV. Accessed Feb 20, 2018.

Movielens data sets https://grouplens.org/datasets/movielens/. Accessed Feb 20, 2018.

Public data repository from Stanford's Department of Statistics http://statweb.stanford.edu/~sabatti/data.html. Accessed Feb 20, 2018.

The Home of Data Science & Machine Learning https://www.kaggle.com/datasets. Accessed Mar 12, 2018.

Project Part	Due
Proposal	Fri 02/23
Part 1	Wed 03/07
Part 2	Wed 03/21
Part 3	Wed 04/04

Last modified: Wed Feb 13 13:20:55 EST 2018