Search Engine Project Part 4: Putting All Together

Summer 2014

Project
Part 1
Part 2
Part 3
Part 4
Part 5

Introduction

In the previous phases of the project, you built a web server that can interact with any browsers; you created a indexer which can parse and index a collection of text documents; then you developed a crawler that can collect live pages from the web. In this phase of the project, you are asked to put all three previous phases of the project together to make a working search engine. The expected product of this phase of the project is a system that can do the following.

These components collectively make a working search engine. The crawler collects web pages. These pages are sent to the indexer for parsing and indexing. The indexer builds an inverted indexing system which link each individual term in the document collection to all the documents that contain the term. The indexing system also writes the result to a disk file so the index can be re-used without crawling repeatedly. On the other hand, the user interface component takes the user query and sends it to the ranking/retrieving system for processing. The ranking component takes the query and searches the relevant document from the inverted indexing list and returns the list of relevant URLs to the user.

The first three phases of the project accomplish all above, except the returned list is not ranked in any particular order. Because of limited time, the computation of ranking is left as the last phase of the project.

Your Tasks

As described in the Introduction section, your tasks in this phase is to combine the previous three phases together to make it a working search engine. One of the necessary functionality that we didn't emphasize in the first three phases of the project is to format the output nicely for display on the screen of the browser. When a list of URLs are returned from the ranker to the browser, you need to add HTML tags for each of the URL in the list. The following example illustrates this idea.

Assume your search query term computer resulted the following URLs from your ranker in the given order (e.g., the order on the posting list.)

Figure 1: Raw List of URLs Generated by Ranker

Taking this list, your ranker or user interface component of the program is responsible for adding the necessary HTML tags so the list will be formatted properly for display in the browser. (Use the Java class Html.java for your code base, if you'd like.) Here is how the list may look like with the added HTML tags.

Figure 2: List of URLs with Added HTML Tags

With the HTML tags in place, here is how the return page may look like that contains the list of URLs

  1. http://cose.seu.edu.cn/
  2. http://www.cs.cmu.edu/
  3. http://www.csail.mit.edu/
  4. http://www-cs.stanford.edu/faculty
  5. http://www.bucknell.edu/ComputerScience.xml

Figure 3: Formatted Output Displayed on Screen

What to Hand In

Your team needs to hand in the following in the order given.

  1. A team report for phase four of the project with a cover page. The report shouldn't be too long, maybe two to three pages. The report should include the following as a minimum.
  2. Source code for the programs and any other supporting documents. (This should be a complete set of a working search engine, without ranking.)
  3. Snapshots of sample runs. Use any copy-and-paste features to capture screen shots.
  4. Email the instructor a copy of the complete source code and sample runs in zip or tar format.

Project
Part 1
Part 2
Part 3
Part 4
Part 5