Web Search - An Application of Information Retrieval Theory

CSCI379.01 Term Project

Fall 2003

Fourth Phase Assigned: Wednesday October 29, 2003
Fourth Phase Due: Friday November 14, 2003


Overview Of This Phase Of The Project

You are to implement a crawler or a robot in this phase of the project to collect information from the Web. The crawler starts with a given set of URLs. The number of URLs in this set can be one or more. It will retrieve the pages specified by this starting URL set. Parse the pages, extract some more URLs from these pages. Then visit the pages following these new URLs. This process will continue until either the time allowed has expired, or the number of retrieved pages has reached the limit, or there is no new page to visit.

The retrieved Web pages are passed to the Indexer you built in the second phase of the project for processing. There an inverted index will be built for all the Web pages retrieved, ready for an end user to search. The inverted indexing system was built as the third phase of the project.

General Algorithm

Traverse the Web is very similar to traverse a general graph. Each Web page can be considered as a node in a graph. Each hyper-link can be considered as a link in a graph. From this point of view, crawling Web is not too much different from the graph traverse algorithm you learned in a typical data structure course. The following is a general Web traversing algorithm that we discussed in lectures.
Initialize queue (Q) with initial set of known URLs
Until Q empty or page or time limit reached do
{
  Pop URL, L, from the front of the Q
  If L is not point to an HTML page (.gif, .jpeg, .ps, .pdf, .ppt, ...)
     continue loop
  If already visited L, continue loop
  If the page can't be downloaded (404 error, or robot exclusion)
     continue loop
  Download the page, P, for L
  Index P (or store it to a disk file)
  Parse P to obtain list of new URLs (this can also be done by Indexer)
  Append new URLs to the end of Q
}

Some Issues To Be Considered

Because of the vast size of the Web, there are some technical and engineering issues we have to consider for a successful, less-intrusive crawler. We list here some of the issues to consider in crawling the Web.

What to Hand In

Your team needs to hand in the following in the order given. Please staple all pages.

  1. A team report for phase four of the project with a cover sheet. The report should include the following as a minimum.
    1. A description of main contribution of each team member.
    2. A summary of the working process, e.g. what the team started with, what the team has accomplished, any problems encountered, how the team solved them, any thoughts on the project, among others.

  2. Source code for the programs (for this phase of the project only).

  3. Snapshots of sample runs.

  4. Email the instructor a copy of the complete source code (all phases up to now) in zip or tar format.

About this document ...

Web Search - An Application of Information Retrieval Theory

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.70)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 1 -nonavigation project-part4

The translation was initiated by Meng Xiannong on 2003-10-28


Meng Xiannong 2003-10-28