Web Search - An Application of Information Retrieval Theory

CSCI379.01 Term Project

Fall 2003

First Phase Assigned: Monday, September 1, 2003
First Phase Due: Wednesday September 17th, 2003

Introduction

The goal of the project is to produce a limited scale, but functional search engine. The search engine should be able to provide a list of relevant documents when a query is given, just like any commercial search engine would do. It is in a limited scale that it is required to collect a limited number of documents (e.g. in the order of a few hundreds to a few thousands). The more your search engine can collect, the better it is.

This is a multi-phase, team project. It will start from the beginning of the semester and last through the semester. The detailed scope of the project, the team organization, technical information, and other details will be given as the semester progresses. An overview and the first part of the project will be given here.

Components of a Search Engine

A search engine consists of a collection of software components that work together to accomplish the task of collecting, analyzing a large number of documents over the Internet and giving the user a list of relevant documents and URLs when a query is issued to the search engine.

Major components of a typical search engine include the following:

Figure 1 indicates the relation among different components in a typical search engine.

Figure 1: Components of a Typical Search Engine
\begin{figure}\centering\epsfysize =2in
\vspace*{0in}\hspace*{0in}\epsfbox{arch.eps}\end{figure}

A search engine consists of two major parts, somewhat independent of each other, as can be seen from the figure. One is on the left side of the document collection, which answers user's queries. The other is on the right side of the document collection, which collects information from the Web so the URLs related to the user queries can be retrieved. A crawler goes around the Web to read Web pages and to extract information about each Web page it reads. The information is then sent to an indexer. The indexer takes this information and creates the links between keywords and the documents that contain these keywords. The result is typically saved into a file or a collection of files. When a user issues a query the document list is searched and a collection of relevant documents is generated. The ranker is responsible to rank these documents according to certain algorithms and measures. The top ranked documents are returned to the user for review. It is possible for the user at this point to review the documents and send feedbacks to the search engine. The ranker may take these feedback into account and re-select or re-rank the documents for user to view.

Project Team

The project will be carried out in teams. The details of the team work are given in a separate handout.

Your Work in Phase One and Some Technical Details

Your phase one work is to implement a basic version of the interface and the back-end engine. This is just a framework. As the project progresses, some of these components will need to be enhanced. The interface part is responsible for the following main tasks.

The back-end engine takes care of the operations of files and network. It does the following.

There are some code examples in the directory of
$<$http://www.eg.bucknell.edu/~xmeng/Course/CS379/code/javaServer/$>$ for this part of the project. Students who use C++ may find a similar example in C in the directory of
$<$http://www.eg.bucknell.edu/~xmeng/Course/CS379/code/cServer/$>$. You should try out these examples, observe their behavior before you design and develop your own program. Do the following.

Let's discuss some technical details as we walk through an example. Assume the search engine is running at host polaris at port 9999. we have issued the URL in our browser as

http://polaris.eg.bucknell.edu:9999/search
An HTML form will be displayed in the browser as the result. We typed ``123'' in the first input box and ``abc'' in the second input box and then we clicked on the button ``Submit''. Let's now exam what happens between the browser and the server.

What to Hand In

Your team needs to hand in the following in the order given. Please staple all pages.

  1. A team report for phase one of the project with a cover sheet. The report shouldn't be too long, maybe two to three pages. The report should include the following as a minimum.
    1. The name of your team (should be same as your search engine);
    2. A description of the roles of each team member and the contributions of each member;
    3. A summary of the working process, e.g. what the team started with, what the team has accomplished, any problems encountered, how the team solved them, any thoughts on the project, among others.
    4. Team meeting minutes during this phase of the work.

  2. Source code for the programs and HTML pages.

  3. Snapshots of sample runs from a Web browser.

  4. Email the instructor a copy of the complete source code in zip or tar format.

About this document ...

Web Search - An Application of Information Retrieval Theory

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.70)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -nonavigation -split 1 project-part1

The translation was initiated by Meng Xiannong on 2003-08-31


Meng Xiannong 2003-08-31