Search Engine Project Part 1: Building a Web Server

Summer 2014

Project
Part 1
Part 2
Part 3
Part 4
Part 5

Introduction

The goal of the project is to build a limited scale, but functional search engine. The search engine should be able to provide a list of relevant documents when a query is given, just like any commercial search engine would do. It is in a limited scale that it is required to collect a limited number of documents (e.g., in the order of a few hundreds to a few thousands.) The more your search engine can collect, the better it is.

This is a multi-phase team project. It will start from the beginning of the course and last through the course. The detailed scope of the project, the team organization, technical information, and other details will be given as the course progresses. An overview and the first part of the project is given here.

Components of a Search Engine

A search engine consists of a collection of software components that work together to accomplish the task of collecting, analyzing a large number of documents over the web and giving the user a list of relevant documents and URLs when a query is issued to the search engine.

While the details vary, major components of a minimal search engine include the following.

Search Engine Architecture

Figure 1: Architecture of a Simple Search Engine

Figure 1 illustrates the relation among different components in a typical search engine. These components can be roughly divided into two major parts, somewhat independent of each other, as can be seen from the figure. One is on the left side of the document collection, which answers user's queries. The other is on the right side of the document collection, which collects information from the web so the URLs related to the user queries can be retrieved. A crawler goes around the web to read web pages and to extract information from each web page it reads. The information is then sent to an indexer. The indexer builds an indexing system using the collected information and creates the links between keywords and the documents that contain these keywords. The result is typically saved into a file or a collection of files. When a user issues a query the document list is searched and a collection of relevant documents is generated. The ranker is responsible to rank these documents according to certain algorithms and measures. The top ranked documents are returned to the user for review. It is possible for the user at this point to review the documents and send feedback to the search engine. The ranker may take these feedback into account and re-select or re-rank the documents for the user to view.

Your Work in Phase One and Some Technical Details

Your phase one work is to implement a basic version of the interface and a web server program that can retrieve web pages from local file system for the user. At this point, a basic set of framework is needed. As the project progresses, some of these components will be further enhanced. The interface part is responsible for the following main tasks.

The web server program takes care of the operations of files and network. It does the following.

There are some Java code examples in the directory for this part of the project. Students who use C++ may find a similar example in C in the directory. You may certainly implement the project in other programming language. Please make sure to let me know if you use anything other than Java, C/C++, or Python. Before you design and develop your own program, please do the following.

Let's discuss some technical details as we walk through an example. Assume the search engine is running at host localhost at port 9999. We have issued the URL in our browser as
http://localhost:9999/search

An HTML form will be displayed in the browser as the result. We typed "123" in the first input box and "abc" in the second input box and then we clicked on the button "Submit." Let's now exam what happens between the browser and the server.

When a web browser contacts a web server, it sends, among other things, a command of the following form to the server

get /path http/1.0 \r\n\r\n

This means the browser is requesting (get) a page specified by /path and the protocol that the browser is using is HTTP 1.0. It is required to have two consecutive new lines to end the command, each of which consists of a new-line character and a carriage-return character.

In our example, the command sent to the server from the browser is

get /search http/1.0 \r\n\r\n

This resulted in the form to be displayed to the browser's screen.

The server has to parse this command to understand what the browser wants to do. A get command indicates that the browser is requesting some web page(s). A post command indicates that the browser is sending some information to the server, for example, a search query.

The browser also sends the length of the input to the server when it is requesting to post a form. This length tells the server how many bytes of information the browser is sending to the server as contents. The server is then expecting to read the number of bytes specified by the content length.

In our example, when we fill in the form and click the "Submit" button, the browser is sending a "post" request to the search engine. The content of the form is sent to the server as the actual content after the header information.

The following is a sample string the browser sends to the server when it is requesting a regular web page. When you run the sample program, you will see a message similar to this echoed on your server screen.

GET / HTTP/1.0
Connection: Keep-Alive
User-Agent: Mozilla/4.78 [en] (X11; U; SunOS 5.8 sun4u)
Host: polaris:9999
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*
Accept-Encoding: gzip
Accept-Language: en
Accept-Charset: iso-8859-1,*,utf-8

The following is a sample string the browser sends to the server when it is posting a form to the server. When you run the sample program, you will see this echoed on your server screen.

POST /form HTTP/1.0
Referrer: http://polaris:9999/search
Connection: Keep-Alive
User-Agent: Mozilla/4.78 [en] (X11; U; SunOS 5.8 sun4u)
Host: polaris:9999
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*
Accept-Encoding: gzip
Accept-Language: en
Accept-Charset: iso-8859-1,*,utf-8
Content-type: application/x-www-form-urlencoded
Content-length: 44

Note that the last line indicates that the content length is 44 characters. In this case, after the header part (in the above display) the browser is sending a form containing data to the server in the following format.

FirstInput=123&SecondInput=abc&Submit=Submit

which is exactly 44 characters. This string represents the form that is sent from the browser. The content length (44) and the content string depends on your input to the form. In our example, we typed "123" in the first input box and “abc” in the second input box. The content string is formed by the browser in the format of a sequence of name/value pairs. The name and value in a pair is separated by an equal sign "=" and the pairs are separated by an ampersand sign "&". In our example, FirstINput, SecondInput, and Submit are the names of the form entries (read form.html to see where they are specified) and 123, abc, and Submit are the values corresponding to these names.

Once the server parses the input form string, it captures all the input values. The server can then act accordingly. If this is a search engine, the server can pass this value as the query string to the ranker/retriever component to retrieve relevant documents. In this phase of the project, you may just echo the query term to indicate that the server understands the query. The actual retrieval operation will be implemented in a later phase.

What to Hand In

Your team needs to hand in the following in the order given.

  1. A team report for phase one of the project with a cover page. Name this file as phase1-report-teamname.docx, where teamname should be your team name. If your file is PDF, use the .pdf extension. The report shouldn't be too long, maybe two to three pages. The report should include the following as a minimum.
  2. Source code for the programs and HTML pages.
  3. Snapshots of sample runs from a Web browser.
  4. Follow the instructions in submissions.html to submit your work.

Project
Part 1
Part 2
Part 3
Part 4
Part 5