Project 1: Search Engine - A Client Server Program Pair

In this project students in teams will create a simple, but functional search engine. The search engine collects and organizes information about online courses materials in the computer science department and provides the users a search interface like most of the search engines, e.g., www.google.com. The users should be able to enter a key phrase such as assembly or architecture and your search engine should be able to return a list of resources that contain the key word.

The software will consist a few components. The first one is a crawler that visits and collects information from course websites that are accessible, for example, http://www.eg.bucknell.edu/~cs363/ and http://www.eg.bucknell.edu/~csci203. Your crawler will read these web pages and break them into a list of words and URLs that come with the pages. The second component of the search engine is a web server, you will build a web server that will act as the basis of a search engine. When a user issues a query, the search engine will be able to respond with the data the server has. The third component is an inverted index, which is built from the web pages retrieved. Each word in this index points to a list of web pages that contain this word. When a query is given, the search engine looks through its inverted index list and returns the list of web pages that contain the key word. You can use some very simple mechanism to sort the returned list. Optionally, you can use a slightly more sophisticated ranking algorithm to rank the returned list. See Extra Credit for details.

This is a multi-phased team project. You can work in a team of two to three, or you can work alone. You are encouraged to work in teams. I would prefer you write the program in C, though other languages such as Python, Java, or a mix of them are acceptable.

Due dates

Project assigned: Wednesday January 27th, 2016.

Phase 1: Monday February 8th, 2016.

Phase 2: Friday March 4th, 2016.

Phase 3: Monday March 28th, 2016.

A Few Words of Caution

Before implementing the project, I would ask all to keep the following in mind.

Search Engine Architecture

A search engine consists of two major parts, somewhat independent of each other, as can be seen from the figure. One part is on the left side of the Document Collection, which answers user queries. A user interacts with the search engine typically through a browser. The user query is sent to the retriever through the web server to retrieve the collection of URLs relevant to the query. The other part is on the right side of the Document Collection, which collects information from the web so the URLs related to the user queries can be retrieved. A crawler goes around the web to read web pages. The information is then sent to a parser that breaks down the web pages into a collection of words and extracts URLs within the web page. These URLs are pointing to other web pages on the internet from this web page. If the current given web page is denoted as w and the URLs extracted from w are s, then effectively page w is pointing to each page si for all i in the set s. When a user issues a query the document collection is searched and a list of URLs containing the search key word is generated. Each of the URLs presented as an answer to the query is pointing to the queried web page (target). The data structure that keeps this information and enables the user search is what indexer builds and augments. Because in this data structure a word in a document is pointing to URLs contained in this document, we call it a reversed indexing system.

Search Engine Architecture

Figure: Architecture of A Simple Search Engine

Phase 1: Retrieve a Web Page and Parse URLs in the Page

The first phase of the project is to retrieve a web page and parse out all the URLs contained in the web page. You are to construct a TCP client program that is able to send an HTTP GET request to a web server and retrieve the page back from the server. Once a page is retrieved, the second task of this phase is to parse out all the URLs in the page.

An HTTP client (a.k.a. web client) establishes a connection to a web server through a TCP socket. After creating the socket, the client makes a connection request to the server, the server acknowledges the connection and the communication between the server and the client is established. Once a TCP connection is ready, the client can send either an HTTP 1.0 or an HTTP 1.1 request. The server will send back the page being requested. The client then can process the web page as needed.

See the web client program example for code details. The client program in that example uses a library tcplib.c. You are asked to use your own wrapper programs developed before, e.g., in CSCI 315. In addition, the example program simply prints the received web page to the standard output (screen), your program must save the content in a string variable that can be used by other parts of the program.

The second task in this phase of the project is to parse out all URLs contained in the received page. The following is a simple web page that contains multiple URLs. Your task is to extract all the URLs and to complete the URLs that are relative.

In the code example directory, some utility programs are provided that may help you extract a URL and convert it to its absolute form. In particular, look at the program library htmllinks.c and its associated files including the Makefile.

Here are discussions on some technical details that may help you to carry out the task.

Strategy to tackle the problem

You can approach the problem in many different ways. Here is what I would suggest.

  1. Develop a simple web client using the socket interface first, similar (or identical) to the example given in code/web-client-server-c/webclient.c. This client program should be able to retrieve a web page from a given URL.
  2. Test your program on some simple pages such as http://www.eg.bucknell.edu/~xmeng/index.html and http://www.eg.bucknell.edu/~xmeng/testpages/index.html.
  3. When retrieving web page part is working fine, move on to the next task, extract all the URL links. What you can do is to save the retrieved the page to a text file, for example, by redirecting the output from the screen to a file. Then develop your part of the code to extract URLs out from the saved text file. Doing so can avoid un-necessary network traffic.
  4. You can divide the task of extracting URLs into two phases. First work with the absolute URLs in a web page. Then work with the relative URLs in a page. Again you can start with the simple pages such as my home page, then move on to more complicated pages where there are many relative URLs such as http://www.eg.bucknell.edu/~xmeng/test/relative-urls.html
  5. When all is working fine, try to download the course home page at http://www.eg.bucknell.edu/~cs363/2016-spring/index.html and extract all URLs there and convert any relative URLs into an absolute URL.

Some Code Examples That May Help

The examples in this directory may help you work on the project.

The two files fsm-url.c and regexp-url.c are two different ways of extracting URLs out of a web page. fsm-words.c is a simple example of parsing a piece of text into a collection of words. testpaths.c uses path_util.c to convert each of the relative URLs in a page into an absolute URL.

Deliverable

You are to submit to your Gitlab account the following by the deadline for this phase.

  1. Your programs and their associated header files or libraries;
  2. CSCI 363 course home page for this semester and a text file containing the list of completed URLs your program extracted from that page.
  3. A text file that briefly describe the following (total length no more than two pages, or 800 words):
  4. Submit your work through Gitlab. Create a project1 folder under your csci363 you created earlier, upload (push) all your files in the project1 directory by the deadline.