CSCI 363 — Computer Networks -- Client-Server Project

Project 1: Search Engine - A Client Server Program Pair

In this project students in teams will create a simple, but functional search engine. The search engine collects and organizes information about online courses materials in the computer science department and provides the users a search interface like most of the search engines, e.g., www.google.com. The users should be able to enter a key phrase such as assembly or architecture and your search engine should be able to return a list of resources that contain the key word.

The software will consist a few components. The first one is a crawler that visits and collects information from course websites that are accessible, for example, http://www.eg.bucknell.edu/~cs363/ and http://www.eg.bucknell.edu/~csci203. Your crawler will read these web pages and break them into a list of words and URLs that come with the pages. The second component of the search engine is a web server, you will build a web server that will act as the basis of a search engine. When a user issues a query, the search engine will be able to respond with the data the server has. The third component is an inverted index, which is built from the web pages retrieved. Each word in this index points to a list of web pages that contain this word. When a query is given, the search engine looks through its inverted index list and returns the list of web pages that contain the key word. You can use some very simple mechanism to sort the returned list. Optionally, you can use a slightly more sophisticated ranking algorithm to rank the returned list. See Extra Credit for details.

This is a multi-phased team project. You can work in a team of two to three, or you can work alone. You are encouraged to work in teams. I would prefer you write the program in C, though other languages such as Python, Java, or a mix of them are acceptable.

Due dates

Project assigned: Wednesday January 27th, 2016.

Phase 1: Monday February 8th, 2016.

Phase 2: Friday March 4th, 2016.

Phase 3: Monday March 28th, 2016.

A Few Words of Caution

Before implementing the project, I would ask all to keep the following in mind.

The goal of the project is for you to learn and practice network programming and C. I ask you to use standard C library in your project. If you use any other existing code (a library, or a piece of code that is shared freely from the internet), you need to do three things.

Make sure you are allowed to use the code by the owner(s) and quote the reference.
Mark it clearly which part of the code is imported, other than a few sections of code given by this course.
Justify why you'd use the external code.

Note that it is fine to study these public code and algorithms used in these code, and implement the ideas by yourself (that is, you can't just copy-and-paste the code in its entirety or use a set of libraries without doing the three things listed above.) In this case, you should also quote the references.

Our program will create some traffic over the campus network and load on the computers your programs run. Be very careful that when you leave the computer, make sure you don't leave un-used process running. If your program does take a long time to run and your program is running on a lab computer, use the Linux nice command to lower the priority of your process.
Do not run your program on the college servers (e.g., linuxremote or linuxremote1, linuxremote2, and linuxremote3.) Rather, run your program in lab computers in Dana 132, Dana 213, Breakiron 164, or Breakiron 167. You may remote log into the servers, but please ssh into the lab computers before running your programs. The names for these lab computers are dana132-lnx-n, dana213-lnx-n or brki164-lnx-n where n is a number, e.g., 1-24, depending which lab.
In a later phase of the project, your program will visit many web pages hosted by the Bucknell web servers. Be extremely mindful that do not let your program repeatedly visit the server in a short burst. You should have your program sleep for a couple of seconds for every 30 or so pages the program visits. It will be helpful if you use multiple threads with one thread visiting web servers and other thread(s) process pages. Or you could use multiple processes created through fork(). The one thread (process) that visits web servers must pause or sleep after visiting every a few tens of pages.

Search Engine Architecture

A search engine consists of two major parts, somewhat independent of each other, as can be seen from the figure. One part is on the left side of the Document Collection, which answers user queries. A user interacts with the search engine typically through a browser. The user query is sent to the retriever through the web server to retrieve the collection of URLs relevant to the query. The other part is on the right side of the Document Collection, which collects information from the web so the URLs related to the user queries can be retrieved. A crawler goes around the web to read web pages. The information is then sent to a parser that breaks down the web pages into a collection of words and extracts URLs within the web page. These URLs are pointing to other web pages on the internet from this web page. If the current given web page is denoted as w and the URLs extracted from w are s, then effectively page w is pointing to each page s_i for all i in the set s. When a user issues a query the document collection is searched and a list of URLs containing the search key word is generated. Each of the URLs presented as an answer to the query is pointing to the queried web page (target). The data structure that keeps this information and enables the user search is what indexer builds and augments. Because in this data structure a word in a document is pointing to URLs contained in this document, we call it a reversed indexing system.

Figure: Architecture of A Simple Search Engine

Phase 1: Retrieve a Web Page and Parse URLs in the Page

The first phase of the project is to retrieve a web page and parse out all the URLs contained in the web page. You are to construct a TCP client program that is able to send an HTTP GET request to a web server and retrieve the page back from the server. Once a page is retrieved, the second task of this phase is to parse out all the URLs in the page.

An HTTP client (a.k.a. web client) establishes a connection to a web server through a TCP socket. After creating the socket, the client makes a connection request to the server, the server acknowledges the connection and the communication between the server and the client is established. Once a TCP connection is ready, the client can send either an HTTP 1.0 or an HTTP 1.1 request. The server will send back the page being requested. The client then can process the web page as needed.

See the web client program example for code details. The client program in that example uses a library tcplib.c. You are asked to use your own wrapper programs developed before, e.g., in CSCI 315. In addition, the example program simply prints the received web page to the standard output (screen), your program must save the content in a string variable that can be used by other parts of the program.

The second task in this phase of the project is to parse out all URLs contained in the received page. The following is a simple web page that contains multiple URLs. Your task is to extract all the URLs and to complete the URLs that are relative.

In the code example directory, some utility programs are provided that may help you extract a URL and convert it to its absolute form. In particular, look at the program library htmllinks.c and its associated files including the Makefile.

Here are discussions on some technical details that may help you to carry out the task.

In general, the client doesn't know how long the retrieved page is. For example, the above page is short with a few hundred bytes. The Bucknell home page we received on December 25, 2015 is about 49,000 bytes long. Other pages can be much longer or shorter. While some server may send back the length of the page as a part of the response header, most servers don't. The question then is how the client program allocates memory space to store the page. One simple strategy is to allocate certain fixed amount of memory (e.g., 32 K) for a received page. The client can then receive up to that amount in a page. Anything after that is dropped. This approach is problematic in two ways, one is that the client may drop portion of a long page; the second is that if the client needs to process hundred, thousands, or millions of pages, the memory management becomes extremely challenge (a lot is wasted). An alternative would be to manage a dynamic data structure (e.g., a linked list). The client receives the page in multiple chunks and each chunk is stored in one node of the list. When the page is received completely, the client can rebuild the page (now we know the exact amount of memory the page needs) from the linked list.
Parsing out URLs could be fun and challenge. You could use a finite-state machine to recognize the pattern (preferred), or you could use a regular expression to do pattern matches. Or you could use direct string matching. Regardless which method to use, your program would have to be able to recognize URLs of various presenting forms. (The above examples gives some flavors of variations in URL.)
URLs contained in a page could be absolute or relative. An absolute URL specifies all components of an URL, for example, http://www.eg.bucknell.edu/~xmeng/teaching.html#eg290. When an absolute URL is extracted from a web page, it can be used directly to retrieve the web page specified by the URL. On the other hand, a relative URL gives a path relative to the current web page. For example if the web page from which the relative URL is extracted is http://www.eg.bucknell.edu/~xmeng/index.html (often called a base URL) and a relative URL from the given page is extracted as ../teaching.html, then the absolute URL should be http://www.eg.bucknell.edu/~xmeng/teaching.html. Your program will need to convert all relative URLs to absolute URLs.

The basic strategy to convert a relative URL to an absolute URL is to scan the URL file path from left to right, every time a current directory notion "./" is met, you simply remove these two characters from the URL; every time a parent directory notion "../" is met, you remove the directory name above this level. For example, a relative URL "level1/./level2/../level3/file.html" is given, its absolute form should be "level1/level3/file.html" after the process.

When the file path scan and conversion is finished, attach the entire path to the current host (e.g., http://www.bucknell.edu/ or http://www.cnn.com/.)

Strategy to tackle the problem

You can approach the problem in many different ways. Here is what I would suggest.

Develop a simple web client using the socket interface first, similar (or identical) to the example given in code/web-client-server-c/webclient.c. This client program should be able to retrieve a web page from a given URL.
Test your program on some simple pages such as http://www.eg.bucknell.edu/~xmeng/index.html and http://www.eg.bucknell.edu/~xmeng/testpages/index.html.
When retrieving web page part is working fine, move on to the next task, extract all the URL links. What you can do is to save the retrieved the page to a text file, for example, by redirecting the output from the screen to a file. Then develop your part of the code to extract URLs out from the saved text file. Doing so can avoid un-necessary network traffic.
You can divide the task of extracting URLs into two phases. First work with the absolute URLs in a web page. Then work with the relative URLs in a page. Again you can start with the simple pages such as my home page, then move on to more complicated pages where there are many relative URLs such as http://www.eg.bucknell.edu/~xmeng/test/relative-urls.html
When all is working fine, try to download the course home page at http://www.eg.bucknell.edu/~cs363/2016-spring/index.html and extract all URLs there and convert any relative URLs into an absolute URL.

Some Code Examples That May Help

The examples in this directory may help you work on the project.

The two files fsm-url.c and regexp-url.c are two different ways of extracting URLs out of a web page. fsm-words.c is a simple example of parsing a piece of text into a collection of words. testpaths.c uses path_util.c to convert each of the relative URLs in a page into an absolute URL.

Deliverable

You are to submit to your Gitlab account the following by the deadline for this phase.

Your programs and their associated header files or libraries;
CSCI 363 course home page for this semester and a text file containing the list of completed URLs your program extracted from that page.
A text file that briefly describe the following (total length no more than two pages, or 800 words):
- Names of all team members
- Your approach to solve the problems for this phase of the project
- A few points of highlight of your project
- Major difficulties your team encountered and how you resolve them.
- Any other points or suggestions that you would like to share with me
Submit your work through Gitlab. Create a project1 folder under your csci363 you created earlier, upload (push) all your files in the project1 directory by the deadline.