Project 1: Search Engine - A Client Server Program Pair

The goal of the project is to produce a limited scale, but functional search engine. The search engine should provide a count and a partial list of web pages that are pointing to a searched (target) web page. For example, if one searches for www.bucknell.edu, your search engine should report that a total of 2,345 web pages (This is not a real count.) in Bucknell domain are pointing to www.bucknell.edu and the first n URLs pointing the target page are listed, where n is a reasonable number (e.g., 30 or 50). The search engine you are about to build is in a limited scale that it is required to collect a limited number of documents (e.g. in the order of a few thousands to a few tens of thousands of pages). The more your search engine can collect, the better it is. It would be difficult, if not impossible, for a lab desk-top computer to collect all pages in a given domain, e.g., bucknell.edu.

This is a multi-phased team project. You can work in a team of two to three, or you can work alone. You are encouraged to work in teams. You are asked to write the program in C. However if you have a good reason to develop the project in other programming languages, please discuss it with me. In general, if you use more modern languages, you are asked to provide more functionality in the project.

Due dates

Phase 1: Wednesday February 6th.

Phase 2: Wednesday February 13th.

Phase 3: Wednesday February 27th.

Phase 4: Friday March 8th.

Phase 4: Monday March 25

A Few Words of Caution

Before implementing the projects, I would ask that all of us keep the following in mind about the project.

Search Engine Architecture

A search engine consists of two major parts, somewhat independent of each other, as can be seen from the figure. One is on the left side of the Document Collection, which answers user's queries. The other is on the right side of the Document Collection, which collects information from the web so the URLs related to the user queries can be retrieved. A crawler goes around the web to read web pages. The information is then sent to a parser. (In a general-purpose search engine, a more elaborate indexer is needed. Since we are only interested in the hyper-text links that points to a particular web page, we can omit the indexing part.) The parser extracts the URLs contained in the given web page. These URLs are pointing to other web pages from this web page. If the current given web page is denoted as w and the URLs extracted from w are s, then effectively page w is pointing to each page si for all i in the set s. When a user issues a query the document collection is searched and a list of URLs is generated. Each of the URLs presented as an answer to the query is pointing to the queried web page (target).

Search Engine Architecture

Figure: Architecture of A Simple Search Engine

Phase 1: Retrieve a Web Page and Parse URLs in the Page

The first phase of the project is to retrieve a web page and parse out all the URLs contained in the web page. You are to construct a TCP client program that is able to send an HTTP GET request to a web server and retrieve the page back from the server. Once a page is retrieved, the second task of this phase is to parse out all the URLs in the page.

An HTTP client (a.k.a. web client) establishes a connection to a web server through a TCP socket. After creating the socket, the client makes a connection request to the server, the server acknowledges the connection and now the communication between the server and the client is established. Once a TCP connection is ready, the client can send either an HTTP 1.0 request or an HTTP 1.1 request. The server will send back the page being requested. The client then can process the page.

See the web client program example for code details. The client program in that example uses a library tcplib.c. You are asked to use your own wrapper programs developed in your lab assignment. In addition, the example program simply prints the received web page to the standard output (screen), your program must save the content in a variable that can be used by other parts of the program.

The second task in this phase of the project is to parse out all URLs contained in the received page. The following is a simple web page that contains multiple URLs. Your task is to extract all the URLs and to complete the URLs that are relative.

Here are discussions on some technical details that may help you to carry out the task.

Strategy to tackle the problem

You can approach the problem in many different ways. Here is what I would suggest.

  1. Develop a simple web client using the socket interface first, similar (or identical) to the example given in code/client-server-c/webclient.c. This client program should be able to retrieve a web page from a given URL.
  2. Test your program on some simple pages such as http://www.eg.bucknell.edu/~xmeng/index.html and http://www.eg.bucknell.edu/~xmeng/testpages/index.html.
  3. When retrieving web page part is working fine, move on to the next task, extract all the URL links. What you can do is to save the retrieved the page to a text file, for example, by redirecting the output from the screen to a file. Then develop your part of the code to extract URLs out from the saved text file. Doing so can avoid un-necessary network traffic.
  4. You can divide the task of extracting URLs into two phases. First work with the absolute URLs in a web page. Then work with the relative URLs in a page. Again you can start with the simple pages such as my home page, then move on to more complicated pages where there are many relative URLs such as http://www.eg.bucknell.edu/~xmeng/test/relative-urls.html
  5. When all is working fine, try to download the Bucknell's home page at http://www.bucknell.edu/ and extract all URLs there and convert any relative URLs into an absolute URL. The number of URLs in Bucknell's home page is about 130 or so last time I tried it. Be very careful not to over run the network and Bucknell's web server. See the Words of Caution at the beginning of the assignment.

Deliverable

You are to submit to your Gitlab account the following by the deadline for this phase.

  1. Your C program and their associated header files;
  2. Bucknell's home page you retrieved and a text file containing the list of URLs your program extracted from Bucknell's home page;
  3. A text file that briefly describe the following (total length no more than two pages, or 800 words):
  4. Submit your work through Gitlab. Create a project1 folder under your csci363-s13 you created earlier, upload (push) all your files in the project1 directory by the deadline.

Phase 2: Build a Simple Web Search Engine

The second phase of the project asks you to create a web server (a simple search engine) that can answer user queries about a web page your program has visited and indexed. The type of queries your search engine need to be able to handle is very simple. The query gives a URL for a web page. Your server program will be able to provide information such as the number of out-going URLs collected from this web page and a list of first 20 of these URLs, if the targeted web page has that many out-going URLs. A typical user interaction would look as follows (here we only list text, your should be able to do it in a browser).

The following sections discuss some of the technical details how we can implement such a simple search engine.

2.1 Creating a server program as a web server

A web server essentially is a TCP based server program that follows the HTTP protocol as the application-layer protocol. Take a look at the echo-server.c program in the following.

In this program, the server is waiting for client connection requests at an agreed-upon port. Once a connection is accepted, the server reads a string from the client, and sends it right back to the client with an inserted phrase "Echo --> ". From the client point of view, once a connection is accepted from the server, the client sends a message to the server and reads message sent by the server and prints it on the screen. This is the application protocol for this particular echo service!

For a search engine, the interaction between a client and a server follows the HTTP protocol, which is slightly more complicated than a service such as echo. To understand how HTTP works, first let's do the following experiment.

In your echo-server.c program, after reading the input from a client, instead of echoing the message back to the client, your echo-server.c program prints what is read from the client on the screen, and then sends back the following message to the client.

Note that essentially your server now sends back an HTTP response code first (HTTP/1.0 200 OK) followed by two pairs of newline and carriage returns. The HTTP response code is then followed by a web page.

When your program prints what it reads from the client (a browser), you should see something similar to the following on your screen. We will explore the meaning of this request later.

Compile and run the server program. Assume the server program is running on a lab computer, e.g., dana213-lnx-1. Have your favorite web browser point at the URL dana213-lnx-1:2500. Here the server name is dana213-lnx-1 and 2500 is the port at which the server program is running. You should use the computer name in which your server program is running and the port number that the server program is using. You should see the browser displays the content sent back by the server.

The above is the simplest scenario of interaction between a web client and a web server. How does a client request a specific web page from a web server? In the same directory where your simple web server resides, create a simple web page with the following content (you can certainly make a more elaborate web page as you would like), calling the file with this content simple.html or a name of your choice.

Set the access permission of simple.html as readable by the world. If you are not familiar with how to set permissions, please read the manual page on the command chmod. For what we need, you can simply do

chmod 644 simple.html

which sets the file readable by all and writable by the owner (you) only.

Revise your echo-server.c program by the following steps.

  1. Revise the write() statement in the echo-server.c program so that it only sends back the HTTP response code ("HTTP/1.0 200 OK"), not sending the in-line simple web page code.
  2. Have your server program open and read the file simple.html from the disk, send the contents being read directly to the client (web browser) using the write() system call. Doing so your web browser that made the request to your server should see the simple.html displayed in the browser and you should be able to click the hypertext link from within that page.

Now let's read what the original client request. The client (a web browser) sends a request tot the web server when the browser makes a connection to the server. The command GET / HTTP/1.1 indicates that the browser wants to read the root HTML page at the server. If a browser would request a specific file, e.g., simple.html, the parameter of the GET command would look as follows.

GET /simple.html HTTP/1.1

Confirm this phenomenon by changing the URL which the browser uses to access the web page as

http://dana213-lnx-1:2500/simple.html

Load the web request again. You should still see the content of simple.html displayed on the browser screen. In addition, you should see from the server side that the parameter of the GET command has changed from the root "/" to "/simple.html." Other pieces of common requests from a browser include the host name of the server, the agent name (the browser), the type of application the browser can handle (e.g., text/html, or application/xml), the accepted language, the accepted encoding mechanism, among others.

2.2 How to send a query to a web server

Now that we know how a web client (e.g., a browser) interacts with a web server, we can turn our attention to how to make a web server a search engine and how a web client such as a browser to send query to a search engine and how a search engine sends the search results back to the client.

A web client can send a piece of information such as a query to a web server by using the HTTP POST command. Let's first concentrate on the client side to see how we can post a request to the server.

The basic mechanism to post a query from a web client is to use a form submission method in HTML. Once again, let's change the program echo-server.c to make it accept and process a query from a client. Instead of sending simple.html to the client, let's have the server send back the form.html which reads as follows.

Now have your browser point to your server, e.g.,
http://dana213-lnx-1:2500/

This time, instead of a simple web page, your browser screen should display a form with two input text boxes and a Submission button. Type some values into these two input text boxes and click the Submission button. In our example, we use an integer 123 and a string "abc" as the input values. Have your server read and print whatever the client is sending to the server. You should see something similar to the following printed on the server's screen.

Notice that in the above information sent from the browser to the server after clicking the "Submit" button, the command is "POST" instead of "GET" because the client is trying to send, thus post, information to the server. The amount and the content of the information sent from the client to the server is included in the message as well. The amount of information is indicated by the attribute Content-length which has the value of 44 which, in turn, means the content of the form is 44 bytes long. The actual content of the post, i.e., the content of the form is followed after the Content-length attribute, separated by a pair of newline NL and carriage return CR. Remember that a pair of newline and carriage return by itself in a line indicates the end of the header in HTTP. In our example, the content is a string of 44 characters as follows.

FirstInput=123&SecondInput=abc&Submit=Submit

The content of the form submission is divided into pairs of (key=value), separated by an &. For example, the above submission form contains three pairs of key and value. The first one is FirstInput which is the label of the text input box in the original form, and 123 which is the value corresponding to this key. The second is the string text we typed. The third one is the Submission button. When parsing out this form submission, the server can figure out what is sent from the client. If the server is a search engine, the server then takes this input as a search query and sends the search result back to the client in an HTML format.

2.3 Make your server program a search engine for web page out-degree count

Now that we know the basics of a search engine, this part of the project is to make your server a search engine that can answer queries about the out-degree of a web page in a collection of web pages. That is, if a client wants to find out how many out-going HTML links a target web page contains, your search engine should be able to provide this answer in an HTML formatted output.

In the first phase of the project, you wrote a client program that can access a web page and extract all the out-going HTML links in the target web page. In this phase of the project, your server program reads this information (either from a file saved by the program in the first phase, or from a communication channel such as pipe, or shared memory) and sends the result back to the client (typically, a browser) in an HTML format.

Here is a typical scenario of the interaction between the server (your program) and a client (e.g., a browser).

Deliverable of phase two

You are to submit to your Gitlab account the following by the deadline.

  1. Completed code of your server (and client from the first phase if your server depends on that for its data).
  2. Screen shots of sample runs. Try your server program (thus the client from the first phase as well) with a few different web pages. The data must include Bucknell's main web page. The output should be a count of the out-going URL links in the target page and up to 20 actual URLs that are click-able.
  3. A text file that contains a brief description about this phase of the project that include team member names, high lights the challenges your team encountered, and any thoughts that you'd like to share. The length of the description should not be more than two pages or 800 words.

Phase 3: Collect Pages from Bucknell Domain

After completing the first two phases, now you have a pair of programs, a client program and a server program. The client program is able to retrieve a web page from a given website and to parse out all URLs in that page. The server program is able to take a URL as a query from a web browser and can send back the count of out-going URLs and the first 20 of these URLs.

Your next task is to further develop the client program so it can retrieve multiple web pages in a given domain. We can call this program that is able to collect multiple web pages a crawler (Please refer to the architecture figure.) In addition to collecting the pages, you are also extending the capability of the parser so that the parser can build a relationship matrix that a row at x in the matrix contains all the URLs retrieved from page x. Let's look at some of the details.

3.1 Crawling the web

In Phase 2 of the project, you completed a client program that can retrieve a web page from a given URL and parse out all the URLs contained in that web page. Now if your client program follows each of these URL links harvested in the page to retrieve more pages, the program becomes a crawler that is able to retrieve all web pages in a connected component of a web. The connected component here has exact the same meaning as you learned in your data structures class, that is, a root node (web page) can reach any node (web page) in the collection as a directed graph. Viewing from the graph perspective, a crawling process really becomes a traversing process. The starting web page becomes the root of the tree in this directed graph, the task of a crawler is to traverse all the nodes in the graph by either a breadth-first traversal, or a depth-first traversal. For our purpose, we are going to pursue a breadth-first traversal. Depth-first traversal works in a similar way, using a different type of data structure.

How do we perform a breadth-first traversal in a graph from a given starting node x? We can use a queue data structure to resolve this problem. Assume we have a starting page x and a queue toVisitQ, a breadth-first traversal can be accomplished as follows.

We need to set a reasonable limit so the crawling process doesn't take too long. I would suggest a few thousands to a few tens of thousands as a limit. We can also gradually update this limit (see notes later in Section 3.3.)

A number of factors should be considered when deciding if a page should be visited. First of all, a page that has been visited before should be never visited again. Otherwise the traversal would go into an infinite loop. Second, for our purpose, we should only visit text based pages such as HTML files or plain text files. We can simply use file extension to check to see if a web page falls into these text file categories. (In Bucknell domain, an HTML page would have the extension of *.html or *.xml, while a text file typically has the extension of *.txt.) If you would like, you could also visit and index pages in PDF using some Unix facility (or other program) that can convert a PDF file into a text. But this is not required.

We should limit our crawler to visit web pages in Bucknell domain only. That is, we don't want to visit an URL that is pointing to pages outside of bucknell.edu. This should be a critical factor to consider when deciding whether or not to visit a web page.

3.2 Building a degree matrix

The second task is to build a degree matrix for the visited web pages. Assume your crawler visited n web pages, each of these n pages contains a number of m links that point to other web pages. If we consider the entire collection of n web pages, we can form a n X n matrix, each element ni,j indicates the number of times page i points to page j. If we have the following 4 X 4 matrix,

the matrix indicates that web page p0 contains five URL links, one is pointing to itself, one is pointing to p1 and three are pointing to p2. The other rows have the similar interpretation.

The logical concept here isn't very complicated. However we need to pay attention to some of the technical, or engineering issues.

3.3 Collecting data incrementally

While collecting web pages over the network, we should consider two issues. One is that we try not to overcrowd the network; the second is that we should try not to repeat the work that has been done. We will discuss these two issues here in this section.

Network capacity is very precious resource. We should use it very carefully. We can choose our program options to make our program less aggressive in terms of using network capacity. At the same time, doing so can also save our own time in running the program less frequently. We can do at least the following two things to help this issue.

Deliverable of phase three

You are to submit to your Gitlab account the following by the deadline.

  1. Completed code of your programs.
  2. Data file(s) that contain the current information of your crawling. As stated in the project description, this data file should contain the current list of to-be-visited URLs, the visited URLs, and a total count of how many pages the crawler has visited (length of the visited URLs data structure.)
  3. Screen shots of sample runs. Try your server program with a few different web pages. The output should be a count of the out-going URL links in the target page and up to 20 actual URLs that are click-able.
  4. A text file that contains a brief description about this phase of the project that include team member names, high lights the challenges your team encountered, and any thoughts that you'd like to share. In particular, instructions how to run your program(s) should be included. The length of the description should not be more than two pages or 800 words.

Phase 4: Put Everything Together

Just like you did in previous phases, create a separate directory for this phase so the programs can be better organized and located.

Upon completing Phase 3, your programs should be able to collect a set of web pages and serve user queries by returning a portion of the list of URLs present on a particular web page (query). As a result, you should have a collection of data (e.g., a linked list) where each node represents a web page and the node contains a list of URLs found in the web page.

In the final phase of the project, you are asked to complete the programs so that the programs can answer two types of queries. One type of query is for a given URL, compute the count and display the first set (20) of web pages that point to this URL. The second is to display the set of URLs of top-20 most popular web pages in your collection. The popularity count here is defined as the in-coming links to a web page (i.e., how many other web pages are pointing to the target page.)

In order for your program to answer one of two types of queries, the program will need to use two submission boxes, one for web page query, the other for the count query. The one with count query does not need an input box. It will simply be a submission button.

You may use any reasonable data structures and algorithms you wish to use. If you have a large collection of pages, e.g., a few thousands to a few tens of thousands, a plain 2-D array implementation of matrix is probably not a very good data structure. In general, if you have used a linked list structure to represent your matrix in Phase 3 (a list of lists), you can build an inverse-list in addition to the original list in this phase. Let's call this inverse list of lists L. You can then build two sorted lists (list of pointers to the original L), one sorted by alphabetical order (the name of the URL), the other sorted by the count of the web pages pointing to the target web page. Doing so allows the user to search either by the URL name, or by the count of the web pages.

Deliverable of phase three

You are to submit to your Gitlab account the following by the deadline.

  1. Completed code of your programs.
  2. Data file(s) that are used and generated by the programs.
  3. Screen shots of sample runs. Try your program with a few different web pages. The output should be a count of the in-coming URL links to the target page and up to 20 actual URLs that are click-able. Also try your program with the query of the top 20 most popular page
  4. A text file that contains a brief description about this phase of the project that include team member names, high lights the challenges your team encountered, and any thoughts that you'd like to share. In particular, instructions how to run your program(s) should be included. The length of the description should not be more than two pages or 800 words.