The goal of the project is to produce a limited scale, but functional search engine. The search engine should provide a count and a partial list of web pages that are pointing to a searched (target) web page. For example, if one searches for www.bucknell.edu, your search engine should report that a total of 2,345 web pages (This is not a real count.) in Bucknell domain are pointing to www.bucknell.edu and the first n URLs pointing the target page are listed, where n is a reasonable number (e.g., 30 or 50). The search engine you are about to build is in a limited scale that it is required to collect a limited number of documents (e.g. in the order of a few thousands to a few tens of thousands of pages). The more your search engine can collect, the better it is. It would be difficult, if not impossible, for a lab desk-top computer to collect all pages in a given domain, e.g., bucknell.edu.
This is a multi-phased team project. You can work in a team of two to three, or you can work alone. You are encouraged to work in teams. You are asked to write the program in C. However if you have a good reason to develop the project in other programming languages, please discuss it with me. In general, if you use more modern languages, you are asked to provide more functionality in the project.
Phase 1: Wednesday February 6th.
Phase 2: Wednesday February 13th.
Phase 3: Wednesday February 27th.
Phase 4: Friday March 8th.
Phase 4: Monday March 25
Before implementing the projects, I would ask that all of us keep the following in mind about the project.
Note that it is fine to study these public code and algorithms used in these code, and implement the ideas by yourself (that is, you can't just copy-and-paste the code in its entirety or use a set of libraries without doing the two things listed above.) In this case, you should also quote the references.
nice
command to lower the priority of your process.linuxremote
or linuxremote1, linuxremote2,
and linuxremote3
.) Rather, run your program in lab computers in Breakiron 164 or Dana 213. You may remote log into the servers, but please ssh
into the lab computers before running your programs. The names for these lab computers are dana213-lnx-n
or brki164-lnx-n
where n is a number (1-24).fork()
. The one thread (process) that visits web servers must pause or sleep after visiting every a few 10s of pages.A search engine consists of two major parts, somewhat independent of each other, as can be seen from the figure. One is on the left side of the Document Collection, which answers user's queries. The other is on the right side of the Document Collection, which collects information from the web so the URLs related to the user queries can be retrieved. A crawler goes around the web to read web pages. The information is then sent to a parser. (In a general-purpose search engine, a more elaborate indexer is needed. Since we are only interested in the hyper-text links that points to a particular web page, we can omit the indexing part.) The parser extracts the URLs contained in the given web page. These URLs are pointing to other web pages from this web page. If the current given web page is denoted as w and the URLs extracted from w are s, then effectively page w is pointing to each page si for all i in the set s. When a user issues a query the document collection is searched and a list of URLs is generated. Each of the URLs presented as an answer to the query is pointing to the queried web page (target).
Figure: Architecture of A Simple Search Engine
The first phase of the project is to retrieve a web page and parse out all the URLs contained in the web page. You are to construct a TCP client program that is able to send an HTTP GET request to a web server and retrieve the page back from the server. Once a page is retrieved, the second task of this phase is to parse out all the URLs in the page.
An HTTP client (a.k.a. web client) establishes a connection to a web server through a TCP socket. After creating the socket, the client makes a connection request to the server, the server acknowledges the connection and now the communication between the server and the client is established. Once a TCP connection is ready, the client can send either an HTTP 1.0 request or an HTTP 1.1 request. The server will send back the page being requested. The client then can process the page.
See the web client program example for code details. The client program in that example uses a library tcplib.c. You are asked to use your own wrapper programs developed in your lab assignment. In addition, the example program simply prints the received web page to the standard output (screen), your program must save the content in a variable that can be used by other parts of the program.
The second task in this phase of the project is to parse out all URLs contained in the received page. The following is a simple web page that contains multiple URLs. Your task is to extract all the URLs and to complete the URLs that are relative.
Here are discussions on some technical details that may help you to carry out the task.
http://www.eg.bucknell.edu/~xmeng/teaching.html#eg290
. When an absolute URL is extracted from a web page, it can be used directly to retrieve the web page specified by the URL. On the other hand, a relative URL gives a path relative to the current web page. For example if the web page from which the relative URL is extracted is http://www.eg.bucknell.edu/~xmeng/index.html
(often called a base URL) and a relative URL from the given page is extracted as ../teaching.html
, then the absolute URL should be http://www.eg.bucknell.edu/~xmeng/teaching.html
. Your program will need to convert all relative URLs to absolute URLs.The basic strategy to convert a relative URL to an absolute URL is to scan the URL file path from left to right, every time a current directory notion "./" is met, you simply remove these two characters from the URL; every time a parent directory notion "../" is met, you remove the directory name above this level. For example, a relative URL "level1/./level2/../level3/file.html"
is given, its absolute form should be "level1/level3/file.html"
after the process.
When the file path scan and conversion is finished, attach the entire path to the current host (e.g., http://www.bucknell.edu/
or http://www.cnn.com/
.)
You can approach the problem in many different ways. Here is what I would suggest.
socket
interface first, similar (or identical) to the example given in code/client-server-c/webclient.c
. This client program should be able to retrieve a web page from a given URL.You are to submit to your Gitlab account the following by the deadline for this phase.
project1
folder under your csci363-s13
you created earlier, upload (push
) all your files in the project1 directory by the deadline.
The second phase of the project asks you to create a web server (a simple search engine) that can answer user queries about a web page your program has visited and indexed. The type of queries your search engine need to be able to handle is very simple. The query gives a URL for a web page. Your server program will be able to provide information such as the number of out-going URLs collected from this web page and a list of first 20 of these URLs, if the targeted web page has that many out-going URLs. A typical user interaction would look as follows (here we only list text, your should be able to do it in a browser).
The following sections discuss some of the technical details how we can implement such a simple search engine.
A web server essentially is a TCP based server program that follows the HTTP protocol as the application-layer protocol. Take a look at the echo-server.c program in the following.
In this program, the server is waiting for client connection requests at an agreed-upon port. Once a connection is accepted, the server reads a string from the client, and sends it right back to the client with an inserted phrase "Echo --> ". From the client point of view, once a connection is accepted from the server, the client sends a message to the server and reads message sent by the server and prints it on the screen. This is the application protocol for this particular echo service!
For a search engine, the interaction between a client and a server follows the HTTP protocol, which is slightly more complicated than a service such as echo. To understand how HTTP works, first let's do the following experiment.
In your echo-server.c program, after reading the input from a client, instead of echoing the message back to the client, your echo-server.c program prints what is read from the client on the screen, and then sends back the following message to the client.
Note that essentially your server now sends back an HTTP response code first (HTTP/1.0 200 OK) followed by two pairs of newline and carriage returns. The HTTP response code is then followed by a web page.
When your program prints what it reads from the client (a browser), you should see something similar to the following on your screen. We will explore the meaning of this request later.
Compile and run the server program. Assume the server program is running on a lab computer, e.g., dana213-lnx-1. Have your favorite web browser point at the URL dana213-lnx-1:2500
. Here the server name is dana213-lnx-1 and 2500 is the port at which the server program is running. You should use the computer name in which your server program is running and the port number that the server program is using. You should see the browser displays the content sent back by the server.
The above is the simplest scenario of interaction between a web client and a web server. How does a client request a specific web page from a web server? In the same directory where your simple web server resides, create a simple web page with the following content (you can certainly make a more elaborate web page as you would like), calling the file with this content simple.html or a name of your choice.
Set the access permission of simple.html as readable by the world. If you are not familiar with how to set permissions, please read the manual page on the command chmod
. For what we need, you can simply do
chmod 644 simple.html
which sets the file readable by all and writable by the owner (you) only.
Revise your echo-server.c program by the following steps.
write()
statement in the echo-server.c program so that it only sends back the HTTP response code ("HTTP/1.0 200 OK"), not sending the in-line simple web page code.write()
system call. Doing so your web browser that made the request to your server should see the simple.html displayed in the browser and you should be able to click the hypertext link from within that page.Now let's read what the original client request. The client (a web browser) sends a request tot the web server when the browser makes a connection to the server. The command GET / HTTP/1.1 indicates that the browser wants to read the root HTML page at the server. If a browser would request a specific file, e.g., simple.html, the parameter of the GET command would look as follows.
GET /simple.html HTTP/1.1
Confirm this phenomenon by changing the URL which the browser uses to access the web page as
http://dana213-lnx-1:2500/simple.html
Load the web request again. You should still see the content of simple.html displayed on the browser screen. In addition, you should see from the server side that the parameter of the GET command has changed from the root "/" to "/simple.html." Other pieces of common requests from a browser include the host name of the server, the agent name (the browser), the type of application the browser can handle (e.g., text/html, or application/xml), the accepted language, the accepted encoding mechanism, among others.
Now that we know how a web client (e.g., a browser) interacts with a web server, we can turn our attention to how to make a web server a search engine and how a web client such as a browser to send query to a search engine and how a search engine sends the search results back to the client.
A web client can send a piece of information such as a query to a web server by using the HTTP POST command. Let's first concentrate on the client side to see how we can post a request to the server.
The basic mechanism to post a query from a web client is to use a form submission method in HTML. Once again, let's change the program echo-server.c to make it accept and process a query from a client. Instead of sending simple.html to the client, let's have the server send back the form.html which reads as follows.