Project 1: Search Engine - A Client Server Program Pair

Phase 2: Build a Simple Web Search Engine

The second phase of the project asks you to create a web server (a simple search engine) that can answer user queries about a web page your program has visited and indexed. The actual data (web pages) can be collected manually by using the web client program in the first phase. The type of queries your search engine need to be able to handle is very simple. For example, if the query is network, your server should return a list of web pages that contain the word network. A typical user interaction would look as follows (here we only list text, your should be able to do it in a browser).

The following sections discuss some of the technical details how we can implement such a simple search engine.

2.1 Creating a server program as a web server

A web server essentially is a TCP based server program that follows the HTTP protocol as the application-layer protocol. Take a look at the echo-server.c program in the following. (You can certainly study a slightly more complicated and more complete set of programs in web-server.c.)

In this program, the server is waiting for client connection requests at an agreed-upon port. Once a connection is accepted, the server reads a string from the client, and sends it right back to the client with a inserted phrase "Echo --> ". From the client point of view, after a connection is accepted by the server, the client sends a message to the server, reads message sent by the server and prints it on the screen. This is the application protocol for this particular echo service!

For a search engine, the interaction between a client and a server follows the HTTP protocol, which is slightly more complicated than a service such as echo. To understand how HTTP works, first let's do the following experiment.

In your echo-server.c program, after reading the input from a client, instead of echoing the message back to the client, your echo server program prints what is read from the client on the screen, and then sends back the following message to the client.

Note that essentially your server now sends back an HTTP response code first (HTTP/1.0 200 OK) followed by two pairs of newline and carriage returns. The HTTP response code is then followed by a simple, but complete web page.

Now start a web browser, assuming your server program is running on a lab machine, e.g., dana132-lnx-4, put your echo server program's URL as the browser address, e.g., http://dana132-lnx-4:2500 where the number "2500" is the port number at which your echo server is running. Observe the behavior of the server.

When your server program prints what it reads from the client (a browser), you should see something similar to the following on your screen. We will explore the meaning of this request later.

You should see the browser displays the content sent back by the server.

The above is the simplest scenario of interaction between a web client and a web server. How does a client request a specific web page from a web server? In the same directory where your simple web server resides, create a simple web page with the following content (you can certainly make a more elaborate web page as you would like), calling the file with this content simple.html or a name of your choice.

Set the access permission of simple.html as readable by the world. If you are not familiar with how to set permissions, please read the manual page on the command chmod. For what we need, you can simply do

chmod 644 simple.html

which sets the file readable by all and writable by the owner (you) only.

Revise your echo-server.c program by the following steps.

  1. Revise the write() statement in the echo-server.c program so that it only sends back the HTTP response code ("HTTP/1.0 200 OK"), not sending the in-line simple web page code. Remember that an HTTP response code must be followed by a newline by itself in a line.
  2. Have your server program open and read the file simple.html from the disk, send the contents being read directly to the client (web browser) using the write() system call. Doing so your web browser that made the request to your server should see the simple.html displayed in the browser and you should be able to click the hypertext link from within that page.

Now let's read what the original client request. The client (a web browser) sends a request to the web server when the browser tries to connect to the server. The command GET / HTTP/1.1 indicates that the browser wants to read the root HTML page at the server. If a browser would request a specific file, e.g., simple.html, the parameter of the GET command would look as follows.

GET /simple.html HTTP/1.1

Confirm this phenomenon by changing the URL which the browser uses to access the web page as

http://host-name:2500/simple.html

Load the web request again (refresh the browser). You should still see the content of simple.html displayed on the browser screen. In addition, you should see from the server side that the parameter of the GET command has changed from the root "/" to "/simple.html." Other pieces of common requests from a browser remain the same, including the host name of the server, the agent name (the browser), the type of application the browser can handle (e.g., text/html, or application/xml), the accepted language, the accepted encoding mechanism, among others.

2.2 How to send a query to a web server

Now that we know how a web client (e.g., a browser) interacts with a web server, we can turn our attention to how to make a web server a search engine and how a web client such as a browser sends query to a search engine and how a search engine sends the search results back to the client.

A web client can send a piece of information such as a query to a web server by using the HTTP POST command. Let's first concentrate on the client side to see how we can post a request to the server.

The basic mechanism to post a query from a web client is to use a form submission method in HTML. Once again, let's change the program echo-server.c to make it accept and process a query from a client. Instead of sending simple.html to the client, let's have the server send back the form.html which reads as follows.

Now have your browser point to your server, e.g.,
http://host-name:2500/

This time, instead of a simple web page, your browser screen should display a form with two input text boxes and a Submission button. Type some values into these two input text boxes and click the Submission button. In our example, we use an integer 123 and a string "abc" as the input values. Have your server read and print whatever the client is sending to the server. You should see something similar to the following printed on the server's screen.

Notice that in the above information sent from the browser to the server after clicking the "Submit" button, the command is "POST" instead of "GET" because the client is trying to send, thus post, information to the server. The amount and the content of the information sent from the client to the server is included in the message as well. The amount of information is indicated by the attribute Content-length which has the value of 44 which, in turn, means the content of the form is 44 bytes long. The actual content of the post, i.e., the content of the form is followed after the Content-length attribute, separated by a pair of newline NL and carriage return CR. Remember that a pair of newline and carriage return by itself in a line indicates the end of the header in HTTP. In our example, the content is a string of 44 characters as follows.

FirstInput=123&SecondInput=abc&Submit=Submit

The content of the form submission is divided into pairs of (key=value), separated by an &. For example, the above submission form contains three pairs of key and value. The first one is FirstInput which is the label of the text input box in the original form, and 123 which is the value corresponding to this key. The second is the string text we typed. The third one is the Submission button. When parsing out this form submission, the server can figure out what is sent from the client. If the server is a search engine, the server then takes this input as a search query and sends the search result back to the client in an HTML format.

2.3 Make your server program a search engine

In the first phase of the project, you wrote a client program that can access a web page and extract all the out-going HTML links in the target web page. In this phase of the project, your server program reads this information (e.g., from a file saved by the program in the first phase) and sends the result back to the client (typically, a browser) in an HTML format.

The basic idea of a search engine is for a given query to return web pages that contain the words in the query. For a complete functional search engine, your program will need to crawl many more pages and collect more information to answer user queries. Because your program visits only one page in the first phase of the project, we will work with one page only at this time. In this phase, if a user enters a query in which the words appear in the web page you collected, your program (search engine) should return the link to that web page; if the queried words do not appear in the web page at all, your search engine should return an appropriate message to the client, e.g., "No page was found."

Assume your client program from Phase One of the project has visited the following web page whose URL is http://www.eg.bucknell.edu/~xmeng/simple.html.

The first phase of your program has parsed out URLs contained in the page, if any (in the above example, you should have one URL). Your phase-one client program now needs to also parse out all the words and put them in a list or a hash table. For example, we would have the following list of words.

html
body
p
hello
happy
browser
this
is
my
page
a
href
http
www
bucknell
edu
website

Theoretically, if a user types a query that contains any of these words, your server program (search engine) should return the current URL that contains the word to the user as a search result, in our case, this URL would be http://www.eg.bucknell.edu/~xmeng/simple.html. However many of the words in this web page do not provide any meaningful result for the purpose of search, for example, consider the words html, p, this, is, my, a, href, www. These words would appear in almost all web pages. Your program would have to eliminate these words from your list. A common way of accomplishing this task is to consult a stop-word list, removing any words from your list if it is a stop-word. You can make up your own stop word list, or consult the web for a commonly accepted one. Here is an example of publicly available stop word list.

In addition, for your program to search a particular word in the list, you either need to sort the list, or put the list into a hash table to allow quick search. Assume we sort the list of words. Then after removing the common stop words and sorting, the list would look as follows.

body
browser
bucknell
edu
happy
hello
page
website

If a user enters any of these words as a query, your search engine should return the one URL you have as the result. Any other query would result in an error message such as "No page was found." In our next (final) phase of the project, you will be asked to collect many more pages from the CS course websites and perform similar search on these pages.

Here is a typical scenario of the interaction between the server (your program) and a client (e.g., a browser).

How to handle a query with multiple words? If the user enters a query with multiple words, you can treat each separately (i.e., take these words as if they mean "word1" or "word2" or "word3", etc. If you have time, you can make it a bit fancier by allowing the user to use quotes to indicate the words should be and together. That is, if the query has the form of "word1 word2 word3" (three words inside a pair of quotes), your search engine will treat them as if all three words must appear in a web page for the server program to return a web page. For example, if we use the above simple.html as an example, if the user enters a query body arm, the search engine should return the URL of the web page because the page contains the word body even though no arm in the page. However if the user enters the query as "body arm", then the search engine should return "No page was found." because both words do not appear in the same web page. This feature is optional, not required.

Deliverable of phase two

You are to submit to your Gitlab account the following by the deadline.

  1. Completed code of your server (and client from the first phase if your server depends on that for its data).
  2. A set of instructions how to run your program, including the crawling program if your server program depends directly on it.
  3. Screen shots of sample runs. Try your server program (thus the client from the first phase as well) with a few different web pages.
  4. A text file that contains a brief description about this phase of the project that include team member names, high lights the challenges your team encountered, and any thoughts that you'd like to share. The length of the description should not be more than two pages or 800 words.