Web Information Retrieval -- Xiannnong Meng

Search Engine Project Part 5: Ranking the Search Results

Summer 2014

Project	Part 1	Part 2	Part 3	Part 4	Part 5

Introduction

At the end of the last phase of the project, your programs were able to collect a set of web pages, build an inverted index, and answer user queries by retrieving the posting list for a particular query term from the index list. However the search results are not ranked in any meaningful way. In this phase of the project, you are to add two more components to your program to accomplish the task of ranking the returned search result. One is the term weight which we will use the well-known tf-idf (term-frequency-inverse-document-frequency) in the vector space model. The other is a revised ranking module, which should return a ranked list of documents for a given query.

Ranking URLs Based on User Queries

Ranking web pages based on user queries can be a complicated process. Many factors can be considered in ranking, such as the frequency of a term appeared in a document, the location of the term, the font, and many others. Google, for example, uses over 200 factors to rank its results. (See Google Basics for an overview.) We will use a very simple, but effective measure, tf-idf (term-frequency and inverse-document-frequency) as the basis of our ranking. Term frequency is the the number of times that a term appears in a document; document frequency is the number of documents that contain the term; inverse document frequency is some form of inverse of the document frequency. The measure tf-idf captures the fact that if a term is used more frequently in a particular document, and fewer other documents in the document collection contain this term, then the term is probably more important (more relevant) to represent this particular document.

To build the term weight, you will need to use the statistics your program has collected in the indexing phase of the project. There you should have collected the term frequency for each term in every document. You most likely have saved the term frequency in your posting node. The document frequency of a term is defined as the number of documents that contains the term, which is simply the length of the posting list. With df and tf, we can compute idf, the inverse document frequency and the product of tf and idf. Thus the weight for term i in document j is computed as
\[w_{i,j} = tf_{i,j} * idf_i = tf_{i,j} * log(\frac{N}{df_i}) \]
The value w_i,j typically is stored in the posting node (a.k.a. DocNode) on the posting list. As you can see, your program should be able to compute the term weight for all terms when the indexing is finished, before any search query is processed.

The following is a description of a set of algorithms that can accomplish the above task. We discussed these algorithms in class. You have implemented a portion of it in the indexing phase.
Create an empty index term list I; For each document D in the document set V { For each non-empty token T in D { If T is not already in I Insert T into I; Find the location for T in I; If (T, D) is in the posting list for T Increase its term frequency for T; Else { Create (T, D); Add it to the posting list for T; } } }
Figure 1: Algorithm to create an inverted index

You should have implemented this algorithm in the indexing phase. The algorithm is listed again for references when other algorithms are discussed.

Once the inverted indexing has been established, the document frequency for each term is just the length of the posting list. Since the value of document frequency is going to be used often, it is a good practice to keep the value in the header of its posting list, instead of re-computing when needed. The term frequency for each term in each document is stored in the corresponding posting node. With these two values, we can proceed to compute tf-idf as follows. First, the algorithm to compute IDF is presented.
Let N be the total number of documents; For each token T in I { Determine the total number of documents, M, in which T occurs (the length of T's posting list); Set the IDF for T to be log(N/M); }
Figure 2: Algorithm to compute IDF

As we discussed in the lectures, the search process for a given search query is to find the documents in the collection that are similar to the query term. Similarity can be measured in many different ways. The measure we propose to use is called cosine similarity, which is effective and relatively simple. In cosine similarity measure, the terms in a single query are considered as a vector, so is each document in the collection. A normalized inner product is performed between the document vector consisting of query terms and every other document vector in the document collection that contain any terms in the query. The normalized inner product generates a value between 0 and 1, which is a measure of the angle between the query and other documents. Assume Q is the query document vector, and D_j is a document in the collection, for all values of j. Then the cosine similarity is computed as follows.
\[ sim(Q, D_j) = \frac{Q \bullet D_j}{|Q||D_j|} \]
From the above formula for the inner product, we know the similarity calculation involves the inner product of two vectors and the length of a vector. In addition, we know that the length of the document vectors can and should be computed off-line, while the inner product between the query vector and every other document vector containing the query term has to be computed at the query time, that is, real time. The following algorithm computes the vector length for a document.
Initialize all document vector length to be 0.0; For each token T in I { Let idf be the IDF weight of T; For each token T in document D (use posting list) { Let C be the count of T in D; Increment the vector length of D by (idf * C)^2; } } For each document D in the document set { Set the vector length of D to be the square-root of the current stored length; }
Figure 3: Algorithm to compute document vector length

As can be seen from the algorithms, it would be better if you maintain a document list as well (similar to term list, except this is for documents) where each node stores the document name, document id, and other statistics such as document vector length. Both the posting list and this document vector list should be sorted according to either document name or document id. Using document id throughout the process instead of using its full name would save space if that is a concern.

When a query is received, the query processing algorithm is executed and a list of ranked documents is retrieved. You should decide how many documents should be listed in each return page. Typical choices are listing the top 10 or top 20 as the first page. Providing the capability of navigating through different pages in the returning list can be a complicated task. We do not put this as a primary requirement. Implement this feature if you have extra time.

Finally the following is a description of the algorithm for retrieval when a query is given.
Create a vector Q for the query; Create an empty collection R to store retrieved documents with scores; For each token T in Q { Let idf be the IDF of T and K be the count of T in Q; Set the weight of T in Q: W = K * idf; Let L be the posting list of T from I (the term list); For each node O in L { Let D be the document of O, and C be the term frequency of T in D; If D is not already in R { Add D to R; Initialize its score to 0.0; } Increment D's score by W * idf * C; } } Compute the length L of the vector Q; For each retrieved document D in R { Let S be the current accumulated score of D; // S is the inner-product of D and Q now Let Y be the length of D; Normalize D's final score S = S/(L*Y); // cosine similarity } Sort retrieved documents in R by final score; Return the top-N results as specified;
Figure 4: Inverted-index retrieval algorithm

Your Task

Your task in this phase is to implement the algorithms that can rank the documents based on the query terms using cosine similarity as a measurement of ranking.

What to Hand In

Your team needs to hand in the following in the order given.

A team report for phase three of the project with a cover page. The report shouldn't be too long, maybe two to three pages. The report should include the following as a minimum.

The name of your team (should be same as your search engine);

The name of each team member;

A description of the roles of each team member and the contributions of each member;

A summary of the project, i.e., what the team started with, what the team has accomplished, any problems encountered, how the team solved them, any thoughts on the project, among others.

Team meeting minutes during this phase of the work.

Source code for the programs and any other supporting documents.

Snapshots of sample runs. You can use any copy-and-paste features to save the screen images.

Email the instructor a copy of the complete source code and sample runs in zip or tar format.

Project	Part 1	Part 2	Part 3	Part 4	Part 5