Web Information Retrieval

Xiannong Meng
Computer Science Department
Bucknell University
Lewisburg, PA 17837
May 2014

Homework Four

Due: Monday 06/09/2014 at 11:59 p.m.

Please follow this link to see the submission instructions.

This assignment is designed for you to get familiar with the basic vector model in IR. You are to calculate measures such as tf-idf and similarity between a query and a set of documents using different measures, given some basic statistic about the documents and the terms (key words).

Note: It might be easier to either use a piece of spread-sheet software or write a program to do the computation and the sorting.

  1. Table 1 lists the terms and their appearances in a set of documents.

    Doc/Term retrieval database computer text information
    D1 4 10 2 0 1
    D2 3 0 7 4 5
    D3 7 2 4 6 8

    Table 1: Term Frequencies In A Given Set Of Documents

    We also know that the total number of documents in the set is 1000. Table 2 shows the document frequencies of these terms.

    Term retrievaldatabasecomputer textinformation
    Frequency 100 70 220 80 110

    Table 2: Document Frequencies For A Given Set Of Terms

    Compute tf-idf for each of the (doc, term) pairs listed in Table 1. List your results in sorted order from the largest value of tf-idf to the smallest value.

  2. Assume we use the tf-idf as the weight in the vector model, write down the document-term matrix using the results generated from the above problem. Remember a document-term matrix has terms as its columns and documents as its rows.
  3. Now assume we have a query computer information, compute the similarity based on the inner product similarity and the cosine similarity for each of the documents listed in Table 1. Indicate which formulate you use for term weight (either one discussed in the class will be fine.) Which document is the most relevant in each of the similarity measures? Which one is the least relevant?