Written Assignment Three

CSCI 379.01 - Information Retrieval and Web Search

Assigned: October 22nd, 2003, Wednesday
Due: October 27th, 2003, Monday

This assignment is designed for you to get familiar with some other similarity measures other than the inner product measure and cosine similarity measure.

  1. Association matrix. An association matrix is a matrix of n by n consisting of entries measuring the closeness of indexing terms in their frequencies of co-appearance. Assume there are a total of n unique terms in the document collection, then an association matrix would look as follows.

      $t_1$ $t_2$ ...... $t_n$  
    $t_1$ $c_{11}$ $c_{12}$ ...... $c_{1n}$  
    $t_2$ $c_{21}$ $c_{22}$ ......    
    $t_3$ $c_{31}$   ......    
               
    $t_n$ $c_{n1}$        

    where each of the $c_{ij}$ is the correlation factor between term $i$ and term $j$.

    \begin{displaymath}c_{ij} = \sum_{d_k \in D} f_{ik} * f_{jk} \end{displaymath}

    where $f_{td}$ is the frequency of term $t$ in document $d$.

    Given the following $f_{td}$'s, please compute $c_{ij}$. Note that the matrix is symmetric, so you only need to compute a half of it. Assume we have three documents and four terms.

    $f_{11} = 2$ $f_{12} = 1$ $f_{13} = 0$
    $f_{21} = 0$ $f_{22} = 2$ $f_{23} = 1$
    $f_{31} = 4$ $f_{32} = 1$ $f_{33} = 3$
    $f_{41} = 2$ $f_{42} = 0$ $f_{43} = 1$

  2. Normalized association matrix. One problem with regular association matrix is that it favors long documents that have many terms and a term may appear many more times. One way to counter this factor is to normalize the association matrix so that the value of the matrix entries will be between 0 and 1.

    \begin{displaymath}s_{ij} = \frac{c_{ij}} {c_{ii} + c_{jj} - c_{ij}} \end{displaymath}

    Using the information in Problem 1, compute the normalized association matrix.

  3. Metric correlation matrix. Association correlation does not account for the proximity of terms in documents, just co-occurrence frequencies within documents. Metric correlations account for term proximity (distance between the two terms).

    \begin{displaymath}c_{ij} = \sum_{k_u \in V_i} \sum_{k_v \in V_j} \frac{1} {r(k_u,
k_v)} \end{displaymath}

    $V_i$: Set of all occurrences of term $i$ in any document.
    $r(k_u, k_v)$: Distance in words between word occurrences $k_u$ and $k_v$ ($\infty$ if $k_u$ and $k_v$ are occurrences in different documents).

    Assume a document collection contains the following two documents only, compute the metric correlation matrix. Note that the matrix is symmetric so you only need to compute a half of it.

    Document 1:
    Higher education in Pennsylvania.

    Document 2:
    University of Pennsylvania is a fine higher education institute.

    You should ignore the stopwords in, of, is, a.

About this document ...

Written Assignment Three

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.70)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -nonavigation -split 1 hw3

The translation was initiated by Meng Xiannong on 2003-10-22


Meng Xiannong 2003-10-22