Written Assignment Three

CSCI 379.01 - Information Retrieval and Web Search

Assigned: October 22nd, 2003, Wednesday
Due: October 27th, 2003, Monday

Association matrix. An association matrix is a matrix of n by n consisting of entries measuring the closeness of indexing terms in their frequencies of co-appearance. Assume there are a total of n unique terms in the document collection, then an association matrix would look as follows.

......

$c_{11}$ $c_{12}$ ...... $c_{1n}$

$c_{21}$ $c_{22}$ ......

$c_{31}$ ......

$c_{n1}$

where each of the $c_{ij}$ is the correlation factor between term and term .

$\begin{displaymath}c_{ij} = \sum_{d_k \in D} f_{ik} * f_{jk} \end{displaymath}$

where $f_{td}$ is the frequency of term

in document

Given the following $f_{td}$ 's, please compute $c_{ij}$ . Note that the matrix is symmetric, so you only need to compute a half of it. Assume we have three documents and four terms.

$f_{11} = 2$ $f_{12} = 1$ $f_{13} = 0$

$f_{21} = 0$ $f_{22} = 2$ $f_{23} = 1$

$f_{31} = 4$ $f_{32} = 1$ $f_{33} = 3$

$f_{41} = 2$ $f_{42} = 0$ $f_{43} = 1$

Normalized association matrix. One problem with regular association matrix is that it favors long documents that have many terms and a term may appear many more times. One way to counter this factor is to normalize the association matrix so that the value of the matrix entries will be between 0 and 1.

$\begin{displaymath}s_{ij} = \frac{c_{ij}} {c_{ii} + c_{jj} - c_{ij}} \end{displaymath}$

Using the information in Problem 1, compute the normalized association matrix.

Metric correlation matrix. Association correlation does not account for the proximity of terms in documents, just co-occurrence frequencies within documents. Metric correlations account for term proximity (distance between the two terms).

$\begin{displaymath}c_{ij} = \sum_{k_u \in V_i} \sum_{k_v \in V_j} \frac{1} {r(k_u, k_v)} \end{displaymath}$

: Set of all occurrences of term

in any document.

: Distance in words between word occurrences

and

( $\infty$ if

and

are occurrences in different documents).

Assume a document collection contains the following two documents only, compute the metric correlation matrix. Note that the matrix is symmetric so you only need to compute a half of it.

Document 1:
Higher education in Pennsylvania.

Document 2:
University of Pennsylvania is a fine higher education institute.

You should ignore the stopwords in, of, is, a.

About this document ...

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.70)

		......
$c_{11}$	$c_{12}$	......	$c_{1n}$
$c_{21}$	$c_{22}$	......
$c_{31}$		......

$c_{n1}$

$f_{11} = 2$	$f_{12} = 1$	$f_{13} = 0$
$f_{21} = 0$	$f_{22} = 2$	$f_{23} = 1$
$f_{31} = 4$	$f_{32} = 1$	$f_{33} = 3$
$f_{41} = 2$	$f_{42} = 0$	$f_{43} = 1$