%% sample.tex
%% example to show how latex works
%% X. Meng
%% oct-27-2004

\input epsf


\setlength{\textwidth}{6.5in}              %%% Preset settings
%\setlength{\footheight}{0.0in}            %%% not recognized 2002-09-19



Xiannong Meng\\
Department of Computer Science\\
Bucknell University\\
Lewisburg, PA 17837, U.S.A.\\
Email: xmeng@bucknell.edu 
second author\\
Department of Computer Science\\
Some University\\
Some city, ST 12345, U.S.A. \\



\begin{center}{\bf Abstract}\end{center}
{\small There are other measures that
  improve over the precision-recall measure, such as \emph{expected
  search length} (ESL) and \emph{average search length} (ASL). These
  measures are not intuitive and difficult to compute. We propose in
  this paper, as an alternative measure for evaluating the
  effectiveness of IR systems such as a web search system.
  \emph{RankPower} takes both the rank and the number of
  relevant documents into consideration. \emph{RankPower} is bounded
  below as the number of relevant documents approaches infinity. Thus
  comparisons among different systems using \emph{RankPower} become
  intuitive and easy.
{\bf Keywords}: 
information retrieval, user preference, 
relevance feedback, adaptive algorithm, 
term weighting, web search





\emph{Recall} and \emph{precision} are two best-known traditional
measures of the quality of a information retrieval
Assume the whole document collection is \emph{S}, the relevant
document collection among
\emph{S} for a given query \emph{q} is \emph{R}. When the query
\emph{q} is issued,
an information retrieval (IR) system  (or specifically, a web search
engine) returns the collection
$S_q$, among which $R_q$ is the set of relevant documents, then
\[ Precsion = \frac{|R_q|}{|S_q|} ~~~~~~ Recall = \frac{|R_q|}{|R|} \]
That is, \emph{precision} measures how many documents are relevant to
the query among
the returned documents; and \emph{recall} measures how many relevant
documents are returned 
among the total relevant document set. We will use \emph{P} to denote
the precision measure
and \emph{R} to denote the recall measure for the rest of the paper.
The goal of an IR system is to have high precision, meaning
high percentage of returned documents are relevant; and high recall,
meaning high percentage
of relevant documents are returned. In a closed lab testing
environment where different algorithms can be studied, the complete document
collection is known, the test queries and the
set of documents relevant to each query among the whole document
collection are identified
by the experts. Under these conditions, precision and recall are very
good measures of the
degree of success of an IR system. However since precision and recall are
two separate measures
it is not always convenient to compare among different IR systems using two
separate measures. To
amend this deficiency, many \emph{single value} measures have been
proposed and studied\,\cite{korfhage:97}.
The \emph{harmonic mean} measure is defined as the inverse of the sum
of the inverse of precision and 
the inverse of recall. So for a \emph{harmonic mean} value to be high, both
precision and recall have
to be high.
\[harmonic~mean = \frac{2}{\frac{1}{P} + \frac{1}{R}} \]
Variations of the basic \emph{harmonic mean} measure include E-measure
among others.
\[E~measure = \frac{(1+\beta^2)}{\frac{\beta^2}{R} + \frac{1}{P}} \]
where $\beta$ is used to control the trade-off between precision and
recall. If $\beta$ is one, the precision and recall are equally
weighted, which becomes the \emph{harmonic mean} measure; if $\beta$ is
greater than one, the precision is weighed more; otherwise the recall
is weighed more.

  A closed-form expression of the optimal \emph{RankPower} can be found
  such that
  comparisons of different web information retrieval systems can be
  easily made. The \emph{RankPower} measure reaches its optimal value when
  all returned documents are relevant. The rest of the paper is
  organized as follows. First the various measures of the
  effectiveness of IR systems including web systems are reviewed in
  Section \ref{sec:lit}. The definition, derivation, and rationales of
  the measure \emph{RankPower} are presented in Section
  \ref{sec:core}. Section \ref{sec:result} contains some numerical
  results and comparisons among different measures, followed by some
  concluding remarks in Section \ref{sec:conclude}.

\section{Example of Citing References\label{sec:lit}}

The classic measures of user-oriented performance of an IR system are
\emph{precision} and \emph{recall} which can be traced back to the
time frame of 1960's\,\cite{cleverdon:66, treu:67}. The ideal goal is
a high precision rate, as well as a high recall rate. Several other
measures are related to the precision and recall. The \emph{average
precision at seen relevant documents}\,\cite{yates:99} takes the
average of precision values seen so far to give a single value


The following notions will be used throughout the discussion.
   \item For a given query, a set of \emph{N} documents is returned.
   \item Among the \emph{N} returned documents, $R_N$ represents
   the documents that are relevant to the query, $R_N \le N$.
   \item Each of the relevant documents in $R_N$ is placed among
   the \emph{N} different places. These places are denoted as $L_i$,
   $0 \le i \le N$.
   \item The number of relevant documents in the returned set, $|R_N|$
   is denoted as $C_N$.
Note that we are considering practical retrieval systems such as web
search engines. In this situation the size of the total
document set is unknown, nor do we know the exact relevant document
set from the complete document collection for any given query.

\[ \lim_{N \rightarrow \infty} R_{avg} (N) \rightarrow \infty \]
In an ideal case where every single returned document is
   relevant, the average rank has the follow form.
\[ R_{avg} = \frac{\sum_{i=1}^{C_N} L_i} {C_N} 
           = \frac{\sum_{i=1}^{N} i} {N} = \frac{\frac{N(N+1)}{2}}{N} 
           = \frac{N+1}{2}\]

We now introduce a new measure, \emph{RankPower} to tackle the above
issues. The basic idea is that the average rank captures only the
average over a set. We would like to see a small average rank, and a
large relevant document count. Combing these two factors, we define
the \emph{RankPower} as follows.
RankPower = \frac{R_{avg}} {C_N} = \frac{\sum_{i=1}^{C_N} L_i} {{C_N}^2} 

An example of table.

Precision-Recall: We use the traditional 11 recall
  intervals. The results are listed in Table \ref{tbl:rp}.
\[Recall = \frac{|R_q|}{|R|} \]
\[Precision = \frac{|R_q|}{|S_q|} \]
\caption{Example of Recall and Precision\label{tbl:rp}}
\begin{tabular}{|l|l|l|l|l|l|} \hline
Place     & 1     & 2    & 3    & 4    & 5  \\ \hline
Recall    & 0.1   & 0.2  & 0.3  & 0.4  & 0.5 \\ \hline
Precision & 0.50  & 0.40 & 0.43 & 0.00 & 0.00 \\ \hline\hline
Place     & 6     & 7    & 8    & 9    & 10  \\ \hline
Recall    & 0.6  & 0.7 & 0.8 & 0.9 & 1.0 \\ \hline
Precision & 0.00 & 0.00 & 0.00 &0.00 & 0.00 \\ \hline

\section{Figure Example\label{sec:result}}

Some more results.

\caption{Architecture of PAWS-cluster\label{fig:arch}}


Some conclusions.

\noindent{\bf URLS} referenced in the paper:

[AW] $<$http://www.alltheweb.com/$>$

[GG] $<$http://www.google.com/$>$ 




R.~Baeza-Yates and B.~Ribeiro-Neto, 
\emph{Modern Information Retrieval}, 
Addison Wesley, 

C.W.~Cleverdon, J.~Mills, \& E.M.~Keen, Factors Determining the
Performance of Indexing Systems, Volume 1 -- Design, Aslib Cranfield
Research Project, Cranfield, England, 1966.

W.S.~Cooper, Expected search length: A single measure of retrieval
effectiveness based on weak ordering action of retrieval systems,
\emph{Journal of the American Society for Information Science},
19(1), 30-41,

\emph{Information Storage and Retrieval},
John Wiley \& Sons,

Foundation of evaluation, 
\emph{Journal of Documentation},
30(4), 365-373,

\emph{Text Retrieval and Filtering: Analytic Models of Performance},
Kluwer Publisher, Boston,

Measuring Search Engine Quality and Query Difficulty: Ranking with
Target and Freestyle,
\emph{Journal of the American Society for Information Science},
50(10), 882-889,

When Information Retrieval Measures Agree about the Relative Quality
of Document Rankings,
\emph{Journal of the American Society for Information Science},
51(9), 834-840,

X. Meng \& Z. Chen,
MARS: Multiplicative Adaptive Refinement Web Search,
submitted for publication,

G.~Salton, \emph{Automatic Text Processing}, 
Addison Wesley, 

On the foundation of evaluation,
\emph{Journal of the American Society for Information Science},
37(5), 346-348,

S. Treu, Testing and Evaluation -- Literature Review, \emph{Electronic
Handling of Information: Testing and Evaluation}, A. Kent,
O.E. Taulbee, J. Belzer, and G.D. Goldstein, editors, Thompson Book
Co., Washington, D.C. 1967, 71-88.