%%% my own latex stuff are commented to use pdpta2001 format %%% \documentstyle[11pt,cprog,html]{article} %%% \setlength{\textheight}{9.6in} %%% \setlength{\voffset}{-1.4in} %%% \setlength{\textwidth}{7.2in} %%% \setlength{\hoffset}{-1.2in} %%% %\parskip 2ex %%% \renewcommand{\baselinestretch}{1.2} %% PDPTA-2001 format from www.ashland.edu/~iajwa/conferences \documentstyle[twocolumn,11pt, html]{article} \def\htlink{\htmladdnormallink} %%%% I added it for my convenience \pagestyle{empty} %%%% No page Numbering \setlength{\textheight}{9.0in} \setlength{\columnsep}{0.375in} \setlength{\textwidth}{6.5in} %%% Preset settings \setlength{\footheight}{0.0in} \setlength{\topmargin}{-0.0625in} \setlength{\headheight}{0.0in} \setlength{\headsep}{0.0in} \setlength{\oddsidemargin}{0.0in} \setlength{\parindent}{1pc} %\title{The Architecture of Yarrow: A Real-Time Intelligent Meta-Search Engine} %\author{Xiannong Meng\& Zhixiang Chen \\ %Department of Computer Science\\ %University of Texas - Pan American\\ %Edinburg, TX 78539-2999\\ %Contacting and presenting author: Xiannong Meng\\ %meng@cs.panam.edu \\ %Phone: (956) 316-7062 \\ %Fax: (956) 384-5099 %} %\date{February 22, 2001} \title{The Architecture of Yarrow: A Real-Time Intelligent Meta-Search Engine} \author{ Xiannong Meng\\ Department of Computer Science\\ University of Texas - Pan American\\ Edinburg, TX 78539-2999, U.S.A.\\ \and Zhixiang Chen\\ Department of Computer Science\\ University of Texas - Pan American\\ Edinburg, TX 78539-2999, U.S.A.\\ } \date{} \input epsf \begin{document} \maketitle %\begin{abstract} \noindent {\bf Abstract} {\small\em In this paper we present the architecture of Yarrow[a]\footnote{{\em Yarrow} is the name of a family of small plants and happens to be the name of the street where the two authors live.} -- an intelligent web meta-search engine. Yarrow takes a user query and sends it automatically to a number of major search engines. As the results are sent back from these search engines Yarrow processes these results using a practically efficient on-line learning algorithm before displaying the re-ranked results to the user. Users have the opportunity to interactively refine the search results presented by Yarrow, which dynamically promotes or demotes the search results until a satisfactory set of pages are located by the user. } \vspace{0.5cm} \noindent {\it Keywords:} {\small meta-search engine, World Wide Web, Internet application, adaptive-learning} %\end{abstract} \section{Introduction\label{sec:intro}} The web provides a pervasive amount of information. According to a recent study\cite{lawrence99a}, there are estimated 800 million pages on the web. Finding information on the web in a reasonable amount of time is very difficult. General purpose search engines such as AltaVista[g], Yahoo![n], NorthernLight[m] do help. But with exponential growth in the size of the web, the coverage of the web by general search engines has been decreasing, with no engine indexing more than about 16\% of the estimated size of the publicly indexable web \cite{lawrence99a}. In response to this difficulty, two approaches have been taken recently. One is the development of {\em meta-search engines} that forward user queries to multiple search engines at the same time in order to increase the coverage and hope to {\em include} in a short list of top-ranked results what the user wants. Examples of such meta-search engine include MetaCrowler [b], Inference Find [c] and Dogpile[d]. Another approach is the development of {\em topic-specific} search engines that are specialized in particular topics. These topics range from vacation guides [e] to kids health [f]. General search engines cover large amounts of information even though the percentage of coverage is decreasing. But users have hard time locating efficiently what they want. The first generation of meta-search engines addressed the problem of decreasing coverage by simultaneously querying multiple general-purpose engines. These meta-search engines suffer to a certain extent the inherited problem of {\em information overflow}. It is difficult for users to pin down specific information for which they are searching. Specialized search engines typically contain much more accurate and narrowly focused information. However it is not easy for a novice user to know where and which specialized engine to use. Meta-search engines may be classified into two categories: {\em shallow meta-search engines} and {\em deep meta-search engines}. A shallow meta-search engine simply echoes the search results of one or several general-purpose search engines. There may be some collating, filtering, or sorting processes, but such efforts are very limited. A deep meta-search engine will use the search results of the general-purpose search engines as its starting search space, from which it will adaptively learn from the user's feedback to boost and enhance the search performance and the relevance accuracy of the general-purpose search engines. It may use clustering, filtering, and other methods to help its adaptive learning process. From an engineering point of view, a meta-search engine is usually light-weighted, that is, it doesn't require the support of very complicated data structures, it does not require a large database, it does not require a large amount of memory. It should and is able to emphasize the intelligent processing of the search results returned by general-purpose search engines. Recent research on web communities \cite{kleinberg99,gibson98,chakrabarti98} has used a short list of hits returned by a search engine as a starting set for further expansion. There have been great efforts on applying machine learning on web search related applications, for example, scientific article locating and user profiling \cite{kurt98,kurt99,lawrence99}, and focused crawling \cite{rennie99}. This paper presents Yarrow [a], a second-generation meta-search engine that is an intelligent deep meta-search engine. Currently, Yarrow can query eight of the most popular general-purpose search engines and is able to perform document parsing and indexing, and learning in real-time on client side. The predominant feature of Yarrow is that in contrast to the lack of adaptive learning features in existing meta-search engines, Yarrow is equipped with an on-line learning algorithm TW2 (Tailored Winnow2) \cite{chenyarrow} %{chenquery,chenwebsail} so that it is capable of helping the user to search for the desired documents with user feedbacks. We designed in \cite{chenquery} %,chenwebsail} the learning algorithm, TW2, a tailored version of Winnow2 \cite{littlestone88} in the case of web search. When used to learn a disjunction of at most $k$ relevant keywords, TW2 has surprisingly small mistake bounds that are independent of the dimensionality of the indexing keywords. TW2 has been successfully used as part of the learning components in our other projects \cite{chenwebsail,chenfeatures}. ...... \section{Concluding Remarks\label{sec:concl}} This paper describes the architecture of Yarrow, an intelligent meta-search engine and some of their implementation details. From the engineering point of view, deep meta-search is possible, because a meta-search engine is usually light-weighted and does not require a large database nor a large amount of memory. Yarrow is a first-step attempt to build deep meta-search engine. It is powered by an efficient learning algorithm and is also equipped with functions of document parsing and indexing. It adaptively learns from the user's feedback to search for the desired documents. In the future we plan to implement Yarrow on a cluster of computers and we also plan to add personalized features.\\ {\noindent{\large\bf URLs Used in the Paper}}\\ \noindent [a] Yarrow $<$\htlink{http://www.cs.panam.edu/\~{}\\ chen/WebSearch/Yarrow.html} {http://www.cs.panam.edu/\~{}chen/WebSearch/Yarrow.html}$>$ \noindent [b] MetaCrawler \\ $<$\htlink{http://www.metacrawler.com} {http://www.metacrawler.com}$>$ \noindent [c] Inference Find: $<$\htlink{http://www.infind.com} {http://www.infind.com}$>$ \noindent [d] Dogpile $<$\htlink{http://www.dogpile.com} {http://www.dogpile.com}$>$ \noindent [e] VacationSpot.com \\ $<$\htlink{http://www.vacationspot.com} {http://www.vacationspot.com}$>$ \noindent [f] KidsHealth $<$\htlink{http://www.kidshealth.com} {http://www.kidshealth.com}$>$ \noindent [g] AltaVista $<$\htlink{http://www.altavista.com} {http://www.altavista.com}$>$ \noindent [h] Excite $<$\htlink{http://www.excite.com} {http://www.excite.com}$>$ \noindent [i] GoTo $<$\htlink{http://www.goto.com} {http://www.goto.com}$>$ \noindent [j] HotBot $<$\htlink{http://www.hotbot.com} {http://www.hotbot.com}$>$ \noindent [k] InfoSeek $<$\htlink{http://www.infoseek.com} {http://www.infoseek.com}$>$ \noindent [l] Lycos $<$\htlink{http://www.lycos.com} {http://www.lycos.com}$>$ \noindent [m] NorthernLight\\ $<$\htlink{http://www.northernlight.com} {http://www.northernlight.com}$>$ \noindent [n] Yahoo! $<$\htlink{http://www.yahoo.com} {http://www.yahoo.com}$>$ \bibliographystyle{plain} \bibliography{/home/accounts/facultystaff/x/xmeng/lib/tex/web} \end{document}