Web Information Retrieval -- Xiannnong Meng

Search Engine Project Part 2: Basic Text Processing and Indexing

Summer 2014

Project	Part 1	Part 2	Part 3	Part 4	Part 5

Introduction

In this phase of the project you are to build the indexing component of the search engine. Refer to the description of the first phase of the project for the relation between the indexing component and the rest of the system.

The indexer takes a sequence of file names as input. For each of the files specified by the file name, the indexer processes the file in the following steps.

Lexical analysis (tokenizing)

Stopwords removal

Stemming

Selection of indexing terms among the word collection

Updating the indexing system for the newly processed file

You may restrict the types of files to process to a sub-set of all files. As a minimum your program should be able to handle plain text files and HTML files. Your program may ignore other types of files for now.

The result of the processing should be an inverted indexing system that all index terms that are found in the set of files. In later stages of the project, a crawler will provide a sequence of URLs (file names) to the indexer. The result of the indexing system can be used by the retriever to select relevant URLs based on the search key words.

Technical Details

Here we describe some of the details in each steps involved in the indexing.

Lexical Analysis

In general you can consider the input file as a long text string. The task of lexical analysis is to parse this long text string into a collection of words based on the specified delimiter(s). In this part of the processing you should also be able to extract the URLs contained in the file if they are part of the anchors. For example, if we have the following anchor inside a web page,

<a href = "http://www.somewhere.com/SomePath/aFile.html">some description</a>

the lexical analysis unit should be able to separate and extract out the following two components.

The URL text as http://www.somewhere.com/SomePath/aFile.html

A list of words from this segment as http, www, somewhere, com, SomePath, aFile, html, some, description (note that commas are added in the list of words.)

Although we do not use these URLs to access other web pages at this stage of the process, we just extract the URLs, these URLs will be used in the crawling part of the project.

The lexical analysis can be done in different ways. You can use a pattern match approach (regular expression), or you can design and implement your own finte-state machine to recognize words and URLs, or you can use string match and search functions or other software modules available in the high level programming language you choose (e.g., Java, Python, or C.)

To ease the processing, it might be a good idea to pre-process the text such that

Multiple white spaces (i.e., new-line, tab and space chars) are squeezed into one and all new line characters, tabs are converted into the space character;

All characters are converted into lower case.

By squeezing multiple white spaces into one and converting all white space, search the pattern for URL becomes very easy. All you have to do is to search for the pattern of <a href. Once you locate this pattern, extracting the URL that comes after it is easy. Converting all characters into lower cases also makes the pattern search easier, except that it presents one problem: how to preserve a URL's path and file name part, which is case-sensitive? In the following URL
http://nlp.stanford.edu/IR-book/pdf/irbook.pdf
We want to preserve the cases of the path part of the URL. We don't want to extract the URL as
http://nlp.stanford.edu/ir-book/pdf/irbook.pdf
since this is not the same URL as the original one. While there are many different ways of handling this problem, one possible solution would be to keep two copies of the input text. One copy keeps the original case, the other copy contains the lower case version of the original. We search the pattern of URL ≶a href in the lower case text while extracting the URL in the original one. Note that both copies should have been processed by squeezing the white spaces. Alternatively if you'd like to keep the original HTML documents as they are, you may find useful the existing Java classes such as the HTMLLinkExtractor.java available at the course web site

http://www.eg.bucknell.edu/~xmeng/webir-course/2014/code/misc/HTMLLinkExtractor.java

Stopwords Removal

Once the text is tokenized, we should remove the stopwords that we don't want to include in the index collection. Here you may come up a list of stopwords of your own, or you may search the internet to find a source that has a list of common stopwords. These lists may differ. That is fine. Your program need to go through the list of words extracted from the text in the first step to remove the ones that are in the stopwords list. That is, if a word is found on the stopwords list, do not add it as an index term. You should use the same list of stopwords in indexing and later on in searching.

Stemming

The task of stemming helps to reduce the number of different words because many words that have the same root (stem) have similar meanings. You may use Porter's algorithm for stemming. I found a few implementations of the algorithm from the web. You may experiment them a little bit and use the one you feel comfortable with. You can certainly implement your own version of the algorithm. Click this link http://www.eg.bucknell.edu/~xmeng/webir-course/2014/code/Porters/ to see a list of implementations of Porter's algorithm.

Selecting Index Terms

For simplicity we use all individual words as our indexing term for now. If you run into the problem of not having enough memory to store and process all words, you may have to select some terms from the whole collection. This can be implementation dependent.

Building and Updating the Indexing System

The most important part of this phase of the project is to build the indexing system. We will use an inverted index system, that is, indexing term points to a list of documents that contain the term. Figure 1 shows how a inverted indexing system may look like.

Figure 1: An Example of Inverted Indexing System

The basic index system has a list of index terms across the whole document set. Each index has a list of nodes called a posting node each of which contains the information about that particular term and that particular document. The list for each index term is called a posting list. In the simplest form the pair of document ID and term frequency is kept in each of the posting node, as can be seen in Figure 1. For example, for the posting list of the term computer, three documents contain the term (thus, df is three; the first posting node indicates that the term computer appears in document D7 four times. You may keep other information such as the location of the term within the document and the importance of the term as perceived within the text (whether or not a heading, bold faced ...) as needed by the search system. The index terms should be sorted alphabetically for easy search. You could implement the inverted indexing system as a list of lists. Or you can use other data structures such as hash tables for the list or balanced binary search trees. Using hash table may reduce the number of terms you can index since hash table typically is less efficient in space.

What data structure to use in the indexing system is very critical to the search system. The main consideration should be the space efficiency in our project as we will use our campus lab computers to implement the system. The more efficient your indexing system is, the more URLs you can index, thus the more you can support the search. Your team should choose an appropriate data structure. One easy way to save some space is not to store document name (URLs) everywhere. Rather, you should assign a document ID to each document. Use this ID when processing and storing information about the document. Keep just one copy of the document name and its mapping to the ID. So a mapping from document name to a unique number is needed. You can do the same thing for the indexing term, that is, using a unique index term ID throughout the system, but keeping only one copy of the term and a mapping between the term and the term ID.

The following is a general description of the algorithm to create an inverted index system.
Create an empty index term list I; For each document D in the document set { For each token T in D { If (T is not already in I) insert T into I; Find the location for T in I; If ((T, D) is in the posting list for T) increase its term frequency for T; Else { create (T, D); add it to the posting list for T; } // end of if-else } // end of for each token } // end of for each document
Figure 2: Algorithm to Create Inverted Indexing

Test Files

You may select any test files to work with. These files should be accessible from local file systems, i.e., not from the web directly. Keep in mind that your program should be able to work with at least a few hundreds of terms and a few thousands of documents. The document set should contain plain text and HTML documents. If you don't have other sources, you may try to index a portion of my CSCI 363: Computer Networks by downloading and extracting from

http://www.eg.bucknell.edu/~xmeng/webir-course/common-files/cs363-s13-sample.zip

After saving all files on to your local file system, you can use file manipulation functions provided by your chosen programming language, e.g., Java to process these local files. If you are not familiar with these functions, please check out some examples to see how to work with files. For Java, please read this example, for C++, please check out this example.
To test if your indexing system works fine, show that after the index is built, your program will be able to answer queries from the user by listing the documents that contain the query term. For example, using the indexing system shown in Figure 1 as the test case, if the user types in a word computer, your system should return the names of the three documents that contain the term computer along with the term frequency. The output from your program may look something similar to the following.
computer: hello.html, 4, world.html, 2, how.html, 5
where hello.html is D7 in which the term appears four times, world.html is D8 in which the term appears two times, and how.html is D10 in which the term appears five times. (Please refer to Figure 1 of this document.)

What to Hand In

Your team needs to hand in the following in the order given.

A team report for phase two of the project with a cover page. The report shouldn't be too long, maybe two to three pages. The report should include the following as a minimum.

The name of your team (should be same as your search engine);

The name of each team member;

A description of the roles of each team member and the contributions of each member;

A summary of the working process, e.g., what the team started with, what the team has accomplished, any problems encountered, how the team solved them, any thoughts on the project, among others.

Team meeting minutes during this phase of the work.

Source code for the programs and HTML pages.

Snapshots of sample runs. You can use any copy-and-paste features to save the result of running the programs to a text file.

Follow the instructions in submissions.html to submit your work.

Project	Part 1	Part 2	Part 3	Part 4	Part 5