Search Engine Project Part 3: Crawling the Web

Summer 2014

Project
Part 1
Part 2
Part 3
Part 4
Part 5

Introduction

You are to implement a crawler or a robot in this phase of the project to collect information from the web. The crawler starts with a given set of starting URLs. The number of URLs in this set can be one or more. The crawler will retrieve the pages specified by this starting URL set; parse the pages; extract some more URLs from these pages. The crawler then visits the pages following these newly harvested URLs. This process will continue until either the time allowed has expired, or the number of retrieved web pages has reached the limit, or there is no new page to visit.

The retrieved web pages are passed to the Indexer you built in the second phase of the project for processing. An inverted index will be built by the Indexer for all the web pages retrieved, ready for a user to search. In the indexing building phase, your Indexer takes a set of local files as input source. Now the Indexer takes the input from the web pages retrieved by the Crawler.

General Algorithm

Traversing the web is very similar to traversing a general graph. Each web page can be considered as a node in the graph. Each hyper-link can be considered as a link in the graph. From this point of view, crawling web is not too much different from the graph traverse algorithm you learned in a typical data structure course.

The following is a general web traversing algorithm that we discussed in lectures.

Figure 1: General Algorithm for Crawling the Web

Some Issues To Be Considered

Because of the vast size of the web, there are some technical and engineering issues we have to consider for a successful, less-intrusive crawler. We list here some of the issues to consider in crawling the web.

Testing Your Crawler

Test your programs on a small website first. For example, you can test your crawler against the small website you built in the first phase of this project. Our goal is to crawl the English web site of Southeast University.

If all components work correctly, you should be able to combine this phase with the second phase and be able to provide a query/answer system. In the second phase of the project, your program was able to answer queries and return the names of the documents containing the query term along with the term frequency count. If you feed the crawled web pages into that program, your program should be able to return the URLs of the web pages that contain the query term.

What to Hand In

Your team needs to hand in the following in the order given.

  1. A team report for phase three of the project with a cover page. The report shouldn't be too long, maybe two to three pages. The report should include the following as a minimum.
  2. Source code for the programs and any other supporting documents.
  3. Snapshots of sample runs. You can use any copy-and-paste features to save the result of running the programs to a text file.
  4. Email the instructor a copy of the complete source code and sample runs in zip or tar format.

Project
Part 1
Part 2
Part 3
Part 4
Part 5