###
Review Question Set Three

The final exam will take place on Friday, December 8th, 8:00 am in O'Leary
232. The exam will be comprehensive with more emphasis on the part
covered since the last exam. The exam questions will be based on the concepts
covered by these review
questions plus the two sets of questions for the mid-term exams. You are
allowed to bring one information sheet. You will need
to do some computational work so a calculator would be helpful, but not
absolutely necessary.
Questions based on the new materials since the second mid-term exam.
- Text Properties
- Zipf's law regarding word frequency and rank
- Predicting occurrence frequencies using Zipf's law
- Heap's law regarding the size of vocabulary in a document
collection
- Meta-data to describe a document
- Mark-up language as a special case of meta-data
- Structured mark-up language, XML

- Text Clustering
- Un-supervised clustering
- Agglomerative (bottom-up) vs. divisive (top-down)
- Direct clustering method
- Elements in clustering: similarity function, threshold
- Hierarchical Agglomerative Clustering algorithm (HAC)
- Cluster similarity measurement:
- single link
- complete link
- Group average

- Non-hierarchical clustering
- K-means algorithm
- Distance metrics
- Euclidean distance
- L1 norm
- Cosine similarity

- Buckshot algorithm

- Text Categorization
- The basic concepts and applications of text categorization:
to determine the proper category to which a given piece of text belong.
- Algorithms
- Rocchio's vector space algorithm
- K Nearest-neighbor algorithm