Abstract
This paper presents an index structure for retrieving electronic documents in digital libraries. The documents we consider may contain mistyped words or spelling errors. Given a query string (e.g., a search key), we want to find those documents that approximately contain the query, i.e., certain inserts, deletes and mismatches are allowed when matching the query with a word (or phrase) in the documents. Our approach is to store the documents sequentially in a database and hash their 'fingerprints' into a number of 'fingerprint files.' When the query is given, its fingerprints are also hashed into the files and a histogram of votes is constructed on the documents. We derive a lower bound, based on which one can prune a large number of nonqualifying documents (i.e., those whose votes are below the lower bound) during searching. The paper presents some experimental results, which demonstrate the effectiveness of the index structure and the lower bound.
Original language | English (US) |
---|---|
Pages (from-to) | 208-215 |
Number of pages | 8 |
Journal | Proceedings of the International Conference on Tools with Artificial Intelligence |
State | Published - 1995 |
Event | Proceedings of the 1995 IEEE 7th International Conference on Tools with Artificial Intelligence - Herndon, VA, USA Duration: Nov 5 1995 → Nov 8 1995 |
All Science Journal Classification (ASJC) codes
- Software