Abstract
The contents of office documents can be divided into structured and unstructured parts. In this paper, we present a sample-based approach to analyzing a document to form its layout and conceptual structures, and then extracting information from the structured part of the office documents. We represent a document's layout structure as an ordered labeled tree structure using nested segmentation, and its conceptual structure as a set of attribute type pairs. The layout similarities between the document to be processed and samples are identified by employing an approximate tree matching method. The conceptual similarities are identified by analyzing document and sample contents, and by measuring the degree of conceptual closeness. Finally, the information is extracted by instantiating the attributes specified in the conceptual structure based on the result of document structure analysis.
Original language | English (US) |
---|---|
Pages (from-to) | 245-274 |
Number of pages | 30 |
Journal | Information sciences |
Volume | 91 |
Issue number | 3-4 |
DOIs | |
State | Published - Jun 1996 |
All Science Journal Classification (ASJC) codes
- Software
- Information Systems and Management
- Artificial Intelligence
- Theoretical Computer Science
- Control and Systems Engineering
- Computer Science Applications