The contents of office documents can be divided into structured and unstructured parts. In this paper, we present a sample-based approach to analyzing a document to form its layout and conceptual structures, and then extracting information from the structured part of the office documents. We represent a document's layout structure as an ordered labeled tree structure using nested segmentation, and its conceptual structure as a set of attribute type pairs. The layout similarities between the document to be processed and samples are identified by employing an approximate tree matching method. The conceptual similarities are identified by analyzing document and sample contents, and by measuring the degree of conceptual closeness. Finally, the information is extracted by instantiating the attributes specified in the conceptual structure based on the result of document structure analysis.
All Science Journal Classification (ASJC) codes
- Theoretical Computer Science
- Control and Systems Engineering
- Computer Science Applications
- Information Systems and Management
- Artificial Intelligence