Information extraction from the structured part of office documents

Xiaolong Hao, Jason T.L. Wang, Peter A. Ng

Research output: Contribution to journalArticlepeer-review

8 Scopus citations

Abstract

The contents of office documents can be divided into structured and unstructured parts. In this paper, we present a sample-based approach to analyzing a document to form its layout and conceptual structures, and then extracting information from the structured part of the office documents. We represent a document's layout structure as an ordered labeled tree structure using nested segmentation, and its conceptual structure as a set of attribute type pairs. The layout similarities between the document to be processed and samples are identified by employing an approximate tree matching method. The conceptual similarities are identified by analyzing document and sample contents, and by measuring the degree of conceptual closeness. Finally, the information is extracted by instantiating the attributes specified in the conceptual structure based on the result of document structure analysis.

Original languageEnglish (US)
Pages (from-to)245-274
Number of pages30
JournalInformation sciences
Volume91
Issue number3-4
DOIs
StatePublished - Jun 1996

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems and Management
  • Artificial Intelligence
  • Theoretical Computer Science
  • Control and Systems Engineering
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Information extraction from the structured part of office documents'. Together they form a unique fingerprint.

Cite this