Improving OCR performance in biomedical literature retrieval through preprocessing and postprocessing

Songhua Xu, James McCusker, Martin Schultz, Michael Krauthammer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Scopus citations

Abstract

Today's information retrieval (IR) techniques are mostly text-based. As a consequence, some types of information are beyond the reach of text-based IR systems, which fail in situations where textual information can not be easily accessed, e.g. textual information in biomedical images and figures. To tackle such situations, we propose to augment IR systems with the ability to perform optical character recognition (OCR). A principal obstacle is the accuracy of the OCR procedure, which is often error-prone. In our work, we introduce some preprocessing and postprocessing techniques for improving the OCR performance. Our preprocessing stage is concerned with separating texts from graphical elements in an image so that the graphics in the image would not affect the performance of OCR, as today's OCR engines are optimized for dealing with documents without graphical elements. Our postprocessing stage is concerned with a context-based OCR result correction. Experimental results show that these preprocessing and postprocessing techniques can consistently improve the performance of biomedical image OCR in terms of either precision or recall.

Original languageEnglish (US)
Title of host publication3rd International Symposium on Semantic Mining in Biomedicine, SMBM 2008 - Proceedings
Pages161-164
Number of pages4
StatePublished - Dec 1 2008
Externally publishedYes
Event3rd International Symposium on Semantic Mining in Biomedicine, SMBM 2008 - Turku, Finland
Duration: Sep 1 2008Sep 3 2008

Other

Other3rd International Symposium on Semantic Mining in Biomedicine, SMBM 2008
CountryFinland
CityTurku
Period9/1/089/3/08

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Biomedical Engineering

Fingerprint Dive into the research topics of 'Improving OCR performance in biomedical literature retrieval through preprocessing and postprocessing'. Together they form a unique fingerprint.

Cite this