Abstract
Today's information retrieval (IR) techniques are mostly text-based. As a consequence, some types of information are beyond the reach of text-based IR systems, which fail in situations where textual information can not be easily accessed, e.g. textual information in biomedical images and figures. To tackle such situations, we propose to augment IR systems with the ability to perform optical character recognition (OCR). A principal obstacle is the accuracy of the OCR procedure, which is often error-prone. In our work, we introduce some preprocessing and postprocessing techniques for improving the OCR performance. Our preprocessing stage is concerned with separating texts from graphical elements in an image so that the graphics in the image would not affect the performance of OCR, as today's OCR engines are optimized for dealing with documents without graphical elements. Our postprocessing stage is concerned with a context-based OCR result correction. Experimental results show that these preprocessing and postprocessing techniques can consistently improve the performance of biomedical image OCR in terms of either precision or recall.
Original language | English (US) |
---|---|
Title of host publication | 3rd International Symposium on Semantic Mining in Biomedicine, SMBM 2008 - Proceedings |
Pages | 161-164 |
Number of pages | 4 |
State | Published - Dec 1 2008 |
Externally published | Yes |
Event | 3rd International Symposium on Semantic Mining in Biomedicine, SMBM 2008 - Turku, Finland Duration: Sep 1 2008 → Sep 3 2008 |
Other
Other | 3rd International Symposium on Semantic Mining in Biomedicine, SMBM 2008 |
---|---|
Country/Territory | Finland |
City | Turku |
Period | 9/1/08 → 9/3/08 |
All Science Journal Classification (ASJC) codes
- Computer Science Applications
- Biomedical Engineering