Auditing National Cancer Institute thesaurus neoplasm concepts in groups of high error concentration

Ling Zheng, Hua Min, Yan Chen, Julia Xu, James Geller, Yehoshua Perl

Research output: Contribution to journalArticlepeer-review

5 Scopus citations

Abstract

The National Cancer Institute thesaurus is an important knowledge resource that should ideally be error-free. We investigated the occurrence of errors in the Neoplasm subhierarchy, which is a part of the National Cancer Institute thesaurus Disease, Disorder or Finding hierarchy. There are five key findings in this study. (1) Errors in the Neoplasm subhierarchy are not uniformly distributed. (2) A partial-area taxonomy, which is a compact network for summarizing the structure and content of an ontology, helped uncover groups of concepts, called "small partial-areas," in the Neoplasm subhierarchy. (3) The rate of errors in "small partial-areas" is twice as large as in "large partial-areas" (44% versus 22%), satisfying statistical significance. Thus, we conclude that higher error concentrations exist in small partial-areas. (4) Group-based auditing can be used successfully to identify additional suspicious concepts in a small group, once a few members of the group are already known as erroneous. (5) Error correction propagation can be used successfully and with minimal effort to correct additional errors in the Neoplasm subhierarchy that occur outside of an initial small group of erroneous concepts. We present examples of errors and examples of how corrections transform and simplify the partial-area taxonomy.

Original languageEnglish (US)
Pages (from-to)113-130
Number of pages18
JournalApplied Ontology
Volume12
Issue number2
DOIs
StatePublished - 2017

All Science Journal Classification (ASJC) codes

  • General Computer Science
  • Language and Linguistics
  • Linguistics and Language

Keywords

  • Abstraction network
  • Error concentration
  • NCI thesaurus
  • Ontology auditing
  • Ontology quality assurance

Fingerprint

Dive into the research topics of 'Auditing National Cancer Institute thesaurus neoplasm concepts in groups of high error concentration'. Together they form a unique fingerprint.

Cite this