Abstract
Duplicate entity detection in biological data is an important research task. In this paper, we propose a novel and context-sensitive Shortest Path Edit Distance (SPED) extending and supplementing our previous work on Markov Random Field-based Edit Distance (MRFED). SPED transforms the edit distance computational problem to the calculation of the shortest path among two selected vertices of a graph. We produce several modifications of SPED by applying Levenshtein, arithmetic mean, histogram difference and TFIDF techniques to solve subtasks. We compare SPED performance to other well-known distance algorithms for biological entity matching. The experimental results show that SPED produces competitive outcomes.
Original language | English (US) |
---|---|
Pages (from-to) | 395-410 |
Number of pages | 16 |
Journal | International Journal of Data Mining and Bioinformatics |
Volume | 4 |
Issue number | 4 |
DOIs | |
State | Published - Jul 2010 |
All Science Journal Classification (ASJC) codes
- Information Systems
- General Biochemistry, Genetics and Molecular Biology
- Library and Information Sciences
Keywords
- Biological entity matching
- Duplicate record detection
- Histogram matching
- Levenshtein
- SPED
- Shortest path edit distance
- Text mining