Abstract
Presence-only data occur in a classification, which consist of a sample of observations from the presence class and a large number of background observations with unknown presence/absence. Since absence data are generally unavailable, conventional semi-supervised learning approaches are no longer appropriate as they tend to degenerate and assign all observations to the presence class. In this article, we propose a generalized class balance constraint, which can be equipped with semi-supervised learning approaches to prevent them from degeneration. Furthermore, to circumvent the difficulty of model tuning with presence-only data, a selection criterion based on classification stability is developed, which measures the robustness of any given classification algorithm against the sampling randomness. The effectiveness of the proposed approach is demonstrated through a variety of simulated examples, along with an application to gene function prediction.
| Original language | English (US) |
|---|---|
| Pages (from-to) | 134-143 |
| Number of pages | 10 |
| Journal | Computational Statistics and Data Analysis |
| Volume | 59 |
| Issue number | 1 |
| DOIs | |
| State | Published - Mar 2013 |
| Externally published | Yes |
All Science Journal Classification (ASJC) codes
- Statistics and Probability
- Computational Mathematics
- Computational Theory and Mathematics
- Applied Mathematics
Keywords
- Cross validation
- Functional genomics
- Stability
- Support vector machine
- Tuning