Model-Robust Subdata Selection for Big Data

Chenlu Shi, Boxin Tang

Research output: Contribution to journalArticlepeer-review

9 Scopus citations

Abstract

Subdata selection is necessary because of challenges arising from statistical analysis of big data using limited computing resources. The existing work on subdata selection relies heavily on a specified model, which calls for an approach that is robust to model misspecification. We propose the use of space-filling designs for subdata selection and examine a fast algorithm for its implementation. Our algorithm performs surprisingly well when compared to the reference distribution given by complete search. Simulations are conducted to compare our approach with a recently introduced IBOSS method, and the results show that our method is not just robust to model misspecification but also robust to model uncertainty. While robustness to model misspecification and uncertainty may be expected due to the nature of space-filling designs, we discover that our method enjoys an additional property of robustness when there exist substantial correlations among covariates.

Original languageEnglish (US)
Article number82
JournalJournal of Statistical Theory and Practice
Volume15
Issue number4
DOIs
StatePublished - Dec 2021
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Statistics and Probability

Keywords

  • Massive data
  • Maximin distance design
  • Model-independent method
  • Space-filling design

Fingerprint

Dive into the research topics of 'Model-Robust Subdata Selection for Big Data'. Together they form a unique fingerprint.

Cite this