Efficient approximate top-k mutual information based feature selection

Md Abdus Salam, Senjuti Basu Roy, Gautam Das

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Feature selection is an important step in the data science pipeline, and it is critical to develop efficient algorithms for this step. Mutual Information (MI) is one of the important measures used for feature selection, where attributes are sorted according to descending score of MI, and top-k attributes are retained. The goal of this work is to develop a new measure Attribute Average Conflict to effectively approximate top-k attributes, without actually calculating MI. Our proposed method is based on using the database concept of approximate functional dependency to quantify MI rank of attributes which to our knowledge has not been studied before. We demonstrate the effectiveness of our proposed measure with a Monte-Carlo simulation. We also perform extensive experiments using high dimensional synthetic and real datasets with millions of records. Our results show that our proposed method demonstrates perfect accuracy in selecting the top-k attributes, yet is significantly more efficient than state-of-art baselines, including exact methods for computing Mutual Information based feature selection, as well as adaptive random- sampling based approaches. We also investigate the upper and lower bounds of the proposed new measure and show that tighter bounds can be derived by using marginal frequency of attributes in specific arrangements. The bounds on the proposed measure can be used to select top-k attributes without full scan of the dataset in a single pass. We perform experimental evaluation on real datasets to show the accuracy and effectiveness of this approach.

Original languageEnglish (US)
Pages (from-to)191-223
Number of pages33
JournalJournal of Intelligent Information Systems
Volume61
Issue number1
DOIs
StatePublished - Aug 2023

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems
  • Hardware and Architecture
  • Computer Networks and Communications
  • Artificial Intelligence

Keywords

  • Data science
  • Feature selection
  • Mutual information
  • Top-k attribute selection

Fingerprint

Dive into the research topics of 'Efficient approximate top-k mutual information based feature selection'. Together they form a unique fingerprint.

Cite this