Cluster analysis, a.k.a. Clustering, is a machine-learning technique used to group together entities that are related or share similar characteristics. Clustering has applications in medicine, biology, social sciences, robotics, and earth sciences, including high-stakes domains such as medical image processing or medical diagnosis, predicting disease-related genes, or resource allocation in acute disease. However, currently, users or developers of applications that incorporate clustering have no assurances that the applications are reliable; this calls into question results or diagnoses obtained with the use of clustering, and discourages researchers or practitioners from using clustering applications. This project will make clustering implementations more reliable, easier to develop, and easier to fix. Software engineers will benefit from implementations that are easier to program/fix/check. In turn, the resulting software will be more reliable, benefiting end-users. The project will introduce students and IT professionals to challenges in, as well as approaches for, dependable machine learning; this will make students and professionals better equipped for tackling emerging software research and development challenges. Ongoing outreach and support efforts, to minorities and underrepresented groups, will continue.
Clustering use will expand with the increasing interest in machine learning in general, and increased adoption of machine-learning implementations (hardware or software) in software-driven products. Hence it is imperative that clustering implementations be dependable. Researchers in this space lack definitions of basic clustering-correctness properties and effective/efficient analyses for verifying clustering implementations' properties. This project will address the aforementioned issues by defining clustering correctness starting from first principles, e.g., program determinism; and constructing approaches for verifying correctness via program analysis. Approaches will include differential execution, white-box as well as black-box techniques, dynamic slicing, and symbolic execution. The scope of this work includes 'purely software' clustering implementations as well as implementations that use hardware acceleration. Using these tools, developers and researchers will be able to gain effective insights into clustering-implementation behavior and dependability. Researchers will be able to use the general principles introduce developed in this work to construct program analyses in other domains, e.g., scientific computing, numerical computing, high-performance computing, and use the tools/approaches for lockstep-executing, slicing, or symbolically executing other categories of data-intensive applications.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|Effective start/end date||10/1/20 → 9/30/23|
- National Science Foundation: $400,000.00