The increasing availability of bridge data from the National Bridge Inventory (NBI) offers a great opportunity to perform predictive analytics (such as bridge deterioration prediction) using machine learning (ML) pipelines for supporting bridge asset management. However, data quality issues (e.g., outliers and missing values) can significantly affect ML pipelines, requiring upstream tasks to be performed for ensuring the validity, applicability, and generalizability of pipelines. Among the tasks, outlier removal and missing value imputation are the most challenging due to a highly laborious process, a lack of data governance, and a mixture of heterogenous data quality issues and data types. To address this challenge, this paper proposes an upstream workflow for enhancing the downstream predictive analytics of bridge-related ML pipelines. The proposed upstream workflow was developed based on the NBI data collected for all States in the United States, which includes a total of 617,084 observations/bridges. Existing bridge domain knowledge from multiple sources (such as the bridge design manual and regulations) was leveraged to remove outliers. Then, this study applied and compared 10 statistical and ML-based data imputation techniques to impute missing values. Statistical analysis and imputation evaluation of NBI data indicated that: (1) 19 and 15 out of the total 38 frequently used features or variables had outliers and missing values, respectively; (2) categorical features are generally more prone to data dropping due to inapplicable values, while numeric features are more subjected to outliers; and (3) ML-based data imputation is more suitable than statistical imputation for both numeric and categorical features, especially for features with high missing rate. The proposed workflow was validated on its capability of improving downstream predictive analytics for bridge deck condition prediction, increasing the balanced accuracy by 6.85%-9.76%. This paper contributes to the body of knowledge by offering a novel upstream workflow that can be utilized as a benchmark for guiding researchers and bridge engineering practitioners to handle NBI data quality issues for better preforming predictive analytics using ML pipelines.
All Science Journal Classification (ASJC) codes
- Civil and Structural Engineering
- Building and Construction