TY - JOUR
T1 - Improving the Predictive Analytics of Machine-Learning Pipelines for Bridge Infrastructure Asset Management Applications
T2 - An Upstream Data Workflow to Address Data Quality Issues in the National Bridge Inventory Database
AU - Hu, Xi
AU - Assaad, Rayan H.
N1 - Publisher Copyright:
© 2023 American Society of Civil Engineers.
PY - 2024/1/1
Y1 - 2024/1/1
N2 - The increasing availability of bridge data from the National Bridge Inventory (NBI) offers a great opportunity to perform predictive analytics (such as bridge deterioration prediction) using machine learning (ML) pipelines for supporting bridge asset management. However, data quality issues (e.g., outliers and missing values) can significantly affect ML pipelines, requiring upstream tasks to be performed for ensuring the validity, applicability, and generalizability of pipelines. Among the tasks, outlier removal and missing value imputation are the most challenging due to a highly laborious process, a lack of data governance, and a mixture of heterogenous data quality issues and data types. To address this challenge, this paper proposes an upstream workflow for enhancing the downstream predictive analytics of bridge-related ML pipelines. The proposed upstream workflow was developed based on the NBI data collected for all States in the United States, which includes a total of 617,084 observations/bridges. Existing bridge domain knowledge from multiple sources (such as the bridge design manual and regulations) was leveraged to remove outliers. Then, this study applied and compared 10 statistical and ML-based data imputation techniques to impute missing values. Statistical analysis and imputation evaluation of NBI data indicated that: (1) 19 and 15 out of the total 38 frequently used features or variables had outliers and missing values, respectively; (2) categorical features are generally more prone to data dropping due to inapplicable values, while numeric features are more subjected to outliers; and (3) ML-based data imputation is more suitable than statistical imputation for both numeric and categorical features, especially for features with high missing rate. The proposed workflow was validated on its capability of improving downstream predictive analytics for bridge deck condition prediction, increasing the balanced accuracy by 6.85%-9.76%. This paper contributes to the body of knowledge by offering a novel upstream workflow that can be utilized as a benchmark for guiding researchers and bridge engineering practitioners to handle NBI data quality issues for better preforming predictive analytics using ML pipelines.
AB - The increasing availability of bridge data from the National Bridge Inventory (NBI) offers a great opportunity to perform predictive analytics (such as bridge deterioration prediction) using machine learning (ML) pipelines for supporting bridge asset management. However, data quality issues (e.g., outliers and missing values) can significantly affect ML pipelines, requiring upstream tasks to be performed for ensuring the validity, applicability, and generalizability of pipelines. Among the tasks, outlier removal and missing value imputation are the most challenging due to a highly laborious process, a lack of data governance, and a mixture of heterogenous data quality issues and data types. To address this challenge, this paper proposes an upstream workflow for enhancing the downstream predictive analytics of bridge-related ML pipelines. The proposed upstream workflow was developed based on the NBI data collected for all States in the United States, which includes a total of 617,084 observations/bridges. Existing bridge domain knowledge from multiple sources (such as the bridge design manual and regulations) was leveraged to remove outliers. Then, this study applied and compared 10 statistical and ML-based data imputation techniques to impute missing values. Statistical analysis and imputation evaluation of NBI data indicated that: (1) 19 and 15 out of the total 38 frequently used features or variables had outliers and missing values, respectively; (2) categorical features are generally more prone to data dropping due to inapplicable values, while numeric features are more subjected to outliers; and (3) ML-based data imputation is more suitable than statistical imputation for both numeric and categorical features, especially for features with high missing rate. The proposed workflow was validated on its capability of improving downstream predictive analytics for bridge deck condition prediction, increasing the balanced accuracy by 6.85%-9.76%. This paper contributes to the body of knowledge by offering a novel upstream workflow that can be utilized as a benchmark for guiding researchers and bridge engineering practitioners to handle NBI data quality issues for better preforming predictive analytics using ML pipelines.
UR - http://www.scopus.com/inward/record.url?scp=85175302540&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85175302540&partnerID=8YFLogxK
U2 - 10.1061/JBENF2.BEENG-6012
DO - 10.1061/JBENF2.BEENG-6012
M3 - Article
AN - SCOPUS:85175302540
SN - 1084-0702
VL - 29
JO - Journal of Bridge Engineering
JF - Journal of Bridge Engineering
IS - 1
M1 - 04023103
ER -