TY - GEN
T1 - Data-Centric Explainable Debiasing for Improving Fairness in Pre-trained Language Models
AU - Li, Yingji
AU - Du, Mengnan
AU - Song, Rui
AU - Wang, Xin
AU - Wang, Ying
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - Human-like social bias of pre-trained language models (PLMs) on downstream tasks have attracted increasing attention. The potential flaws in the training data are the main factor that causes unfairness in PLMs. Existing data-centric debiasing strategies mainly leverage explicit bias words (defined as sensitive attribute words specific to demographic groups) for counterfactual data augmentation to balance the training data. However, they lack consideration of implicit bias words potentially associated with explicit bias words in complex distribution data, which indirectly harms the fairness of PLMs. To this end, we propose a Data-Centric Debiasing method (named Data-Debias), which uses an explainability method to search for implicit bias words to assist in debiasing PLMs. Specifically, we compute the feature attributions of all tokens using the Integrated Gradients method, and then treat the tokens that have a large impact on the model's decision as implicit bias words. To make the search results more precise, we iteratively train a biased model to amplify the bias with each iteration. Finally, we use the implicit bias words searched in the last iteration to assist in debiasing PLMs. Extensive experimental results on multiple PLMs debiasing on three different classification tasks demonstrate that Data-Debias achieves state-of-the-art debiasing performance and strong generalization while maintaining predictive abilities.
AB - Human-like social bias of pre-trained language models (PLMs) on downstream tasks have attracted increasing attention. The potential flaws in the training data are the main factor that causes unfairness in PLMs. Existing data-centric debiasing strategies mainly leverage explicit bias words (defined as sensitive attribute words specific to demographic groups) for counterfactual data augmentation to balance the training data. However, they lack consideration of implicit bias words potentially associated with explicit bias words in complex distribution data, which indirectly harms the fairness of PLMs. To this end, we propose a Data-Centric Debiasing method (named Data-Debias), which uses an explainability method to search for implicit bias words to assist in debiasing PLMs. Specifically, we compute the feature attributions of all tokens using the Integrated Gradients method, and then treat the tokens that have a large impact on the model's decision as implicit bias words. To make the search results more precise, we iteratively train a biased model to amplify the bias with each iteration. Finally, we use the implicit bias words searched in the last iteration to assist in debiasing PLMs. Extensive experimental results on multiple PLMs debiasing on three different classification tasks demonstrate that Data-Debias achieves state-of-the-art debiasing performance and strong generalization while maintaining predictive abilities.
UR - http://www.scopus.com/inward/record.url?scp=85205292363&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85205292363&partnerID=8YFLogxK
U2 - 10.18653/v1/2024.findings-acl.226
DO - 10.18653/v1/2024.findings-acl.226
M3 - Conference contribution
AN - SCOPUS:85205292363
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 3773
EP - 3786
BT - 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Proceedings of the Conference
A2 - Ku, Lun-Wei
A2 - Martins, Andre
A2 - Srikumar, Vivek
PB - Association for Computational Linguistics (ACL)
T2 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024
Y2 - 11 August 2024 through 16 August 2024
ER -