Synthetic Data Generation for Credit Scoring Models: Leveraging AI and Machine Learning to Improve Predictive Accuracy and Reduce Bias in Financial Services

Gunaseelan Namperumal; Akila Selvaraj; Yeswanth Surampudi

Authors

Gunaseelan Namperumal ERP Analysts Inc, USA Author
Akila Selvaraj iQi Inc, USA Author
Yeswanth Surampudi Groupon, USA Author

Keywords:

synthetic data generation, credit scoring models

Abstract

Using AI and ML in credit scoring models, the financial services sector is rapidly increasing accuracy and removing biases that can result in unjust lending. Historical data biases might help to maintain inequality. Fake data may help a credit score rise. This study claims that artificial intelligence and machine learning can create synthetic credit assessment data to boost prediction accuracy and remove prejudice. Privacy, dataset biases, and data constraints were handled via GANs, VAEs, and DP synthetic data. They mimic realistic yet synthetic data for training and validation of a fair credit scoring algorithm.

This paper addresses synthetic data-based credit score algorithm enhancement and bias reduction. GANs may add underrepresented groups by matching real-world distributions with highly-fidelity synthetic data. VAEs might provide interpretable latent representations and probabilistic synthetic data to retain credit risk assessment trends. A DP approach may strike a compromise between controlled noise and data privacy and utilization. Synthetic data techniques that improve model fairness and generalizability are investigated in this work. We assess computational cost, scalability, overfitting/unrealistic data point risk of every approach.

References

A. Borji, "Pros and Cons of GAN Evaluation Measures," Computer Vision and Image Understanding, vol. 179, pp. 41-65, Feb. 2019.

I. Goodfellow et al., "Generative Adversarial Nets," in Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS'14), Montreal, Canada, 2014, pp. 2672-2680.

D. P. Kingma and M. Welling, "Auto-Encoding Variational Bayes," in Proceedings of the 2nd International Conference on Learning Representations (ICLR), Banff, Canada, 2014.

C. Dwork, A. Roth, "The Algorithmic Foundations of Differential Privacy," Foundations and Trends in Theoretical Computer Science, vol. 9, no. 3-4, pp. 211-407, Aug. 2014.

S. K. Yoon, "Generative Models for Synthetic Data in Credit Scoring," Journal of Financial Data Science, vol. 3, no. 2, pp. 45-61, Spring 2021.

H. Liu, X. Xu, Y. Liu, "Synthetic Data Generation for Machine Learning: An Overview," ACM Computing Surveys, vol. 53, no. 4, pp. 1-37, Aug. 2021.

F. Provost, T. Fawcett, "Data Science and Its Relationship to Big Data and Data-Driven Decision Making," Big Data, vol. 1, no. 1, pp. 51-59, Mar. 2013.

A. E. Ho and D. Y. Kim, "Explainable AI (XAI) in Credit Scoring Models Using Generative Models," Expert Systems with Applications, vol. 167, pp. 1-12, Mar. 2021.

M. Abadi et al., "Deep Learning with Differential Privacy," in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS '16), Vienna, Austria, 2016, pp. 308-318.

T. B. Brown et al., "Language Models are Few-Shot Learners," in Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2020.

L. Xu, J. Luo, Y. Luo, "Generating Reliable Synthetic Data: A Case Study in Credit Risk Modeling," IEEE Access, vol. 8, pp. 93179-93192, May 2020.

D. J. Wu, R. Wang, "Ethical AI in Financial Services: Challenges and Recommendations," Journal of Financial Regulation and Compliance, vol. 29, no. 1, pp. 45-62, Jan. 2021.

Y. Bengio et al., "Learning Deep Architectures for AI," Foundations and Trends in Machine Learning, vol. 2, no. 1, pp. 1-127, 2009.

M. Arjovsky, S. Chintala, L. Bottou, "Wasserstein GAN," in Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 2017.

Z. Chen, S. Kumar, "Hybrid Synthetic Data in Banking Risk Models," Journal of Banking & Finance, vol. 124, pp. 105753, Dec. 2021.

R. Sheth et al., "Differentially Private Generative Adversarial Networks for Time Series Data," in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18), London, UK, 2018, pp. 43-52.

G. Papernot et al., "Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data," in Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 2017.

K. Kairouz et al., "Advances and Open Problems in Federated Learning," Foundations and Trends in Machine Learning, vol. 14, no. 1, pp. 1-210, Mar. 2021.

S. Beutel et al., "Data Augmentation for Credit Scoring with Generative Adversarial Networks," ACM Transactions on Intelligent Systems and Technology, vol. 12, no. 2, pp. 1-16, Feb. 2021.

S. Shinde, T. Sculley, "Mitigating Algorithmic Bias Using Synthetic Data: A Case Study in Credit Risk Models," in Proceedings of the 37th International Conference on Machine Learning (ICML), Vienna, Austria, 2020, pp. 1027-1034.

Synthetic Data Generation for Credit Scoring Models: Leveraging AI and Machine Learning to Improve Predictive Accuracy and Reduce Bias in Financial Services

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite