Experimental Comparison of Financial Distress Prediction Models Using Imbalanced data sets

Document Type : Research Paper


1 Department of Accounting, Neyshabur Branch, Islamic Azad University,Neyshabur, Iran

2 Department of Accounting, Neyshabur Branch, Islamic Azad University, Neyshabur, Iran

3 Department of Accounting, Sabzevar Branch ,Islamic Azad University, Sabzevar, Iran

4 Department of Accounting, Neyshabur Branch , Islamic Azad University, Neyshabur, Iran



From machine learning perspective, the problem of predicting financial distress is challenging because the distribution of the classes is extremely imbalanced. The goal of this study was comparing the performance of financial distress prediction models for the imbalanced data sets with different proportions. In this study, the data of the previous year before financial distress was used for 760 company year for the time period of 2007-2017. Besides using traditional classifications such as logistic regression, linear discriminant analysis, artificial neural network, and the classification models of least square support vector machine with four kernel functions, random forest and the Knn algorithm, the measures of the area under the curve and Friedman and Nemenyi tests were also utilized to determine the average rank and the difference significance of the Auc of the models. For selecting the models´ optimal parameters, the combined method of grid search optimization and cross validation was used. The results of this experimental study showed that for the balanced and imbalanced datasets with lower proportions, the best performance was for the random forest. For more imbalanced datasets, the best performance belonged to the least square support vector machine with sigmoid, radial, and linear kernel functions; performance of Knn algorithm had no significant difference from the other models and the performance of the artificial neural network was average or appropriate. Also, the performances of the linear logistic regression and linear discriminant analysis were weaker than other nonlinear models.


  • Receive Date: 22 July 2020
  • Revise Date: 28 February 2021
  • Accept Date: 01 March 2021
  • First Publish Date: 03 May 2021