This chapter describes the current literature on the support vector machines in credit scoring with a special focus on the comparison of the SVM performance with other classifiers, especially with the logistic regression.
Over time, couple of standardized data sets have been used for the purpose of comparability of different models’ performance in the space and time. Notable examples include German data set [31] and Australian data set [3]. I will refer to them in the following lines.
The original idea to separate the linearly separable observations with a hyperplane and to choose such a hyperplane that maximizes the minimal distance from the observation dates back to 1962 [60]. The concept remained unnoticed by most researchers until the pivotal work of Vapnik and Cortes [12] wasn’t published in 1995. This paper formulated support vector machines (then referred as support vector networks) in the current form (soft margin, kernel function) and showed the superiority of SVM over the state-of-the art classification algorithms of that time on the example of the handwritten numbers recognition.
This work had a boosting impact on the subsequent advance of interest in SVMs towards the close of the 20th and on the beginning of the 21st century. Its effect could be compared with consequences of the pioneering paper [52] about the back-propagation algorithm that started the late 80s interest in the neural networks, which were neglected and underestimated till then.
Over time, papers examining the performance of the SVMs in the credit scoring and other financial applications started to appear, especially after 2000.
Baesens in 2003 [4] applied SVMs and other classifiers to several credit data sets. He concludes that SVMs perform well when compared with other algorithms; they do not however always result in the best performing model. He also notes the specifics for the credit data which are typically hardly separable by any decision surface. It’s given by the fact that the data cannot capture the complexities of an individual’s life. It is quite common, that the misclassification on credit data reach around 20 % or 30 %.
Li in 2004 [43] studied the SVM performance on 1 000 credit records of a Chinese commercial bank. He concludes, that SVM performs noticeably better (by more than 50 %) in the terms of hit rate against the credit scoring methodology the bank used at the time. However, he did not in any way specify, which methodology did the bank use then and this mere fact degrades the significance of the whole paper.
Schebesch and Stecking (2005) [54] apply SVM to a database of applicants for building and loan credit. They conclude that SVMs perform slightly better than LR, but not significantly so.
In Huang et al (2007) [36] the SVM performance on German and Australian credit data set is compared against other data mining methods (back-propagation neural network, genetic programming and decision trees). For the support vector machines, various feature selection techniques are tested: unrestricted model, features selection by F-score and features selection by GA-approach. The GA approach used significantly lower number of features than the other methods, with a slightly higher hit rate. The authors conclude that SVM is competitive method for the credit scoring when compared against other commonly used data mining algorithms but is not significantly more accurate than other methods.
Bellotti et al (2009) [5] used the data set consisting the records of 25 000 credit card users and compared the performance of SVM with LR, LDA and kNN. They found that the non-linear kernels in SVM do not perform better than the simple linear SVM. Especially the polynomial kernel performed poorly and authors attribute this fact to the possible over-fitting. The method for selecting significant features in the data using the square of weights on features output proposed by Guyon et al (2002) [27] was used and enhanced. The paper concluded that LR and SVM tend to select same features as the most important ones. Generally, the SVM performed slightly better than the logistic regression, according to Bellotti.
Ghodselahi (2011) [25] utilized 10 SVM classifiers as the members on an ensemble models and conclude that its performance is significantly better than of any individual SVM model or logistic regression. The training was performed on the standardized German data set. However the reported AUC for the logistic regression is worse than the AUC reported by others on the very same data set (e.g. [42]) which casts a shadow of doubts on the conclusions of this paper. Chances are, that the conclusion stands on the simple fact, that authors were not able to fully utilize the strengths of the LR approach. This paper is nevertheless a representative of a research trend of the recent years when ensemble models are preferred to the individual ones.
There is usually high imbalance in credit data sets, as the bad cases occur significantly less often than the good ones. Brown (2012) in [7] examined the effect of the data imbalance on several data mining algorithms. Support vector machines tended to perform poorly, when the percentage of bad cases in the training data set decreased and the data imbalance increased.
The solution of the data imbalance is not as simple as it may appear. Simple undersampling21 does not necessarily solve the problem. On the contrary, it may lead to the solutions that are further from the ideal solution, as [1] shows and proves. This paper suggests oversampling instead: the technique to generate new artificial bad cases from the actual ones, so that their total number matches the number of good cases. Authors achieved positive results with their method on 10 different data sets22. Despite this fact, I find the proposed idea kind of “spooky” and not suitable for the credit scoring.
The problem of the unbalanced data in combination with Support vector machines remains unresolved.
Recently, there were some doubts regarding the AUC as an indicator for comparing different classifiers performance due to its fundamental incoherence [30]. Several alternative measures have been proposed in the literature, one of them being H-measure as per [29]. However, according to the extensive empirical classifier comparison in [32] there is a high correlation between the AUC and H-measure. In other words, both measures give same conclusions in the vast majority of cases. Authors conclude that using AUC to evaluate credit scoring performance remains “safe” from the empirical point of view.
Probably the most in-depth analysis of the current state-of-the-art machine learning algorithms was performed by Baesens et al (2013) [42], exactly 10 years after their first comparative paper [4] was published.
Authors compared the performance of different algorithms with many different settings (1141 models in total) on 7 real-life credit data sets (German and Australian were also among them) using 4 different performance measures to evaluate the models (percentage correctly classified, AUC, H-measure and Brier Score). The models from 3 wide families were examined: individual classifiers (mostly linear classifiers), homogeneous ensemble and heterogeneous ensemble classifiers. [42]
The results are not favourable to the SVM. The performance of the SVM models with linear kernel was moderate when compared with other individual classifiers. The Gaussian kernel performed slightly better. Logistic regression noticeably outperformed all other linear classifiers and ended up as second best individual algorithm, beaten only by the artificial neural networks by a small margin. It turns out that SVM can perform really great on some datasets, but the overall results on different data sets are average at best, according to their paper. [42]
Authors remark that the overall performance of the “individual classifiers” did not significantly improve over the last decade. In their opinion it implies that the possible limits were reached with this approach. Ensemble classifiers (homogeneous as well as heterogeneous) performed markedly better on the 99 % significance level: Random forests (RF) as being the best of homogeneous and HCES-Bag the best of heterogeneous ensembles (and the best overall). [42]
Authors propose to use Random forests as the benchmark for the future research on new classification algorithms in credit scoring. Despite it ended up as second best, it is (unlike HCES-Bag) easily available and implemented in many standard data mining software. [42]
The authors argue against the current common practice of using logistic regression as the only classifier for the purpose of comparing a newly proposed classifiers, since beating the logistic regression is no longer a challenge. While outperforming random forests can be considered as a signal for a methodological advancement. [42]
The most recent progress in the machine learning in general were reached using Deep learning, a set of algorithms enhanced and derived from the neural networks, though with a significantly more complex architecture.
The pioneers of this approach accomplished some noticeably achievements in the computer vision, beating other existing algorithms by a large margin and achieving a human-competitive performance on major benchmarks.23 [11]
Although the idea is not new, it did not gain an attention of the researchers until recently with new applications being developed and explored, like speech recognition, natural language processing, facial recognition and others. When a new breakthrough machine learning algorithm is developed, its performance is usually tested on tasks like these, at first.
Credit scoring is not usually attractive enough among the engineers and technically oriented researchers who are the originators of such new advancements. Consequently, the research of their applications on credit scoring frequently comes with a delay of several years behind the latest trends. Up to now I was not able to find any relevant paper on Deep learning for credit scoring and I believe this might present an interesting topic for further research.
Footnotes
21 randomly take away good cases from the training data set, so that the final ratio of good and bad cases will be 50:50
22 none of them is credit scoring, however
23 See Deep Learning Wins 2012 Brain Image Segmentation Contest at http://www.idsia.ch/~juergen/deeplearningwinsbraincontest.html