Application of machine learning in credit risk assessment: a prelude to smart banking
Abstract
A precise credit risk assessment system is vital to a financial institution for its proper and
impeccable functioning. Accurate estimations of credit risk will allow them to continue
their operation in a gainful and transparent way. As the rate of loan defaults are gradually
increasing, bank authorities are finding it more and more difficult to correctly assess loan
requests. Thus the subject of credit risk has become a highly conferred and examined topic
throughout the world. Numerous solutions have been given, one being more efficient than
the other and several studies are still being made for solving this difficult predicament. Thus
keeping the implications of such a problematic matter in mind this paper proposes to build
a machine learning model which can precisely assess credit risk and predict possible loan
defaulters for any credit lending institution. Taking into account a borrower’s financial and
social history this paper proposes a way to accurately define whether a customer’s loan request
should be accepted or not which in turn can steadily save the creditor from incurring further
loss. Evaluating data from previous successful borrowers and loan defaulters, a comparative
analysis have been made using our supervised learning model and the results obtained can be
used to predict the behavior of future borrowers. This model can assist a financial institution
in assessing whether it should accept a loan request or not. Different combinations of feature
selection algorithm and classifiers have been made and based upon metrics such as accuracy,
AUC score, F1 score etc. the best model has been selected. Recursive feature elimination
with cross validation (RFECV) and Principal Component Analysis (PCA) have been used to
find the optimum number of features needed to make an accurate prediction. This allows us to
make more efficient and optimal use of the limited available resources. The assessment will
be performed in a supervised environment and so Support Vector Machines (SVM), Random
Forest, Extreme Gradient Boosting and Logistic Regression have been used as the classifiers.
In order to ensure all possible combinations have been properly tested k folds cross validation
has been used to bring out a more balanced result. Furthermore, GridSearchCV has been used
to tune the selected hyperparameters for each model in order to obtain the best result possible.
And based upon this a comparison in a tabular form has been shown which showcases the
most and the least accurate model for precisely assessing loan requests.