credit risk modeling using machine learning

Credit Risk Modeling Using Machine Learning Approach (Part 1)

In this post, we will demonstrate a machine learning approach for modeling credit risk in the peer-to-peer (P2P) lending domain. This is a two-part series of credit risk modeling. In this part, we will discuss the basics of credit risk modeling, about P2P lending platform, the dataset used and, exploratory data analysis.

Application of Machine Learning in Credit Risk Modeling

Credit risk modeling is a technique used by creditors for identifying the level of credit risk linked with the borrowers. Now, the question comes

what is credit risk exactly?

Credit risk is the amount of risk that arises when an individual or corporate borrower unable or fails to pay their debts in time. It means that the creditor who extended the debt to the borrower will not be able to receive the principal and interest associated with the debt. This will create an imbalance in the cash flow as principal and interest are the basic rewards on which creditor runs their business. So, a higher level of credit risk can affect the creditor adversely by increasing collection costs and disrupting the consistency of cash flows.

About P2P Lending Platform

In P2P lending, loans are typically uncollateralized i.e., without physical security against loans and lenders seek higher returns as compensation for the financial risk they take. In addition, they need to make decisions under information asymmetry that works in favor of the borrowers. In order to make rational decisions, lenders want to minimize the risk of default of each lending decision and realize the return that compensates for the risk. The overview of the P2P lending framework is shown in below figure 1.

Fig. 1 Overview of Peer-to-peer lending framework (Source:

Machine Learning Pipeline

In this project, our machine learning pipeline consists of the following steps namely data understanding, data extraction, data pre-processing, data normalization, feature engineering, model building, splitting of the dataset, 10-fold cross-validation, model evaluation, and validation, deriving critical features and model deployment.

Fig. 2 The architecture of machine learning pipeline
About Dataset

The dataset used in this study has been retrieved from a publicly available data set of a leading European P2P lending platform Bandora. The retrieved data is a pool of both defaulted and non-defaulted loans from the time period between 1st March 2009 and 27th January 2020. The data comprises demographic, financial information of borrowers and loan transactions features. The dataset can be accessed from here.

The original dataset consists of 134529 borrowers with 112 features. The distribution of loan status in the dataset is shown below Fig.3 :

Fig.3 Distribution of loan status in the dataset

For this study, we have selected only repaid and late status loans as we don’t know much about current status loans which are still operational. Further after removing invalid records from the dataset, we are come up with 71782 records consisting of 40175 late status loans (treated as default loans) and 31607 as repaid loans which are fully repaid by borrowers. The description of the features in the dataset along with their data type is shown below Table 1.

NameData TypeDescription
AgeNumericBorrower’s age in years
GenderNominalBorrower’s gender
CountryNominalThe country in which the borrower resides
Language codeNominalNative Language of the borrower
EducationOrdinalThe level of education of borrower
Marital StatusNominalMarital status of borrower
Employment StatusNominalEmployment status of the borrower
Occupation AreaNominalOccupation of borrower i.e., in which sector borrower works
Home Ownership TypeNominalHome ownership status of borrower
Income TotalNumericBorrower’s total monthly income
Applied AmountNumericThe Loan amount applied by borrower
AmountNumericAmount of Loan sanctioned
Loan DurationNumericCurrent duration of loan in months
InterestNumericMaximum interest rate applied in the loan application
Monthly PaymentNumericEstimated amount the borrower has to pay every month
Use of LoanNominalActual purpose for which loan was taken by borrower
RatingOrdinalBondora Rating issued by the Rating model
CreditScoreEsMicroLOrdinalA score that is specifically designed for risk classifying subprime borrowers.
Debt To IncomeNumericRatio of borrower’s monthly gross income that goes toward paying loans
Existing LiabilitiesNumericBorrower’s number of existing liabilities
Liabilities TotalNumericTotal monthly liabilities of borrower
Refinance LiabilitiesNumericThe total amount of liabilities of borrower after refinancing
No. Of Previous Loans Before LoansNumericNumber of previous loans of borrower
Amount of  Previous Loans Before LoansNumericValue of previous loans of borrower
Previous Repayments before loanNumericHow much the borrower had repaid previous loans prior to this loan
Previous early repayment count before loanNumericNumber of times borrower repaid the loan early
Free CashNumericDiscretionary income of borrower after monthly liabilities  
Bids Portfolio ManagerNumericThe amount of investment offers made by Portfolio Managers
Bids ApiNumericThe amount of investment offers made via Api
Bids ManualNumericThe amount of investment offers made manually
New Credit CustomerNominalDid the customer have prior credit history in Bondora.
Verification TypeNominalMethod used for loan application data verification
Monthly Payment DayNumericThe day of the month the loan payments are scheduled for
Interest and Penalty Payments MadeNumericInterest and penalty payments made by borrower so far
Employment Duration Current EmployerOrdinalEmployment time of borrower with the current employer
DefaultBinaryDefault status of borrower. 0: Loan Repaid 1: Loan Default
Table 1. The description of dataset features

Data Cleaning

In this step, we at first simply remove those features from the dataset which are not relevant for prediction of credit risk such as Loan ID, Loan Number, Listed on UTC, Username, Bidding Started on, etc., and after that, we removed those features which have more than 40% missing values. After removal of those features, we were left with 35 features only as shown in Table 1 and the features which have less than 40% missing values were imputed with median values as median values were more representative in comparison to mean values.

Exploratory Data Analysis (EDA)

In this step, we have analyzed different features of the dataset by performing exploratory data analysis.

Fig. 4 Distribution of Default Loans

As per the above figure 4, the majority of loans in the dataset are default loans, which will help in analyzing the pattern of default loans.

Fig. 5 Age and Gender distribution of defaulters

From the above figure, it is evident that the average age of defaulters is around 40 years, whereas males have the highest number of default loans in comparison to females and undefined gender. From below Fig. 6, we can observe that the secondary level of education has the highest number of defaulters, and borrowers who didn’t specify their employment status have the highest number of default loans.

Fig. 6 Education and Employment status wise distribution of defaulters

From Fig. 7, we easily see that borrowers who didn’t specify their gender have the highest number of default loans i.e., 33.3%, and Estonian, Finnish, and Spanish-speaking borrowers defaulted most which is but obvious as this peer-to-peer lending platform is basically targeted to European countries.

Fig. 7 Distribution of marital status and language of defaulters

In Fig. 8, we can observe that those loans having no clear purpose defaulted most, while others and home improvement purpose are second and third most defaulted loans.

Fig. 8 Distribution of purpose of loan in case of default loans

Most defaulters are those who have employment of more than 5 years while the second most defaulted loans come from the borrowers who have employment up to 1 year. That’s really surprising that most experienced professionals defaulted the most.

Fig.9 Distribution of employment duration of defaulters

In the next part, we will demonstrate the data pre-processing, feature engineering, modeling, performance analysis of different models and discuss the business objective achieved using the best model.


About the author

Manu Siddhartha

Hi!! I am Siddhartha, an aspiring blogger with an obsession to share my knowledge in Machine Learning & Data science domain. This blog is dedicated to demonstrate application of machine learning in different domains with real-time case studies.

View all posts

Leave a Reply

Your email address will not be published. Required fields are marked *