𝐓Table of Contents

Credit Risk Analysis


Credit risk is the possibility of a loss resulting from a borrower's failure to repay a loan or meet contractual obligations. Determining credit risk requires creditors to evaluate customers based on their credit scores. As a result of this, there are classification imbalances with credit risk because good loans outnumber riskier loans. We are tasked to build a classification model using machine learning statistical algorithms to make predictions on the credit risk of a client. In our analysis, we will be using the credit card credit dataset from LendingClub, a peer-to-peer lending services company. We will utilize different machine learning techniques such as RandomOverSampler, SMOTE, ClusterCentroids, SMOTEENN, BalancedRandomForestClassifier, and EasyEnsembleClassifier to train and evaluate data to build a recommendation for the best machine learning model to use for credit risk predictions.

Resources

  • Analysis Software: Jupyter Notebook 6.4.12
  • Data Source: LoanStats_2019Q1.csv
  • Programming Languages: Python 3.10

Resampling


In each analysis with the resampling models, we used the resampled data to train a logistic regression model and calculated the balanced accuracy score from sklearn.metrics, printed the confusion matrix, and generated a classification report from imbalanced-learn.


Naive Random Oversampling

In random oversampling, instances of the minority class are randomly selected and added to the training set until the majority and minority classes are balanced. Oversampling addresses class imbalance by duplicating or mimicking existing data.

Python Code:


    from imblearn.over_sampling import RandomOverSampler
    ros = RandomOverSampler(random_state=1)
    X_resampled, y_resampled = ros.fit_resample(X_train, y_train)
                

Balanced Accuracy Score:

0.663188044716539

Confusion Matrix:

Predicted High Risk Predicted Low Risk
Actual High Risk 76 25
Actual Low Risk 7288 9816

Classification Report:

                   
                    pre       rec       spe        f1       geo       iba       sup
                
    high_risk       0.01      0.75      0.57      0.02      0.66      0.44       101
    low_risk       1.00      0.57      0.75      0.73      0.66      0.42     17104
                
    avg / total       0.99      0.57      0.75      0.72      0.66      0.42     17205
                

The Naive Random Oversampling model accurately predicts credit risk 66.3% of the time. Additionally, the precision of the model for high risk is 0.01 and low risk is 1.00. In other words, when it predicts that a client is high risk, it is correct 1% of the time and when it predicts that a client is low risk, it is correct 100% of the time. The recall in our model 0.75 for high risk and 0.57 for low risk. This means that it correctly identifies 75% of all high risk and 57% for all low risk.


SMOTE Oversampling

The synthetic minority oversampling technique (SMOTE) is another oversampling approach to deal with unbalanced datasets. In SMOTE, like random oversampling, the size of the minority is increased. The key difference between the two lies in how the minority class is increased in size. As we have seen, in random oversampling, instances from the minority class are randomly selected and added to the minority class. In SMOTE, by contrast, new instances are interpolated. That is, for an instance from the minority class, a number of its closest neighbors is chosen. Based on the values of these neighbors, new values are created.

Python Code:


    from imblearn.over_sampling import SMOTE
    X_resampled, y_resampled = SMOTE(random_state=1,
    sampling_strategy='auto').fit_resample(X_train, y_train)
                

Balanced Accuracy Score:

0.6621894942066704

Confusion Matrix:

Predicted High Risk Predicted Low Risk
Actual High Risk 64 37
Actual Low Risk 5290 11814

Classification Report:

                  
                    pre       rec       spe        f1       geo       iba       sup
                
    high_risk       0.01      0.63      0.69      0.02      0.66      0.44       101
    low_risk       1.00      0.69      0.63      0.82      0.66      0.44     17104

    avg / total       0.99      0.69      0.63      0.81      0.66      0.44     17205
                

The SMOTE Oversampling model accurately predicts credit risk 66.2% of the time. Additionally, the precision of the model for high risk is 0.01 and low risk is 1.00. In other words, when it predicts that a client is high risk, it is correct 1% of the time and when it predicts that a client is low risk, it is correct 100% of the time. The recall in our model 0.63 for high risk and 0.69 for low risk. This means that it correctly identifies 63% of all high risk and 69% for all low risk.


Cluster Centroids (Undersampling)

Cluster centroid undersampling is akin to SMOTE. The algorithm identifies clusters of the majority class, then generates synthetic data points, called centroids, that are representative of the clusters. The majority class is then undersampled down to the size of the minority class.

Python Code:


    from imblearn.under_sampling import ClusterCentroids
    cc = ClusterCentroids(random_state=1)
    X_resampled, y_resampled = cc.fit_resample(X_train, y_train)
                

Balanced Accuracy Score:

0.5447339051023905

Confusion Matrix:

Predicted High Risk Predicted Low Risk
Actual High Risk 70 31
Actual Low Risk 10324 6780

Classification Report:

                   
                    pre       rec       spe        f1       geo       iba       sup
                
    high_risk       0.01      0.69      0.40      0.01      0.52      0.28       101
    low_risk       1.00      0.40      0.69      0.57      0.52      0.27     17104

    avg / total       0.99      0.40      0.69      0.56      0.52      0.27     17205
                

The Cluster Centroids model accurately predicts credit risk 54.4% of the time. Additionally, the precision of the model for high risk is 0.01 and low risk is 1.00. In other words, when it predicts that a client is high risk, it is correct 1% of the time and when it predicts that a client is low risk, it is correct 100% of the time. The recall in our model 0.69 for high risk and 0.40 for low risk. This means that it correctly identifies 69% of all high risk and 40% for all low risk.


SMOTEENN (Combination Sampling)

SMOTEENN is an approach to resampling that combines aspects of both oversampling and undersampling.

Python Code:


    from imblearn.combine import SMOTEENN
    smote_enn = SMOTEENN(random_state=0)
    X_resampled, y_resampled = smote_enn.fit_resample(X, y)
                

Balanced Accuracy Score:

0.644711676499736

Confusion Matrix:

Predicted High Risk Predicted Low Risk
Actual High Risk 73 28
Actual Low Risk 7412 9692

Classification Report:

                  
                    pre       rec       spe        f1       geo       iba       sup
                
    high_risk       0.01      0.72      0.57      0.02      0.64      0.42       101
    low_risk       1.00      0.57      0.72      0.72      0.64      0.40     17104
                
    avg / total       0.99      0.57      0.72      0.72      0.64      0.40     17205
                

The SMOTEENN model accurately predicts credit risk 64.5% of the time. Additionally, the precision of the model for high risk is 0.01 and low risk is 1.00. In other words, when it predicts that a client is high risk, it is correct 1% of the time and when it predicts that a client is low risk, it is correct 100% of the time. The recall in our model 0.72 for high risk and 0.57 for low risk. This means that it correctly identifies 72% of all high risk and 57% for all low risk.


Ensemble Learners


In each analysis with the ensemble models, we trained the model using training data and calculated the balanced accuracy score from sklearn.metrics, printed the confusion matrix, and generated a classification report from imbalanced-learn.


Balanced Random Forest Classifier

Instead of having a single, complex tree like the ones created by decision trees, a random forest algorithm will sample the data and build several smaller, simpler decision trees. Each tree is simpler because it is built from a random subset of features.

Python Code:


    from imblearn.ensemble import BalancedRandomForestClassifier
    rf_model = BalancedRandomForestClassifier(n_estimators=100, random_state=1)
    rf_model.fit(X_train, y_train)
                

Balanced Accuracy Score:

0.7885466545953005

Confusion Matrix:

Predicted High Risk Predicted Low Risk
Actual High Risk 71 30
Actual Low Risk 2153 14951

Classification Report:

                   
                    pre       rec       spe        f1       geo       iba       sup
                
    high_risk       0.03      0.70      0.87      0.06      0.78      0.60       101
    low_risk       1.00      0.87      0.70      0.93      0.78      0.62     17104
                
    avg / total       0.99      0.87      0.70      0.93      0.78      0.62     17205
                

The SMOTEENN model accurately predicts credit risk 78.9% of the time. Additionally, the precision of the model for high risk is 0.03 and low risk is 1.00. In other words, when it predicts that a client is high risk, it is correct 3% of the time and when it predicts that a client is low risk, it is correct 100% of the time. The recall in our model 0.70 for high risk and 0.87 for low risk. This means that it correctly identifies 70% of all high risk and 87% for all low risk.


AdaBoost Classifier

In AdaBoost, a model is trained and then evaluated. After evaluating the errors of the first model, another model is trained. This time, however, the model gives extra weight to the errors from the previous model. The purpose of this weighting is to minimize similar errors in subsequent models. Then, the errors from the second model are given extra weight for the third model. This process is repeated until the error rate is minimized.

Python Code:


    from imblearn.ensemble import EasyEnsembleClassifier
    EE_model = EasyEnsembleClassifier(n_estimators=100, random_state=1)
    EE_model.fit(X_train, y_train)
                

Balanced Accuracy Score:

0.9316600714093861

Confusion Matrix:

Predicted High Risk Predicted Low Risk
Actual High Risk 93 8
Actual Low Risk 983 16121

Classification Report:

                   
                    pre       rec       spe        f1       geo       iba       sup
                
    high_risk       0.09      0.92      0.94      0.16      0.93      0.87       101
    low_risk       1.00      0.94      0.92      0.97      0.93      0.87     17104

    avg / total       0.99      0.94      0.92      0.97      0.93      0.87     17205
                

The SMOTEENN model accurately predicts credit risk 93.2% of the time. Additionally, the precision of the model for high risk is 0.09 and low risk is 1.00. In other words, when it predicts that a client is high risk, it is correct 9% of the time and when it predicts that a client is low risk, it is correct 100% of the time. The recall in our model is 0.92 for high risk and 0.94 for low risk. This means that it correctly identifies 92% of all high risk and 94% of all low risk.


Summary


  • EasyEnsembleClassifer: 93.2% accuracy, 9% precision, and 92% recall
  • BalancedRandomForestClassifer: 78.9% accuracy, 3% precision, and 70% recall
  • SMOTE: 66.2% accuracy, 1% precision, and 63% recall
  • RandomOverSampler: 66.3% accuracy, 1% precision, and 75% recall
  • SMOTEENN: 64.5% accuracy, 1% precision, and 72% recall
  • ClusterCentroids: 54.4% accuracy, 1% precision, and 69% recall

Based on the results, the best overall model is the AdaBoost Classifier or EasyEnsembleClassifer. This model has a 93.2% balanced accuracy score, a precision rate of 9%, and a sensitivity rate of 92% for high risk. The overall results were highest compared to the other models we tested in our analysis therefore this is the model we recommend using.

Although this model is the best compared to the other models in our test, it still has a low precision. Since the precision for high risk is only 0.09, when it predicts that a client is high risk, it is correct 9% of the time. As a result, the classifier returns a lot of false positives. This will benefit the credit card companies since it is better to reject predicted high risk individuals to avoid risky loans.