Assign a credit grade to a borrower

One of the fundamental tasks in credit risk management is to assign a credit grade to a borrower. Grades are used to rank customers according to their perceived creditworthiness: better grades mean less risky customers; similar grades mean similar level of risk. Grades come in two categories: credit ratings and credit scores. Credit ratings are a small number of discrete classes, usually labeled with letters, such as 'AAA', 'BB-', etc. Credit scores are numeric grades such as '640' or '720'. Credit grades are one of the key elements in regulatory frameworks, such as Basel II .

Assigning a credit grade involves analyzing information on the borrower. If the borrower is an individual, information of interest could be the individual's income, outstanding debt (mortgage, credit cards), household size, residential status, etc. For corporate borrowers, one may consider certain financial ratios (e.g., sales divided by total assets), industry, etc. Here, we refer to these pieces of information about a borrower as features or predictors.

This example shows how we can develop an automated stage of a credit rating process.

We assume that historical information is available in the form of a data set where each record contains the features of a borrower and the credit rating that was assigned to it. These may be internal ratings, assigned by a committee that followed policies and procedures already in place.

The existing historical data is the starting point, and it is used to train the bagged decision tree that will automate the credit rating. In the vocabulary of statistical learning, this training process falls in the category of supervised learning. The classifier is then used to assign ratings to new customers. In practice, these automated or predicted ratings would most likely be regarded as tentative, until a credit committee of experts reviews them. The type of classifier we use here can also facilitate the revision of these ratings, because it provides a measure of certainty for the predicted ratings, a classification score.

The data set contains financial ratios, industry sector, and credit ratings for a list of corporate customers. This is simulated, not real data. The first column is a customer ID. Then we have five columns of financial ratios.

Working capital / Total Assets (WC_TA)

Retained Earnings / Total Assets (RE_TA)

Earnings Before Interests and Taxes / Total Assets (EBIT_TA)

Market Value of Equity / Book Value of Total Debt (MVE_BVTD)

Sales / Total Assets (S_TA)

Next, we have an industry sector label, an integer value ranging from 1 to 12. The last column has the credit rating assigned to the customer. We copy the features into a matrix X, and the corresponding classes, the ratings, into a vector Y.

The features to be stored in the matrix X are the five financial ratios, and the industry label. Industry is a categorical variable, nominal in fact, because there is no ordering in the industry sectors. The response variable, the credit ratings, is also categorical, though this is an ordinal variable, because, by definition, ratings imply a ranking of creditworthiness.

We use the predictors X and the response Y to fit a particular type of classification ensemble called a bagged decision tree. "Bagging," in this context, stands for "bootstrap aggregation." The methodology consists in generating a number of sub-samples, or bootstrap replicas, from the data set. These sub-samples are randomly generated, sampling with replacement from the list of customers in the data set. For each replica, a decision tree is grown. Each decision tree is a trained classifier on its own, and could be used in isolation to classify new customers. The predictions of two trees grown from two different bootstrap replicas may be different, though.

The ensemble aggregates the predictions of all the decision trees that are grown for all the bootstrap replicas. If the majority of the trees predict one particular class for a new customer, it is reasonable to consider that prediction to be more robust than the prediction of any single tree alone. Moreover, if a different class is predicted by a smaller set of trees, that information is useful, too. In fact, the proportion of trees that predict different classes is the basis for the classification scores that are reported by the ensemble when classifying new data.

The first step to construct our classification ensemble will be to find a good leaf size for the individual trees; here we try sizes of 1, 5 and 10. We start with a small number of trees, 25 only, because we mostly want to compare the initial trend in the classification error for different leaf sizes.

The errors are comparable for the three leaf-size options. We will therefore work with a leaf size of 10, because it results in leaner trees and more efficient computations.

Note that we did not have to split the data into training and test subsets. This is done internally, it is implicit in the sampling procedure that underlies the method. At each bootstrap iteration, the bootstrap replica is the training set, and any customers left out ("out-of-bag") are used as test points to estimate the out-of-bag classification error reported above.

Next, we want to find out whether all the features are important for the accuracy of our classifier. We also try a larger number of trees now, and store the classification error, for further comparisons below.

Features 2, 4 and 6 stand out from the rest. Feature 4, market value of equity / book value of total debt, is the most important predictor for this data set. This ratio is closely related to the predictors of creditworthiness in structural models, such as Merton's model, where the value of the firm's equity is compared to its outstanding debt to determine the default probability.

Information on the industry sector, feature 6, is also relatively more important than other variables to assess the creditworthiness of a firm for this data set.

Although not as important as feature 2, retained earnings / total assets, stands out from the rest. There is a correlation between retained earnings and the age of a firm (the longer a firm has existed, the more earnings it can accumulate, in general), and in turn the age of a firm is correlated to its creditworthiness (older firms tend to be more likely to survive in tough times).

Let us fit a new classification ensemble using only predictors Feature 2-4-6. We compare its classification error with the previous classifier, which uses all features.

The accuracy of the classification does not deteriorate significantly when we remove the features with relatively low importance (1, 3, and 5), so we will use the more parsimonious classification ensemble for our predictions. At this point, the classifier could be saved to be loaded in a future session to classify new customers.

Validation or back-testing is the process of profiling or assessing the quality of the credit ratings. There are many different measures and tests related to this task (see, for example, Basel Committee on Banking Supervision). Here, we focus on how accurate are the predicted ratings, as compared to the actual ratings. Here "predicted ratings" refers to those obtained from the automated classification process, and "actual ratings" to those assigned by a credit committee that puts together the predicted ratings and their classification scores, and other pieces of information, such as news and the state of the economy to determine a final rating.

For each specific rating, we can compute a measure of agreement between predicted and actual ratings. We can build a Receiver Operating Characteristic (ROC) curve using and check the area under the curve (AUC). Let us build a ROC and calculate the AUC for rating 'BBB' in our example.

Here is an explanation of how the ROC is built. Recall that for each customer the automated classifier returns a classification score for each of the credit ratings, in particular, for 'BBB,' which can be interpreted as how likely it is that this particular customer should be rated 'BBB.' In order to build the ROC curve, one needs to vary the classification threshold. That is, the minimum score to classify a customer as 'BBB.' In other words, if the threshold is t, we only classify customers as 'BBB' if their 'BBB' score is greater than or equal to t. For example, suppose that company XYZ had a 'BBB' score of 0.87. If the actual rating of XYZ is 'BBB,' then XYZ would be correctly classified as 'BBB' for any threshold of up to 0.87. This would be a true positive, and it would increase what is call the sensitivity of the classifier. For any threshold greater than 0.87, this company would not receive a 'BBB' rating, and we would have a false negative case. To complete the description, suppose now that XYZ's actual rating is 'BB.' Then it would be correctly rejected as a 'BBB' for thresholds of more than 0.87, becoming a true negative, and thus increasing the so called specificity of the classifier. However, for thresholds of up to 0.87, it would become a false positive(it would be classified as 'BBB,' when it actually is a 'BB'). The ROC curve is constructed by plotting the proportion of true positives (sensitivity), versus false positives (1-specificity), as the threshold varies from 0 to 1.

The AUC, as its name indicates, is the area under the ROC curve. The closer the AUC is to 1, the more accurate the classifier (a perfect classifier would have an AUC of 1). In this example, the AUC seems high enough, but it would be up to the committee to decide which level of AUC for the ratings should trigger a recommendation to improve the automated classifier.

Back to Projects