Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Corporate Bond Default Risk

Seeing the wood through the trees?


Alan Hanna
Queens University Belfast
January 29, 2016

Abstract
Practitioners are faced with an increasing choice of default prediction models which are often
constructed from a limited number of forecasting variables. We adapt a random forest approach,
a well-known machine learning classification technique capable of accommodating a large number of
market and accounting input variables, to create a vote-based comparative ranking of default risk.
Using a data set drawn from US Corporate bonds spanning the period 2000-2012, a walk-forward
approach is used to evaluate the performance of the model relative to existing benchmarks such as
distance-to-default. We highlight benefits of random forests and their potential to produce rank-
based measures of economic outcomes. Increased out-of-sample predictive ability over three and
twelve-month time horizons is found, suggesting a benefit from retaining a wider range of predictor
variables.

keywords: random forest, default prediction, distance-to-default


JEL classification: G11, G17, G33, C45.

1 Introduction
Bond investors face an overwhelming range of forecasting variables and competing models with which
to assess the risk of default and inform their investment decisions (see, for example, Giesecke et al.
(2011)). Selecting a particular modelling approach necessarily entails rejecting others such that relevant
information may be discarded. The goal of this paper is to develop a modelling framework to rank
bonds according to their likelihood of default, while allowing practitioners the flexibility to incorporate
covariates from alternative models, and overcome data related issues.
The framework is based upon the machine learning technique of random forests which allows a
large number of input variables to be retained without imposing further assumptions. In adopting this
approach, we accept that the causes of default are multifaceted and retain a comparatively large number
of variables. Existing models may be overly restrictive in terms of their assumptions or overly narrow
in their choice of input variables. Such models have the potential to overlook external variables that
might otherwise have provided early warning of default. We adapt the traditional role of a random forest
as a classification model by utilising the built-in voting mechanism to create a rank ordering of those
observations deemed more likely to default within a given time horizon. We do not attempt to quantify
the probability of default. In focusing on a rank-based order we deliberately avoid the somewhat artificial
binary classification of predicting precisely those bonds we expect to default. Our model has advantages
in terms of factor selection, model specification and in dealing with data quality issues.
A data set based on US non-investment grade corporate bonds spanning the period 2000-2012 is
used to evaluate the model relative to alternative benchmark models. Performance is evaluated by

1
inspection of Cumulative Accuracy Profile curves, and a comparison of related accuracy scores and
decile tabulations. We find that the random forest model produces improved predictive accuracy of
default over both three and twelve-month time horizons compared to the alternative benchmarks such
as distance-to-default. On average, random forests achieve higher accuracy ratios and allocate a greater
percentage of defaulting bonds to the lowest decile ranks, with significantly fewer defaulting bonds being
allocated to the top five decile ranks. This improved performance is largely maintained during periods
of market stress.
By randomly selecting observations for model construction and model testing, we consider the sta-
bility of the model and its robustness to noise. We observe that the random forest model is more stable
in the run up to default than the distance-to-default model, as the random forest appears to better
detect the default signal in the presence of noise. We find the approach to be robust to configuration
changes and interpret our results as supporting earlier claims that single composite measures such as
distance-to-default are insufficient predictors in and of themselves.
While attempts have been made to apply machine learning techniques to financial forecasting (see
section 3), we are not aware of any that explicitly adapt the voting mechanism to create a rank measure.
Attempts to apply other such techniques to credit-related predictions lack significant data sets or tend
to focus on the potential to apply the technique rather than the financial theory or its application by a
practitioner. We contribute to this body of work by establishing the ability of random forests to provide
an ordered measure of the relative likelihood of default. The performance is evaluated in line with
existing financial literature over a long time period including a major economic crisis, in a manner that
could be applied by a practitioner facing data deficiencies. By contrast to existing work, we focus on
speculative grade ratings and do not exclude financial firms, yet achieve matching or superior predictive
accuracy.
The remainder of this paper is structured as follows. Sections 2 and 3 review the existing literature
relating to machine learning, credit default prediction, and their intersection. Section 4 describes the
data set and methodology. Results are presented in section 5 and contrasted with previous findings.
Section 6 concludes.

2 Default prediction
We begin by considering the difficulties faced by a practitioner in selecting a modelling approach. It is not
difficult to think of factors that might influence debt valuations and credit outcomes. Notwithstanding
the degree of significance any factor may have, practical issues and limitations naturally arise. Macroe-
conomic and accounting measures are updated infrequently, perhaps only on a monthly or quarterly
lagged basis. Accounting based measures by their very nature are backward looking metrics: a quarterly
filing will only confer information confirming the past of the firm. Company accounts possess a degree
of conservativeness concerning asset valuations, and so will tend to be downwardly biased (Hillegeist et
al., 2004). Credit ratings are subjective and subject to dampening, with rating agencies reluctant to
downgrade given the potential negative market implications (Hamilton & Cantor, 2004). Market data
may be subject to the influences of noise trading or, through the absence of liquidity, prices may update
infrequently at a significant distance from fair value. More generally, data can be missing, delayed, noisy
or not (directly) observable.
A second difficulty arises in model selection. Approaches to default modelling can broadly be seg-
mented into two theoretical strands: reduced form models and structural models. In the former case,
default is seen as exogenous to the firm, and can be modelled as an intensity process with theoretical
tractability and high predictive power (Duffie & Singleton, 2003). One early approach by Altman (1968)
used multiple discriminant analysis to produce a series of five accounting ratios and their weightings to
comprise the now eponymous Z-Score. A structural model makes the link between the firms underlying
asset value, its liabilities and a default point which, once reached, will trigger a default event. Equity

2
can be viewed as a call option on the firms assets, thus volatility and leverage are key components of
this approach. The Merton (1974) model assumes that the firms total asset value V follows a geomet-
ric Brownian motion, with the firms debt structured as a single zero coupon bond with face value F
maturing at time T . This leads to a derivation of the distance-to-default (DD) measure:

ln(V /F ) + ( 21 V2 )T
DDM = (1)
V T
where represents the expected return of the firms assets, and V , the volatility of the asset price
process. The probably of default can then be derived directly from DDM using the standard normal
cumulation distribution function. Note that the asset value and volatility are not directly observable.
Jessen and Lando (2015) find that, apart from a few exceptions, the DD measure is extremely robust
when used as a metric to rank firms according to their default risk. Hillegeist et al. (2004) tested the
performance of the Z-Score (and Ohlsons (1980) O-score) against a structural model approach, and
found the accountancy models to have several deficiencies. However they warn that potential benefits
come at the cost of relying on the models simplifying assumptions, many of which do not hold in
practice. Others, such as Campbell et al. (2008), find that DD has little additional explanatory power
for models already incorporating its key components of leverage and volatility.
Bharath and Shumway (2008) show that DD is not a sufficient measure for default prediction and
propose a simpler or naive alternative that is computationally less expensive, but equivalent in predictive
power concluding that benefit of the DD approach lies in its functional form. First they define a naive
firms volatility N composed of the term structure volatility (0.05) and default risk (0.25E ):
E F
N = E + (0.05 + 0.25E ) (2)
E+F E+F
where E represents the value of the firms equity, and E its volatility. A naive distance-to-default is
then calculated as:

ln[(E + F )/F ] + (rt1 12 N


2
)T
DDN = (3)
N T
where the expected return on the firms assets is set to the previous years stock return rt1 . Of course
there is no reason why DD (or its variants) cannot be incorporated in other models such as the one
constructed by Duffie et al. (2007).
A third difficulty arises when faced with the vast number of possible variables to include in any
default model. Research on corporate bankruptcy prediction and credit scores suggest various accounting
variables that could be used as input (Huang et al., 2004; Sun & Li, 2008; Zhao et al., 2009; W.-Y. Lin
et al., 2012). These ratios and metrics could broadly be classified as belonging to: growth, profitability,
leverage and efficiency (Sung et al., 1999). Vassalou and Xing (2004) found default risk to be highly
correlated to the size of the firm and book to market value, but that these associations were present
only in the lowest two deciles: the smallest firms tended to have the highest book to market ratios.
Stratification by firm size in the lowest two deciles revealed smaller firms tended to have a higher default
risk than larger firms.1 In defining high book to market stocks as value and low book to market stocks
as growth (Fama & French, 1998), a higher default risk was found in value stocks compared to growth
stocks.
Other work has used debt and equity-related market data measured through lagged (excess) returns
and historic volatilities across several time horizons (for example R. Jarrow (2001)). Bharath and
Shumway (2008) found CDS rates and yield spreads over US treasury rates to be only weakly correlated
1 Survival bias may exacerbate this phenomenon: a firm in financial distress will tend to have a lower stock price - and

by definition, a lower market capitalization - and so it would be expected that firms with declining value (the lowest market
capitalizations) to have higher default likelihood.

3
with default. Credit ratings and credit watches were considered by Hamilton and Cantor (2004), while
Lando and Nielsen (2010) and Azizpour et al. (2011) found default rates of firms to react when a firm
in a related field or industry defaults. Previous studies find a positive correlation between the macro
economy and default (Jonsson & Fridson, 1996; Giesecke et al., 2011). Controlling for the business
cycles is possible (R. A. Jarrow & Turnbull, 2000; Duffie et al., 2007, 2009) through the use of indices
(stock, high yield, investment grade and treasury) and macroeconomic factors such as GDP, inflation,
and unemployment rates.

3 Machine learning
Machine learning is primarily concerned with finding patterns present in large amounts of data via
the implementation of algorithms. Rows of data are termed instances, whilst variables are termed
attributes. Dependent variables are terms targets. Although a large field in itself, there are two
general strands of research within machine learning: supervised and unsupervised learning. Unsupervised
learning seeks to establish structure in unlabelled data. For supervised learning, there is an attribute
of particular interest to the researcher, and the aim is to identify patterns in the other instances that
influence the target attribute. For a continuous target, supervised learning seeks to solve a regression
problem. When the target is categorical (default or not default for example), supervised learning seeks
to solve a classification problem. An advantage of machine learning techniques is that no assumptions
need be imposed on the structure of the data; relationships (linear or otherwise) are inferred rather than
imposed.
The general schema requires distinct data sets to construct the model (training or learning phase),
refine the model (validation phase) and evaluate the model (testing phase). Breiman et al. (1984) note,
classifiers are not constructed whimsically. They are based on past experience. Doctors know, for
example, that elderly heart attack patients with low blood pressure are generally high risk. Examples of
supervised learning algorithms are decision trees, ensembles, support vector machines, artificial neural
networks, and genetic algorithms (Witten & Frank, 2005).

3.1 Applications within finance


While machine learning techniques have been applied to various financial problems including fraud
detection and stock selection, we briefly survey applications to default prediction. Sun and Li (2008)
used a decision tree classification algorithm to predict corporate bankruptcy in 198 publicly listed Chinese
companies from 2000-2005. Using accounting data they were able to achieve a predictive accuracy of
80-95%. Sung et al. (1999) proposed a bankruptcy prediction model based on decision trees that would
work equally well in normal and crisis economic conditions. Using around 80 public firms listed on the
Korean exchange over distinct periods between 1991 and 1998, they found over both cycles a predictive
accuracy of 81-83% was achievable, versus roughly 30% when a normal model was applied in crisis
conditions, and vice versa.
Much of the machine learning literature focuses on bankruptcy prediction using the artificial neural
network algorithm. Angelini et al. (2008) predict the bankruptcy for 76 businesses of an Italian bank
during the period 20012003 using typical balance sheet, income statement, profitability and leverage
metrics, reporting average errors as low as 7%. F. Y. Lin and McClean (2001) use a dataset comprising
1,133 UK firms, with quarterly accounting data from 19801999, concluded that a relatively simple
logistic regression is as competitive in predicting bankruptcy as decision trees and neural networks. Min
and Jeong (2009) apply a binary classification model to their dataset of 2,542 small and medium sized
Korean manufacturing firms, using audited accounting data to generate 27 financial ratios which led to
over 70% of bankruptcies being predicted. Kim and Kang (2010) apply various imputations of neural
network algorithms to 1,458 Korean firms from 20022005, and are able to predict bankruptcy in over

4
70% of the cases. Atiya (2001) models a neural network on 911 US firms as a means of predicting
bankruptcy; again typical accounting ratios were utilized to achieve predictive accuracy of up to 89%.
Huang et al. (2004) model corporate default for 120 companies using half yearly accounting data from
19972001 and found either a simple decision tree or neural network could correctly predict bankruptcy
over 70% of the time.
W.-Y. Lin et al. (2012) provide an extensive review of the literature on machine learning in corporate
bankruptcy prediction. They observe that most algorithms have been applied, ranging from simple
decision trees (Sun & Li, 2008) to more complex ones such as neural network and genetic algorithms
(Shin & Lee, 2002). More recently Jones et al. (2015) conduct a review of binary classifiers for rating
changes, finding machine learning techniques to outperform conventional technique.
One of the more significant criticisms of the existing literature is the lack of large scale sample
sizes and attributes being used, with most of the sample sizes comprising less than 1,000 instances.
Notwithstanding this, supervised learning techniques are well suited to the task of bankruptcy prediction:
by keeping the classification problem as a binary outcome - default or not default - the various
supervised learning algorithms outlined above are specifically calibrated to this task, and can achieve
high predictive accuracy as a result.

3.2 Decision trees


The decision tree - so named due to its tree like visualization - is perhaps one of the most widely used
supervised learning techniques in machine learning, in part due to its ease of interpretation (Kuhn &
Johnsson, 2013). Starting from the root node, each branching operation represents a further (binary)
partitioning of the data predicated on the outcome of a (boolean) test applied to a single attribute of
the remaining instances. Each branching test is chosen carefully to maximise some formalised notion of
gain and is therefore deterministic. Breiman et al. (1984) note that, [t]he fundamental idea is to select
each split of a subset so that the data in each of the descendant subsets are purer than the data in the
parent subset. The process continues iteratively until defined stopping criteria are met. Terminal leaf
nodes are ultimately classified based on the target attribute by a simple majority. The final tree (see
Figure 1) can be construed as a rule-based algorithm to apply when seeking to classify new instances.

X0

X1 X2

X3 X4 X5 X6

X7 X8 X9 X10
Figure 1
Basic schema of a decision tree. The root node represents the entire data set X0 . Each branching operation further partitions the
data set predicated on the outcome of a (boolean) test applied to a single attribute, thus X0 = X1 X2 and X1 X2 = . The
branching continues until pre-defined criteria are met. Leaf nodes (X3 , X6 , X7 , X8 , X9 , X10 ) are classified based on the target
attribute on a simple majority basis.

A major benefit of decision trees is the importance of the ordered nature of the attributes rather than
their magnitude. Magnitude scaling and treatment of outliers using techniques such as winsorization is

5
unnecessary as the decision tree partitions the data.

One potential issue with decision trees is their tendency to overfit: rules constructed from a training
data set may not generalize well to new and unseen data. Consequently, the misclassification rate of the
training set may be extremely low, but increases significantly when sampled on unseen data. Therefore,
in practice a pruning phase is applied, where the fully developed tree based on the training set is
reduced (or pruned) in order to make a more generalizable tree that minimizes the misclassification
rate.

3.3 Random forests


Random forests generate a large number of predictor models (decision trees) which, while individually
may represent poor predictors, combine to create an aggregate model with high predictive accuracy,
low bias and reduced variance (Breiman, 2001). The use of multiple classifiers in such an ensemble
compensates for the high variance that arises when only one classifier is used (Ho, 1998).
One of the innovations of the random forest is the use of bootstrap aggregation (or bagging) of the
data set. Each tree within the forest is constructed using a random sample from the training set with
replacement. Some of the original instances shall be present more than once, whilst other instances will
not feature (Witten & Frank, 2005). A second source of randomness for each tree is through the use of
attribute subsetting. A third is through random selection of the attribute selected for each branching
decision. While individual trees may generate different classification predictions for new instances, the
overall forest prediction is determined by a simple majority vote or defined voting threshold. Much in
the same way a committee of experts are able to arrive at a correct decision versus an individual, the
ensemble of trees predict the correct classification better than an individual decision tree: [w]hat one
loses, with the trees, is a simple and interpretable structure. What one gains is increased accuracy
(Breiman, 1996, p137).
An ensemble of decision trees can comprise hundreds or thousands of individual decision trees. The
random elements of the procedure make the algorithm robust to noise and outliers. Bias is reduced by
allowing the trees to grow to maximal depth and correlation is reduced by the randomness present in
the procedure, so the ensemble tends not to overfit the data.
If we accept the Merton approach and its stochastic framework, then we must concede that even
those bonds deemed most likely to default will, on occasion, survive. Any derived prediction model
will therefore be imperfect. Although a classification system can neatly label bonds as residing in the
default class or not, this crispness is artificial: there is an element of partiality. By contrast to classic
set theory, fuzzy sets (Zadeh, 1965) permit membership of a class to be a matter of degree by defining
a grade-of-membership (GoM) function. We adopt the percentage number of trees within the random
forest classifying a bond as default as our GoM function. Bonds with a higher GoM are considered a
higher default risk. This allows bonds to be ranked and success to be measured based on the percentage
of defaulting bonds allocated to different deciles as per the approach of Bharath and Shumway (2008)
and Duffie et al. (2009).

3.4 Default model evaluation


Simple binary classification models allow standard performance measures such as accuracy and specificity
to be derived from counts of true and false positive predictions. Note that given the infrequency of bond
defaults, artificially high accuracy scores would be achieved by a classifier that designated all observations
as non-defaulters. Moreover, most models permit a degree of configuration that allows the prevalence
of predicting a particular outcome to be influenced (for example, the voting threshold in the case of the
random forest). By varying the threshold, it is possible to construct a receiver operating characteristic

6
(ROC) curve which plots the true positive rate against the false positive rate, measures which are
independent of class priors (Fawcett, 2006).
A diagonal line from the bottom left corner to the top right corner shows the performance of a random
classification model: a classification model that resides below this will therefore not be performing better
than mere chance. A good classification model should reside in the upper left hand corner of the graph;
the better the classification model, the more the curve follows the left and top sides of the square. By
itself the ROC curve does not provide a statistic of classification performance, rather it is the calculation
of the area under the curve (AUC) which permits a readily interpretable performance measure related
to the Mann-Whitney test. Witten and Frank (2005) observe that the AUC has a nice interpretation
as the probability that the classifier ranks a randomly chosen positive instance above a randomly chosen
negative one. An alternative approach, is the Cumulative Accuracy Profiles (CAP) curve with its
associated accuracy ratio, equivalent to twice the area under the ROC curve lying above the diagonal.

4 Data
Given that the rate of bond defaults in speculative grade bonds is over 18 times higher than that
experienced in investment grade bonds (Ou et al., 2011), we focus on this subset of the bond universe
that exhibit inherently higher default rates over investment grade peers. All US corporate bonds issued
by public firms that had ever been rated speculative grade (Ba1 and below) by the rating agency
Moodys Investors Service were selected from the period January 2000 - November 2011. Focusing on
debt issued by public firms permits the inclusion of quarterly accounting data and associated equity
pricing data to be used in the model; our choice of start date is due to Bloombergs historic records of
Moodys credit ratings commencing in 1999. Unlike other credit risk models we do not exclude those
bonds issued by financial or insurance firms: our aim is to build a general model that has universal
application.
All data was taken from Bloombergs professional subscription service, providing a bond universe of
4,213 bonds issued by 718 public companies, yielding 196,724 bond-month observations. Our aim is to
rank bonds based on their likelihood to default over a forward looking three month or twelve month
horizon. The proportion of bond-month observations that subsequently default within these horizons is
outlined in Table 1.
Table 1
Number of bond observation months per year from which training and testing data sets can be drawn,
and the percentage of those instances which subsequently default within a three-month or twelve-month
time horizon.

Year Instances 3m Default 12m Default


(%) (%)
2000 7,048 0.8 4.2
2001 7,405 1.9 6.2
2002 7,686 1.1 3.8
2003 8,895 0.3 1.0
2004 11,431 0.2 0.5
2005 14,415 0.2 0.7
2006 17,608 0.1 0.2
2007 22,465 0.1 10.8
2008 24,135 10.0 26.4
2009 19,932 0.4 0.8
2010 25,565 0.1 0.4
2011 30,139 0.3 1.8

Data from 1999 and 2012 was used only to determine backward looking differences and forward

7
looking outcomes. The number of defaulting observations is also influenced by companies with multiple
bonds in issuance.
Market data for each bond comprised metrics such as yield and CDS rates (where available). Yield
spreads over US treasury rates were also included although Bharath and Shumway (2008) found these
only weakly correlated. Differing bond characteristics, such as (time to) maturity and (change in) credit
ratings were also captured. Credit rating variables were converted into a numeric scale (Huang et al.,
2004). Data for the associated equity (belonging to the ultimate parent company issuing the bond/debt)
comprised the stocks lagged returns, and historic volatilities across several time horizons (see R. Jarrow
(2001)). Historic excess returns were also included (Bharath & Shumway, 2008).
Additionally, Altmans Z-score (as well as its constituent accounting ratios) and a DD measure based
on equation 3 were added where the availability of data permitted. In constructing the face value of
the debt (F ), we follow the approach of Vassalou and Xing (2004) and Bharath and Shumway (2008),
using current liabilities plus half of long term liabilities. Note that there is no transformation of the
distance-to-default measure into a default probability, as the main interest is the relative position of the
firms.
The target attributes, recording whether the bond experienced default within three (3m) or twelve
(12m) months, were constructed as binary attributes coded 1 for default, and 0 otherwise. This de-
termination was based on default records provided by Bloomberg, defined as comprising either: Chapter
7, Chapter 10, Chapter 11 Bankruptcy proceedings, grace period, missed coupon payment or missed
principal payment (see Duffie et al. (2007) for similar treatment). Using this definition there were 10,913
bond-months coded 1 at the 12m default horizon, and 2,992 bond-months at the 3m default horizon.
Other forms of exit were due to acquisition or de-listing, both of which had a specific date provided by
Bloomberg. Should a firm be acquired by a private company, its debt would be deemed private from
this moment on, and hence exit the sample. Where a firm merged or was acquired by another public
firm, its debt was combined into the new entity and remained in the sample. Name changes, entries into
and exits out of the sample were strictly controlled.

4.1 Walk forward approach


To aid reliability the model was tested using a walk-forward approach (see Stein (2007)), so that testing
was performed with out-of-time data. By construction, those bonds which default will necessarily be
out-of-sample. At each stage, a cutoff date was chosen and a data set comprising observations from the
prior 36 months was formed. First, bond-month observations which were known to default within the
given time frame were identified. A single bond-month observation was then selected randomly for each
of the associated companies, thus avoiding selection of sibling bonds. For the remaining bond-month
observations linked to companies without defaults, a single bond-month observation was also selected
randomly. Note that the year indicated represents the start of the testing period, thus 2004 uses training
data comprised of bond-month observations from the calendar years 2000-2002, and makes predictions
during the calendar year 2004. Bond-month observations from 2003 cannot be used in the training set
as their required classification would require knowledge from the future. This is illustrated in Figure 2.
The test data was drawn randomly from all bond-month observations over a full year, but separated
from the training set by a full year. Cutoff dates were chosen annually to coincide with year end. The
methodology adopted replicates how the model could be used in practice; only data that would have
been observable at the point of prediction was included in the random forest construction. Note that
the test data is not entirely out of sample. Bonds from the training set can appear in the test set (for
future dates) however, given the separating year, no defaulting bonds from the training data set can
appear.
Missing values within the training data set are imputed using the median value of non-missing obser-
vations from the training data set prior to construction of the random forest. The same median values are
later used to impute missing values for the test data set. Observations with missing attributes should

8
2004
2005
2006
2007
2008
2009
2010
2011

2000 2002 2004 2006 2008 2010 2012

Figure 2
Rolling windows. Here the training data set spans a 3 year period (grey fill, striped) and the test
data set spans a single calendar year (dotted). The intermediate buffer (red) is omitted to exclude
bond observations with outcomes that would be unavailable to the practitioner. For example, results
presented for 2008, use a training set drawn from January 2004 to December 2006, and a test set drawn
from January to December 2008. In this example, data from 2007 is used only for the purpose of
determining training set outcomes and test set lags.

therefore result in non-exceptional treatment. To reduce the required number of such substitutions,
accounting data published within the previous 6 months is preferred.
After building the random forest using the training data, the resultant ensemble classifier was used
to predict the default outcomes for observations from the test data. A reasonable default configuration
of the model was selected with no further attempts to optimise model performance. The percentage
number of votes cast for each observation, rather than the predicted outcome, is the output metric
of interest. This provides a rank to order observations, with those observations receiving the highest
number of votes obtaining the highest grade of membership and therefore considered the most likely to
default. Ranked observations can be decomposed into deciles to calculate the percentage of those bonds
which actually default within each decile. To keep our results consistent with earlier work we label the
deciles such that decile 1 contains those observations with the most votes. Clearly a good classifier would
hope to have a large percentage of defaulting bonds occurring in this first decile.
Given the random nature of those observations selected for the training set, the results of the random
forest will vary based on the chosen seed. Thus the process must be repeated to calculate an average
outcome and determine how sensitive the ensemble classifier is to its training set. Equally, because of the
relatively low numbers of defaulters, individual observations have the potential to significantly influence
model results. Resampling diminishes the potential impact this can have on a models performance
and allows the variability to be understood (Stein, 2007). One thousand iterations of the process were
performed.

5 Out-of-sample results
We begin by reviewing the out-of-sample predictive ability of random forests across all years and con-
trasting them with those of other single variable models. Average CAP curve representations for each
model, displaying the cumulative fraction of defaulters identified by percentile, are shown in Figure 3.

9
Figure 3
Average CAP curves (12m). Results are presented for the random forest (RF), the random forest filtered on those observations
where a naive DD measure was available (RF DD), naive distance-to-default (DD), company prior year excess return (EQTY) and
Altman Z-score (Z-score). The Random series represents the expected behaviour of a random classifier.

Results are presented for the random forest (RF), the random forest filtered on those observations
where a naive DD measure was available (RF DD), naive distance-to-default (DD), company prior
year excess return (EQTY) and Altman Z-score (Z-score). We see immediately that the random forest
outperforms the other models. For all models, the CAP curve rises rapidly as they quickly identify a
cohort of bonds likely to default. This is to be expected: at least some firms on the path to future default
are likely to exhibit the tell-tale signs such as poor returns, high leverage, or increased volatility, that
will be detected by the models. However, those firms which do not exhibit the obvious signs of distress
sought by our benchmark models will avoid early detection, so the rate of assent of the CAP curve
diminishes.
Decile and AUC scores for a 12m default horizon are presented in Table 2. The results show that
DD has been able to rank 65.4% of defaulting firms into the lowest decile, rising to 74.5% for the lowest
quintile, but also allocates some 17.6% of defaulting firms to the highest five deciles. Interestingly,
the simple prior year excess return achieves similar results. Z-score proves an insightful but inferior
measure. These results are comparable with the findings of Bharath and Shumway (2008) for quarter
ahead defaults and support their inclusion as input variables for the random forest approach. Immaterial
differences are obtained between the RF and RF DD models.
The random forest model can be seen to outperform the benchmark models through a superior
AUC average, much lower AUC standard deviation and higher lowest decile (80.3%) and quintile scores
(90.5%). Using a Wilcoxon signed rank test, the null hypothesis that the median difference between the
random forest and other models being zero is rejected at the 0.1% level for distance-to-default, equity and
Zscore samples. This, alongside having a higher AUC mean value and lower standard deviation, suggests
the random forest model is superior in predicting default than the other models tested. Hamilton and
Cantor (2004) report a 12m ahead baseline accuracy ratio of 65% using Moodys ratings, which they
enhance to achieve a 74% rate. On a similar basis Duffie et al. (2007) report an accuracy ratio of 88% for
industrial firms. Converting the average random forest AUC score to an equivalent accuracy ratio 89.4%
is achieved. Note that in focusing only on speculative grade credits, one might expect poorer results:

10
RF RF DD EQTY Z-
DD score
Decile 1 80.3 80.7 65.4 64.7 55.8
Decile 2 10.2 11.2 9.1 11.3 14.1
Decile 3 5.5 5.2 4.9 5.9 9.9
Decile 4 2.3 1.7 1.2 4.6 3.5
Decile 5 1.0 0.7 1.8 3.4 2.1
Deciles 6-10 0.7 0.5 17.6 10.0 14.6
AUC mean 94.7 94.9 83.4 85.6 78.2
AUC std dev 4.6 5.0 19.3 12.0 23.6
p-value 0.21 <0.001 <0.001 <0.001

Table 2
Out-of-sample 12 month prediction performance. Results are presented for the random forest (RF), the random forest filtered
on those observations where a naive DD measure was available (RF DD), naive distance-to-default (DD), company prior year
excess return (EQTY) and Altman Z-score (Z-score). The percentage of bond observation months that default within a 12 month
horizon are shown by decile, along with the mean and standard deviation AUC score. Finally, a Wilcoxon signed rank test for
group differences against the RF model. Results are averaged over 1,000 iterations and 8 rolling windows.

intuitively including investment grades should push bonds likely to default into the lower deciles. Indeed
Cantor and Mann (2003) report higher accuracy ratios for mixed pool investment and speculative-grade
credits.
Similar results for a 3m default horizon are presented for in Table 3. The performance of all models
improves as the predictive time horizon is shortened. The RF model now allocates 92.5% of defaulting
firms into the lowest quintile compared with 81% for DD, 83.7% for excess return and 77.6% for Z-Score.
Once again a Wilcoxon signed rank test was utilized, with the null hypothesis that the median difference
between the random forest and other models being zero is rejected at the 0.1% level across all models.
Interestingly the null is rejected between the random forest and filtered random forest, whereas at the
12m interval we were not able to reject the null. The random forest appears to have superior predictive
accuracy than the other models tested.

RF RF DD EQTY Z-
DD score
Decile 1 87.5 91.2 73.6 78.3 70.1
Decile 2 5.0 5.3 7.4 5.4 7.5
Decile 3 3.9 2.4 1.4 4.0 5.7
Decile 4 1.6 0.5 0.5 3.3 2.2
Decile 5 0.8 0.2 2.5 4.6 0.6
Deciles 6-10 1.1 0.4 14.5 4.3 13.9
AUC mean 95.8 97.2 85.8 91.0 80.6
AUC std dev 7.2 5.8 22.7 12.0 29.9
p-value <0.001 <0.001 <0.001 <0.001

Table 3
Out-of-sample 3 month prediction performance. Results are presented for the random forest (RF), the random forest filtered on
those observations where a naive DD measure was available (RF DD), naive distance-to-default (DD), company prior year excess
return (EQTY) and Altman Z-score (Z-score). The percentage of bond observation months that default within a 3 month horizon
are shown by decile, along with the mean and standard deviation AUC score. Finally, a Wilcoxon signed rank test for group
differences against the RF model. Results are averaged over 1,000 iterations and 8 rolling windows.

For a 3m horizon, Bharath and Shumway (2008) report out of sample results with 77% of defaulting
bonds allocated to their lowest decile. This contrasts with a 69% score reported for their Moodys KMV
baseline model. Duffie et al. (2009) report results of 94% based on this measure. Considering only those
bonds with DD measures, for which such a comparison could be made, the random forest approach on
average allocates 91.2% of defaulting bonds into this decile. This reduces to 87.5% when all bonds are
included.
Given the relatively short window for the training set (relative to the length of the business cycle), it
is perhaps unsurprising that (unreported) attempts to use macroeconomic data were largely unfruitful.

11
While macroeconomic data will have different impacts across industries its broad effect should be felt
by all, diminishing its impact on relative comparisons between bonds.

5.1 Results by year


We now consider the prediction results broken down at an annual level to examine model performance
under changing market conditions, and variance under resampling. CAP curve representations for the
RF model in Figure 4 show that while the model performs significantly better in some years than in
others, on average most defaulters remain in the top quintile, with few being allocated to the highest
five deciles.

Figure 4
Average RF CAP curves by year for 12m predictions. For example, the year labelled 2005 refers to predictions made in 2005 for
outcomes in 2006.

The results presented in Table 4 show the mean and standard deviation AUC scores for 1,000 iter-
ations of the results for predictions over a 12m horizon. This table also shows the percentage of actual
defaulters contained within each decile. Particularly high scores (with low standard deviations) are
achieved for years 2004-7 and 2010 with over 89% of defaulters being captured in the top quintile for
these years. Unsurprisingly, the performance of the model deteriorates through the turmoil of the global
financial crisis. In all years, some 61% of defaulters were ranked within the lowest decile, 78% in the
lowest two deciles and less than 2% in the highest five deciles.
Examining the random forest voting mechanism in more detail, during normal years, relatively
small numbers of bonds attract high numbers of votes. While this includes some bonds which do not
ultimately default within the 12m horizon, it broadly captures all those that do, leading to a high score
for the top decile. During times of market stress, most bonds which ultimately default within the 12m
horizon receive high numbers of votes, but larger numbers of other bonds attract equal suspicion. This
dilutes the success rate achieved within the lowest decile. Moreover, a number of other bonds which
ultimately default attract insufficient votes and therefore appear in higher deciles. Both phenomenon
reduce average AUC scores. When results are filtered to only include bond-month observations for which
a naive DD measure could be derived from available data attributes, performance is broadly similar to
the unfiltered set.

12
2004 2005 2006 2007 2008 2009 2010 2011
Decile 1 80.38 83.42 99.90 86.50 61.60 75.48 93.09 61.97
Decile 2 11.21 5.72 0.10 10.39 16.88 13.22 3.64 20.56
Decile 3 6.56 3.03 0.00 1.99 13.95 5.86 1.79 10.50
Decile 4 1.50 3.72 0.00 0.77 4.63 2.69 0.76 4.39
Decile 5 0.32 2.55 0.00 0.12 1.47 1.50 0.44 1.55
Deciles 6-10 0.03 1.57 0.00 0.23 1.47 1.24 0.27 1.03
AUC Mean 95.40 94.29 98.96 96.57 91.01 93.19 97.76 90.06
AUC Std Dev 2.78 3.48 0.71 2.61 2.59 3.72 3.06 6.47

Table 4
Random forest out-of-sample 12 month prediction performance by year. The percentage of bond observation months that default
within a 12 month horizon are shown by decile, along with the mean and standard deviation of AUC scores. Results are averaged
over 1,000 iterations.

2004 2005 2006 2007 2008 2009 2010 2011


Decile 1 52.03 77.90 23.45 92.78 69.37 82.99 76.63 48.08
Decile 2 24.15 2.46 0.00 3.87 12.99 12.36 8.53 8.17
Decile 3 16.08 0.02 0.00 1.23 7.15 3.45 8.82 2.62
Decile 4 0.53 3.35 0.00 2.12 2.90 0.08 0.10 0.72
Decile 5 0.03 1.10 0.00 0.00 5.51 0.00 0.03 7.37
Deciles 6-10 7.17 15.17 76.55 0.00 2.08 1.12 5.88 33.04
AUC Mean 87.14 87.83 44.32 96.89 90.91 94.84 92.13 73.08
AUC Std Dev 9.39 7.21 18.97 2.35 1.87 2.69 8.37 16.91
p-value <0.001 <0.001 <0.001 <0.001 0.028 <0.001 <0.001 <0.001

Table 5
DD out-of-sample 12 month prediction performance by year filtered on bond observations with a DD measure. The percentage of
bond observation months that default within a 12 month horizon are shown by decile, along with the mean and standard deviation
of AUC scores. Finally, a Wilcoxon signed rank test for group differences against the RF model. Results are averaged over 1,000
iterations

Table 5 presents equivalent results for the DD model. While in some years the performance is
similar to the random forest, in broad terms, the results achieved by the random forest approach are
superior: higher AUC scores, lower variance, more defaulting bonds are assigned to lower deciles and
fewer defaulting bonds are assigned to higher deciles. Year on year, the performance of DTD appears
less stable. The poor performance in 2006 is in part due to the relatively small number of defaulting
bond-observation months with DD values. With the null hypothesis of the Wilcoxon signed rank test
being that the median difference between the random forest and distance-to-default is zero across all
individual sample years, we are able to reject the null at the 0.1% level in all but one year. In 2008 we
are unable to reject the null, and so must concede that during this period of relative market turmoil
the distance-to-default model has similar performance to the random forest. However, this changes at
the 3 month time horizon, where we are able to reject the null hypothesis at the 0.1% level across all
time periods. It could therefore be argued the predictive accuracy of the random forest is greater at the
shorter 3m time interval than it is at the 12m interval.
The drop in performance of the random forests predictive accuracy in 2011 bears closer scrutiny.
Recall that these scores are based on forward prediction from 2011 into 2012. Since relatively small
numbers of bonds from our sample experience default during this time, each models performance be-
comes particularly acute to misclassified bonds. Two companies are indicative of an issue contributing
to this outcome: American Airlines Group Inc (AAMRQ) and MF Global Holdings Ltd (MFGLQ). In
both cases, these bonds attract low number of votes until just prior to default, when the peril of the
company is readily apparent.
In the case of American Airlines, the company sought bankruptcy as a means of protection. While
clearly in some distress, it appears that the company was not in immediate danger and therefore opted
for a strategic default (Peterson & Daily, 2011; Surowiecki, 2011). MF Global Holdings was fined
by the United States Commodity Futures Trading Commission and stood accused of using customer

13
funds to cover liquidity shortfalls while its auditors, PricewaterhouseCoopers, faced and settled claims
of fraudulently deceiving investors (Miedema & Hall, 2014; Stempel, 2015). Such situations will prove
difficult to predict for any quantitative model.
The drop in model performance for the period 2008, particularly for those bond observations inap-
propriately assigned to the higher deciles, is in part due to model oversight; for example due to bonds
issued by Washington Mutual (WAMUQ) and Nortel Networks Corporation (NRTLQ). However, here
the bonds tend to attract significant numbers of votes, yet are simply overlooked due to other bonds
attracting considerably higher numbers.
We now review the results of prediction over a 3m time horizon. Results presented in Table 6 show
the mean and standard deviation AUC scores for 1,000 iterations. On average, the AUC scores exceed
89% in all years. With the notable exception of 2005 and 2011, AUC scores exceed 95% with over 80%
of defaulters being ranked in the lowest decile alone. Defaulters appear in the highest five deciles for
2005, 2009 and 2011 only. Averaged across all years, 87.6% and 92.8% of defaulters appeared in the
lowest decile and quintile respectively.

2004 2005 2006 2007 2008 2009 2010 2011


Decile 1 99.75 69.60 100.00 100.00 81.04 90.63 100.00 60.22
Decile 2 0.25 4.99 0.00 0.00 9.51 6.60 0.00 20.01
Decile 3 0.00 9.13 0.00 0.00 8.11 1.72 0.00 12.02
Decile 4 0.00 7.01 0.00 0.00 1.18 0.51 0.00 3.05
Decile 5 0.00 4.33 0.00 0.00 0.16 0.43 0.00 1.25
Deciles 6-10 0.00 4.94 0.00 0.00 0.00 0.10 0.00 3.44
AUC Mean 98.97 89.01 98.48 99.74 95.25 97.14 99.57 89.23
AUC Std Dev 0.99 10.29 1.11 0.37 3.07 2.14 0.43 13.55

Table 6
Random Forest out-of-sample 3 month prediction performance by year. The percentage of bond observation months that default
within a 3 month horizon are shown by decile, along with the mean and standard deviation of AUC scores. Results are averaged
over 1,000 iterations.

As with 12m, results are also presented in Table 7, filtered for bond observations for which a DD
measure was available. Both models generally perform exceptionally well, but the DD model assigns
defaulting bonds to the highest 5 deciles more often, suggesting that the random forest gains a benefit
from its additional input variables.

2004 2005 2006 2007 2008 2009 2010 2011


Decile 1 70.10 99.92 0.00 100.00 81.87 87.25 100.00 41.27
Decile 2 29.90 0.08 0.00 0.00 2.19 12.28 0.00 20.78
Decile 3 0.00 0.00 0.00 0.00 3.13 0.48 0.00 9.22
Decile 4 0.00 0.00 0.00 0.00 3.14 0.00 0.00 0.00
Decile 5 0.00 0.00 0.00 0.00 4.45 0.00 0.00 20.20
Deciles 6-10 0.00 0.00 100.00 0.00 5.23 0.00 0.00 8.53
AUC Mean 93.74 98.34 30.97 99.32 89.75 96.11 99.80 78.37
AUC Std Dev 4.36 1.44 2.83 0.55 5.81 1.53 0.29 17.98
p-value <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001

Table 7
DD out-of-sample 3 month prediction performance by year filtered on bond observations with a DD measure. The percentage of
bond observation months that default within a 3 month horizon are shown by decile, along with the mean and standard deviation
of AUC scores. Finally, a Wilcoxon signed rank test for group differences against the RF model. Results are averaged over 1,000
iterations.

5.2 Robustness
In this section we now examine the robustness of the results. The choice of a three year period for
the training data set was somewhat arbitrary. A natural question is could the results be improved by

14
using a larger training set spanning more years? Table 8 presents the random forest results when the
rolling window used to define the training data set is varied to two or four years. No significant impact
is observed.
2Y 3Y 4Y
Decile 1 80.5 80.3 81.6
Decile 2 10.1 10.2 9.0
Decile 3 5.2 5.5 5.0
Decile 4 2.4 2.3 2.5
Decile 5 1.0 1.0 1.1
Deciles 6-10 0.8 0.7 0.8
AUC mean 94.6 94.7 94.7
AUC std dev 4.7 4.6 4.7

Table 8
Out-of-sample 12 month prediction performance, when the rolling window for training data sets is varied between two and four
years. Results are averaged over 1,000 iterations and multiple rolling windows.

Other studies have specifically excluded financial firms because of the distinct nature of their business
and its impact on measures of financial performance. Removing financial firms (those with a Bloomberg
industry sector code of 10008) from our sample reduces the data set to 143,047 bond month observations.
Around 10% of non-financial companies experience default over our sample period, compared with 16%
of financial firms. Table 9 shows the model performance results for a 12m horizon when the data set is
filtered to exclude financials.
RF RF DD EQTY Z-
DD score
Decile 1 86.4 85.3 71.5 64.7 55.1
Decile 2 8.7 9.3 8.7 10.2 14.4
Decile 3 3.2 3.7 4.2 6.4 10.6
Decile 4 1.1 1.2 1.4 7.6 3.4
Decile 5 0.3 0.3 1.3 3.8 2.1
Deciles 6-10 0.2 0.2 13.0 7.3 14.4
AUC mean 96.2 96.0 86.7 86.2 78.3
AUC std dev 3.6 3.9 18.7 13.4 23.4

Table 9
Out-of-sample 12 month prediction performance when bond observations relating to financial companies are excluded. Results are
averaged over 1,000 iterations and 8 rolling windows.

Improvements are evident for the RF and DD related models, but lacking for the excess return and
Z-Score models. This is despite the fact that our original model included an attribute to identify financial
firms. Segregation of financial and non-financial firms may therefore be merited. Direct comparisons
with existing results which exclude financial firms would therefore initially appear less flattering.
In the random forest model proposed above, we have included the DD as one of the algorithms
attributes. As has been noted elsewhere (Bharath & Shumway, 2008), if the Merton (1974) model is
literally true then the DD should be the only metric required to explain corporate default. It could be
argued that DD is the main contributor to the algorithms performance and that other attributes used
possess little explanatory power. Therefore it is important to empirically test this possibility.
Assessing the change in the AUC by including and excluding the DD can reveal how sensitive the
model is to this attribute. For the sake of completeness, the Altman Z-score is included in the tests as
well. Although the random forest samples attributes at random for each tree, if these key risk metrics
provide valuable insights into the credit risk of a bond, then random forests including these attributes
should improve the AUC scores over the baseline model. Four random forests were run: 1) the baseline
model, outlined in the previous section; 2) baseline model excluding Z-Score attribute; 3) baseline model
excluding DD attribute; 4) baseline model excluding Z-score and DD attributes. The procedure followed

15
that of the methodology above, where 1,000 iterations were performed for each year from 2004-2011.
The AUC scores for each random forest model were used to implement Kruskal-Wallis tests, with
the null hypothesis being the samples have equal locations (Khan & Rayner, 2003). Across all but one
time period (2009) the p-values are statistically significant at 0.1% level, and so we can reject the null
hypothesis that the samples have equal locations in general.
Another important issue relates to the extent that adding randomness to the data impacts the
predictive accuracy of the algorithm. If the attributes selected provide non-spurious insights into bond
default prediction, then increasing the proportion of attributes with random values should lead to a
deterioration in the predictive accuracy of the model. Five random forests were run: 1) the baseline
model, outlined in the previous section; 2) baseline model with 25% of attributes replaced with random
values; 3) baseline model with 50% of attributes replaced with random values; 4) baseline model with
75% of attributes replaced with random values; 5) baseline model with 100% of attributes replaced with
random values. AUC scores were used in the Kruskal-Wallis test, with the null hypothesis being the
samples have equal locations. In each time period the null hypothesis is rejected at the 0.1% level.
Therefore the inclusion of attributes and targets with random values provides a statistically significant
difference (decrease in accuracy) on the AUC scores over the 2004-2011 time periods.
However, recall that given the random forest algorithm is robust to noise (Breiman, 2001), its pre-
dictive accuracy should not decline when a comparatively small portion of its attributes are pure noise
(random numbers). As an increasing proportion of attributes are replaced with random numbers the
predictive power of the forest declines significantly, until its median AUC score reaches 50% when all
attributes are random. Our model and its predictive accuracy is therefore not a consequence of spuri-
ousness in the attributes selected. This suggests that the inclusion of additional attributes does improve
the accuracy of the model.

5.3 Decile Boundaries


We now consider how the model dynamics change over time. Figure 5 shows random forest scores, relative
to decile boundaries, for selected companies that ultimately default. Figure 6 shows corresponding DD
results. To aid comparison, votes cast against default are plotted so that, like DD, a higher score implies
a bond is less likely to default. Loosely interpreted, crossing the threshold and moving into the lowest
decile could be viewed as a warning of potential imminent default.

(a) THMRQ (TMST Inc) (b) UHAL (Americo)

Figure 5
Average random forest results for two defaulting companies. Random forest votes (against) are shown as a solid red line. The
dotted lines represent the 10th, 50th and 90th percentiles. Crossing the threshold into the lowest decile would be interpreted as
an increased likelihood of default.

16
The change in RF voting for TMST Inc in August 2007 is dramatic while the DD measure experiences
swings from 2005 and enters the lowest DD decile at an earlier time. Both models will react to the large
fall in share price and increased volatility, however the RF model benefits from the inclusion of an
additional 10-day volatility measure which increased nineteenfold, and the three point credit rating cut
from Ba2 to B2. By contrast Americo is viewed with suspicion much earlier by the RF model. The
continuous DD distribution has a relatively stable spread although his narrows when the market is under
stress. By contrast, the (bounded) random forest votes allow for relatively little discrimination between
bonds in the lower deciles except when the market is under stress. Both measures react strongly to
market stress.

(a) THMRQ (TMST Inc) (b) UHAL (Americo)

Figure 6
Average DD results for two defaulting companies. DD scores are shown as a solid red line. The dotted lines represent the 10th,
50th and 90th percentiles. Crossing the threshold into the lowest decile would be interpreted as an increased likelihood of default.

6 Conclusion
In this paper we have presented a methodology for implementing a random forest approach to provide
a rank measure of the likelihood of bond default. In permitting a moderate number of variables to
influence prediction, the model can be considered as an expert that blends the combined wisdom from
multiple sources. Important measures such as distance-to-default can be incorporated into the model,
but moderated with other financial indicators that can heighten or dampen the perceived risk. Using
US corporate bond data from 1999-2012, we have further demonstrated the potential performance of
such an approach compared with simple and composite variables. Performance results match or exceed
those of previous studies. The model therefore appears to have merit as an early warning system and
aid in the financial decision making process over short-term time horizons.
Traditional approaches include relatively few variables drawn from economic, accounting and market
data, often in combination to create a single composite measure such as distance-to-default. Decision
makers are likely to draw on information outside their preferred model, but this becomes increasingly
difficult with large data sets. Random forests can accommodate a large number of input parameters
without making any judgements about their economic or statistical significance. Advantages of the
approach include the ease with which new variables (including nominal and ordinal ones) and modeler
preferences can be incorporated into the model, its ability to accommodate outliers without special
treatment, and the capacity to handle missing data attributes. An interesting outcome is that the
random forest ranking appears to be stable relative to the DD measure, in that the forest appears to be

17
able to detect the default signal from the noise of the data in a more consistent manner than the DD
measure.
There are however some disadvantages of the random forest model that deserve mention. Firstly,
the random forest is by its very nature a black box model. While interpretation of the voting results
is routine, understanding the configuration and contribution of individual decision trees requires effort.
Related to this is the fact that the random forest algorithm is computationally demanding when compared
to the (relative) simplicity of the DD model. In terms of the dataset itself, we were constrained in the
starting date as a result of the credit rating data; ideally a longer period would be preferred. However,
our sample period does include different macroeconomic cycles and includes the turmoil of the financial
crisis period of 2008-2009.
We interpret our findings as supporting earlier claims that while distance-to-default is an important
predictor of default, its significance diminishes greatly in the presence of other variables. We do not
claim to have found the optimal set of input variables; further improvements are undoubtedly possible.
Future work could be directed at devising a robust methodology for selecting the optimal set of input
variables, and further extensions - such as inclusion of industry effects - may enhance performance of
the forest. A longer data set would also provide greater scope for testing predictive accuracy for longer
time horizons. Finally, we believe that further enhancements to the model are possible and that the
approach could be productively applied to other predictive tasks in the area of credit and beyond.

References
Altman, E. I. (1968). Financial ratios, discriminant analysis, and the prediction of corporate bankruptcy.
Journal of Finance, 25 (4), 589609.
Angelini, E., di Tollo, G., & Roli, A. (2008). A neural network approach for credit risk evaluation. The
Quarterly Review of Economics and Finance, 48 (4), 73355.
Atiya, A. F. (2001). Bankruptcy prediction for credit risk using neural networks: A survey and new
results. IEEE Transactions on Neural Networks, 12 (4), 92935.
Azizpour, S., Giesecke, K., & Kim, B. (2011). Premia for correlated default risk. Journal of Economic
Dynamics & Control , 35 (8), 134057.
Bharath, S. T., & Shumway, T. (2008). Forecasting default with the Merton distance to default model.
Review of Financial Studies, 21 (3), 13391369.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24 (2), 12340.
Breiman, L. (2001). Random forests. Machine Learning, 45 , 532.
Breiman, L., Freidman, J., Olshen, R. A., & Stone, C. S. (1984). Classification and regression trees.
Chapman & Hall.
Campbell, J. Y., Hilscher, J., & Szilagyi, J. (2008). In search of distress risk. The Journal of Finance,
63 (6), 28992939.
Cantor, R., & Mann, C. (2003). Measuring the performance of corporate bond ratings. available from
http://ssrn.com/abstract=996025.
Duffie, D., Eckner, A., Horel, G., & Saita, L. (2009). Frailty correlated default. The Journal of Finance,
64 (5), 20892123.
Duffie, D., Saita, L., & Wang, K. (2007). Multi-period corporate default prediction with stochastic
covariates. Journal of Financial Economics, 83 , 63565.
Duffie, D., & Singleton, K. J. (2003). Credit risk : pricing, measurement, and management. Princeton
University Press.
Fama, E. F., & French, K. R. (1998). Dividend yields and expected stock returns. Journal of Financial
Economics, 22 (1), 325.
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27 (8), 86174.

18
Giesecke, K., Longstaff, F. A., Schaefer, S., & Strebulaev, I. (2011). Corporate bond default risk: A
150-year perspective. Journal of Financial Economics, 102 (2), 233250.
Hamilton, D. T., & Cantor, R. (2004). Rating transitions and default rates conditioned on outlooks.
Journal of Fixed Income, 14 (2), 5470.
Hillegeist, S. A., Keating, E. K., Cram, D. P., & Lundstedt, K. G. (2004). Assessing the probability of
bankruptcy. Review of accounting studies, 9 (1), 534.
Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 20 (8), 83244.
Huang, Z., Chen, H., Hsu, C.-J., Chen, W.-H., & Wu, S. (2004). Credit rating analysis with support
vector machines and neural networks: a market comparative study. Decision Support Systemss,
37 (4), 543-58.
Jarrow, R. (2001). Default parameter estimation using market prices. Financial Analysts Journal ,
57 (5), 7592.
Jarrow, R. A., & Turnbull, S. M. (2000). The intersection of market and credit risk. Journal of Banking
& Finance, 24 (1), 27199.
Jessen, C., & Lando, D. (2015). Robustness of distance-to-default. Journal of Banking & Finance, 50 ,
493505.
Jones, S., Johnstone, D., & Wilson, R. (2015). An empirical evaluation of the performance of binary
classifiers in the prediction of credit ratings changes. Journal of Banking & Finance, 56 , 7285.
Jonsson, J. G., & Fridson, M. S. (1996). Forecasting default rates on high-yield bonds. Journal of Fixed
Income, 6 (1), 6977.
Khan, A., & Rayner, G. D. (2003). Robustness to non-normality of common tests for the many-sample
location problem. Journal of Applied Mathematics AND Decision Sciences, 7 (4), 187206.
Kim, M.-J., & Kang, D.-K. (2010). Ensemble with neural networks for bankruptcy prediction. Expert
Systems with Applications, 37 (4), 3373-9.
Kuhn, M., & Johnsson, K. (2013). Applied predictive modeling. Springer.
Lando, D., & Nielsen, M. S. (2010). Correlation in corporate defaults: Contagion or conditional
independence? Journal of Financial Intermediation, 19 (3), 35572.
Lin, F. Y., & McClean, S. (2001). A data mining approach to the prediction of corporate failure.
Knowledge-Based Systems, 14 , 18995.
Lin, W.-Y., Hu, Y.-H., & Tsai, C.-F. (2012). Machine learning in financial crisis prediction: A survey.
Systems, Man, and Cybernetics, Part C: Applications and Reviews: IEEE Transactions, 42 (4),
42136.
Merton, R. C. (1974). On the pricing of corporate debt: The risk structure of interest rates. Journal of
Finance, 29 (2), 44970.
Miedema, D., & Hall, K. V. (2014). MF Global Holdings to pay $100 mln fine in CFTC set-
tlement. available from www.reuters.com/article/2014/12/24/mf-global-hldg-settlement
-idUSL1N0U813I20141224. (retrieved 7th October 2015)
Min, J. H., & Jeong, C. (2009). A binary classification method for bankruptcy prediction. Expert
Systems with Applications, 36 (3), 525663.
Ohlson, J. A. (1980). Financial ratios and the probabilistic prediction of bankruptcy. Journal of
Accounting Research, 19 , 109-131.
Ou, S., Chiu, D., & Metz, A. (2011). Corporate default and recovery rates, 1920 - 2010. Moodys
Investor Service Report Number 131388 .
Peterson, K., & Daily, M. (2011). American Airlines files for bankruptcy. available from www.reuters
.com/article/2011/11/30/us-americanairlines-idUSTRE7AS0T220111130. (retrieved 7th Oc-
tober 2015)
Shin, K.-S., & Lee, Y.-J. (2002). A genetic algorithm application in bankruptcy prediction modeling.
Expert Systems with Applications, 23 (3), 32181.

19
Stein, R. M. (2007). Benchmarking default prediction models: pitfalls and remedies in model validation.
Journal of Risk Model Validation, 1 (1), 77113.
Stempel, J. (2015). PwC to pay $65 mln to resolve lawsuit over MF Global. available from
www.reuters.com/article/2015/04/17/pricewaterhousecoopers-mfglobal-lawsuit
-idUSL2N0XE1UK20150417. (retrieved 7th October 2015)
Sun, J., & Li, H. (2008). Data mining method for listed companies financial distress prediction.
Knowledge-Based Systems, 21 (1), 15.
Sung, T. K., Chang, N., & Lee, G. (1999). Dynamics of modeling in data mining: Interpretive approach
to bankruptcy prediction. Journal of Management Information Systems, 16 (1), 6385.
Surowiecki, J. (2011). Living by default. available from www.newyorker.com/magazine/2011/12/19/
living-by-default. (retrieved 7th October 2015)
Vassalou, M., & Xing, Y. (2004). Default risk in equity returns. The Journal of Finance, 59 (2),
83168.
Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques.
Elsevier.
Zadeh, L. A. (1965). Fuzzy sets. Information and control , 8 (3), 33853.
Zhao, H., Sinha, A. P., & Ge, W. (2009). Effects of feature construction on classification performance:
An empirical study in bank failure prediction. Expert Systems with Applications, 36 (2), 263344.

20

You might also like