SSRN Id4726993

Machine Learning Approach for Predicting U.S.
ETFs’ Tracking Errors

– Implications on U.S. Invested Fund
Jin-Hyung Cho*, Gun-Hee Lee**, Won-Eung Lee***, Bong-Jun Kim****
First version: Oct. 16th, 2023
ABSTRACT
In recent decades, machine learning (ML) algorithms has gained wide popularity in the
finance literature. The goal of this research is to exploit machine learning techniques in order
to analyze the effect of exchange-traded fund (ETF) illiquidity on tracking errors. We
demonstrate the superior performance of the machine learning models – Random Forest and
Gradient Boosting Decision Tree, in particular - over traditional linear models in predicting
U.S. ETF’s tracking errors. Moreover, our variable importance analysis suggests that the
features such as underlying assets based on U.S. assets (Invested in US Asset) and expense
ratio (Expense Ratio), are two key factors in the determination of predicting the tracking
errors on the ETF illiquidity. Finally, we further conduct SHAP (Shapley Additive
exPlanations) technique in order to observe the impact of a particular variable(feature) on
the difference between the considered- and average predictions of our machine learning
models. Our results indicate that the most relevant variable is Invested in US Asset, which is
in align with the previous importance analysis.
JEL Classification: G11, G12

Keywords: ETF, Tracking Error, Machine Learning, SHAP
*Corresponding Author, Research Fellow (Ph.D.), Kakao, E-mail: enish27@daum.net

**First Author, College of Economics and Finance, Hanyang University, E-mail: egh962@hanyang.ac.kr
***Co-Author, College of Economics and Finance, Hanyang University, woneunglee@hanyang.ac.kr
****Co-Author, Viterbi School of Engineering, University of Southern California, bongjun@usc.edu
Electronic copy available at: https://ssrn.com/abstract=4726993

I. Introduction
What are the determinants in the effect of exchange-traded fund (ETF) liquidity on tracking
errors? Despite the growing number of the researches on ETF, this question remains largely
unexplored. In past years, substantial number of researches paid attention to the
performance aspects of ETF (Dorocáková, 2017; Elton et al., 2019; Gallagher & Segara, 2005;
Gastineau et al., 2004; Harper et al., 2006; Levy and Lieberman, 2013; Lu et al., 2009;
Marshall et al., 2013; Svetina, 2010; Wu et al., 2021), volatility and risk premium (Ben-David
et al., 2018; Hurlin et al., 2019), the relationship with underlying portfolio (Box et al., 2021)
and so on. On contrary, the studies on tracking error which arise from the illiquidity aspect
of ETF is still quite limited (Bae and Kim, 2020; Ben-David et al., 2017; Broman and Shum,
2018; Dorocáková, 2017; Madhavan, 2014; Pu and Xie, 2022). We further delve into the topic
by employing machine learning methodologies which recently gained wide popularity in
finance researches.
To be specific, we employ machine learning techniques in order to capture the

determinants which explain the relationship between the illiquidity of U.S. exchange-traded
fund and its tracking errors. We adopt machine learning methodologies such as Random
Forest and GBDT (Gradient Boosted Decision Tree) and compare their performance on
prediction for the tracking errors of U.S. ETF, with those of traditional methodologies such
as linear regression, LASSO and Ridge regression. We successfully demonstrate the superior
performance of machine learning over traditional methodologies, which would be later
discussed in detail.
One of the well-known studies on illiquidity of ETF listed in U.S. market (U.S. ETF,
hereafter) and its tracking errors, is Bae and Kim (2020). Specifically, they discuss the effect
of U.S. illiquidity on tracking errors and illustrate whether ‘positive liquidity premium’
exists in U.S. ETF market. We set their rsearch as theoretical framework upon which our
machine learning techniques are tested. Our study further captures the determinants of
tracking errors which are the results of illiquidity of U.S. ETF, and present the interaction
between the variables by conducting SHAP (Shapley Additive exPlanations) tests. To our
best knowledge, this is the first study to employ machine learning methodologies on the
illiquidity of U.S. ETF and tracking errors. We further document that the tracking error of
U.S. ETF is best captured and explained by the variables including underlying portfolio
based on U.S. market (Invested in US Asset) and expense ratio (Expense Ratio), which is in
align with previous researches on ETF (Aber et al., 2009; Dorocáková et al., 2017; Johnson,
2009; Saunders, 2018; Tsalikis & Papadopoulos, 2019).
The U.S. ETF market provides an ideal laboratory to examine the tracking error of ETF
which is mainly the result of its illiquidity (Buetow & Henderson, 2012; Chen et al., 2017;
Guo & Leung, 2015; Osterhoff & Kaserer, 2016; Shin & Soydemir, 2010). In the fourth quarter
of 2022, average daily trading volumes for U.S. equities and U.S. ETFs were $502.2 billion
and $157.7 billion, respectively. This means that U.S. ETFs accounted for 31.4% of the total

U.S. composite volume in the secondary market over the quarter.1 The growing number of
ETF investment in the U.S. suggests that the ETF has become one of the well-known
investment products for the financial market. However, due to the increasing popularity of
the ETFs, the top ten ETFs roughly accounts for 36.8% of the total assets under management
(AUM) at the end of 2022.2 The rest of “non-popular” ETFs suffer from relatively low
liquidity, thus generating high transaction costs for investors and capital market (Angel et
al., 2016; Bae and Kim; Cheng & Madhavan, 2009; Houweling, 2012; Johnson et al., 2016;
Liebi, 2020; Tokat & Hayrullahoğlu, 2022).
However, our paper suggests that this “low liquidity” of the less popular ETFs, does not
only attribute to relatively low trading volume, but also to other factors which may relate to
the characteristics of the ETFs and a financial market. For example, if an investor chooses to
invest in two ETFs with similar liquidity, some factors such as underlying portfolio (e.g.,
U.S. invested, emerging market, currency), or dollar trading volume (DTV) could be of more
concern to him or her. Further, we depart from previous researches in a way that we focus
more on the “explanatory characteristics” of the determinants and their interaction in
explaining the tracking errors of U.S. ETFs, which could be unique to the U.S. financial
market. While majority of previous studies on the effect of U.S. illiquidity on tracking errors
– for example, Bae and Kim (2020) – concentrate on explaining secondary market liquidity
issue.
As for a passive ETF, popular structures of ETF are physical ETF and synthetic ETF. While
a physical ETF tracks the index targeted by holding the underlying asset of index, a
synthetic ETF replicates the index by using derivatives. The advantage of synthetic ETF is its
cost-efficiency, in contrast with physical ETF which is quite expensive as it tracks emerging
market or less liquid market index. Also, as for active funds, it is widely known that the
tracking error of ETF relate to the capability of ETF managers. If the ETF managers record
high tracking errors in replicating the index, it is believed that they underperform the index
ETF. In this aspect, it is rational to believe that if ETF managers are “capable”, they would be
able to handle with low liquidity in ETF and outperform the market to some extent (Koont et
al., 2022; Li, 2022).
Following Bae and Kim (2020), we exclude the data of active funds in our dataset as the
tracking errors of active funds can be the results of “management style”. By focusing on
passive ETF in analysis, we could better capture the determinants in the effect of U.S. ETF’s
illiquidity on tracking errors through machine learning techniques. To be specific, we
believe that if the prediction performance of our machine learning methodologies is superior
to those of traditional methodologies, it would be rational to assume that the machine
1 Global ETF Market Facts: Three Things to Know from Q4 2022, 2023, Nasdaq.
hZps://www.nasdaq.com/articles/global-etf-market-facts:-three-things-to-know-from-q4-
2022#:~:text=In%20the%20fourth%20quarter%20of%202022%2C%20average%20daily,volume%20in%20the%20secondary%20m
arket%20over%20the%20quarter.
2 See Table 1 in Appendix for further details.

learning algorithms would be better able to capture determinants which cause tracking
errors of U.S. ETFs. Further, by adopting SHAP (Shapley Additive exPlanations), we attempt
to prove that how one important feature, which is a determinant in the prediction of U.S.
ETF’s tracking errors, affect other variables and influence the output of our model’s
prediction in overall.
The major findings of our paper are following. First, we demonstrate the superior
performance of machine learning methodologies (e.g., Random Forest, GBDT) over
traditional regressions (e.g., Linear regression, LASSO, Ridge). This implies that machine
learning successfully handles with nonlinearities and overfitting issues which are embedded
in U.S. ETF dataset. Second, our machine learning analysis reveals that underlying assets
based on U.S. assets (Invested in US Asset) and expense ratio (Expense Ratio) are two key
determinants which explain the effect of illiquidity of ETF listed in U.S. on their tracking
error. This result well aligns with the traditional view that investors prefer cost-efficiency
and low management fee. To be specific, in economic downturn, investors could be more
likely to apportion underlying portfolio into categories of developed market (e.g., U.S.,
Europe) over developing market (Jhunjhunwala & Sethi, 2022; Madhavan & Maheswaran,
2016; Madhavan, 2014; Narend & Thenmozhi, 2019; Sarkar et al., 2013). Third, our SHAP
analysis indicates that the aforementioned determinants, especially Invested in US Asset,
tend to negatively affect the model output, which implies that these determinants decrease
the tendency of outputs (tracking error).
The contributions of this research are as follows. First, to our knowledge, this study is first
research to adopt machine learning techniques in analyzing the illiquidity effect of U.S. ETF.
Thanks to the superior performance of machine learning, we verify the determinants -
Invested in US Asset and Expense Ratio, in particular – in the effect of U.S. ETF’s illiquidity on
their tracking errors, which aligns with previous researches (Blitz & Huij, 2012; Charupat &
Miu, 2013; Shin & Soydemir, 2010; Tsalikis & Papadopoulos, 2019; Zawadzki, 2020). Second,
our findings reaffirms that the importance of specific variables in predicting the tracking
errors which come from illiquidity of U.S. ETFs. The “consistence” in our machine learning
models in capturing Invested in US Asset, suggest that its importance could be largely
underestimated which could be due to previous researches’ overemphasizing trading
volume as indicator of the tracking error (Dorocáková, 2017; Gallagher & Segara, 2005; Shin
& Soydemir, 2010; Tsalikis & Papadopoulos, 2019; Vardharaj et al., 2004). This is also
explained by our analysis of traditional methodologies, which suggest Log(AUM) as main
determinant.
The rest of our paper is constructed as following. Section 2 discusses previous researches
and Section 3 presents data and methodology. Section 4 introduces analysis and, followed
by discussion in Section 4. Finally, Section 5 concludes.

II. Literature review
2-1. The Effect of Illiquidity of ETFs on Tracking Error
Generally, ETFs tend to co-move with their respective underlying portfolio, which may
pose systematic (in)stability as investors may “simultaneously” benefit from ETF’s return or
face losses. In an effort to deliver stable return, a number of ETF has grown rapidly along
with the wider range of underlying portfolios such as bonds, commodities, currencies and
equities. Particularly, previous researches indicate that due to the popularity of top ETFs,
relative non-popular ETFs suffer from low liquidity, which causes high transaction costs and
tracking errors, authorized participants (AP) could be discouraged to replicate underlying
portfolio (Bae and Kim, 2020; Dorocáková, 2017; Pu and Xie, 2022)
By definition, tracking error of ETF refers to the standard deviation of the difference in
return of ETF portfolio and the underlying portfolio's returns (Vardharaj et al., 2004). It is
widely believed that the occurrence of tracking error is not avoidable, so that APs endeavor
to lower tracking errors as low as possible. If a decoupling of returns of ETF and its
underlying portfolio widens, it could lead to the loss of faith in the liquidity formation of
ETFs. In this context, previous researches note that the fundamental to the performance of
ETF are price volatility of underlying portfolio and their arbitrage mechanism (Ben-David et
al., 2018; Grossman, 1987; Humphreys & McClain, 1998; Pagano et al., 2019; Tuzun, 2013). At
the same time, it is believed that trading ETFs could also lead to price discovery for
underlying assets, which are mainly results of the interaction between liquidity of ETFs and
“noisy” traders (Box et al., 2021; Ivanov et al., 2013).
Previous researches note that the appearance of tracking error may attribute to a number of
factors. First, the tracking errors mostly appear on the basis of the regional feature of
underlying portfolios. For instance, using the samples of different ETFs issued by iShares,
six for each of three regions (two Americas, two Asia and two Europe), Zawadzki (2020)
note that ETFs fail to mimic their underlying indexes by showing that calculated tracking
errors are often negative significantly and even further, the value of tracking errors depends
on the region and market maturity. Second, it is also notable that economic events influence
the tracking error of returns of ETF. For example, Svetina (2010) demonstrates that the 2008-
2009 financial crises negative affects the daily tracking performance for all ETF listed on the
London Stock Exchange (LSE).
Lastly, most importantly, our research is closely linked with Bae and Kim (2020), who
document that when ETFs are not liquid, tracking errors are large. Their research
demonstrates that the illiquidity of ETF relates to ETF tracking errors along with variance
and expected returns. They further imply that because of ETF illiquidity, APS increase the
costs for transaction costs for arbitrage trading, since they fail to immediately address
tracking errors, which means their incapability to properly track the underlying index. As
well, it is worthwhile to note that ETFs structure also relate to the effect of ETF illiquidity on

the tracking errors - in-kind type, US stocks-based ETF, for example. They key difference
between our paper and our benchmark paper, Bae and Kim (2020), is that while their
regression-based approach to tracking errors focuses on its relationship between U.S. ETFs’
illiquidity, our machine learning-based approach concentrates on capturing “determinants”
and their interactions with variables in the relationship between U.S. ETF illiquidity and
their tracking errors. This will be discussed later in detail.
2-2. The Machine Learning approach to financial market (ETF)
In finance researches, a growing number of studies adopted machine learning

technologies in analyzing capital structure (Amini et al., 2021; Bilgin, 2023; Eliasy &
Przychodzen, 2020; Gaytan et al., 2022), dividend payout policy (Avramov et al., 2021; Hu et
al., 2021; Ivașcu, 2023; Kamalov et al., 2021; Obthong et al., 2020; Yaseen & Dragotă, 2021)
and the performance of corporate and treasury bonds (Bali et al., 2020; Kim et al., 2021; Kim,
2021). A majority of the researches have employed various machine learning techniques in
predicting their performance and capturing variable importance, which mostly led to
superior performance over traditional regression methods such as linear regression and
LASSO.
Among them, a number of studies on ETF employed machine learning techniques in

analyzing its “performance aspect” mostly. It is interesting to note that the results of
researches with machine learning algorithms well aligns with traditional random walk
theory which assert that individual stocks do not move in distinguishable patter so that
short-term movement in future movements is difficult to be predicted in advance (Cootner,
1964; Fama, 1965). For example, Liew and Mayster (2017) employ machine learning
techniques such as Deep Neural Networks, Random Forest and Support Vector Machines
and further split information sets into past returns, past volume, dummies for days and
months, and a combination of all three. They prove that the machine learning algorithms
function well periods of 1 month to 3-month horizons, but, reveal that the calendar dummies
do not add much prediction power to much for shorter periods (e.g., days). Moreover,
employing a variety of machine learning algorithms, such as LSTM and CNN-BiLSTM-AM
model, Zheng (2021) reveals that LSTM shows the highest accuracy on ETF price prediction
over ARIMA and DNN, and when simulating trading on the test data set of about three
years, net gain reaches 63.7%.
III. Data and Methodology
Now, we explain variable definition for our sample. As aforementioned, we employ

variables and calculation for tracking errors based on our benchmark paper, Bae and Kim
(2020). Our primary sample includes passive ETFs listed in U.S. market, which track gold,

commodities, physical assets, equities and indexes as underlying portfolio. The Table 1
below presents the yearly composition of our ETF data samples. As shown in this table, our
observations period is from 2015 to 2022, in which both yearly and daily ETF data are used
for the analysis. The table also presents the growing trading volume of ETF in U.S. market.
Specifically, the dollar volume of our sample increased from 48.62 $billion in 2015 nearly
tripled to 141.52 $billion in 2022. Subsequently, the Table 2 presents the proportion of our
samples’ underlying asset as in 2022. In total of 1,568 samples, stock index constituents
comprise more than 50% of all indexes, with dollar trading volume of roughly 97.9 $Billion.
On contrary, the constituents of physical assets comprise of only 6% of all indexes, with
respective volume of roughly 0.99 $Billion.
[Table 1 Insert Here]
As aforementioned, we set the research framework of Bae and Kim (2020) as baseline and
selectively employ their variables. For example, their category of control variables segment
into (1) size and volatility, (2) characteristics of ETF and (3) synthesis and management
expense. First, Log(AUM), Log(Dollar Trading Volume), Index Volatility, Log(Shares Outstanding)
and Shares Volatility belong to first category, while Equity-Type ETF, Invested in US Assets,
Swap Based, Derivates Based, Leveraged Fund, Futures Available and Options Available are
included in second category. The variables of third category are In-Kind, Optimized, and
Expense Ratio. We download financial data composing each variable from Bloomberg and
NYSE Trade and Quote (TAQ) database. The details and sources of our variables are
presented in following the Table 3.
Next, we present the measure of ETF illiquidity. We borrow the liquidity measure from Bae
and Kim (2020) by presenting the daily relative effective half-spread which is the absolute
difference between the price of quote midpoint and its trade price. Similar to our
benchmarking study, we calculate data from NYSE TAQ database. The TAQ database has
daily-based intraday transaction data – both trades and quotes – for securities listed on
NYSE, American Stock Exchange (AMEX) and Nasdaq National Market System (NMS). In
the eq. (1) below, each security’s daily relative effective half-spread is standardized by
traded price, and is summed together to obtain average value of effective half-spread. In the

equation, 𝑝!,#
$ $
, 𝑚!,# and 𝑛#$ refer to the trade price, the midpoint of quote price and the
number of trades at time k on day t for each security i, respectively.
%!"
$ $
1 |𝑝!,# − 𝑚!,# |
𝑐#$ = $ ) $
⋯ (1)
𝑛# 𝑝!,#
!&'
Subsequently, we present two equations for tracking error measures, which are daily-based
and yearly-based respectively. First, the eq. (2) and (3) present daily tracking error measure
for the return difference between ETF and Net Asset Value (or Index). We calculate daily
tracking errors for two return difference (ETF returns – NAV returns, ETF returns – Index
returns) which are in the form of absolute value. Second, the eq. (4) shows yearly tracking
errors which are defined as the absolute value of the difference between one and the
coefficient of 𝑋 from the regression of 𝑌 on 𝑋. In each equation, 𝑟!,# , 𝑣# , 𝑓# and 𝛽!,# denote
the daily ETF, net asset value (NAV), returns of index (IND) and coefficient of the
regression, respectively. Subsequently, the summary statistics and correlation of our
measures of tracking errors are presented in the Table 4.
(𝐷𝑎𝑖𝑙𝑦)𝑇𝑟𝑎𝑐𝑘𝑖𝑛𝑔 𝐸𝑟𝑟𝑜𝑟$,# = |𝑟$,# − 𝑣$,# | ⋯ (2)
(𝐷𝑎𝑖𝑙𝑦)𝑇𝑟𝑎𝑐𝑘𝑖𝑛𝑔 𝐸𝑟𝑟𝑜𝑟$,# = |𝑟$,# − 𝑓𝑖,𝑡 | ⋯ (3)
(𝑌𝑒𝑎𝑟𝑙𝑦)𝑇𝑟𝑎𝑐𝑘𝑖𝑛𝑔 𝐸𝑟𝑟𝑜𝑟!,# = (1 − 𝛽$,# ) ⋯ (4)
Next, we present brief explanations for traditional and machine learning methodologies for
our analysis. As aforementioned, we employ traditional methodologies such as
(multivariate) linear regression, LASSO and Ridge, and machine learning algorithms
including Random Forest and GBDT (Gradient Boosted Decision Tree).
First, we illustrate the traditional methodologies as following. Linear regression predicts

the value of the dependent variable based on the set of the independent variables. In
machine learning algorithms, linear regression is conducted under supervised learning. This
model assumes a linear relationship between independent and dependent variables and the
slope provided by the model shows the degree to which dependent variables changes for a
single unit change in the independent variables. LASSO, also called penalized regression,
predicts a result by selecting the subset of the variables which minimize prediction errors
(Tibshirani, 1996). As well, Ridge regression is useful in mitigating multicollinearity of linear

regression. This method imposes a penalty on the size of the coefficient, which, also
minimizes a penalized residual sum of squares (Hoerl & Kennard, 2012). Through this
process, the parameters become more precise, and mean square error (MSE) and variance
estimates become smaller than the estimates from ordinary least square (OLS).
Now we explain our machine learning algorithms which are Random Forest and GBDT.
First introduced by Breiman (2001), a random forest is an estimator in machine learning
algorithms which fits classifying decision tress on a variety of randomly selected subsamples
in dataset. Then, the algorithm averages them to improve the accuracy of the prediction, and
at the same time, control any over-fits. As well, the other machine learning algorithm,
GBDT, fits an approximation, 𝑦"! , of the function 𝐹(𝑥! ), which maps set of 𝑥 to the values
of output in a stage-wise fashion. In forms of the decision trees, the algorithm allows the
optimization of arbitrary differentiable loss function (Bentéjac et al., 2021)
IV. Empirical analysis
IV-1. The main analysis
Now, we present the results of empirical analysis. We first conduct performance analysis
with different methodologies, and compare the performance of prediction among them. The
Table 5 and 6 blow show the results of training and test set for the prediction of two daily
based tracking error measures (| ETF-IND|, |ETF-NAV|) and two yearly based tracking
error measures (θ(ETF-IND), θ(ETF-NAV)) respectively. In both tables, the performance of
prediction is measured by R² and four widely used metrics of regression - Mean Absolute
Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and Median
Absolute Error Loss (MedianAE).
First, in the Table 5, for daily-based tracking errors, it seems clear that the prediction
performance of R² for machine learning algorithm is similar to the traditional regression for
| ETF-IND|; however, the performance of R² is much larger for |ETF-NAV|, suggesting that
the performance for daily-based tracking errors depends on the use of the measures. On the
other hand, for the yearly-based tracking errors, the prediction performance of R² for
machine learning algorithm is much larger for both measures of tracking errors, which are
θ(ETF-IND), θ(ETF-NAV). Also, most of our measures for the prediction errors (MAE, MSE,
RMSE and MedianAE) indicate that the prediction errors of machine learning algorithms for
yearly-based tracking errors (θ(ETF-IND), θ(ETF-NAV)) are much smaller than those of
traditional methodologies. In contrast, there seems to be not very much difference between
the machine learning algorithms and traditional methods in the prediction values for daily
based tracking errors (| ETF-IND|, |ETF-NAV|).

Subsequently, the results in the Table 6 further confirm the results in the Table 5. The
difference with the Table 5 is that the performance of R² for two daily based tracking error
measures (| ETF-IND|, |ETF-NAV|) of machine learning algorithms has both greatly
improved. However, most of the measures for the prediction errors (MAE, MSE, RMSE and
MedianAE) are still quite similar to the values in the Table 5.
Now, most importantly, we turn our attention to the variable importance analysis for
traditional methodologies (Linear, LASSO and Ridge) and machine learning algorithms
(Random Forest and GBDT), which is presented the Table 7. First, the results for the
traditional methods point out Log(AUM) as the most determinant in the US ETF’s illiquidity
effect on tracking errors, regardless of frequency (daily, yearly) and the types of the
methods.
On the other hand, however, the results for machine learning algorithms are quite
constrasting between daily and yearly measures for tracking errors. The results for daily-
based tracking errors are quite mixed for both trained and test set. For example, the first
measure (| ETF-IND|) suggests Intraday Volatility as the most determinant for both Random
Forest and GBDT for training set; however, the same measure for the test set presents Lev.
Fund and Log(AUM) as the most determinant respectively. The “inconsistency” in the results
is quite similar for the second measure (|ETF-NAV|). The analysis suggests Shares Outst. as
the most determinant for both Random Forest and GBDT for training set, but the test set
indicates that Log(AUM) and Log(Shares outst.) are the most determinant respectively.
Interestingly, however, the results for machine learning algorithms are consistent for
yearly-based tracking error measures (θ(ETF-IND), θ(ETF-NAV)). For instance, the results
for both measures of Random Forest and GBDT indicate that Inv. US asset is the most
determinant, followed by Expense Ratio. The only difference is that while Random Forest
analysis for first measure (| ETF-IND|) mostly indicates Log(DTV) as the third determinant
variable, the GBDT analysis consistently suggest Index Volatility as the third determinant,
regardless of the types of yearly measures.
In summary, in contrast with traditional methodologies which point out “size aspect
(Log(AUM))” of US ETFs’ tracking error, regardless of the frequency of measures, the
machine learning algorithms excels in its prediction capability, especially for the yearly-
based tracking error measures (θ(ETF-IND), θ(ETF-NAV)). The rational for the machine
10

learning algorithms’ capturing Inv. US asset as the most determinant will be debated in the
Discussion section.
IV-2. Shapley value
We now conduct SHAP technique to provide the explain-ability of our model. Introduced
by L. Shapley, the Shapley value originated from the concept and method of game theory.
For example, the value of each feature is a “player” in game where the payoff is our
prediction. In a collation game (𝑁, v), where 𝑁 and v correspond to a set of players and
values respectively, the Shapley value divides the total payoff, v(𝑁) in order to estimate the
impact of the value of a particular feature on the difference between the considered- and
average prediction, given the current values of all of the features (Lundberg & Lee, 2017). On
the basis of Symmetry, Linearity and Dummy player, the difference from the average
prediction is distributed fairly among the feature values. Figure 1 and Figure 2 below
presents the daily-based and yearly-based SHAP beeswarm plots for SHAP values
respectively.
To be specific, Figure 1a and 1b illustrates the Random Forest analysis of daily-based

tracking error measures (| ETF-IND|, |ETF-NAV|), while Figure 1c and 1d shows the GBDT
analysis of same tracking error measures. Figure 2a and 2b illustrate the Random Forest
analysis of yearly-based tracking error measures (θ(ETF-IND), θ(ETF-NAV)), while Figure
2c and 2d show the GBDT analysis of yearly-based tracking error measures respectively.
[Figure 1 Insert Here]
[Figure 2 Insert Here]
As evident in Figure 1, all of the panels fail to capture the relevance aspects of the
determinants in our analysis. This could be due to “noisy” characteristics of daily short-term
movements of passive funds, which our machine learning algorithms do not adequately
explain.
In contrast, all of the panels in Figure 1 shows that the variable which has the greatest
relevance appears to be Invested in US Asset, which is the most determinant in the previous
variable importance analysis. To explain further, this variable is mostly high with a negative
sign of SHAP value, suggesting that higher counts of Invested in US Asset tend to negatively
affect the outcome, which is tracking errors from U.S. ETF illiquidity. On the other hand, the
11

relevance of the Expense Ratio, which is the second determinant, quite mixed; this variable
shows both positive/negative signs in most of the panels; only Figure 1b (θ(ETF-NAV))
illustrates the positive effect of the feature on our outcome.
V. Discussion
Our research employs a variety of approaches explaining the tracking errors from U.S. ETF
illiquidity. Apart from Bae and Kim (2020), we attempted to capture the most determinant in
capturing the tracking errors by adopting machine learning algorithms – Random Forest
and GBDT, namely -, which showed superior performance to the traditional methodologies.
Besides the prediction performance of the machine learning algorithms, there are a number
of points to be discussed.
First, capturing Invested in US Asset as the most determinant by the machine learning
techniques suggest that this feature could have been largely underestimated in the literature
on tracking errors of ETFs. For example, the feature related to “size effect” – number of
stocks, fund size, for example – has traditionally been emphasized by previous literatures as
a variable which affects the tracking errors of ETF (Dorocáková, 2017; Vardharaj, R. et al.,
2004). This is also confirmed by our traditional methods, which consistently suggests
Log(AUM) as a main determinant. When the lack of liquidity is concerned, however, both of
our machine learning algorithms point out Invested in US Asset as the most determinant,
while the traditional methods still present Log(AUM) as the most determinant. This outcome
suggests that the investors’ tendency to follow “safe havens” such as long-term Treasuries or
the dominant US dollar could be reflected in actual outcomes (tracking errors) of US ETF
illiquidity (Habib et al., 2020; Kaczmarek et al., 2022; Kim, 2021). Our Shapley value analysis
further confirms the reliability of the outcomes of the variable importance test.
Second, our machine learning analysis is still in align with traditional random walk theory
(Cootner, 1964; Fama, 1965) in that only “yearly-based” analysis on the tracking errors of US
ETF illiquidity provides reliable outcomes, when compared with “daily-based” outcomes.
For example, the consistent results in terms of prediction performance and capturing a
determinant (Log(AUM)) are only presented in yearly-based outcomes in Table 5 through 7
and both Figure 1 and 2. The possible explanation is that the adoption of machine learning
algorithms do not significantly alter previous findings that short-term movement in future
movements is almost impossible to be predicted in advance. As aforementioned, this could
be due to the “noisy” characteristics of short-term movement of funds. Our explanation is
also in align with previous findings that employed machine learning algorithms on ETF
analysis, which mainly point out the prediction of its “yearly” performance (Liew and
Mayster, 2017; Zheng, 2021).
12

VI. Conclusion
The findings of our research are following. First, the performance of machine learning
methodologies (e.g., Random Forest, GBDT) is superior to traditional regressions (e.g.,
Linear regression, LASSO, Ridge) in forecasting the errors of U.S. ETF illiquidity on tracking
errors. This suggests that machine learning algorithms better address nonlinearities and
overfitting issues embedded in dataset. Second, our study reveals that underlying assets
based on U.S. assets (Invested in US Asset) and expense ratio (Expense Ratio) are key
determinants which play a significant role in the illiquidity of ETF listed in U.S. on their
tracking error. This result aligns with the traditional view that investors prefer cost-
efficiency and low management fee. Third, our further SHAP analysis shows that the
aforementioned determinants, especially Invested in US Asset, negatively affect the model
output, implying that it decreases the tendency of outputs (tracking error).
We contribute to the realm of researches on ETF in following ways. First, we believe that
this research is first to employ machine learning algorithms in analyzing the U.S. ETF’s
illiquidity effect on tracking errors. The superior performance of machine learning verifies
the determinants - Invested in US Asset and Expense Ratio, for example – which play role in
the relationship between U.S. ETF’s illiquidity on their tracking errors. Second, our findings
confirm the importance of the determinants – for example, Invested in US Asset – in that they
could be could be largely underestimated which could be due to previous researches’
overemphasizing trading volume as indicator of the tracking error (Dorocáková, 2017;
Gallagher & Segara, 2005; Shin & Soydemir, 2010; Tsalikis & Papadopoulos, 2019; Vardharaj
et al., 2004). As pointed out, this tendency is captured by our traditional methodologies,
which present Log(AUM) as a main determinant.
The limitations of our study are as following. Despite the adoption of Random Forest and
GBDT models, we do not take other machine learning models and deep learning models
into considerations. Further, in order to further elaborate on the meanings and implications
of the most determinant in machine learning algorithms, Invested in US Asset, further
robustness tests, such as comparative analysis with active funds, would be necessary. This
research area is left for future study.
13

References
Aber, J. W., Li, D., Can, L., 2009. Price volatility and tracking ability of ETFs. Journal of Asset
Management 10, 210-221.
Amini, S., Elmore, R., Öztekin, Ö., Strauss, J., 2021. Can machines learn capital structure
dynamics? Journal of Corporate Finance 70, 102073.
Angel, J. J., Broms, T. J., Gastineau, G. L., 2016. ETF transaction costs are often higher than
investors realize. The Journal of Portfolio Management 42(3), 65-75.
Avramov, D., Li, M., Wang, H., 2021. Predicting corporate policies using downside risk: A
machine learning approach. Journal of Empirical Finance 63, 1-26.
Bae, K., Kim, D., 2020. Liquidity risk and exchange-traded fund returns, variances, and
tracking errors. Journal of Financial Economics 138(1), 222-253.
Bali, T. G., Goyal, A., Huang, D., Jiang, F., Wen, Q., 2020. Predicting corporate bond returns:
Merton meets machine learning. Georgetown McDonough School of Business Research Paper
(3686164), 20-110.
Ben-David, I., Franzoni, F., Moussawi, R., 2018. Do ETFs Increase Volatility? The Journal of
Finance 73, 2471-2535.
Bentéjac, C., Csörgő, A. & Martínez-Muñoz, G., 2021. A comparative analysis of gradient
boosting algorithms. Artificial Intelligence Review volume 54, 1937–1967
Bilgin, R., 2023. The Selection of Control Variables In Capital Structure Research With
Machine Learning.
Blitz, D., Huij, J., 2012. Evaluating the performance of global emerging markets equity
exchange-traded funds. Emerging markets review 13(2), 149-158.
Breiman, L., 2001. Random Forests. Machine Learning 45, 5–32.
Broman, M.S., Shum, P., 2018. Relative Liquidity, Fund Flows and Short-Term Demand:
Evidence from Exchange-Traded Funds. Financial Review 53, 87-115.
Buetow, G. W., Henderson, B. J., 2012. An empirical analysis of exchange-traded funds.

Journal of Portfolio Management 38(4), 112.
Box, T., Davis, R., Evans, R., Lynch, A., 2021. Intraday arbitrage between ETFs and their
underlying portfolios. Journal of Financial Economics 141(3), 1078-1095.
Chen, J., Chen, Y., Frijns, B., 2017. Evaluating the tracking performance and tracking error of
New Zealand exchange traded funds. Pacific Accounting Review 29(3), 443-462.
14

Cheng, M., Madhavan, A., 2009. The dynamics of leveraged and inverse exchange-traded
funds. Journal of investment management 16(4), 43.
Charupat, N., Miu, P., 2013. Recent developments in Exchange-Traded Fund literature:
Pricing efficiency, tracking ability, and effects on underlying securities. Managerial Finance
39(5), 427-443.
Cootner, P.H. (Ed.), 1967. The Random Character of Stock Market Prices. The MIT Press.
Dorocáková, M., 2017. Comparison of ETF´ s performance related to the tracking error.
Journal of International Studies 10(4), 154-165.
Elton, E.J., Gruber, M.J., de Souza, A., 2019. Passive mutual funds and ETFs: Performance and
comparison. Journal of Banking & Finance 106, 265-275.
Eliasy, A., Przychodzen, J., 2020. The role of AI in capital structure to enhance corporate
funding strategies. Array 6, 100017.
Fama, E.F., 1965. Random Walks in Stock Market Prices. Financial Analysts Journal, 21(5), 55-
59. Published By: Taylor & Francis, Ltd.
Gallagher, D. R., Segara, R., 2005. The performance and trading characteristics of exchange-
traded funds. Journal of Investment Strategy 1(1), 47-58.
Gastineau, G. L., 2004. The benchmark index ETF performance problem. The Journal of
Portfolio Management 30(2), 96-103.
Grossman, S. J., 1987. An analysis of the implications for stock and futures price volatility of
program trading and dynamic hedging strategies.
Guo, K., Leung, T., 2015. Understanding the tracking errors of commodity leveraged ETFs.
Springer New York.
Habib, M. M., Stracca, L., Venditti, F., 2020. The fundamentals of safe assets. Journal of
International Money and Finance, 102.
Harper, J. T., Madura, J., Schnusenberg, O., 2006. Performance comparison between exchange-
traded funds and closed-end country funds. Journal of International Financial Markets,
Institutions and Money 16(2), 104-122.
Hoerl, A.E. & Kennard, R.W., 1970. Ridge Regression: Biased Estimation for Nonorthogonal
Problems. Technometrics, 12(1), 55-67. Publisher: Taylor & Francis. doi:
10.1080/00401706.1970.10488634.
Houweling, P., 2012. On the performance of fixed-income exchange-traded funds. The Journal
of Beta Investment Strategies 3(1), 39-44.
15

Humphreys, H. B., McClain, K. T., 1998. Reducing the impacts of energy price volatility
through dynamic portfolio selection. The Energy Journal 19(3).
Hurlin, C., Iseli, G., Pérignon, C., Yeung, S., 2019. The counterparty risk exposure of ETF
investors. Journal of Banking & Finance 102, 215-230.
Hu, Z., Zhao, Y., Khushi, M., 2021. A survey of forex and stock price prediction using deep
learning. Applied System Innovation 4(1), 9.
Ivanov, S.I., Jones, F.J., Zaima, J.K., 2013. Analysis of DJIA, S&P 500, S&P 400, NASDAQ 100
and Russell 2000 ETFs and their influence on price discovery. Global Finance Journal 24(3),
171-187.
Ivașcu, C. F., 2023. Understanding Dividend Puzzle Using Machine Learning. Computational
Economics 1-19.
Johnson, B., Bioy, H., Boyadzhiev, D., 2016. Assessing the true cost of strategic-beta ETFs. The
Journal of Index Investing 7(1), 35.
Johnson, W. F., 2009. Tracking errors of exchange traded funds. Journal of Asset Management
10, 253-262.
Jhunjhunwala, S., Sethi, A., 2022. Do ETFs affect the return co-movement of their underlying
assets? Evidence from an emerging market. Managerial Finance 48(11), 1661-1686.
Kaczmarek, T., Będowska-Sójka, B., Grobelny, P., Perez, K., 2022. False Safe Haven Assets:
Evidence From the Target Volatility Strategy Based on Recurrent Neural Network. Research
in International Business and Finance, 60, 101610. ISSN 0275-5319.
Kamalov, F., Gurrib, I., Rajab, K., 2021. Financial forecasting with machine learning: price vs
return. Journal of Computer Science 17(3), 251-264.
Kim, J. M., Kim, D. H., Jung, H., 2021. Applications of machine learning for corporate bond
yield spread forecasting. The North American Journal of Economics and Finance 58, 101540.
Kim, M., 2021. Adaptive trading system integrating machine learning and back-testing:
Korean bond market case. Expert Systems with Applications 176, 114767.
Kim, T., 2021. Safe Asset Demand, Global Capital Flows and Wealth Concentration. IMF
Working Paper No. 2021/254. Available at SSRN: https://ssrn.com/abstract=4026482.
Koont, N., Ma, Y., Pastor, L., Zeng, Y., 2022. Steering a ship in illiquid waters: Active
management of passive funds. National Bureau of Economic Research.
Levy, A., Lieberman, O., 2013. Overreaction of country ETFs to US market returns: Intraday
vs. daily horizons and the role of synchronized trading. Journal of Banking & Finance 37(5),
1412-1421.
16

Liebi, L. J., 2020. The effect of ETFs on financial markets: a literature review. Financial Markets
and Portfolio Management 34(2), 165-178.
Liew, J.K.S., & Mayster, B., 2017. Forecasting ETFs with Machine Learning Algorithms. Johns
Hopkins Carey Business School. Version 1.3. http://etfprediction.pythonanywhere.com/. Date:
January 14, 2017.
Lu, L., Wang, J., Zhang, G., 2009. Long term performance of leveraged ETFs. Available at
SSRN 1344133.
Lundberg, S. M., & Lee, S.-I., 2017). A Unified Approach to Interpreting Model Predictions.
Advances in Neural Information Processing Systems (Vol. 30).
Madhavan, V., Maheswaran, S., 2016. Indian exchange traded funds: relationship with
underlying indices. Economic and Political Weekly 142-148.
Madhavan, V., 2014. Investigating the nature of nonlinearity in indian exchange traded funds
(ETFs). Managerial Finance 40(4), 395-415.
Marshall, B.R., Nguyen, N.H., Visaltanachoti, N., 2013. ETF arbitrage: Intraday evidence.
Journal of Banking & Finance 37(9), 3486-3498.
Narend, S., Thenmozhi, M., 2019. Do country ETFs influence foreign stock market index?
Evidence from India ETFs. Journal of Emerging Market Finance 18(1_suppl), S59-S86.
Obthong, M., Tantisantiwong, N., Jeamwatthanachai, W., Wills, G., 2020. A survey on
machine learning for stock price prediction: Algorithms and techniques.
Osterhoff, F., Kaserer, C., 2016. Determinants of tracking error in German ETFs–The role of
market liquidity. Managerial Finance 42(5), 417-437.
Pagano, M., Sánchez Serrano, A., Zechner, J., 2019. Can EFTs Contribute to Systemic Risk?
ESRB: Advisory Scientific Committee Reports 9.
Rompotis, G.G., 2011. The Performance of Actively Managed Exchange-Traded Funds. The
Journal of Index Investing 1(4), 53-65.
Saunders, K. T., 2018. Analysis of international ETF tracking error in country-specific funds.
Atlantic Economic Journal 46, 151-160.
Sarkar, S. S., Dutta, S., Dutta, P., 2013. A review of Indian index funds. Global Business
Review 14(1), 89-98.
Shin, S., Soydemir, G., 2010. Exchange-traded funds, persistence in tracking errors and
information dissemination. Journal of Multinational Financial Management 20(4-5), 214-234.
Svetina, M., 2010. Exchange traded funds: Performance and competition. Journal of Applied
Finance (Formerly Financial Practice and Education) 20(2).
17

Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society. Series B (Methodological), 267-288.
Tokat, E., Hayrullahoğlu, A. C., 2022. Pairs trading: is it applicable to exchange-traded funds?.
Borsa Istanbul Review 22(4), 743-751.
Tsalikis, G., Papadopoulos, S., 2019. ETFs-Performance, tracking errors and their
determinants in Europe and the USA. Risk Governance & Control: Financial Markets &
Institutions 9(4).
Tuzun, Tugkan, 2013. Are Leveraged and Inverse ETFs the New Portfolio Insurers. FEDS
Working Paper No. 2013-48.
Vardharaj, R., Fabozzi, F.J., Jones, F.J., 2004. Determinants of Tracking Error for Equity
Portfolios. The Journal of Investing 13(2), 37-47.
Wu, C., Xiong, X., Gao, Y., 2021. Performance comparisons between ETFs and traditional
index funds: Evidence from China. Finance Research Letters 40, 101740.
Yaseen, H., Dragotă, V., 2021. Forecasting the dividend policy using machine learning
approach: Decision tree regression models. In Eurasian Business and Economics Perspectives:
Proceedings of the 31st Eurasia Business and Economics Society Conference (pp. 19-39).
Cham: Springer International Publishing.
Zawadzki, K., 2020. The performance of ETFs on developed and emerging markets with
consideration of regional diversity. Quantitative Finance and Economics 4(3), 515-525.
Zhang, C., Amir Sjarif, N.N., & Ibrahim, R., 2023. Deep learning models for price forecasting
of financial time series: A review of recent advancements: 2020–2022. WIREs Data Mining and
Knowledge Discovery.
18

Appendix
Table 1. Exchanged-traded fund (ETF) trends

Market Value Volume Dollar Volume
Year Created Delisted N
Bill. $ Top3 (%) Top10 (%) Mill. Sh Top3 (%) Top10 (%) Bill. $ Top3 (%) Top10 (%)
2015 69 9 695 1388.76 22.0 36.6 758.34 30.7 52.1 48.64 55.7 67.3
2016 101 7 789 1520.86 21.2 37.1 897.87 28,7 49.7 49.30 49.4 62.3
2017 113 5 897 2055.59 21.1 37.0 692.85 27.1 47.2 43.82 46.5 58.8
2018 160 1 1056 2608.50 19.9 36.6 1,021.70 23.4 42.9 71.56 49.4 60.8
2019 115 5 1166 2923.03 19.2 35.9 928.94 20.3 41.2 62.73 45.3 57.5
2020 109 1 1274 3386.46 18.9 36.2 1,408.45 16.0 34.0 90.86 45.3 57.5
2021 158 1 1431 4922.39 18.4 35.2 1,205.25 15.6 34.6 105.53 45.5 56.8
2022 141 4 1568 5051.45 18.6 35.8 2,119.22 18.0 36.8 138.25 44.1 56.6
Table 2. The Composition of Exchanged-traded fund (ETF) Sample

Constituents Sample Sample (%) Dollar Volume (Bill. $) Dollar Volume (%)
Physical asset 9 0.6 0.98 0.7
Stock 287 18.3 29.8 21.5
Stock index 929 59.2 95.9 69.3
Bond 288 18.4 8.67 6.3
Futures 55 3.5 2.92 2.1
19

Table 3. Variable definitions3
Variable Definition Freq Source
Absolute Order Absolute value of normalized order imbalance. The imbalance is the order imbalance divided Daily NYSE TAQ
Imbalance by both the fund creation and redemption size. Order imbalance is calculated by the
difference of the number of trading volume between traded at ask side and traded at bid side.
AR(1) Coefficient(ф) Coefficient estimate of the AR(1) model for daily underlying index returns. Assets under Yearly Bloomberg
management, which are net asset value times the number of shares outstanding.
AUM Assets under management, which are net asset value times the number of shares outstanding. Daily NYSE TAQ
Derivatives-Based A dummy variable that is 1 if the ETF uses a derivatives contract to replicate the underlying Daily Bloomberg
index.
Dollar Trading Daily volume multiplied by daily closing price. We set to 0, if there is no trading volume Daily Bloomberg
Volume
Effective Spread Average of the trade-weighted effective half-spread, which is the absolute difference between Daily NYSE TAQ
the trade price and the quote midpoint of the associated price.
Equity-Type ETF A dummy variable that is 1 if the ETF's underlying portfolio consists of stocks. Daily Bloomberg/ETF
webpage
Expense Ratio Annual expense ratio Yearly Bloomberg/ETF
webpage
Futures Available A dummy variable that is 1 if the ETF has a futures contract underlying it. Daily Bloomberg
Index Volatility Annual standard deviation calculated from daily index returns. Yearly Bloomberg
In-Kind A dummy variable that is 1 if the ETF delivers or receives physical assets when creating or Daily Bloomberg
redeeming shares.
Intraday Volatility Daily standard deviation of intraday price changes. Daily NYSE TAQ
Invested in US A dummy variable that is 1 if the ETF consists of US assets. Daily Bloomberg/ETF
Assets webpage
Leveraged Fund A dummy variable that is 1 if the ETF is either inverse or leveraged. Daily Bloomberg
Non-trading Prob(p) Number of non-trading days divided by total business days each year. Yearly Bloomberg
Optimized A dummy variable that is 1 if the ETF optimizes the portfolio when replicating its underlying Daily Bloomberg
index.
3 Please note that the set of our control variables is basically same to Bae and Kim (2020).
20

Options Available A dummy variable that is 1 if the ETF has option contracts based on itself. Daily Bloomberg
Quoted Spread Average trade-weighted half-spread, which is the difference between the ask and bid prices Daily NYSE TAQ
of the quote divided by two.
ETF Illiquidity Average relative effective half-spread of the day. The variable is defined as the effective half- Daily NYSE TAQ
spread divided by the trade price in which the effective half-spread is the absolute difference
between the trade price and the quote midpoint of the associated price.
Relative Quoted Average trade-weighted relative half-spread, which is the half-spread divided by the quote Daily NYSE TAQ
Spread midpoint.
Shares Outstanding Log of shares outstanding to the lagged shares outstanding at daily level. Daily NYSE TAQ
Growth
Shares Volatility Volatility of shares outstanding growth defined as the standard deviation of daily share Yearly NYSE TAQ
outstanding growth.
Shares Outstanding Number of shares outstanding Daily NYSE TAQ
Swap Based A dummy variable that is 1 if the ETF uses the swap contracts to replicate its underlying Daily Bloomberg
index.
Table 4. Summary Statistics and Correlations of Tracking Error Measures of ETF

Variable
Panel A : Summary statistics for estimated tracking errors
ETF-IND ETF-NAV θ(ETF-IND) θ(ETF-NAV)
Mean 0.75% 0.83% 44.95% 42.47%
Std.Dev. 1.30% 5.60% 30.69% 26.89%
Panel B : Tracking error correlations for individual ETFs
ETF-IND 1
ETF-NAV 0.20 1
θ(ETF-IND) 0.07 0.06 1
θ(ETF-NAV) 0.07 0.08 0.83 1
21

Table 5. Forecasting Performance on the Tracking Errors of ETFs (Training Set)4
Freq Model R² MAE MSE RMSE MAPE MedianAE
Linear 0.38 0.01 0.00 0.01 17.94 0.00
Lasso 0.29 0.01 0.00 0.01 18.40 0.00
| ETF-IND| Daily Ridge 0.38 0.01 0.00 0.01 17.98 0.00
Random Forest 0.37 0.01 0.00 0.01 16.08 0.00
GBDT 0.44 0.01 0.00 0.01 17.06 0.00
Linear 0.02 0.01 0.00 0.06 13,554.38 0.00
Lasso (0.00) 0.01 0.00 0.06 13,252.72 0.00
|ETF-NAV| Daily Ridge 0.02 0.01 0.00 0.06 13,574.62 0.00
Random Forest 0.49 0.01 0.00 0.05 8,480.75 0.00
GBDT 0.71 0.01 0.00 0.03 8,223.54 0.00
Linear 0.33 0.19 0.06 0.25 286,350,200,000 0.15
Lasso 0.30 0.19 0.07 0.26 349,944,400,000 0.15
θ(ETF-IND) Yearly Ridge 0.31 0.19 0.06 0.25 340,124,600,000 0.15
Random Forest 0.66 0.12 0.03 0.18 415,667,000,000 0.08
GBDT 0.55 0.15 0.04 0.21 406,992,400,000 0.12
Linear 0.34 0.17 0.05 0.23 298,368,900,000 0.14
Lasso 0.31 0.18 0.05 0.23 354,095,300,000 0.14
θ(ETF-NAV) Yearly Ridge 0.31 0.18 0.05 0.23 349,009,600,000 0.14
Random Forest 0.61 0.13 0.03 0.17 417,206,200,000 0.10
GBDT 0.54 0.14 0.04 0.19 386,756,500,000 0.11
4 The values in parenthesis indicate that they are negative values.
22

Table 6. Forecasting Performance on the Tracking Errors of ETFs (Testing Set) 5
Freq Model R² MAE MSE RMSE MAPE MedianAE
Linear (11.30) 0.01 0.00 0.04 11.56 0.00
Lasso (5.67) 0.01 0.00 0.03 11.75 0.00
|ETF-IND| Daily Ridge (11.30) 0.01 0.00 0.04 11.58 0.00
Random Forest 0.21 0.01 0.00 0.01 10.53 0.00
GBDT 0.22 0.01 0.00 0.01 11.23 0.00
Linear (1.37) 0.01 0.00 0.07 2,883.99 0.01
Lasso (0.00) 0.01 0.00 0.04 3,976.37 0.00
|ETF-NAV| Daily Ridge (1.37) 0.01 0.00 0.07 2,887.85 0.01
Random Forest 0.49 0.01 0.00 0.03 1,433.75 0.00
GBDT 0.42 0.01 0.00 0.03 1,498.39 0.00
Linear (20.39) 0.23 1.98 1.41 801,212,800,000 0.15
Lasso (8.55) 0.23 0.88 0.94 734,426,000,000 0.17
θ(ETF-IND) Yearly Ridge (18.90) 0.24 1.84 1.36 754,952,800,000 0.16
Random Forest 0.36 0.17 0.06 0.24 969,479,000,000 0.12
GBDT 0.37 0.18 0.06 0.24 1,010,695,000,000 0.15
Linear (0.16) 0.18 0.07 0.26 785,105,900,000 0.15
Lasso (0.77) 0.19 0.10 0.32 728,321,900,000 0.16
θ(ETF-NAV) Yearly Ridge (0.38) 0.18 0.08 0.28 741,192,000,000 0.16
Random Forest 0.29 0.16 0.04 0.20 1,046,803,000,000 0.14
GBDT 0.31 0.16 0.04 0.20 1,039,112,000,000 0.14
5 The values in parenthesis indicate that they are negative values.
23

Table 7. Variable Importance for U.S. ETF’s Tracking Errors
Linear LASSO Ridge RF GBDT
Rank train test train test train test train test train test
Intrady Intraday
1 Log(AUM) Log(AUM) Log(AUM) Log(AUM) Log(AUM) Log(AUM) Lev. Fund Log(AUM)
Volatility Volatility
ETF-IND Intradafy Intraday Intrady ETF
(Daily) 2 Lev. Fund Lev. Fund Lev. Fund ETF Illiquidity Log(AUM) Log(DTV)
Volatility Volatility Volatility Illiquidity
3 Log(DTV) Log(DTV) Log(DTV) Log(DTV) Log(DTV) Log(DTV) Der. Based Der. Based Log(DTV) ETF Illiquidity
Shares Shares
1 Log(AUM) Log(AUM) Opt. Opt. Log(AUM) Log(AUM) outst. Expense Ratio Oust. Log(Shares oust.)
growth growth
ETF-NAV
Log(Shares Log(Shares Log(Shares Log(Shares Expense Shares outst. Log(Shares Shares Oust.
(Daily) 2 Der. Based Der. Based
oust.) oust.) oust.) oust.) Ratio growth oust.) growth
Intraday Intraday
3 Lev. Fund AOI AOI Lev. Fund Log(AUM) Log(AUM) Log(DTV) Log(AUM)
Volatility Volatility
Inv. US Inv. US
1 Log(AUM) Log(AUM) Log(AUM) Log(AUM) Log(AUM) Log(AUM) Inv. US asset Inv. US asset
asset asset
θ(ETF- Inv. US Inv. US Inv. US Log(Shares Expense Expense
IND) 2 Inv. US asset Inv. US asset Expense Ratio Expense Ratio
asset asset asset oust.) Ratio Ratio
(Yearly)
Log(Shares Log(Shares Log(Shares Log(Shares Log(Shares Index
3 Inv. US asset Log(DTV) Log(DTV) Index Volatility
oust.) oust.) oust.) oust.) oust.) Volatility
Inv. US Inv. US
1 Log(AUM) Log(AUM) Log(AUM) Log(AUM) Log(AUM) Log(AUM) Inv. US asset Inv. US asset
asset asset
θ(ETF- Inv. US Log(Shares Inv. US Log(Shares Inv. US Log(Shares Expense Expense
NAV) 2 Expense Ratio Expense Ratio
asset oust.) asset oust.) asset oust.) Ratio Ratio
(Yearly)
Log(Shares Log(Shares Log(Shares Index
3 Log(DTV) Inv. US asset Inv. US asset Log(DTV) Optimized Index Volatility
oust.) oust.) oust.) Volatility
24

Figure 1. The Decomposition of SHAP Values (Daily)
25

Figure 2. The Decomposition of SHAP Values (Yearly)
26

SSRN Id4726993

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SSRN Id4726993

Uploaded by

Copyright:

Available Formats

Machine Learning Approach for Predicting U.S.

ETFs’ Tracking Errors

Jin-Hyung Cho*, Gun-Hee Lee**, Won-Eung Lee***, Bong-Jun Kim****

First version: Oct. 16th, 2023

JEL Classification: G11, G12

*Corresponding Author, Research Fellow (Ph.D.), Kakao, E-mail: enish27@daum.net

Electronic copy available at: https://ssrn.com/abstract=4726993

To be specific, we employ machine learning techniques in order to capture the

Electronic copy available at: https://ssrn.com/abstract=4726993

Electronic copy available at: https://ssrn.com/abstract=4726993

Electronic copy available at: https://ssrn.com/abstract=4726993

2-1. The Effect of Illiquidity of ETFs on Tracking Error

Electronic copy available at: https://ssrn.com/abstract=4726993

2-2. The Machine Learning approach to financial market (ETF)

In finance researches, a growing number of studies adopted machine learning

Among them, a number of studies on ETF employed machine learning techniques in

III. Data and Methodology

Now, we explain variable definition for our sample. As aforementioned, we employ

Electronic copy available at: https://ssrn.com/abstract=4726993

[Table 1 Insert Here]

[Table 2 Insert Here]

[Table 3 Insert Here]

Electronic copy available at: https://ssrn.com/abstract=4726993

(𝐷𝑎𝑖𝑙𝑦)𝑇𝑟𝑎𝑐𝑘𝑖𝑛𝑔 𝐸𝑟𝑟𝑜𝑟$,# = |𝑟$,# − 𝑣$,# | ⋯ (2)

(𝐷𝑎𝑖𝑙𝑦)𝑇𝑟𝑎𝑐𝑘𝑖𝑛𝑔 𝐸𝑟𝑟𝑜𝑟$,# = |𝑟$,# − 𝑓𝑖,𝑡 | ⋯ (3)

(𝑌𝑒𝑎𝑟𝑙𝑦)𝑇𝑟𝑎𝑐𝑘𝑖𝑛𝑔 𝐸𝑟𝑟𝑜𝑟!,# = (1 − 𝛽$,# ) ⋯ (4)

[Table 4 Insert Here]

First, we illustrate the traditional methodologies as following. Linear regression predicts

Electronic copy available at: https://ssrn.com/abstract=4726993

IV. Empirical analysis

IV-1. The main analysis

Electronic copy available at: https://ssrn.com/abstract=4726993

[Table 5 Insert Here]

[Table 6 Insert Here]

Electronic copy available at: https://ssrn.com/abstract=4726993

[Table 7 Insert Here]

IV-2. Shapley value

To be specific, Figure 1a and 1b illustrates the Random Forest analysis of daily-based

[Figure 1 Insert Here]

[Figure 2 Insert Here]

Electronic copy available at: https://ssrn.com/abstract=4726993

Electronic copy available at: https://ssrn.com/abstract=4726993

Electronic copy available at: https://ssrn.com/abstract=4726993

Breiman, L., 2001. Random Forests. Machine Learning 45, 5–32.

Buetow, G. W., Henderson, B. J., 2012. An empirical analysis of exchange-traded funds.

Electronic copy available at: https://ssrn.com/abstract=4726993

Electronic copy available at: https://ssrn.com/abstract=4726993

Electronic copy available at: https://ssrn.com/abstract=4726993

Electronic copy available at: https://ssrn.com/abstract=4726993

Electronic copy available at: https://ssrn.com/abstract=4726993

Table 1. Exchanged-traded fund (ETF) trends

Table 2. The Composition of Exchanged-traded fund (ETF) Sample

Electronic copy available at: https://ssrn.com/abstract=4726993

Electronic copy available at: https://ssrn.com/abstract=4726993

Table 4. Summary Statistics and Correlations of Tracking Error Measures of ETF

Electronic copy available at: https://ssrn.com/abstract=4726993

4 The values in parenthesis indicate that they are negative values.

Electronic copy available at: https://ssrn.com/abstract=4726993

5 The values in parenthesis indicate that they are negative values.

Electronic copy available at: https://ssrn.com/abstract=4726993

Electronic copy available at: https://ssrn.com/abstract=4726993

Electronic copy available at: https://ssrn.com/abstract=4726993

Jin-Hyung Cho, Gun-Hee Lee, Won-Eung Lee, Bong-Jun Kim****