Landis e Skouras (2021)

Journal of Banking and Finance 130 (2021) 106128
Contents lists available at ScienceDirect
Journal of Banking and Finance

journal homepage: www.elsevier.com/locate/jbf
Guidelines for asset pricing research using international equity data

from Thomson Reuters Datastream
Conrad Landis, Spyros Skouras∗
Department of International and European Economic Studies, Athens University of Economics and Business, 76 Patision St., Athens, Greece
a r t i c l e i n f o a b s t r a c t
Article history: We provide detailed guidelines and code to derive high quality international equity data from Thom-
Received 2 August 2018 son Reuters Datastream (TDS) data. Our approach increases stock and country coverage (to 91 countries),
Accepted 22 March 2021
improves data accuracy, filters problematic data and reduces survivorship bias and data staleness. We val-
Available online 14 May 2021
idate our approach by demonstrating that our U.S. TDS factors are statistically and economically indistin-
JEL classification: guishable to standard Fama-French CRSP factors. On the other hand, when we compare our international
G12 factors to other publicly available international factors, differences are significant, so we justify and detail
G15 every aspect of our proposed guidelines. Our guidelines and accompanying code and data should be espe-
C89 cially useful for international research focused on wide coverage, equal weighted portfolios, small stocks
and countries with a limited number of stocks and for researchers wishing to analyze the US market with
Keywords:
access to only TDS but not CRSP-Compustat data.
Stock market data
Data collection © 2021 Elsevier B.V. All rights reserved.
Stock returns
1. Introduction detail, correlations of widely used publicly available international

factor return series for the four most commonly used factors (mar-
Empirical research on international equity markets beyond the ket, value, size and momentum) can be as low as 50-60%, while
US is hampered by lack of consensus on data source and pro- the level of mean returns can differ substantially even when com-
cedures for data pre-processing. The most commonly used data paring U.S. market returns (widely considered to be very robust)
source is Thomson Reuters Datastream (TDS),1 though recent re- where we observe differences as large as 20 basis points (bps) per
search by Fama and French (2012, 2017) uses data ’primarily month in returns created using data from the same TDS database.
from Bloomberg, supplemented by Datastream and Worldscope. Evidently, data sources and data processing can have a very
Asness and Frazzini (2013) and other researchers affiliated with significant impact on empirical research conclusions when study-
AQR use the XpressFeed Global database, while FactSet is also ing standard value-weighted international factors, problems which
used occasionally, as are a plethora of country-specific data-sets. one would expect would be significantly larger if studying equal
In terms of data pre-processing procedures, Ince and Porter (2006), weighted portfolios (for example Lu, Stambaugh and Yuan, 2017,
Schmidt, Von Arx, Schrimpf, Wagner and Ziegler (2017, 2019) and is a recent study focused on equal weighted portfolios), portfolios
Karolyi and Wu (2012) provide detailed advice on how to pre- that depend on more error-prone fundamental data, small stock
process monthly TDS data. portfolios (which are typically more error prone), or if studying in-
Unfortunately, simple analyses that are very robust when using dividual stocks or portfolios with few stocks where any random er-
U.S. CRSP-Compustat data are in fact quite sensitive when work- rors will not cancel out through averaging. This issue may be one
ing with international data, so caution is warranted when inter- reason for which Karolyi (2016) finds that only 16% of empirical
preting the results from published research on international data. studies published in top Finance journals examine non-U.S. coun-
To motivate our research, as we will see in Section 4.2 in greater tries without evidence of a fast increase in this rate.
Our paper aims to make a contribution to best practices for
working with international data, focusing on TDS (complemented
∗
Corresponding author. with fundamental data from Worldscope) because it is the most
E-mail addresses: conradfelixmichel@gmail.com (C. Landis), skouras@aueb.gr (S. widely available international data source in the academic com-
Skouras).
1
In our Internet Appendix Table B.1, we list forty-four papers that have appeared
in five leading finance journals during 2009-2019. We note that in the last few years
Thomson Reuters has switched from providing TDS as a stand-alone product to de-
livering TDS via its Eikon platform (Eikon is a front-end which provides access to
the same database).
https://doi.org/10.1016/j.jbankfin.2021.106128
0378-4266/© 2021 Elsevier B.V. All rights reserved.
C. Landis and S. Skouras Journal of Banking and Finance 130 (2021) 106128
munity2 and because we can build on significant pre-existing (i) these series are provided by several research teams, thereby al-
research focused on refining and using TDS data. Combining TDS lowing comparisons, (ii) factors should be constructed with broad
with advanced data pre-processing, we hope to contribute to coverage, filtering out only relevant data errors, whereas some
the development of a transparent, consensus and broadly usable studies involve specialized data requirements (in principle the fil-
methodology for international equity returns, similar to what al- tering process to create factor returns should not necessarily be
ready exists for the U.S. (where CRSP-Compustat data are com- the same as the filtering process for more specialized data used to
bined with simple pre-processing rules such as the exclusion of address specific research questions), (iii) factor returns, especially
low priced stocks). We make three main innovations: Firstly, we when value weighted, are generally very robust, so by focusing on
use a new approach to extracting data which delivers data for all differences across value weighted factor returns, we are setting the
stocks available in TDS3 and we extract this information at maxi- bar high and would expect more significant differences in terms
mum accuracy in the ’default’ currency (i.e., the currency in which of other metrics, and (iv) since many research conclusions depend
the stock was traded on the date of each observation). Secondly, on factor returns e.g. for risk adjustments, imperfections in factor
we substantially refine existing best practices for TDS data filtering, return data are likely to affect the conclusions of much empirical
most notably by introducing filters that require daily data. Thirdly, research in international finance.
as part of this filtering we develop an approach to identify sur- There are also reasons to believe our refinements may not just
vivorship bias in TDS data. increase precision but also reduce biases that naturally arise as
We emphasize that even after applying our methodology, inter- a consequence of limited coverage and data errors. Limited cov-
national data is unlikely to be of equal quality to CRSP-Compustat erage likely causes biases through three channels: First, analyses
data, which was designed to address the academic community’s with poor stock coverage are likely to mechanically underestimate
quality concerns and has been refined through decades of feed- factor returns based on sorted characteristics because there are
back from the academic community and a significant commercial fewer stocks to sort on. Second, because coverage is almost cer-
investment. By contrast, international data providers still focus pri- tainly inversely correlated with firm size, the size factor’s return
marily on practitioners who have different priorities, for example is likely further underestimated because disproportionately many
they are more concerned with timely and accurate data for cur- small stocks will be missing in any analysis with poor coverage.
rently active companies than in historical data to be used in anal- To the extent that size factor returns are positive, this also sug-
yses that require long histories with high quality even in older gests market factor returns will be underestimated. Third, where
dates. Our goal is to significantly improve, not perfect, interna- coverage is imperfect, missing stocks are more likely to be stocks
tional data used by the academic community. no longer in existence, so it is likely associated with backfill bias.
It is also worth emphasizing, that there is no ’source of truth’ On the other hand, data errors that affect specific date ranges of
against which to judge international data that has been processed stocks’ observations are likely to cause artificial reversals in stock
according to our procedures. It may seem that collecting raw returns, inducing a downward bias in estimates of momentum fac-
data directly from all international trading venues could provide a tor returns and an upward bias in estimates of value factor returns
benchmark, but - aside from the enormous task of collecting data (since value is closely related to long-term reversal effects).
from the 183 venues covered in this study - such data would be We do not attempt to refute the central message of any pre-
of little use without additional information such as corporate ac- existing research using international data, but instead suggest
tions, classification of stocks into common vs. other classes, firm that a broad range of published empirical research in interna-
headquarters, number of shares outstanding among many other tional finance may be less precise than is widely appreciated. Re-
necessary data items not typically provided in raw feeds4 . We are evaluating previous work with our data would no doubt affect the
however able to validate our data in the special case of U.S. data magnitude of a wide range of published estimates that depend on
against CRSP-Compustat data, which is widely viewed as being international factors, considering that our factor returns have dif-
very high quality, and we find excellent agreement. ferent means to other widely used factors and that correlations are
As we will later discuss in detail, each innovation we propose in several cases quite low. Of course, there is also a glass-half-full
addresses a specific shortcoming in the data and they are intended interpretation of our results, according to which the problems we
as uncontroversial refinements that should collectively increase the report are of limited severity. At the very least, our work should
accuracy of future empirical research. Considering that there are provide a deeper understanding of TDS data to potential users and
significant differences among even coarse statistics such as aver- some new best practices when dealing with this data. Our filtered
age factor returns reported by different groups of researchers dis- data covers a far larger than usual 91 countries, so another contri-
cussed above, there is very significant scope for accuracy improve- bution is that we believe our guidelines make it possible to mean-
ments. Such significant differences across factors should have an ingfully analyze countries that previous research has viewed as off-
economically meaningful impact on a very broad range of empiri- limits, presumably because of data issues which we hope we have
cal estimates that depend on such factors. tackled.
There are of course many metrics one might use to compare re- Our paper is structured as follows. In Section 2, we describe in
sults across different data sets. We focus on factor returns because detail the approach we propose to extracting TDS data for inter-
national research involving total returns, market capitalization and
book-to-market values, which are necessary to construct Fama-
2
While Bloomberg is a close second, standard Bloomberg licenses place signifi- French factors (specifically we refer to the implementation and
cant limits on the number of stocks that can be extracted each month, to discourage data vintage of French’s Data Library in January of 2017, which we
the type of comprehensive data extraction academic research usually involves. refer to in what follows as French, 2017). This involves extracting
3
Our approach is juxtaposed to the standard practice in the literature of using
information for all equities traded in all stock exchanges of each
TDS recommended ’constituent lists’ of stocks for each country which turn out to
be limited in many cases. country using TDS’ Navigator GUI, instead of relying on market-
4
Of course, it is in principle possible to collect all raw feeds and merge them specific ’constituent lists’ of equities provided by TDS, which to
with multiple databases including TDS, Bloomberg and XpressFeed Global after ac- the best of our knowledge is the approach that has been used
counting for the (most likely different) problems in each database. This would be by all research teams that have worked with TDS data (Ince and
a very major endeavor beyond the scope of this paper, which is more likely to be
Porter, 2006; Schmidt et al., 2019; Karolyi, Lee and Van Dijk,
pursued as a commercial rather than an academic project. While we show that data
processed according to the methods discussed here are of high quality, we acknowl- 2012, are a few examples which will be relevant in what follows).
edge that further improvements remain possible. Our approach leads to a significantly larger cross-section of equi-
2
ties than has been used in many earlier studies. We also extract Bekaert, Harvey and Lundblad (2007) who report that the total
data with full data accuracy and in local currency, dealing with number of firms in the TDS constituent lists they use is about
changes in the currencies in which stocks are traded over time (e.g. 90% of the number of firms reported as domestically listed firms
pre/post Euro adoption) to deal with significant rounding problems in World Bank’s World Development Indicators. However, this cal-
in default TDS data. Finally, we extract daily return indexes to cre- culation likely ignores the problem that not all instruments in TDS
ate new filters for data problems which would have been difficult constituent lists are common stocks of domestically listed firms.
to recognize with just monthly return indexes, or to enhance filters Indeed, constituent lists contain instruments we believe most re-
which have been used with monthly observations, but which can searchers do not typically intend to include in their analyses and
be further improved with daily observations. A step-by-step guide sometimes attempt to filter out: for example, they include instru-
to implementing our data extraction approach is provided in our ments that are not incorporated in the country of interest (but are
Internet Appendix. traded in it) as well as instruments that are not equities (e.g. ETFs,
In Section 3, we propose a comprehensive list of filters which unit trusts, depository receipts among others). Therefore, a filtered
should be applied to the granular TDS data described above. Some version of constituent lists is likely to achieve substantially less
of these are original and some are standard, while some are signif- than 90% coverage.
icantly enhanced versions of standard filters, with enhancements To make matters worse, constituent lists suffer from backfill
coming from the use of daily data, our method for recognizing sur- bias - small stocks are more likely to be ignored but when a stock
vivorship bias or other methodological refinements, including for becomes large enough to be included, its entire history is included.
example the use of data from stocks traded in secondary exchanges In order to maximize coverage and minimize backfill biases in our
to avoid sample selection bias, calibration of filter parameters on sample, we propose a granular approach to extracting data from
CRSP data to avoid overfiltering and size breakpoints rules that are TDS: We use TDS’ GUI (’Navigator’ in TDS terminology) to manually
designed for international firms. Of course, our list also aims to locate all instruments in Category: Equities and Type: Equity, for
be comprehensive in its inclusion of standard filters developed by each Exchange available on TDS, except exchanges in which only
other researchers. MATLAB code to implement our filtering (or se- stocks ’primarily’ listed elsewhere are traded7 . This approach ex-
lect subsets of our filters) as well as our international factor data tracted a sample of 204,337 instruments from 183 venues span-
are provided online5 . ning 107 countries at the time of our October 2015 data vintage
In Section 4 we evaluate our guidelines. We begin demonstrat- analyzed in this paper. This is far larger than any raw sample we
ing that international factors developed in previous research have are aware of, though some of these instruments will later be fil-
significant differences to each other and then compare our factors tered out, e.g. as non-domestic stocks (detailed instrument filters
to those international factors. We report excellent agreement with are proposed in Section 3).
French (2017) factors in the US, but significant differences between Note that some researchers prefer to exclude stocks listed in
our international factors and those of others, underscoring the im- secondary exchanges, on the assumption that stocks of secondary
portance of detailed and explicit data pre-processing when deal- exchanges are likely to be small and/or OTC. However, a major
ing with international data. We then examine the reasons for these shortcoming of this approach when applied to TDS data, is that
differences and the impact of specific features of our guidelines. TDS only contains current exchange classifications for stocks and
Section 5 concludes with a summary of our main results, while therefore excluding secondary exchanges will induce a sample se-
we report many additional details in an Internet Appendix. lection bias because it will lead to the exclusion of stocks that
are in secondary exchanges because they have been demoted from
2. Extracting raw data a primary exchange due to poor performance. To the extent that
the motivation for excluding small or OTC stocks is to reduce data
In this section we describe a method of extracting TDS data in errors, microstructural and illiquidity effects, we deal with such
an original way designed to maximize stock coverage and data ac- problems directly with the filters we propose in Section 3. The in-
curacy and detail the problems with data extracted with a stan- fluence of small stocks on any analysis is more safely controlled by
dard approach. A step-by-step illustrated guide to extracting data conditioning analyses directly on size than by excluding exchanges.
according to our guidelines is provided in our Internet Appendix
A. 2.2. Maximum precision data
2.1. Maximizing stock coverage 2.2.1. Decimal settings

By default, TDS rounds all data to the second decimal. This is
In all research based on TDS data that we are aware of, where particularly dangerous because, unlike CRSP, it does not provide re-
sufficiently detailed information about how data was extracted is turns directly, but instead provides a TDS ’return index’ which is a
provided, researchers seem to create their initial stock universe by price-like index tracking the value of a hypothetical investment in
merging various ’constituent lists’ of stocks provided by TDS. These which all cash flows (notably dividends) are reinvested8 . If this re-
lists do not always exist for all countries, do not have precise turn index takes small values, its rounding will have an unaccept-
naming patterns across countries and researchers don’t always use able impact on any return calculations. To avoid this, we suggest
the same set of lists. The most widely used lists are those including extracting data with the largest number of available decimal points
currently active and suspended stocks for each country (so called (6 in TDS’ DPL function), which is sufficient in all cases we have
’research’ lists), those including delisted stocks (usually referred to encountered. Fig. 1 shows an example of how this problem mani-
as ’dead’) and Worldscope lists (as provided by TDS)6 . fests in the case of a specific U.S. stock in default TDS data: default
Unfortunately, the set of stocks contained in constituent lists
are only a subset of stocks covered by TDS. The only estimate 7
At the time of our data extraction these were EASDAQ, SEAQ, NEWEX and
of list coverage that we are aware of, has been provided by SWXEurope. Exchanges in each country and raw sample sizes by country and re-
gion as well as additional information are provided in Internet Appendix Table B.2.
8
8 TDS’ return index (datatype RI) is described at http://product.datastream.com/
5
Code and data can be downloaded from the corresponding author’s website. navigator/search.aspx. The percent change of TDS’ RI is analogous to CRSP’s return
6
According to Reuters Helpdesk, research lists include stocks for which multiple variable (daily item id ’ret’. See www.crsp.com/files/data_descriptions_guide_0.pdf).
datatypes beyond price are supported. We note that we have had extensive commu- Descriptions of all TDS and Worldscope datatypes, and CRSP item ids used in this
nications with Reuters to confirm the various data issues we discuss in this paper. study are summarized in Internet Appendix Table B.3.
3
Fig. 1. Necessity of data extraction at max precision.

We illustrate the impact of TDS rounding under default settings for data extraction for the stock with Datastream code (datatype DSCD) 32543W, and CRSP code (datatype
PERMNO) 91014, focusing on a time window over which the problem is easily depicted. The figure depicts three variants of prices normalized by us. The first is based on
unfiltered data downloaded with TDS default settings (datatype RI) (‘tds default’), the second (‘ours’) is calculated from TDS data that conforms to our guidelines, notably it
is extracted at maximum precision, and the third (‘crsp’) uses CRSP data (datatype RET).
precision extraction leads to zero returns where in fact significant 3. Data filtering
return variation occurs and can be matched exactly between CRSP
and TDS when used with maximum precision extraction. Where Having described our TDS data extraction process, we now de-
previous studies filter out many zero prices as nonsense values, it scribe a comprehensive set of filters to deal with TDS data prob-
is likely that such zero prices were in fact due to data extraction lems. Our filters span stock filters (which exclude the entire history
at default accuracy. of an instrument), stockday filters (which exclude a specific day
of a specific stock), and factor holding portfolio filters (which ex-
clude stocks from the universe on which factor holding portfolios
are formed on a specific investment date). We provide a MATLAB
2.2.2. Currency settings implementation of our filtering process online, which gives users
There are two reasons why TDS’ price like return indexes can the flexibility to choose which filters to apply and to thereby in-
have small values, causing default decimal settings to be problem- vestigate the impact of each filter separately.
atic. The first, examined above, is that the stock has performed One subtle point is that in any fully specified filtering method-
very poorly since its return index base date. The second is that ology, the sequence in which filters are applied needs to be de-
the return index may have been extracted in a currency with units termined. We apply our filters after removing all stocks for which
in a different scale than that of the currency in which the stock there is no return index data (our filter 11), which we do so
was traded (e.g. an unsuspecting researcher wishing to do a global that the threshold parameters of our filters are meaningful. Having
analysis in U.S. dollars, might extract return indexes in U.S. dol- done this, we apply each of the remaining filters independently of
lars rather extract return indexes in tens of default currencies to all others.
then apply all relevant exchange rates in order to obtain his own An important difficulty in proposing sensible filters is that it is
calculation of U.S. dollar returns). In this second scenario, data ex- often hard to decide on thresholds or parameters that control the
traction at default accuracy but in a non-traded currency can have tradeoff between too much and too little filtering. Preferring to err
devastating consequences - for example extracting Italian return on the side of underfiltering, we have specified the parameters of
indexes in U.S. dollars at default precision will lead to returns that all parametric filters (filters 7 to 10, 12 and 14) at plausible values
are zero in 73% of all stocks whose currency is the Italian Lira but subject to the constraint that they remove no more than 0.5%
(i.e., were delisted before Italy’s entry to the Eurozone) because of all CRSP observations when applied to CRSP data. Where other
the exchange rate of the Italian Lira to the U.S. dollar takes val- thresholds appear in our filters (e.g. in our filters for nonsense val-
ues above 10 0 0. But even maximum precision data extraction in ues or holidays), they remove no data from CRSP.
common (e.g. U.S. dollar) but non- traded currencies is insufficient This filter parameterization is interpreted as one which would
to avoid significant rounding errors. For example, Fig. 2 shows a have very small impact if applied to data that was of high qual-
case where a Japanese stock’s return index extracted in U.S. dollars ity (as in the case of CRSP), so in that sense they do not overfilter.
with 6 decimal place (maximum) accuracy is not sufficient to avoid While underfiltering is also a concern, we aim at sufficient com-
some rounding of the return index which can be avoided if the re- prehensiveness in our filters to hopefully capture the most salient
turn index is extracted in Yen at maximum accuracy. Evidently, a problems in the data.
reliable download requires extraction with maximum precision and Table 1 collects all our filters which are described in detail be-
in traded currency. low. In the left panel, we summarize how each filter has been used
4
C. Landis and S. Skouras
Table 1
TDS data filters.
We list the filters we use, modify and introduce in this paper, compare them to filters that have been previously used in the literature and measure their impact on our raw data.9 In Panel A, we list all our filters and compare
them to filters used by other researchers who explain in detail how they filter TDS data. G refers to our guidelines, IP to Ince and Porter (2006), S+ to Schmidt, Von Arx, Schrimpf, Wagner, and Ziegler (2019), L to Lee (2011),
GKN to Griffin, Kelly and Nardary (2010), and KLD to Karolyi, Lee and Van Dijk (2012). ’I’ means that a specific filter was introduced by a specific author; ’U’ means that a filter was used and ’M’ means it was used in modified
form. The right hand side columns report the fraction of instruments, stockdays or stockmonths, removed by each filter in our data for each region. For Global factors we include two versions, one with the same 23 country
coverage used in French’s (2017) Data Library, denoted (G23) and one with full 91 country coverage (G91). For U.S. factors, we report separately the impact on our daily and monthly implementations.
% removed
crsp G IP S+ L GKN KLD G23 G91 US daily US monthly JPN UK
Panel A
A.1 Filters for stocks based on static information
Significantly enhanced
1 non-common stocks (text strings) U M U U U U U 22.93 18.91 13.40 13.40 2.64 3.24
2 cross listed stocks M 13.75 10.40 19.06 19.06 1.93 7.92
3 non-common stocks (duplicate LOC) M 0.11 0.08 0.25 0.25 0.00 0.00
Standard
4 non-domestic headquarters U U I U U U 16.83 12.39 20.98 20.98 2.05 16.17
5 non-domestic currency U U 0.95 1.37 0.03 0.03 0.00 6.43
6 small countries M M 0.00 0.06 0.00 0.00 0.00 0.00
% removed
crsp G IP S+ L GRF KLD G23 G91 US daily US monthly JPN UK
A.2 Filters for stocks based on return index information
Original requiring daily data
7 implausibility I 6.62 7.98 9.94 5.97 0.24 4.81
(>98% returns same sign)
8 few observations (%-wise) I 6.16 7.49 9.54 5.52 0.23 3.11
(>95% daily zero)
Significantly enhanced using daily data
9 high volatility (>0.4) I 2.99 2.65 5.70 11.89 0.00 0.00
10 low volatility (<10−6 ) I 1.25 2.33 16.83 5.16 0.16 1.51
Standard
5
11 return index data unavailable U U U U U U U 15.28 16.88 14.63 14.63 5.87 14.66
12 few observations (<120 days) U U M I 0.06 0.06 0.08 0.07 0.00 0.03
A.3 Filters for stockdays
Significantly enhanced using daily data
13 stocks no longer traded U I U U U U 42.34 38.79 43.16 42.41 14.02 52.21
(>10 days at end)
14 staleness I 10.63 11.93 16.05 17.27 0.47 5.92
(>30 days consecutively)
15 outlier errors M M I 0.04 0.03 0.06 0.32 0.66 0.01
(>100% & <-50%)
Original
16 holidays I 2.65 3.12 2.97 0.19 4.92 1.72
(<0.5% stocks available)
Significantly Enhanced
17 survivorship biased or incomplete dividends I Dec84 Dec84 Dec84 Dec84 Dec90 Dec84

Standard
18 capital adjustment inconsistencies M I U 0.38 0.38 0.13 0.13 0.18 1.4
(<5bps dif)
19 nonsense values (p<=0) U 0.02 0.02 0.05 0.27 0 0
Panel B
Filters for stocks from investment universe of factors on each investment date
Original
20 book-to-market staleness I 13.9 12.3 2.65 2.62 3.45 6.52
Significantly Enhanced
21 penny stocks U M U M 20 20 20 20 20 20
(20% lowest prices)
9
The percentages measure slightly different things in each scenario. In Panel A.1, the percentages reported refer to stocks removed from the raw sample by each filter. In Panel A.2 and A.3 it reports the percentage of stockdays
removed from the sample that remains after applying the filters that remove missing observations (filters 11 and 13). For filter 11 it reports the percentage of stockday observations removed as a fraction of total stockdays; and
for filter 13, the percentage of stockdays removed after applying filter 11. In Panel B’s filter 19 we report dates after which data is of acceptable quality; for filter 20 the percentage of stocks removed on each investment date
when calculating market returns after applying filter 11; and for filter 21 the percentage of stocks removed on each investment date for size and value factor returns, after applying filters 11, 13 and 14.
Fig. 2. Necessity of data extraction in local currency at max precision.

We illustrate the impact of TDS rounding under maximum precision settings when data is extracted in a foreign currency for stock with Datastream code (datatype DSCD)
871503, focusing on a time window over which the problem is easily depicted. The figure depicts four variants of adjusted prices normalized by us. The first is extracted
with TDS default settings in local currency (datatype RI) (‘tds default’), the second (which overlaps with the first exactly) is as above but extracted in dollars (‘tds default
dollar’), the third (‘ours’) is calculated from TDS data that conforms to our guidelines, notably extracted at maximum precision in local (Yen) currency, while the fourth
(‘ours-dollars’) is as above but extracted at maximum precision in dollars.
in the extant literature, including whether it is commonly applied verse. On this basis, we exclude all stocks with security type code
to U.S. CRSP data. The right hand-side panel of Table 1 reports the (datatype TRAC) taking any value other than "ORD", "ORDSUBR",
fraction of data removed by each filter, which helps quantify the "FULLPAID", "UKNOWN", "UNKNOW" and "KNOW". However, the
importance of each filter. Within the table, the filters are orga- majority of international stocks either do not have this datatype
nized in groups defined by what they filter (stocks, stockdays, or populated, or it is populated with a value that means it is un-
stock-holding periods used in investment portfolios which define known (this happens frequently in delisted stocks, which can cause
factors), what information they use for the filtering (static data, an unsuspecting user to filter out delisted stocks instead of non-
return indexes or return indexes requiring daily data) and by their common stocks, biasing the filtered sample). While one might ex-
degree of originality, (original, enhanced or standard). Within sub- pect that non-equity instruments (e.g. funds, unit trusts etc.) would
groups, filters are ordered by the impact they have in terms of how have been excluded by TDS at the extraction stage based on the
much filtering they cause on our global market factor. process we outlined in Section 2.1, unfortunately TDS’ ’Equity’ clas-
sification is not reliable.
3.1. Stock filters We therefore refined and updated previous researchers’ efforts
to collect text strings in stock names which identify in each coun-
3.1.1. Based on static information try whether a stock is non-common (e.g. preferred stock, depos-
(1) Non-Common stocks: The sharp distinction applied to U.S. itory receipts, certificates, duplicates, warrants and rights issues,
CRSP data between ’common stocks’ and other types of etc.), hence our text strings have been designed to also exclude
listed instruments, notably preferred share classes of equi- non-equity instruments, erroneously classified by TDS. Specifically,
ties, is not necessarily sensible in all countries. Preferred and we use stocks’ extended names (datatype ENAME) and remove all
common share classes have very different meanings in dif- stocks where ENAME includes any of the country-specific strings
ferent countries - for example, in some countries the dis- listed in column 2 of our Table 2.
tinction may focus on voting rights, while in others dividend From the first row of Table 1 we observe that this filter removes
policy or seniority may be more important. Nevertheless, around 23% of instruments from Fama and French’s 23 country
our intent is to adopt a data processing methodology which global data universe (Fama and French, 2012, 2017; French, 2017)10 ,
is as close to U.S. conventions as possible while avoiding a so it clearly has a very significant impact. This impact can vary
case-by-case treatment of each country. We therefore follow widely across countries as evidenced by a 13% filtering rate in the
the most popular approach in the literature, which involves U.S. vs. a 3% filtering rate in Japan and the U.K. We classify this fil-
excluding all instruments that cannot be classified as com- ter as a significant enhancement on existing approaches to detect-
mon stocks. This has limitations, but the classification of a ing non- common stocks as our text strings are country-specific
stock as common does not change over time, so an advan- (many researchers filter data from all countries based on the same
tage of using this filter is that it cannot cause survivorship
bias.
10
This includes the following markets: Australia, Austria, Belgium, Canada,
In the last few years, TDS has introduced a datatype which Switzerland, Germany, Denmark, Spain, Finland, France, UK, Greece, Honk Kong, Ire-
specifies whether a stock is common or not, making it appar- land, Italy, Japan, Netherlands, Norway, New Zealand, Portugal, Sweden, Singapore
ently straightforward to filter common stocks from a broader uni- and U.S.
6
Table 2
Country specific text strings and abbreviations used in filters.
For each country, we report strings that can be used when searching various datatypes to classify stocks. When the strings of Column 2 appear in the TDS’ ENAME datatype,
these stocks are classified as non-common in our filter 1 (inverted commas indicate specific whitespace patterns were part of the search to avoid matching strings which
are part of stock names). The abbreviations of column 3 are used to identify cross listed stocks in our filter 2, while in the fourth column we report comma separated lists
of headquarter locations which we allow for each country in filter 4. In the final column we report the currency string that must appear in a stock’s currency identifier in
order for it to be included in the data of each country.
Country Non common stock identifiers Cross Listing Country Identifier Currency Identifier
Abbreviations
(ENAME) (ENAME) (GEOGN) (PCUR)
Panel A - Africa
Botswana REDEEMABLE (BOT), (BTW) BOTSWANA PU
Egypt REIT, “ EDR”, NIL PAID, “ARABIA C” (EGYPT), (CAI) EGYPT E£
Ghana DEPOSITORY SHARES (GHA) GHANA CT
Ivory Coast COTE D IVOIRE CX
Kenya NIL PAID (KENYA), (NAI) KENYA KS
Malawi (MSW) MALAWI MK
Mauritius REAL ESTATE INVESTMENT (MAURITIUS), (MAU) MAURITIUS MR
Morocco NIL PAID, “ RDF” MOROCCO MD
Namibia FUND (NAM) NAMIBIA, SOUTH NR
AFRICA
Nigeria “REIT ”, FUND (NSA) NIGERIA NG
Rwanda (RSE) RWANDA RF
South Africa PREFERENCE, CUMULATIVE, “FUND ”, DUPLICATE, “ RFD”, (JSE) SOUTH AFRICA R
“ CPF”, NIL PAID, “ ETF”, HDG.GP
Tanzania (DAR) TANZANIA TS
Tunis BONUS RIGHTS, FULLY PAID, NIL PAID TUNISIA UD
Uganda (UGA) UGANDA UG
Zambia “ DRC” (ZAM) ZAMBIA K
Zimbabwe “ ZDR”, “ FUND”, “ BOND” (ZIM) ZIMBABWE U$,R$
Panel B –
Americas
Argentina “CFI ” (BUE) ARGENTINA AP
Bahamas BAHAMAS U$
Barbados (BARBADOS) BARBADOS $B
Bermuda BERMUDA BBD
Bolivia BOLIVIA BP
Brazil “ UNT”, “ BDR”, BRAZIL UNITS, PARTICIPATIONS UNITS, (BSP) BRAZIL C
“DR3 ”, “UNITS ”, DEPOSITORY SHARES, DEP SER, DEP SR,
NIL PAID, “DR2 ”, “ CL ”, PARTICIPACOES UNITS, “PNE ”,
“PNG ”, UNT N2, ON NP, ON NM, NIL PAID, REPRESENT
Canada SUBSCRIPTION RIGHTS, FUND SHARES, INVESTMENT (TSX), (VSE), (MON), CANADA C$
TRUST, UNIT, EXPD, PROPERTY TRUST, “ N L”, “ NL ”, (TSE), (TOR)
INCOME TRUST UNITS, INCOME COMMERCE, EXPIRY, NL
ORDINARY, COMMERCE PAR, DEFERRED, EXPIRED, “ CDN”,
INCOME FUND, SUBSCRIPTION RECEIPTS, “ RECEIPT”, “
RIGHTS”, SUB VOTING, TRUST UNIT, MBS TRUST, BOND
FUND, PARTNERSHIP UNIT, PARTNER UNIT, “ REIT”, FUND
UNITS, LOAN FUND, “ RIGHTS”, CONVERTIBLE FUND,
PREFERENCE SHARES, COMBINED UNITS, “"IDS" UNITS”, “
SPLIT”, REDEEMABLE, DIVIDEND FUND, TOTAL RETURN,
DEBENTURE, WARRANT, CREDIT FUND, “ ETF”, STAPLE,
INDEX NOTES, “ TR UNIT”, “ NOTES”, “ LINKED”,
CUMULATIVE, EXCHANGE CERTIFICATE, APPRECIATION
FUND, INCOME TRUST, REAL ESTATE INVESTMENT TRUST,
MORTGAGE INCOME FUND, REAL ESTATE FUND, HIGH
INCOME MORTGAGE FUND, “ NOTES”, PRIN PROTECTED,
PRESERVATION LISTED FUCOMMERCIAL REIT,
REDEEMABLE UNITS, REAL ESTATE UNITS,
INFRASTRUCTURE FUND TRUST UNIT, STRATEGY FUND,
INCOME REIT, FINANCIAL TRUST, “ SERIES”, REAL ESTATE
INVESTMENT, ALLOCATION TRUST, “ BACKED”,
ADVANTAGED, RECOVERY FUND, LOAN FUND, SENIOR
LOAD FUND, PROPERTY SERIES, OFFERED SHARES, “ CDA”,
“ CNQ”, PROPERTY TRUST, “ EQUITIES”, FAMILY CORE
FUND, FAMILY CORE CLASS, SUBSCRIPTION RECEIPTS, MBS
TRUST, UNIT PARTNERSHIP, “ REIT LP”, PARTICIPATING
SECURITIES, CANADIAN FUND, EQUITY FUND,
CONVERTIBLE FUND, DEBANTURE, “ CONVERTIBLE”,
FOCUS FUND, SPLIT PRIORITY SHARES, “ SPLIT”, NIL PAID,
TOTAL RETURN TRUST, BALANCED FUND, CREDIT FUND,
VOTING RIGHTS, VOTING SHARES, “ UNT ”
Cayman Islands CAYMAN ISLANDS CD
Chile “RTS ”, “RIGHTS ”, “ OSA ”, EXPIRED, AFP (SGO) CHILE CE
Colombia PRNC, PREFERENCIAL, “ ESP” (BOG) COLOMBIA CP
Costa Rica COSTA RICA CC
Ecuador NRFD, PREFERIDAS, SUCRE, “ SERIE” ECUADOR SU, U$
(continued on next page)
7
Table 2 (continued)
Abbreviations
Jamaica INVESTMENT TRUST (JAM) JAMAICA J$

Mexico “ REIT”, “ DE CV”, “ DE C V”, “ EXPD”, “ ETF”, “ NPV”, “ (MEX) MEXICO MP
BCP”, “ SERIES”, “ ENR”, NIL PAID
Panama PANAMA £
Peru EXPIRED, “ RIGHTS”, NIL PAID, “ SERIES”, “XXXX” (LIM) PERU PS
Trinidad & Tobago TRINIDAD + TOBAGO T$
USA “ TRUST ”, “ REPR ”, “ RIGHT”, “ SERIES ”, “ NV ”, “ IV (NYS), (NAS), (ASE), UNITED STATES U$
TST”, REAL ESTATE INVESTMENT, “REALTY ”, “ RLTY”, (OTC), (XSQ), (XQB)
ROYALTY INVESTMENT, ASSET INVESTMENT, CAPITAL
INVESTMENT, ASSET MANAGEMENT, CAPITAL
MANAGEMENT, INVESTMENT MANAGEMENT, VENTURE
CAPITAL, FINANCIAL SHBI, PROPERTY INVESTORS, INCOME
PROPERTY, “ UNITS ”, “ UNIT ”, LIMITED PARTENERSHIP, “
FUND ”, EQUITY PARTNERS, LIMITED VOTING, SUB
VOTING, TIER ONE SUB, VARIABLE VOTING, “NON
VOTINGREIT ”, “ RESIDENTIAL”, “R E I T”, BENEFICIAL,
BENEFICIARY, BENEFIT INTEREST, BEN INTEREST, “ SH BEN
INT ”, “WARRANT”, “ WRTS”, “ L P ”, L P INTEREST, “LP UT
”, HOLDINGS LP, PARTENERS UNIT, PART INT, UNIT
PARTENERSHIP, UNIT LIMITED, “ MORTGAGE”, “ REAL
ESTATE”, “ CERTIFICATE”, NO PAR VALUE, HOLDING UNIT,
“ BACKED”, “ ST MIN”, “ CORTS ”, “ TORPS ”,“ TOPRS ”,
SECURITIES TRUPS, “ QUIPS ”, STRATS HIGH YIELD, TOTAL
RETURN, DIVERSIFIED HOLDINGS, “(SICAV)”, “
DEPOSITARY”, “ DEPOSITOR”, “ RECEIPT”, “REP & SHARES”,
“ GLOBAL SHARES”, “ ADR ”, “ GDR”, “EXPD.”, EXPIRED, “
DUPLICATE”, CONVERTIBLE, “CNVRT.”, “CONVRT.”, “
EXCH.”, DEBANTURE, “(DEB)”, NIL PAID, STRUCTURED
ASSET, “ CALLABLE”, FLOATING RATE, “ ADJUSTABLE”,
REDEEMABLE, “ PAIRED CTF”, CONSOLIDATED, “
INSURED”, CAPITAL SHARES, DEBT STRATEGIES,
LIQUIDATING, LIQUID UNIT, “L UNIT”, “- LASD”, “
ACQUISITION”, “CAP UNIT”, INCOME UNIT, PREFERRED
Venezuela OPTC (CAR) VENEZUELA VO, VB
Panel C - Asia
Pacific
Arab Emirates NIL PAID, “ ADS” (DIF), (DFM), (ADH) ABU DHABI ED,U$
Australia STAPLED, FULLY PAID, INDUSTRIAL FUND, SECURITIES (ASX) AUSTRALIA A$
FUND, DEFERRED, “ CHESS”, DEPOSITORY INTEREST, “
RESTRICTED”, PARTLY PAID, ABSOLUTE RETURN FUND,
INCOME TRUST, PROPERTY TRUST, “ OPT DEF”, PROPERTY
GROUP, “ TRUST”, “ REIT”, SHARE FUND, PROPERTY FUND,
DELAWARE, INDUSTRIAL TRUST, OFFICE TRUST, RENTS
TRUST, RETAIL TRUST, DUPLICATE, “% CN”, “ DEF ”, “CDI ”,
DEMERGED, SHAREHOLDERS FUND, “RTS ”, YIELD FUND,
EXPIRED, “ FUND”, TRUST INCOME, RESETTABLE,
PREFERRED, FLOATING RATE FUND, OPPORTUNITIES
FUND, RECEIPTS,TOTAL RETURN FUND, PREFERENCE
SHARE, EQUITY FUND, NIL PAID, UNIT TRUST,
DIVERSIFIED, SHARE FUND, RECEIPTS, “ CVR”, DE
PROPERTY
Bahrain (BAH) BAHRAIN BH,U$
Bangladesh MUTUAL FUND, “ CF”, MUTUAL FD, “ MF ” (BD) BANGLADESH TK
China OAO, “ B ” (SHG), (SHZ), (HAN) CHINA CH
Hong Kong NIL PAID, LEGAL SHARES, DEPOSITORY RECEIPTS, (HKG) HONG KONG K$
INVESTMENT TRUST, DEFERRED, “ H SHARES”, “ HDR”
India “ IDR”, DUPLICATE, “ DVR”, PREFERENCE SHARES (NSE) INDIA IR
Indonesia DUPLICATE, “ FB”, “ TBK”, “ TBL” (JKT) INDONESIA RI
Iraq IRAQ ID
Israel DUPLICATE, PARTNERSHIP, “ L LIMITED”, TRUSTS (TAE) ISRAEL I£
INVESTMENTS LIMITED, NIL PAID, “ REIT”
Japan “ REIT”, DUPLICATE, REAL ESTATE INVESTMENT, “ FUND”, (TKS) JAPAN Y
NIL PAID, “ RIGHTS”, EXPIRED, PREFERENCE
Jordan NIL PAID, “ REIT” JORDAN JD
Kuwait “ FUND”, REAL ESTATE KUWAIT KD
LAOS LAOS LK
Lebanon “ PR ” (BEY) LEBANON U$
Malaysia REIT, REAL ESTATE INVESTMENT TRUST, PREFERENCE (KLS) MALAYSIA M$
SHARE, RESTRICTED, NIL PAID, DUPLICATE, CUMULATIVE
Mongolia MONGOLIA MT
New Zealand RIGHTS, INCOME FUND, “ SERIES”, SHAREHOLDERS FUND, (NZE) NEW ZEALAND Z$
PROPERTY TRUST, STAPLED, “ RTS”, RECEIPT, NIL PAID,
CONVERTIBLE, PREFERRED
8
Table 2 (continued)
Abbreviations
Oman NIL PAID, FUND UNSUPPORTED OMAN OR

Pakistan MUTUAL FUND, NIL PAID, “ FUND”, MUTUAL (PAK.), (PAKISTAN) PAKISTAN PR
Palestine PALESTINE JD,U$
Philippines “ PDR”, “ UNIT=” (PHS) PHILIPPINES PP
Qatar REIT QATAR Q
Saudi Arabia RTS, NIL PAID SAUDI ARABIA SR
Singapore REAL ESTATE INVESTMENT, “ FUND”, “ TRUST”, “ NCPS”, (SES) SINGAPORE S$
NCCPS, “ SGD”, INVTRUST, “ SDS”, DEPOSITORY,
DUPLICATE, NIL PAID, FULLY PAID, “ REIT”
South Korea “(2P)”, “ 1PB ”, “ 2PB”, “ 3PB ”, “ 1P”, “ REIT”, “ PF2”, “ SOUTH KOREA KW
FUND”, REAL ESTATE INVESTMENT, “ 1PF”, “ 1PFD”, “
KDR”, NIL PAID, “(DETACHED)”
Sri Lanka NIL PAID, NON VOTING (LANKA) SRI LANKA CR
Syria SYRIAN ARAB Y£
REPUBLIC
Taiwan “ TDR”, REAL ESTATE INVESTMENT, REIT, “ ETF”, “ (TW.), (TAIWAN) TAIWAN TW
(DETACHED)”, “ CONV.”,
Thailand “ FB”, PROPERTY FUND, “ DR1:1”, “ FUND”, REAL ESTATE (THAILAND), (THAI.), THAILAND TB
INVESTMENT, “ NVDR”, NIL PAID, DUPLICATE, “ UNITS” (BANGKOK)
Vietnam VIETNAM VD
Panel D – Europe
Austria CERTIFICATE, “ ZT ”, “ NK5 ”, DUPLICATE, PARTICIPATION (WBO) AUSTRIA AS, E
CERTIFICATE, “CERT ”, “ VI ”, “REIT ”,“ % S ”, NIL PAID
Azerbaijan AZERBAIJAN AM
Belgium “STR ”, “STR VV ”, “ STRIP”, “ STRIPS”, “ VVPR”, (BRU) BELGIUM BF, E
CERTIFICATE, “ CERT”, “ PC ”, “ CNP ”, “ IDR”, “ UNITS”,
DELAWARE, “ ST VV”, “ CS 1”, “ STRIP VV PR”, FULLY, “
CVA”, “ RNC”, RIGHTS, “ REIT”
Bosnia- BOSNIA AND BO
Herzegovina HERZEGOVINA
Bulgaria FUND, REIT, NIL PAID, MONTSTROY BULGARIA BL
Croatia PREFERENCE SHARES, “ PIF” CROATIA KA
Czech Republic (PRA) CZECH REPUBLIC CY, E
Cyprus NIL PAID, “ RTS” (CYP) CYPRUS CK
Denmark NIL PAID, REGD CERT (CSE) DENMARK DN
Estonia ADDITIONAL SHARE, “ NRFD”, TUIENDAV AKTSIA ESTONIA EK, E
Finland “ FDR”, SUBSCRIPTION RECEIPT, SALES RIGHTS (HEL) FINLAND M, E
France CERTIFICATE, DELAWARE, LIMITED DATA, BONUS RIGHTS, (PAR) FRANCE FF, E
“ BDR”, “ ADP”, “ SPA”, PREFERRED, STOCK DIVIDEND, “
SPA RP”, “ AFV ”, NIL PAID, “ NV ”, “ NRFD ”, “NR ”, “ CVA
”, DROIT DE VOTE, “ PS ”, NIL PAID, “ ADP”, “ FDR”
Germany REIT, “ SWAP”, GENUSSSCHEINE, PREFERENCE, “ NPV”, NIL (FRA), (STU), (HAM), GERMANY DM, E
PAID, BONUS RIGHTS, “ GS ”, “ PF ”, SUB RIGHTS, “ CDI ”, (DUS), (MUN), (XET)
“ RSP ”, DEPOSITORY RECEIPTS, “ UNIT ”, CHESS
DEPOSITORY INTEREST, DEFERRED, PARTICIPATE
CERTIFICATE, LIMITED PARTENERSHIP, “ GDRS”, “ TRUST”,
“ REIT”, “ REFINERY”
Greece “ PR ”, “ UNITS”, PREFERENCE (ATH) GREECE DR, E
Hungary “ UNITS”, “ TRUST” (BUD) HUNGARY HF
Iceland (ICE) ICELAND IK
Ireland DUPLICATE, “ FUND”, “ UNITS”, “ REIT” (DUB), (IEX), (ESM) IRELAND I£, E
Italy “ RSP”, “ CONV RTS”, NIL PAID, SUB RIGHTS, BONUS (MIL) ITALY L, E
RIGHTS, “ RIGHTS ”, “ RP ”, “ RCV”, “ NRFD”, FULLY PAID,
FULLY PIAD, “ ETN ”
Kazakhstan PREFERENCE LIMITED, PREFERENCE SHARES (KAZ) KAZAKHSTAN KT
Latvia “FB ” LATVIA LV, E
Lithuania LITHUANIA LT, E
Luxembourg “ IDR ”, DEPOSITARY RECEIPT, DEPOSITARY, “ EDR ”, “ (LUX) LUXEMBOURG LF, E
VVPR”, DELAWARE, “ EDR ”, “ CDR ”, “ FDR ”, “ CERT”, “
GDR ”, “ BDR”
Macedonia MACEDONIA MC
Malta (MALTA), (MAL.) MALTA M£, E
Montenegro MONTENEGRO E
Netherlands CERTIFICATE, DUPLICATE, DEPOSITARY, BONUS RIGHTS, “ (AMS), (FL) NETHERLANDS FL, E
% STOCK ”, “ CERT”, CERTS, TRUST INCOME, “ STRIP”, “ CT
”, “ DUPL”, “ UNITS”, “ SPA ”, STRIP VVPR, PREFERENCE
Norway “ DUPLI”, NEW SHARES, NIL PAID (OSL) NORWAY NK
Poland NIL PAID (WAR) POLAND PZ
Portugal BONUS RIGHT, NIL PAID (LIS) PORTUGAL PE, E
Romania (BSE) ROMANIA RL
Russia “ PREF”, TRAST, “ RDP”, PREFERENCE, “ PREF.” RUSSIAN FEDERATION UR, U$
Serbia “ CF ” SERBIA YD
Slovakia “ FOND”, “ VP”, “ POV P”, “ PP”,“ LINKV”, “ ZSP” SLOVAKIA KK, E
9
Table 2 (continued)
Abbreviations
Slovenia SLOVENIA TO, E

Spain NIL PAID, BONUS RIGHTS, BUNUS RIGHTS, “ SHARES”, “ (MAD) SPAIN EP, E
LIMITED DATA”, “ CPO ”
Sweden “ SDB”, “ UNIT”, NIL PAID REDEMPTION, “ REDEMP”, “ (OME), (XSQ) SWEDEN SK
SDR”, FULLY PAID, “ RFD”, INTERIM SHARE, “ RIGHTS”,
DEPOSITARY, RECEIPTS, “ SR 1”, “ RFD”
Switzerland WHEN ISSUED, CERTIFICATE, DELAWARE, “ SERIES”, REAL (SWX), (BRN) SWITZERLAND SF
ESTATE FUND, “ CERT”, DUPLICATE, “ UNITS”, “ BOND”,
BONUS RIGHTS, REAL ESTATE IFCA, PROPERTY FUND, “
MIXED”, “ REIT”, COMMERCIAL FUND, “ DRC”
Turkey CERT, “ NRFD”, NIL PAID, FULLY PAID (IST) TURKEY TL
UK “ FUND ”, “ TRUST ”, NIL PAID, STOCK UNIT, ANNUITY (LON), (UNITED UNITED KINGDOM £
UNIT, “UNIT £”, UNIT TRUST, “ UNITS”, “ ZDP ”, REIT, POST KINGDOM)
RED, DEPOSITARY, “ RECEIPT”, INTERIM SHARES,
REEDEMABLE, PREFERENCE, INVESTMENT TRUST, “ ADR”,
FULLY PAID, PARTLY PAID, “ BDR”, “ NRDF”, DEFERRED
Ukraine “ CF ”, “ FUND”, “ CLOSED FUND ” UKRAINE KB
list of strings) and significantly more comprehensive than what has (5) Non-domestic currency: We exclude all securities traded in a
been made available in previous research. currency other than the domestic currency for each coun-
try. We list the TDS currency abbreviations which are do-
(2) Cross-listed stocks: TDS provides primary listing information
mestic for each country in column 5 of Table 2. For Ecuador,
for a very limited fraction of its stock universe, which we
Bahrain, Lebanon, Palestine and Russia, some stocks listed
assume is what researchers have relied on so far to identify
in those markets are traded in U.S. dollars, so dollars are
cross-listed stocks. We have instead found that stocks which
viewed as a second local currency. In other countries, there
are listed in another country can be identified from infor-
is a change in the local currency during the sample, so mul-
mation in their name (i.e., the name itself will suggest that
tiple currencies need to be included: For Euro area countries
it is the local listing of a stock primarily traded elsewhere).
this means the Euro as well as the pre-Euro currency are
On this premise, our cross-listing filter therefore removes a
treated as local; and for Venezuela, both the Bolivar (VB) and
stock if its Expanded Name (datatype ENAME) contains a
the Bolivar Fuerte (VO) are local. To implement this filter, we
string with the country-specific local listing identifiers, dis-
go through each country and exclude any stock the TDS’ cur-
played in column 3 of Table 211 . This removes around 14%
rency of which (datatype PCUR) is not in one of the coun-
of stocks globally, though again there is significant variation
try’s currency abbreviations listed in column 5 of Table 2.
across countries.
The overall impact of this filter is small.
(3) Non-common stock identification from duplicate local codes:
(6) Small countries: To avoid dealing with countries that have
We have noticed that multiple TDS stocks can have the same
a trivially small number of stocks, we exclude all stocks
’local code’ (datatype LOC), which is based on the code as-
from a country with less than 20 stocks available in the
signed to an instrument by the exchange on which it is
sample. This filter has also been used by Griffin, Kelly and
traded. In cases where TDS flags a particular stock as pri-
Nardari (2010) and has an impact on the number of stocks
mary (datatype ISINID equals "P"), we have noticed that
used in only our full universe analyses (our other universes
all other stocks with the same local code are non-common
are unaffected by it). Its impact can be felt only on our
stocks. We therefore exclude all stocks which have a non-
full global data universe, where it removes 0.06% of global
unique LOC and their ISINID is not P, as long as there exists
stocks and 14 of 107 countries, specifically Bahamas, Bar-
one stock with this LOC that does have ISINID = P.
bados, Bermuda, Bolivia, Cayman Islands, Costa Rica, Laos,
To the best of our knowledge, previous research has aimed Lebanon, Malawi, Panama, Rwanda, Tanzania, Trinidad and
to remove non-common stocks exclusively on the basis of text Tobago and Uganda. Note that after applying our filters,
strings, so this seems like a significant enhancement to existing ap- Azerbaijan and Namibia contribute zero filtered stocks to our
proaches. However, the filter only has a moderate impact, remov- sample so after applying our all our guidelines, there are 91
ing 0.11% of all stocks from the 23-country global universe. countries contributing stocks to our filtered sample.
(4) Non-domestic headquarters: We follow Ince and
Porter (2006) and exclude any security for which TDS’ 3.1.2. Based on return index data
geographical classification (datatype GEOGN) is different (7) Implausibility: We remove stocks of which more than 98% of
to the identifier for the country being analyzed. For each their non-zero daily returns are either non-negative or non-
country the valid identifier is reported in column 4 of positive. This filter has an important effect on our overall
Table 2. This standard filter removes 17% of global stocks. data since it removes 7% of stockdays globally.
11
These situations can reflect a variety of problems. In Fig. 3, we
We do not apply this filter to any German stock which is traded in the XETRA
exchange (implemented by ignoring this filter for any stock with an GEOGN equal
provide an example of a U.S. stock removed in this way12 , in which
to "GERMANY" and EXMNEM equal to "XET" and an ISIN (datatype ISIN) or ’local the underlying problem is that TDS apparently based its return in-
code’ (datatype LOC) equal to the ISIN or LOC of any stock with exchange mnemonic dex calculation on a dividend yield factor implying a dividend was
(datatype EXMNEM) equal to ’FRA’ and GEOGN equal to "GERMANY"). The reason
is that we want to keep data from stocks which were cross-listed on Xetra and
12
Frankfurt, in order to merge their time series as discussed in Internet Appendix Internet Appendix Figures B.2 and B.3 illustrate a stock from large and from a
Section B.1. small international market filtered out in this way.
10
Fig. 3. Necessity of implausibility filter.

We illustrate the usefulness of our implausibility filter (filter 7 of Table 1) for stock with Datastream code (datatype DSCD) 912391, and CRSP code (datatype PERMNO)
78552, focusing on a time window over which the problem is easily depicted. The implausibility filter removes this stock entirely, so here we illustrate cumulative returns
from CRSP data (datatype RET) (‘crsp’), compared to appropriately normalized adjusted prices from unfiltered TDS data (datatype RI) (‘raw’), a monthly implementation of
our filters denoted P1 in Table 7 (‘monthly’), and adjusted prices calculated from TDS data (datatype RI) where we use all of our guidelines except the implausibility filter
(‘daily ex implausibility’).
paid out on every date for which data is available. We studied this zero returns in TDS data. Note that while CRSP distinguishes be-
case in detail and found that the dividend yield factor applied was tween missing observations and zero returns, no such distinction
based on a dividend yield calculation corresponding to a four cent exists in TDS, so it is likely that the underlying cause of the prob-
dividend payment which according to CRSP data was indeed paid lem in the TDS data is a high fraction of missing data for this stock.
out, but only once on 3rd May 1985, whereas TDS applies this div- Evidently, our filter is acting sensibly in removing this stock alto-
idend yield on every day after that date. gether.
We believe that we are the first to detect this type of prob- As in the previous case, it would be difficult to be sure that this
lem and this filter is a good example of a problem which would data is problematic using monthly data since applying our filter
not be easily obvious in monthly return index data and for which on the entire history of monthly data for this stock (not just the
it would be difficult to implement a filter using monthly data be- period depicted in Fig. 4) would have no effect since only 62% of
cause monthly returns have more balanced signs than daily data. unfiltered monthly observations are zero.
Indeed, a visual inspection of Fig. 3 confirms that if we were to This filter is also very significant, removing 6% of all stockdays
focus only on monthly data (the diamond line), we would see cu- globally and reinforces the usefulness of working with daily data13 .
mulative returns that are not entirely implausible, while we would
(9) High volatility: Stocks with a daily standard deviation of
miss the extreme stability of daily returns (the straight sections of
more than 40% are eliminated from the sample. This is a
the squared line reveal large numbers of days across which returns
regular occurrence in TDS data, which we have observed in
were identical) which are a sure give-away that the data is prob-
some cases being due to missing observations (which ex-
lematic.
press themselves as padded return indexes and therefore
Note that even in this highly irregular case of Fig. 3, the im-
zero returns), to large errors (for example incorrect adjust-
plausibility filter is not redundant, since applying all other filters
ments for corporate events14 ) or to extreme illiquidity which
we propose is not sufficient to eliminate the offending stock.
leads to only very rare price updates for certain stocks15 . As
(8) Few zero observations (small percentage of sample): We re- an indication of the impact of the volatility specific filters,
move stocks for which the return calculated from the return in CRSP data, the range across stocks of daily standard devi-
index (datatype RI) is zero in more than 95% of their sam- ations is zero to 83% (mean 4.3% and median 3.6%). By com-
ple (i.e., days between their appearance in the database and parison, in raw TDS data the standard deviation range is zero
the last date of our sample or their delisting date identified to 48,414,624% (mean 1688% and median 32%). After apply-
according to the rule of filter 13).
13
Internet Appendix Figures B.4 and B.5 provide additional examples for a big and
Fig. 4 illustrates a case where applying this filter eliminates the a small country in our sample.
stock. Comparing the cumulative returns in CRSP data to several 14
For example, the stock with Datastream Code 51792L, has an implausible ad-
variants of returns obtained from TDS data (including raw data, justment factor spike (datatype AF) for a single day on June 6th, 2012, while the
data after applying all daily filters but without eliminating this unadjusted price (datatype UP) remains constant, causing a huge spike in the re-
turn index (datatype RI) on that date.
stock (i.e. all but this filter), and data filtered with monthly filters 15
Some researchers remove absolute returns above a specific level or exclude re-
including this filter) we see that there are significant discrepancies turns outside an extreme fractile (e.g., 5% tails), but this filter is designed to rec-
in all cases. Furthermore, it is evident that the reason for these dis- ognize TDS problems which lead to a stock specific problem that leads to repeated
crepancies is that several non-zero returns in CRSP data appear as extreme returns in the same stock, rather than to recognize a problematic stockday.
11
Fig. 4. Necessity of few zero observations (small percentage of sample) filter.

We illustrate the usefulness of our few zero observations (small percentage of sample) filter (filter 8 of Table 1) for stock with Datastream code (datatype DSCD) 982362,
and CRSP code (datatype PERMNO) 18949, focusing on a time window over which the problem is easily depicted. This filter removes this stock entirely, so here we illustrate
cumulative returns from CRSP data (datatype RET) (‘crsp’), compared to appropriately normalized unfiltered TDS adjusted prices(datatype RI) (‘raw’), adjusted prices with a
monthly implementation of our filters denoted P1 in Table 7 (‘monthly’) and adjusted prices calculated from TDS data where we use all of our guidelines except the filter
under analysis (‘daily ex few’).
ing all our filters on the TDS data, the standard deviation 3.2. Stockday filters
range is 0.11% to 40% (mean 8.4% and median 4.6%).
(13) No longer traded stocks: We have noticed that TDS delist-
ing dates typically occur much later than the last date for
We note that this filter and the next low-volatility filter can be which return index data is available. This means that many
implemented with daily data more precisely than with monthly in stocks appear with constant return indexes towards the end
the sense that the standard error of monthly volatility estimates of their series even when the series have been truncated at
based on daily data will be significantly lower than that based on their delisting dates.
monthly data and hence our filtering will be less affected by statis-
tical variation. The filter removes around 3% of global observations. As in Ince and Porter (2006), we eliminate data for stocks
where we observe padded values for return indexes immediately
preceding their delisting date. While Ince and Porter (2006) and
others (e.g., Schmidt et al., 2019) implement this by removing the
(10) Low volatility: Stocks with a daily standard deviation of less
second and subsequent padded value in monthly data, we remove
than 0.01 bps are eliminated from the sample. This is more
the tenth and subsequent padded daily observation.
common than one might expect, affecting 2% of all stock-
This means we will not have returns for a stockmonth that
days. The filter captures stocks that have many zero returns
would have otherwise been included on the basis of a monthly fil-
which may be due to extreme but genuine illiquidity, as
ter, when its last trading date is at the start of the month, since
well as situations where TDS pads missing prices (i.e., miss-
end of month data will be unavailable. This is a very significant
ing prices are replaced with the most recent available price,
filter removing 42% of global stockdays.
leading to zero returns), most likely reflecting stock trading
suspensions, closed venues, data collection or other database (14) Staleness: In CRSP, daily returns are calculated with respect
errors. to the most recent available price, as long as this price is not
(11) Return index data is unavailable: For some stocks, TDS’ re- more than 9 days old16 and cumulated daily returns match
turn index (datatype RI) is not available for any date (though monthly returns exactly. Because TDS has padded prices (i.e.,
e.g., accounting data may exist). All such instruments are ex- on each date the price at the most recent available date is
cluded as is standard in the literature. This accounts for ap- used to calculate all datatypes, so missing values are not re-
proximately 15% of global stocks. ported), we need a mechanism to recognize excessively stale
(12) Few observations (small total number): As is standard, in or- prices in order to follow a CRSP-like convention for calculat-
der to avoid stocks which have an unusually brief history, ing returns. This convention is important, because monthly
which can be due to TDS limitations, we restrict attention returns calculated from stale prices will bias many types
to stocks which have more than some threshold of obser- of analyses premised on synchronous observation of returns
vations. Specifically, we require 120 valid daily observations, across stocks.
unless the stock’s first available observation is within the
last 120 days of the sample and this eliminates 0.06% of all
stockdays. 16
See www.crsp.com/files/data_descriptions_guide_0.pdf
12
Fig. 5. Outlier errors filter.

We illustrate the need to apply our outlier errors filter on daily rather than monthly data (filter 15 of Table 1) for stock with Datastream code (datatype DSCD) 271924, and
CRSP code (datatype PERMNO) 87156, focusing on a time window over which the problem is easily depicted. We illustrate cumulative returns from unfiltered CRSP data
(datatype RET) (‘crsp’) compared to appropriately normalized unfiltered TDS cumulative returns (datatype RI) (‘raw’), cumulative returns after a monthly implementation of
our filters denoted P1 in Table 7 (‘monthly’) and after using all of our guidelines (‘ours’).
The rule we use is that if 30 consecutive prices are identical, all This is an original filter we have had to introduce as previ-
subsequent price observations are eliminated until the next price ous work on filtering has focused on monthly data so has
change. Note this is a very significant issue - for example, approxi- not had to deal with this issue. The filter removes 3% of
mately 70% of all daily returns are zero in the raw U.S. TDS sample, global stockdays.
while this is only 23% in the CRSP sample. After applying our filters (17) Exclusion of early periods where TDS has survivorship bias or
our U.S. sample has only around 30% zero daily returns, so agree- dividend data is incomplete: CRSP-Compustat data are known
ment is much better. The rule removes 11% of global stockdays. to be subject to survivorship biases (Davis, 1996) and so
is TDS-Worldscope (Andrikopoulos, Daynes, Latimer and Pa-
(15) Outlier errors: We use a filter to control for extreme re-
gas, 2007; Ulbricht and Weiner, 2005). However, most re-
turns which we have found are typically due to errors such
searchers using TDS data have not attempted to exclude
as erroneous adjustments for stock splits. To operationalize
data on this basis. While this problem can seriously distort
this, if the return on date t is greater than 100% (respec-
many analyses, the good news is that it seems limited to
tively lower than -50%) and the return on day t+1 is lower
older data, so it can be dealt with by limiting attention to
than -50% (respectively greater than 100%) then both days
more recent data. For example, Figure 6 illustrates delist-
are eliminated. Similar filters were used on monthly returns
ing rates across CRSP, Compustat, TDS and Worldscope for
by Ince and Porter (2006) and Schmidt et al. (2019) but
U.S. data18 . Evidently, no TDS U.S. stocks with valid World-
such reversals in daily data are far less plausible than in
scope book value information delist before 1985, but after
monthly data. Naturally, for monthly returns, elimination of
that date, there is no obvious survivorship bias using this
such stockdays will have a more significant impact when
simple measure19 . As a further check for survivorship bias
date t is the last date of a month.
in TDS data, we compared TDS data we extracted in 2010,
Figure 5 illustrates an example in which applying this filter on 2013 and 2015 vintages and found that the 2015 data vin-
monthly data will unnecessarily filter out observations that would tage did not add pre-2013 observations relative to those al-
not have been removed if the filter were instead applied on daily ready available in the 2010 and 2013 vintages. This suggests
data. that in recent years, this bias should not be a concern.
(16) Holidays: On a country-by-country basis, we exclude likely An additional problem with older historical TDS data is that
holidays or days on which markets are closed, by removing before ex-dividend dates become available for each country (1st
days on which non-missing or non-zero returns account for January 1988 for all markets except Canada and USA where cov-
less than 0.5% of the total number of stocks available for that
country across all days (after applying filters 11 and 13).17
18
Internet Appendix Figure B.6 reports the same for U.K. and Japan.
19
Note that the fact that Worldscope has significantly lower delisting rates than
17
We confirmed this filter correctly identified all U.S. holidays by comparing to TDS, CRSP or Compustat after 1985 may reflect its focus on larger stocks rather than
CRSP data. survivorship bias.
13
Fig. 6. Survivorship bias in CRSP, Compustat, TDS and Worldscope data.

We calculate the fraction of US stocks that were traded on the last date of each year but not traded on the last date of the next year in CRSP, Compustat, TDS and Worldscope
data. A stock is considered to be traded over the interval between its sample start date and its end date as determined after applying the filter for no longer traded stocks
(13). For CRSP data we look at the sample of the RET datatype, for Compustat, we use book equity calculated as in Davis, Fama and French (2001), for TDS the RI datatype,
and for Worldscope we use the WC03501 datatype. Note TDS and Worldscope samples begin a lot later than the other data.
erage begins 1st January 1972), TDS constructs return indexes by prices would have been rounded to zero (e.g., in the U.S.
smoothing annual dividends across all days in that year. Ignoring at default precision this filter removes 3.7% of observa-
this can be extremely problematic, especially when using monthly tions whereas at maximum precision it removes only 0.05%,
TDS data that is stale, in which case TDS’ monthly return will be which is relevant for the discussion of Section 2).
non-zero and consistently positive due to the dividend smooth-
ing. This apparently underappreciated problem likely affects many 3.3. Filters for stocks from investment universe of factors on each
studies using older monthly TDS data. investment date
On the basis of these two observations, we propose country-
specific start dates for TDS analyses after which delisting rates are Our final set of filters, removes stocks from the universe from
non-zero in TDS and Worldscope book value data and ex dividend which factor holding portfolios are formed, aiming to be consistent
data is also available (analyses which do not require Worldscope with conventions applied when calculating U.S. factors from U.S.
book values, can potentially start from earlier dates). These dates CRSP-Compustat data. These filters are applied after applying all
are reported in Table 1 for Global, US, Japan and U.K. analyses. the filters in the previous subsection and are applied on each in-
Global portfolios will have countries added to them after each investment date of each factor holding portfolio. In all respects, our
dividual country’s start date is reached20 . factor holding portfolios follow French (2017), so for example we
remove stocks from the investment universe of each investment
(18) Adjustment inconsistencies: Ince and Porter (2006) note that date if they do not have a holding period return, market cap or
there is an inconsistency between adjusted prices provided a signal (e.g., momentum) available on that date. These extremely
by TDS and the prices implied from unadjusted prices and standard filters are not analyzed here, but we note we apply them
adjustment factors provided by TDS. Following Ince and after applying all stock and stockday filters above.
Porter (2006), we filter out cases where there is a large dis-
crepancy (more than 5%) between the two. More specifically, (20) Book-to-market staleness (applies only to value and size fac-
we filter out all stockdays for which UP is more than 5% dif- tors): TDS provides market-to-book data (datatype MTBV)
ferent to P ∗ AF, where UP is TDS’ unadjusted price, P is the which is widely used, in particular to create value and size
adjusted price and AF is the adjustment factor. This standard factors. To the best of our knowledge, we are the first to
filter does not have a large impact, removing just 0.4% of have noticed that this series can be problematic as it uses
global stockdays. the most recent available market and book values for each
(19) Nonsense values: We remove all stockdays for which un- series independently to form this ratio. As a consequence,
adjusted prices contain zero or negative values. This stan- and because TDS’ market values can be stale by years, the
dard and obvious filter removes 0.02% of global stockdays. two terms in the ratio may be measured at very distant
Note that this filtering rate would have been much higher dates from each other, while there will be no indication that
if we had downloaded data at TDS’ default precision since this is the case in the MTBV series.
it would have captured many stockdays whose unadjusted
To avoid this, following French (2017) methodology closely
we filter out all stocks from the investment universes of factors
20
In Internet Appendix Table B.5 we provide such dates for each of the 107 coun- that require MTBV (i.e., value and size) when their market value
tries in our sample. (datatype MV) or book value are unavailable (datatype WC03501)
14
on the date at which MTBV is measured. In the French (2017) con- after applying all stock and stockday filters, but before applying
vention, the investment dates for value and size factors is the end any holding portfolios filters) and use this main exchange’s median
of June of each year and the measurement date for MTBV is the market cap as the size breakpoint, in line with the U.S. approach
last working day of the previous year. For example, for the stock of French (2017). We define ’Big’ stocks as those with market cap
with code (datatype DSCD) 777345, TDS’ MTBV uses the market above the size breakpoint and follow French (2017) in defining size
value from 11 August 1987 and the book value from 31 December and momentum thresholds on the basis of the 30th and 70th per-
1987 in reporting a market-to-book value of 1.69 for 31 December centiles of the corresponding characteristics, but our calculation is
1987, whereas from CRSP- Compustat we match this stock to the modified in that these percentiles are calculated for the main ex-
one with PERMNO 63853 where we observe a value of 0.76 (we change only in each international country.
verify that the difference is due to a drop in market cap towards This approach reduces the weight of small stocks in each coun-
the end of 1987 which is not reflected in TDS’ stale data). This fil- try and is especially important for country-specific analyses as it
ter is applied after filters 1-20 and eliminates 2.4% of stocks on also reduces the weight of OTC stocks. OTC stocks are usually
average across investment dates of value and size factors (book-to- traded on secondary exchanges so an advantage of our approach
market information is not used to construct the market or momen- is that they will not affect breakpoint values (as discussed earlier
tum factor). in Section 2.1, in TDS data it is possible to identify stocks’ venues
only at the time of download, so using this as a sample selection
(21) Penny stocks: In research on U.S. data, it is common practice
criterion will typically create a survivorship bias in the sample).
to remove stocks with prices below a price threshold of $1
Our breakpoint approach influences the results we report for
or $5. Similar rules based on local currency (e.g., Ince and
the U.S. (main exchange: NYSE) and Japan (Tokyo) and would in-
Porter, 2006, among others) or USD (Hou, Karolyi and Kho,
fluence country-specific analyses for any of the 30 countries in our
2011) have been applied to international data. However, this
sample with multiple markets21 . In Fig. 7 we verify that our break-
practice can be seriously problematic in certain countries
points are sensible: we compare them to French (2017) CRSP NYSE
where the usual price ranges are very different to what is
breakpoints and find they are very similar throughout our sample.
normal in the U.S. and application of such filters can elim-
inate all of the data over some periods. Ireland is a case in
4. Evaluation of our guidelines
point where the $1 (respectively $5) rule would remove 60%
(respectively 90%) of its sample (using thresholds with these
In Tables 3 and 4 we report summary statistics for our market,
values in local currency would not significantly affect these
size, value and momentum factors across various regions as well as
filtering rates).
those developed and provided by six other independent research
To avoid these issues, we propose removing stocks from the in- teams, specifically French (2017), AQR (2017), Schmidt et al. (2019),
vestment universe of month t when their unadjusted close price Karolyi et al. (2012), Gregory, Tharyan and Christidis (2013) and
in month t - 1 is in the lowest quartile of stocks available on that Dimson, Nagel and Quigley (2003). The data sources, filters, im-
month (we apply filter 11 to remove stocks with no missing return plementation details, country, factor and frequency coverage dif-
indexes before applying this filter as quartile breakpoints would fer across researchers22 . The choice of countries and regions (U.S.,
otherwise be less interpretable). Japan, U.K. and Global) analyzed in this section was designed to
maximize the number of studies with which we could compare
3.4. Holding portfolios breakpoints for country-specific international our factors.
factors
4.1. Differences among international factors used in previous research
In creating international factor returns, it becomes necessary to
take a view on an important design decision for holding portfolios. We consider first the variation observed across others’ widely
We discuss this here, acknowledging and emphasizing that this is used factor series over the dates each series is provided publicly,
a design decision rather than a data filter. since it is an indication of the sensitivity of factor performance
Time-varying market cap percentiles (breakpoints) are com- estimates to data and factor design decisions of independent re-
monly used to ensure factor portfolios are balanced in the searchers and of the size of potential biases from relying on any
size of the stocks they contain. For example, the standard one study.
French (2017) methodology requires that high book-to-market In Table 3, we find this variation is statistically and econom-
portfolios contain two sub-portfolios of stocks with high book ically meaningful: Differences in estimates of mean monthly re-
to market ratios from a distinct large and small market cap turns of the order of 10-20 bps are common and, in some cases,
stock universe. For U.S. CRSP data, the size threshold is based on can be as much as 50 bps (e.g., for U.K. market returns), but this
NYSE’s median market values whereas for international regions, is not entirely surprising since time samples do not coincide and
the French (2017) threshold is obtained by defining as big, those samples are not always very long. However, note that the major
stocks whose size ’sums up to the top 90% of June market cap for reason for short samples and variation in sample start dates is that
the region’. the start date chosen by each research team typically reflects the
Since the French (2017) methodology has not been refined to date after which they feel confident in the quality of their data and
create country-specific factors (with the exception of Japan which hence is intimately tied to data processing decisions. The variation
is viewed as a region), there is no obvious standard for how to we see in this table is therefore affected by variation in data pro-
create country-specific breakpoints and researchers creating such cessing decisions both directly and indirectly.
country-specific factors have tried various approaches. We propose
a breakpoint methodology for country-specific factors designed 21
Internet Appendix Table B.4 presents for all countries with multiple exchanges
to closely follow the logic of the French (2017) methodology for in our sample, those classified as "main", as well as its share of stocks and market
U.S. data. Specifically, we suggest using breakpoints based on each cap relative to secondary exchanges. As an exception to our rule, for Arab Emirates
the union of Abu Dhabi and Dubai is considered to be the major exchange and
country’s main exchange. When a country has multiple exchanges,
for China, the union of Shanghai and Shenzhen. These cases are treated differently
we define the main exchange to be the one with the largest to- because the market capitalization share is balanced between these exchanges.
tal capitalization (market cap on each day is the sum of available 22
Internet Appendix Table B.6 provides links to the publicly available datasets of
market value (datatype MV) observations for all stocks on that day, factor returns used in these studies.
15
Fig. 7. Book-to-market percentiles for size and value factors: comparison of CRSP and TDS.
We illustrate the 30% and 70% book-to-market percentiles of NYSE stocks, used to construct breakpoints for size and value factors, based on both CRSP and TDS data. As is
standard, CRSP breakpoints use book values constructed according to Davis, Fama and French (2001) and information on which stocks are traded in NYSE on each date. With
TDS data, we use the Worldscope book value (datatype WC03501) divided by market value (datatype MV), and allocate stocks to NYSE based on their classification at the
time of the data download (datatype EXMNEM).
In Table 4, we report a more cleanly interpretable compari- 4.2. Comparison of our international factors to previously available
son across alternative factor series, limiting attention to the in- international factors
tersection of dates for which pairwise comparisons are possible
(in other words, we compare all factors to each other on the Tables 3 and 4 also allow a comparison of our factors cre-
same dates). We find that correlations of factors can be as low as ated according to our guidelines in Sections 2 and 3 with those
around 60% (e.g., the size factor returns of Dimson et al., 2003, and of other researchers discussed in the previous subsection. Focusing
AQR, 2017, for the U.K. have a correlation of 56% in their overlap- on samples covering the same dates, in Table 4 we observe excel-
ping period from December 1984 to December 2001), while even lent agreement of our factors and French (2017) factors in the U.S.,
French (2017) and AQR (2017) returns for global portfolios which where the largest mean return difference is just 5 bps (with a stan-
should be very robust can be as low as 82% (for size returns). In dard error of 28 bps for the size factor) and the lowest correlation
addition, the level of their returns can differ by more than 10 ba- is 91% (for momentum). This is good news for researchers wishing
sis points per month even though we have restricted attention to to study the U.S. equity market with access to TDS but not CRSP
the same and therefore most recent time periods. Large differences data. At least for the purpose of creating factors, researchers inter-
are even observed when we compare market returns using data ested in analyses after 1984 can reliably work with U.S. TDS data
from the same TDS database - e.g., when comparing U.S. market re- processed according to our guidelines if they do not have access to
turns provided by Karolyi et al. (2012) and Schmidt et al. (2019) we CRSP-Compustat data.
find a difference of 19bps per month, with a relatively low cor- On the other hand, there are significant differences between our
relation (by the standards of what we would expect for value- international factors and those of other researchers. For example,
weighted market returns) in the range 0.83-0.9. Note that in this our U.K. HML factor has a statistically significant 24 bps discrep-
table, low correlations make estimated differences across means ancy to that of Gregory et al. (2013) and just 61% correlation with
and Sharpe Ratios noisy and hence harder to reject the null of that of Schmidt et al. (2019). However, these discrepancies cannot
equality. Contrary to usual interpretation, here the lack of statisti- be attributed entirely to data issues, since there are also differ-
cal significance of an economically meaningful difference is a cause ences in factor construction methodologies across researchers and
of greater rather than lesser concern because it reflects a low cor- there may be good reason for these differences across studies with
relation of the series. Therefore, pairs with either correlation confi- different goals24 . To isolate the impact of our guidelines, in Panel
dence intervals in low ranges or statistically significant differences A of Table 5 we compare our Global (G91, G23) and our Global
should be considered to have statistically large discrepancies (note excluding U.S. factors (GxUS91, GxUS23) to the corresponding fac-
these results should be interpreted subject to the usual qualifica- tors of French (2017), whose factor construction methodology we
tions about multiple testing)23 . follow very closely so that any differences in this table are likely
driven by data differences.
23
In unreported results we show that the hypothesis that the Sharpe Ratio of
24
returns is equal across all researchers’ factors was rejected at the 5% level in 13 out For reference purposes, in Internet Appendix Table B.7 we summarize factor
of the 16 factors and regions we considered, including rejection for U.S. and U.K. implementation details of different research teams listed in Internet Appendix Table
market returns. B.6.
16
Table 3
Descriptive statistics for publicly available factors.
We report the monthly mean multiplied by 100 (μ x100) Sharpe Ratio (SR), number of observations (#obs) and start and end date (Start, End) of market (MKT), size
(SMB), value (HML) and momentum (WML) factors for the U.S., Japan, U.K. and a Global portfolio constructed according to our guidelines (G), as well as corresponding
Fama-French factors (FF) downloaded from French (2017), AQR factors (AQR) and factors developed by Schmidt, Von Arx, Schrimpf, Wagner, and Ziegler (2019) (S+),
Karolyi, Lee and Van Dijk (2012) (KLD), Gregory, Tharyan and Christidis (2013) (GTC) and Dimson, Nagel and Quigley (2003) (DNQ). For our granular Global Factors, we
report two variants, one (G23) covering the same countries as those of French (2017) in global portfolios and one (G91) covering all 91 countries for which we have filtered
data. Table cells are left empty where a factor is unavailable. Start dates are 1st December 1984 (the date after which our data is of satisfactory quality - see the discussion
of our filter 17), or the start date of each dataset, whichever is later. End dates are the last date of our sample or the last date of each dataset available at the time of
download. All factor returns are in US dollars.
US JPN
G FF AQR S+ KLD G FF AQR S+ KLD
MKT μ x100 ’0.96 ’0.96 ’0.95 ’0.97 ’0.98 ’0.25 ’0.27 ’0.27 ’0.20 ’-0.05
SR ’0.22 ’0.22 ’0.21 ’0.21 ’0.23 ’0.04 ’0.05 ’0.05 ’0.03 ’-0.01
#obs 370 370 370 327 180 298 298 298 263 180
Start ’DE84 ’DE84 ’DE84 ’DE84 ’JA95 ’DE90 ’DE90 ’DE90 ’DE90 ’JA95
End ’SE15 ’SE15 ’SE15 ’FE12 ’DE09 ’SE15 ’SE15 ’SE15 ’OC12 ’DE09
SMB μ x100 ’0.13 ’0.08 ’0.06 ’0.14 ’-0.02 ’-0.01 ’0.02 ’-0.10
SR ’0.04 ’0.02 ’0.02 ’0.04 ’-0.00 ’-0.00 ’0.01 ’-0.03
#obs 370 370 370 327 298 298 298 263
Start ’DE84 ’DE84 ’DE84 ’DE84 ’DE90 ’DE90 ’DE90 ’DE90
End ’SE15 ’SE15 ’SE15 ’FE12 ’SE15 ’SE15 ’SE15 ’OC12
HML μ x100 ’0.18 ’0.19 ’0.14 ’0.22 ’0.63 ’0.41 ’0.51 ’0.54
SR ’0.06 ’0.07 ’0.05 ’0.07 ’0.15 ’0.15 ’0.20 ’0.18
#obs 370 370 370 327 298 298 298 263
WML μ x100 ’0.59 ’0.61 ’0.67 ’0.69 ’0.39 ’0.12 ’0.19 ’0.05
SR ’0.13 ’0.13 ’0.15 ’0.11 ’0.07 ’0.03 ’0.04 ’0.01
#obs 370 370 370 327 298 298 298 263
UK Global
G AQR S+ KLD GTC DNQ G23 G91 FF AQR
MKT μ x100 ’0.99 ’0.84 ’0.92 ’0.81 ’1.02 ’1.33 ’0.80 ’0.78 ’0.65 ’0.81
SR ’0.19 ’0.16 ’0.17 ’0.17 ’0.20 ’0.25 ’0.19 ’0.18 ’0.15 ’0.18
#obs 370 357 307 180 367 205 370 370 303 359
Start ’DE84 ’JA86 ’DE86 ’JA95 ’DE84 ’DE84 ’DE84 ’DE84 ’JL90 ’NO85
End ’SE15 ’SE15 JN12 ’DE09 JN15 ’DE01 ’SE15 ’SE15 ’SE15 ’SE15
SMB μ x100 ’0.01 ’-0.03 ’-0.10 ’0.29 ’0.15 ’-0.03 ’-0.02 ’0.03 ’0.05
SR ’0.00 ’-0.01 ’-0.03 ’0.07 ’0.03 ’-0.01 ’-0.01 ’0.01 ’0.03
#obs 370 327 300 367 205 370 370 303 327
Start ’DE84 ’JL88 ’JL87 ’DE84 ’DE84 ’DE84 ’DE84 ’JL90 ’JL88
End ’SE15 ’SE15 JN12 JN15 ’DE01 ’SE15 ’SE15 ’SE15 ’SE15
HML μ x100 ’0.61 ’0.40 ’0.49 ’0.42 ’0.51 ’0.40 ’0.45 ’0.33 ’0.32
SR ’0.15 ’0.12 ’0.16 ’0.09 ’0.12 ’0.16 ’0.18 ’0.14 ’0.15
#obs 370 327 300 367 205 370 370 303 327
Start ’DE84 ’JL88 ’JL87 ’DE84 ’DE84 ’DE84 ’DE84 ’JL90 ’JL88
End ’SE15 ’SE15 JN12 JN15 ’DE01 ’SE15 ’SE15 ’SE15 ’SE15
WML μ x100 ’1.60 ’1.07 ’1.16 ’1.07 ’0.75 ’0.75 ’0.64 ’0.67
SR ’0.31 ’0.24 ’0.21 ’0.20 ’0.21 ’0.21 ’0.16 ’0.18
#obs 370 345 300 367 370 370 299 347
Start ’DE84 ’JA87 ’JL87 ’DE84 ’DE84 ’DE84 ’NO90 ’NO86
End ’SE15 ’SE15 JN12 JN15 ’SE15 ’SE15 ’SE15 ’SE15
Restricting attention to the same 23 country universe of ing portfolios contain e.g., 6482 stocks on December 1990 whereas
French (2017) 25 , we notice statistically and economically meaning- the corresponding number reported in French (2017) data is 5667.
ful differences of the order of 13 basis points in the GxUS23 MKT A significant coverage gap persists through the end of the sample
returns, a correlation of just 83% in the SMB factor and statistically used in our study.
significant differences in Sharpe Ratios in all but the HML factor.
Factors match slightly better for the Global portfolio (G23), reflect- 4.2.1. Comparison of our full country coverage international factors
ing the good agreement of our factors in the U.S. discussed above with other widely used international factors
(large U.S. stocks will dominate value weighted Global portfolios These effects become even more pronounced in our factors
and this effect is particularly strong in earlier periods where Global which have full country coverage extending beyond the countries
portfolios have few countries in them). Stock coverage within this of French (2017). Highlighting the impact of full country coverage,
23 country universe is likely an important determinant of these we see in Panel A of Table 5 that the Global excluding U.S. WML
differences since as is evident in Panel B of Table 5, our 23 coun- factor correlation confidence interval drops from 94-96% to 53-67%
try Global excluding U.S., annually rebalanced size and value hold- when moving from the same 23 country universe to the broadest
91 country universe. Our point estimates for the mean returns of
25
Global excluding U.S. market returns are 6 bps per month higher
French (2017) include the following markets: Australia, Austria, Belgium,
Canada, Switzerland, Germany, Denmark, Spain, Finland, France, U.K., Greece, Honk
than French (2017), size returns are 8 bps higher, value returns are
Kong, Ireland, Italy, Japan, Netherlands, Norway, New Zealand, Portugal, Sweden, 16 bps lower and momentum returns are 22 bps higher. Our re-
Singapore and U.S. sults suggest that Global momentum is even stronger than was
17
Table 4
Pairwise factor comparisons on matched samples.
We report pairwise comparisons of the factor returns described in Table 3 on the intersection of dates for which both elements of each pair is available (availability dates
are reported in Table 3). Panel A covers the market and size factors and Panel B the value and momentum factors, respectively for four regions, the U.S., Japan, U.K. and
Global. In panel A.1, we report the monthly mean of the row factor minus the column factor multiplied by 100 and below it in parentheses the p-value of a two sample
mean equality test using Newey-West standard errors with 12 lags; in panel A.2, the Pearson correlation of the underlying series and below it a lower and upper bound
for its 95% confidence interval under a t-distribution with n-2 d.f.; and in panel A.3, the difference between the monthly Sharpe Ratios multiplied by 100, and below it,
in parentheses the p-value for Memmel’s (2003) two sample Sharpe Ratio equality test. In each of the panels A.1, A.2, and A.3 of panel A the upper and lower triangulars
correspond to market and size factors, respectively. Panel B reports the results for value and momentum factors in a similar way.
Panel A: MKT & SMB Returns
MKT Returns
US JPN
SMB Returns A.1 μ x100

G ’0.00 ’0.01 ’-0.01 ’-0.20 ’-0.02 ’-0.03 ’-0.04 ’0.05
’(0.89)’ ’(0.32)’ ’(0.64)’ ’(0.19)’ ’(0.20)’ ’(0.08)’ ’(0.26)’ ’(0.66)’
FF ’-0.05 ’0.01 ’-0.02 ’-0.21 ’0.01 ’-0.00 ’-0.02 ’0.07
’(0.28)’ ’(0.42)’ ’(0.59)’ ’(0.19)’ ’(0.97)’ ’(0.81)’ ’(0.67)’ ’(0.46)’
AQR ’-0.07 ’-0.02 ’-0.02 ’-0.18 ’0.03 ’0.02 ’-0.02 ’0.06
’(0.06)’ ’(0.65)’ ’(0.61)’ ’(0.27)’ ’(0.88)’ ’(0.69)’ ’(0.67)’ ’(0.57)’
S+ ’-0.03 ’0.04 ’0.05 ’-0.19 ’-0.21 ’-0.07 ’-0.12 ’0.10
’(0.54)’ ’(0.25)’ ’(0.24)’ ’(0.23)’ ’(0.34)’ ’(0.18)’ ’(0.03)’ ’(0.39)’
A.2
Correlation
G ’100 ’100 ’94 ’93 ’100 ’100 ’94 ’97
[100,100]’ [100,100]’ [93,95]’ [90,94]’ [100,100]’ [100,100]’ [92,95]’ [96,98]’
FF ’95 ’100 ’94 ’92 ’72 ’100 ’94 ’98
[94,96]’ [100,100]’ [93,95]’ [90,94]’ [66,77]’ [100,100]’ [93,95]’ [97,98]’
AQR ’96 ’94 ’94 ’92 ’68 ’88 ’94 ’98
[95,97]’ [92,95]’ [93,96]’ [90,94]’ [62,74]’ [85,91]’ [92,95]’ [97,98]’
S+ ’91 ’95 ’88 ’87 ’63 ’89 ’83 ’90
[89,93]’ [93,96]’ [86,90]’ [83,90]’ [56,70]’ [86,91]’ [79,87]’ [87,93]’
A.3 SR x100
G ’0.02 ’0.43 ’0.09 ’-6.43 ’-0.34 ’-0.40 ’-0.62 ’0.93
’(0.82)’ ’(0.03)’ ’(0.94)’ ’(0.00)’ ’(0.32)’ ’(0.10)’ ’(0.69)’ ’(0.44)’
FF ’-2.03 ’0.40 ’0.03 ’-6.54 ’0.06 ’-0.06 ’-0.19 ’1.32
(0.11)’ ’(0.07)’ ’(0.98)’ ’(0.00)’ ’(0.98)’ ’(0.86)’ ’(0.90)’ ’(0.25)’
AQR ’-2.32 ’-0.29 ’-0.09 ’-6.25 ’0.89 ’0.83 ’-0.20 ’1.10
’(0.02)’ ’(0.83)’ ’(0.95)’ ’(0.00)’ ’(0.78)’ ’(0.68)’ ’(0.90)’ ’(0.34)’
S+ ’-1.25 ’1.43 ’1.40 ’-6.78 ’-5.60 ’-2.38 ’-4.02 ’1.66
’(0.44)’ ’(0.27)’ ’(0.46)’ ’(0.01)’ ’(0.13)’ ’(0.25)’ ’(0.11)’ ’(0.48)’
MKT Returns
UK Global
SMB Returns G AQR S+ KLD GTC DNQ G23 G91 FF AQR
A.1 μ x100
G ’0.07 ’-0.03 ’-0.05 ’0.00 ’-0.04 G23 ’0.02 ’-0.06 ’-0.06
’(0.00)’ ’(0.50)’ ’(0.28)’ ’(0.98)’ ’(0.06)’ ’(0.65)’ ’(0.24)’ ’(0.27)’
AQR ’0.19 ’-0.10 ’-0.12 ’-0.06 ’-0.12 G91 ’0.01 ’-0.08 ’-0.08
’(0.30)’ ’(0.02)’ ’(0.02)’ ’(0.01)’ ’(0.00)’ ’(0.86)’ ’(0.32)’ ’(0.26)’
S+ ’0.08 ’-0.06 ’-0.03 ’0.04 ’-0.01 FF ’0.07 ’0.06 ’0.00
’(0.68)’ ’(0.62)’ ’(0.62)’ ’(0.38)’ ’(0.69)’ ’(0.22)’ ’(0.57)’ ’(0.77)’
KLD ’0.08 ’0.02 AQR ’0.10 ’0.08 ’0.02
’(0.12)’ ’(0.71)’ ’(0.09)’ ’(0.45)’ ’(0.73)’
GTC ’0.28 ’0.14 ’0.15 ’-0.04
’(0.03)’ ’(0.40)’ ’(0.42)’ ’(0.08)’
DNQ ’0.08 ’-0.17 ’0.03 ’-0.07
’(0.42)’ ’(0.52)’ ’(0.88)’ ’(0.46)’
A.2
Correlation
G ’100 ’94 ’99 ’100 ’100 G23 ’98 ’98 ’98
[100,100]’ [92,95]’ [98,99]’ [100,100]’ [100,100]’ [97,98]’ [98,98]’ [98,98]’
AQR ’67 ’94 ’99 ’100 ’100 G91 ’85 ’96 ’96
[60,72]’ [92,95]’ [98,99]’ [100,100]’ [99,100]’ [82,88]’ [95,97]’ [96,97]’
S+ ’70 ’78 ’90 ’93 ’95 FF ’89 ’78 ’100
[63,75]’ [73,82]’ [86,92]’ [92,95]’ [94,97]’ [86,91]’ [73,82]’ [100,100]’
KLD ’99 ’98 AQR ’86 ’71 ’82
[98,99]’ [97,99]’ [83,88]’ [65,76]’ [78,85]’
GTC ’89 ’69 ’61 ’100
[86,91]’ [63,75]’ [53,68]’ [100,100]’
DNQ ’96 ’56 ’71 ’95
[95,97]’ [45,66]’ [62,77]’ [93,96]’
A.3 SR x100
G ’1.28 ’0.38 ’-0.64 ’-0.03 ’-0.72 G23 ’0.39 ’-1.25 ’-0.67
’(0.00)’ ’(0.80)’ ’(0.43)’ ’(0.90)’ ’(0.04)’ ’(0.64)’ ’(0.13)’ ’(0.37)’
AQR ’3.48 ’-1.06 ’-2.50 ’-1.14 ’-2.26 G91 ’0.62 ’-1.56 ’-1.06
’(0.27)’ ’(0.47)’ ’(0.01)’ ’(0.00)’ ’(0.00)’ ’(0.76)’ ’(0.19)’ ’(0.29)’
18
Table 4 (continued)
Panel A: MKT & SMB Returns
MKT Returns
US JPN
S+ ’0.85 ’-1.38 ’-1.52 ’-0.19 ’-0.49 FF ’3.31

’2.53 ’0.31
’(0.79)’ ’(0.62)’ ’(0.54)’ ’(0.90)’ ’(0.76)’ ’(0.09)’ ’(0.35)’ ’(0.28)’
KLD ’1.37 ’0.30 AQR ’4.82 ’3.99 ’1.14
’(0.10)’ ’(0.86)’ ’(0.02)’ ’(0.18)’ ’(0.64)’
GTC ’6.24 ’3.31 ’3.84 ’-0.82
’(0.00)’ ’(0.28)’ ’(0.29)’ ’(0.03)’
DNQ ’1.81 ’-2.45 ’1.77 ’-1.83
’(0.21)’ ’(0.64)’ ’(0.67)’ ’(0.25)’
HML Returns
US JPN
G FF AQR S+ G FF AQR S+
WML Returns B.1 μ x100

G ’-0.01 ’0.04 ’-0.01 ’0.22 ’0.12 ’0.32
’(0.86)’ ’(0.54)’ ’(0.85)’ ’(0.35)’ ’(0.57)’ ’(0.08)’
FF ’0.02 ’0.05 ’0.01 ’-0.26 ’-0.10 ’-0.11
’(0.57)’ ’(0.21)’ ’(0.86)’ ’(0.18)’ ’(0.24)’ ’(0.22)’
AQR ’0.09 ’0.07 ’-0.02 ’-0.20 ’0.07 ’0.01
’(0.04)’ ’(0.05)’ ’(0.75)’ ’(0.34)’ ’(0.24)’ ’(0.93)’
S+ ’0.12 ’0.10 ’0.06 ’-0.46 ’-0.05 ’-0.13
’(0.28)’ ’(0.31)’ ’(0.59)’ ’(0.03)’ ’(0.69)’ ’(0.19)’
B.2
Correlation
G ’91 ’90 ’90 ’51 ’45 ’59
[89,93]’ [88,92]’ [88,92]’ [42,59]’ [36,54]’ [51,67]’
FF ’99 ’96 ’93 ’78 ’87 ’82
[99,99]’ [95,97]’ [91,94]’ [73,82]’ [84,89]’ [77,85]’
AQR ’98 ’99 ’91 ’79 ’98 ’77
[98,99]’ [99,99]’ [89,93]’ [74,83]’ [98,99]’ [71,81]’
S+ ’91 ’93 ’92 ’72 ’92 ’95
[89,93]’ [91,94]’ [91,94]’ [66,78]’ [90,94]’ [93,96]’
B.3 SR x100
G ’-0.18 ’0.92 ’-0.07 ’0.28 ’-4.89 ’2.42
’(0.91)’ ’(0.57)’ ’(0.97)’ ’(0.95)’ ’(0.26)’ ’(0.54)’
FF ’-0.32 ’1.10 ’0.52 ’-4.64 ’-5.17 ’-2.74
’(0.51)’ ’(0.30)’ ’(0.73)’ ’(0.09)’ ’(0.02)’ ’(0.30)’
AQR ’1.54 ’1.86 ’0.31 ’-3.23 ’1.41 ’2.73
’(0.02)’ ’(0.00)’ ’(0.85)’ ’(0.23)’ ’(0.07)’ ’(0.37)’
S+ ’-1.03 ’-0.76 ’-2.04 ’-8.41 ’-1.16 ’-2.75
’(0.53)’ ’(0.61)’ ’(0.19)’ ’(0.01)’ ’(0.50)’ ’(0.05)’
HML Returns
UK Global
WML Returns G AQR S+ GTC DNQ G23 G91 FF AQR
B.1 μ x100
G ’0.03 ’0.07 ’0.24 ’0.19 G23 ’-0.05 ’0.07 ’0.06
’(0.89)’ ’(0.72)’ ’(0.07)’ ’(0.09)’ ’(0.34)’ ’(0.05)’ ’(0.40)’
AQR ’-0.48 ’0.02 ’0.27 ’0.22 G91 ’-0.00 ’0.15 ’0.14
’(0.00)’ ’(0.88)’ ’(0.20)’ ’(0.38)’ ’(0.97)’ ’(0.03)’ ’(0.07)’
S+ ’-0.29 ’0.15 ’0.21 ’0.21 FF ’-0.08 ’-0.06 ’-0.00
’(0.11)’ ’(0.12)’ ’(0.34)’ ’(0.40)’ ’(0.27)’ ’(0.60)’ ’(0.96)’
GTC ’-0.50 ’-0.02 ’-0.20 ’0.14 AQR ’-0.05 ’-0.05 ’0.06
’(0.00)’ ’(0.90)’ ’(0.36)’ ’(0.33)’ ’(0.61)’ ’(0.69)’ ’(0.46)’
B.2
Correlation
G ’53 ’61 ’88 ’90 G23 ’89 ’94 ’87
[44,60]’ [54,68]’ [85,90]’ [87,92]’ [87,91]’ [92,95]’ [84,89]’
AQR ’79 ’75 ’63 ’61 G91 ’88 ’82 ’79
[75,83]’ [70,80]’ [56,69]’ [51,70]’ [86,90]’ [78,86]’ [74,82]’
S+ ’79 ’92 ’62 ’61 FF ’95 ’89 ’94
[74,83]’ [90,93]’ [55,69]’ [51,70]’ [94,96]’ [86,91]’ [93,95]’
GTC ’93 ’78 ’74 ’91 AQR ’89 ’84 ’96
[92,94]’ [74,82]’ [69,79]’ [88,93]’ [87,91]’ [80,86]’ [96,97]’
B.3 SR x100
G ’-1.37 ’-2.02 ’7.23 ’4.37 G23 ’-1.99 ’0.72 ’-0.18
’(0.72)’ ’(0.58)’ ’(0.00)’ ’(0.06)’ ’(0.26)’ ’(0.63)’ ’(0.93)’
AQR ’-6.18 ’-0.42 ’9.88 ’5.67 G91 ’-0.17 ’3.81 ’2.77
’(0.02)’ ’(0.89)’ ’(0.00)’ ’(0.25)’ ’(0.93)’ ’(0.12)’ ’(0.29)’
S+ ’-5.31 ’-0.14 ’10.06 ’9.15 FF ’-3.04 ’-2.66 ’-1.37
’(0.05)’ ’(0.94)’ ’(0.01)’ ’(0.06)’ ’(0.02)’ ’(0.18)’ ’(0.34)’
GTC ’-9.87 ’-3.65 ’-3.91 ’1.02 AQR ’-2.65 ’-2.63 ’1.37
’(0.00)’ ’(0.16)’ ’(0.19)’ ’(0.64)’ ’(0.14)’ ’(0.24)’ ’(0.21)’
19
Table 5
Comparison of global granular factors to global French Data Library factors from July 1990 to September 2015.
Panel A reports pairwise comparisons of our global and global excluding U.S. factors, against corresponding French (2017) factors when using the same 23 countries for
international analyses and when using our full 91 country universe. In panel A.1, we report the difference of the monthly mean (μ) and below it, in parentheses, the
p-value of a two sample mean equality test using Newey-West standard errors with 12 lags; in panel A.2, a lower and upper bound for the 95% confidence interval of
Pearson’s correlation coefficient under a t-distribution with n-2 d.f.; and in panel A.3, the difference between the monthly Sharpe Ratios multiplied by 100 (SR x100),
and below it, in parentheses, the p-value for Memmel’s (2003) two sample Sharpe Ratio equality test. Differences that are statistically significant at the 10% level are in
bold. Differences are always calculated by subtracting FF factors from our granular factors. Panel B reports the number of stocks used in the annually rebalanced size and
value holding portfolios on December of each year, distinguishing also between French’s (2017) 23 country universe and our filtered 91 country universe.
Panel A Returns Panel B: Number of stocks
23 91 French Data Library 23 91
A.1 μ x100 ex US G23 ex US G91 year ex US Global ex US G23 ex US G91
Market ’-0.13
’-0.06 ’-0.06 ’-0.08
’(0.03)’ ’(0.24)’ ’(0.67)’ ’(0.32)’ 1990 5677 8144 6482 10549 7887 11954
SMB ’-0.10 -0.07 ’-0.08 -0.06 1991 6380 9480 7286 11649 9059 13422
’(0.20)’ ’(0.22)’ ’(0.55)’ ’(0.57)’ 1992 7083 10561 7391 12155 9526 14290
HML ’0.09 ’0.07 ’0.16 ’0.15 1993 7703 11705 7994 13256 10647 15909
’(0.12)’ ’(0.05)’ ’(0.13)’ ’(0.03)’ 1994 8257 12727 8375 14049 11644 17318
WML ’0.03 0.08 ’-0.22 0.06 1995 9185 13973 9104 15152 13027 19075
’(0.20)’ ’(0.27)’ ’(0.55)’ ’(0.60)’ 1996 9698 14735 9724 16361 14374 21011
A.2 Correlation Coefficient x100 1997 10202 15504 10220 17250 15003 22033
Market ’98 ’98 ’84 ’96 1998 10473 15640 10330 17478 15224 22372
[98,98]’ [98,98]’ [80,87]’ [95,97]’ 1999 10664 15554 10673 18210 16781 24318
SMB ’83 ’89 ’73 ’78 2000 10938 15698 11163 18488 17351 24676
[79,86]’ ’[86,91]’ [67,77]’ ’[73,82]’ 2001 11607 16178 10867 18046 15378 22557
HML ’86 ’94 ’69 ’82 2002 11585 15760 10871 18059 18025 25213
[83,89]’ ’[92,95]’ [62,74]’ [78,86]’ 2003 11342 15341 11164 18414 18455 25705
WML ’95 95 ’60 ’89 2004 11339 15245 11583 18900 19648 26965
[94,96]’ [94,96] [53,67]’ [86,91]’ 2005 11719 15691 12113 19432 21150 28469
A.3 SR x100 2006 12146 16091 12671 19996 22145 29470
Market ’-2.49 ’-1.25 ’-1.76 ’-1.56 2007 12719 16574 13021 20295 23333 30607
’(0.00)’ ’(0.13)’ ’(0.45)’ ’(0.19)’ 2008 12939 16725 11942 18925 22427 29410
SMB ’-4.37 ’-3.31 ’-3.26 -2.53 2009 12429 15942 12024 18985 22670 29631
’(0.07)’ ’(0.09)’ ’(0.28)’ ’(0.35)’ 2010 11843 15162 12045 18866 24422 31243
HML ’1.68 ’0.72 ’3.44 ’3.81 2011 11755 14980 12582 19201 25779 32398
’(0.44)’ ’(0.63)’ ’(0.29)’ ’(0.12)’ 2012 11581 14726 12307 18781 25748 32222
WML ’2.16 3.04 ’-9.16 2.66 2013 11266 14305 12125 18579 27384 33838
’(0.07)’ ’(0.02)’ ’(0.02)’ ’(0.18)’ 2014 11229 14306 12261 18762 28037 34538
previously appreciated when the sample is extended to include have excluded because in their data these countries contained too
countries that have not been previously studied26 . Stock coverage few stocks with data of sufficient quality. If this is the case, then
improvements when moving from 23 to 91 countries is very sig- our Global excluding U.S. factors are likely better proxies for the
nificant - for example, in 1990 our 23 country universe holding ’true’ but unobservable factors they aim to approximate. Especially
portfolios contain 6482 stocks and our 91 country universe con- when aiming to create market returns (e.g., for the Global exclud-
tain 7887 stocks. ing U.S. market); achieving maximum coverage is at the very core
A comparison among factors with different country coverage is of the concept of this factor.
meaningful to the extent that what we are ultimately interested
in understanding, is the behavior of factors with as wide country 4.3. Impact of specific aspects of our guidelines
coverage as possible and in the context where other researchers
have excluded countries which we have included to avoid dealing 4.3.1. Max coverage data extraction
with tedious data issues that our methodology addresses. Indeed, The broadest country coverage we are aware of in previous re-
data pre-processing decisions for international data are necessarily search is that of Griffin et al. (2010) who cover 56 countries which
intertwined with decisions about which countries have a sufficient stills fall considerably short of our 91 countries. Lee (2011) uses
quality and quantity of data and with decisions about what sample 50 markets and also reports stock coverage, with 58,300 unfiltered
start dates should be used. For example, Griffin et al. (2010) ex- and 30,069 filtered stocks which is less than half of our 204,337
clude countries in which their filtered data contain less than 50 unfiltered and 76,468 filtered stock sample. This reflects the excel-
stocks and this leads them to use a sample with just 40 coun- lent coverage TDS has if data is extracted using our guidelines27 .
tries while excluding important countries such as Israel from their In the previous section we reported a significant impact on global
analysis (with our filter 6 we exclude countries with less than 20 factors when expanding coverage from French’s (2017) 23 country
stocks but include 91 countries in total). universe to our 91 country universe. We therefore now turn atten-
One plausible explanation for why our global factors include tion to the impact of improved coverage within a country.
many more countries than the factors used in previous research In Table 6, we have collected stock coverage information for
is that our guidelines improve data quality and intracountry stock Japan from all research we are aware of where such information is
coverage, so this allows us to include countries other researchers made available. We have chosen to focus on Japan, since this is one
of the most frequently studied international markets, so a compar-
ison is possible across a broad universe of studies. In Panel A, we
26
In Internet appendix Table C.1 we extend the comparison of Table 4 to all re-
gions covered by French (2017) and observe that including more countries in the
27
Asia Pacific region changes factors in that region dramatically - for example, the Our Internet Appendix Table C.2, reports the number of stocks in our filtered
confidence interval for the correlation of the granular and the French (2017) Asia sample for U.S., U.K., Japan and all regions at several dates in our sample from De-
Pacific region HML factor is [23,43]%. cember 1984 to September 2015.
20
Table 6
Factor coverage comparisons.
Panel A reports the number of filtered stocks in Japan value and size holding portfolios on various year end dates, in our sample and in the sample of French’s (2017) Data Library. Panel B compares the overall number of
stocks (across all dates in the sample) in the research universe of various papers that mention this number, against the number of stocks in our sample over the same period. Panel C reports a comparison of our factors to
French’s (2017) Data Library factors, when applying an identical implementation of size breakpoints. We report the monthly mean of our factors minus French’s (μ) and and below it, in parentheses, the p-value of a two
sample mean equality test using Newey-West standard errors with 12 lags; a Pearson correlation and below it a lower and upper bound for its 95% confidence interval under a t-distribution with n-2 d.f.; and the difference
between the monthly Sharpe Ratios multiplied by 100 (SR x100), and below it, in parentheses, the p-value for Memmel’s (2003) two sample Sharpe Ratio equality test. Differences that are statistically significant at the 10%
level are in bold.
Panel A: Coverage Comparison on five year intervals

1990 1993 1998 2003 2008 2013
Granular coverage 1766 2583 3256 3622 3862 3536

French (2017) 1216 1837 2443 2666 2845 2564
Panel B: Coverage Comparison on matched samples

Sample Dates # stocks # stocks in matched sample dates granular universe
Asness and Frazinni (2013) Dec83-Dec11 4952 4953
Jul81-Dec0328
21
Hou, Karolyi and Kho (2011) 2844 4137

Karolyi and Wu (2011) Nov89-Dec10 4301 4894
Karolyi, Lee and Van Dijk (2012) Dec95-Dec09 3309 4813
Chui, Titman and Wei (2010) Dec9629 2374 2152
Lee (2011) Jan88-Dec07 3106 4778
Schmidt, Von Arx, Schrimpf, Wagner, and Ziegler (2019) Dec84-Dec11 4944 4951
Panel C: French (2017) Factor Comparison on matched Implementation Details
MKT SMB HML WML
μ x100 ’-0.02 ’-0.01 ’0.18 ’0.03
’(0.20)’ ’(0.86)’ ’(0.00)’ ’(0.86)’
Correlation Coefficient x100 ’100 ’95 ’90 ’96
[100,100]’ [93,96]’ [88,92]’ [95,97]’
SR x100 ’-0.34 ’-0.37 ’7.76 ’0.40
’(0.32)’ ’(0.78)’ ’(0.00)’ ’(0.39)’

28
This number of stocks mentioned in this study includes stocks with at least one of the following data items available: book-to-market equity, cash flow-to-price, dividend-to-price, earnings-to-price, long-term debt-to-
common equity and size, while we have no such requirement in our universe, except that the stocks have size-market cap available on filtered sample.
29
The number of stocks used in this study, refers to a universe including both domestic and foreign stocks, whereas we only include the former.
compare our Japanese sample to that of French (2017) on various N3. This is similar to the previous filter, except that we do not
dates and find that we have 30-40% greater coverage. In Panel B, use data extracted using our maximum accuracy approach,
we see some researchers using significantly less stocks than us, while we do apply the non-common stock filter.
while others have a similar number. This latter group likely reflects
In the first and the third cases we see the impact on the U.S.
weaker filtering since our understanding is that all these studies
value factor is devastating, leading to extremely low correlations
extract data based on constituent lists with limited coverage. In
(1 to 13%) to both our factor and the French (2017) factor. In the
Panel C, we compare our factors for Japan, when applying the ex-
second case, correlations increase dramatically to 88% but there re-
act implementation details of French (2017). Note the version of
mains a large and statistically significant difference in mean re-
our Japanese factors analyzed in this table differs slightly from that
turns among factors. Evidently, these filters are absolutely essen-
used in other tables in that here we use French (2017) breakpoint
tial.
methodology for Japan rather than our methodology as developed
in Section 3.4. It is important to note, that correlation between all N4. In this case, we apply just our static stock filters and our
factors increases significantly relative to the results of Table 4, but penny stock filter. The results are similar to the previous
statistically significant differences persist after accounting for dif- case, suggesting that applying a small group of fairly stan-
ferences in breakpoint methodologies, which further suggest dif- dard filters is not sufficient to deal with problems in the
ferences in coverage are significant. data.
In sum, our approach to extracting data most likely has a sig-
nificant impact both on the analysis of specific countries and on 4.3.2.2. Impact relative to previous research. The below filters ex-
which countries are included in global portfolios. We note that amine the contribution of our filters beyond what is achievable
our max coverage approach will likely have especially significant based on relatively standard filters applied on monthly data. The
impact on studies that focus on smaller stocks and/or returns of aim is to see to what extent daily filters are really useful since
equally weighted portfolios. it may be worth considering implementing our guidelines except
for daily filters. Considering how tedious daily filters are to imple-
4.3.2. Impact of variations to our guidelines ment, we acknowledge that they need to surpass a high usefulness
In section 3, we discussed the percentage of the raw sample fil- hurdle and our goal here is to quantify this usefulness.
tered by each of our filters as reported in Table 1. Some of these
P1. We apply the filters proposed by Ince and Porter (2006) on
filters remove a large fraction of the data indicating salient issues
monthly data where we have made every effort to imple-
in TDS’ data. To the extent that researchers do not apply some ver-
ment them as closely as possible except that we use our
sion of these highly active filters, it is natural that their factors will
static filters, to focus our comparison on issues beyond the
be significantly different.
use of static information.
In order to understand more deeply the impact of specific fea-
P2. We refine the monthly data filters used in the previous
tures of our guidelines, we have created variants of all our factors
variant, applying all our guidelines but using monthly data
in which certain features have been dropped or modified. We have
only. Our original filters which require daily data are ap-
studied 19 variants of our benchmark guidelines, summarized in
plied on monthly data, i.e., we require that less than 98% of
Table 7, analyzing the impact of specific features of our guidelines
monthly returns have the same sign and that less than 95%
(1) relative to no data processing or filtering; (2) relative to using
of monthly observations are zero, while our volatility filters
filters previously used in the literature; (3) relative to a version of
are applied on monthly returns; our filter for stocks with
our guidelines applied to monthly rather than daily data; (4) rel-
few observations is thresholded at six months rather than
ative to various modifications in specific filters which may be of
120 days; and our staleness filter is implemented as drop-
particular interest; and (5) relative to alternative designs for our
ping the third and subsequent consecutively constant return
breakpoints in holding portfolios. We analyze each of our four fac-
index observations; and our delisting filter removes stock
tors for each of our 19 filtering and design variants and in Table 8
months after one month. The impact of adding a monthly
report a comparison of each factor variant with our baseline pro-
version of our filters on the quality of the U.S. size factor is
posed factor and the corresponding French (2017) factor. For each
ambiguous relative to using standard monthly filters. This is
of these 19 variants, we focus on a factor affected very signifi-
not too surprising considering that some of our filters were
cantly, mostly restricting the analysis to U.S. and Japan, though we
designed to be used with daily data.
report U.K. results for one variant where all factors are very simi-
P3. This variant augments the previous one by applying the
lar to those from our baseline guidelines in both the U.S. and Japan.
three filters which we have flagged as requiring daily data.
This provides an indication of the importance of each aspect of our
The incremental impact of these filters is significant relative
guidelines studied below30 .
to filters that use only monthly data, reducing discrepancies
to both the benchmark and our baseline U.S. size factor by
4.3.2.1. Impact relative to no filtering. We study four variants of no
around 10 bps relative to what is achievable with variant
filtering as follows.
P2 which uses only monthly data. Correlations also increase
N1. Here we apply none of our stock or stockday filters and use by approximately 10%. The impact of a daily implementation
data extracted at default two decimal place accuracy. We fol- relative to a monthly implementation of our filters can also
low French (2017) procedures in constructing factor holding be gauged by comparing the filtering rates in the ’US daily’
portfolios but use the median size of all stocks for the size and ’US monthly’ columns of Table 1. Evidently, the implau-
breakpoints and use extracted data with TDS’ default two sibility filter and the few observations filter will underfilter
decimal place accuracy. when applied to monthly data, but will overfilter when ap-
N2. This is similar to the previous filter, except that we apply plied to the outlier error filter relative to its effect on daily
our nonsense value filter, we use our size breakpoints and data, in agreement with the stock level examples we have
data extracted using our maximum accuracy approach. provided when discussing these filters.
30
Internet Appendix Tables C.3 and C.4 presents the descriptive data and the dif-
4.3.2.3. Impact of special interest filters. Groups of filters
ferences between all 23 variants, as well as a benchmark for U.S., U.K. and Japan. Here we study the impact of not using certain filters which
For the U.S. we use FF factors as a benchmark, while for U.K., AQR. form natural groups.
22
Table 7
Summary of variants of our guidelines.
Our proposed methodology is compared to 19 variants, summarized in separate columns, where each row indicates whether a particular feature of our methodology is applied. Blank cells mean that the feature of our
methodology listed in the cell’s row is applied in each variant’s column. ’N’ means that it is not applied. Where a feature of our methodology is applied but in modified form, we succinctly summarize the modification in the
corresponding cell and discuss in greater length in the main body of the paper.
Comparison No Filtering Previous Research Block Modification Single Modification Holding Portfolia Modification
Type
N1 N2 N3 N4 P1 P2 P3 B1 B2 B3 S1 S2 S3 S4 S5 H1 H2 H3 H4
Panel A
A.1 Filters for stocks based on static information
Significantly enhanced
1 non-common N N N
stocks (text
strings)
2 cross listed stocks N N N N
3 non-common N N N N
stocks (duplicate
LOC)
Standard
4 non-domestic N N N N
headquarters
5 non-domestic N N N N
currency
6 small countries N N N N
23
A.2 Filters for stocks based on return index information

Original
requiring daily
data
7 implausibility N N N N N daily N
(>98% returns
same sign)
8 few observations N N N N N daily N
(%-wise)
(>90% missing or
>95% zero)
Significantly
enhanced using
daily data

9 high daily N N N N N N N N,
volatility (>0.4) remove
r>10k%,
r<-99%
10 low daily volatility N N N N N N N
(<10−6 )
Standard
11 return index data
unavailable
12 few observations N N N N N 6m 6m N
(<120 days)
Table 7 (continued)
Type
N1 N2 N3 N4 M1 M2 M3 B1 B2 B3 S1 S2 S3 S4 S5 H1 H2 H3 H4
A.3 Filters for
stockdays
Significantly
enhanced using
daily data
13 stocks no longer N N N N 1m 1m N
traded
(>10 days at end)
Original
requiring daily
data
14 staleness N N N N N 2m daily N
(>30 days
consecutively)
Significantly
enhanced using
daily data
15 outlier errors N N N N 300% N N
&-70%
(>100% & <-50%)
Original
24
16 holidays N N N N N N
(<0.5% stocks
available)
Significantly
Enhanced
17 survivorship biased
or incomplete
dividends
Standard
18 capital adjustment N N N N N N
inconsistencies
(<5bps dif)
19 nonsense values N N N
(p<=0)

Panel B.1
Filters for stocks from investment universe of factors on each investment date
Original
20 book-to-market
staleness
Significantly
Enhanced
21 penny stocks (20% N N N <1 N
lowest prices) local
cur-
rency
Table 7 (continued)
Type
N1 N2 N3 N4 M1 M2 M3 B1 B2 B3 S1 S2 S3 S4 S5 H1 H2 H3 H4
B.2 Variants of holding portfolios design
Original (in international data)
22 large size 50% all 50% all 50%
breakpoint NYSE
(50% of main
exchange)
Standard
25
23 do not require Worldscope BM N

24 require market cap
Panel C
C.1 Data
Processing
Original
25 our data N N N
refinements
(own BM, own adj. price, max precision)
Standard
26 use all exchanges only
main

Table 8
Factor returns for variants of our guidelines.
We report a comparison of 19 variants to our baseline factors and to benchmark factors (French (2017) Data Library factors for U.S. and Japan and AQR factors for the
U.K.). We use the same benchmark factor for similar groups of variants and the relevant benchmark with its mean and Sharpe Ratio is reported above each variant or
group of variants. All means are reported in percent terms. For each variant, we report the mean of the variant minus the comparison factor and below it, the p-value
of a two sample mean equality test using Newey-West standard errors with 12 lags. The SR column, notes a similar difference between monthly Sharpe Ratios multiplied
by 100 and below it, the p-value for Memmel’s (2003) two sample Sharpe Ratio equality test. The final column in each comparison reports the Pearson correlation of the
underlying series and below it a lower and upper bound for its 95% confidence interval under a t-distribution with n-2 d.f.. Bold denotes statistical significance at the 10%
level. All differences refer to factor in a row minus factor in a column.
Comparison Type Comparison to our factors Comparison to benchmark
Market-Factor mean SR correlation mean SR correlation
No Filtering
US-HML ’0.18 ’0.06 ’0.19 ’0.07
N1 4.99 -0.12 ’2 4.98 -0.30 ’1
’(0.24)’ ’(0.98)’ ’[-9, 12]’ ’(0.24)’ ’(0.95)’ ’[-9, 12]’
N2 0.22 6.85 ’88 0.21 6.85 ’82
’(0.01)’ ’(0.00)’ ’[85, 90]’ ’(0.03)’ ’(0.00)’ ’[78, 85]’
N3 1.02 0.12 ’15 1.01 -0.05 ’13
’(0.28)’ ’(0.98)’ ’[5, 25]’ ’(0.28)’ ’(0.99)’ ’[3, 23]’
N4 1.02 0.11 ’15 1.01 -0.06 ’13
’(0.28)’ ’(0.98)’ ’[5, 24]’ ’(0.28)’ ’(0.99)’ ’[3, 23]’
Previous Research
US-SMB ’0.13 ’0.04 0.08 0.02
P1 ’-0.11 ’-3.95 ’99 ’-0.06 ’-1.92 ’94
’(0.00)’ ’(0.00)’ ’[99, 99]’ ’(0.24)’ ’(0.14)’ ’[93, 95]’
P2 ’0.11 ’2.89 ’88 ’0.16 ’4.92 ’84
’(0.21)’ ’(0.11)’ ’[85, 90]’ ’(0.12)’ ’(0.02)’ ’[80, 86]’
P3 ’0.01 ’0.54 ’100 ’0.06 ’2.57 ’95
’(0.46)’ ’(0.11)’ ’[99, 100]’ ’(0.21)’ ’(0.03)’ ’[94, 96]’
Block Modification
US-WML ’0.59 ’0.13 ’0.61 ’0.13
B1 ’0.20 ’4.17 ’97 ’0.18 ’4.49 ’97
’(0.00)’ ’(0.00)’ ’[96, 97]’ ’(0.01)’ ’(0.00)’ ’[96, 97]’
B2 ’-1.11 ’-15.24 ’14 ’-1.13 ’-14.92 ’14
’(0.42)’ ’(0.00)’ ’[4, 24]’ ’(0.42)’ ’(0.00)’ ’[4, 24]’
B3 ’0.14 ’1.90 ’96 ’0.12 ’2.22 ’96
’(0.02)’ ’(0.08)’ ’[95, 97]’ ’(0.11)’ ’(0.04)’ ’[95, 96]’
Single Modification
US-WML ’0.59 ’0.13 ’0.61 ’0.13
S1 ’-0.10 ’-1.48 ’98 ’-0.12 ’-1.16 ’97
’(0.01)’ ’(0.04)’ ’[98, 98]’ ’(0.04)’ ’(0.17)’ ’[97, 98]’
US-MKT ’0.96 ’0.22 0.96 0.22
S2 4.87 -15.31 ’2 4.87 -15.28 ’2
’(0.31)’ ’(0.00)’ ’[-9, 12]’ ’(0.31)’ ’(0.00)’ [-9,12]’
US-HML ’0.18 ’0.06 0.19 0.07
S3 0.19 5.36 ’89 0.18 5.18 ’82
’(0.05)’ ’(0.00)’ ’[87, 91]’ ’(0.06)’ ’(0.02)’ [79,85]’
JPN-MKT 0.25 ’0.04 ’0.27 ’0.05
S4 -0.25 -4.47 ’95 -0.27 -4.80 ’95
’(0.01)’ ’(0.00)’ [94,96]’ ’(0.01)’ ’(0.00)’ [93,96]’
US-HML ’0.18 ’0.06 ’0.19 ’0.07
S5 0.11 2.97 ’93 0.10 2.80 ’85
’(0.07)’ ’(0.03)’ ’[92, 95]’ ’(0.19)’ ’(0.16)’ ’[82, 88]’
Holding Portfolios Modification
US-WML ’0.59 ’0.13 ’0.61 ’0.13
H1 ’-0.10 ’-2.25 ’99 ’-0.12 ’-1.93 ’99
’(0.01)’ ’(0.00)’ ’[99, 99]’ ’(0.01)’ ’(0.00)’ ’[98, 99]’
UK-WML ’1.60 ’0.31 ’1.07 ’0.24
H2 ’-0.27 ’-7.20 ’98 ’0.19 ’-1.32 ’82
’(0.00)’ ’(0.00)’ [97,98]’ ’(0.20)’ ’(0.58)’ [78,85]’
JPN-SMB ’-0.02 ’-0.00 ’-0.01 ’-0.00
H3 ’0.92 ’9.83 ’67 ’0.89 ’7.64 ’52
’(0.06)’ ’(0.03)’ [58,75]’ ’(0.11)’ ’(0.22)’ [40,62]’
US-SMB ’0.13 ’0.04 0.08 0.02
H4 0.25 ’9.46 ’89 ’0.30 ’5.77 ’84
’(0.00)’ ’(0.00)’ ’[87, 91]’ ’(0.00)’ ’(0.28)’ [81,87]’
B1. We apply all our guidelines, while excluding (not applying) B3. We exclude all our stockday filters. The impact on mean re-
our stock filters based on static information. The impact on turns of the US momentum portfolio is statistically and eco-
mean returns of the U.S. momentum portfolio is statistically nomically significant (20 bps per month).
and economically significant (20 bps per month).
B2. We exclude our stock filters based on return index informa-
tion. The impact is huge, leading to very low correlations of Single filters
around 14% as well as very large differences in mean returns. Here we study the impact excluding a single filter, for five ex-
clusion cases that we believe are particularly interesting.
26
S1. We use extracted data at default two decimal place preci- we report results according to which there is significant variation
sion rather than our max precision approach. in the behavior across factors and across regions. The evidence re-
ported here suggests that while application of all our guidelines
The impact of our maximum precision extraction approach is may not be essential for all applications, researchers who would
statistically and economically significant on the U.S. momentum like to create high-quality data that can be consistently used across
factor, affecting it by around 10 bps. It is worth emphasizing here a broad range of applications would be well advised to follow all
that downloading default TDS data has a mechanical impact on the our guidelines, including certain original aspects.
momentum factor because rounding is not distributed randomly
across legs of the factor. Specifically, rounding is more likely to be
concentrated in loser stocks whose return indexes have low values
and are therefore subjected to more rounding. 5. Conclusion
S2. This variant does not apply to our volatility filters. Evidently,
We introduce detailed guidelines for extracting data from TDS
this creates nonsense returns even for the U.S. market series.
and for constructing reliable filtered data which we are able to do
S3. We replace our volatility filter with an alternative that im-
for 91 countries. We attempt to motivate each of our filters and
poses trimming. This is much better than not applying any
to provide evidence for why our guidelines are helpful. To support
volatility filter at all as per the previous variant, but it still
implementation of our guidelines, we provide detailed documenta-
leads to a statistically significant discrepancy of around 18
tion for our data extraction approach, code to implement our filters
bps to both the French (2017) U.S. value factor and ours.
as well as our factor returns.
S4. We do not apply our filter for outliers. This leads to a
In doing so, we highlight the importance of extracting data with
significant discrepancy of around 20 bps to our baseline
maximum precision and of limiting attention to country-specific
Japan market factor and an even larger discrepancy to the
recent dates to avoid survivorship bias and to ensure dividend data
French (2017) factor.
is available. We also provide evidence that using filters based on
S5. This variant excludes our penny stock filter, leading to a sig-
daily data can significantly improve the quality of TDS data and
nificant discrepancy of around 25 bps to our baseline Japan
that our innovations to standard TDS filtering processes will filter
market factor.
a significant fraction of raw data and will have a significant impact
4.3.2.4. Holding portfolio filters. on standard factor series. Undoubtedly, some data problems will
remain even after applying our guidelines but hopefully the code
H1. Here we eliminate stockmonths for which TDS’ market to and detailed data extraction documentation we provide will make
book variable (datatype MTBV) is unavailable on its invest- our analysis easily replicable and will spur additional refinements.
ment date, regardless of whether this information is neces- Our guidelines will likely be particularly useful for research on
sary to create a holding portfolio. Note that we our base- equal weighted portfolios, small stocks, countries with few stocks
line factors only eliminate stockmonths from holding port- or analyses that require unusual datatypes (e.g., to construct fac-
folios if they are missing data required to create the holding tors or characteristics involving detailed accounting data) which
portfolio (e.g., when creating a momentum factor, a miss- may have more severe versions of the problems (e.g., survivorship
ing MTBV value would not cause a stockmonth to be ex- bias) that we have observed even in fairly standard datatypes. In
cluded). This leads to a statistically significant 10 bps impact any case, the differences between our factors and standard factors,
on the U.S. momentum factor away from the direction of the especially in global excluding U.S. research, are such that many
French (2017) factor. quantitative estimates that depend on factor returns such as alphas
of international trading strategies, would inevitably be affected by
4.3.2.5. Impact of holding portfolio breakpoint design. the data used to construct the factors. Re-evaluating standard fac-
tors using our data, we find global market, size and value average
H2. In this variant, we replace our size breakpoint rule with returns that are slightly lower than those of Fama and French, but
a breakpoint that labels as Big the 20% biggest market cap that momentum returns are slightly larger, whether we focus on
stocks on each investment date. This works well for U.S. and their 23 country universe or our expanded 91 country universe.
Japan factors, so here we flag the U.K. momentum factor as By making our international factor data available, we hope to im-
one case in which this approach leads to significant differ- prove the accuracy of future research especially in regions with
ences to both benchmark factors and the factor that imple- more problematic data.
ments our recommended breakpoint design. It is worth emphasizing that TDS data processed using our
H3. This variant replaces our size breakpoint rule with a break- guidelines seems to be a viable and widely available option for
point that labels as Big the stocks whose size is above studying U.S. factors, since our analyses deliver almost identical re-
NYSE’s median size on each investment date ((French, 2017, sults to those obtained from CRSP-Compustat data. Our guidelines
U.S. size breakpoint). This is clearly unworkable as it can also be used to construct usable data for unexplored coun-
leads to size factors that are very far removed from the tries that have remained outside the sample typically explored by
French (2017) size factor in Japan as well as the one ob- researchers, thus making out-of-sample validation (e.g., using the
tained with our baseline breakpoint design. cross-sectional approach of Lu et al., 2017) feasible in a range of
H4. Uses all stocks from the main exchange only. This variant applications.
also seems unworkable, leading to large discrepancies to our
baseline and French (2017) size factors in the U.S.
4.3.2.6. Summary. In summary, we find that in all cases studied in Author Statement
this subsection, all aspects of our guidelines have a significant im-
pact on one of the four factors analyzed in the U.S. or Japan and Conrad Landis: Conceptualization, Methodology, Software, Vali-
the impact of some filters is very large. In our Internet Appendix31 , dation, Formal Analysis, Investigation, Data Curation, Visualization
Spyros Skouras: Conceptualization, Methodology, Resources,
31
See our Internet Appendix Tables C.3 and C.4 Writing, Supervision, Project administration
27
Supplementary materials Griffin, J.M., Kelly, P.J., Nardari, F., 2010. Do market efficiency measures yield correct
inferences? a comparison of developed and emerging markets. Rev. Financ. Stud.
23, 3225–3277.
Supplementary material associated with this article can be Hou, K., Karolyi, G.A., Kho, B., 2011. What factors drive global stock returns. Rev.
found, in the online version, at doi:10.1016/j.jbankfin.2021.106128. Financ. Stud. 24, 2527–2574.
Ince, O.S., Porter, S.B., 2006. Individual equity return data from Thomson Datas-
References tream: handle with care!. J. Financ. Res. 24, 463–479.
Karolyi, G., Wu, Y., 2012. The role of investability restrictions on size, value, and
momentum in international stock returns. SSRN Electron. J. doi:10.2139/ssrn.
Andrikopoulos, P., Daynes, A., Latimer, D., Pagas, P., 2007. UK Market, Financial
2043156.
Databases and Evidence of Bias. Working Paper. De Montfort University, Leices-
Karolyi, G.A., 2016. Home bias, an academic puzzle. Rev. Finance 20, 2049–2078.
ter. Occasional Paper Series, Paper no.79.
Karolyi, G.A., Lee, K., Van Dijk, M.A., 2012. Understanding commonality in liquidity
AQR, C., 2017. Data library. https://www.aqr.com/library/data-sets.
around the world. J. Financ. Econ. 105, 82–112.
Asness, C., Frazzini, A., 2013. The devil in HML details. J. Portf. Manag. 39, 49–68.
Lee, K., 2011. The world price of liquidity risk. J. Financ. Econ. 99, 136–161.
Bekaert, G., Harvey, C., Lundblad, C., 2007. Liquidity and expected returns: lessons
Lu, X., Stambaugh, R.F., Yuan, Y., 2017. Anomalies Abroad: Beyond Data Mining. The
from emerging markets. Rev. Financ. Studies 20, 1783–1831.
Wharton School Working Paper.
Chui, A.C.W., Titman, S., Wei, K.C.J., 2010. Individualism and momentum around the
Memmel, C., 2003. Performance hypothesis testing with the Sharpe ratio. Finance
world. J. Finance 65, 361–392.
Lett. 1, 21–23.
Davis, J.L., 1996. The cross section of stock returns and survivorship bias: evidence
Schmidt, P.S., Von Arx, U., Schrimpf, A., Wagner, A.F., Ziegler, A., 2017. Size and
from delisted stocks. Q. Rev. Econ. Finance 36, 365–375.
Momentum Profitability in International Stock Markets. Swiss Finance Institute
Dimson, E., Nagel, S., Quigley, G., 2003. Capturing the value premium in the United
Working Paper 15-29.
Kingdom. Financ. Anal. J. 59, 35–45.
Schmidt, P.S., Von Arx, U., Schrimpf, A., Wagner, A.F., Ziegler, A., 2019. Common risk
Fama, E.F., French, K.R., 2012. Size, value, and momentum in international stock re-
factors in international stock markets. In: Financial Markets and Portfolio Man-
turns. J. Financ. Econ. 105, 457–472.
agement, 33. Springer; Swiss Society for Financial Market Research, pp. 213–241.
Fama, E.F., French, K.R., 2017. International tests of a five-factor asset pricing model.
Ulbricht, N., Weiner, C., 2005. Worldscope Meets Compustat: a Comparison of Fi-
J. Financ. Econ. 123, 441–463.
nancial Databases. Humboldt-Universitat zu Berlin Working Paper 2005-064.
French, R.K., 2017. Data library. http://mba.tuck.dartmouth.edu/pages/faculty/ken.
french/data_library.html.
Gregory, A., Tharyan, R., Christidis, A., 2013. Constructing and testing alternative ver-
sions of the Fama-French and Carhart models in the Uk. J. Bus. Finance Account.
40, 172–214.
28

Landis e Skouras (2021)

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Landis e Skouras (2021)

Uploaded by

Copyright:

Available Formats

Journal of Banking and Finance 130 (2021) 106128

Contents lists available at ScienceDirect

Journal of Banking and Finance

Guidelines for asset pricing research using international equity data

1. Introduction detail, correlations of widely used publicly available international

2.1. Maximizing stock coverage 2.2.1. Decimal settings

Fig. 1. Necessity of data extraction at max precision.

Journal of Banking and Finance 130 (2021) 106128

Fig. 2. Necessity of data extraction in local currency at max precision.

Jamaica INVESTMENT TRUST (JAM) JAMAICA J$

Oman NIL PAID, FUND UNSUPPORTED OMAN OR

Slovenia SLOVENIA TO, E

Fig. 3. Necessity of implausibility ﬁlter.

Fig. 4. Necessity of few zero observations (small percentage of sample) ﬁlter.

Fig. 5. Outlier errors ﬁlter.

Fig. 6. Survivorship bias in CRSP, Compustat, TDS and Worldscope data.

G FF AQR S+ KLD G FF AQR S+ KLD

Panel A: MKT & SMB Returns

G FF AQR S+ KLD G FF AQR S+ KLD

SMB Returns A.1 μ x100

Panel A: MKT & SMB Returns

G FF AQR S+ KLD G FF AQR S+ KLD

S+ ’0.85 ’-1.38 ’-1.52 ’-0.19 ’-0.49 FF ’3.31

WML Returns B.1 μ x100

Panel A Returns Panel B: Number of stocks

23 91 French Data Library 23 91

A.1 μ x100 ex US G23 ex US G91 year ex US Global ex US G23 ex US G91

Panel A: Coverage Comparison on ﬁve year intervals

Granular coverage 1766 2583 3256 3622 3862 3536

Panel B: Coverage Comparison on matched samples

Hou, Karolyi and Kho (2011) 2844 4137

Journal of Banking and Finance 130 (2021) 106128

A.2 Filters for stocks based on return index information

Journal of Banking and Finance 130 (2021) 106128

Journal of Banking and Finance 130 (2021) 106128

23 do not require Worldscope BM N

Journal of Banking and Finance 130 (2021) 106128

Comparison Type Comparison to our factors Comparison to benchmark

Market-Factor mean SR correlation mean SR correlation

You might also like