Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

Financial Data

TheThal
esi
ans
I
tise
asyf
orphi
lo
sophe
rst
ober
ichi
fthe
ycho
ose

Financial Data

Dr. Paul A. Bilokon, Founder, Thalesians Ltd

2023.02.03
Financial Data
Financial data

Question

What is special about financial data? How do financial datasets differ from those found in
other areas, such as image processing, engineering, physics, chemistry, biology, and
medicine?
Financial Data
Financial data

Financial data

The City and, more generally, the finance industry, is one of the most interesting sources of
data for data science. This data — the financial data — has some interesting
characteristics (image: Gary Yeowell/Getty Images)
Financial Data
Financial data

Characteristics of financial data


Financial data usually exhibit some of the following characteristics:
I Time series: financial data are usually presented in the form of time series.
I Non-stationary: financial time series are usually non-stationary.
I Non-Gaussian: while assuming the normal distribution is very convenient
mathematically, real financial data often violates this assumption.
I Influenced by “animal spirits” — a “spontaneous urge to action rather than
inaction” [Key36].
I Often deals with unobservables, such as volatility.
I Affected by human errors, including fat-finger errors.
I Complex and interrelated, such that may be represented by dynamic complex
networks.
I Often multivariate: in equities, fixed income and credit, one usually deals with large
portfolios and multipoint yield curves, whereas even for a single bond the static data is
quite complex.
I Big: in terms of the number of data points, but also may be repurposed.
I High-frequency, also known as tick data.
I Changing: finance is a human endeavour and the “rules of the game” are constantly
reviewed and adjusted; data reflects these changes.
I Often expensive and/or difficult to get hold of.
I No longer purely financial, including alternative data.
Financial Data
Financial data

Financial data: time series (i)

I Financial variables are usually observed at specific times. A time series is just a list of
such observations (often multivariate), presented in chronological order along with the
observation times. For example, an m-dimensional time series of length n, n ∈ N∗ ,
could be written down as x t1 , x t2 , . . . , x tn ∈ Rm , t1 ≤ t2 ≤ . . . ≤ tn being the times.
I Note that, depending on the situation, if the timestamps ti and ti +1 are equal, the
ordering x ti , x ti +1 may still be important and not equivalent to the ordering x ti +1 , x ti .
Thus the indices 1, 2, . . . , n may contain information additional to that contained in the
times t1 , t2 , . . . , tn . For example, suppose that x t1 , x t2 , . . . , x tn represent trades and
the times t1 , t2 , . . . , tn may be known only to the nearest millisecond. We may still be
interested in the fact that the trade represented by x ti arrived before the trade
represented by x ti +1 , even if they arrived within a millisecond of each other and the
timestamps ti = ti +1 .
I In practice, the components of the vectors x ti may not necessarily all be real numbers,
but some may, for example, be strings. (Although the idealisation of assuming that the
observation vectors are real numbers is often convenient in analysis.)
Financial Data
Financial data

Financial data: time series (ii)


I Moreover, in practice the vectors x ti may have a lot more structure:

q ) s e l e c t from quotes where date =2016.04.18 , t i m e w i t h i n 0 9 : 3 0 : 0 0 . 0 0 0 0 9 : 3 0 : 0 0 . 0 2 0 , sym= ‘CLM16


date time sym b i d p r i c e bidsize askprice asksize
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
2016.04.18 0 9 : 3 0 : 0 0 . 0 0 4 CLM16 39.98 35 40 40
2016.04.18 0 9 : 3 0 : 0 0 . 0 0 4 CLM16 39.98 35 40 41
2016.04.18 0 9 : 3 0 : 0 0 . 0 0 4 CLM16 39.98 35 40 42
2016.04.18 0 9 : 3 0 : 0 0 . 0 0 4 CLM16 39.98 33 40 42
2016.04.18 0 9 : 3 0 : 0 0 . 0 0 9 CLM16 39.98 33 39.99 1
2016.04.18 0 9 : 3 0 : 0 0 . 0 0 9 CLM16 39.98 33 40 41
2016.04.18 0 9 : 3 0 : 0 0 . 0 1 2 CLM16 39.98 33 40 40
2016.04.18 0 9 : 3 0 : 0 0 . 0 1 2 CLM16 39.98 33 40 39
2016.04.18 0 9 : 3 0 : 0 0 . 0 1 2 CLM16 39.98 33 40 38
2016.04.18 0 9 : 3 0 : 0 0 . 0 1 2 CLM16 39.98 33 40 37
2016.04.18 0 9 : 3 0 : 0 0 . 0 1 6 CLM16 39.98 33 40 36
2016.04.18 0 9 : 3 0 : 0 0 . 0 1 6 CLM16 39.98 35 40 36
2016.04.18 0 9 : 3 0 : 0 0 . 0 1 6 CLM16 39.98 36 40 36
2016.04.18 0 9 : 3 0 : 0 0 . 0 1 6 CLM16 39.98 36 40 35
2016.04.18 0 9 : 3 0 : 0 0 . 0 1 6 CLM16 39.99 1 40 35
2016.04.18 0 9 : 3 0 : 0 0 . 0 1 6 CLM16 39.98 37 40 35
2016.04.18 0 9 : 3 0 : 0 0 . 0 1 6 CLM16 39.99 1 40 35

I In this case each vector x ti represents the ith row of the table (in the language of
kdb+/q) or data frame (in the language of R or the Python library pandas).
I Each component (x ti )j of the vector x ti has a certain data type. Moreover, all values
in the same column of the table have the same data type.
Financial Data
Financial data

Financial data: time series (iii)

I In kdb+/q (for the above table was produced by a kdb+/q query), the types can be
examined with meta:

q)meta quotes
c | t f a
--------| -----
date | d
time | t
sym | s
bidprice | f
bidsize | h
askprice | f
asksize | h

I Here ‘d’ stands for “date”, ‘t’ for “time”, ‘s’ for “symbol”, ‘f’ for float (floats are 8 bytes in
q and correspond to doubles in some other languages, such as Java), ‘h’ for short (2
bytes).
Financial Data
Financial data

Financial data: time series (iv)

I In classical quantitative finance, time series are usually modelled as continuous-time


stochastic processes, usually disregarding their microstructural properties.
I Econometricians prefer to work in discrete time. It is typically easier to estimate and
forecast time series in discrete time.
Financial Data
Financial data

Financial data: non-stationary (i)

I Recall [DMS14, Chapter 1] that a stochastic process is strictly stationary if

d
{Xt1 , Xt2 , . . . , Xtn } = {Xt1 +h , Xt2 +h , . . . , Xtn +h },

for all n ∈ N∗ , all times t1 , t2 , . . . , tn ∈ R∗ , and all time shifts h ∈ R. That is, its
finite-dimensional distributions are invariant under time shifts.
I This is a pretty strict notion of stationarity, in practice milder versions are used.
I For example, a finite-variance stochastic process {Xt , t ∈ R∗ } is second-order
stationary (or weakly stationary) if its mean function µt := E [Xt ] (for all t ∈ R∗ ;
assuming E [|Xt |] < ∞) is constant and does not depend on time, and its
autocovariance function γ(s , t ) := Cov [Xs , Xt ] = E [(Xs − µs )(Xt − µt )] (for all
s , t ∈ R∗ ) depends on s and t only through their difference, |s − t |.
I Informally, a stationary time series is one whose statistical properties (mean,
variance, autocorrelation, correlation for multivariate time series, etc.) are constant
over time.
I Non-stationary time series exhibit non-stationary behaviour: random-walking,
trends, cycles, etc.
Financial Data
Financial data

Financial data: non-stationary (ii)

An example of a stationary time series — white noise. Image from [JTWH15].


Financial Data
Financial data

Financial data: non-stationary (iii)

Examples of non-stationary behaviour. Image from Investopedia: Introduction To Stationary


And Non-Stationary Processes by Tzveta Iordanova:
http://www.investopedia.com/articles/trading/07/stationary.asp
Financial Data
Financial data

Financial data: non-stationary (iv)

Rolling window volatility time series for Goodyear from 1992 to 2012. The (simple, net)
return interval is one trading day and the time window has length 60 trading days. [SCSG13]
Financial Data
Financial data

Financial data: non-stationary (v)

Correlation matrix of 306 continuously traded (between 1992 and 2012) companies in the
S&P 500 index for the fourth quarter of 2005 and the first quarter of 2006; the darker, the
stronger the correlation. The companies are sorted according to industrial
sectors. [SCSG13]
Financial Data
Financial data

Financial data: non-Gaussian (i)


I Wesley Clair Mitchell in his 1915 work The Making and Using of Index
Numbers [Mit15] studied the characteristics of fluctuations in commodity prices
I He produces a chart (his Chart 1, see next slide)...
...drawn to a logarithmic scale, [which] gives a more vivid idea of these price fluc-
tuations. It shows for each year the whole range covered by the recorded changes
from prices in the preceding year by vertical lines, which connect the points of
greatest rise with the points of greatest fall. These lines differ considerably in
length, which indicates that price changes cover a wider range in some years than
in others. The heavy dots upon the vertical lines show the positions of the decils.
One-tenth of the commodities quoted in any given year rose above their prices of
the year before by percentages scattered between the top of the line for that year
and the highest of the dots. Another tenth fell in price by percentages scattered
between the bottom of the line and the lowest of the dots. The fluctuations of
the remaining eight-tenths of the commodities were concentrated within the much
narrower range between the lowest and the highest dots. The dots grow closer
together toward the central dot, which is the median. This concentration indicates,
of course, that the number of commodities showing fluctuations of relatively slight
extent was much larger than the number showing the wide fluctuations falling out-
side the highest and lowest decils, or even between these decils and the decils
next inside them.
Financial Data
Financial data

Financial data: non-Gaussian (ii)

Mitchell’s Chart 1: conspectus of yearly changes in prices 1891–1918.


Financial Data
Financial data

Financial data: non-Gaussian (iii)

The middle dots or medians in successive years are connected by a heavy black
line, which represents the general upward or downward drift of the whole set of
fluctuations. To make this drift clear the median of each year is taken as the
starting point from which the upward or downward movements in the following
year are measured. Hence the chart has no fixed base line. But in this respect
it represents faithfully the figures from which it is made; since these figures are
precentages of prices in the preceding year, a price fluctuation in any year are the
units from which all the changes proceed is further emphasized by connecting the
nine decils, as well as the points of greatest rise and fall, with the median of the
year before by light diagonal lines. The chart suggests a series of bursting bomb
shells, the bombs being represented by the median dots of the years before and
the scattering of their fragments by the lines which radiate to the decils and the
points of greatest rise and fall.
Financial Data
Financial data

Financial data: non-Gaussian (iv)

Time is well spent in studying this chart, because it is capable of giving a truer im-
pression of the characteristics of price changes than can be given by any other de-
vice. The market diversity of the fluctuations of different commodities in the same
year — some rising, some falling, some remaining unchanged — the wide range
covered by these fluctuations, the erratic occurrence of extremely large changes,
and the fact that the greatest percentages of rise far surpass the greatest percent-
ages of fall are strikingly shown; but so also are the much greater frequency of
rather small variations, the dense concentration near the center of the field, the
existence of a general drift in the whole complex of changes, and the frequent al-
terations in the direction and the degree of this drift.
Financial Data
Financial data

Financial data: non-Gaussian (v)


I In his 1963 work Variation of Certain Speculative Prices [Man63], Benoit Mandelbrot
(1924–2010) examines cotton price changes.
I He cites Mitchell’s work [Mit15] and proposes using the stable Paretian distributions in
place of the Gaussian distribution to deal with price changes.
I Mandelbrot confirms that the distribution of price changes is leptokurtic (from Greek
“leptì”, meaning “slender”, and “kurtìc”, meaning “curved, arching”) — having the
kurtosis, " 4 #
E (X − µ )4
 
X −µ
Kurt[X ] = E = ,
σ (E [(X − µ)2 ])2
greater than that of the normal distribution, which is 3. (Here µ is the mean, σ the
standard deviation of the distribution.) High kurtosis is observed in distributions where
I the probability density is concentrated around the mean and the data-generating process
produces occasional values far from the mean, or
I the probability density is concentrated in the tails of the distribution, leading to the so-called
fat tails.
I The data may also be negatively or positively skewed (which is also catered for by
stable Paretian distributions):
" 3 #
E (X − µ )3
 
X −µ
Skew[X ] = E = .
σ (E [(X − µ)2 ])3/2
I Mandelbrot’s empirical work will later lead to his development of the notion of fractal.
Financial Data
Financial data

Financial data: non-Gaussian (vi)

Mandelbrot’s Figure 1: Two histograms illustrating departure from normality of the fifth and
tenth difference of monthly wool prices, 1890–1937. In each case, the continuous
bell-shaped curve represents the Gaussian “interpolate” based upon the sample variance.
Source: Gerhard Tintner, The Variate-Difference Method (Bloomington, Ind., 1940).
Financial Data
Financial data

Financial data: influenced by “animal spirits”

I As described by John Maynard Keynes [Key36]:


Even apart from the instability due to speculation, there is the instability due to
the characteristic of human nature that a large proportion of our positive activities
depend on spontaneous optimism rather than mathematical expectations, whether
moral or hedonistic or economic. Most, probably, of our decisions to do something
positive, the full consequences of which will be drawn out over many days to come,
can only be taken as the result of animal spirits — a spontaneous urge to action
rather than inaction, and not as the outcome of a weighted average of quantitative
benefits multiplied by quantitative probabilities.
I 1543 Bartholomew Traheron translation of Vigo’s Chirurg:
Physitions teache that there ben thre kindes of spirites; animal; vital; and naturall.
The animal spirite hath his seate in the brayne; and is spredde in to all the bodye by
synnowes; gyuyng facultie of mouynge; and felynge. It is called animal; bycause
it is the first instrument of the soule; whych the Latins call animam.
Financial Data
Financial data

Financial data: often deals with unobservables (i)

(a) Cat (b) Volatility

Figure: Image processing versus finance

It is much easier to more or less unambiguously spot a cat (source: Getty Images) than the
times when the S&P 500 index (sources: Google Finance, SIX) is volatile.
Financial Data
Financial data

Financial data: often deals with unobservables (ii)

Examples of “unobservables” in finance:


I Volatility.
I Volatility of volatility.
I Correlation.
I Out-of-the-money option prices, which may be extrapolated.
I Survival probabilities.
I Persistence parameters of mean-reverting processes.
I Sentiment (consumer, news, etc.).
I Factors, such as Nelson–Siegel curve factors or principal components in PCA.
Financial Data
Financial data

Financial data: affected by fat-finger (more generally, human) errors (i)

I Fat-finger errors are keyboard input errors in financial markets, e.g. where an order to
buy or sell is placed for a wrong asset or contract, for a wrong size, or at a wrong price.
A special case of the more general human errors contributing to operational risk.
I Have been known to have significant problems (including material losses) when
surfacing in production systems.
I Numerous fat-finger errors live on undetected in financial datasets.
Financial Data
Financial data

Financial data: affected by fat-finger (more generally, human) errors (ii)

I Examples of publicised fat-finger errors from Wikipedia1 :


I In 2001, UBS sold 610,000 Dentsu-shares at 6 yen, instead of 6 Dentsu-shares at 610,000
yen. Even though the error was spotted immediately, the Tokyo Stock Exchange did not
cancel the trades and UBS hat to buy back the shares at market-value which caused them a
loss of 100m USD.
I In 2006, a fat-finger error by a trader at Mizuho Securities in Japan caused the firm to short
sell a stock in an error that cost the firm 40 billion Yen to unwind.
I In 2014, a Japanese broker erroneously placed orders for more than US$600bn of stock in
leading Japanese companies, including Nomura, Toyota Motors and Honda which was
subsequently cancelled. If the order had been fulfilled it would have exceeded the value of
the economy of Sweden.
I In 2015, a junior employee at Deutsche Bank whose superior was on vacation confused
gross and net amounts while processing a trade, causing a payment to a US hedge fund of
$6bn, orders of magnitude higher than the correct amount. The bank reported the error to the
British Financial Conduct Authority, the European Central Bank and the US Federal Reserve
Bank, and retrieved the money on the following day.
I In 2015, the Investor Armon S. bought certificates from BNP Paribas at a price of 108 EUR
instead of 54,400 EUR each. This caused a loss of 160m EUR for BNP.
I In 2016, it was believed a fat-finger error caused the British pound to drop 6% in just a few
minutes to $1.1841, its lowest value for 31 years. A report by the Bank for International
Settlements later concluded that the drop was not caused by a single factor.

1 From https://en.wikipedia.org/wiki/Fat-finger_error#Examples
Financial Data
Financial data

Financial data: complex and interrelated

Financial data occurs not only as time series but also as (evolving, dynamic)
networks [CMeS13, ACM16].

Figure 13.1 from [CMeS13]: Brazilian interbank network, December 2007. The number of
financial conglomerates is n = 125 and the number of links in this representation at any
date does not exceed 1200.
Financial Data
Financial data

Financial data: Data Mining and Big Data

I Big Data (BD): data mining on data sets collected for administrative, legal, insurance,
financial planning, medical care, etc. purposes; such data are not intended in the first
place to gain insight into a phenomenon or to solve a real-world problem, but the
collected records may contain data — and usually do — that may be useful for other
purposes than those originally intended. (L. Grandinetti, G. R. Joubert, M. Kunze)
I Data Mining: the use of machine learning algorithms to find faint patterns of
relationship between data elements in large, noisy, and messy data sets, which can
lead to actions to increase benefit in some form (diagnosis, profit, detection, etc.).
I Noisy multidimensional time series.
I New and specialised datasets.
I Big — but also repurposed — data:
I Data governance and other compliance functions (logs, audit, internal data) are becoming
integrated with big data platforms.
I Software such as Apache Hadoop and Apache Spark is being adopted by financial services.
Financial Data
Financial data

Financial data: high-frequency (i)

I Definitions of high-frequency data within finance vary from one source to the next, for
example:
I High-frequency financial data are observations on financial variables taken daily or at a finer
time scale, and are often irregularly spaced over time. [YZ05]
I High-frequency financial data usually refer to data sampled at a time horizon smaller than a
trading day. [Con10]
I High-frequency data, also known as tick data, are records of live market activity. Every time
a customer, a dealer, or another entity posts a so-called limit order to buy s units of a specific
security with ticker X at price q, a bid quote qtb is logged at time tb to buy Stb units of X .
b b
Market orders are incorporated into tick data in different ways... [Ald13]
I Often such data is updated at sub-second frequency.
Financial Data
Financial data

Financial data: high-frequency (ii)


Irene Aldridge [Ald13, Table 4.1] describes some properties of high-frequency data:
I Voluminous: Each day of high-frequency data contains the number of observations equivalent to
30 years of daily data.
I Pros: Large numbers of observations carry lots of information.
I Cons: High-frequency data are difficult to handle manually.
I Subject to bid-ask bounce: Unlike traditional data based on just closing prices, tick data carry
additional supply-and-demand information in the form of bid and ask prices and offering sizes.
I Pros: Bid and ask quotes can carry valuable information about impending market moves,
which can be harnessed to the researcher’s advantage.
I Cons: Bid and ask quotes are separated by a spread. Continuous movement from bid to ask
and back introduces a jump process, difficult to deal with through many conventional models.
I Not normally or lognormally distributed: Returns computed from tick are not normal or lognormal.
I Pros: Many tradable models are still to be discovered.
I Cons: Traditional asset pricing models assuming lognormality of prices do not apply.
I Irregularly spaced in time: Arrivals of tick data are asynchronous.
I Pros: Durations between data arrival carry information.
I Cons: Most traditional models require regularly spaced data; need to convert high-frequency
data to some regular intervals, or “bars” of data. Converted data are often sparse (populated
with zero returns), once again making traditional econometric inferences difficult.
I Do not include buy or sell trade direction information: Level I and Level II data do not include
information on whether the trade was a result of a market buy or a market sell order.
I Pros: Data are leaner without trade direction information; trade information is more difficult
for bystanders to extract.
I Cons: The information on whether a trade is buyer initiated or seller initiated is a desired
input in many models.
Financial Data
Financial data

Financial data: changing (i)

I The makeup of the S&P 500 is constantly changing [Ro15]:


[Morgan Stanley’s Adam] Parker points out that the makeup of indexes like the
S&P 500 is constantly evolving. “One of the main items that changes over time in
the S&P 500 is the actual constituents of the index,” Parker writes. “While there has
been relatively less turnover lately, with only 3% of the companies changing since
2013, the cumulative effect of adding and subtracting companies is surprisingly
substantial. Ten percent of the companies in today’s index are different since 2011,
17% are different since 2009, and fully half the companies are different since 1999.
Said another way, at least half the companies in the S&P 500 today were not in
the S&P 500 when your average portfolio manager started running a portfolio.”
I According to Wikipedia,
Survivorship bias or survival bias is the logical error of concentrating on the
people, organisations, or things that made it past some selection process and
overlooking those that did not, typically because of their lack of visibility. This can
lead to false conclusions in several different ways. It is a form of selection bias.
Financial Data
Financial data

Financial data: changing (ii)

I Its impact on data analysis can be significant [Con14]:


In finance, here’s one example. The S&P 500’s compound annual long-run return
is 9.8% (less depending when you start/how you measure it). So that must be my
bogey for the entire stock market.
Each of those examples misses some crucial hidden factor. You only see famous
actors because the failures went home. The Melanesians hadn’t heard of the
Wright brothers, or the Industrial Revolution.
In the stock example, a lot of Darwinism has taken place by the time a company is
significant enough to be part of the S&P 500.
The same process is at work in the fund industry. If a mutual fund performs poorly
and is shut down, it will take its poor record to the grave. It will be booted from cat-
egory averages, for the very straightforward reason that you can no longer invest
in it.
If you think this is some marginal trend, it’s not. I recently asked Morningstar about
their tally of fund closures. The company tracks some 7,461 mutual funds today.
But its tally of “dead” funds — liquidated or more often merged away — is nearly
as large, at 7,183.
Financial Data
Financial data

Financial data: changing (iii)


I Libor, the London inter-bank lending rate, was considered to be one of the most
crucial interest rates in finance.
I As reported by BBC2 :
I As early as 2005 there was evidence Barclays had tried to manipulate dollar Libor and
Euribor rates at the request of its derivatives traders and other banks.
I Between January 2005 and June 2009, Barclays derivatives traders made a total of 257
requests to fix Libor and Euribor rates, according to a report by the FSA. One trader told a
trader from another bank in relation to three-month dollar Libor: “duuuude... what’s up with ur
guys 34.5 3m fix... tell him to get it up!”
I At the onset of the financial crisis in September 2007 with the collapse of Northern Rock,
liquidity concerns drew public scrutiny towards Libor. Barclays manipulated Libor
submissions to give a healthier picture of the bank’s credit quality and its ability to raise funds.
I On 16 April 2008, the Wall Street Journal published a report that questioned the integrity of
Libor.
I In late 2011, RBS sacked four people for their alleged roles in the Libor-fixing scandal.
I By 2012.07m, the breadth of the scandal became evident. Around 20 major banks had
been named in investigations and court cases.
I On 2017.07.27, the FCA, tasked with overseeing Libor, announced the benchmark will
be phased out by 2021. In many places the overnight indexed swap (OIS)3 is
already used instead of Libor as a risk-free rate.

2 http://www.bbc.co.uk/news/business-18671255
3 http://www.investopedia.com/articles/forex/053014/will-ois-replace-libor.asp
Financial Data
Financial data

Financial data: changing (iv)


Regulation is constantly evolving leading to changes in financial data (sources:
AbideFinancial4 , Accenture5 ):
I 2007.11.01: MiFID go-live
I 2012.12.31: Dodd-Frank Act go-live
I 2013.10.01: ASIC phase-in approach begins
I 2013.10.31: MAS phase-in approach begins
I 2014.02.12: EMIR go-live
I 2015.09.28: ESMA publish bulk of final RTS & ITS on MiFID II/R
I 2015.10.07: REMIT phase 1
I 2015.11.13: EU Commission EMIR Review
I 2016.01.03: ESMA submit remaining MiFID II/R RTS & ITS to the commission
I 2016.01.12: SFTR partially comes into force
I 2016.04.07: REMIT phase 2
I 2017Q2: EMIR rewrite go-live
I 2017Q4: FRTB implementation of standardised approach and internal model approach components to be completed
I 2018.01.03: MIDIF II/R go-live
I 2018Q2–3: FRTB production parallel run to begin
I 2018H2: SFTR go-live
I 2019.01.12: EU Commission SFTR review
I 2019.03.03: EU Commission MiFID II/R Review
I 2019.12m: FRTB compliance deadline
4 http:
//www.nexregreporting.com/wp-content/uploads/2016/07/Multi-Regime-Regulatory-Timeline-2020.pdf
5 https://www.accenture.com/t20170711T040306__w__/us-en/_acnmedia/PDF-56/
Accenture-Fundamental-Review-of-the-Trading-Book-Theory-to-Action.pdf
Financial Data
Financial data

Financial data: changing (v)

(L to R) Sir George Iacobescu, Chairman of Canary Wharf Group, Eric Van Der Kleij, Head
of Level39 and London Mayor Boris Johnson open Canary Wharf’s newest FinTech
accelerator, Level39 on the (image: London Loves Business, 2013.04.08 article by Shruti
Tripathi Chopra)
Financial Data
Financial data

Financial data: changing (vi)

I Heraclitus, as quoted by Plato in Cratylus, 402a:


Everything changes and nothing stands still.
I Also Heraclitus, as quoted ibid.:
You could not step twice into the same river.
I The financial markets a decade, let alone a century, ago are different from the financial
markets today. And yet...
I Ecclesiastes 1:9 (ASV):
That which hath been is that which shall be; and that which hath been done is that
which shall be done: and there is no new thing under the sun.
Financial Data
Financial data

Financial data: often expensive and/or difficult to get hold of

I Andrew Ng [Ng16]:
Among leading AI teams, many can likely replicate others’ software in, at most, 1–
2 years. But it is exceedingly difficult to get access to someone else’s data. Thus
data, rather than software, is the defensible barrier for many businesses.
I Andrew Ng (@AndrewYNg on Twitter, 2017.02.05):
For AI to be free we need not just Open Source, but also a strong Open Data
movement.
I Sources of open financial data (see also [Mar16]):
I Quandl (https://www.quandl.com/): a marketplace for financial, economic and alternative
data delivered in modern formats for today’s analysts, including Python, Excel, Matlab, and R.
I DATA.GOV (https://www.data.gov/finance/): Hundreds of free data sets on financial
services, including banking, lending, retirement, investments, and insurance.
I FRED (https://fred.stlouisfed.org/): Federal Reserve Economic Data.
I Kaggle Datasets (https://www.kaggle.com/datasets): a search engine to find open
datasets.
I Google Finance (https://www.google.com/finance): 40 years’ worth of stock market
data.
I Yahoo! Finance (https://uk.finance.yahoo.com/): financial data, also accessible via an
API.
Financial Data
Financial data

Financial data: no longer purely financial

I Financial institutions are beginning to consume more and more non-financial data:
I News and news sentiment data [ZS10, Ame15, BM15].
I Tweets [RHC+ 12].
I Chat data [Kis16].
I Blogs [ZS10].
I Company filings [Kis16].
I Shopping surveys [Kis16].
I Though this is far from easy [Kis16]:
The most unfathomable U.S. presidential race in memory gave Predata, a startup
firm backed by hedge fund manager Kyle Bass, a chance to cut through the vitriol
and coolly calculate the winner by monitoring chat data. Turns out that Predata,
which anointed Hillary Clinton the likely victor days before the Nov. 8 vote, was no
smarter than the cable news pundits.
Financial Data
Finance as a source of applications for data science and machine learning

Why finance is a difficult source of applications


I Unlike e.g. face recognition, which is a low noise task, finance is generally high noise:
the daily volatility of S&P500 index is an order of magnitude higher than the mean
return.
I (The various forms of) the Efficient Market Hypothesis: A market is efficient with
respect to information set, It , if it is impossible to make economic profits by trading on
the basis of this information set. [Jen78]
I Understanding the Reward-to-Risk ratio is critical in finance. As early as 1980s it was
understood that the expected return can be estimated far less accurately than the
variance of returns; that the variance of the market return changes significantly over
time; and that financial time series are often non-stationary. [Mer80]
I At the time it was thought [Mer80] that volatility estimates could be made arbitrarily
precise provided that the sampling interval be allowed to shrink to zero. It is now
[BRT00] understood that fat tails, microstructure noise, dependence in the returns and
in their squared returns, deterministic patterns in the variance, the high kurtosis in
high-frequency data, and time varying volatility further limit the precision of volatility
estimates.
I Thus second and (especially) first moment estimation (let alone prediction) in finance
is hard.
I @AndrewYNg: “For AI to be free we need not just Open Source, but also a strong Open
Data movement.”
Financial Data
Finance as a source of applications for data science and machine learning

Why finance is an easy source of applications

I “Despite stock returns once having been thought to be unforecastable, there is now
plenty of optimism that this is not so... Benefits can arise from taking a longer horizon,
from using disaggregated data, from carefully removing outliers or exceptional events,
and especially from considering non-linear models. [...] The papers also suggest that
some sub-periods may be more forecastable than others — such as summer months
or January — and this is worth exploring. If many component series are available, then
ranks may produce further information that is helpful with forecasting. There seems to
be many opportunities for forecasters, many of whom need to break away from simple
linear univariate ARIMA or multivariate transfer functions. It is often not easy to beat
convincingly these simple methods, so they make excellent base-line models, but they
often can be beaten.” [Gra92]
I Those of us doing algorithmic trading know that at least some markets at least some of
the time aren’t fully efficient.
I XTX Markets, the electronic market maker co-led by Deutsche Bank’s former head of
foreign exchange trading Zar Amrolia, brought in more than £10 million a month in
revenues during its first six months of operations. (Financial News, 2016.09.06)
Financial Data
Bibliography

Hamed Amini, Rama Cont, and Andreea Minca.


Resilience to contagion in financial networks.
Mathematical Finance, 26(2):329–365, April 2016.
Irene Aldridge.
High-Frequency Trading: A Practical Guide to Algorithmic Strategies and Trading
Systems.
Wiley, 2 edition, 2013.
Saeed Amen.
Bond over big data.
Technical report, Thalesians, January 2015.
Svetlana Borovkova and Diego Mahakena.
News, volatility and jumps: the case of natural gas futures.
Quantitative Finance, 15(7):1217–1242, 2015.
Xuezheng Bai, Jeffrey R. Russell, and George C. Tiao.
Beyond Merton’s utopia: effects of non-normality and dependence on the precision of
variance estimates using high-frequency financial data.
Unpublished paper, Graduate School of Business, University of Chicago, 2000.
Rama Cont, Amal Moussa, and Edson Bastos e Santos.
Network Structure and Systemic Risk in Banking Systems, chapter Network Structure
and Systemic Risk in Banking Systems, pages 327–368.
Cambridge University Press, 2013.
Financial Data
Bibliography

Rama Cont, editor.


Encyclopedia of Quantitative Finance.
John Wiley and Sons, Inc., 2010.
Brendan Conway.
Survivorship bias: Why most investors should own index funds.
Barrons, February 2014.
Randal Douc, Eric Moulines, and David Stoffer.
Nonlinear Time Series: Theory, Methods and Applications with R Examples.
Texts in Statistical Science. CRC Press, Taylor & Francis Group, 2014.
Clive W.J. Granger.
Forecasting stock market prices: Lessons for forecasters.
International Journal of Forecasting, 8:3–13, 1992.
Michael C. Jensen.
Some anomalous evidence regarding market efficiency.
Journal of Financial Economics, 6(2/3):95–101, 1978.
Andrew T. Jebb, Louis Tay, Wei Wang, and Qiming Huang.
Time series analysis for psychological research: examining and forecasting change.
Frontiers in Psychology, 6, June 2015.
John Maynard Keynes.
The General Theory of Employment, Interest and Money.
Macmillan, 1936.
Financial Data
Bibliography

Saijel Kishan.
Big data is a big mess for hedge funds hunting signals.
Bloomberg, November 2016.
Benoit B. Mandelbrot.
The variation of certain speculative prices.
Journal of Business, XXXVI:392–417, 1963.
Bernard Marr.
Big data: 33 brillian and free data sources for 2016.
Forbes, February 2016.
Robert C. Merton.
On estimating the expected return on the market: An exploratory investigation.
Journal of Financial Economics, 8:323–361, 1980.
Wesley Clair Mitchell.
The Making and Using of Index Numbers: Introduction to Index Numbers and
Wholesale Prices in the United States and Foreign Countries, volume 173 of Bulletins
of the U.S. Bureau of Labor Statistics.
U.S. Bureau of Labor Statistics, 1915.
Andrew Ng.
What artificial intelligence can and can’t do right now.
Harvard Business Review, November 2016.
Eduardo J. Ruiz, Vagelis Hristidis, Carlos Castillo, Aristides Gionis, and Alejandro
Jaimes.
Financial Data
Bibliography

Correlating financial time series with micro-blogging activity.


In Proceedings of the fifth ACM international conference on Web search and data
mining (WSDM’12), pages 513–522, February 2012.
Sam Ro.
The makeup of the S&P 500 is constantly changing.
Business Insider UK, June 2015.
Thilo A. Schmitt, Desislava Chetalova, Rudi Schäfer, and Thomas Guhr.
Non-stationarity in financial time series: Generic features and tail behaviour.
EPL (Europhysics Letters), 2013.
Bingcheng Yan and Eric Zivot.
Analysis of high-frequency financial data with S-Plus.
Technical Report UWEC-2005-03, University of Washington, Department of
Economics, 2005.
Wenbin Zhang and Steven Skiena.
Trading strategies to exploit blog and news sentiment.
In Proceedings of the Fourth International Conference on Weblogs and Social Media
(ICWSM 2010), May 2010.

You might also like