Professional Documents
Culture Documents
7 Financial Data
7 Financial Data
TheThal
esi
ans
I
tise
asyf
orphi
lo
sophe
rst
ober
ichi
fthe
ycho
ose
Financial Data
2023.02.03
Financial Data
Financial data
Question
What is special about financial data? How do financial datasets differ from those found in
other areas, such as image processing, engineering, physics, chemistry, biology, and
medicine?
Financial Data
Financial data
Financial data
The City and, more generally, the finance industry, is one of the most interesting sources of
data for data science. This data — the financial data — has some interesting
characteristics (image: Gary Yeowell/Getty Images)
Financial Data
Financial data
I Financial variables are usually observed at specific times. A time series is just a list of
such observations (often multivariate), presented in chronological order along with the
observation times. For example, an m-dimensional time series of length n, n ∈ N∗ ,
could be written down as x t1 , x t2 , . . . , x tn ∈ Rm , t1 ≤ t2 ≤ . . . ≤ tn being the times.
I Note that, depending on the situation, if the timestamps ti and ti +1 are equal, the
ordering x ti , x ti +1 may still be important and not equivalent to the ordering x ti +1 , x ti .
Thus the indices 1, 2, . . . , n may contain information additional to that contained in the
times t1 , t2 , . . . , tn . For example, suppose that x t1 , x t2 , . . . , x tn represent trades and
the times t1 , t2 , . . . , tn may be known only to the nearest millisecond. We may still be
interested in the fact that the trade represented by x ti arrived before the trade
represented by x ti +1 , even if they arrived within a millisecond of each other and the
timestamps ti = ti +1 .
I In practice, the components of the vectors x ti may not necessarily all be real numbers,
but some may, for example, be strings. (Although the idealisation of assuming that the
observation vectors are real numbers is often convenient in analysis.)
Financial Data
Financial data
I In this case each vector x ti represents the ith row of the table (in the language of
kdb+/q) or data frame (in the language of R or the Python library pandas).
I Each component (x ti )j of the vector x ti has a certain data type. Moreover, all values
in the same column of the table have the same data type.
Financial Data
Financial data
I In kdb+/q (for the above table was produced by a kdb+/q query), the types can be
examined with meta:
q)meta quotes
c | t f a
--------| -----
date | d
time | t
sym | s
bidprice | f
bidsize | h
askprice | f
asksize | h
I Here ‘d’ stands for “date”, ‘t’ for “time”, ‘s’ for “symbol”, ‘f’ for float (floats are 8 bytes in
q and correspond to doubles in some other languages, such as Java), ‘h’ for short (2
bytes).
Financial Data
Financial data
d
{Xt1 , Xt2 , . . . , Xtn } = {Xt1 +h , Xt2 +h , . . . , Xtn +h },
for all n ∈ N∗ , all times t1 , t2 , . . . , tn ∈ R∗ , and all time shifts h ∈ R. That is, its
finite-dimensional distributions are invariant under time shifts.
I This is a pretty strict notion of stationarity, in practice milder versions are used.
I For example, a finite-variance stochastic process {Xt , t ∈ R∗ } is second-order
stationary (or weakly stationary) if its mean function µt := E [Xt ] (for all t ∈ R∗ ;
assuming E [|Xt |] < ∞) is constant and does not depend on time, and its
autocovariance function γ(s , t ) := Cov [Xs , Xt ] = E [(Xs − µs )(Xt − µt )] (for all
s , t ∈ R∗ ) depends on s and t only through their difference, |s − t |.
I Informally, a stationary time series is one whose statistical properties (mean,
variance, autocorrelation, correlation for multivariate time series, etc.) are constant
over time.
I Non-stationary time series exhibit non-stationary behaviour: random-walking,
trends, cycles, etc.
Financial Data
Financial data
Rolling window volatility time series for Goodyear from 1992 to 2012. The (simple, net)
return interval is one trading day and the time window has length 60 trading days. [SCSG13]
Financial Data
Financial data
Correlation matrix of 306 continuously traded (between 1992 and 2012) companies in the
S&P 500 index for the fourth quarter of 2005 and the first quarter of 2006; the darker, the
stronger the correlation. The companies are sorted according to industrial
sectors. [SCSG13]
Financial Data
Financial data
The middle dots or medians in successive years are connected by a heavy black
line, which represents the general upward or downward drift of the whole set of
fluctuations. To make this drift clear the median of each year is taken as the
starting point from which the upward or downward movements in the following
year are measured. Hence the chart has no fixed base line. But in this respect
it represents faithfully the figures from which it is made; since these figures are
precentages of prices in the preceding year, a price fluctuation in any year are the
units from which all the changes proceed is further emphasized by connecting the
nine decils, as well as the points of greatest rise and fall, with the median of the
year before by light diagonal lines. The chart suggests a series of bursting bomb
shells, the bombs being represented by the median dots of the years before and
the scattering of their fragments by the lines which radiate to the decils and the
points of greatest rise and fall.
Financial Data
Financial data
Time is well spent in studying this chart, because it is capable of giving a truer im-
pression of the characteristics of price changes than can be given by any other de-
vice. The market diversity of the fluctuations of different commodities in the same
year — some rising, some falling, some remaining unchanged — the wide range
covered by these fluctuations, the erratic occurrence of extremely large changes,
and the fact that the greatest percentages of rise far surpass the greatest percent-
ages of fall are strikingly shown; but so also are the much greater frequency of
rather small variations, the dense concentration near the center of the field, the
existence of a general drift in the whole complex of changes, and the frequent al-
terations in the direction and the degree of this drift.
Financial Data
Financial data
Mandelbrot’s Figure 1: Two histograms illustrating departure from normality of the fifth and
tenth difference of monthly wool prices, 1890–1937. In each case, the continuous
bell-shaped curve represents the Gaussian “interpolate” based upon the sample variance.
Source: Gerhard Tintner, The Variate-Difference Method (Bloomington, Ind., 1940).
Financial Data
Financial data
It is much easier to more or less unambiguously spot a cat (source: Getty Images) than the
times when the S&P 500 index (sources: Google Finance, SIX) is volatile.
Financial Data
Financial data
I Fat-finger errors are keyboard input errors in financial markets, e.g. where an order to
buy or sell is placed for a wrong asset or contract, for a wrong size, or at a wrong price.
A special case of the more general human errors contributing to operational risk.
I Have been known to have significant problems (including material losses) when
surfacing in production systems.
I Numerous fat-finger errors live on undetected in financial datasets.
Financial Data
Financial data
1 From https://en.wikipedia.org/wiki/Fat-finger_error#Examples
Financial Data
Financial data
Financial data occurs not only as time series but also as (evolving, dynamic)
networks [CMeS13, ACM16].
Figure 13.1 from [CMeS13]: Brazilian interbank network, December 2007. The number of
financial conglomerates is n = 125 and the number of links in this representation at any
date does not exceed 1200.
Financial Data
Financial data
I Big Data (BD): data mining on data sets collected for administrative, legal, insurance,
financial planning, medical care, etc. purposes; such data are not intended in the first
place to gain insight into a phenomenon or to solve a real-world problem, but the
collected records may contain data — and usually do — that may be useful for other
purposes than those originally intended. (L. Grandinetti, G. R. Joubert, M. Kunze)
I Data Mining: the use of machine learning algorithms to find faint patterns of
relationship between data elements in large, noisy, and messy data sets, which can
lead to actions to increase benefit in some form (diagnosis, profit, detection, etc.).
I Noisy multidimensional time series.
I New and specialised datasets.
I Big — but also repurposed — data:
I Data governance and other compliance functions (logs, audit, internal data) are becoming
integrated with big data platforms.
I Software such as Apache Hadoop and Apache Spark is being adopted by financial services.
Financial Data
Financial data
I Definitions of high-frequency data within finance vary from one source to the next, for
example:
I High-frequency financial data are observations on financial variables taken daily or at a finer
time scale, and are often irregularly spaced over time. [YZ05]
I High-frequency financial data usually refer to data sampled at a time horizon smaller than a
trading day. [Con10]
I High-frequency data, also known as tick data, are records of live market activity. Every time
a customer, a dealer, or another entity posts a so-called limit order to buy s units of a specific
security with ticker X at price q, a bid quote qtb is logged at time tb to buy Stb units of X .
b b
Market orders are incorporated into tick data in different ways... [Ald13]
I Often such data is updated at sub-second frequency.
Financial Data
Financial data
2 http://www.bbc.co.uk/news/business-18671255
3 http://www.investopedia.com/articles/forex/053014/will-ois-replace-libor.asp
Financial Data
Financial data
(L to R) Sir George Iacobescu, Chairman of Canary Wharf Group, Eric Van Der Kleij, Head
of Level39 and London Mayor Boris Johnson open Canary Wharf’s newest FinTech
accelerator, Level39 on the (image: London Loves Business, 2013.04.08 article by Shruti
Tripathi Chopra)
Financial Data
Financial data
I Andrew Ng [Ng16]:
Among leading AI teams, many can likely replicate others’ software in, at most, 1–
2 years. But it is exceedingly difficult to get access to someone else’s data. Thus
data, rather than software, is the defensible barrier for many businesses.
I Andrew Ng (@AndrewYNg on Twitter, 2017.02.05):
For AI to be free we need not just Open Source, but also a strong Open Data
movement.
I Sources of open financial data (see also [Mar16]):
I Quandl (https://www.quandl.com/): a marketplace for financial, economic and alternative
data delivered in modern formats for today’s analysts, including Python, Excel, Matlab, and R.
I DATA.GOV (https://www.data.gov/finance/): Hundreds of free data sets on financial
services, including banking, lending, retirement, investments, and insurance.
I FRED (https://fred.stlouisfed.org/): Federal Reserve Economic Data.
I Kaggle Datasets (https://www.kaggle.com/datasets): a search engine to find open
datasets.
I Google Finance (https://www.google.com/finance): 40 years’ worth of stock market
data.
I Yahoo! Finance (https://uk.finance.yahoo.com/): financial data, also accessible via an
API.
Financial Data
Financial data
I Financial institutions are beginning to consume more and more non-financial data:
I News and news sentiment data [ZS10, Ame15, BM15].
I Tweets [RHC+ 12].
I Chat data [Kis16].
I Blogs [ZS10].
I Company filings [Kis16].
I Shopping surveys [Kis16].
I Though this is far from easy [Kis16]:
The most unfathomable U.S. presidential race in memory gave Predata, a startup
firm backed by hedge fund manager Kyle Bass, a chance to cut through the vitriol
and coolly calculate the winner by monitoring chat data. Turns out that Predata,
which anointed Hillary Clinton the likely victor days before the Nov. 8 vote, was no
smarter than the cable news pundits.
Financial Data
Finance as a source of applications for data science and machine learning
I “Despite stock returns once having been thought to be unforecastable, there is now
plenty of optimism that this is not so... Benefits can arise from taking a longer horizon,
from using disaggregated data, from carefully removing outliers or exceptional events,
and especially from considering non-linear models. [...] The papers also suggest that
some sub-periods may be more forecastable than others — such as summer months
or January — and this is worth exploring. If many component series are available, then
ranks may produce further information that is helpful with forecasting. There seems to
be many opportunities for forecasters, many of whom need to break away from simple
linear univariate ARIMA or multivariate transfer functions. It is often not easy to beat
convincingly these simple methods, so they make excellent base-line models, but they
often can be beaten.” [Gra92]
I Those of us doing algorithmic trading know that at least some markets at least some of
the time aren’t fully efficient.
I XTX Markets, the electronic market maker co-led by Deutsche Bank’s former head of
foreign exchange trading Zar Amrolia, brought in more than £10 million a month in
revenues during its first six months of operations. (Financial News, 2016.09.06)
Financial Data
Bibliography
Saijel Kishan.
Big data is a big mess for hedge funds hunting signals.
Bloomberg, November 2016.
Benoit B. Mandelbrot.
The variation of certain speculative prices.
Journal of Business, XXXVI:392–417, 1963.
Bernard Marr.
Big data: 33 brillian and free data sources for 2016.
Forbes, February 2016.
Robert C. Merton.
On estimating the expected return on the market: An exploratory investigation.
Journal of Financial Economics, 8:323–361, 1980.
Wesley Clair Mitchell.
The Making and Using of Index Numbers: Introduction to Index Numbers and
Wholesale Prices in the United States and Foreign Countries, volume 173 of Bulletins
of the U.S. Bureau of Labor Statistics.
U.S. Bureau of Labor Statistics, 1915.
Andrew Ng.
What artificial intelligence can and can’t do right now.
Harvard Business Review, November 2016.
Eduardo J. Ruiz, Vagelis Hristidis, Carlos Castillo, Aristides Gionis, and Alejandro
Jaimes.
Financial Data
Bibliography