You are on page 1of 765

How to use dates &

times with pandas


M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Date & time series functionality
At the root: data types for date & time information
Objects for points in time and periods

A ributes & methods re ect time-related details

Sequences of dates & periods:


Series or DataFrame columns

Index: convert object into Time Series

Many Series/DataFrame methods rely on time information in


the index to provide time-series functionality

MANIPULATING TIME SERIES DATA IN PYTHON


Basic building block: pd.Timestamp
import pandas as pd # assumed imported going forward
from datetime import datetime # To manually create dates
time_stamp = pd.Timestamp(datetime(2017, 1, 1))
pd.Timestamp('2017-01-01') == time_stamp

True # Understands dates as strings

time_stamp # type: pandas.tslib.Timestamp

Timestamp('2017-01-01 00:00:00')

MANIPULATING TIME SERIES DATA IN PYTHON


Basic building block: pd.Timestamp
Timestamp object has many a ributes to store time-speci c
information

time_stamp.year

2017

time_stamp.day_name()

'Sunday'

MANIPULATING TIME SERIES DATA IN PYTHON


More building blocks: pd.Period & freq
period = pd.Period('2017-01')
period # default: month-end

Period object has freq


Period('2017-01', 'M') a ribute to store frequency
info
period.asfreq('D') # convert to daily

Period('2017-01-31', 'D')
Convert pd.Period() to
period.to_timestamp().to_period('M') pd.Timestamp() and back

Period('2017-01', 'M')

MANIPULATING TIME SERIES DATA IN PYTHON


More building blocks: pd.Period & freq
period + 2 Frequency info enables
basic date arithmetic
Period('2017-03', 'M')

pd.Timestamp('2017-01-31', 'M') + 1

Timestamp('2017-02-28 00:00:00', freq='M')

MANIPULATING TIME SERIES DATA IN PYTHON


Sequences of dates & times
pd.date_range : start , end , periods , freq

index = pd.date_range(start='2017-1-1', periods=12, freq='M')

index

DatetimeIndex(['2017-01-31', '2017-02-28', '2017-03-31', ...,


'2017-09-30', '2017-10-31', '2017-11-30', '2017-12-31'],
dtype='datetime64[ns]', freq='M')

pd.DateTimeIndex : sequence of Timestamp objects with


frequency info

MANIPULATING TIME SERIES DATA IN PYTHON


Sequences of dates & times
index[0]

Timestamp('2017-01-31 00:00:00', freq='M')

index.to_period()

PeriodIndex(['2017-01', '2017-02', '2017-03', '2017-04', ...,


'2017-11', '2017-12'], dtype='period[M]', freq='M')

MANIPULATING TIME SERIES DATA IN PYTHON


Create a time series: pd.DateTimeIndex
pd.DataFrame({'data': index}).info()

RangeIndex: 12 entries, 0 to 11
Data columns (total 1 columns):
data 12 non-null datetime64[ns]
dtypes: datetime64[ns](1)

MANIPULATING TIME SERIES DATA IN PYTHON


Create a time series: pd.DateTimeIndex
np.random.random :
Random numbers: [0,1]

12 rows, 2 columns

data = np.random.random((size=12,2))
pd.DataFrame(data=data, index=index).info()

DatetimeIndex: 12 entries, 2017-01-31 to 2017-12-31


Freq: M
Data columns (total 2 columns):
0 12 non-null float64
1 12 non-null float64
dtypes: float64(2)

MANIPULATING TIME SERIES DATA IN PYTHON


Frequency aliases & time info

MANIPULATING TIME SERIES DATA IN PYTHON


Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Indexing &
resampling time
series
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Time series transformation
Basic time series transformations include:

Parsing string dates and convert to datetime64

Selecting & slicing for speci c subperiods

Se ing & changing DateTimeIndex frequency


Upsampling vs Downsampling

MANIPULATING TIME SERIES DATA IN PYTHON


Getting GOOG stock prices
google = pd.read_csv('google.csv') # import pandas as pd
google.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 504 entries, 0 to 503
Data columns (total 2 columns):
date 504 non-null object
price 504 non-null float64
dtypes: float64(1), object(1)

google.head()

date price
0 2015-01-02 524.81
1 2015-01-05 513.87
2 2015-01-06 501.96
3 2015-01-07 501.10
4 2015-01-08 502.68

MANIPULATING TIME SERIES DATA IN PYTHON


Converting string dates to datetime64
pd.to_datetime() :
Parse date string

Convert to datetime64

google.date = pd.to_datetime(google.date)
google.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 504 entries, 0 to 503
Data columns (total 2 columns):
date 504 non-null datetime64[ns]
price 504 non-null float64
dtypes: datetime64[ns](1), float64(1)

MANIPULATING TIME SERIES DATA IN PYTHON


Converting string dates to datetime64
.set_index() :
Date into index

inplace :
don't create copy

google.set_index('date', inplace=True)
google.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 504 entries, 2015-01-02 to 2016-12-30
Data columns (total 1 columns):
price 504 non-null float64
dtypes: float64(1)

MANIPULATING TIME SERIES DATA IN PYTHON


Plotting the Google stock time series
google.price.plot(title='Google Stock Price')
plt.tight_layout(); plt.show()

MANIPULATING TIME SERIES DATA IN PYTHON


Partial string indexing
Selecting/indexing using strings that parse to dates

google['2015'].info() # Pass string for part of date

DatetimeIndex: 252 entries, 2015-01-02 to 2015-12-31


Data columns (total 1 columns):
price 252 non-null float64
dtypes: float64(1)

google['2015-3': '2016-2'].info() # Slice includes last month

DatetimeIndex: 252 entries, 2015-03-02 to 2016-02-29


Data columns (total 1 columns):
price 252 non-null float64
dtypes: float64(1)
memory usage: 3.9 KB

MANIPULATING TIME SERIES DATA IN PYTHON


Partial string indexing
google.loc['2016-6-1', 'price'] # Use full date with .loc[]

734.15

MANIPULATING TIME SERIES DATA IN PYTHON


.asfreq(): set frequency
.asfreq('D') :
Convert DateTimeIndex to calendar day frequency

google.asfreq('D').info() # set calendar day frequency

DatetimeIndex: 729 entries, 2015-01-02 to 2016-12-30


Freq: D
Data columns (total 1 columns):
price 504 non-null float64
dtypes: float64(1)

MANIPULATING TIME SERIES DATA IN PYTHON


.asfreq(): set frequency
Upsampling:
Higher frequency implies new dates => missing data

google.asfreq('D').head()

price
date
2015-01-02 524.81
2015-01-03 NaN
2015-01-04 NaN
2015-01-05 513.87
2015-01-06 501.96

MANIPULATING TIME SERIES DATA IN PYTHON


.asfreq(): reset frequency
.asfreq('B') :
Convert DateTimeIndex to business day frequency

google = google.asfreq('B') # Change to calendar day frequency


google.info()

DatetimeIndex: 521 entries, 2015-01-02 to 2016-12-30


Freq: B
Data columns (total 1 columns):
price 504 non-null float64
dtypes: float64(1)

MANIPULATING TIME SERIES DATA IN PYTHON


.asfreq(): reset frequency
google[google.price.isnull()] # Select missing 'price' values

price
date
2015-01-19 NaN
2015-02-16 NaN
...
2016-11-24 NaN
2016-12-26 NaN

Business days that were not trading days

MANIPULATING TIME SERIES DATA IN PYTHON


Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Lags, changes, and
returns for stock
price series
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Basic time series calculations
Typical Time Series manipulations include:
Shi or lag values back or forward back in time

Get the di erence in value for a given time period

Compute the percent change over any number of periods

pandas built-in methods rely on pd.DateTimeIndex

MANIPULATING TIME SERIES DATA IN PYTHON


Getting GOOG stock prices
Let pd.read_csv() do the parsing for you!

google = pd.read_csv('google.csv', parse_dates=['date'], index_col='date')

google.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 504 entries, 2015-01-02 to 2016-12-30
Data columns (total 1 columns):
price 504 non-null float64
dtypes: float64(1)

MANIPULATING TIME SERIES DATA IN PYTHON


Getting GOOG stock prices
google.head()

price
date
2015-01-02 524.81
2015-01-05 513.87
2015-01-06 501.96
2015-01-07 501.10
2015-01-08 502.68

MANIPULATING TIME SERIES DATA IN PYTHON


.shift(): Moving data between past & future
.shift() :
defaults to periods=1

1 period into future

google['shifted'] = google.price.shift() # default: periods=1


google.head(3)

price shifted
date
2015-01-02 542.81 NaN
2015-01-05 513.87 542.81
2015-01-06 501.96 513.87

MANIPULATING TIME SERIES DATA IN PYTHON


.shift(): Moving data between past & future
.shift(periods=-1) :
lagged data

1 period back in time

google['lagged'] = google.price.shift(periods=-1)
google[['price', 'lagged', 'shifted']].tail(3)

price lagged shifted


date
2016-12-28 785.05 782.79 791.55
2016-12-29 782.79 771.82 785.05
2016-12-30 771.82 NaN 782.79

MANIPULATING TIME SERIES DATA IN PYTHON


Calculate one-period percent change
xt / xt−1
google['change'] = google.price.div(google.shifted)
google[['price', 'shifted', 'change']].head(3)

price shifted change


Date
2017-01-03 786.14 NaN NaN
2017-01-04 786.90 786.14 1.000967
2017-01-05 794.02 786.90 1.009048

MANIPULATING TIME SERIES DATA IN PYTHON


Calculate one-period percent change
google['return'] = google.change.sub(1).mul(100)
google[['price', 'shifted', 'change', 'return']].head(3)

price shifted change return


date
2015-01-02 524.81 NaN NaN NaN
2015-01-05 513.87 524.81 0.98 -2.08
2015-01-06 501.96 513.87 0.98 -2.32

MANIPULATING TIME SERIES DATA IN PYTHON


.diff(): built-in time-series change
Di erence in value for two adjacent periods

xt − xt−1
google['diff'] = google.price.diff()
google[['price', 'diff']].head(3)

price diff
date
2015-01-02 524.81 NaN
2015-01-05 513.87 -10.94
2015-01-06 501.96 -11.91

MANIPULATING TIME SERIES DATA IN PYTHON


.pct_change(): built-in time-series % change
Percent change for two adjacent periods
xt
xt−1

google['pct_change'] = google.price.pct_change().mul(100)
google[['price', 'return', 'pct_change']].head(3)

price return pct_change


date
2015-01-02 524.81 NaN NaN
2015-01-05 513.87 -2.08 -2.08
2015-01-06 501.96 -2.32 -2.32

MANIPULATING TIME SERIES DATA IN PYTHON


Looking ahead: Get multi-period returns
google['return_3d'] = google.price.pct_change(periods=3).mul(100)
google[['price', 'return_3d']].head()

price return_3d
date
2015-01-02 524.81 NaN
2015-01-05 513.87 NaN
2015-01-06 501.96 NaN
2015-01-07 501.10 -4.517825
2015-01-08 502.68 -2.177594

Percent change for two periods, 3 trading days apart

MANIPULATING TIME SERIES DATA IN PYTHON


Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Compare time series
growth rates
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Comparing stock performance
Stock price series: hard to compare at di erent levels

Simple solution: normalize price series to start at 100

Divide all prices by rst in series, multiply by 100


Same starting point

All prices relative to starting point

Di erence to starting point in percentage points

MANIPULATING TIME SERIES DATA IN PYTHON


Normalizing a single series (1)
google = pd.read_csv('google.csv', parse_dates=['date'], index_col='date')
google.head(3)

price
date
2010-01-04 313.06
2010-01-05 311.68
2010-01-06 303.83

first_price = google.price.iloc[0] # int-based selection


first_price

313.06

first_price == google.loc['2010-01-04', 'price']

True

MANIPULATING TIME SERIES DATA IN PYTHON


Normalizing a single series (2)
normalized = google.price.div(first_price).mul(100)
normalized.plot(title='Google Normalized Series')

MANIPULATING TIME SERIES DATA IN PYTHON


Normalizing multiple series (1)
prices = pd.read_csv('stock_prices.csv',
parse_dates=['date'],
index_col='date')
prices.info()

DatetimeIndex: 1761 entries, 2010-01-04 to 2016-12-30


Data columns (total 3 columns):
AAPL 1761 non-null float64
GOOG 1761 non-null float64
YHOO 1761 non-null float64
dtypes: float64(3)

prices.head(2)

AAPL GOOG YHOO


Date
2010-01-04 30.57 313.06 17.10
2010-01-05 30.63 311.68 17.23

MANIPULATING TIME SERIES DATA IN PYTHON


Normalizing multiple series (2)
prices.iloc[0]

AAPL 30.57
GOOG 313.06
YHOO 17.10
Name: 2010-01-04 00:00:00, dtype: float64

normalized = prices.div(prices.iloc[0])
normalized.head(3)

AAPL GOOG YHOO


Date
2010-01-04 1.000000 1.000000 1.000000
2010-01-05 1.001963 0.995592 1.007602
2010-01-06 0.985934 0.970517 1.004094

.div() : automatic alignment of Series index & DataFrame


columns

MANIPULATING TIME SERIES DATA IN PYTHON


Comparing with a benchmark (1)
index = pd.read_csv('benchmark.csv', parse_dates=['date'], index_col='date')
index.info()

DatetimeIndex: 1826 entries, 2010-01-01 to 2016-12-30


Data columns (total 1 columns):
SP500 1762 non-null float64
dtypes: float64(1)

prices = pd.concat([prices, index], axis=1).dropna()


prices.info()

DatetimeIndex: 1761 entries, 2010-01-04 to 2016-12-30


Data columns (total 4 columns):
AAPL 1761 non-null float64
GOOG 1761 non-null float64
YHOO 1761 non-null float64
SP500 1761 non-null float64
dtypes: float64(4)

MANIPULATING TIME SERIES DATA IN PYTHON


Comparing with a benchmark (2)
prices.head(1)

AAPL GOOG YHOO SP500


2010-01-04 30.57 313.06 17.10 1132.99

normalized = prices.div(prices.iloc[0]).mul(100)
normalized.plot()

MANIPULATING TIME SERIES DATA IN PYTHON


Plotting performance difference
diff = normalized[tickers].sub(normalized['SP500'], axis=0)

GOOG YHOO AAPL


2010-01-04 0.000000 0.000000 0.000000
2010-01-05 -0.752375 0.448669 -0.115294
2010-01-06 -3.314604 0.043069 -1.772895

.sub(..., axis=0) : Subtract a Series from each DataFrame


column by aligning indexes

MANIPULATING TIME SERIES DATA IN PYTHON


Plotting performance difference
diff.plot()

MANIPULATING TIME SERIES DATA IN PYTHON


Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Changing the time
series frequency:
resampling
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Changing the frequency: resampling
DateTimeIndex : set & change freq using .asfreq()

But frequency conversion a ects the data


Upsampling: ll or interpolate missing data

Downsampling: aggregate existing data

pandas API:
.asfreq() , .reindex()

.resample() + transformation method

MANIPULATING TIME SERIES DATA IN PYTHON


Getting started: quarterly data
dates = pd.date_range(start='2016', periods=4, freq='Q')
data = range(1, 5)
quarterly = pd.Series(data=data, index=dates)
quarterly

2016-03-31 1
2016-06-30 2
2016-09-30 3
2016-12-31 4
Freq: Q-DEC, dtype: int64 # Default: year-end quarters

MANIPULATING TIME SERIES DATA IN PYTHON


Upsampling: quarter => month
monthly = quarterly.asfreq('M') # to month-end frequency

2016-03-31 1.0
2016-04-30 NaN
2016-05-31 NaN
2016-06-30 2.0
2016-07-31 NaN
2016-08-31 NaN
2016-09-30 3.0
2016-10-31 NaN
2016-11-30 NaN
2016-12-31 4.0
Freq: M, dtype: float64

Upsampling creates missing values

monthly = monthly.to_frame('baseline') # to DataFrame

MANIPULATING TIME SERIES DATA IN PYTHON


Upsampling: fill methods
monthly['ffill'] = quarterly.asfreq('M', method='ffill')
monthly['bfill'] = quarterly.asfreq('M', method='bfill')
monthly['value'] = quarterly.asfreq('M', fill_value=0)

MANIPULATING TIME SERIES DATA IN PYTHON


Upsampling: fill methods
bfill : back ll

ffill : forward ll

baseline ffill bfill value


2016-03-31 1.0 1 1 1
2016-04-30 NaN 1 2 0
2016-05-31 NaN 1 2 0
2016-06-30 2.0 2 2 2
2016-07-31 NaN 2 3 0
2016-08-31 NaN 2 3 0
2016-09-30 3.0 3 3 3
2016-10-31 NaN 3 4 0
2016-11-30 NaN 3 4 0
2016-12-31 4.0 4 4 4

MANIPULATING TIME SERIES DATA IN PYTHON


Add missing months: .reindex()
dates = pd.date_range(start='2016', quarterly.reindex(dates)
periods=12,
freq='M')
2016-01-31 NaN
2016-02-29 NaN
DatetimeIndex(['2016-01-31', 2016-03-31 1.0
'2016-02-29', 2016-04-30 NaN
..., 2016-05-31 NaN
'2016-11-30', 2016-06-30 2.0
'2016-12-31'], 2016-07-31 NaN
dtype='datetime64[ns]', freq='M') 2016-08-31 NaN
2016-09-30 3.0
2016-10-31 NaN
.reindex() : 2016-11-30 NaN

conform DataFrame to 2016-12-31 4.0

new index

same lling logic as


.asfreq()

MANIPULATING TIME SERIES DATA IN PYTHON


Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Upsampling &
interpolation with
.resample()
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Frequency conversion & transformation methods
.resample() : similar to .groupby()

Groups data within resampling period and applies one or


several methods to each group

New date determined by o set - start, end, etc

Upsampling: ll from existing or interpolate values

Downsampling: apply aggregation to existing data

MANIPULATING TIME SERIES DATA IN PYTHON


Getting started: monthly unemployment rate
unrate = pd.read_csv('unrate.csv', parse_dates['Date'], index_col='Date')
unrate.info()

DatetimeIndex: 208 entries, 2000-01-01 to 2017-04-01


Data columns (total 1 columns):
UNRATE 208 non-null float64 # no frequency information
dtypes: float64(1)

unrate.head()

UNRATE
DATE
2000-01-01 4.0
2000-02-01 4.1
2000-03-01 4.0
2000-04-01 3.8
2000-05-01 4.0

Reporting date: 1st day of month

MANIPULATING TIME SERIES DATA IN PYTHON


Resampling Period & Frequency Offsets
Resample creates new date for frequency o set

Several alternatives to calendar month end

Frequency Alias Sample Date


Calendar Month End M 2017-04-30
Calendar Month Start MS 2017-04-01
Business Month End BM 2017-04-28
Business Month Start BMS 2017-04-03

MANIPULATING TIME SERIES DATA IN PYTHON


Resampling logic

MANIPULATING TIME SERIES DATA IN PYTHON


Resampling logic

MANIPULATING TIME SERIES DATA IN PYTHON


Assign frequency with .resample()
unrate.asfreq('MS').info()

DatetimeIndex: 208 entries, 2000-01-01 to 2017-04-01


Freq: MS
Data columns (total 1 columns):
UNRATE 208 non-null float64
dtypes: float64(1)

unrate.resample('MS') # creates Resampler object

DatetimeIndexResampler [freq=<MonthBegin>, axis=0, closed=left,


label=left, convention=start, base=0]

MANIPULATING TIME SERIES DATA IN PYTHON


Assign frequency with .resample()
unrate.asfreq('MS').equals(unrate.resample('MS').asfreq())

True

.resample() : returns data only when calling another method

MANIPULATING TIME SERIES DATA IN PYTHON


Quarterly real GDP growth
gdp = pd.read_csv('gdp.csv')
gdp.info()

DatetimeIndex: 69 entries, 2000-01-01 to 2017-01-01


Data columns (total 1 columns):
gpd 69 non-null float64 # no frequency info
dtypes: float64(1)

gdp.head(2)

gpd
DATE
2000-01-01 1.2
2000-04-01 7.8

MANIPULATING TIME SERIES DATA IN PYTHON


Interpolate monthly real GDP growth
gdp_1 = gdp.resample('MS').ffill().add_suffix('_ffill')

gpd_ffill
DATE
2000-01-01 1.2
2000-02-01 1.2
2000-03-01 1.2
2000-04-01 7.8

MANIPULATING TIME SERIES DATA IN PYTHON


Interpolate monthly real GDP growth
gdp_2 = gdp.resample('MS').interpolate().add_suffix('_inter')

gpd_inter
DATE
2000-01-01 1.200000
2000-02-01 3.400000
2000-03-01 5.600000
2000-04-01 7.800000

.interpolate() : nds points on straight line between


existing data

MANIPULATING TIME SERIES DATA IN PYTHON


Concatenating two DataFrames
df1 = pd.DataFrame([1, 2, 3], columns=['df1'])
df2 = pd.DataFrame([4, 5, 6], columns=['df2'])
pd.concat([df1, df2])

df1 df2
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
0 NaN 4.0
1 NaN 5.0
2 NaN 6.0

MANIPULATING TIME SERIES DATA IN PYTHON


Concatenating two DataFrames
pd.concat([df1, df2], axis=1)

df1 df2
0 1 4
1 2 5
2 3 6

axis=1 : concatenate horizontally

MANIPULATING TIME SERIES DATA IN PYTHON


Plot interpolated real GDP growth
pd.concat([gdp_1, gdp_2], axis=1).loc['2015':].plot()

MANIPULATING TIME SERIES DATA IN PYTHON


Combine GDP growth & unemployment
pd.concat([unrate, gdp_inter], axis=1).plot();

MANIPULATING TIME SERIES DATA IN PYTHON


Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Downsampling &
aggregation
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Downsampling & aggregation methods
So far: upsampling, ll logic & interpolation

Now: downsampling
hour to day

day to month, etc

How to represent the existing values at the new date?


Mean, median, last value?

MANIPULATING TIME SERIES DATA IN PYTHON


Air quality: daily ozone levels
ozone = pd.read_csv('ozone.csv',
parse_dates=['date'],
index_col='date')
ozone.info()

DatetimeIndex: 6291 entries, 2000-01-01 to 2017-03-31


Data columns (total 1 columns):
Ozone 6167 non-null float64
dtypes: float64(1)

ozone = ozone.resample('D').asfreq()
ozone.info()

DatetimeIndex: 6300 entries, 1998-01-05 to 2017-03-31


Freq: D
Data columns (total 1 columns):
Ozone 6167 non-null float64
dtypes: float64(1)

MANIPULATING TIME SERIES DATA IN PYTHON


Creating monthly ozone data
ozone.resample('M').mean().head() ozone.resample('M').median().head()

Ozone Ozone
date date
2000-01-31 0.010443 2000-01-31 0.009486
2000-02-29 0.011817 2000-02-29 0.010726
2000-03-31 0.016810 2000-03-31 0.017004
2000-04-30 0.019413 2000-04-30 0.019866
2000-05-31 0.026535 2000-05-31 0.026018

.resample().mean() : Monthly
average, assigned to end of
calendar month

MANIPULATING TIME SERIES DATA IN PYTHON


Creating monthly ozone data
ozone.resample('M').agg(['mean', 'std']).head()

Ozone
mean std
date
2000-01-31 0.010443 0.004755
2000-02-29 0.011817 0.004072
2000-03-31 0.016810 0.004977
2000-04-30 0.019413 0.006574
2000-05-31 0.026535 0.008409

.resample().agg() : List of aggregation functions like


groupby

MANIPULATING TIME SERIES DATA IN PYTHON


Plotting resampled ozone data
ozone = ozone.loc['2016':]
ax = ozone.plot()
monthly = ozone.resample('M').mean()
monthly.add_suffix('_monthly').plot(ax=ax)

MANIPULATING TIME SERIES DATA IN PYTHON


Resampling multiple time series
data = pd.read_csv('ozone_pm25.csv',
parse_dates=['date'],
index_col='date')
data = data.resample('D').asfreq()
data.info()

DatetimeIndex: 6300 entries, 2000-01-01 to 2017-03-31


Freq: D
Data columns (total 2 columns):
Ozone 6167 non-null float64
PM25 6167 non-null float64
dtypes: float64(2)

MANIPULATING TIME SERIES DATA IN PYTHON


Resampling multiple time series
data = data.resample('BM').mean()
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 207 entries, 2000-01-31 to 2017-03-31
Freq: BM
Data columns (total 2 columns):
ozone 207 non-null float64
pm25 207 non-null float64
dtypes: float64(2)

MANIPULATING TIME SERIES DATA IN PYTHON


Resampling multiple time series
df.resample('M').first().head(4)

Ozone PM25
date
2000-01-31 0.005545 20.800000
2000-02-29 0.016139 6.500000
2000-03-31 0.017004 8.493333
2000-04-30 0.031354 6.889474

df.resample('MS').first().head()

Ozone PM25
date
2000-01-01 0.004032 37.320000
2000-02-01 0.010583 24.800000
2000-03-01 0.007418 11.106667
2000-04-01 0.017631 11.700000
2000-05-01 0.022628 9.700000

MANIPULATING TIME SERIES DATA IN PYTHON


Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Rolling window
functions with
pandas
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Window functions in pandas
Windows identify sub periods of your time series

Calculate metrics for sub periods inside the window

Create a new time series of metrics

Two types of windows:


Rolling: same size, sliding (this video)

Expanding: contain all prior values (next video)

MANIPULATING TIME SERIES DATA IN PYTHON


Calculating a rolling average
data = pd.read_csv('google.csv', parse_dates=['date'], index_col='date')

DatetimeIndex: 1761 entries, 2010-01-04 to 2016-12-30


Data columns (total 1 columns):
price 1761 non-null float64
dtypes: float64(1)

MANIPULATING TIME SERIES DATA IN PYTHON


Calculating a rolling average
# Integer-based window size
data.rolling(window=30).mean() # fixed # observations

DatetimeIndex: 1761 entries, 2010-01-04 to 2017-05-24


Data columns (total 1 columns):
price 1732 non-null float64
dtypes: float64(1)

window=30 : # business days

min_periods : choose value < 30 to get results for rst days

MANIPULATING TIME SERIES DATA IN PYTHON


Calculating a rolling average
# Offset-based window size
data.rolling(window='30D').mean() # fixed period length

DatetimeIndex: 1761 entries, 2010-01-04 to 2017-05-24


Data columns (total 1 columns):
price 1761 non-null float64
dtypes: float64(1)

30D : # calendar days

MANIPULATING TIME SERIES DATA IN PYTHON


90 day rolling mean
r90 = data.rolling(window='90D').mean()
google.join(r90.add_suffix('_mean_90')).plot()

MANIPULATING TIME SERIES DATA IN PYTHON


90 & 360 day rolling means
data['mean90'] = r90
r360 = data['price'].rolling(window='360D'.mean()
data['mean360'] = r360; data.plot()

MANIPULATING TIME SERIES DATA IN PYTHON


Multiple rolling metrics (1)
r = data.price.rolling('90D').agg(['mean', 'std'])
r.plot(subplots = True)

MANIPULATING TIME SERIES DATA IN PYTHON


Multiple rolling metrics (2)
rolling = data.google.rolling('360D')
q10 = rolling.quantile(0.1).to_frame('q10')
median = rolling.median().to_frame('median')
q90 = rolling.quantile(0.9).to_frame('q90')
pd.concat([q10, median, q90], axis=1).plot()

MANIPULATING TIME SERIES DATA IN PYTHON


Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Expanding window
functions with
pandas
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Expanding windows in pandas
From rolling to expanding windows

Calculate metrics for periods up to current date

New time series re ects all historical values

Useful for running rate of return, running min/max

Two options with pandas:


.expanding() - just like .rolling()

.cumsum() , .cumprod() , cummin() / max()

MANIPULATING TIME SERIES DATA IN PYTHON


The basic idea
df = pd.DataFrame({'data': range(5)})
df['expanding sum'] = df.data.expanding().sum()
df['cumulative sum'] = df.data.cumsum()
df

data expanding sum cumulative sum


0 0 0.0 0
1 1 1.0 1
2 2 3.0 3
3 3 6.0 6
4 4 10.0 10

MANIPULATING TIME SERIES DATA IN PYTHON


Get data for the S&P 500
data = pd.read_csv('sp500.csv', parse_dates=['date'], index_col='date')

DatetimeIndex: 2519 entries, 2007-05-24 to 2017-05-24


Data columns (total 1 columns):
SP500 2519 non-null float64

MANIPULATING TIME SERIES DATA IN PYTHON


How to calculate a running return
Single period return rt : current price over last price minus 1:
Pt
rt = −1
Pt−1
Multi-period return: product of (1 + rt ) for all periods,
minus 1:

RT = (1 + r1 )(1 + r2 )...(1 + rT ) − 1

For the period return: .pct_change()

For basic math .add() , .sub() , .mul() , .div()

For cumulative product: .cumprod()

MANIPULATING TIME SERIES DATA IN PYTHON


Running rate of return in practice
pr = data.SP500.pct_change() # period return
pr_plus_one = pr.add(1)
cumulative_return = pr_plus_one.cumprod().sub(1)
cumulative_return.mul(100).plot()

MANIPULATING TIME SERIES DATA IN PYTHON


Getting the running min & max
data['running_min'] = data.SP500.expanding().min()
data['running_max'] = data.SP500.expanding().max()
data.plot()

MANIPULATING TIME SERIES DATA IN PYTHON


Rolling annual rate of return
def multi_period_return(period_returns):
return np.prod(period_returns + 1) - 1
pr = data.SP500.pct_change() # period return
r = pr.rolling('360D').apply(multi_period_return)
data['Rolling 1yr Return'] = r.mul(100)
data.plot(subplots=True)

MANIPULATING TIME SERIES DATA IN PYTHON


Rolling annual rate of return
data['Rolling 1yr Return'] = r.mul(100)
data.plot(subplots=True)

MANIPULATING TIME SERIES DATA IN PYTHON


Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Case study: S&P500
price simulation
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Random walks & simulations
Daily stock returns are hard to predict

Models o en assume they are random in nature

Numpy allows you to generate random numbers

From random returns to prices: use .cumprod()

Two examples:
Generate random returns

Randomly selected actual SP500 returns

MANIPULATING TIME SERIES DATA IN PYTHON


Generate random numbers
from numpy.random import normal, seed
from scipy.stats import norm
seed(42)
random_returns = normal(loc=0, scale=0.01, size=1000)
sns.distplot(random_returns, fit=norm, kde=False)

MANIPULATING TIME SERIES DATA IN PYTHON


Create a random price path
return_series = pd.Series(random_returns)
random_prices = return_series.add(1).cumprod().sub(1)
random_prices.mul(100).plot()

MANIPULATING TIME SERIES DATA IN PYTHON


S&P 500 prices & returns
data = pd.read_csv('sp500.csv', parse_dates=['date'], index_col='date')
data['returns'] = data.SP500.pct_change()
data.plot(subplots=True)

MANIPULATING TIME SERIES DATA IN PYTHON


S&P return distribution
sns.distplot(data.returns.dropna().mul(100), fit=norm)

MANIPULATING TIME SERIES DATA IN PYTHON


Generate random S&P 500 returns
from numpy.random import choice
sample = data.returns.dropna()
n_obs = data.returns.count()
random_walk = choice(sample, size=n_obs)
random_walk = pd.Series(random_walk, index=sample.index)
random_walk.head()

DATE
2007-05-29 -0.008357
2007-05-30 0.003702
2007-05-31 -0.013990
2007-06-01 0.008096
2007-06-04 0.013120

MANIPULATING TIME SERIES DATA IN PYTHON


Random S&P 500 prices (1)
start = data.SP500.first('D')

DATE
2007-05-25 1515.73
Name: SP500, dtype: float64

sp500_random = start.append(random_walk.add(1))
sp500_random.head())

DATE
2007-05-25 1515.730000
2007-05-29 0.998290
2007-05-30 0.995190
2007-05-31 0.997787
2007-06-01 0.983853
dtype: float64

MANIPULATING TIME SERIES DATA IN PYTHON


Random S&P 500 prices (2)
data['SP500_random'] = sp500_random.cumprod()
data[['SP500', 'SP500_random']].plot()

MANIPULATING TIME SERIES DATA IN PYTHON


Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Relationships
between time series:
correlation
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Correlation & relations between series
So far, focus on characteristics of individual variables

Now: characteristic of relations between variables

Correlation: measures linear relationships

Financial markets: important for prediction and risk


management

pandas & seaborn have tools to compute & visualize

MANIPULATING TIME SERIES DATA IN PYTHON


Correlation & linear relationships
Correlation coe cient: how similar is the pairwise movement
of two variables around their averages?
∑N (x −x̄)(yi − ȳ )
Varies between -1 and +1 r= i=1 i
sx sy

MANIPULATING TIME SERIES DATA IN PYTHON


Importing five price time series
data = pd.read_csv('assets.csv', parse_dates=['date'],
index_col='date')
data = data.dropna().info()

DatetimeIndex: 2469 entries, 2007-05-25 to 2017-05-22


Data columns (total 5 columns):
sp500 2469 non-null float64
nasdaq 2469 non-null float64
bonds 2469 non-null float64
gold 2469 non-null float64
oil 2469 non-null float64

MANIPULATING TIME SERIES DATA IN PYTHON


Visualize pairwise linear relationships
daily_returns = data.pct_change()
sns.jointplot(x='sp500', y='nasdaq', data=data_returns);

MANIPULATING TIME SERIES DATA IN PYTHON


Calculate all correlations
correlations = returns.corr()
correlations

bonds oil gold sp500 nasdaq


bonds 1.000000 -0.183755 0.003167 -0.300877 -0.306437
oil -0.183755 1.000000 0.105930 0.335578 0.289590
gold 0.003167 0.105930 1.000000 -0.007786 -0.002544
sp500 -0.300877 0.335578 -0.007786 1.000000 0.959990
nasdaq -0.306437 0.289590 -0.002544 0.959990 1.000000

MANIPULATING TIME SERIES DATA IN PYTHON


Visualize all correlations
sns.heatmap(correlations, annot=True)

MANIPULATING TIME SERIES DATA IN PYTHON


Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Select index
components &
import data
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Market value-weighted index
Composite performance of various stocks

Components weighted by market capitalization


Share Price x Number of Shares => Market Value

Larger components get higher percentage weightings

Key market indexes are value-weighted:


S&P 500 , NASDAQ , Wilshire 5000 , Hang Seng

MANIPULATING TIME SERIES DATA IN PYTHON


Build a cap-weighted Index
Apply new skills to construct value-weighted index
Select components from exchange listing data

Get component number of shares and stock prices

Calculate component weights

Calculate index

Evaluate performance of components and index

MANIPULATING TIME SERIES DATA IN PYTHON


Load stock listing data
nyse = pd.read_excel('listings.xlsx', sheet_name='nyse',
na_values='n/a')
nyse.info()

RangeIndex: 3147 entries, 0 to 3146


Data columns (total 7 columns):
Stock Symbol 3147 non-null object # Stock Ticker
Company Name 3147 non-null object
Last Sale 3079 non-null float64 # Latest Stock Price
Market Capitalization 3147 non-null float64
IPO Year 1361 non-null float64 # Year of listing
Sector 2177 non-null object
Industry 2177 non-null object
dtypes: float64(3), object(4)

MANIPULATING TIME SERIES DATA IN PYTHON


Load & prepare listing data
nyse.set_index('Stock Symbol', inplace=True)
nyse.dropna(subset=['Sector'], inplace=True)
nyse['Market Capitalization'] /= 1e6 # in Million USD

Index: 2177 entries, DDD to ZTO


Data columns (total 6 columns):
Company Name 2177 non-null object
Last Sale 2175 non-null float64
Market Capitalization 2177 non-null float64
IPO Year 967 non-null float64
Sector 2177 non-null object
Industry 2177 non-null object
dtypes: float64(3), object(3)

MANIPULATING TIME SERIES DATA IN PYTHON


Select index components
components = nyse.groupby(['Sector'])['Market Capitalization'].nlargest(1)
components.sort_values(ascending=False)

Sector Stock Symbol


Health Care JNJ 338834.390080
Energy XOM 338728.713874
Finance JPM 300283.250479
Miscellaneous BABA 275525.000000
Public Utilities T 247339.517272
Basic Industries PG 230159.644117
Consumer Services WMT 221864.614129
Consumer Non-Durables KO 183655.305119
Technology ORCL 181046.096000
Capital Goods TM 155660.252483
Transportation UPS 90180.886756
Consumer Durables ABB 48398.935676
Name: Market Capitalization, dtype: float64

MANIPULATING TIME SERIES DATA IN PYTHON


Import & prepare listing data
tickers = components.index.get_level_values('Stock Symbol')
tickers

Index(['PG', 'TM', 'ABB', 'KO', 'WMT', 'XOM', 'JPM', 'JNJ', 'BABA', 'T',
'ORCL', ‘UPS'], dtype='object', name='Stock Symbol’)

tickers.tolist()

['PG',
'TM',
'ABB',
'KO',
'WMT',
...
'T',
'ORCL',
'UPS']

MANIPULATING TIME SERIES DATA IN PYTHON


Stock index components
columns = ['Company Name', 'Market Capitalization', 'Last Sale']
component_info = nyse.loc[tickers, columns]
pd.options.display.float_format = '{:,.2f}'.format

Company Name Market Capitalization Last Sale


Stock Symbol
PG Procter & Gamble Company (The) 230,159.64 90.03
TM Toyota Motor Corp Ltd Ord 155,660.25 104.18
ABB ABB Ltd 48,398.94 22.63
KO Coca-Cola Company (The) 183,655.31 42.79
WMT Wal-Mart Stores, Inc. 221,864.61 73.15
XOM Exxon Mobil Corporation 338,728.71 81.69
JPM J P Morgan Chase & Co 300,283.25 84.40
JNJ Johnson & Johnson 338,834.39 124.99
BABA Alibaba Group Holding Limited 275,525.00 110.21
T AT&T Inc. 247,339.52 40.28
ORCL Oracle Corporation 181,046.10 44.00
UPS United Parcel Service, Inc. 90,180.89 103.74

MANIPULATING TIME SERIES DATA IN PYTHON


Import & prepare listing data
data = pd.read_csv('stocks.csv', parse_dates=['Date'],
index_col='Date').loc[:, tickers.tolist()]
data.info()

DatetimeIndex: 252 entries, 2016-01-04 to 2016-12-30


Data columns (total 12 columns):
ABB 252 non-null float64
BABA 252 non-null float64
JNJ 252 non-null float64
JPM 252 non-null float64
KO 252 non-null float64
ORCL 252 non-null float64
PG 252 non-null float64
T 252 non-null float64
TM 252 non-null float64
UPS 252 non-null float64
WMT 252 non-null float64
XOM 252 non-null float64
dtypes: float64(12)

MANIPULATING TIME SERIES DATA IN PYTHON


Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Build a market-cap
weighted index
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Build your value-weighted index
Key inputs:
number of shares

stock price series

MANIPULATING TIME SERIES DATA IN PYTHON


Build your value-weighted index
Key inputs:
number of shares

stock price series

Normalize index to start


at 100

MANIPULATING TIME SERIES DATA IN PYTHON


Stock index components
components

Company Name Market Capitalization Last Sale


Stock Symbol
PG Procter & Gamble Company (The) 230,159.64 90.03
TM Toyota Motor Corp Ltd Ord 155,660.25 104.18
ABB ABB Ltd 48,398.94 22.63
KO Coca-Cola Company (The) 183,655.31 42.79
WMT Wal-Mart Stores, Inc. 221,864.61 73.15
XOM Exxon Mobil Corporation 338,728.71 81.69
JPM J P Morgan Chase & Co 300,283.25 84.40
JNJ Johnson & Johnson 338,834.39 124.99
BABA Alibaba Group Holding Limited 275,525.00 110.21
T AT&T Inc. 247,339.52 40.28
ORCL Oracle Corporation 181,046.10 44.00
UPS United Parcel Service, Inc. 90,180.89 103.74

MANIPULATING TIME SERIES DATA IN PYTHON


Number of shares outstanding
shares = components['Market Capitalization'].div(components['Last Sale'])

Stock Symbol
PG 2,556.48 # Outstanding shares in million
TM 1,494.15
ABB 2,138.71
KO 4,292.01
WMT 3,033.01
XOM 4,146.51
JPM 3,557.86
JNJ 2,710.89
BABA 2,500.00
T 6,140.50
ORCL 4,114.68
UPS 869.30
dtype: float64

Market Capitalization = Number of Shares x Share Price

MANIPULATING TIME SERIES DATA IN PYTHON


Historical stock prices
data = pd.read_csv('stocks.csv', parse_dates=['Date'],
index_col='Date').loc[:, tickers.tolist()]
market_cap_series = data.mul(no_shares)
market_series.info()

DatetimeIndex: 252 entries, 2016-01-04 to 2016-12-30


Data columns (total 12 columns):
ABB 252 non-null float64
BABA 252 non-null float64
JNJ 252 non-null float64
JPM 252 non-null float64
...
TM 252 non-null float64
UPS 252 non-null float64
WMT 252 non-null float64
XOM 252 non-null float64
dtypes: float64(12)

MANIPULATING TIME SERIES DATA IN PYTHON


From stock prices to market value
market_cap_series.first('D').append(market_cap_series.last('D'))

ABB BABA JNJ JPM KO ORCL \\


Date
2016-01-04 37,470.14 191,725.00 272,390.43 226,350.95 181,981.42 147,099.95
2016-12-30 45,062.55 219,525.00 312,321.87 307,007.60 177,946.93 158,209.60
PG T TM UPS WMT XOM
Date
2016-01-04 200,351.12 210,926.33 181,479.12 82,444.14 186,408.74 321,188.96
2016-12-30 214,948.60 261,155.65 175,114.05 99,656.23 209,641.59 374,264.34

MANIPULATING TIME SERIES DATA IN PYTHON


Aggregate market value per period
agg_mcap = market_cap_series.sum(axis=1) # Total market cap
agg_mcap(title='Aggregate Market Cap')

MANIPULATING TIME SERIES DATA IN PYTHON


Value-based index
index = agg_mcap.div(agg_mcap.iloc[0]).mul(100) # Divide by 1st value
index.plot(title='Market-Cap Weighted Index')

MANIPULATING TIME SERIES DATA IN PYTHON


Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Evaluate index
performance
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Evaluate your value-weighted index
Index return:
Total index return

Contribution by component

Performance vs Benchmark
Total period return

Rolling returns for sub periods

MANIPULATING TIME SERIES DATA IN PYTHON


Value-based index - recap
agg_market_cap = market_cap_series.sum(axis=1)
index = agg_market_cap.div(agg_market_cap.iloc[0]).mul(100)
index.plot(title='Market-Cap Weighted Index')

MANIPULATING TIME SERIES DATA IN PYTHON


Value contribution by stock
agg_market_cap.iloc[-1] - agg_market_cap.iloc[0]

315,037.71

MANIPULATING TIME SERIES DATA IN PYTHON


Value contribution by stock
change = market_cap_series.first('D').append(market_cap_series.last('D'))
change.diff().iloc[-1].sort_values() # or: .loc['2016-12-30']

TM -6,365.07
KO -4,034.49
ABB 7,592.41
ORCL 11,109.65
PG 14,597.48
UPS 17,212.08
WMT 23,232.85
BABA 27,800.00
JNJ 39,931.44
T 50,229.33
XOM 53,075.38
JPM 80,656.65
Name: 2016-12-30 00:00:00, dtype: float64

MANIPULATING TIME SERIES DATA IN PYTHON


Market-cap based weights
market_cap = components['Market Capitalization']
weights = market_cap.div(market_cap.sum())
weights.sort_values().mul(100)

Stock Symbol
ABB 1.85
UPS 3.45
TM 5.96
ORCL 6.93
KO 7.03
WMT 8.50
PG 8.81
T 9.47
BABA 10.55
JPM 11.50
XOM 12.97
JNJ 12.97
Name: Market Capitalization, dtype: float64

MANIPULATING TIME SERIES DATA IN PYTHON


Value-weighted component returns
index_return = (index.iloc[-1] / index.iloc[0] - 1) * 100

14.06

weighted_returns = weights.mul(index_return)
weighted_returns.sort_values().plot(kind='barh')

MANIPULATING TIME SERIES DATA IN PYTHON


Performance vs benchmark
data = index.to_frame('Index') # Convert pd.Series to pd.DataFrame
data['SP500'] = pd.read_csv('sp500.csv', parse_dates=['Date'],
index_col='Date')
data.SP500 = data.SP500.div(data.SP500.iloc[0], axis=0).mul(100)

MANIPULATING TIME SERIES DATA IN PYTHON


Performance vs benchmark: 30D rolling return
def multi_period_return(r):
return (np.prod(r + 1) - 1) * 100
data.pct_change().rolling('30D').apply(multi_period_return).plot()

MANIPULATING TIME SERIES DATA IN PYTHON


Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Index correlation &
exporting to Excel
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Some additional analysis of your index
Daily return correlations:

Calculate among all components

Visualize the result as heatmap

Write results to excel using .xls and .xlsx formats:

Single worksheet

Multiple worksheets

MANIPULATING TIME SERIES DATA IN PYTHON


Index components - price data
data = DataReader(tickers, 'google', start='2016', end='2017')['Close']
data.info()

DatetimeIndex: 252 entries, 2016-01-04 to 2016-12-30


Data columns (total 12 columns):
ABB 252 non-null float64
BABA 252 non-null float64
JNJ 252 non-null float64
JPM 252 non-null float64
KO 252 non-null float64
ORCL 252 non-null float64
PG 252 non-null float64
T 252 non-null float64
TM 252 non-null float64
UPS 252 non-null float64
WMT 252 non-null float64
XOM 252 non-null float64

MANIPULATING TIME SERIES DATA IN PYTHON


Index components: return correlations
daily_returns = data.pct_change()
correlations = daily_returns.corr()

ABB BABA JNJ JPM KO ORCL PG T TM UPS WMT XOM


ABB 1.00 0.40 0.33 0.56 0.31 0.53 0.34 0.29 0.48 0.50 0.15 0.48
BABA 0.40 1.00 0.27 0.27 0.25 0.38 0.21 0.17 0.34 0.35 0.13 0.21
JNJ 0.33 0.27 1.00 0.34 0.30 0.37 0.42 0.35 0.29 0.45 0.24 0.41
JPM 0.56 0.27 0.34 1.00 0.22 0.57 0.27 0.13 0.49 0.56 0.14 0.48
KO 0.31 0.25 0.30 0.22 1.00 0.31 0.62 0.47 0.33 0.50 0.25 0.29
ORCL 0.53 0.38 0.37 0.57 0.31 1.00 0.41 0.32 0.48 0.54 0.21 0.42
PG 0.34 0.21 0.42 0.27 0.62 0.41 1.00 0.43 0.32 0.47 0.33 0.34
T 0.29 0.17 0.35 0.13 0.47 0.32 0.43 1.00 0.28 0.41 0.31 0.33
TM 0.48 0.34 0.29 0.49 0.33 0.48 0.32 0.28 1.00 0.52 0.20 0.30
UPS 0.50 0.35 0.45 0.56 0.50 0.54 0.47 0.41 0.52 1.00 0.33 0.45
WMT 0.15 0.13 0.24 0.14 0.25 0.21 0.33 0.31 0.20 0.33 1.00 0.21
XOM 0.48 0.21 0.41 0.48 0.29 0.42 0.34 0.33 0.30 0.45 0.21 1.00

MANIPULATING TIME SERIES DATA IN PYTHON


Index components: return correlations
sns.heatmap(correlations, annot=True)
plt.xticks(rotation=45)
plt.title('Daily Return Correlations')

MANIPULATING TIME SERIES DATA IN PYTHON


Saving to a single Excel worksheet
correlations.to_excel(excel_writer= 'correlations.xls',
sheet_name='correlations',
startrow=1,
startcol=1)

MANIPULATING TIME SERIES DATA IN PYTHON


Saving to multiple Excel worksheets
data.index = data.index.date # Keep only date component
with pd.ExcelWriter('stock_data.xlsx') as writer:
corr.to_excel(excel_writer=writer, sheet_name='correlations')
data.to_excel(excel_writer=writer, sheet_name='prices')
data.pct_change().to_excel(writer, sheet_name='returns')

MANIPULATING TIME SERIES DATA IN PYTHON


Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Congratulations!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Congratulations!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Introduction to the
Course
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Example of Time Series: Google Trends

TIME SERIES ANALYSIS IN PYTHON


Example of Time Series: Climate Data

TIME SERIES ANALYSIS IN PYTHON


Example of Time Series: Quarterly Earnings Data

TIME SERIES ANALYSIS IN PYTHON


Example of Multiple Series: Natural Gas and Heating
Oil

TIME SERIES ANALYSIS IN PYTHON


Goals of Course
Learn about time series models

Fit data to a time series model

Use the models to make forecasts of the future

Learn how to use the relevant statistical packages in Python

Provide concrete examples of how these models are used

TIME SERIES ANALYSIS IN PYTHON


Some Useful Pandas Tools
Changing an index to datetime

df.index = pd.to_datetime(df.index)

Plo ing data

df.plot()

Slicing data

df['2012']

TIME SERIES ANALYSIS IN PYTHON


Some Useful Pandas Tools
Join two DataFrames

df1.join(df2)

Resample data (e.g. from daily to weekly)

df = df.resample(rule='W').last()

TIME SERIES ANALYSIS IN PYTHON


More pandas Functions
Computing percent changes and di erences of a time series

df['col'].pct_change()
df['col'].diff()

pandas correlation method of Series

df['ABC'].corr(df['XYZ'])

pandas autocorrelation

df['ABC'].autocorr()

TIME SERIES ANALYSIS IN PYTHON


Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Correlation of Two
Time Series
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Correlation of Two Time Series
Plot of S&P500 and JPMorgan stock

TIME SERIES ANALYSIS IN PYTHON


Correlation of Two Time Series
Sca er plot of S&P500 and JP Morgan returns

TIME SERIES ANALYSIS IN PYTHON


More Scatter Plots
Correlation = 0.9 Correlation = 0.4

Correlation = -0.9 Corelation = 1.0

TIME SERIES ANALYSIS IN PYTHON


Common Mistake: Correlation of Two Trending Series
Dow Jones Industrial Average and UFO Sightings
(www.nuforc.org)

Correlation of levels: 0.94

Correlation of percent changes: ≈0

TIME SERIES ANALYSIS IN PYTHON


Example: Correlation of Large Cap and Small Cap
Stocks
Start with stock prices of SPX (large cap) and R2000 (small
cap)

First step: Compute percentage changes of both series


df['SPX_Ret'] = df['SPX_Prices'].pct_change()
df['R2000_Ret'] = df['R2000_Prices'].pct_change()

TIME SERIES ANALYSIS IN PYTHON


Example: Correlation of Large Cap and Small Cap
Stocks
Visualize correlation with sca ter plot
plt.scatter(df['SPX_Ret'], df['R2000_Ret'])
plt.show()

TIME SERIES ANALYSIS IN PYTHON


Example: Correlation of Large Cap and Small Cap
Stocks
Use pandas correlation method for Series

correlation = df['SPX_Ret'].corr(df['R2000_Ret'])
print("Correlation is: ", correlation)

Correlation is: 0.868

TIME SERIES ANALYSIS IN PYTHON


Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Simple Linear
Regressions
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is a Regression?
Simple linear regression:

yt = α + βxt + ϵt

TIME SERIES ANALYSIS IN PYTHON


What is a Regression?
Ordinary Least Squares (OLS)

TIME SERIES ANALYSIS IN PYTHON


Python Packages to Perform Regressions
In statsmodels: Warning: the order of x and
import statsmodels.api as sm y is not consistent across
sm.OLS(y, x).fit()
packages

In numpy:
np.polyfit(x, y, deg=1)

In pandas:
pd.ols(y, x)

In scipy:
from scipy import stats
stats.linregress(x, y)

TIME SERIES ANALYSIS IN PYTHON


Example: Regression of Small Cap Returns on Large
Cap
Import the statsmodels module
import statsmodels.api as sm

As before, compute percentage changes in both series


df['SPX_Ret'] = df['SPX_Prices'].pct_change()
df['R2000_Ret'] = df['R2000_Prices'].pct_change()

Add a constant to the DataFrame for the regression intercept


df = sm.add_constant(df)

TIME SERIES ANALYSIS IN PYTHON


Regression Example (continued)
Notice that the rst row of returns is NaN
SPX_Price R2000_Price SPX_Ret R2000_Ret
Date
2012-11-01 1427.589966 827.849976 NaN NaN
2012-11-02 1414.199951 814.369995 -0.009379 -0.016283

Delete the row of NaN


df = df.dropna()

Run the regression


results = sm.OLS(df['R2000_Ret'],df[['const','SPX_Ret']]).fit()
print(results.summary())

TIME SERIES ANALYSIS IN PYTHON


Regression Example (continued)
Regression output

Intercept in results.params[0]

Slope in results.params[1]

TIME SERIES ANALYSIS IN PYTHON


Regression Example (continued)
Regression output

TIME SERIES ANALYSIS IN PYTHON


Relationship Between R-Squared and Correlation
[corr(x, y)]2 = R2 (or R-squared)
sign(corr) = sign(regression slope)
In last example:
R-Squared = 0.753

Slope is positive

correlation = +√0.753 = 0.868

TIME SERIES ANALYSIS IN PYTHON


Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Autocorrelation
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is Autocorrelation?
Correlation of a time series with a lagged copy of itself

Also called serial correlation

Lag-one autocorrelation

TIME SERIES ANALYSIS IN PYTHON


Interpretation of Autocorrelation
Mean Reversion - Negative autocorrelation

TIME SERIES ANALYSIS IN PYTHON


Interpretation of Autocorrelation
Momentum, or Trend Following - Positive autocorrelation

TIME SERIES ANALYSIS IN PYTHON


Traders Use Autocorrelation to Make Money
Individual stocks
Historically have negative autocorrelation

Measured over short horizons (days)

Trading strategy: Buy losers and sell winners

Commodities and currencies


Historically have positive autocorrelation

Measured over longer horizons (months)

Trading strategy: Buy winners and sell losers

TIME SERIES ANALYSIS IN PYTHON


Example of Positive Autocorrelation: Exchange Rates
Use daily ¥/$ exchange rates in DataFrame df from FRED

Convert index to datetime


# Convert index to datetime
df.index = pd.to_datetime(df.index)
# Downsample from daily to monthly data
df = df.resample(rule='M').last()
# Compute returns from prices
df['Return'] = df['Price'].pct_change()
# Compute autocorrelation
autocorrelation = df['Return'].autocorr()
print("The autocorrelation is: ",autocorrelation)

The autocorrelation is: 0.0567

TIME SERIES ANALYSIS IN PYTHON


Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Autocorrelation
Function
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Autocorrelation Function
Autocorrelation Function (ACF): The autocorrelation as a
function of the lag

Equals one at lag-zero

Interesting information beyond lag-one

TIME SERIES ANALYSIS IN PYTHON


ACF Example 1: Simple Autocorrelation Function
Can use last two values in series for forecasting

TIME SERIES ANALYSIS IN PYTHON


ACF Example 2: Seasonal Earnings
Earnings for H&R Block ACF for H&R Block

TIME SERIES ANALYSIS IN PYTHON


ACF Example 3: Useful for Model Selection
Model selection

TIME SERIES ANALYSIS IN PYTHON


Plot ACF in Python
Import module:
from statsmodels.graphics.tsaplots import plot_acf

Plot the ACF:


plot_acf(x, lags= 20, alpha=0.05)

TIME SERIES ANALYSIS IN PYTHON


Confidence Interval of ACF

TIME SERIES ANALYSIS IN PYTHON


Confidence Interval of ACF
Argument alpha sets the width of con dence interval

Example: alpha=0.05
5% chance that if true autocorrelation is zero, it will fall
outside blue band

Con dence bands are wider if:


Alpha lower

Fewer observations

Under some simplifying assumptions, 95% con dence bands


are ±2/√N
If you want no bands on plot, set alpha=1

TIME SERIES ANALYSIS IN PYTHON


ACF Values Instead of Plot
from statsmodels.tsa.stattools import acf
print(acf(x))

[ 1. -0.6765505 0.34989905 -0.01629415 -0.02507


-0.03186545 0.01399904 -0.03518128 0.02063168 -0.02620
...
0.07191516 -0.12211912 0.14514481 -0.09644228 0.05215

TIME SERIES ANALYSIS IN PYTHON


Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
White Noise
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is White Noise?
White Noise is a series with:
Constant mean

Constant variance

Zero autocorrelations at all lags

Special Case: if data has normal distribution, then Gaussian


White Noise

TIME SERIES ANALYSIS IN PYTHON


Simulating White Noise
It's very easy to generate white noise
import numpy as np
noise = np.random.normal(loc=0, scale=1, size=500)

TIME SERIES ANALYSIS IN PYTHON


What Does White Noise Look Like?
plt.plot(noise)

TIME SERIES ANALYSIS IN PYTHON


Autocorrelation of White Noise
plot_acf(noise, lags=50)

TIME SERIES ANALYSIS IN PYTHON


Stock Market Returns: Close to White Noise
Autocorrelation Function for the S&P500

TIME SERIES ANALYSIS IN PYTHON


Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Random Walk
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is a Random Walk?
Today's Price = Yesterday's Price + Noise

Pt = Pt−1 + ϵt
Plot of simulated data

TIME SERIES ANALYSIS IN PYTHON


What is a Random Walk?
Today's Price = Yesterday's Price + Noise

Pt = Pt−1 + ϵt
Change in price is white noise

Pt − Pt−1 = ϵt
Can't forecast a random walk

Best forecast for tomorrow's price is today's price

TIME SERIES ANALYSIS IN PYTHON


What is a Random Walk?
Today's Price = Yesterday's Price + Noise

Pt = Pt−1 + ϵt
Random walk with dri :

Pt = μ + Pt−1 + ϵt
Change in price is white noise with non-zero mean:

Pt − Pt−1 = μ + ϵt

TIME SERIES ANALYSIS IN PYTHON


Statistical Test for Random Walk
Random walk with dri

Pt = μ + Pt−1 + ϵt
Regression test for random walk

Pt = α + β Pt−1 + ϵt
Test:

H0 : β = 1 (random walk)
H1 : β < 1 (not random walk)

TIME SERIES ANALYSIS IN PYTHON


Statistical Test for Random Walk
Regression test for random walk

Pt = α + β Pt−1 + ϵt
Equivalent to

Pt − Pt−1 = α + β Pt−1 + ϵt
Test:

H0 : β = 0 (random walk)
H1 : β < 0 (not random walk)

TIME SERIES ANALYSIS IN PYTHON


Statistical Test for Random Walk
Regression test for random walk

Pt − Pt−1 = α + β Pt−1 + ϵt
Test:

H0 : β = 0 (random walk)
H1 : β < 0 (not random walk)
This test is called the Dickey-Fuller test

If you add more lagged changes on the right hand side, it's
the Augmented Dickey-Fuller test

TIME SERIES ANALYSIS IN PYTHON


ADF Test in Python
Import module from statsmodels

from statsmodels.tsa.stattools import adfuller

Run Augmented Dickey-Test

adfuller(x)

TIME SERIES ANALYSIS IN PYTHON


Example: Is the S&P500 a Random Walk?
# Run Augmented Dickey-Fuller Test on SPX data
results = adfuller(df['SPX'])

# Print p-value
print(results[1])

0.782253808587

# Print full results


print(results)

(-0.91720490331127869,
0.78225380858668414,
0,
1257,
{'1%': -3.4355629707955395,
'10%': -2.567995644141416,
'5%': -2.8638420633876671},
10161.888789598503)

TIME SERIES ANALYSIS IN PYTHON


Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Stationarity
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is Stationarity?
Strong stationarity: entire distribution of data is time-
invariant

Weak stationarity: mean, variance and autocorrelation are


time-invariant (i.e., for autocorrelation, corr(Xt , Xt−τ ) is
only a function of τ)

TIME SERIES ANALYSIS IN PYTHON


Why Do We Care?
If parameters vary with time, too many parameters to
estimate

Can only estimate a parsimonious model with a few


parameters

TIME SERIES ANALYSIS IN PYTHON


Examples of Nonstationary Series
Random Walk

TIME SERIES ANALYSIS IN PYTHON


Examples of Nonstationary Series
Seasonality in series

TIME SERIES ANALYSIS IN PYTHON


Examples of Nonstationary Series
Change in Mean or Standard Deviation over time

TIME SERIES ANALYSIS IN PYTHON


Transforming Nonstationary Series Into Stationary
Series
Random Walk First di erence
plot.plot(SPY) plot.plot(SPY.diff())

TIME SERIES ANALYSIS IN PYTHON


Transforming Nonstationary Series Into Stationary
Series
Seasonality Seasonal di erence
plot.plot(HRB) plot.plot(HRB.diff(4))

TIME SERIES ANALYSIS IN PYTHON


Transforming Nonstationary Series Into Stationary
Series
AMZN Quarterly Revenues # Log of AMZN Revenues
plt.plot(np.log(AMZN))
plt.plot(AMZN)

# Log, then seasonal difference


plt.plot(np.log(AMZN).diff(4))

TIME SERIES ANALYSIS IN PYTHON


Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Introducing an AR
Model
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Mathematical Description of AR(1) Model
Rt = μ + ϕ Rt−1 + ϵt
Since only one lagged value on right hand side, this is called:
AR model of order 1, or

AR(1) model

AR parameter is ϕ
For stationarity, −1 < ϕ < 1

TIME SERIES ANALYSIS IN PYTHON


Interpretation of AR(1) Parameter
Rt = μ + ϕ Rt−1 + ϵt
Negative ϕ: Mean Reversion
Positive ϕ: Momentum

TIME SERIES ANALYSIS IN PYTHON


Comparison of AR(1) Time Series
ϕ = 0.9 ϕ = −0.9

ϕ = 0.5 ϕ = −0.5

TIME SERIES ANALYSIS IN PYTHON


Comparison of AR(1) Autocorrelation Functions
ϕ = 0.9 ϕ = −0.9

ϕ = 0.5 ϕ = −0.5

TIME SERIES ANALYSIS IN PYTHON


Higher Order AR Models
AR(1)

Rt = μ + ϕ1 Rt−1 + ϵt
AR(2)

Rt = μ + ϕ1 Rt−1 + ϕ2 Rt−2 + ϵt
AR(3)

Rt = μ + ϕ1 Rt−1 + ϕ2 Rt−2 + ϕ3 Rt−3 + ϵt


...

TIME SERIES ANALYSIS IN PYTHON


Simulating an AR Process
from statsmodels.tsa.arima_process import ArmaProcess
ar = np.array([1, -0.9])
ma = np.array([1])
AR_object = ArmaProcess(ar, ma)
simulated_data = AR_object.generate_sample(nsample=1000)
plt.plot(simulated_data)

TIME SERIES ANALYSIS IN PYTHON


Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Estimating and
Forecasting an AR
Model
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Estimating an AR Model
To estimate parameters from data (simulated)

from statsmodels.tsa.arima_model import ARMA


mod = ARMA(data, order=(1,0))
result = mod.fit()

ARMA has been deprecated and replaced with ARIMA

from statsmodels.tsa.arima.model import ARIMA


mod = ARIMA(data, order=(1,0,0))
result = mod.fit()

For ARMA, order=(p,q)

For ARIMA,order=(p,d,q)

TIME SERIES ANALYSIS IN PYTHON


Estimating an AR Model
Full output (true μ = 0 and ϕ = 0.9)
print(result.summary())

TIME SERIES ANALYSIS IN PYTHON


Estimating an AR Model
Only the estimates of μ and ϕ (true μ = 0 and ϕ = 0.9)
print(result.params)

array([-0.03605989, 0.90535667])

TIME SERIES ANALYSIS IN PYTHON


Forecasting With an AR Model
from statsmodels.graphics.tsaplots import plot_predict
fig, ax = plt.subplots()
data.plot(ax=ax)
plot_predict(result, start='2012-09-27', end='2012-10-06', alpha=0.05, ax=ax)
plt.show()

Arguments of function plot_predict()


First argument is ed model

Set alpha=None for no con dence interval

Set ax=ax to plot the data and prediction on same axes

TIME SERIES ANALYSIS IN PYTHON


Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Choosing the Right
Model
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Identifying the Order of an AR Model
The order of an AR(p) model will usually be unknown

Two techniques to determine order


Partial Autocorrelation Function

Information criteria

TIME SERIES ANALYSIS IN PYTHON


Partial Autocorrelation Function (PACF)

TIME SERIES ANALYSIS IN PYTHON


Plot PACF in Python
Same as ACF, but use plot_pacf instead of plt_acf

Import module
from statsmodels.graphics.tsaplots import plot_pacf

Plot the PACF


plot_pacf(x, lags= 20, alpha=0.05)

TIME SERIES ANALYSIS IN PYTHON


Comparison of PACF for Different AR Models
AR(1) AR(2)

AR(3) White Noise

TIME SERIES ANALYSIS IN PYTHON


Information Criteria
Information criteria: adjusts goodness-of- t for number of
parameters

Two popular adjusted goodness-of- t measures


AIC (Akaike Information Criterion)

BIC (Bayesian Information Criterion)

TIME SERIES ANALYSIS IN PYTHON


Information Criteria
Estimation output

TIME SERIES ANALYSIS IN PYTHON


Getting Information Criteria From statsmodels
You learned earlier how to t an AR model
from statsmodels.tsa.arima_model import ARIMA
mod = ARIMA(simulated_data, order=(1,0))
result = mod.fit()

And to get full output


result.summary()

Or just the parameters


result.params

To get the AIC and BIC


result.aic
result.bic

TIME SERIES ANALYSIS IN PYTHON


Information Criteria
Fit a simulated AR(3) to di erent AR(p) models

Choose p with the lowest BIC

TIME SERIES ANALYSIS IN PYTHON


Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Describe Model
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Mathematical Description of MA(1) Model
Rt = μ + ϵt + θ ϵt−1
Since only one lagged error on right hand side, this is called:
MA model of order 1, or

MA(1) model

MA parameter is θ
Stationary for all values of θ

TIME SERIES ANALYSIS IN PYTHON


Interpretation of MA(1) Parameter
Rt = μ + ϵt + θ ϵt−1
Negative θ: One-Period Mean Reversion
Positive θ: One-Period Momentum
Note: One-period autocorrelation is θ/(1 + θ2 ), not θ

TIME SERIES ANALYSIS IN PYTHON


Comparison of MA(1) Autocorrelation Functions
θ = 0.9 θ = −0.9

θ = 0.5 θ = −0.5

TIME SERIES ANALYSIS IN PYTHON


Example of MA(1) Process: Intraday Stock Returns

TIME SERIES ANALYSIS IN PYTHON


Autocorrelation Function of Intraday Stock Returns

TIME SERIES ANALYSIS IN PYTHON


Higher Order MA Models
MA(1)

Rt = μ + ϵt − θ1 ϵt−1
MA(2)

Rt = μ + ϵt − θ1 ϵt−1 − θ2 ϵt−2
MA(3)

Rt = μ + ϵt − θ1 ϵt−1 − θ2 ϵt−2 − θ3 ϵt−3


...

TIME SERIES ANALYSIS IN PYTHON


Simulating an MA Process
from statsmodels.tsa.arima_process import ArmaProcess
ar = np.array([1])
ma = np.array([1, 0.5])
AR_object = ArmaProcess(ar, ma)
simulated_data = AR_object.generate_sample(nsample=1000)
plt.plot(simulated_data)

TIME SERIES ANALYSIS IN PYTHON


Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Estimation and
Forecasting an MA
Model
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Estimating an MA Model
Same as estimating an AR model (except order=(0,0,1) )

from statsmodels.tsa.arima.model import ARIMA


mod = ARIMA(simulated_data, order=(0,0,1))
result = mod.fit()

TIME SERIES ANALYSIS IN PYTHON


Forecasting an MA Model
from statsmodels.graphics.tsaplots import plot_predict
fig, ax = plt.subplots()
data.plot(ax=ax)
plot_predict(res, start='2012-09-27', end='2012-10-06', ax=ax)
plt.show()

TIME SERIES ANALYSIS IN PYTHON


Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
ARMA models
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
ARMA Model
ARMA(1,1) model:

Rt = μ + ϕ Rt−1 + ϵt + θ ϵt−1

TIME SERIES ANALYSIS IN PYTHON


Converting Between ARMA, AR, and MA Models
Converting AR(1) into an MA(∞)

Rt = μ + ϕRt−1 + ϵt

Rt = μ + ϕ(μ + ϕRt−2 + ϵt−1 ) + ϵt


μ
Rt = + ϵt + ϕϵt−1 − ϕ2 ϵt−2 + ϕ3 ϵt−3 + ...
1−ϕ

TIME SERIES ANALYSIS IN PYTHON


Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Cointegration
Models
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is Cointegration?
Two series, Pt and Qt can be random walks
But the linear combination Pt − c Qt may not be a random
walk!

If that's true
Pt − c Qt is forecastable
Pt and Qt are said to be cointegrated

TIME SERIES ANALYSIS IN PYTHON


Analogy: Dog on a Leash
Pt = Owner
Qt = Dog
Both series look like a random walk

Di erence, or distance between them, looks mean reverting


If dog falls too far behind, it gets pulled forward

If dog gets too far ahead, it gets pulled back

TIME SERIES ANALYSIS IN PYTHON


Example: Heating Oil and Natural Gas
Heating Oil and Natural Gas both look like random walks...

TIME SERIES ANALYSIS IN PYTHON


Example: Heating Oil and Natural Gas
But the spread (di erence) is mean reverting

TIME SERIES ANALYSIS IN PYTHON


What Types of Series are Cointegrated?
Economic substitutes
Heating Oil and Natural Gas

Platinum and Palladium

Corn and Wheat

Corn and Sugar

...

Bitcoin and Ethereum?

How about competitors?


Coke and Pepsi?

Apple and Blackberry? No! Leash broke and dog ran away

TIME SERIES ANALYSIS IN PYTHON


Two Steps to Test for Cointegration
Regress Pt on Qt and get slope c
Run Augmented Dickey-Fuller test on Pt − c Qt to test for
random walk

Alternatively, can use coint function in statsmodels that


combines both steps

from statsmodels.tsa.stattools import coint


coint(P,Q)

TIME SERIES ANALYSIS IN PYTHON


Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Case Study: Climate
Change
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Analyzing Temperature Data
Temperature data:
New York City from 1870-2016

Downloaded from National Oceanic and Atmospheric


Administration (NOAA)

Convert index to datetime object

Plot data

TIME SERIES ANALYSIS IN PYTHON


Analyzing Temperature Data
Test for Random Walk

Take rst di erences

Compute ACF and PACF

Fit a few AR, MA, and ARMA models

Use Information Criterion to choose best model

Forecast temperature over next 30 years

TIME SERIES ANALYSIS IN PYTHON


Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Congratulations
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Advanced Topics
GARCH Models

Nonlinear Models

Multivariate Time Series Models

Regime Switching Models

State Space Models and Kalman Filtering

...

TIME SERIES ANALYSIS IN PYTHON


Keep practicing!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Welcome to the
course!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Prerequisites
Intro to Python for Data Science

Intermediate Python for Data Science

VISUALIZING TIME SERIES DATA IN PYTHON


Time series in the field of Data Science
Time series are a fundamental way to store and analyze
many types of data

Financial, weather and device data are all best handled as


time series

VISUALIZING TIME SERIES DATA IN PYTHON


Time series in the field of Data Science

VISUALIZING TIME SERIES DATA IN PYTHON


Course overview
Chapter 1: Ge ing started and personalizing your rst time
series plot

Chapter 2: Summarizing and describing time series data

Chapter 3: Advanced time series analysis

Chapter 4: Working with multiple time series

Chapter 5: Case Study

VISUALIZING TIME SERIES DATA IN PYTHON


Reading data with Pandas
import pandas as pd
df = pd.read_csv('ch2_co2_levels.csv')
print(df)

datestamp co2
0 1958-03-29 316.1
1 1958-04-05 317.3
2 1958-04-12 317.6
...
...
...
2281 2001-12-15 371.2
2282 2001-12-22 371.3
2283 2001-12-29 371.5

VISUALIZING TIME SERIES DATA IN PYTHON


Preview data with Pandas
print(df.head(n=5))

datestamp co2
0 1958-03-29 316.1
1 1958-04-05 317.3
2 1958-04-12 317.6
3 1958-04-19 317.5
4 1958-04-26 316.4

print(df.tail(n=5))

datestamp co2
2279 2001-12-01 370.3
2280 2001-12-08 370.8
2281 2001-12-15 371.2
2282 2001-12-22 371.3
2283 2001-12-29 371.5

VISUALIZING TIME SERIES DATA IN PYTHON


Check data types with Pandas
print(df.dtypes)

datestamp object
co2 float64
dtype: object

VISUALIZING TIME SERIES DATA IN PYTHON


Working with dates
To work with time series data in pandas , your date columns
needs to be of the datetime64 type.

pd.to_datetime(['2009/07/31', 'test'])

ValueError: Unknown string format

pd.to_datetime(['2009/07/31', 'test'], errors='coerce')

DatetimeIndex(['2009-07-31', 'NaT'],
dtype='datetime64[ns]', freq=None)

VISUALIZING TIME SERIES DATA IN PYTHON


Let's get started!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Plot your first time
series
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
The Matplotlib library
In Python, matplotlib is an extensive package used to plot
data

The pyplot submodule of matplotlib is traditionally imported


using the plt alias

import matplotlib.pyplot as plt

VISUALIZING TIME SERIES DATA IN PYTHON


Plotting time series data

VISUALIZING TIME SERIES DATA IN PYTHON


Plotting time series data
import matplotlib.pyplot as plt
import pandas as pd

df = df.set_index('date_column')
df.plot()
plt.show()

VISUALIZING TIME SERIES DATA IN PYTHON


Adding style to your plots
plt.style.use('fivethirtyeight')
df.plot()
plt.show()

VISUALIZING TIME SERIES DATA IN PYTHON


FiveThirtyEight style

VISUALIZING TIME SERIES DATA IN PYTHON


Matplotlib style sheets
print(plt.style.available)

['seaborn-dark-palette', 'seaborn-darkgrid',
'seaborn-dark', 'seaborn-notebook',
'seaborn-pastel', 'seaborn-white',
'classic', 'ggplot', 'grayscale',
'dark_background', 'seaborn-poster',
'seaborn-muted', 'seaborn', 'bmh',
'seaborn-paper', 'seaborn-whitegrid',
'seaborn-bright', 'seaborn-talk',
'fivethirtyeight', 'seaborn-colorblind',
'seaborn-deep', 'seaborn-ticks']

VISUALIZING TIME SERIES DATA IN PYTHON


Describing your graphs with labels
ax = df.plot(color='blue')

ax.set_xlabel('Date')
ax.set_ylabel('The values of my Y axis')
ax.set_title('The title of my plot')
plt.show()

VISUALIZING TIME SERIES DATA IN PYTHON


Figure size, linewidth, linestyle and fontsize
ax = df.plot(figsize=(12, 5), fontsize=12,
linewidth=3, linestyle='--')
ax.set_xlabel('Date', fontsize=16)
ax.set_ylabel('The values of my Y axis', fontsize=16)
ax.set_title('The title of my plot', fontsize=16)
plt.show()

VISUALIZING TIME SERIES DATA IN PYTHON


Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Customize your time
series plot
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Slicing time series data
discoveries['1960':'1970']

discoveries['1950-01':'1950-12']

discoveries['1960-01-01':'1960-01-15']

VISUALIZING TIME SERIES DATA IN PYTHON


Plotting subset of your time series data
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
df_subset = discoveries['1960':'1970']

ax = df_subset.plot(color='blue', fontsize=14)
plt.show()

VISUALIZING TIME SERIES DATA IN PYTHON


Adding markers
ax.axvline(x='1969-01-01',
color='red',
linestyle='--')

ax.axhline(y=100,
color='green',
linestyle='--')

VISUALIZING TIME SERIES DATA IN PYTHON


Using markers: the full code
ax = discoveries.plot(color='blue')
ax.set_xlabel('Date')
ax.set_ylabel('Number of great discoveries')
ax.axvline('1969-01-01', color='red', linestyle='--')
ax.axhline(4, color='green', linestyle='--')

VISUALIZING TIME SERIES DATA IN PYTHON


Highlighting regions of interest
ax.axvspan('1964-01-01', '1968-01-01',
color='red', alpha=0.5)

ax.axhspan(8, 6, color='green',
alpha=0.2)

VISUALIZING TIME SERIES DATA IN PYTHON


Highlighting regions of interest: the full code
ax = discoveries.plot(color='blue')
ax.set_xlabel('Date')
ax.set_ylabel('Number of great discoveries')

ax.axvspan('1964-01-01', '1968-01-01', color='red',


alpha=0.3)
ax.axhspan(8, 6, color='green', alpha=0.3)

VISUALIZING TIME SERIES DATA IN PYTHON


Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Clean your time
series data
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
The CO2 level time series
A snippet of the weekly measurements of CO2 levels at the
Mauna Loa Observatory, Hawaii.

datastamp co2
1958-03-29 316.1
1958-04-05 317.3
1958-04-12 317.6
...
...
2001-12-15 371.2
2001-12-22 371.3
2001-12-29 371.5

VISUALIZING TIME SERIES DATA IN PYTHON


Finding missing values in a DataFrame
print(df.isnull())

datestamp co2
1958-03-29 False
1958-04-05 False
1958-04-12 False

print(df.notnull())

datestamp co2
1958-03-29 True
1958-04-05 True
1958-04-12 True
...

VISUALIZING TIME SERIES DATA IN PYTHON


Counting missing values in a DataFrame
print(df.isnull().sum())

datestamp 0
co2 59
dtype: int64

VISUALIZING TIME SERIES DATA IN PYTHON


Replacing missing values in a DataFrame
print(df)

...
5 1958-05-03 316.9
6 1958-05-10 NaN
7 1958-05-17 317.5
...

df = df.fillna(method='bfill')
print(df)

...
5 1958-05-03 316.9
6 1958-05-10 317.5
7 1958-05-17 317.5
...

VISUALIZING TIME SERIES DATA IN PYTHON


Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Plot aggregates of
your data
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Moving averages
In the eld of time series analysis, a moving average can be
used for many di erent purposes:
smoothing out short-term uctuations

removing outliers

highlighting long-term trends or cycles.

VISUALIZING TIME SERIES DATA IN PYTHON


The moving average model
co2_levels_mean = co2_levels.rolling(window=52).mean()

ax = co2_levels_mean.plot()
ax.set_xlabel("Date")
ax.set_ylabel("The values of my Y axis")
ax.set_title("52 weeks rolling mean of my time series")

plt.show()

VISUALIZING TIME SERIES DATA IN PYTHON


A plot of the moving average for the CO2 data

VISUALIZING TIME SERIES DATA IN PYTHON


Computing aggregate values of your time series
co2_levels.index

DatetimeIndex(['1958-03-29', '1958-04-05',...],
dtype='datetime64[ns]', name='datestamp',
length=2284, freq=None)

print(co2_levels.index.month)

array([ 3, 4, 4, ..., 12, 12, 12], dtype=int32)

print(co2_levels.index.year)

array([1958, 1958, 1958, ..., 2001,


2001, 2001], dtype=int32)

VISUALIZING TIME SERIES DATA IN PYTHON


Plotting aggregate values of your time series
index_month = co2_levels.index.month
co2_levels_by_month = co2_levels.groupby(index_month).mean()
co2_levels_by_month.plot()

plt.show()

VISUALIZING TIME SERIES DATA IN PYTHON


Plotting aggregate values of your time series

VISUALIZING TIME SERIES DATA IN PYTHON


Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Summarizing the
values in your time
series data
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Obtaining numerical summaries of your data
What is the average value of this data?

What is the maximum value observed in this time series?

VISUALIZING TIME SERIES DATA IN PYTHON


The .describe() method automatically computes key
statistics of all numeric columns in your DataFrame

print(df.describe())

co2
count 2284.000000
mean 339.657750
std 17.100899
min 313.000000
25% 323.975000
50% 337.700000
75% 354.500000
max 373.900000

VISUALIZING TIME SERIES DATA IN PYTHON


Summarizing your data with boxplots
ax1 = df.boxplot()
ax1.set_xlabel('Your first boxplot')
ax1.set_ylabel('Values of your data')
ax1.set_title('Boxplot values of your data')

plt.show()

VISUALIZING TIME SERIES DATA IN PYTHON


A boxplot of the values in the CO2 data

VISUALIZING TIME SERIES DATA IN PYTHON


Summarizing your data with histograms
ax2 = df.plot(kind='hist', bins=100)
ax2.set_xlabel('Your first histogram')
ax2.set_ylabel('Frequency of values in your data')
ax2.set_title('Histogram of your data with 100 bins')

plt.show()

VISUALIZING TIME SERIES DATA IN PYTHON


A histogram plot of the values in the CO2 data

VISUALIZING TIME SERIES DATA IN PYTHON


Summarizing your data with density plots
ax3 = df.plot(kind='density', linewidth=2)
ax3.set_xlabel('Your first density plot')
ax3.set_ylabel('Density values of your data')
ax3.set_title('Density plot of your data')

plt.show()

VISUALIZING TIME SERIES DATA IN PYTHON


A density plot of the values in the CO2 data

VISUALIZING TIME SERIES DATA IN PYTHON


Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Autocorrelation and
Partial
autocorrelation
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Autocorrelation in time series data
Autocorrelation is measured as the correlation between a
time series and a delayed copy of itself

For example, an autocorrelation of order 3 returns the


correlation between a time series at points ( t_1 , t_2 , t_3 ,
...) and its own values lagged by 3 time points, i.e. ( t_4 , t_5
, t_6 , ...)

It is used to nd repetitive pa erns or periodic signal in time


series

VISUALIZING TIME SERIES DATA IN PYTHON


Statsmodels
statsmodels is a Python module that provides classes and
functions for the estimation of many di erent statistical
models, as well as for conducting statistical tests, and
statistical data exploration.

VISUALIZING TIME SERIES DATA IN PYTHON


Plotting autocorrelations
import matplotlib.pyplot as plt
from statsmodels.graphics import tsaplots
fig = tsaplots.plot_acf(co2_levels['co2'], lags=40)

plt.show()

VISUALIZING TIME SERIES DATA IN PYTHON


Interpreting autocorrelation plots

VISUALIZING TIME SERIES DATA IN PYTHON


Partial autocorrelation in time series data
Contrary to autocorrelation, partial autocorrelation removes
the e ect of previous time points

For example, a partial autocorrelation function of order 3


returns the correlation between our time series ( t1 , t2 , t3 ,
...) and lagged values of itself by 3 time points ( t4 , t5 , t6 ,
...), but only a er removing all e ects a ributable to lags 1
and 2

VISUALIZING TIME SERIES DATA IN PYTHON


Plotting partial autocorrelations
import matplotlib.pyplot as plt

from statsmodels.graphics import tsaplots


fig = tsaplots.plot_pacf(co2_levels['co2'], lags=40)

plt.show()

VISUALIZING TIME SERIES DATA IN PYTHON


Interpreting partial autocorrelations plot

VISUALIZING TIME SERIES DATA IN PYTHON


Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Seasonality, trend
and noise in time
series data
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Properties of time series

VISUALIZING TIME SERIES DATA IN PYTHON


The properties of time series
Seasonality: does the data display a clear periodic pa ern?

Trend: does the data follow a consistent upwards or


downwards slope?

Noise: are there any outlier points or missing values that are
not consistent with the rest of the data?

VISUALIZING TIME SERIES DATA IN PYTHON


Time series decomposition
import statsmodels.api as sm
import matplotlib.pyplot as plt
from pylab import rcParams

rcParams['figure.figsize'] = 11, 9
decomposition = sm.tsa.seasonal_decompose(
co2_levels['co2'])
fig = decomposition.plot()

plt.show()

VISUALIZING TIME SERIES DATA IN PYTHON


A plot of time series decomposition on the CO2 data

VISUALIZING TIME SERIES DATA IN PYTHON


Extracting components from time series
decomposition
print(dir(decomposition))

['__class__', '__delattr__', '__dict__',


... 'plot', 'resid', 'seasonal', 'trend']

print(decomposition.seasonal)

datestamp
1958-03-29 1.028042
1958-04-05 1.235242
1958-04-12 1.412344
1958-04-19 1.701186

VISUALIZING TIME SERIES DATA IN PYTHON


Seasonality component in time series
decomp_seasonal = decomposition.seasonal

ax = decomp_seasonal.plot(figsize=(14, 2))
ax.set_xlabel('Date')
ax.set_ylabel('Seasonality of time series')
ax.set_title('Seasonal values of the time series')

plt.show()

VISUALIZING TIME SERIES DATA IN PYTHON


Seasonality component in time series

VISUALIZING TIME SERIES DATA IN PYTHON


Trend component in time series
decomp_trend = decomposition.trend

ax = decomp_trend.plot(figsize=(14, 2))
ax.set_xlabel('Date')
ax.set_ylabel('Trend of time series')
ax.set_title('Trend values of the time series')

plt.show()

VISUALIZING TIME SERIES DATA IN PYTHON


Trend component in time series

VISUALIZING TIME SERIES DATA IN PYTHON


Noise component in time series
decomp_resid = decomp.resid

ax = decomp_resid.plot(figsize=(14, 2))
ax.set_xlabel('Date')
ax.set_ylabel('Residual of time series')
ax.set_title('Residual values of the time series')

plt.show()

VISUALIZING TIME SERIES DATA IN PYTHON


Noise component in time series

VISUALIZING TIME SERIES DATA IN PYTHON


Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
A review on what
you have learned so
far
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
So far ...
Visualize aggregates of time series data

Extract statistical summaries

Autocorrelation and Partial autocorrelation

Time series decomposition

VISUALIZING TIME SERIES DATA IN PYTHON


The airline dataset

VISUALIZING TIME SERIES DATA IN PYTHON


Let's analyze this
data!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Working with more
than one time series
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Working with multiple time series
An isolated time series

date ts1
1949-01 112
1949-02 118
1949-03 132

A le with multiple time series

date ts1 ts2 ts3 ts4 ts5 ts6 ts7


2012-01-01 2113.8 10.4 1987.0 12.1 3091.8 43.2 476.7
2012-02-01 2009.0 9.8 1882.9 12.3 2954.0 38.8 466.8
2012-03-01 2159.8 10.0 1987.9 14.3 3043.7 40.1 502.1

VISUALIZING TIME SERIES DATA IN PYTHON


The Meat production dataset
import pandas as pd
meat = pd.read_csv("meat.csv")
print(meat.head(5))

date beef veal pork lamb_and_mutton broilers


0 1944-01-01 751.0 85.0 1280.0 89.0 NaN
1 1944-02-01 713.0 77.0 1169.0 72.0 NaN
2 1944-03-01 741.0 90.0 1128.0 75.0 NaN
3 1944-04-01 650.0 89.0 978.0 66.0 NaN
4 1944-05-01 681.0 106.0 1029.0 78.0 NaN

other_chicken turkey
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN

VISUALIZING TIME SERIES DATA IN PYTHON


Summarizing and plotting multiple time series
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
ax = df.plot(figsize=(12, 4), fontsize=14)

plt.show()

VISUALIZING TIME SERIES DATA IN PYTHON


Area charts
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
ax = df.plot.area(figsize=(12, 4), fontsize=14)

plt.show()

VISUALIZING TIME SERIES DATA IN PYTHON


Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Plot multiple time
series
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Clarity is key
In this plot, the default matplotlib color scheme assigns the
same color to the beef and turkey time series.

VISUALIZING TIME SERIES DATA IN PYTHON


The colormap argument
ax = df.plot(colormap='Dark2', figsize=(14, 7))
ax.set_xlabel('Date')
ax.set_ylabel('Production Volume (in tons)')

plt.show()

For the full set of available colormaps, click here.

VISUALIZING TIME SERIES DATA IN PYTHON


Changing line colors with the colormap argument

VISUALIZING TIME SERIES DATA IN PYTHON


Enhancing your plot with information
ax = df.plot(colormap='Dark2', figsize=(14, 7))
df_summary = df.describe()

# Specify values of cells in the table


ax.table(cellText=df_summary.values,
# Specify width of the table
colWidths=[0.3]*len(df.columns),
# Specify row labels
rowLabels=df_summary.index,
# Specify column labels
colLabels=df_summary.columns,
# Specify location of the table
loc='top')

plt.show()

VISUALIZING TIME SERIES DATA IN PYTHON


Adding Statistical summaries to your plots

VISUALIZING TIME SERIES DATA IN PYTHON


Dealing with different scales

VISUALIZING TIME SERIES DATA IN PYTHON


Only veal

VISUALIZING TIME SERIES DATA IN PYTHON


Facet plots
df.plot(subplots=True,
linewidth=0.5,
layout=(2, 4),
figsize=(16, 10),
sharex=False,
sharey=False)

plt.show()

VISUALIZING TIME SERIES DATA IN PYTHON


VISUALIZING TIME SERIES DATA IN PYTHON
Time for some
action!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Find relationships
between multiple
time series
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Correlations between two variables
In the eld of Statistics, the correlation coe cient is a
measure used to determine the strength or lack of
relationship between two variables:
Pearson's coe cient can be used to compute the
correlation coe cient between variables for which the
relationship is thought to be linear

Kendall Tau or Spearman rank can be used to compute the


correlation coe cient between variables for which the
relationship is thought to be non-linear

VISUALIZING TIME SERIES DATA IN PYTHON


Compute correlations
from scipy.stats.stats import pearsonr
from scipy.stats.stats import spearmanr
from scipy.stats.stats import kendalltau
x = [1, 2, 4, 7]
y = [1, 3, 4, 8]
pearsonr(x, y)

SpearmanrResult(correlation=0.9843, pvalue=0.01569)

spearmanr(x, y)

SpearmanrResult(correlation=1.0, pvalue=0.0)

kendalltau(x, y)

KendalltauResult(correlation=1.0, pvalue=0.0415)

VISUALIZING TIME SERIES DATA IN PYTHON


What is a correlation matrix?
When computing the correlation coe cient between more
than two variables, you obtain a correlation matrix
Range: [-1, 1]

0: no relationship

1: strong positive relationship

-1: strong negative relationship

VISUALIZING TIME SERIES DATA IN PYTHON


What is a correlation matrix?
A correlation matrix is always "symmetric"

The diagonal values will always be equal to 1

x y z
x 1.00 -0.46 0.49
y -0.46 1.00 -0.61
z 0.49 -0.61 1.00

VISUALIZING TIME SERIES DATA IN PYTHON


Computing Correlation Matrices with Pandas
corr_p = meat[['beef', 'veal','turkey']].corr(method='pearson')
print(corr_p)

beef veal turkey


beef 1.000 -0.829 0.738
veal -0.829 1.000 -0.768
turkey 0.738 -0.768 1.000

corr_s = meat[['beef', 'veal','turkey']].corr(method='spearman')


print(corr_s)

beef veal turkey


beef 1.000 -0.812 0.778
veal -0.812 1.000 -0.829
turkey 0.778 -0.829 1.000

VISUALIZING TIME SERIES DATA IN PYTHON


Computing Correlation Matrices with Pandas
corr_mat = meat.corr(method='pearson')

VISUALIZING TIME SERIES DATA IN PYTHON


Heatmap
import seaborn as sns
sns.heatmap(corr_mat)

VISUALIZING TIME SERIES DATA IN PYTHON


Heatmap

VISUALIZING TIME SERIES DATA IN PYTHON


Clustermap
sns.clustermap(corr_mat)

VISUALIZING TIME SERIES DATA IN PYTHON


VISUALIZING TIME SERIES DATA IN PYTHON
Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Apply your
knowledge to a new
dataset
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
The Jobs dataset

VISUALIZING TIME SERIES DATA IN PYTHON


Let's get started!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Beyond summary
statistics
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Facet plots of the jobs dataset
jobs.plot(subplots=True,
layout=(4, 4),
figsize=(20, 16),
sharex=True,
sharey=False)

plt.show()

VISUALIZING TIME SERIES DATA IN PYTHON


VISUALIZING TIME SERIES DATA IN PYTHON
Annotating events in the jobs dataset
ax = jobs.plot(figsize=(20, 14), colormap='Dark2')
ax.axvline('2008-01-01', color='black',
linestyle='--')
ax.axvline('2009-01-01', color='black',
linestyle='--')

VISUALIZING TIME SERIES DATA IN PYTHON


VISUALIZING TIME SERIES DATA IN PYTHON
Taking seasonal average in the jobs dataset
print(jobs.index)

DatetimeIndex(['2000-01-01', '2000-02-01', '2000-03-01',


'2000-04-01', '2009-09-01','2009-10-01',
'2009-11-01', '2009-12-01','2010-01-01', '2010-02-01'],
dtype='datetime64[ns]', name='datestamp',
length=122, freq=None)

index_month = jobs.index.month
jobs_by_month = jobs.groupby(index_month).mean()
print(jobs_by_month)

datestamp Agriculture Business services Construction


1 13.763636 7.863636 12.909091
2 13.645455 7.645455 13.600000
3 13.830000 7.130000 11.290000
4 9.130000 6.270000 9.450000
5 7.100000 6.600000 8.120000
...

VISUALIZING TIME SERIES DATA IN PYTHON


Monthly averages in the jobs dataset
ax = jobs_by_month.plot(figsize=(12, 5),
colormap='Dark2')

ax.legend(bbox_to_anchor=(1.0, 0.5),
loc='center left')

VISUALIZING TIME SERIES DATA IN PYTHON


Monthly averages in the jobs dataset

VISUALIZING TIME SERIES DATA IN PYTHON


Time to practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Decompose time
series data
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Python dictionaries
# Initialize a Python dictionnary
my_dict = {}

# Add a key and value to your dictionnary


my_dict['your_key'] = 'your_value'

# Add a second key and value to your dictionnary


my_dict['your_second_key'] = 'your_second_value'

# Print out your dictionnary


print(my_dict)

{'your_key': 'your_value',
'your_second_key': 'your_second_value'}

VISUALIZING TIME SERIES DATA IN PYTHON


Decomposing multiple time series with Python
dictionaries
# Import the statsmodel library
import statsmodels.api as sm
# Initialize a dictionary
my_dict = {}
# Extract the names of the time series
ts_names = df.columns
print(ts_names)

['ts1', 'ts2', 'ts3']

# Run time series decomposition


for ts in ts_names:
ts_decomposition = sm.tsa.seasonal_decompose(jobs[ts])
my_dict[ts] = ts_decomposition

VISUALIZING TIME SERIES DATA IN PYTHON


Extract decomposition components of multiple time
series
# Initialize a new dictionnary
my_dict_trend = {}
# Extract the trend component
for ts in ts_names:
my_dict_trend[ts] = my_dict[ts].trend
# Convert to a DataFrame
trend_df = pd.DataFrame.from_dict(my_dict_trend)
print(trend_df)

ts1 ts2 ts3


datestamp
2000-01-01 2.2 1.3 3.6
2000-02-01 3.4 2.1 4.7
...

VISUALIZING TIME SERIES DATA IN PYTHON


Python dictionaries
for the win!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Compute
correlations
between time series
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Trends in Jobs data
print(trend_df)

datestamp Agriculture Business services Construction

2000-01-01 NaN NaN NaN


2000-02-01 NaN NaN NaN
2000-03-01 NaN NaN NaN
2000-04-01 NaN NaN NaN
2000-05-01 NaN NaN NaN
2000-06-01 NaN NaN NaN
2000-07-01 9.170833 4.787500 6.329167
2000-08-01 9.466667 4.820833 6.304167
...

VISUALIZING TIME SERIES DATA IN PYTHON


Plotting a clustermap of the jobs correlation matrix
# Get correlation matrix of the seasonality_df DataFrame
trend_corr = trend_df.corr(method='spearman')

# Customize the clustermap of the seasonality_corr


correlation matrix
fig = sns.clustermap(trend_corr, annot=True, linewidth=0.4)

plt.setp(fig.ax_heatmap.yaxis.get_majorticklabels(),
rotation=0)

plt.setp(fig.ax_heatmap.xaxis.get_majorticklabels(),
rotation=90)

VISUALIZING TIME SERIES DATA IN PYTHON


The jobs correlation matrix

VISUALIZING TIME SERIES DATA IN PYTHON


Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Congratulations!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Going further with time series
Data from Zillow Research

Kaggle competitions

Reddit Data

VISUALIZING TIME SERIES DATA IN PYTHON


Going further with time series
The importance of time series in business:
to identify seasonal pa erns and trends

to study past behaviors

to produce robust forecasts

to evaluate and compare company achievements

VISUALIZING TIME SERIES DATA IN PYTHON


Getting to the next level
Manipulating Time Series Data in Python

Importing & Managing Financial Data in Python

Statistical Thinking in Python (Part 1)

Supervised Learning with scikit-learn

VISUALIZING TIME SERIES DATA IN PYTHON


Thank you!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Introduction to time
series and
stationarity
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
Motivation
Time series are everywhere

Science

Technology
Business

Finance

Policy

ARIMA MODELS IN PYTHON


Course content
You will learn

Structure of ARIMA models

How to fit ARIMA model


How to optimize the model

How to make forecasts

How to calculate uncertainty in predictions

ARIMA MODELS IN PYTHON


Loading and plotting
import pandas as pd
import matplotlib as plt

df = pd.read_csv('time_series.csv', index_col='date', parse_dates=True)

date values
2019-03-11 5.734193
2019-03-12 6.288708
2019-03-13 5.205788
2019-03-14 3.176578

ARIMA MODELS IN PYTHON


Trend
fig, ax = plt.subplots()
df.plot(ax=ax)
plt.show()

ARIMA MODELS IN PYTHON


Seasonality

ARIMA MODELS IN PYTHON


Cyclicality

ARIMA MODELS IN PYTHON


White noise
White noise series has uncorrelated values

Heads, heads, heads, tails, heads, tails, ...

0.1, -0.3, 0.8, 0.4, -0.5, 0.9, ...

ARIMA MODELS IN PYTHON


Stationarity
Stationary Not stationary

Trend stationary: Trend is zero

ARIMA MODELS IN PYTHON


Stationarity
Stationary Not stationary

Trend stationary: Trend is zero

Variance is constant

ARIMA MODELS IN PYTHON


Stationarity
Stationary Not stationary

Trend stationary: Trend is zero

Variance is constant

Autocorrelation is constant

ARIMA MODELS IN PYTHON


Train-test split
# Train data - all data up to the end of 2018
df_train = df.loc[:'2018']

# Test data - all data from 2019 onwards


df_test = df.loc['2019':]

ARIMA MODELS IN PYTHON


Let's Practice!
ARIMA MODELS IN PYTHON
Making time series
stationary
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
Overview
Statistical tests for stationarity
Making a dataset stationary

ARIMA MODELS IN PYTHON


The augmented Dicky-Fuller test
Tests for trend non-stationarity
Null hypothesis is time series is non-stationary

ARIMA MODELS IN PYTHON


Applying the adfuller test

from statsmodels.tsa.stattools import adfuller

results = adfuller(df['close'])

ARIMA MODELS IN PYTHON


Interpreting the test result
print(results)

(-1.34, 0.60, 23, 1235, {'1%': -3.435, '5%': -2.913, '10%': -2.568}, 10782.87)

0th element is test statistic (-1.34)


More negative means more likely to be stationary

1st element is p-value: (0.60)


If p-value is small → reject null hypothesis. Reject non-stationary.

4th element is the critical test statistics

ARIMA MODELS IN PYTHON


Interpreting the test result
print(results)

(-1.34, 0.60, 23, 1235, {'1%': -3.435, '5%': -2.863, '10%': -2.568}, 10782.87)

0th element is test statistic (-1.34)


More negative means more likely to be stationary

1st element is p-value: (0.60)


If p-value is small → reject null hypothesis. Reject non-stationary.

4th element is the critical test statistics

1 https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html

ARIMA MODELS IN PYTHON


The value of plotting
Plotting time series can stop you making wrong assumptions

ARIMA MODELS IN PYTHON


The value of plotting

ARIMA MODELS IN PYTHON


Making a time series stationary

ARIMA MODELS IN PYTHON


Taking the difference

Difference: Δyt = yt − yt−1

ARIMA MODELS IN PYTHON


Taking the difference
df_stationary = df.diff()

city_population
date
1969-09-30 NaN
1970-03-31 -0.116156
1970-09-30 0.050850
1971-03-31 -0.153261
1971-09-30 0.108389

ARIMA MODELS IN PYTHON


Taking the difference
df_stationary = df.diff().dropna()

city_population
date
1970-03-31 -0.116156
1970-09-30 0.050850
1971-03-31 -0.153261
1971-09-30 0.108389
1972-03-31 -0.029569

ARIMA MODELS IN PYTHON


Taking the difference

ARIMA MODELS IN PYTHON


Other transforms
Examples of other transforms

Take the log


np.log(df)

Take the square root


np.sqrt(df)

Take the proportional change


df.shift(1)/df

ARIMA MODELS IN PYTHON


Let's practice!
ARIMA MODELS IN PYTHON
Intro to AR, MA and
ARMA models
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
AR models
Autoregressive (AR) model

AR(1) model :
yt = a1 yt−1 + ϵt

ARIMA MODELS IN PYTHON


AR models
Autoregressive (AR) model

AR(1) model :
yt = a1 yt−1 + ϵt

AR(2) model :
yt = a1 yt−1 + a2 yt−2 + ϵt

AR(p) model :
yt = a1 yt−1 + a2 yt−2 + ... + ap yt−p + ϵt

ARIMA MODELS IN PYTHON


MA models
Moving average (MA) model

MA(1) model :
yt = m1 ϵt−1 + ϵt

MA(2) model :
yt = m1 ϵt−1 + m2 ϵt−2 + ϵt

MA(q) model :
yt = m1 ϵt−1 + m2 ϵt−2 + ... + mq ϵt−q + ϵt

ARIMA MODELS IN PYTHON


ARMA models
Autoregressive moving-average (ARMA) model

ARMA = AR + MA

ARMA(1,1) model :
yt = a1 yt−1 + m1 ϵt−1 + ϵt

ARMA(p, q)

p is order of AR part

q is order of MA part

ARIMA MODELS IN PYTHON


Creating ARMA data
yt = a1 yt−1 + m1 ϵt−1 + ϵt

ARIMA MODELS IN PYTHON


Creating ARMA data
yt = 0.5yt−1 + 0.2ϵt−1 + ϵt

from statsmodels.tsa.arima_process import arma_generate_sample


ar_coefs = [1, -0.5]
ma_coefs = [1, 0.2]
y = arma_generate_sample(ar_coefs, ma_coefs, nsample=100, scale=0.5)

ARIMA MODELS IN PYTHON


Creating ARMA data
yt = 0.5yt−1 + 0.2ϵt−1 + ϵt

ARIMA MODELS IN PYTHON


Fitting and ARMA model
from statsmodels.tsa.arima.model import ARIMA
# Instantiate model object
model = ARIMA(y, order=(1,0,1))
# Fit model
results = model.fit()

ARIMA MODELS IN PYTHON


Let's practice!
ARIMA MODELS IN PYTHON
Fitting time series
models
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
Creating a model
from statsmodels.tsa.arima.model import ARIMA

# This is an ARMA(p,q) model


model = ARIMA(timeseries, order=(p,0,q))

ARIMA MODELS IN PYTHON


Creating AR and MA models
ar_model = ARIMA(timeseries, order=(p,0,0))

ma_model = ARIMA(timeseries, order=(0,0,q))

ARIMA MODELS IN PYTHON


Fitting the model and fit summary
model = ARIMA(timeseries, order=(2,0,1))
results = model.fit()

print(results.summary())

ARIMA MODELS IN PYTHON


Fit summary
SARIMAX Results
==============================================================================
Dep. Variable: y No. Observations: 1000
Model: ARMA(2, 1) Log Likelihood 148.580
Date: Thu, 25 Apr 2022 AIC -287.159
Time: 22:57:00 BIC -262.621
Sample: 0 HQIC -277.833
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0017 0.012 -0.147 0.883 -0.025 0.021
ar.L1.y 0.5253 0.054 9.807 0.000 0.420 0.630
ar.L2.y -0.2909 0.042 -6.850 0.000 -0.374 -0.208
ma.L1.y 0.3679 0.052 7.100 0.000 0.266 0.469

ARIMA MODELS IN PYTHON


Fit summary
SARIMAX Results
==============================================================================
Dep. Variable: y No. Observations: 1000
Model: ARMA(2, 1) Log Likelihood 148.580
Date: Thu, 25 Apr 2022 AIC -287.159
Time: 22:57:00 BIC -262.621
Sample: 0 HQIC -277.833
Covariance Type: opg

ARIMA MODELS IN PYTHON


Fit summary
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0017 0.012 -0.147 0.883 -0.025 0.021
ar.L1.y 0.5253 0.054 9.807 0.000 0.420 0.630
ar.L2.y -0.2909 0.042 -6.850 0.000 -0.374 -0.208
ma.L1.y 0.3679 0.052 7.100 0.000 0.266 0.469
sigma2 1.6306 0.339 6.938 0.000 0.583 1.943

ARIMA MODELS IN PYTHON


Introduction to ARMAX models
Exogenous ARMA
Use external variables as well as time series

ARMAX = ARMA + linear regression

ARIMA MODELS IN PYTHON


ARMAX equation
ARMA(1,1) model :
yt = a1 yt−1 + m1 ϵt−1 + ϵt

ARMAX(1,1) model :
yt = x1 zt + a1 yt−1 + m1 ϵt−1 + ϵt

ARIMA MODELS IN PYTHON


ARMAX example

ARIMA MODELS IN PYTHON


ARMAX example

ARIMA MODELS IN PYTHON


Fitting ARMAX
# Instantiate the model
model = ARIMA(df['productivity'], order=(2,0,1), exog=df['hours_sleep'])

# Fit the model


results = model.fit()

ARIMA MODELS IN PYTHON


ARMAX summary
==============================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------
const -0.1936 0.092 -2.098 0.041 -0.375 -0.013
x1 0.1131 0.013 8.602 0.000 0.087 0.139
ar.L1.y 0.1917 0.252 0.760 0.450 -0.302 0.686
ar.L2.y -0.3740 0.121 -3.079 0.003 -0.612 -0.136
ma.L1.y -0.0740 0.259 -0.286 0.776 -0.581 0.433

ARIMA MODELS IN PYTHON


Let's practice!
ARIMA MODELS IN PYTHON
Forecasting
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
Predicting the next value
Take an AR(1) model

yt = a1 yt−1 + ϵt

Predict next value

yt = 0.6 x 10 + ϵt

yt = 6.0 + ϵt

Uncertainty on prediction

5.0 < yt < 7.0

ARIMA MODELS IN PYTHON


One-step-ahead predictions

ARIMA MODELS IN PYTHON


Making one-step-ahead predictions
# Make predictions for last 25 values
results = model.fit()
# Make in-sample prediction
forecast = results.get_prediction(start=-25)

ARIMA MODELS IN PYTHON


Making one-step-ahead predictions
# Make predictions for last 25 values
results = model.fit()
# Make in-sample prediction
forecast = results.get_prediction(start=-25)
# forecast mean
mean_forecast = forecast.predicted_mean

Predicted mean is a pandas series

2013-10-28 1.519368
2013-10-29 1.351082
2013-10-30 1.218016

ARIMA MODELS IN PYTHON


Confidence intervals
# Get confidence intervals of forecasts
confidence_intervals = forecast.conf_int()

Confidence interval method returns pandas DataFrame

lower y upper y
2013-09-28 -4.720471 -0.815384
2013-09-29 -5.069875 0.112505
2013-09-30 -5.232837 0.766300
2013-10-01 -5.305814 1.282935
2013-10-02 -5.326956 1.703974

ARIMA MODELS IN PYTHON


Plotting predictions
plt.figure()

# Plot prediction
plt.plot(dates,
mean_forecast.values,
color='red',
label='forecast')
# Shade uncertainty area
plt.fill_between(dates, lower_limits, upper_limits, color='pink')

plt.show()

ARIMA MODELS IN PYTHON


Plotting predictions

ARIMA MODELS IN PYTHON


Dynamic predictions

ARIMA MODELS IN PYTHON


Making dynamic predictions
results = model.fit()
forecast = results.get_prediction(start=-25, dynamic=True)

# forecast mean
mean_forecast = forecast.predicted_mean

# Get confidence intervals of forecasts


confidence_intervals = forecast.conf_int()

ARIMA MODELS IN PYTHON


Forecasting out of sample
forecast = results.get_forecast(steps=20)

# forecast mean
mean_forecast = forecast.predicted_mean

# Get confidence intervals of forecasts


confidence_intervals = forecast.conf_int()

ARIMA MODELS IN PYTHON


Forecasting out of sample
forecast = results.get_forecast(steps=20)

ARIMA MODELS IN PYTHON


Let's practice!
ARIMA MODELS IN PYTHON
Introduction to
ARIMA models
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
Non-stationary time series recap

ARIMA MODELS IN PYTHON


Non-stationary time series recap

ARIMA MODELS IN PYTHON


Forecast of differenced time series

ARIMA MODELS IN PYTHON


Reconstructing original time series after differencing
diff_forecast = results.get_forecast(steps=10).predicted_mean
from numpy import cumsum
mean_forecast = cumsum(diff_forecast)

ARIMA MODELS IN PYTHON


Reconstructing original time series after differencing
diff_forecast = results.get_forecast(steps=10).predicted_mean
from numpy import cumsum
mean_forecast = cumsum(diff_forecast) + df.iloc[-1,0]

ARIMA MODELS IN PYTHON


Reconstructing original time series after differencing

ARIMA MODELS IN PYTHON


The ARIMA model

Take the difference

Fit ARMA model


Integrate forecast

Can we avoid doing so much work?

Yes!

ARIMA - Autoregressive Integrated Moving Average

ARIMA MODELS IN PYTHON


Using the ARIMA model
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(df, order=(p,d,q))

p - number of autoregressive lags


d - order of differencing

q - number of moving average lags

ARIMA(p, 0, q) = ARMA(p, q)

ARIMA MODELS IN PYTHON


Using the ARIMA model
# Create model
model = ARIMA(df, order=(2,1,1))
# Fit model
model.fit()
# Make forecast
mean_forecast = results.get_forecast(steps=10).predicted_mean

ARIMA MODELS IN PYTHON


Using the ARIMA model
# Make forecast
mean_forecast = results.get_forecast(steps=steps).predicted_mean

ARIMA MODELS IN PYTHON


Picking the difference order
adf = adfuller(df.iloc[:,0])
print('ADF Statistic:', adf[0])
print('p-value:', adf[1])

ADF Statistic: -2.674


p-value: 0.0784

adf = adfuller(df.diff().dropna().iloc[:,0])
print('ADF Statistic:', adf[0])
print('p-value:', adf[1])

ADF Statistic: -4.978


p-value: 2.44e-05

ARIMA MODELS IN PYTHON


Picking the difference order
model = ARIMA(df, order=(p,1,q))

ARIMA MODELS IN PYTHON


Let's practice!
ARIMA MODELS IN PYTHON
Intro to ACF and
PACF
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
Motivation

ARIMA MODELS IN PYTHON


ACF and PACF
ACF - Autocorrelation Function

PACF - Partial autocorrelation function

ARIMA MODELS IN PYTHON


What is the ACF
lag-1 autocorrelation → corr(yt , yt−1 )
lag-2 autocorrelation → corr(yt , yt−2 )

...

lag-n autocorrelation → corr(yt , yt−n )

ARIMA MODELS IN PYTHON


What is the ACF

ARIMA MODELS IN PYTHON


What is the PACF

ARIMA MODELS IN PYTHON


Using ACF and PACF to choose model order

AR(2) model →

ARIMA MODELS IN PYTHON


Using ACF and PACF to choose model order

MA(2) model →

ARIMA MODELS IN PYTHON


Using ACF and PACF to choose model order

ARIMA MODELS IN PYTHON


Using ACF and PACF to choose model order

ARIMA MODELS IN PYTHON


Implementation in Python
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# Create figure
fig, (ax1, ax2) = plt.subplots(2,1, figsize=(8,8))
# Make ACF plot
plot_acf(df, lags=10, zero=False, ax=ax1)
# Make PACF plot
plot_pacf(df, lags=10, zero=False, ax=ax2)

plt.show()

ARIMA MODELS IN PYTHON


Implementation in Python

ARIMA MODELS IN PYTHON


Over/under differencing and ACF and PACF

ARIMA MODELS IN PYTHON


Over/under differencing and ACF and PACF

ARIMA MODELS IN PYTHON


Let's practice!
ARIMA MODELS IN PYTHON
AIC and BIC
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
AIC - Akaike information criterion
Lower AIC indicates a better model
AIC likes to choose simple models with lower order

ARIMA MODELS IN PYTHON


BIC - Bayesian information criterion
Very similar to AIC
Lower BIC indicates a better model

BIC likes to choose simple models with lower order

ARIMA MODELS IN PYTHON


AIC vs BIC
BIC favors simpler models than AIC
AIC is better at choosing predictive models

BIC is better at choosing good explanatory model

ARIMA MODELS IN PYTHON


AIC and BIC in statsmodels
# Create model
model = ARIMA(df, order=(1,0,1))
# Fit model
results = model.fit()
# Print fit summary
print(results.summary())

Statespace Model Results


==============================================================================
Dep. Variable: y No. Observations: 1000
Model: SARIMAX(2, 0, 0) Log Likelihood -1399.704
Date: Fri, 10 May 2019 AIC 2805.407
Time: 01:06:11 BIC 2820.131
Sample: 01-01-2013 HQIC 2811.003
- 09-27-2015
Covariance Type: opg

ARIMA MODELS IN PYTHON


AIC and BIC in statsmodels
# Create model
model = ARIMA(df, order=(1,0,1))
# Fit model
results = model.fit()
# Print AIC and BIC
print('AIC:', results.aic)
print('BIC:', results.bic)

AIC: 2806.36
BIC: 2821.09

ARIMA MODELS IN PYTHON


Searching over AIC and BIC
# Loop over AR order
for p in range(3):
# Loop over MA order
for q in range(3):
# Fit model
model = ARIMA(df, order=(p,0,q))
results = model.fit()
# print the model order and the AIC/BIC values
print(p, q, results.aic, results.bic)

0 0 2900.13 2905.04
0 1 2828.70 2838.52
0 2 2806.69 2821.42
1 0 2810.25 2820.06
1 1 2806.37 2821.09
1 2 2807.52 2827.15
...

ARIMA MODELS IN PYTHON


Searching over AIC and BIC
order_aic_bic =[]
# Loop over AR order
for p in range(3):
# Loop over MA order
for q in range(3):
# Fit model
model = ARIMA(df, order=(p,0,q))
results = model.fit()
# Add order and scores to list
order_aic_bic.append((p, q, results.aic, results.bic))

# Make DataFrame of model order and AIC/BIC scores


order_df = pd.DataFrame(order_aic_bic, columns=['p','q', 'aic', 'bic'])

ARIMA MODELS IN PYTHON


Searching over AIC and BIC
# Sort by AIC # Sort by BIC
print(order_df.sort_values('aic')) print(order_df.sort_values('bic'))

p q aic bic p q aic bic


7 2 1 2804.54 2824.17 3 1 0 2810.25 2820.06
6 2 0 2805.41 2820.13 6 2 0 2805.41 2820.13
4 1 1 2806.37 2821.09 4 1 1 2806.37 2821.09
2 0 2 2806.69 2821.42 2 0 2 2806.69 2821.42
... ...

ARIMA MODELS IN PYTHON


Non-stationary model orders
# Fit model
model = ARIMA(df, order=(2,0,1))
results = model.fit()

ValueError: Non-stationary starting autoregressive parameters


found with `enforce_stationarity` set to True.

ARIMA MODELS IN PYTHON


When certain orders don't work
# Loop over AR order
for p in range(3):
# Loop over MA order
for q in range(3):

# Fit model
model = ARIMA(df, order=(p,0,q))
results = model.fit()

# Print the model order and the AIC/BIC values


print(p, q, results.aic, results.bic)

ARIMA MODELS IN PYTHON


When certain orders don't work
# Loop over AR order
for p in range(3):
# Loop over MA order
for q in range(3):
try:
# Fit model
model = ARIMA(df, order=(p,0,q))
results = model.fit()

# Print the model order and the AIC/BIC values


print(p, q, results.aic, results.bic)
except:
# Print AIC and BIC as None when fails
print(p, q, None, None)

ARIMA MODELS IN PYTHON


Let's practice!
ARIMA MODELS IN PYTHON
Model diagnostics
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
Introduction to model diagnostics
How good is the final model?

ARIMA MODELS IN PYTHON


Residuals

ARIMA MODELS IN PYTHON


Residuals
# Fit model
model = ARIMA(df, order=(p,d,q))
results = model.fit()
# Assign residuals to variable
residuals = results.resid

2013-01-23 1.013129
2013-01-24 0.114055
2013-01-25 0.430698
2013-01-26 -1.247046
2013-01-27 -0.499565
... ...

ARIMA MODELS IN PYTHON


Mean absolute error
How far our the predictions from the real values?

mae = np.mean(np.abs(residuals))

ARIMA MODELS IN PYTHON


Plot diagnostics
If the model fits well the residuals will be
white Gaussian noise

# Create the 4 diagostics plots


results.plot_diagnostics()
plt.show()

ARIMA MODELS IN PYTHON


Residuals plot

ARIMA MODELS IN PYTHON


Residuals plot

ARIMA MODELS IN PYTHON


Histogram plus estimated density

ARIMA MODELS IN PYTHON


Normal Q-Q

ARIMA MODELS IN PYTHON


Correlogram

ARIMA MODELS IN PYTHON


Summary statistics
print(results.summary())

...
===================================================================================
Ljung-Box (Q): 32.10 Jarque-Bera (JB): 0.02
Prob(Q): 0.81 Prob(JB): 0.99
Heteroskedasticity (H): 1.28 Skew: -0.02
Prob(H) (two-sided): 0.21 Kurtosis: 2.98
===================================================================================

Prob(Q) - p-value for null hypothesis that residuals are uncorrelated

Prob(JB) - p-value for null hypothesis that residuals are normal

ARIMA MODELS IN PYTHON


Let's practice!
ARIMA MODELS IN PYTHON
Box-Jenkins method
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
The Box-Jenkins method
From raw data → production model

identification

estimation
model diagnostics

ARIMA MODELS IN PYTHON


Identification
Is the time series stationary?
What differencing will make it stationary?

What transforms will make it stationary?

What values of p and q are most


promising?

ARIMA MODELS IN PYTHON


Identification tools
Plot the time series
df.plot()

Use augmented Dicky-Fuller test


adfuller()

Use transforms and/or differencing


df.diff() , np.log() , np.sqrt()

Plot ACF/PACF
plot_acf() , plot_pacf()

ARIMA MODELS IN PYTHON


Estimation
Use the data to train the model coefficients
Done for us using model.fit()

Choose between models using AIC and BIC


results.aic , results.bic

ARIMA MODELS IN PYTHON


Model diagnostics
Are the residuals uncorrelated

Are residuals normally distributed


results.plot_diagnostics()

results.summary()

ARIMA MODELS IN PYTHON


Decision

ARIMA MODELS IN PYTHON


Repeat
We go through the process again with more
information

Find a better model

ARIMA MODELS IN PYTHON


Production
Ready to make forecasts
results.get_forecast()

ARIMA MODELS IN PYTHON


Box-Jenkins

ARIMA MODELS IN PYTHON


Let's practice!
ARIMA MODELS IN PYTHON
Seasonal time series
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
Seasonal data
Has predictable and repeated patterns
Repeats after any amount of time

ARIMA MODELS IN PYTHON


Seasonal decomposition

ARIMA MODELS IN PYTHON


Seasonal decomposition

time series = trend + seasonal + residual

ARIMA MODELS IN PYTHON


Seasonal decomposition using statsmodels
# Import
from statsmodels.tsa.seasonal import seasonal_decompose

# Decompose data
decomp_results = seasonal_decompose(df['IPG3113N'], period=12)

type(decomp_results)

statsmodels.tsa.seasonal.DecomposeResult

ARIMA MODELS IN PYTHON


Seasonal decomposition using statsmodels
# Plot decomposed data
decomp_results.plot()
plt.show()

ARIMA MODELS IN PYTHON


Finding seasonal period using ACF

ARIMA MODELS IN PYTHON


Identifying seasonal data using ACF

ARIMA MODELS IN PYTHON


Detrending time series
# Subtract long rolling average over N steps
df = df - df.rolling(N).mean()
# Drop NaN values
df = df.dropna()

ARIMA MODELS IN PYTHON


Identifying seasonal data using ACF
# Create figure
fig, ax = plt.subplots(1,1, figsize=(8,4))

# Plot ACF
plot_acf(df.dropna(), ax=ax, lags=25, zero=False)
plt.show()

ARIMA MODELS IN PYTHON


ARIMA models and seasonal data

ARIMA MODELS IN PYTHON


Let's practice!
ARIMA MODELS IN PYTHON
SARIMA models
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
The SARIMA model
Seasonal ARIMA = SARIMA SARIMA(p,d,q)(P,D,Q)S

Non-seasonal orders Seasonal Orders


p: autoregressive order P: seasonal autoregressive order

d: differencing order D: seasonal differencing order

q: moving average order Q: seasonal moving average order

S: number of time steps per cycle

ARIMA MODELS IN PYTHON


The SARIMA model
ARIMA(2,0,1) model :
yt = a1 yt−1 + a2 yt−2 + m1 ϵt−1 + ϵt

SARIMA(0,0,0)(2,0,1)7 model:
yt = a7 yt−7 + a14 yt−14 + m7 ϵt−7 + ϵt

ARIMA MODELS IN PYTHON


Fitting a SARIMA model
# Imports
statsmodels.tsa.statespace.sarimax import SARIMAX
# Instantiate model
model = SARIMAX(df, order=(p,d,q), seasonal_order=(P,D,Q,S))
# Fit model
results = model.fit()

ARIMA MODELS IN PYTHON


Seasonal differencing
Subtract the time series value of one season ago

Δyt = yt − yt−S

# Take the seasonal difference


df_diff = df.diff(S)

ARIMA MODELS IN PYTHON


Differencing for SARIMA models

Time series

ARIMA MODELS IN PYTHON


Differencing for SARIMA models

First difference of time series

ARIMA MODELS IN PYTHON


Differencing for SARIMA models

First difference and first seasonal difference of time series

ARIMA MODELS IN PYTHON


Finding p and q

ARIMA MODELS IN PYTHON


Finding P and Q

ARIMA MODELS IN PYTHON


Plotting seasonal ACF and PACF
# Create figure
fig, (ax1, ax2) = plt.subplots(2,1)

# Plot seasonal ACF


plot_acf(df_diff, lags=[12,24,36,48,60,72], ax=ax1)

# Plot seasonal PACF


plot_pacf(df_diff, lags=[12,24,36,48,60,72], ax=ax2)

plt.show()

ARIMA MODELS IN PYTHON


Let's practice!
ARIMA MODELS IN PYTHON
Automation and
saving
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
Searching over model orders
import pmdarima as pm

results = pm.auto_arima(df)

Performing stepwise search to minimize aic


ARIMA(2,0,2)(1,1,1)[12] intercept : AIC=inf, Time=3.33 sec
ARIMA(0,0,0)(0,1,0)[12] intercept : AIC=2648.467, Time=0.062 sec
ARIMA(1,0,0)(1,1,0)[12] intercept : AIC=2279.986, Time=1.171 sec

...

ARIMA(3,0,3)(1,1,1)[12] intercept : AIC=2173.508, Time=12.487 sec


ARIMA(3,0,3)(0,1,0)[12] intercept : AIC=2297.305, Time=2.087 sec

Best model: ARIMA(3,0,3)(1,1,1)[12]


Total fit time: 245.812 seconds

ARIMA MODELS IN PYTHON


pmdarima results
print(results.summary()) results.plot_diagnostics()

ARIMA MODELS IN PYTHON


Non-seasonal search parameters

ARIMA MODELS IN PYTHON


Non-seasonal search parameters
results = pm.auto_arima( df, # data
d=0, # non-seasonal difference order
start_p=1, # initial guess for p
start_q=1, # initial guess for q
max_p=3, # max value of p to test
max_q=3, # max value of q to test
)

1 https://www.alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.auto_arima.html

ARIMA MODELS IN PYTHON


Seasonal search parameters
results = pm.auto_arima( df, # data
... , # non-seasonal arguments
seasonal=True, # is the time series seasonal
m=7, # the seasonal period
D=1, # seasonal difference order
start_P=1, # initial guess for P
start_Q=1, # initial guess for Q
max_P=2, # max value of P to test
max_Q=2, # max value of Q to test
)

ARIMA MODELS IN PYTHON


Other parameters
results = pm.auto_arima( df, # data
... , # model order parameters
information_criterion='aic', # used to select best model
trace=True, # print results whilst training
error_action='ignore', # ignore orders that don't work
stepwise=True, # apply intelligent order search
)

ARIMA MODELS IN PYTHON


Saving model objects
# Import
import joblib

# Select a filepath
filepath ='localpath/great_model.pkl'

# Save model to filepath


joblib.dump(model_results_object, filepath)

ARIMA MODELS IN PYTHON


Saving model objects
# Select a filepath
filepath ='localpath/great_model.pkl'

# Load model object from filepath


model_results_object = joblib.load(filepath)

ARIMA MODELS IN PYTHON


Updating model
# Add new observations and update parameters
model_results_object.update(df_new)

ARIMA MODELS IN PYTHON


Update comparison

ARIMA MODELS IN PYTHON


Let's practice!
ARIMA MODELS IN PYTHON
SARIMA and Box-
Jenkins
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
Box-Jenkins

ARIMA MODELS IN PYTHON


Box-Jenkins with seasonal data
Determine if time series is seasonal
Find seasonal period

Find transforms to make data stationary


Seasonal and non-seasonal differencing

Other transforms

ARIMA MODELS IN PYTHON


Mixed differencing
D should be 0 or 1

d + D should be 0-2

ARIMA MODELS IN PYTHON


Weak vs strong seasonality

Weak seasonal pattern Strong seasonal pattern

Use seasonal differencing if necessary Always use seasonal differencing

ARIMA MODELS IN PYTHON


Additive vs multiplicative seasonality

Additive series = trend + season multiplicative series = trend x season

Proceed as usual with differencing Apply log transform first - np.log

ARIMA MODELS IN PYTHON


Multiplicative to additive seasonality

ARIMA MODELS IN PYTHON


Let's practice!
ARIMA MODELS IN PYTHON
Congratulations!
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
The SARIMAX model
`

ARIMA MODELS IN PYTHON


Time series modeling framework
Test for stationarity and seasonality

Find promising model orders

Fit models and narrow selection with


AIC/BIC

Perform model diagnostics tests

Make forecasts

Save and update models

ARIMA MODELS IN PYTHON


Further steps
Fit data created using arma_generate_sample()

Tackle real world data! Either your own or examples from statsmodels

ARIMA MODELS IN PYTHON


Further steps
Fit data created using arma_generate_sample()
Tackle real world data! Either your own or examples from statsmodels

More time series courses here

1 https://www.statsmodels.org/stable/datasets/index.html

ARIMA MODELS IN PYTHON


Good luck!
ARIMA MODELS IN PYTHON
Timeseries kinds and
applications
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Time Series

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Time Series

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


What makes a time series?
Datapoint Datapoint Datapoint Datapoint Datapoint Datapoint
1 34 12 54 76 40

Timepoint Timepoint Timepoint Timepoint Timepoint Timepoint


2:00 2:01 2:02 2:03 2:04 2:05

Timepoint Timepoint Timepoint Timepoint Timepoint Timepoint


Jan Feb March April May Jun

Timepoint Timepoint Timepoint Timepoint Timepoint Timepoint


1e-9 2e-9 3e-9 4e-9 5e-9 6e-9

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Reading in a time series with Pandas
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('data.csv')
data.head()

date symbol close volume


0 2010-01-04 AAPL 214.009998 123432400.0
46 2010-01-05 AAPL 214.379993 150476200.0
92 2010-01-06 AAPL 210.969995 138040000.0
138 2010-01-07 AAPL 210.580000 119282800.0
184 2010-01-08 AAPL 211.980005 111902700.0

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Plotting a pandas timeseries
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(12, 6))
data.plot('date', 'close', ax=ax)
ax.set(title="AAPL daily closing price")

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


A timeseries plot

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Why machine learning?
We can use really big data and really complicated data

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Why machine learning?
We can...

Predict the future

Automate this process

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Why combine these two?

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


A machine learning pipeline
Feature extraction

Model ing

Prediction and validation

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Machine learning
basics
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Always begin by looking at your data
array.shape

(10, 5)

array[:3]

array([[ 0.735528 , 1.00122818, -0.28315978],


[-0.94478393, 0.18658748, -0.00241224],
[-0.74822942, -1.46636618, 0.69835096]])

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Always begin by looking at your data
df.head()

col1 col2 col3


0 0.735528 1.001228 -0.283160
1 -0.944784 0.186587 -0.002412
2 -0.748229 -1.466366 0.698351
3 1.038589 -0.171248 0.831457
4 -0.161904 0.003972 -0.321933

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Always visualize your data
Make sure it looks the way you'd expect.

# Using matplotlib
fig, ax = plt.subplots()
ax.plot(...)

# Using pandas
fig, ax = plt.subplots()
df.plot(..., ax=ax)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Scikit-learn
Scikit-learn is the most popular machine learning library in Python

from sklearn.svm import LinearSVC

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Preparing data for scikit-learn
scikit-learn expects a particular structure of data:

(samples, features)

Make sure that your data is at least two-dimensional

Make sure the rst dimension is samples

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


If your data is not shaped properly
If the axes are swapped:

array.T.shape

(10, 3)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


If your data is not shaped properly
If we're missing an axis, use .reshape() :

array.shape

(10,)

array.reshape(-1, 1).shape

(10, 1)

-1 will automatically ll that axis with remaining values

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Fitting a model with scikit-learn
# Import a support vector classifier
from sklearn.svm import LinearSVC

# Instantiate this model


model = LinearSVC()

# Fit the model on some data


model.fit(X, y)

It is common for y to be of shape (samples, 1)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Investigating the model
# There is one coefficient per input feature
model.coef_

array([[ 0.69417875, -0.5289162 ]])

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Predicting with a fit model
# Generate predictions
predictions = model.predict(X_test)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Let's practice
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Combining
timeseries data with
machine learning
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Getting to know our data
The datasets that we'll use in this course are all freely-available online

There are many datasets available to download on the web, the ones we'll use come from
Kaggle

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


The Heartbeat Acoustic Data
Many recordings of heart sounds from di erent patients

Some had normally-functioning hearts, others had abnormalities

Data comes in the form of audio les + labels for each le

Can we nd the "abnormal" heart beats?

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Loading auditory data
from glob import glob
files = glob('data/heartbeat-sounds/files/*.wav')

print(files)

['data/heartbeat-sounds/proc/files/murmur__201101051104.wav',
...
'data/heartbeat-sounds/proc/files/murmur__201101051114.wav']

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Reading in auditory data
import librosa as lr
# `load` accepts a path to an audio file
audio, sfreq = lr.load('data/heartbeat-sounds/proc/files/murmur__201101051104.wav')

print(sfreq)

2205

In this case, the sampling frequency is 2205 , meaning there are 2205 samples per second

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Inferring time from samples
If we know the sampling rate of a timeseries, then we know the timestamp of each
datapoint relative to the rst datapoint

Note: this assumes the sampling rate is xed and no data points are lost

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Creating a time array (I)
Create an array of indices, one for each sample, and divide by the sampling frequency

indices = np.arange(0, len(audio))


time = indices / sfreq

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Creating a time array (II)
Find the time stamp for the N-1th data point. Then use linspace() to interpolate from zero
to that time

final_time = (len(audio) - 1) / sfreq


time = np.linspace(0, final_time, sfreq)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


The New York Stock Exchange dataset
This dataset consists of company stock values for 10 years

Can we detect any pa erns in historical records that allow us to predict the value of
companies in the future?

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Looking at the data
data = pd.read_csv('path/to/data.csv')

data.columns

Index(['date', 'symbol', 'close', 'volume'], dtype='object')

data.head()

date symbol close volume


0 2010-01-04 AAPL 214.009998 123432400.0
1 2010-01-04 ABT 54.459951 10829000.0
2 2010-01-04 AIG 29.889999 7750900.0
3 2010-01-04 AMAT 14.300000 18615100.0
4 2010-01-04 ARNC 16.650013 11512100.0

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Timeseries with Pandas DataFrames
We can investigate the object type of each column by accessing the dtypes a ribute

df['date'].dtypes

0 object
1 object
2 object
dtype: object

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Converting a column to a time series
To ensure that a column within a DataFrame is treated as time series, use the
to_datetime() function

df['date'] = pd.to_datetime(df['date'])

df['date']

0 2017-01-01
1 2017-01-02
2 2017-01-03
Name: date, dtype: datetime64[ns]

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Classification and
feature engineering
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Always visualize raw data before fitting models

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Visualize your timeseries data!
ixs = np.arange(audio.shape[-1])
time = ixs / sfreq
fig, ax = plt.subplots()
ax.plot(time, audio)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


What features to use?
Using raw timeseries data is too noisy for classi cation

We need to calculate features!

An easy start: summarize your audio data

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Calculating multiple features
print(audio.shape)
# (n_files, time)

(20, 7000)

means = np.mean(audio, axis=-1)


maxs = np.max(audio, axis=-1)
stds = np.std(audio, axis=-1)

print(means.shape)
# (n_files,)

(20,)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Fitting a classifier with scikit-learn
We've just collapsed a 2-D dataset (samples x time) into several features of a 1-D dataset
(samples)

We can combine each feature, and use it as an input to a model

If we have a label for each sample, we can use scikit-learn to create and t a classi er

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Preparing your features for scikit-learn
# Import a linear classifier
from sklearn.svm import LinearSVC

# Note that means are reshaped to work with scikit-learn


X = np.column_stack([means, maxs, stds])
y = labels.reshape(-1, 1)
model = LinearSVC()
model.fit(X, y)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Scoring your scikit-learn model
from sklearn.metrics import accuracy_score

# Different input data


predictions = model.predict(X_test)

# Score our model with % correct


# Manually
percent_score = sum(predictions == labels_test) / len(labels_test)
# Using a sklearn scorer
percent_score = accuracy_score(labels_test, predictions)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Improving the
features we use for
classification
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
The auditory envelope
Smooth the data to calculate the auditory envelope

Related to the total amount of audio energy present at each moment of time

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Smoothing over time
Instead of averaging over all time, we can do a local average

This is called smoothing your timeseries

It removes short-term noise, while retaining the general pa ern

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Smoothing your data

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Calculating a rolling window statistic
# Audio is a Pandas DataFrame
print(audio.shape)
# (n_times, n_audio_files)

(5000, 20)

# Smooth our data by taking the rolling mean in a window of 50 samples


window_size = 50
windowed = audio.rolling(window=window_size)
audio_smooth = windowed.mean()

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Calculating the auditory envelope
First rectify your audio, then smooth it

audio_rectified = audio.apply(np.abs)
audio_envelope = audio_rectified.rolling(50).mean()

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Feature engineering the envelope
# Calculate several features of the envelope, one per sound
envelope_mean = np.mean(audio_envelope, axis=0)
envelope_std = np.std(audio_envelope, axis=0)
envelope_max = np.max(audio_envelope, axis=0)

# Create our training data for a classifier


X = np.column_stack([envelope_mean, envelope_std, envelope_max])

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Preparing our features for scikit-learn
X = np.column_stack([envelope_mean, envelope_std, envelope_max])
y = labels.reshape(-1, 1)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Cross validation for classification
cross_val_score automates the process of:
Spli ing data into training / validation sets

Fi ing the model on training data

Scoring it on validation data

Repeating this process

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Using cross_val_score
from sklearn.model_selection import cross_val_score

model = LinearSVC()
scores = cross_val_score(model, X, y, cv=3)
print(scores)

[0.60911642 0.59975305 0.61404035]

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Auditory features: The Tempogram
We can summarize more complex temporal information with timeseries-speci c functions

librosa is a great library for auditory and timeseries feature engineering

Here we'll calculate the tempogram, which estimates the tempo of a sound over time

We can calculate summary statistics of tempo in the same way that we can for the
envelope

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Computing the tempogram
# Import librosa and calculate the tempo of a 1-D sound array
import librosa as lr
audio_tempo = lr.beat.tempo(audio, sr=sfreq,
hop_length=2**6, aggregate=None)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
The spectrogram -
spectral changes to
sound over time
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Fourier transforms
Timeseries data can be described as a combination of quickly-changing things and slowly-
changing things

At each moment in time, we can describe the relative presence of fast- and slow-moving
components

The simplest way to do this is called a Fourier Transform

This converts a single timeseries into an array that describes the timeseries as a
combination of oscillations

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


A Fourier Transform (FFT)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Spectrograms: combinations of windows Fourier
transforms
A spectrogram is a collection of windowed Fourier transforms over time

Similar to how a rolling mean was calculated:


1. Choose a window size and shape

2. At a timepoint, calculate the FFT for that window

3. Slide the window over by one

4. Aggregate the results

Called a Short-Time Fourier Transform (STFT)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Calculating the STFT
We can calculate the STFT with librosa

There are several parameters we can tweak (such as window size)

For our purposes, we'll convert into decibels which normalizes the average values of all
frequencies

We can then visualize it with the specshow() function

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Calculating the STFT with code
# Import the functions we'll use for the STFT
from librosa.core import stft, amplitude_to_db
from librosa.display import specshow
import matplotlib.pyplot as plt

# Calculate our STFT


HOP_LENGTH = 2**4
SIZE_WINDOW = 2**7
audio_spec = stft(audio, hop_length=HOP_LENGTH, n_fft=SIZE_WINDOW)

# Convert into decibels for visualization


spec_db = amplitude_to_db(audio_spec)

# Visualize
fig, ax = plt.subplots()
specshow(spec_db, sr=sfreq, x_axis='time',
y axis='hz' hop length=HOP LENGTH ax=ax)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Spectral feature engineering
Each timeseries has a di erent spectral pa ern.

We can calculate these spectral pa erns by analyzing the spectrogram.

For example, spectral bandwidth and spectral centroids describe where most of the energy
is at each moment in time

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Calculating spectral features
# Calculate the spectral centroid and bandwidth for the spectrogram
bandwidths = lr.feature.spectral_bandwidth(S=spec)[0]
centroids = lr.feature.spectral_centroid(S=spec)[0]

# Display these features on top of the spectrogram


fig, ax = plt.subplots()
specshow(spec, x_axis='time', y_axis='hz', hop_length=HOP_LENGTH, ax=ax)
ax.plot(times_spec, centroids)
ax.fill_between(times_spec, centroids - bandwidths / 2,
centroids + bandwidths / 2, alpha=0.5)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Combining spectral and temporal features in a
classifier
centroids_all = []
bandwidths_all = []
for spec in spectrograms:
bandwidths = lr.feature.spectral_bandwidth(S=lr.db_to_amplitude(spec))
centroids = lr.feature.spectral_centroid(S=lr.db_to_amplitude(spec))
# Calculate the mean spectral bandwidth
bandwidths_all.append(np.mean(bandwidths))
# Calculate the mean spectral centroid
centroids_all.append(np.mean(centroids))

# Create our X matrix


X = np.column_stack([means, stds, maxs, tempo_mean,
tempo_max, tempo_std, bandwidths_all, centroids_all])

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Predicting data over
time
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Classification vs. Regression
CLASSIFICATION REGRESSION

classification_model.predict(X_test) regression_model.predict(X_test)

array([0, 1, 1, 0]) array([0.2, 1.4, 3.6, 0.6])

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Correlation and regression
Regression is similar to calculating correlation, with some key di erences
Regression: A process that results in a formal model of the data

Correlation: A statistic that describes the data. Less information than regression model.

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Correlation between variables often changes over time
Timeseries o en have pa erns that change over time

Two timeseries that seem correlated at one moment may not remain so over time

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Visualizing relationships between timeseries
fig, axs = plt.subplots(1, 2)

# Make a line plot for each timeseries


axs[0].plot(x, c='k', lw=3, alpha=.2)
axs[0].plot(y)
axs[0].set(xlabel='time', title='X values = time')

# Encode time as color in a scatterplot


axs[1].scatter(x_long, y_long, c=np.arange(len(x_long)), cmap='viridis')
axs[1].set(xlabel='x', ylabel='y', title='Color = time')

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Visualizing two timeseries

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Regression models with scikit-learn
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
model.predict(X)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Visualize predictions with scikit-learn
alphas = [.1, 1e2, 1e3]
ax.plot(y_test, color='k', alpha=.3, lw=3)
for ii, alpha in enumerate(alphas):
y_predicted = Ridge(alpha=alpha).fit(X_train, y_train).predict(X_test)
ax.plot(y_predicted, c=cmap(ii / len(alphas)))
ax.legend(['True values', 'Model 1', 'Model 2', 'Model 3'])
ax.set(xlabel="Time")

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Visualize predictions with scikit-learn

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Scoring regression models
Two most common methods:
Correlation (r )

Coe cient of Determination (R2 )

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


2
Coefficient of Determination (R )
The value of R2 is bounded on the top by 1, and can be in nitely low

Values closer to 1 mean the model does a be er job of predicting outputs


error(model)
1−
variance(testdata)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


2
R in scikit-learn
from sklearn.metrics import r2_score
print(r2_score(y_predicted, y_test))

0.08

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Cleaning and
improving your data
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Data is messy
Real-world data is o en messy

The two most common problems are missing data and outliers

This o en happens because of human error, machine sensor malfunction, database failures,
etc

Visualizing your raw data makes it easier to spot these problems

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


What messy data looks like

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Interpolation: using time to fill in missing data
A common way to deal with missing data is to interpolate missing values

With timeseries data, you can use time to assist in interpolation.

In this case, interpolation means using using the known values on either side of a gap in the
data to make assumptions about what's missing.

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Interpolation in Pandas
# Return a boolean that notes where missing values are
missing = prices.isna()

# Interpolate linearly within missing windows


prices_interp = prices.interpolate('linear')

# Plot the interpolated data in red and the data w/ missing values in black
ax = prices_interp.plot(c='r')
prices.plot(c='k', ax=ax, lw=2)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Visualizing the interpolated data

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Using a rolling window to transform data
Another common use of rolling windows is to transform the data

We've already done this once, in order to smooth the data

However, we can also use this to do more complex transformations

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Transforming data to standardize variance
A common transformation to apply to data is to standardize its mean and variance over
time. There are many ways to do this.

Here, we'll show how to convert your dataset so that each point represents the % change
over a previous window.

This makes timepoints more comparable to one another if the absolute values of data
change a lot

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Transforming to percent change with Pandas
def percent_change(values):
"""Calculates the % change between the last value
and the mean of previous values"""
# Separate the last value and all previous values into variables
previous_values = values[:-1]
last_value = values[-1]

# Calculate the % difference between the last value


# and the mean of earlier values
percent_change = (last_value - np.mean(previous_values)) \
/ np.mean(previous_values)
return percent_change

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Applying this to our data
# Plot the raw data
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
ax = prices.plot(ax=axs[0])

# Calculate % change and plot


ax = prices.rolling(window=20).aggregate(percent_change).plot(ax=axs[1])
ax.legend_.set_visible(False)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Finding outliers in your data
Outliers are datapoints that are signi cantly statistically di erent from the dataset.

They can have negative e ects on the predictive power of your model, biasing it away from
its "true" value

One solution is to remove or replace outliers with a more representative value

Be very careful about doing this - o en it is di cult to determine what is a legitimately


extreme value vs an abberation

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Plotting a threshold on our data
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
for data, ax in zip([prices, prices_perc_change], axs):
# Calculate the mean / standard deviation for the data
this_mean = data.mean()
this_std = data.std()

# Plot the data, with a window that is 3 standard deviations


# around the mean
data.plot(ax=ax)
ax.axhline(this_mean + this_std * 3, ls='--', c='r')
ax.axhline(this_mean - this_std * 3, ls='--', c='r')

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Visualizing outlier thresholds

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Replacing outliers using the threshold
# Center the data so the mean is 0
prices_outlier_centered = prices_outlier_perc - prices_outlier_perc.mean()

# Calculate standard deviation


std = prices_outlier_perc.std()

# Use the absolute value of each datapoint


# to make it easier to find outliers
outliers = np.abs(prices_outlier_centered) > (std * 3)

# Replace outliers with the median value


# We'll use np.nanmean since there may be nans around the outliers
prices_outlier_fixed = prices_outlier_centered.copy()
prices_outlier_fixed[outliers] = np.nanmedian(prices_outlier_fixed)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Visualize the results
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
prices_outlier_centered.plot(ax=axs[0])
prices_outlier_fixed.plot(ax=axs[1])

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Creating features
over time
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Extracting features with windows

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Using .aggregate for feature extraction
# Visualize the raw data
print(prices.head(3))

symbol AIG ABT


date
2010-01-04 29.889999 54.459951
2010-01-05 29.330000 54.019953
2010-01-06 29.139999 54.319953

# Calculate a rolling window, then extract two features


feats = prices.rolling(20).aggregate([np.std, np.max]).dropna()
print(feats.head(3))

AIG ABT
std amax std amax
date
2010-02-01 2.051966 29.889999 0.868830 56.239949
2010-02-02 2.101032 29.629999 0.869197 56.239949
2010-02-03 2.157249 29.629999 0.852509 56.239949

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Check the properties of your features!

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Using partial() in Python
# If we just take the mean, it returns a single value
a = np.array([[0, 1, 2], [0, 1, 2], [0, 1, 2]])
print(np.mean(a))

1.0

# We can use the partial function to initialize np.mean


# with an axis parameter
from functools import partial
mean_over_first_axis = partial(np.mean, axis=0)

print(mean_over_first_axis(a))

[0. 1. 2.]

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Percentiles summarize your data
Percentiles are a useful way to get more ne-grained summaries of your data (as opposed
to using np.mean )

For a given dataset, the Nth percentile is the value where N% of the data is below that
datapoint, and 100-N% of the data is above that datapoint.

print(np.percentile(np.linspace(0, 200), q=20))

40.0

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Combining np.percentile() with partial functions to
calculate a range of percentiles
data = np.linspace(0, 100)

# Create a list of functions using a list comprehension


percentile_funcs = [partial(np.percentile, q=ii) for ii in [20, 40, 60]]

# Calculate the output of each function in the same way


percentiles = [i_func(data) for i_func in percentile_funcs]
print(percentiles)

[20.0, 40.00000000000001, 60.0]

# Calculate multiple percentiles of a rolling window


data.rolling(20).aggregate(percentiles)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Calculating "date-based" features
Thus far we've focused on calculating "statistical" features - these are features that
correspond statistical properties of the data, like "mean", "standard deviation", etc

However, don't forget that timeseries data o en has more "human" features associated with
it, like days of the week, holidays, etc.

These features are o en useful when dealing with timeseries data that spans multiple years
(such as stock value over time)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


datetime features using Pandas
# Ensure our index is datetime
prices.index = pd.to_datetime(prices.index)

# Extract datetime features


day_of_week_num = prices.index.weekday
print(day_of_week_num[:10])

Index([0 1 2 3 4 0 1 2 3 4], dtype='object')

day_of_week = prices.index.weekday_name
print(day_of_week[:10])

Index(['Monday' 'Tuesday' 'Wednesday' 'Thursday' 'Friday' 'Monday' 'Tuesday'


'Wednesday' 'Thursday' 'Friday'], dtype='object')

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Time-delayed
features and auto-
regressive models
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
The past is useful
Timeseries data almost always have information that is shared between timepoints

Information in the past can help predict what happens in the future

O en the features best-suited to predict a timeseries are previous values of the same
timeseries.

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


A note on smoothness and auto-correlation
A common question to ask of a timeseries: how smooth is the data.

AKA, how correlated is a timepoint with its neighboring timepoints (called autocorrelation).

The amount of auto-correlation in data will impact your models.

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Creating time-lagged features
Let's see how we could build a model that uses values in the past as input features.

We can use this to assess how auto-correlated our signal is (and lots of other stu too)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Time-shifting data with Pandas
print(df)

df
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0

# Shift a DataFrame/Series by 3 index values towards the past


print(df.shift(3))

df
0 NaN
1 NaN
2 NaN
3 0.0
4 1.0

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Creating a time-shifted DataFrame
# data is a pandas Series containing time series data
data = pd.Series(...)

# Shifts
shifts = [0, 1, 2, 3, 4, 5, 6, 7]

# Create a dictionary of time-shifted data


many_shifts = {'lag_{}'.format(ii): data.shift(ii) for ii in shifts}

# Convert them into a dataframe


many_shifts = pd.DataFrame(many_shifts)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Fitting a model with time-shifted features
# Fit the model using these input features
model = Ridge()
model.fit(many_shifts, data)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Interpreting the auto-regressive model coefficients
# Visualize the fit model coefficients
fig, ax = plt.subplots()
ax.bar(many_shifts.columns, model.coef_)
ax.set(xlabel='Coefficient name', ylabel='Coefficient value')

# Set formatting so it looks nice


plt.setp(ax.get_xticklabels(), rotation=45, horizontalalignment='right')

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Visualizing coefficients for a rough signal

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Visualizing coefficients for a smooth signal

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Cross-validating
timeseries data
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Cross validation with scikit-learn
# Iterating over the "split" method yields train/test indices
for tr, tt in cv.split(X, y):
model.fit(X[tr], y[tr])
model.score(X[tt], y[tt])

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Cross validation types: KFold
KFold cross-validation splits your data into multiple "folds" of equal size

It is one of the most common cross-validation routines

from sklearn.model_selection import KFold


cv = KFold(n_splits=5)
for tr, tt in cv.split(X, y):
...

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Visualizing model predictions
fig, axs = plt.subplots(2, 1)

# Plot the indices chosen for validation on each loop


axs[0].scatter(tt, [0] * len(tt), marker='_', s=2, lw=40)
axs[0].set(ylim=[-.1, .1], title='Test set indices (color=CV loop)',
xlabel='Index of raw data')

# Plot the model predictions on each iteration


axs[1].plot(model.predict(X[tt]))
axs[1].set(title='Test set predictions on each CV loop',
xlabel='Prediction index')

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Visualizing KFold CV behavior

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


A note on shuffling your data
Many CV iterators let you shu e data as a part of the cross-validation process.

This only works if the data is i.i.d., which timeseries usually is not.

You should not shu e your data when making predictions with timeseries.

from sklearn.model_selection import ShuffleSplit

cv = ShuffleSplit(n_splits=3)
for tr, tt in cv.split(X, y):
...

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Visualizing shuffled CV behavior

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Using the time series CV iterator
Thus far, we've broken the linear passage of time in the cross validation

However, you generally should not use datapoints in the future to predict data in the past

One approach: Always use training data from the past to predict the future

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Visualizing time series cross validation iterators
# Import and initialize the cross-validation iterator
from sklearn.model_selection import TimeSeriesSplit
cv = TimeSeriesSplit(n_splits=10)

fig, ax = plt.subplots(figsize=(10, 5))


for ii, (tr, tt) in enumerate(cv.split(X, y)):
# Plot training and test indices
l1 = ax.scatter(tr, [ii] * len(tr), c=[plt.cm.coolwarm(.1)],
marker='_', lw=6)
l2 = ax.scatter(tt, [ii] * len(tt), c=[plt.cm.coolwarm(.9)],
marker='_', lw=6)
ax.set(ylim=[10, -1], title='TimeSeriesSplit behavior',
xlabel='data index', ylabel='CV iteration')
ax.legend([l1, l2], ['Training', 'Validation'])

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Visualizing the TimeSeriesSplit cross validation iterator

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Custom scoring functions in scikit-learn
def myfunction(estimator, X, y):
y_pred = estimator.predict(X)
my_custom_score = my_custom_function(y_pred, y)
return my_custom_score

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


A custom correlation function for scikit-learn
def my_pearsonr(est, X, y):
# Generate predictions and convert to a vector
y_pred = est.predict(X).squeeze()

# Use the numpy "corrcoef" function to calculate a correlation matrix


my_corrcoef_matrix = np.corrcoef(y_pred, y.squeeze())

# Return a single correlation value from the matrix


my_corrcoef = my_corrcoef[1, 0]
return my_corrcoef

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Stationarity and
stability
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Stationarity
Stationary time series do not change their statistical properties over time

E.g., mean, standard deviation, trends

Most time series are non-stationary to some extent

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Model stability
Non-stationary data results in variability in our model

The statistical properties the model nds may change with the data

In addition, we will be less certain about the correct values of model parameters

How can we quantify this?

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Cross validation to quantify parameter stability
One approach: use cross-validation

Calculate model parameters on each iteration

Assess parameter stability across all CV splits

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Bootstrapping the mean
Bootstrapping is a common way to assess variability

The bootstrap:
1. Take a random sample of data with replacement

2. Calculate the mean of the sample

3. Repeat this process many times (1000s)

4. Calculate the percentiles of the result (usually 2.5, 97.5)

The result is a 95% con dence interval of the mean of each coe cient.

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Bootstrapping the mean
from sklearn.utils import resample

# cv_coefficients has shape (n_cv_folds, n_coefficients)


n_boots = 100
bootstrap_means = np.zeros(n_boots, n_coefficients)
for ii in range(n_boots):
# Generate random indices for our data with replacement,
# then take the sample mean
random_sample = resample(cv_coefficients)
bootstrap_means[ii] = random_sample.mean(axis=0)

# Compute the percentiles of choice for the bootstrapped means


percentiles = np.percentile(bootstrap_means, (2.5, 97.5), axis=0)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Plotting the bootstrapped coefficients
fig, ax = plt.subplots()
ax.scatter(many_shifts.columns, percentiles[0], marker='_', s=200)
ax.scatter(many_shifts.columns, percentiles[1], marker='_', s=200)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Assessing model performance stability
If using the TimeSeriesSplit, can plot the model's score over time.

This is useful in nding certain regions of time that hurt the score

Also useful to nd non-stationary signals

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Model performance over time
def my_corrcoef(est, X, y):
"""Return the correlation coefficient
between model predictions and a validation set."""
return np.corrcoef(y, est.predict(X))[1, 0]

# Grab the date of the first index of each validation set


first_indices = [data.index[tt[0]] for tr, tt in cv.split(X, y)]

# Calculate the CV scores and convert to a Pandas Series


cv_scores = cross_val_score(model, X, y, cv=cv, scoring=my_corrcoef)
cv_scores = pd.Series(cv_scores, index=first_indices)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Visualizing model scores as a timeseries
fig, axs = plt.subplots(2, 1, figsize=(10, 5), sharex=True)

# Calculate a rolling mean of scores over time


cv_scores_mean = cv_scores.rolling(10, min_periods=1).mean()
cv_scores.plot(ax=axs[0])
axs[0].set(title='Validation scores (correlation)', ylim=[0, 1])

# Plot the raw data


data.plot(ax=axs[1])
axs[1].set(title='Validation data')

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Visualizing model scores

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Fixed windows with time series cross-validation
# Only keep the last 100 datapoints in the training data
window = 100

# Initialize the CV with this window size


cv = TimeSeriesSplit(n_splits=10, max_train_size=window)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Non-stationary signals

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Wrapping-up
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Timeseries and machine learning
The many applications of time series + machine learning

Always visualize your data rst

The scikit-learn API standardizes this process

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Feature extraction and classification
Summary statistics for time series classi cation

Combining multiple features into a single input matrix

Feature extraction for time series data

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Model fitting and improving data quality
Time series features for regression

Generating predictions over time

Cleaning and improving time series data

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Validating and assessing our model performance
Cross-validation with time series data (don't shu e the data!)

Time series stationarity

Assessing model coe cient and score stability

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Advanced concepts in time series
Advanced window functions

Signal processing and ltering details

Spectral analysis

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Advanced machine learning
Advanced time series feature extraction (e.g., tsfresh )

More complex model architectures for regression and classi cation

Production-ready pipelines for time series analysis

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Ways to practice
There are a lot of opportunities to practice your skills with time series data.

Kaggle has a number of time series predictions challenges

Quantopian is also useful for learning and using predictive models others have built.

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON


Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

You might also like