Time Series With Python

How to use dates &
times with pandas

M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Date & time series functionality
At the root: data types for date & time information
Objects for points in time and periods
A ributes & methods re ect time-related details
Sequences of dates & periods:

Series or DataFrame columns
Index: convert object into Time Series
Many Series/DataFrame methods rely on time information in

the index to provide time-series functionality
MANIPULATING TIME SERIES DATA IN PYTHON

Basic building block: pd.Timestamp
import pandas as pd # assumed imported going forward
from datetime import datetime # To manually create dates
time_stamp = pd.Timestamp(datetime(2017, 1, 1))
pd.Timestamp('2017-01-01') == time_stamp
True # Understands dates as strings
time_stamp # type: pandas.tslib.Timestamp
Timestamp('2017-01-01 00:00:00')

Basic building block: pd.Timestamp
Timestamp object has many a ributes to store time-speci c
information
time_stamp.year
2017
time_stamp.day_name()
'Sunday'

More building blocks: pd.Period & freq
period = pd.Period('2017-01')
period # default: month-end
Period object has freq

Period('2017-01', 'M') a ribute to store frequency
info
period.asfreq('D') # convert to daily
Period('2017-01-31', 'D')
Convert pd.Period() to
period.to_timestamp().to_period('M') pd.Timestamp() and back
Period('2017-01', 'M')

More building blocks: pd.Period & freq
period + 2 Frequency info enables
basic date arithmetic
Period('2017-03', 'M')
pd.Timestamp('2017-01-31', 'M') + 1
Timestamp('2017-02-28 00:00:00', freq='M')

Sequences of dates & times
pd.date_range : start , end , periods , freq
index = pd.date_range(start='2017-1-1', periods=12, freq='M')
index
DatetimeIndex(['2017-01-31', '2017-02-28', '2017-03-31', ...,

'2017-09-30', '2017-10-31', '2017-11-30', '2017-12-31'],
dtype='datetime64[ns]', freq='M')
pd.DateTimeIndex : sequence of Timestamp objects with

frequency info

Sequences of dates & times
index[0]
Timestamp('2017-01-31 00:00:00', freq='M')
index.to_period()
PeriodIndex(['2017-01', '2017-02', '2017-03', '2017-04', ...,

'2017-11', '2017-12'], dtype='period[M]', freq='M')

Create a time series: pd.DateTimeIndex
pd.DataFrame({'data': index}).info()
RangeIndex: 12 entries, 0 to 11
Data columns (total 1 columns):
data 12 non-null datetime64[ns]
dtypes: datetime64[ns](1)

Create a time series: pd.DateTimeIndex
np.random.random :
Random numbers: [0,1]
12 rows, 2 columns
data = np.random.random((size=12,2))
pd.DataFrame(data=data, index=index).info()
DatetimeIndex: 12 entries, 2017-01-31 to 2017-12-31

Freq: M
0 12 non-null float64
1 12 non-null float64
dtypes: float64(2)

Frequency aliases & time info

Let's practice!
Indexing &
resampling time
series
Stefan Jansen
Time series transformation
Basic time series transformations include:
Parsing string dates and convert to datetime64
Selecting & slicing for speci c subperiods
Se ing & changing DateTimeIndex frequency

Upsampling vs Downsampling

Getting GOOG stock prices
google = pd.read_csv('google.csv') # import pandas as pd
google.info()
<class 'pandas.core.frame.DataFrame'>
date 504 non-null object
price 504 non-null float64
dtypes: float64(1), object(1)
google.head()
date price
0 2015-01-02 524.81
1 2015-01-05 513.87
2 2015-01-06 501.96
3 2015-01-07 501.10
4 2015-01-08 502.68

Converting string dates to datetime64
pd.to_datetime() :
Parse date string
Convert to datetime64
google.date = pd.to_datetime(google.date)
google.info()
date 504 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1)

Converting string dates to datetime64
.set_index() :
Date into index
inplace :
don't create copy
google.set_index('date', inplace=True)
google.info()
dtypes: float64(1)

Plotting the Google stock time series
google.price.plot(title='Google Stock Price')
plt.tight_layout(); plt.show()

Partial string indexing
Selecting/indexing using strings that parse to dates
google['2015'].info() # Pass string for part of date

dtypes: float64(1)
google['2015-3': '2016-2'].info() # Slice includes last month

dtypes: float64(1)
memory usage: 3.9 KB

Partial string indexing
google.loc['2016-6-1', 'price'] # Use full date with .loc[]
734.15

.asfreq(): set frequency
.asfreq('D') :
Convert DateTimeIndex to calendar day frequency
google.asfreq('D').info() # set calendar day frequency

Freq: D
dtypes: float64(1)

.asfreq(): set frequency
Upsampling:
Higher frequency implies new dates => missing data
google.asfreq('D').head()
price
date
2015-01-02 524.81
2015-01-03 NaN
2015-01-04 NaN
2015-01-05 513.87
2015-01-06 501.96

.asfreq(): reset frequency
.asfreq('B') :
Convert DateTimeIndex to business day frequency
google = google.asfreq('B') # Change to calendar day frequency

google.info()

Freq: B
dtypes: float64(1)

.asfreq(): reset frequency
google[google.price.isnull()] # Select missing 'price' values
price
date
2015-01-19 NaN
2015-02-16 NaN
...
2016-11-24 NaN
2016-12-26 NaN
Business days that were not trading days

Let's practice!
Lags, changes, and
returns for stock
price series
Stefan Jansen
Basic time series calculations
Typical Time Series manipulations include:
Shi or lag values back or forward back in time
Get the di erence in value for a given time period
Compute the percent change over any number of periods
pandas built-in methods rely on pd.DateTimeIndex

Let pd.read_csv() do the parsing for you!
google = pd.read_csv('google.csv', parse_dates=['date'], index_col='date')
google.info()
dtypes: float64(1)

google.head()
price
date
2015-01-02 524.81
2015-01-05 513.87
2015-01-06 501.96
2015-01-07 501.10
2015-01-08 502.68

.shift(): Moving data between past & future
.shift() :
defaults to periods=1
1 period into future
google['shifted'] = google.price.shift() # default: periods=1

google.head(3)
price shifted
date
2015-01-02 542.81 NaN
2015-01-05 513.87 542.81
2015-01-06 501.96 513.87

.shift(): Moving data between past & future
.shift(periods=-1) :
lagged data
1 period back in time
google['lagged'] = google.price.shift(periods=-1)
google[['price', 'lagged', 'shifted']].tail(3)
price lagged shifted

date
2016-12-28 785.05 782.79 791.55
2016-12-29 782.79 771.82 785.05
2016-12-30 771.82 NaN 782.79

Calculate one-period percent change
xt / xt−1
google['change'] = google.price.div(google.shifted)
google[['price', 'shifted', 'change']].head(3)
price shifted change

Date
2017-01-03 786.14 NaN NaN
2017-01-04 786.90 786.14 1.000967
2017-01-05 794.02 786.90 1.009048

Calculate one-period percent change
google['return'] = google.change.sub(1).mul(100)
google[['price', 'shifted', 'change', 'return']].head(3)
price shifted change return

date
2015-01-02 524.81 NaN NaN NaN
2015-01-05 513.87 524.81 0.98 -2.08
2015-01-06 501.96 513.87 0.98 -2.32

.diff(): built-in time-series change
Di erence in value for two adjacent periods
xt − xt−1
google['diff'] = google.price.diff()
google[['price', 'diff']].head(3)
price diff
date
2015-01-02 524.81 NaN
2015-01-05 513.87 -10.94
2015-01-06 501.96 -11.91

.pct_change(): built-in time-series % change
Percent change for two adjacent periods
xt
xt−1
google['pct_change'] = google.price.pct_change().mul(100)
google[['price', 'return', 'pct_change']].head(3)
price return pct_change

date
2015-01-02 524.81 NaN NaN
2015-01-05 513.87 -2.08 -2.08
2015-01-06 501.96 -2.32 -2.32

Looking ahead: Get multi-period returns
google['return_3d'] = google.price.pct_change(periods=3).mul(100)
google[['price', 'return_3d']].head()
price return_3d
date
2015-01-02 524.81 NaN
2015-01-05 513.87 NaN
2015-01-06 501.96 NaN
2015-01-07 501.10 -4.517825
2015-01-08 502.68 -2.177594
Percent change for two periods, 3 trading days apart

Let's practice!
Compare time series
growth rates
Stefan Jansen
Comparing stock performance
Stock price series: hard to compare at di erent levels
Simple solution: normalize price series to start at 100
Divide all prices by rst in series, multiply by 100

Same starting point
All prices relative to starting point
Di erence to starting point in percentage points

Normalizing a single series (1)
google = pd.read_csv('google.csv', parse_dates=['date'], index_col='date')
google.head(3)
price
date
2010-01-04 313.06
2010-01-05 311.68
2010-01-06 303.83
first_price = google.price.iloc[0] # int-based selection

first_price
313.06
first_price == google.loc['2010-01-04', 'price']
True

Normalizing a single series (2)
normalized = google.price.div(first_price).mul(100)
normalized.plot(title='Google Normalized Series')

Normalizing multiple series (1)
prices = pd.read_csv('stock_prices.csv',
parse_dates=['date'],
index_col='date')
prices.info()

AAPL 1761 non-null float64
GOOG 1761 non-null float64
YHOO 1761 non-null float64
dtypes: float64(3)
prices.head(2)
AAPL GOOG YHOO

Date
2010-01-04 30.57 313.06 17.10
2010-01-05 30.63 311.68 17.23

Normalizing multiple series (2)
prices.iloc[0]
AAPL 30.57
GOOG 313.06
YHOO 17.10
Name: 2010-01-04 00:00:00, dtype: float64
normalized = prices.div(prices.iloc[0])
normalized.head(3)
AAPL GOOG YHOO

Date
2010-01-04 1.000000 1.000000 1.000000
2010-01-05 1.001963 0.995592 1.007602
2010-01-06 0.985934 0.970517 1.004094
.div() : automatic alignment of Series index & DataFrame

columns

Comparing with a benchmark (1)
index = pd.read_csv('benchmark.csv', parse_dates=['date'], index_col='date')
index.info()

SP500 1762 non-null float64
dtypes: float64(1)
prices = pd.concat([prices, index], axis=1).dropna()

prices.info()

AAPL 1761 non-null float64
GOOG 1761 non-null float64
YHOO 1761 non-null float64
dtypes: float64(4)

Comparing with a benchmark (2)
prices.head(1)
AAPL GOOG YHOO SP500

2010-01-04 30.57 313.06 17.10 1132.99
normalized = prices.div(prices.iloc[0]).mul(100)
normalized.plot()

Plotting performance difference
diff = normalized[tickers].sub(normalized['SP500'], axis=0)
GOOG YHOO AAPL

2010-01-04 0.000000 0.000000 0.000000
2010-01-05 -0.752375 0.448669 -0.115294
2010-01-06 -3.314604 0.043069 -1.772895
.sub(..., axis=0) : Subtract a Series from each DataFrame

column by aligning indexes

Plotting performance difference
diff.plot()

Let's practice!
Changing the time
series frequency:
resampling
Stefan Jansen
Changing the frequency: resampling
DateTimeIndex : set & change freq using .asfreq()
But frequency conversion a ects the data

Upsampling: ll or interpolate missing data
Downsampling: aggregate existing data
pandas API:
.asfreq() , .reindex()
.resample() + transformation method

Getting started: quarterly data
dates = pd.date_range(start='2016', periods=4, freq='Q')
data = range(1, 5)
quarterly = pd.Series(data=data, index=dates)
quarterly
2016-03-31 1
2016-06-30 2
2016-09-30 3
2016-12-31 4
Freq: Q-DEC, dtype: int64 # Default: year-end quarters

Upsampling: quarter => month
monthly = quarterly.asfreq('M') # to month-end frequency
2016-03-31 1.0
2016-04-30 NaN
2016-05-31 NaN
2016-06-30 2.0
2016-07-31 NaN
2016-08-31 NaN
2016-09-30 3.0
2016-10-31 NaN
2016-11-30 NaN
2016-12-31 4.0
Freq: M, dtype: float64
Upsampling creates missing values
monthly = monthly.to_frame('baseline') # to DataFrame

Upsampling: fill methods
monthly['ffill'] = quarterly.asfreq('M', method='ffill')
monthly['bfill'] = quarterly.asfreq('M', method='bfill')
monthly['value'] = quarterly.asfreq('M', fill_value=0)

Upsampling: fill methods
bfill : back ll
ffill : forward ll
baseline ffill bfill value

2016-03-31 1.0 1 1 1
2016-04-30 NaN 1 2 0
2016-05-31 NaN 1 2 0
2016-06-30 2.0 2 2 2
2016-07-31 NaN 2 3 0
2016-08-31 NaN 2 3 0
2016-09-30 3.0 3 3 3
2016-10-31 NaN 3 4 0
2016-11-30 NaN 3 4 0
2016-12-31 4.0 4 4 4

Add missing months: .reindex()
dates = pd.date_range(start='2016', quarterly.reindex(dates)
periods=12,
freq='M')
2016-01-31 NaN
2016-02-29 NaN
DatetimeIndex(['2016-01-31', 2016-03-31 1.0
'2016-02-29', 2016-04-30 NaN
..., 2016-05-31 NaN
'2016-11-30', 2016-06-30 2.0
'2016-12-31'], 2016-07-31 NaN
dtype='datetime64[ns]', freq='M') 2016-08-31 NaN
2016-09-30 3.0
2016-10-31 NaN
.reindex() : 2016-11-30 NaN
conform DataFrame to 2016-12-31 4.0
new index
same lling logic as

.asfreq()

Let's practice!
Upsampling &
interpolation with
.resample()
Stefan Jansen
Frequency conversion & transformation methods
.resample() : similar to .groupby()
Groups data within resampling period and applies one or

several methods to each group
New date determined by o set - start, end, etc
Upsampling: ll from existing or interpolate values
Downsampling: apply aggregation to existing data

Getting started: monthly unemployment rate
unrate = pd.read_csv('unrate.csv', parse_dates['Date'], index_col='Date')
unrate.info()

UNRATE 208 non-null float64 # no frequency information
dtypes: float64(1)
unrate.head()
UNRATE
DATE
2000-01-01 4.0
2000-02-01 4.1
2000-03-01 4.0
2000-04-01 3.8
2000-05-01 4.0
Reporting date: 1st day of month

Resampling Period & Frequency Offsets
Resample creates new date for frequency o set
Several alternatives to calendar month end
Frequency Alias Sample Date

Calendar Month End M 2017-04-30
Calendar Month Start MS 2017-04-01
Business Month End BM 2017-04-28
Business Month Start BMS 2017-04-03

Resampling logic

Resampling logic

Assign frequency with .resample()
unrate.asfreq('MS').info()

Freq: MS
UNRATE 208 non-null float64
dtypes: float64(1)
unrate.resample('MS') # creates Resampler object
DatetimeIndexResampler [freq=<MonthBegin>, axis=0, closed=left,

label=left, convention=start, base=0]

Assign frequency with .resample()
unrate.asfreq('MS').equals(unrate.resample('MS').asfreq())
True
.resample() : returns data only when calling another method

Quarterly real GDP growth
gdp = pd.read_csv('gdp.csv')
gdp.info()

gpd 69 non-null float64 # no frequency info
dtypes: float64(1)
gdp.head(2)
gpd
DATE
2000-01-01 1.2
2000-04-01 7.8

Interpolate monthly real GDP growth
gdp_1 = gdp.resample('MS').ffill().add_suffix('_ffill')
gpd_ffill
DATE
2000-01-01 1.2
2000-02-01 1.2
2000-03-01 1.2
2000-04-01 7.8

Interpolate monthly real GDP growth
gdp_2 = gdp.resample('MS').interpolate().add_suffix('_inter')
gpd_inter
DATE
2000-01-01 1.200000
2000-02-01 3.400000
2000-03-01 5.600000
2000-04-01 7.800000
.interpolate() : nds points on straight line between

existing data

Concatenating two DataFrames
df1 = pd.DataFrame([1, 2, 3], columns=['df1'])
df2 = pd.DataFrame([4, 5, 6], columns=['df2'])
pd.concat([df1, df2])
df1 df2
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
0 NaN 4.0
1 NaN 5.0
2 NaN 6.0

Concatenating two DataFrames
pd.concat([df1, df2], axis=1)
df1 df2
0 1 4
1 2 5
2 3 6
axis=1 : concatenate horizontally

Plot interpolated real GDP growth
pd.concat([gdp_1, gdp_2], axis=1).loc['2015':].plot()

Combine GDP growth & unemployment
pd.concat([unrate, gdp_inter], axis=1).plot();

Let's practice!
Downsampling &
aggregation
Stefan Jansen
Downsampling & aggregation methods
So far: upsampling, ll logic & interpolation
Now: downsampling
hour to day
day to month, etc
How to represent the existing values at the new date?

Mean, median, last value?

Air quality: daily ozone levels
ozone = pd.read_csv('ozone.csv',
index_col='date')
ozone.info()

Ozone 6167 non-null float64
dtypes: float64(1)
ozone = ozone.resample('D').asfreq()
ozone.info()

Freq: D
dtypes: float64(1)

Creating monthly ozone data
ozone.resample('M').mean().head() ozone.resample('M').median().head()
Ozone Ozone
date date
2000-01-31 0.010443 2000-01-31 0.009486
2000-02-29 0.011817 2000-02-29 0.010726
2000-03-31 0.016810 2000-03-31 0.017004
2000-04-30 0.019413 2000-04-30 0.019866
2000-05-31 0.026535 2000-05-31 0.026018
.resample().mean() : Monthly
average, assigned to end of
calendar month

Creating monthly ozone data
ozone.resample('M').agg(['mean', 'std']).head()
Ozone
mean std
date
2000-01-31 0.010443 0.004755
2000-02-29 0.011817 0.004072
2000-03-31 0.016810 0.004977
2000-04-30 0.019413 0.006574
2000-05-31 0.026535 0.008409
.resample().agg() : List of aggregation functions like

groupby

Plotting resampled ozone data
ozone = ozone.loc['2016':]
ax = ozone.plot()
monthly = ozone.resample('M').mean()
monthly.add_suffix('_monthly').plot(ax=ax)

Resampling multiple time series
data = pd.read_csv('ozone_pm25.csv',
index_col='date')
data = data.resample('D').asfreq()
data.info()

Freq: D
PM25 6167 non-null float64
dtypes: float64(2)

data = data.resample('BM').mean()
data.info()
Freq: BM
ozone 207 non-null float64
pm25 207 non-null float64
dtypes: float64(2)

df.resample('M').first().head(4)
Ozone PM25
date
2000-01-31 0.005545 20.800000
2000-02-29 0.016139 6.500000
2000-03-31 0.017004 8.493333
2000-04-30 0.031354 6.889474
df.resample('MS').first().head()
Ozone PM25
date
2000-01-01 0.004032 37.320000
2000-02-01 0.010583 24.800000
2000-03-01 0.007418 11.106667
2000-04-01 0.017631 11.700000
2000-05-01 0.022628 9.700000

Let's practice!
Rolling window
functions with
pandas
Stefan Jansen
Window functions in pandas
Windows identify sub periods of your time series
Calculate metrics for sub periods inside the window
Create a new time series of metrics
Two types of windows:

Rolling: same size, sliding (this video)
Expanding: contain all prior values (next video)

Calculating a rolling average
data = pd.read_csv('google.csv', parse_dates=['date'], index_col='date')

dtypes: float64(1)

# Integer-based window size
data.rolling(window=30).mean() # fixed # observations

dtypes: float64(1)
window=30 : # business days
min_periods : choose value < 30 to get results for rst days

# Offset-based window size
data.rolling(window='30D').mean() # fixed period length

dtypes: float64(1)
30D : # calendar days

90 day rolling mean
r90 = data.rolling(window='90D').mean()
google.join(r90.add_suffix('_mean_90')).plot()

90 & 360 day rolling means
data['mean90'] = r90
r360 = data['price'].rolling(window='360D'.mean()
data['mean360'] = r360; data.plot()

Multiple rolling metrics (1)
r = data.price.rolling('90D').agg(['mean', 'std'])
r.plot(subplots = True)

Multiple rolling metrics (2)
rolling = data.google.rolling('360D')
q10 = rolling.quantile(0.1).to_frame('q10')
median = rolling.median().to_frame('median')
q90 = rolling.quantile(0.9).to_frame('q90')
pd.concat([q10, median, q90], axis=1).plot()

Let's practice!
Expanding window
functions with
pandas
Stefan Jansen
Expanding windows in pandas
From rolling to expanding windows
Calculate metrics for periods up to current date
New time series re ects all historical values
Useful for running rate of return, running min/max
Two options with pandas:

.expanding() - just like .rolling()
.cumsum() , .cumprod() , cummin() / max()

The basic idea
df = pd.DataFrame({'data': range(5)})
df['expanding sum'] = df.data.expanding().sum()
df['cumulative sum'] = df.data.cumsum()
df
data expanding sum cumulative sum

0 0 0.0 0
1 1 1.0 1
2 2 3.0 3
3 3 6.0 6
4 4 10.0 10

Get data for the S&P 500
data = pd.read_csv('sp500.csv', parse_dates=['date'], index_col='date')


How to calculate a running return
Single period return rt : current price over last price minus 1:
Pt
rt = −1
Pt−1
Multi-period return: product of (1 + rt ) for all periods,
minus 1:
RT = (1 + r1 )(1 + r2 )...(1 + rT ) − 1
For the period return: .pct_change()
For basic math .add() , .sub() , .mul() , .div()
For cumulative product: .cumprod()

Running rate of return in practice
pr = data.SP500.pct_change() # period return
pr_plus_one = pr.add(1)
cumulative_return = pr_plus_one.cumprod().sub(1)
cumulative_return.mul(100).plot()

Getting the running min & max
data['running_min'] = data.SP500.expanding().min()
data['running_max'] = data.SP500.expanding().max()
data.plot()

Rolling annual rate of return
def multi_period_return(period_returns):
return np.prod(period_returns + 1) - 1
pr = data.SP500.pct_change() # period return
r = pr.rolling('360D').apply(multi_period_return)
data['Rolling 1yr Return'] = r.mul(100)
data.plot(subplots=True)

Rolling annual rate of return
data['Rolling 1yr Return'] = r.mul(100)

Let's practice!
Case study: S&P500
price simulation
Stefan Jansen
Random walks & simulations
Daily stock returns are hard to predict
Models o en assume they are random in nature
Numpy allows you to generate random numbers
From random returns to prices: use .cumprod()
Two examples:
Generate random returns
Randomly selected actual SP500 returns

Generate random numbers
from numpy.random import normal, seed
from scipy.stats import norm
seed(42)
random_returns = normal(loc=0, scale=0.01, size=1000)
sns.distplot(random_returns, fit=norm, kde=False)

Create a random price path
return_series = pd.Series(random_returns)
random_prices = return_series.add(1).cumprod().sub(1)
random_prices.mul(100).plot()

S&P 500 prices & returns
data = pd.read_csv('sp500.csv', parse_dates=['date'], index_col='date')
data['returns'] = data.SP500.pct_change()

S&P return distribution
sns.distplot(data.returns.dropna().mul(100), fit=norm)

Generate random S&P 500 returns
from numpy.random import choice
sample = data.returns.dropna()
n_obs = data.returns.count()
random_walk = choice(sample, size=n_obs)
random_walk = pd.Series(random_walk, index=sample.index)
random_walk.head()
DATE
2007-05-29 -0.008357
2007-05-30 0.003702
2007-05-31 -0.013990
2007-06-01 0.008096
2007-06-04 0.013120

Random S&P 500 prices (1)
start = data.SP500.first('D')
DATE
2007-05-25 1515.73
Name: SP500, dtype: float64
sp500_random = start.append(random_walk.add(1))
sp500_random.head())
DATE
2007-05-25 1515.730000
2007-05-29 0.998290
2007-05-30 0.995190
2007-05-31 0.997787
2007-06-01 0.983853
dtype: float64

Random S&P 500 prices (2)
data['SP500_random'] = sp500_random.cumprod()
data[['SP500', 'SP500_random']].plot()

Let's practice!
Relationships
between time series:
correlation
Stefan Jansen
Correlation & relations between series
So far, focus on characteristics of individual variables
Now: characteristic of relations between variables
Correlation: measures linear relationships
Financial markets: important for prediction and risk

management
pandas & seaborn have tools to compute & visualize

Correlation & linear relationships
Correlation coe cient: how similar is the pairwise movement
of two variables around their averages?
∑N (x −x̄)(yi − ȳ )
Varies between -1 and +1 r= i=1 i
sx sy

Importing five price time series
data = pd.read_csv('assets.csv', parse_dates=['date'],
index_col='date')
data = data.dropna().info()

sp500 2469 non-null float64
nasdaq 2469 non-null float64
bonds 2469 non-null float64
gold 2469 non-null float64
oil 2469 non-null float64

Visualize pairwise linear relationships
daily_returns = data.pct_change()
sns.jointplot(x='sp500', y='nasdaq', data=data_returns);

Calculate all correlations
correlations = returns.corr()
correlations
bonds oil gold sp500 nasdaq

bonds 1.000000 -0.183755 0.003167 -0.300877 -0.306437
oil -0.183755 1.000000 0.105930 0.335578 0.289590
gold 0.003167 0.105930 1.000000 -0.007786 -0.002544
sp500 -0.300877 0.335578 -0.007786 1.000000 0.959990
nasdaq -0.306437 0.289590 -0.002544 0.959990 1.000000

Visualize all correlations
sns.heatmap(correlations, annot=True)

Let's practice!
Select index
components &
import data
Stefan Jansen
Market value-weighted index
Composite performance of various stocks
Components weighted by market capitalization

Share Price x Number of Shares => Market Value
Larger components get higher percentage weightings
Key market indexes are value-weighted:

S&P 500 , NASDAQ , Wilshire 5000 , Hang Seng

Build a cap-weighted Index
Apply new skills to construct value-weighted index
Select components from exchange listing data
Get component number of shares and stock prices
Calculate component weights
Calculate index
Evaluate performance of components and index

Load stock listing data
nyse = pd.read_excel('listings.xlsx', sheet_name='nyse',
na_values='n/a')
nyse.info()

Stock Symbol 3147 non-null object # Stock Ticker
Company Name 3147 non-null object
Last Sale 3079 non-null float64 # Latest Stock Price
Market Capitalization 3147 non-null float64
IPO Year 1361 non-null float64 # Year of listing
Sector 2177 non-null object
Industry 2177 non-null object

Load & prepare listing data
nyse.set_index('Stock Symbol', inplace=True)
nyse.dropna(subset=['Sector'], inplace=True)
nyse['Market Capitalization'] /= 1e6 # in Million USD
Index: 2177 entries, DDD to ZTO

Company Name 2177 non-null object
Last Sale 2175 non-null float64
Market Capitalization 2177 non-null float64
IPO Year 967 non-null float64
Sector 2177 non-null object
Industry 2177 non-null object

Select index components
components = nyse.groupby(['Sector'])['Market Capitalization'].nlargest(1)
components.sort_values(ascending=False)
Sector Stock Symbol

Health Care JNJ 338834.390080
Energy XOM 338728.713874
Finance JPM 300283.250479
Miscellaneous BABA 275525.000000
Public Utilities T 247339.517272
Basic Industries PG 230159.644117
Consumer Services WMT 221864.614129
Consumer Non-Durables KO 183655.305119
Technology ORCL 181046.096000
Capital Goods TM 155660.252483
Transportation UPS 90180.886756
Consumer Durables ABB 48398.935676
Name: Market Capitalization, dtype: float64

Import & prepare listing data
tickers = components.index.get_level_values('Stock Symbol')
tickers
Index(['PG', 'TM', 'ABB', 'KO', 'WMT', 'XOM', 'JPM', 'JNJ', 'BABA', 'T',
'ORCL', ‘UPS'], dtype='object', name='Stock Symbol’)
tickers.tolist()
['PG',
'TM',
'ABB',
'KO',
'WMT',
...
'T',
'ORCL',
'UPS']

Stock index components
columns = ['Company Name', 'Market Capitalization', 'Last Sale']
component_info = nyse.loc[tickers, columns]
pd.options.display.float_format = '{:,.2f}'.format
Company Name Market Capitalization Last Sale

Stock Symbol
PG Procter & Gamble Company (The) 230,159.64 90.03
TM Toyota Motor Corp Ltd Ord 155,660.25 104.18
ABB ABB Ltd 48,398.94 22.63
KO Coca-Cola Company (The) 183,655.31 42.79
WMT Wal-Mart Stores, Inc. 221,864.61 73.15
XOM Exxon Mobil Corporation 338,728.71 81.69
JPM J P Morgan Chase & Co 300,283.25 84.40
JNJ Johnson & Johnson 338,834.39 124.99
BABA Alibaba Group Holding Limited 275,525.00 110.21
T AT&T Inc. 247,339.52 40.28
ORCL Oracle Corporation 181,046.10 44.00
UPS United Parcel Service, Inc. 90,180.89 103.74

Import & prepare listing data
data = pd.read_csv('stocks.csv', parse_dates=['Date'],
index_col='Date').loc[:, tickers.tolist()]
data.info()

ABB 252 non-null float64
BABA 252 non-null float64
JNJ 252 non-null float64
JPM 252 non-null float64
KO 252 non-null float64
ORCL 252 non-null float64
PG 252 non-null float64
T 252 non-null float64
TM 252 non-null float64
UPS 252 non-null float64
WMT 252 non-null float64
XOM 252 non-null float64
dtypes: float64(12)

Let's practice!
Build a market-cap
weighted index
Stefan Jansen
Build your value-weighted index
Key inputs:
number of shares
stock price series

Build your value-weighted index
Key inputs:
number of shares
stock price series
Normalize index to start

at 100

Stock index components
components
Company Name Market Capitalization Last Sale

Stock Symbol
PG Procter & Gamble Company (The) 230,159.64 90.03
TM Toyota Motor Corp Ltd Ord 155,660.25 104.18
ABB ABB Ltd 48,398.94 22.63
KO Coca-Cola Company (The) 183,655.31 42.79
WMT Wal-Mart Stores, Inc. 221,864.61 73.15
XOM Exxon Mobil Corporation 338,728.71 81.69
JPM J P Morgan Chase & Co 300,283.25 84.40
JNJ Johnson & Johnson 338,834.39 124.99
BABA Alibaba Group Holding Limited 275,525.00 110.21
T AT&T Inc. 247,339.52 40.28
ORCL Oracle Corporation 181,046.10 44.00
UPS United Parcel Service, Inc. 90,180.89 103.74

Number of shares outstanding
shares = components['Market Capitalization'].div(components['Last Sale'])
Stock Symbol
PG 2,556.48 # Outstanding shares in million
TM 1,494.15
ABB 2,138.71
KO 4,292.01
WMT 3,033.01
XOM 4,146.51
JPM 3,557.86
JNJ 2,710.89
BABA 2,500.00
T 6,140.50
ORCL 4,114.68
UPS 869.30
dtype: float64
Market Capitalization = Number of Shares x Share Price

Historical stock prices
data = pd.read_csv('stocks.csv', parse_dates=['Date'],
index_col='Date').loc[:, tickers.tolist()]
market_cap_series = data.mul(no_shares)
market_series.info()

...
dtypes: float64(12)

From stock prices to market value
market_cap_series.first('D').append(market_cap_series.last('D'))
ABB BABA JNJ JPM KO ORCL \\

Date
2016-01-04 37,470.14 191,725.00 272,390.43 226,350.95 181,981.42 147,099.95
2016-12-30 45,062.55 219,525.00 312,321.87 307,007.60 177,946.93 158,209.60
PG T TM UPS WMT XOM
Date
2016-01-04 200,351.12 210,926.33 181,479.12 82,444.14 186,408.74 321,188.96
2016-12-30 214,948.60 261,155.65 175,114.05 99,656.23 209,641.59 374,264.34

Aggregate market value per period
agg_mcap = market_cap_series.sum(axis=1) # Total market cap
agg_mcap(title='Aggregate Market Cap')

Value-based index
index = agg_mcap.div(agg_mcap.iloc[0]).mul(100) # Divide by 1st value
index.plot(title='Market-Cap Weighted Index')

Let's practice!
Evaluate index
performance
Stefan Jansen
Evaluate your value-weighted index
Index return:
Total index return
Contribution by component
Performance vs Benchmark
Total period return
Rolling returns for sub periods

Value-based index - recap
agg_market_cap = market_cap_series.sum(axis=1)
index = agg_market_cap.div(agg_market_cap.iloc[0]).mul(100)
index.plot(title='Market-Cap Weighted Index')

Value contribution by stock
agg_market_cap.iloc[-1] - agg_market_cap.iloc[0]
315,037.71

Value contribution by stock
change = market_cap_series.first('D').append(market_cap_series.last('D'))
change.diff().iloc[-1].sort_values() # or: .loc['2016-12-30']
TM -6,365.07
KO -4,034.49
ABB 7,592.41
ORCL 11,109.65
PG 14,597.48
UPS 17,212.08
WMT 23,232.85
BABA 27,800.00
JNJ 39,931.44
T 50,229.33
XOM 53,075.38
JPM 80,656.65
Name: 2016-12-30 00:00:00, dtype: float64

Market-cap based weights
market_cap = components['Market Capitalization']
weights = market_cap.div(market_cap.sum())
weights.sort_values().mul(100)
Stock Symbol
ABB 1.85
UPS 3.45
TM 5.96
ORCL 6.93
KO 7.03
WMT 8.50
PG 8.81
T 9.47
BABA 10.55
JPM 11.50
XOM 12.97
JNJ 12.97
Name: Market Capitalization, dtype: float64

Value-weighted component returns
index_return = (index.iloc[-1] / index.iloc[0] - 1) * 100
14.06
weighted_returns = weights.mul(index_return)
weighted_returns.sort_values().plot(kind='barh')

Performance vs benchmark
data = index.to_frame('Index') # Convert pd.Series to pd.DataFrame
data['SP500'] = pd.read_csv('sp500.csv', parse_dates=['Date'],
index_col='Date')
data.SP500 = data.SP500.div(data.SP500.iloc[0], axis=0).mul(100)

Performance vs benchmark: 30D rolling return
def multi_period_return(r):
return (np.prod(r + 1) - 1) * 100
data.pct_change().rolling('30D').apply(multi_period_return).plot()

Let's practice!
Index correlation &
exporting to Excel
Stefan Jansen
Some additional analysis of your index
Daily return correlations:
Calculate among all components
Visualize the result as heatmap
Write results to excel using .xls and .xlsx formats:
Single worksheet
Multiple worksheets

Index components - price data
data = DataReader(tickers, 'google', start='2016', end='2017')['Close']
data.info()

KO 252 non-null float64
ORCL 252 non-null float64
PG 252 non-null float64
T 252 non-null float64

Index components: return correlations
daily_returns = data.pct_change()
correlations = daily_returns.corr()
ABB BABA JNJ JPM KO ORCL PG T TM UPS WMT XOM

ABB 1.00 0.40 0.33 0.56 0.31 0.53 0.34 0.29 0.48 0.50 0.15 0.48
BABA 0.40 1.00 0.27 0.27 0.25 0.38 0.21 0.17 0.34 0.35 0.13 0.21
JNJ 0.33 0.27 1.00 0.34 0.30 0.37 0.42 0.35 0.29 0.45 0.24 0.41
JPM 0.56 0.27 0.34 1.00 0.22 0.57 0.27 0.13 0.49 0.56 0.14 0.48
KO 0.31 0.25 0.30 0.22 1.00 0.31 0.62 0.47 0.33 0.50 0.25 0.29
ORCL 0.53 0.38 0.37 0.57 0.31 1.00 0.41 0.32 0.48 0.54 0.21 0.42
PG 0.34 0.21 0.42 0.27 0.62 0.41 1.00 0.43 0.32 0.47 0.33 0.34
T 0.29 0.17 0.35 0.13 0.47 0.32 0.43 1.00 0.28 0.41 0.31 0.33
TM 0.48 0.34 0.29 0.49 0.33 0.48 0.32 0.28 1.00 0.52 0.20 0.30
UPS 0.50 0.35 0.45 0.56 0.50 0.54 0.47 0.41 0.52 1.00 0.33 0.45
WMT 0.15 0.13 0.24 0.14 0.25 0.21 0.33 0.31 0.20 0.33 1.00 0.21
XOM 0.48 0.21 0.41 0.48 0.29 0.42 0.34 0.33 0.30 0.45 0.21 1.00

Index components: return correlations
sns.heatmap(correlations, annot=True)
plt.xticks(rotation=45)
plt.title('Daily Return Correlations')

Saving to a single Excel worksheet
correlations.to_excel(excel_writer= 'correlations.xls',
sheet_name='correlations',
startrow=1,
startcol=1)

Saving to multiple Excel worksheets
data.index = data.index.date # Keep only date component
with pd.ExcelWriter('stock_data.xlsx') as writer:
corr.to_excel(excel_writer=writer, sheet_name='correlations')
data.to_excel(excel_writer=writer, sheet_name='prices')
data.pct_change().to_excel(writer, sheet_name='returns')

Let's practice!
Congratulations!
Stefan Jansen
Congratulations!
Introduction to the
Course
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Example of Time Series: Google Trends
TIME SERIES ANALYSIS IN PYTHON

Example of Time Series: Climate Data

Example of Time Series: Quarterly Earnings Data

Example of Multiple Series: Natural Gas and Heating
Oil

Goals of Course
Learn about time series models
Fit data to a time series model
Use the models to make forecasts of the future
Learn how to use the relevant statistical packages in Python
Provide concrete examples of how these models are used

Some Useful Pandas Tools
Changing an index to datetime
df.index = pd.to_datetime(df.index)
Plo ing data
df.plot()
Slicing data
df['2012']

Some Useful Pandas Tools
Join two DataFrames
df1.join(df2)
Resample data (e.g. from daily to weekly)
df = df.resample(rule='W').last()

More pandas Functions
Computing percent changes and di erences of a time series
df['col'].pct_change()
df['col'].diff()
pandas correlation method of Series
df['ABC'].corr(df['XYZ'])
pandas autocorrelation
df['ABC'].autocorr()

Let's practice!
Correlation of Two
Time Series
Rob Reider
Correlation of Two Time Series
Plot of S&P500 and JPMorgan stock

Correlation of Two Time Series
Sca er plot of S&P500 and JP Morgan returns

More Scatter Plots
Correlation = 0.9 Correlation = 0.4
Correlation = -0.9 Corelation = 1.0

Common Mistake: Correlation of Two Trending Series
Dow Jones Industrial Average and UFO Sightings
(www.nuforc.org)
Correlation of levels: 0.94
Correlation of percent changes: ≈0

Example: Correlation of Large Cap and Small Cap
Stocks
Start with stock prices of SPX (large cap) and R2000 (small
cap)
First step: Compute percentage changes of both series

df['SPX_Ret'] = df['SPX_Prices'].pct_change()
df['R2000_Ret'] = df['R2000_Prices'].pct_change()

Stocks
Visualize correlation with sca ter plot
plt.scatter(df['SPX_Ret'], df['R2000_Ret'])
plt.show()

Stocks
Use pandas correlation method for Series
correlation = df['SPX_Ret'].corr(df['R2000_Ret'])
print("Correlation is: ", correlation)
Correlation is: 0.868

Let's practice!
Simple Linear
Regressions
Rob Reider
What is a Regression?
Simple linear regression:
yt = α + βxt + ϵt

What is a Regression?
Ordinary Least Squares (OLS)

Python Packages to Perform Regressions
In statsmodels: Warning: the order of x and
import statsmodels.api as sm y is not consistent across
sm.OLS(y, x).fit()
packages
In numpy:
np.polyfit(x, y, deg=1)
In pandas:
pd.ols(y, x)
In scipy:
from scipy import stats
stats.linregress(x, y)

Example: Regression of Small Cap Returns on Large
Cap
Import the statsmodels module
import statsmodels.api as sm
As before, compute percentage changes in both series

df['SPX_Ret'] = df['SPX_Prices'].pct_change()
df['R2000_Ret'] = df['R2000_Prices'].pct_change()
Add a constant to the DataFrame for the regression intercept

df = sm.add_constant(df)

Regression Example (continued)
Notice that the rst row of returns is NaN
SPX_Price R2000_Price SPX_Ret R2000_Ret
Date
2012-11-01 1427.589966 827.849976 NaN NaN
2012-11-02 1414.199951 814.369995 -0.009379 -0.016283
Delete the row of NaN

df = df.dropna()
Run the regression

results = sm.OLS(df['R2000_Ret'],df[['const','SPX_Ret']]).fit()
print(results.summary())

Regression output
Intercept in results.params[0]
Slope in results.params[1]

Regression output

Relationship Between R-Squared and Correlation
[corr(x, y)]2 = R2 (or R-squared)
sign(corr) = sign(regression slope)
In last example:
R-Squared = 0.753
Slope is positive
correlation = +√0.753 = 0.868

Let's practice!
Autocorrelation
Rob Reider
What is Autocorrelation?
Correlation of a time series with a lagged copy of itself
Also called serial correlation
Lag-one autocorrelation

Interpretation of Autocorrelation
Mean Reversion - Negative autocorrelation

Interpretation of Autocorrelation
Momentum, or Trend Following - Positive autocorrelation

Traders Use Autocorrelation to Make Money
Individual stocks
Historically have negative autocorrelation
Measured over short horizons (days)
Trading strategy: Buy losers and sell winners
Commodities and currencies

Historically have positive autocorrelation
Measured over longer horizons (months)
Trading strategy: Buy winners and sell losers

Example of Positive Autocorrelation: Exchange Rates
Use daily ¥/$ exchange rates in DataFrame df from FRED
Convert index to datetime

# Convert index to datetime
df.index = pd.to_datetime(df.index)
# Downsample from daily to monthly data
df = df.resample(rule='M').last()
# Compute returns from prices
df['Return'] = df['Price'].pct_change()
# Compute autocorrelation
autocorrelation = df['Return'].autocorr()
print("The autocorrelation is: ",autocorrelation)
The autocorrelation is: 0.0567

Let's practice!
Autocorrelation
Function
Rob Reider
Autocorrelation Function
Autocorrelation Function (ACF): The autocorrelation as a
function of the lag
Equals one at lag-zero
Interesting information beyond lag-one

ACF Example 1: Simple Autocorrelation Function
Can use last two values in series for forecasting

ACF Example 2: Seasonal Earnings
Earnings for H&R Block ACF for H&R Block

ACF Example 3: Useful for Model Selection
Model selection

Plot ACF in Python
Import module:
from statsmodels.graphics.tsaplots import plot_acf
Plot the ACF:

plot_acf(x, lags= 20, alpha=0.05)

Confidence Interval of ACF

Confidence Interval of ACF
Argument alpha sets the width of con dence interval
Example: alpha=0.05
5% chance that if true autocorrelation is zero, it will fall
outside blue band
Con dence bands are wider if:

Alpha lower
Fewer observations
Under some simplifying assumptions, 95% con dence bands

are ±2/√N
If you want no bands on plot, set alpha=1

ACF Values Instead of Plot
from statsmodels.tsa.stattools import acf
print(acf(x))
[ 1. -0.6765505 0.34989905 -0.01629415 -0.02507

-0.03186545 0.01399904 -0.03518128 0.02063168 -0.02620
...
0.07191516 -0.12211912 0.14514481 -0.09644228 0.05215

Let's practice!
White Noise
Rob Reider
What is White Noise?
White Noise is a series with:
Constant mean
Constant variance
Zero autocorrelations at all lags
Special Case: if data has normal distribution, then Gaussian

White Noise

Simulating White Noise
It's very easy to generate white noise
import numpy as np
noise = np.random.normal(loc=0, scale=1, size=500)

What Does White Noise Look Like?
plt.plot(noise)

Autocorrelation of White Noise
plot_acf(noise, lags=50)

Stock Market Returns: Close to White Noise
Autocorrelation Function for the S&P500

Let's practice!
Random Walk
Rob Reider
What is a Random Walk?
Today's Price = Yesterday's Price + Noise
Pt = Pt−1 + ϵt
Plot of simulated data

Pt = Pt−1 + ϵt
Change in price is white noise
Pt − Pt−1 = ϵt
Can't forecast a random walk
Best forecast for tomorrow's price is today's price

Pt = Pt−1 + ϵt
Random walk with dri :
Pt = μ + Pt−1 + ϵt
Change in price is white noise with non-zero mean:
Pt − Pt−1 = μ + ϵt

Statistical Test for Random Walk
Random walk with dri
Pt = μ + Pt−1 + ϵt
Regression test for random walk
Pt = α + β Pt−1 + ϵt
Test:
H0 : β = 1 (random walk)
H1 : β < 1 (not random walk)

Pt = α + β Pt−1 + ϵt
Equivalent to
Pt − Pt−1 = α + β Pt−1 + ϵt
Test:

Pt − Pt−1 = α + β Pt−1 + ϵt
Test:
This test is called the Dickey-Fuller test
If you add more lagged changes on the right hand side, it's
the Augmented Dickey-Fuller test

ADF Test in Python
Import module from statsmodels
from statsmodels.tsa.stattools import adfuller
Run Augmented Dickey-Test
adfuller(x)

Example: Is the S&P500 a Random Walk?
# Run Augmented Dickey-Fuller Test on SPX data
results = adfuller(df['SPX'])
# Print p-value
print(results[1])
0.782253808587
# Print full results

print(results)
(-0.91720490331127869,
0.78225380858668414,
0,
1257,
{'1%': -3.4355629707955395,
'10%': -2.567995644141416,
'5%': -2.8638420633876671},
10161.888789598503)

Let's practice!
Stationarity
Rob Reider
What is Stationarity?
Strong stationarity: entire distribution of data is time-
invariant
Weak stationarity: mean, variance and autocorrelation are

time-invariant (i.e., for autocorrelation, corr(Xt , Xt−τ ) is
only a function of τ)

Why Do We Care?
If parameters vary with time, too many parameters to
estimate
Can only estimate a parsimonious model with a few

parameters

Examples of Nonstationary Series
Random Walk

Seasonality in series

Change in Mean or Standard Deviation over time

Transforming Nonstationary Series Into Stationary
Series
Random Walk First di erence
plot.plot(SPY) plot.plot(SPY.diff())

Series
Seasonality Seasonal di erence
plot.plot(HRB) plot.plot(HRB.diff(4))

Series
AMZN Quarterly Revenues # Log of AMZN Revenues
plt.plot(np.log(AMZN))
plt.plot(AMZN)
# Log, then seasonal difference

plt.plot(np.log(AMZN).diff(4))

Let's practice!
Introducing an AR
Model
Rob Reider
Mathematical Description of AR(1) Model
Rt = μ + ϕ Rt−1 + ϵt
Since only one lagged value on right hand side, this is called:
AR model of order 1, or
AR(1) model
AR parameter is ϕ
For stationarity, −1 < ϕ < 1

Interpretation of AR(1) Parameter
Rt = μ + ϕ Rt−1 + ϵt
Negative ϕ: Mean Reversion
Positive ϕ: Momentum

Comparison of AR(1) Time Series
ϕ = 0.9 ϕ = −0.9
ϕ = 0.5 ϕ = −0.5

Comparison of AR(1) Autocorrelation Functions
ϕ = 0.9 ϕ = −0.9
ϕ = 0.5 ϕ = −0.5

Higher Order AR Models
AR(1)
Rt = μ + ϕ1 Rt−1 + ϵt
AR(2)
Rt = μ + ϕ1 Rt−1 + ϕ2 Rt−2 + ϵt
AR(3)
Rt = μ + ϕ1 Rt−1 + ϕ2 Rt−2 + ϕ3 Rt−3 + ϵt

...

Simulating an AR Process
from statsmodels.tsa.arima_process import ArmaProcess
ar = np.array([1, -0.9])
ma = np.array([1])
AR_object = ArmaProcess(ar, ma)
simulated_data = AR_object.generate_sample(nsample=1000)
plt.plot(simulated_data)

Let's practice!
Estimating and
Forecasting an AR
Model
Rob Reider
Estimating an AR Model
To estimate parameters from data (simulated)
from statsmodels.tsa.arima_model import ARMA

mod = ARMA(data, order=(1,0))
result = mod.fit()
ARMA has been deprecated and replaced with ARIMA
from statsmodels.tsa.arima.model import ARIMA

mod = ARIMA(data, order=(1,0,0))
result = mod.fit()
For ARMA, order=(p,q)
For ARIMA,order=(p,d,q)

Full output (true μ = 0 and ϕ = 0.9)
print(result.summary())

Only the estimates of μ and ϕ (true μ = 0 and ϕ = 0.9)
print(result.params)
array([-0.03605989, 0.90535667])

Forecasting With an AR Model
from statsmodels.graphics.tsaplots import plot_predict
fig, ax = plt.subplots()
data.plot(ax=ax)
plot_predict(result, start='2012-09-27', end='2012-10-06', alpha=0.05, ax=ax)
plt.show()
Arguments of function plot_predict()

First argument is ed model
Set alpha=None for no con dence interval
Set ax=ax to plot the data and prediction on same axes

Let's practice!
Choosing the Right
Model
Rob Reider
Identifying the Order of an AR Model
The order of an AR(p) model will usually be unknown
Two techniques to determine order

Partial Autocorrelation Function
Information criteria

Partial Autocorrelation Function (PACF)

Plot PACF in Python
Same as ACF, but use plot_pacf instead of plt_acf
Import module
from statsmodels.graphics.tsaplots import plot_pacf
Plot the PACF

plot_pacf(x, lags= 20, alpha=0.05)

Comparison of PACF for Different AR Models
AR(1) AR(2)
AR(3) White Noise

Information Criteria
Information criteria: adjusts goodness-of- t for number of
parameters
Two popular adjusted goodness-of- t measures

AIC (Akaike Information Criterion)
BIC (Bayesian Information Criterion)

Estimation output

Getting Information Criteria From statsmodels
You learned earlier how to t an AR model
from statsmodels.tsa.arima_model import ARIMA
mod = ARIMA(simulated_data, order=(1,0))
result = mod.fit()
And to get full output

result.summary()
Or just the parameters

result.params
To get the AIC and BIC

result.aic
result.bic

Fit a simulated AR(3) to di erent AR(p) models
Choose p with the lowest BIC

Let's practice!
Describe Model
Rob Reider
Mathematical Description of MA(1) Model
Rt = μ + ϵt + θ ϵt−1
Since only one lagged error on right hand side, this is called:
MA model of order 1, or
MA(1) model
MA parameter is θ
Stationary for all values of θ

Interpretation of MA(1) Parameter
Rt = μ + ϵt + θ ϵt−1
Negative θ: One-Period Mean Reversion
Positive θ: One-Period Momentum
Note: One-period autocorrelation is θ/(1 + θ2 ), not θ

Comparison of MA(1) Autocorrelation Functions
θ = 0.9 θ = −0.9
θ = 0.5 θ = −0.5

Example of MA(1) Process: Intraday Stock Returns

Autocorrelation Function of Intraday Stock Returns

Higher Order MA Models
MA(1)
Rt = μ + ϵt − θ1 ϵt−1
MA(2)
Rt = μ + ϵt − θ1 ϵt−1 − θ2 ϵt−2
MA(3)
Rt = μ + ϵt − θ1 ϵt−1 − θ2 ϵt−2 − θ3 ϵt−3

...

Simulating an MA Process
from statsmodels.tsa.arima_process import ArmaProcess
ar = np.array([1])
ma = np.array([1, 0.5])
AR_object = ArmaProcess(ar, ma)
simulated_data = AR_object.generate_sample(nsample=1000)
plt.plot(simulated_data)

Let's practice!
Estimation and
Forecasting an MA
Model
Rob Reider
Estimating an MA Model
Same as estimating an AR model (except order=(0,0,1) )

mod = ARIMA(simulated_data, order=(0,0,1))
result = mod.fit()

Forecasting an MA Model
from statsmodels.graphics.tsaplots import plot_predict
data.plot(ax=ax)
plot_predict(res, start='2012-09-27', end='2012-10-06', ax=ax)
plt.show()

Let's practice!
ARMA models
Rob Reider
ARMA Model
ARMA(1,1) model:
Rt = μ + ϕ Rt−1 + ϵt + θ ϵt−1

Converting Between ARMA, AR, and MA Models
Converting AR(1) into an MA(∞)
Rt = μ + ϕRt−1 + ϵt
Rt = μ + ϕ(μ + ϕRt−2 + ϵt−1 ) + ϵt
⋮
μ
Rt = + ϵt + ϕϵt−1 − ϕ2 ϵt−2 + ϕ3 ϵt−3 + ...
1−ϕ

Let's practice!
Cointegration
Models
Rob Reider
What is Cointegration?
Two series, Pt and Qt can be random walks
But the linear combination Pt − c Qt may not be a random
walk!
If that's true
Pt − c Qt is forecastable
Pt and Qt are said to be cointegrated

Analogy: Dog on a Leash
Pt = Owner
Qt = Dog
Both series look like a random walk
Di erence, or distance between them, looks mean reverting

If dog falls too far behind, it gets pulled forward
If dog gets too far ahead, it gets pulled back

Example: Heating Oil and Natural Gas
Heating Oil and Natural Gas both look like random walks...

Example: Heating Oil and Natural Gas
But the spread (di erence) is mean reverting

What Types of Series are Cointegrated?
Economic substitutes
Heating Oil and Natural Gas
Platinum and Palladium
Corn and Wheat
Corn and Sugar
...
Bitcoin and Ethereum?
How about competitors?

Coke and Pepsi?
Apple and Blackberry? No! Leash broke and dog ran away

Two Steps to Test for Cointegration
Regress Pt on Qt and get slope c
Run Augmented Dickey-Fuller test on Pt − c Qt to test for
random walk
Alternatively, can use coint function in statsmodels that

combines both steps
from statsmodels.tsa.stattools import coint

coint(P,Q)

Let's practice!
Case Study: Climate
Change
Rob Reider
Analyzing Temperature Data
Temperature data:
New York City from 1870-2016
Downloaded from National Oceanic and Atmospheric

Administration (NOAA)
Convert index to datetime object
Plot data

Analyzing Temperature Data
Test for Random Walk
Take rst di erences
Compute ACF and PACF
Fit a few AR, MA, and ARMA models
Use Information Criterion to choose best model
Forecast temperature over next 30 years

Let's practice!
Congratulations
Rob Reider
Advanced Topics
GARCH Models
Nonlinear Models
Multivariate Time Series Models
Regime Switching Models
State Space Models and Kalman Filtering
...

Keep practicing!
Welcome to the
course!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Thomas Vincent
Head of Data Science, Ge y Images
Prerequisites
Intro to Python for Data Science
Intermediate Python for Data Science
VISUALIZING TIME SERIES DATA IN PYTHON

Time series in the field of Data Science
Time series are a fundamental way to store and analyze
many types of data
Financial, weather and device data are all best handled as

time series

Time series in the field of Data Science

Course overview
Chapter 1: Ge ing started and personalizing your rst time
series plot
Chapter 2: Summarizing and describing time series data
Chapter 3: Advanced time series analysis
Chapter 4: Working with multiple time series
Chapter 5: Case Study

Reading data with Pandas
import pandas as pd
df = pd.read_csv('ch2_co2_levels.csv')
print(df)
datestamp co2
0 1958-03-29 316.1
1 1958-04-05 317.3
2 1958-04-12 317.6
...
...
...
2281 2001-12-15 371.2
2282 2001-12-22 371.3
2283 2001-12-29 371.5

Preview data with Pandas
print(df.head(n=5))
datestamp co2
0 1958-03-29 316.1
1 1958-04-05 317.3
2 1958-04-12 317.6
3 1958-04-19 317.5
4 1958-04-26 316.4
print(df.tail(n=5))
datestamp co2
2279 2001-12-01 370.3
2280 2001-12-08 370.8
2281 2001-12-15 371.2
2282 2001-12-22 371.3
2283 2001-12-29 371.5

Check data types with Pandas
print(df.dtypes)
datestamp object
co2 float64
dtype: object

Working with dates
To work with time series data in pandas , your date columns
needs to be of the datetime64 type.
pd.to_datetime(['2009/07/31', 'test'])
ValueError: Unknown string format
pd.to_datetime(['2009/07/31', 'test'], errors='coerce')
DatetimeIndex(['2009-07-31', 'NaT'],
dtype='datetime64[ns]', freq=None)

Let's get started!
Plot your first time
series
Thomas Vincent
The Matplotlib library
In Python, matplotlib is an extensive package used to plot
data
The pyplot submodule of matplotlib is traditionally imported

using the plt alias
import matplotlib.pyplot as plt

Plotting time series data

Plotting time series data
import pandas as pd
df = df.set_index('date_column')
df.plot()
plt.show()

Adding style to your plots
plt.style.use('fivethirtyeight')
df.plot()
plt.show()

FiveThirtyEight style

Matplotlib style sheets
print(plt.style.available)
['seaborn-dark-palette', 'seaborn-darkgrid',
'seaborn-dark', 'seaborn-notebook',
'seaborn-pastel', 'seaborn-white',
'classic', 'ggplot', 'grayscale',
'dark_background', 'seaborn-poster',
'seaborn-muted', 'seaborn', 'bmh',
'seaborn-paper', 'seaborn-whitegrid',
'seaborn-bright', 'seaborn-talk',
'fivethirtyeight', 'seaborn-colorblind',
'seaborn-deep', 'seaborn-ticks']

Describing your graphs with labels
ax = df.plot(color='blue')
ax.set_xlabel('Date')
ax.set_ylabel('The values of my Y axis')
ax.set_title('The title of my plot')
plt.show()

Figure size, linewidth, linestyle and fontsize
ax = df.plot(figsize=(12, 5), fontsize=12,
linewidth=3, linestyle='--')
ax.set_xlabel('Date', fontsize=16)
ax.set_ylabel('The values of my Y axis', fontsize=16)
ax.set_title('The title of my plot', fontsize=16)
plt.show()

Let's practice!
Customize your time
series plot
Thomas Vincent
Slicing time series data
discoveries['1960':'1970']
discoveries['1950-01':'1950-12']
discoveries['1960-01-01':'1960-01-15']

Plotting subset of your time series data
df_subset = discoveries['1960':'1970']
ax = df_subset.plot(color='blue', fontsize=14)
plt.show()

Adding markers
ax.axvline(x='1969-01-01',
color='red',
linestyle='--')
ax.axhline(y=100,
color='green',
linestyle='--')

Using markers: the full code
ax = discoveries.plot(color='blue')
ax.set_ylabel('Number of great discoveries')
ax.axvline('1969-01-01', color='red', linestyle='--')
ax.axhline(4, color='green', linestyle='--')

Highlighting regions of interest
ax.axvspan('1964-01-01', '1968-01-01',
color='red', alpha=0.5)
ax.axhspan(8, 6, color='green',
alpha=0.2)

Highlighting regions of interest: the full code
ax = discoveries.plot(color='blue')
ax.set_ylabel('Number of great discoveries')
ax.axvspan('1964-01-01', '1968-01-01', color='red',

alpha=0.3)
ax.axhspan(8, 6, color='green', alpha=0.3)

Let's practice!
Clean your time
series data
Thomas Vincent
The CO2 level time series
A snippet of the weekly measurements of CO2 levels at the
Mauna Loa Observatory, Hawaii.
datastamp co2
1958-03-29 316.1
1958-04-05 317.3
1958-04-12 317.6
...
...
2001-12-15 371.2
2001-12-22 371.3
2001-12-29 371.5

Finding missing values in a DataFrame
print(df.isnull())
datestamp co2
1958-03-29 False
1958-04-05 False
1958-04-12 False
print(df.notnull())
datestamp co2
1958-03-29 True
1958-04-05 True
1958-04-12 True
...

Counting missing values in a DataFrame
print(df.isnull().sum())
datestamp 0
co2 59
dtype: int64

Replacing missing values in a DataFrame
print(df)
...
5 1958-05-03 316.9
6 1958-05-10 NaN
7 1958-05-17 317.5
...
df = df.fillna(method='bfill')
print(df)
...
5 1958-05-03 316.9
6 1958-05-10 317.5
7 1958-05-17 317.5
...

Let's practice!
Plot aggregates of
your data
Thomas Vincent
Moving averages
In the eld of time series analysis, a moving average can be
used for many di erent purposes:
smoothing out short-term uctuations
removing outliers
highlighting long-term trends or cycles.

The moving average model
co2_levels_mean = co2_levels.rolling(window=52).mean()
ax = co2_levels_mean.plot()
ax.set_xlabel("Date")
ax.set_ylabel("The values of my Y axis")
ax.set_title("52 weeks rolling mean of my time series")
plt.show()

A plot of the moving average for the CO2 data

Computing aggregate values of your time series
co2_levels.index
DatetimeIndex(['1958-03-29', '1958-04-05',...],
dtype='datetime64[ns]', name='datestamp',
length=2284, freq=None)
print(co2_levels.index.month)
array([ 3, 4, 4, ..., 12, 12, 12], dtype=int32)
print(co2_levels.index.year)
array([1958, 1958, 1958, ..., 2001,

2001, 2001], dtype=int32)

Plotting aggregate values of your time series
index_month = co2_levels.index.month
co2_levels_by_month = co2_levels.groupby(index_month).mean()
co2_levels_by_month.plot()
plt.show()

Plotting aggregate values of your time series

Let's practice!
Summarizing the
values in your time
series data
Thomas Vincent
Obtaining numerical summaries of your data
What is the average value of this data?
What is the maximum value observed in this time series?

The .describe() method automatically computes key
statistics of all numeric columns in your DataFrame
print(df.describe())
co2
count 2284.000000
mean 339.657750
std 17.100899
min 313.000000
25% 323.975000
50% 337.700000
75% 354.500000
max 373.900000

Summarizing your data with boxplots
ax1 = df.boxplot()
ax1.set_xlabel('Your first boxplot')
ax1.set_ylabel('Values of your data')
ax1.set_title('Boxplot values of your data')
plt.show()

A boxplot of the values in the CO2 data

Summarizing your data with histograms
ax2 = df.plot(kind='hist', bins=100)
ax2.set_xlabel('Your first histogram')
ax2.set_ylabel('Frequency of values in your data')
ax2.set_title('Histogram of your data with 100 bins')
plt.show()

A histogram plot of the values in the CO2 data

Summarizing your data with density plots
ax3 = df.plot(kind='density', linewidth=2)
ax3.set_xlabel('Your first density plot')
ax3.set_ylabel('Density values of your data')
ax3.set_title('Density plot of your data')
plt.show()

A density plot of the values in the CO2 data

Let's practice!
Autocorrelation and
Partial
autocorrelation
Thomas Vincent
Autocorrelation in time series data
Autocorrelation is measured as the correlation between a
time series and a delayed copy of itself
For example, an autocorrelation of order 3 returns the

correlation between a time series at points ( t_1 , t_2 , t_3 ,
...) and its own values lagged by 3 time points, i.e. ( t_4 , t_5
, t_6 , ...)
It is used to nd repetitive pa erns or periodic signal in time

series

Statsmodels
statsmodels is a Python module that provides classes and
functions for the estimation of many di erent statistical
models, as well as for conducting statistical tests, and
statistical data exploration.

Plotting autocorrelations
from statsmodels.graphics import tsaplots
fig = tsaplots.plot_acf(co2_levels['co2'], lags=40)
plt.show()

Interpreting autocorrelation plots

Partial autocorrelation in time series data
Contrary to autocorrelation, partial autocorrelation removes
the e ect of previous time points
For example, a partial autocorrelation function of order 3

returns the correlation between our time series ( t1 , t2 , t3 ,
...) and lagged values of itself by 3 time points ( t4 , t5 , t6 ,
...), but only a er removing all e ects a ributable to lags 1
and 2

Plotting partial autocorrelations
from statsmodels.graphics import tsaplots

fig = tsaplots.plot_pacf(co2_levels['co2'], lags=40)
plt.show()

Interpreting partial autocorrelations plot

Let's practice!
Seasonality, trend
and noise in time
series data
Thomas Vincent
Properties of time series

The properties of time series
Seasonality: does the data display a clear periodic pa ern?
Trend: does the data follow a consistent upwards or

downwards slope?
Noise: are there any outlier points or missing values that are
not consistent with the rest of the data?

Time series decomposition
from pylab import rcParams
rcParams['figure.figsize'] = 11, 9
decomposition = sm.tsa.seasonal_decompose(
co2_levels['co2'])
fig = decomposition.plot()
plt.show()

A plot of time series decomposition on the CO2 data

Extracting components from time series
decomposition
print(dir(decomposition))
['__class__', '__delattr__', '__dict__',

... 'plot', 'resid', 'seasonal', 'trend']
print(decomposition.seasonal)
datestamp
1958-03-29 1.028042
1958-04-05 1.235242
1958-04-12 1.412344
1958-04-19 1.701186

Seasonality component in time series
decomp_seasonal = decomposition.seasonal
ax = decomp_seasonal.plot(figsize=(14, 2))
ax.set_ylabel('Seasonality of time series')
ax.set_title('Seasonal values of the time series')
plt.show()

Seasonality component in time series

Trend component in time series
decomp_trend = decomposition.trend
ax = decomp_trend.plot(figsize=(14, 2))
ax.set_ylabel('Trend of time series')
ax.set_title('Trend values of the time series')
plt.show()

Trend component in time series

Noise component in time series
decomp_resid = decomp.resid
ax = decomp_resid.plot(figsize=(14, 2))
ax.set_ylabel('Residual of time series')
ax.set_title('Residual values of the time series')
plt.show()

Noise component in time series

Let's practice!
A review on what
you have learned so
far
Thomas Vincent
So far ...
Visualize aggregates of time series data
Extract statistical summaries
Autocorrelation and Partial autocorrelation
Time series decomposition

The airline dataset

Let's analyze this
data!
Working with more
than one time series
Thomas Vincent
Working with multiple time series
An isolated time series
date ts1
1949-01 112
1949-02 118
1949-03 132
A le with multiple time series
date ts1 ts2 ts3 ts4 ts5 ts6 ts7

2012-01-01 2113.8 10.4 1987.0 12.1 3091.8 43.2 476.7
2012-02-01 2009.0 9.8 1882.9 12.3 2954.0 38.8 466.8
2012-03-01 2159.8 10.0 1987.9 14.3 3043.7 40.1 502.1

The Meat production dataset
import pandas as pd
meat = pd.read_csv("meat.csv")
print(meat.head(5))
date beef veal pork lamb_and_mutton broilers

0 1944-01-01 751.0 85.0 1280.0 89.0 NaN
1 1944-02-01 713.0 77.0 1169.0 72.0 NaN
2 1944-03-01 741.0 90.0 1128.0 75.0 NaN
3 1944-04-01 650.0 89.0 978.0 66.0 NaN
4 1944-05-01 681.0 106.0 1029.0 78.0 NaN
other_chicken turkey
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN

Summarizing and plotting multiple time series
ax = df.plot(figsize=(12, 4), fontsize=14)
plt.show()

Area charts
ax = df.plot.area(figsize=(12, 4), fontsize=14)
plt.show()

Let's practice!
Plot multiple time
series
Thomas Vincent
Clarity is key
In this plot, the default matplotlib color scheme assigns the
same color to the beef and turkey time series.

The colormap argument
ax = df.plot(colormap='Dark2', figsize=(14, 7))
ax.set_ylabel('Production Volume (in tons)')
plt.show()
For the full set of available colormaps, click here.

Changing line colors with the colormap argument

Enhancing your plot with information
ax = df.plot(colormap='Dark2', figsize=(14, 7))
df_summary = df.describe()
# Specify values of cells in the table

ax.table(cellText=df_summary.values,
# Specify width of the table
colWidths=[0.3]*len(df.columns),
# Specify row labels
rowLabels=df_summary.index,
# Specify column labels
colLabels=df_summary.columns,
# Specify location of the table
loc='top')
plt.show()

Adding Statistical summaries to your plots

Dealing with different scales

Only veal

Facet plots
df.plot(subplots=True,
linewidth=0.5,
layout=(2, 4),
figsize=(16, 10),
sharex=False,
sharey=False)
plt.show()

Time for some
action!
Find relationships
between multiple
time series
Thomas Vincent
Correlations between two variables
In the eld of Statistics, the correlation coe cient is a
measure used to determine the strength or lack of
relationship between two variables:
Pearson's coe cient can be used to compute the
correlation coe cient between variables for which the
relationship is thought to be linear
Kendall Tau or Spearman rank can be used to compute the

correlation coe cient between variables for which the
relationship is thought to be non-linear

Compute correlations
from scipy.stats.stats import pearsonr
from scipy.stats.stats import spearmanr
from scipy.stats.stats import kendalltau
x = [1, 2, 4, 7]
y = [1, 3, 4, 8]
pearsonr(x, y)
SpearmanrResult(correlation=0.9843, pvalue=0.01569)
spearmanr(x, y)
SpearmanrResult(correlation=1.0, pvalue=0.0)
kendalltau(x, y)
KendalltauResult(correlation=1.0, pvalue=0.0415)

What is a correlation matrix?
When computing the correlation coe cient between more
than two variables, you obtain a correlation matrix
Range: [-1, 1]
0: no relationship
1: strong positive relationship
-1: strong negative relationship

What is a correlation matrix?
A correlation matrix is always "symmetric"
The diagonal values will always be equal to 1
x y z
x 1.00 -0.46 0.49
y -0.46 1.00 -0.61
z 0.49 -0.61 1.00

Computing Correlation Matrices with Pandas
corr_p = meat[['beef', 'veal','turkey']].corr(method='pearson')
print(corr_p)
beef veal turkey

beef 1.000 -0.829 0.738
veal -0.829 1.000 -0.768
turkey 0.738 -0.768 1.000
corr_s = meat[['beef', 'veal','turkey']].corr(method='spearman')

print(corr_s)
beef veal turkey

beef 1.000 -0.812 0.778
veal -0.812 1.000 -0.829
turkey 0.778 -0.829 1.000

Computing Correlation Matrices with Pandas
corr_mat = meat.corr(method='pearson')

Heatmap
import seaborn as sns
sns.heatmap(corr_mat)

Heatmap

Clustermap
sns.clustermap(corr_mat)

Let's practice!
Apply your
knowledge to a new
dataset
Thomas Vincent
The Jobs dataset

Let's get started!
Beyond summary
statistics
Thomas Vincent
Facet plots of the jobs dataset
jobs.plot(subplots=True,
layout=(4, 4),
figsize=(20, 16),
sharex=True,
sharey=False)
plt.show()

Annotating events in the jobs dataset
ax = jobs.plot(figsize=(20, 14), colormap='Dark2')
ax.axvline('2008-01-01', color='black',
linestyle='--')
ax.axvline('2009-01-01', color='black',
linestyle='--')

Taking seasonal average in the jobs dataset
print(jobs.index)
DatetimeIndex(['2000-01-01', '2000-02-01', '2000-03-01',

'2000-04-01', '2009-09-01','2009-10-01',
'2009-11-01', '2009-12-01','2010-01-01', '2010-02-01'],
dtype='datetime64[ns]', name='datestamp',
length=122, freq=None)
index_month = jobs.index.month
jobs_by_month = jobs.groupby(index_month).mean()
print(jobs_by_month)
datestamp Agriculture Business services Construction

1 13.763636 7.863636 12.909091
2 13.645455 7.645455 13.600000
3 13.830000 7.130000 11.290000
4 9.130000 6.270000 9.450000
5 7.100000 6.600000 8.120000
...

Monthly averages in the jobs dataset
ax = jobs_by_month.plot(figsize=(12, 5),
colormap='Dark2')
ax.legend(bbox_to_anchor=(1.0, 0.5),
loc='center left')

Monthly averages in the jobs dataset

Time to practice!
Decompose time
series data
Thomas Vincent
Python dictionaries
# Initialize a Python dictionnary
my_dict = {}
# Add a key and value to your dictionnary

my_dict['your_key'] = 'your_value'
# Add a second key and value to your dictionnary

my_dict['your_second_key'] = 'your_second_value'
# Print out your dictionnary

print(my_dict)
{'your_key': 'your_value',
'your_second_key': 'your_second_value'}

Decomposing multiple time series with Python
dictionaries
# Import the statsmodel library
# Initialize a dictionary
my_dict = {}
# Extract the names of the time series
ts_names = df.columns
print(ts_names)
['ts1', 'ts2', 'ts3']
# Run time series decomposition

for ts in ts_names:
ts_decomposition = sm.tsa.seasonal_decompose(jobs[ts])
my_dict[ts] = ts_decomposition

Extract decomposition components of multiple time
series
# Initialize a new dictionnary
my_dict_trend = {}
# Extract the trend component
for ts in ts_names:
my_dict_trend[ts] = my_dict[ts].trend
# Convert to a DataFrame
trend_df = pd.DataFrame.from_dict(my_dict_trend)
print(trend_df)
ts1 ts2 ts3

datestamp
2000-01-01 2.2 1.3 3.6
2000-02-01 3.4 2.1 4.7
...

Python dictionaries
for the win!
Compute
correlations
between time series
Thomas Vincent
Trends in Jobs data
print(trend_df)
datestamp Agriculture Business services Construction
2000-01-01 NaN NaN NaN

2000-07-01 9.170833 4.787500 6.329167
2000-08-01 9.466667 4.820833 6.304167
...

Plotting a clustermap of the jobs correlation matrix
# Get correlation matrix of the seasonality_df DataFrame
trend_corr = trend_df.corr(method='spearman')
# Customize the clustermap of the seasonality_corr

correlation matrix
fig = sns.clustermap(trend_corr, annot=True, linewidth=0.4)
plt.setp(fig.ax_heatmap.yaxis.get_majorticklabels(),
rotation=0)
plt.setp(fig.ax_heatmap.xaxis.get_majorticklabels(),
rotation=90)

The jobs correlation matrix

Let's practice!
Congratulations!
Thomas Vincent
Going further with time series
Data from Zillow Research
Kaggle competitions
Reddit Data

Going further with time series
The importance of time series in business:
to identify seasonal pa erns and trends
to study past behaviors
to produce robust forecasts
to evaluate and compare company achievements

Getting to the next level
Manipulating Time Series Data in Python
Importing & Managing Financial Data in Python
Statistical Thinking in Python (Part 1)
Supervised Learning with scikit-learn

Thank you!
Introduction to time
series and
stationarity
ARIMA MODELS IN PYTHON
James Fulton
Climate informatics researcher
Motivation
Time series are everywhere
Science
Technology
Business
Finance
Policy

Course content
You will learn
Structure of ARIMA models
How to fit ARIMA model

How to optimize the model
How to make forecasts
How to calculate uncertainty in predictions

Loading and plotting
import pandas as pd
import matplotlib as plt
df = pd.read_csv('time_series.csv', index_col='date', parse_dates=True)
date values
2019-03-11 5.734193
2019-03-12 6.288708
2019-03-13 5.205788
2019-03-14 3.176578

Trend
df.plot(ax=ax)
plt.show()

Seasonality

Cyclicality

White noise
White noise series has uncorrelated values
Heads, heads, heads, tails, heads, tails, ...
0.1, -0.3, 0.8, 0.4, -0.5, 0.9, ...

Stationarity
Stationary Not stationary
Trend stationary: Trend is zero

Stationarity
Variance is constant

Stationarity
Variance is constant
Autocorrelation is constant

Train-test split
# Train data - all data up to the end of 2018
df_train = df.loc[:'2018']
# Test data - all data from 2019 onwards

df_test = df.loc['2019':]

Let's Practice!
Making time series
stationary
James Fulton
Overview
Statistical tests for stationarity
Making a dataset stationary

The augmented Dicky-Fuller test
Tests for trend non-stationarity
Null hypothesis is time series is non-stationary

Applying the adfuller test
from statsmodels.tsa.stattools import adfuller
results = adfuller(df['close'])

Interpreting the test result
print(results)
(-1.34, 0.60, 23, 1235, {'1%': -3.435, '5%': -2.913, '10%': -2.568}, 10782.87)
0th element is test statistic (-1.34)

More negative means more likely to be stationary
1st element is p-value: (0.60)

If p-value is small → reject null hypothesis. Reject non-stationary.
4th element is the critical test statistics

Interpreting the test result
print(results)
(-1.34, 0.60, 23, 1235, {'1%': -3.435, '5%': -2.863, '10%': -2.568}, 10782.87)
0th element is test statistic (-1.34)

More negative means more likely to be stationary
1st element is p-value: (0.60)

If p-value is small → reject null hypothesis. Reject non-stationary.
4th element is the critical test statistics
1 https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html

The value of plotting
Plotting time series can stop you making wrong assumptions

The value of plotting

Making a time series stationary

Taking the difference
Difference: Δyt = yt − yt−1

df_stationary = df.diff()
city_population
date
1969-09-30 NaN
1970-03-31 -0.116156
1970-09-30 0.050850
1971-03-31 -0.153261
1971-09-30 0.108389

df_stationary = df.diff().dropna()
city_population
date
1970-03-31 -0.116156
1970-09-30 0.050850
1971-03-31 -0.153261
1971-09-30 0.108389
1972-03-31 -0.029569


Other transforms
Examples of other transforms
Take the log

np.log(df)
Take the square root

np.sqrt(df)
Take the proportional change

df.shift(1)/df

Let's practice!
Intro to AR, MA and
ARMA models
James Fulton
AR models
Autoregressive (AR) model
AR(1) model :
yt = a1 yt−1 + ϵt

AR models
Autoregressive (AR) model
AR(1) model :
AR(2) model :
yt = a1 yt−1 + a2 yt−2 + ϵt
AR(p) model :
yt = a1 yt−1 + a2 yt−2 + ... + ap yt−p + ϵt

MA models
Moving average (MA) model
MA(1) model :
yt = m1 ϵt−1 + ϵt
MA(2) model :
yt = m1 ϵt−1 + m2 ϵt−2 + ϵt
MA(q) model :
yt = m1 ϵt−1 + m2 ϵt−2 + ... + mq ϵt−q + ϵt

ARMA models
Autoregressive moving-average (ARMA) model
ARMA = AR + MA
ARMA(1,1) model :
yt = a1 yt−1 + m1 ϵt−1 + ϵt
ARMA(p, q)
p is order of AR part
q is order of MA part

Creating ARMA data
yt = a1 yt−1 + m1 ϵt−1 + ϵt

Creating ARMA data
yt = 0.5yt−1 + 0.2ϵt−1 + ϵt
from statsmodels.tsa.arima_process import arma_generate_sample

ar_coefs = [1, -0.5]
ma_coefs = [1, 0.2]
y = arma_generate_sample(ar_coefs, ma_coefs, nsample=100, scale=0.5)

Creating ARMA data
yt = 0.5yt−1 + 0.2ϵt−1 + ϵt

Fitting and ARMA model
# Instantiate model object
model = ARIMA(y, order=(1,0,1))
# Fit model
results = model.fit()

Let's practice!
Fitting time series
models
James Fulton
Creating a model
# This is an ARMA(p,q) model

model = ARIMA(timeseries, order=(p,0,q))

Creating AR and MA models
ar_model = ARIMA(timeseries, order=(p,0,0))
ma_model = ARIMA(timeseries, order=(0,0,q))

Fitting the model and fit summary
model = ARIMA(timeseries, order=(2,0,1))

Fit summary
SARIMAX Results
==============================================================================
Dep. Variable: y No. Observations: 1000
Model: ARMA(2, 1) Log Likelihood 148.580
Date: Thu, 25 Apr 2022 AIC -287.159
Time: 22:57:00 BIC -262.621
Sample: 0 HQIC -277.833
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0017 0.012 -0.147 0.883 -0.025 0.021
ar.L1.y 0.5253 0.054 9.807 0.000 0.420 0.630
ar.L2.y -0.2909 0.042 -6.850 0.000 -0.374 -0.208
ma.L1.y 0.3679 0.052 7.100 0.000 0.266 0.469

Fit summary
SARIMAX Results
==============================================================================
Model: ARMA(2, 1) Log Likelihood 148.580
Date: Thu, 25 Apr 2022 AIC -287.159
Time: 22:57:00 BIC -262.621
Sample: 0 HQIC -277.833

Fit summary
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0017 0.012 -0.147 0.883 -0.025 0.021
ar.L1.y 0.5253 0.054 9.807 0.000 0.420 0.630
ar.L2.y -0.2909 0.042 -6.850 0.000 -0.374 -0.208
ma.L1.y 0.3679 0.052 7.100 0.000 0.266 0.469
sigma2 1.6306 0.339 6.938 0.000 0.583 1.943

Introduction to ARMAX models
Exogenous ARMA
Use external variables as well as time series
ARMAX = ARMA + linear regression

ARMAX equation
ARMA(1,1) model :
yt = a1 yt−1 + m1 ϵt−1 + ϵt
ARMAX(1,1) model :
yt = x1 zt + a1 yt−1 + m1 ϵt−1 + ϵt

ARMAX example

ARMAX example

Fitting ARMAX
# Instantiate the model
model = ARIMA(df['productivity'], order=(2,0,1), exog=df['hours_sleep'])
# Fit the model


ARMAX summary
==============================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------
const -0.1936 0.092 -2.098 0.041 -0.375 -0.013
x1 0.1131 0.013 8.602 0.000 0.087 0.139
ar.L1.y 0.1917 0.252 0.760 0.450 -0.302 0.686
ar.L2.y -0.3740 0.121 -3.079 0.003 -0.612 -0.136
ma.L1.y -0.0740 0.259 -0.286 0.776 -0.581 0.433

Let's practice!
Forecasting
James Fulton
Predicting the next value
Take an AR(1) model
Predict next value
yt = 0.6 x 10 + ϵt
yt = 6.0 + ϵt
Uncertainty on prediction
5.0 < yt < 7.0

One-step-ahead predictions

Making one-step-ahead predictions
# Make predictions for last 25 values
# Make in-sample prediction
forecast = results.get_prediction(start=-25)

Making one-step-ahead predictions
# Make predictions for last 25 values
# Make in-sample prediction
forecast = results.get_prediction(start=-25)
# forecast mean
mean_forecast = forecast.predicted_mean
Predicted mean is a pandas series
2013-10-28 1.519368
2013-10-29 1.351082
2013-10-30 1.218016

Confidence intervals
# Get confidence intervals of forecasts
confidence_intervals = forecast.conf_int()
Confidence interval method returns pandas DataFrame
lower y upper y
2013-09-28 -4.720471 -0.815384
2013-09-29 -5.069875 0.112505
2013-09-30 -5.232837 0.766300
2013-10-01 -5.305814 1.282935
2013-10-02 -5.326956 1.703974

Plotting predictions
plt.figure()
# Plot prediction
plt.plot(dates,
mean_forecast.values,
color='red',
label='forecast')
# Shade uncertainty area
plt.fill_between(dates, lower_limits, upper_limits, color='pink')
plt.show()

Plotting predictions

Dynamic predictions

Making dynamic predictions
forecast = results.get_prediction(start=-25, dynamic=True)
# forecast mean


Forecasting out of sample
forecast = results.get_forecast(steps=20)
# forecast mean


Forecasting out of sample
forecast = results.get_forecast(steps=20)

Let's practice!
Introduction to
ARIMA models
James Fulton
Non-stationary time series recap

Non-stationary time series recap

Forecast of differenced time series

Reconstructing original time series after differencing
diff_forecast = results.get_forecast(steps=10).predicted_mean
from numpy import cumsum
mean_forecast = cumsum(diff_forecast)

diff_forecast = results.get_forecast(steps=10).predicted_mean
from numpy import cumsum
mean_forecast = cumsum(diff_forecast) + df.iloc[-1,0]


The ARIMA model
Take the difference
Fit ARMA model

Integrate forecast
Can we avoid doing so much work?
Yes!
ARIMA - Autoregressive Integrated Moving Average

Using the ARIMA model
model = ARIMA(df, order=(p,d,q))
p - number of autoregressive lags

d - order of differencing
q - number of moving average lags
ARIMA(p, 0, q) = ARMA(p, q)

# Create model
model = ARIMA(df, order=(2,1,1))
# Fit model
model.fit()
# Make forecast
mean_forecast = results.get_forecast(steps=10).predicted_mean

# Make forecast
mean_forecast = results.get_forecast(steps=steps).predicted_mean

Picking the difference order
adf = adfuller(df.iloc[:,0])
print('ADF Statistic:', adf[0])
print('p-value:', adf[1])
ADF Statistic: -2.674

p-value: 0.0784
adf = adfuller(df.diff().dropna().iloc[:,0])
print('ADF Statistic:', adf[0])
print('p-value:', adf[1])
ADF Statistic: -4.978

p-value: 2.44e-05

Picking the difference order
model = ARIMA(df, order=(p,1,q))

Let's practice!
Intro to ACF and
PACF
James Fulton
Motivation

ACF and PACF
ACF - Autocorrelation Function
PACF - Partial autocorrelation function

What is the ACF
lag-1 autocorrelation → corr(yt , yt−1 )
lag-2 autocorrelation → corr(yt , yt−2 )
...
lag-n autocorrelation → corr(yt , yt−n )

What is the ACF

What is the PACF

Using ACF and PACF to choose model order
AR(2) model →

MA(2) model →



Implementation in Python
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Create figure
fig, (ax1, ax2) = plt.subplots(2,1, figsize=(8,8))
# Make ACF plot
plot_acf(df, lags=10, zero=False, ax=ax1)
# Make PACF plot
plot_pacf(df, lags=10, zero=False, ax=ax2)
plt.show()

Implementation in Python

Over/under differencing and ACF and PACF

Over/under differencing and ACF and PACF

Let's practice!
AIC and BIC
James Fulton
AIC - Akaike information criterion
Lower AIC indicates a better model
AIC likes to choose simple models with lower order

BIC - Bayesian information criterion
Very similar to AIC
Lower BIC indicates a better model
BIC likes to choose simple models with lower order

AIC vs BIC
BIC favors simpler models than AIC
AIC is better at choosing predictive models
BIC is better at choosing good explanatory model

AIC and BIC in statsmodels
# Create model
# Fit model
# Print fit summary
Statespace Model Results

==============================================================================
Model: SARIMAX(2, 0, 0) Log Likelihood -1399.704
Date: Fri, 10 May 2019 AIC 2805.407
Time: 01:06:11 BIC 2820.131
Sample: 01-01-2013 HQIC 2811.003
- 09-27-2015

AIC and BIC in statsmodels
# Create model
# Fit model
# Print AIC and BIC
print('AIC:', results.aic)
print('BIC:', results.bic)
AIC: 2806.36
BIC: 2821.09

Searching over AIC and BIC
# Loop over AR order
for p in range(3):
# Loop over MA order
for q in range(3):
# Fit model
# print the model order and the AIC/BIC values
print(p, q, results.aic, results.bic)
0 0 2900.13 2905.04
0 1 2828.70 2838.52
0 2 2806.69 2821.42
1 0 2810.25 2820.06
1 1 2806.37 2821.09
1 2 2807.52 2827.15
...

order_aic_bic =[]
for p in range(3):
for q in range(3):
# Fit model
# Add order and scores to list
order_aic_bic.append((p, q, results.aic, results.bic))
# Make DataFrame of model order and AIC/BIC scores

order_df = pd.DataFrame(order_aic_bic, columns=['p','q', 'aic', 'bic'])

# Sort by AIC # Sort by BIC
print(order_df.sort_values('aic')) print(order_df.sort_values('bic'))
p q aic bic p q aic bic

7 2 1 2804.54 2824.17 3 1 0 2810.25 2820.06
6 2 0 2805.41 2820.13 6 2 0 2805.41 2820.13
4 1 1 2806.37 2821.09 4 1 1 2806.37 2821.09
2 0 2 2806.69 2821.42 2 0 2 2806.69 2821.42
... ...

Non-stationary model orders
# Fit model
ValueError: Non-stationary starting autoregressive parameters

found with `enforce_stationarity` set to True.

When certain orders don't work
for p in range(3):
for q in range(3):
# Fit model
# Print the model order and the AIC/BIC values


When certain orders don't work
for p in range(3):
for q in range(3):
try:
# Fit model
# Print the model order and the AIC/BIC values

except:
# Print AIC and BIC as None when fails
print(p, q, None, None)

Let's practice!
Model diagnostics
James Fulton
Introduction to model diagnostics
How good is the final model?

Residuals

Residuals
# Fit model
model = ARIMA(df, order=(p,d,q))
# Assign residuals to variable
residuals = results.resid
2013-01-23 1.013129
2013-01-24 0.114055
2013-01-25 0.430698
2013-01-26 -1.247046
2013-01-27 -0.499565
... ...

Mean absolute error
How far our the predictions from the real values?
mae = np.mean(np.abs(residuals))

Plot diagnostics
If the model fits well the residuals will be
white Gaussian noise
# Create the 4 diagostics plots

results.plot_diagnostics()
plt.show()

Residuals plot

Residuals plot

Histogram plus estimated density

Normal Q-Q

Correlogram

Summary statistics
...
===================================================================================
Ljung-Box (Q): 32.10 Jarque-Bera (JB): 0.02
Prob(Q): 0.81 Prob(JB): 0.99
Heteroskedasticity (H): 1.28 Skew: -0.02
Prob(H) (two-sided): 0.21 Kurtosis: 2.98
===================================================================================
Prob(Q) - p-value for null hypothesis that residuals are uncorrelated
Prob(JB) - p-value for null hypothesis that residuals are normal

Let's practice!
Box-Jenkins method
James Fulton
The Box-Jenkins method
From raw data → production model
identification
estimation
model diagnostics

Identification
Is the time series stationary?
What differencing will make it stationary?
What transforms will make it stationary?
What values of p and q are most

promising?

Identification tools
Plot the time series
df.plot()
Use augmented Dicky-Fuller test

adfuller()
Use transforms and/or differencing

df.diff() , np.log() , np.sqrt()
Plot ACF/PACF
plot_acf() , plot_pacf()

Estimation
Use the data to train the model coefficients
Done for us using model.fit()
Choose between models using AIC and BIC

results.aic , results.bic

Model diagnostics
Are the residuals uncorrelated
Are residuals normally distributed

results.plot_diagnostics()
results.summary()

Decision

Repeat
We go through the process again with more
information
Find a better model

Production
Ready to make forecasts
results.get_forecast()

Box-Jenkins

Let's practice!
Seasonal time series
James Fulton
Seasonal data
Has predictable and repeated patterns
Repeats after any amount of time

Seasonal decomposition

Seasonal decomposition
time series = trend + seasonal + residual

Seasonal decomposition using statsmodels
# Import
from statsmodels.tsa.seasonal import seasonal_decompose
# Decompose data
decomp_results = seasonal_decompose(df['IPG3113N'], period=12)
type(decomp_results)
statsmodels.tsa.seasonal.DecomposeResult

Seasonal decomposition using statsmodels
# Plot decomposed data
decomp_results.plot()
plt.show()

Finding seasonal period using ACF

Identifying seasonal data using ACF

Detrending time series
# Subtract long rolling average over N steps
df = df - df.rolling(N).mean()
# Drop NaN values
df = df.dropna()

Identifying seasonal data using ACF
# Create figure
fig, ax = plt.subplots(1,1, figsize=(8,4))
# Plot ACF
plot_acf(df.dropna(), ax=ax, lags=25, zero=False)
plt.show()

ARIMA models and seasonal data

Let's practice!
SARIMA models
James Fulton
The SARIMA model
Seasonal ARIMA = SARIMA SARIMA(p,d,q)(P,D,Q)S
Non-seasonal orders Seasonal Orders

p: autoregressive order P: seasonal autoregressive order
d: differencing order D: seasonal differencing order
q: moving average order Q: seasonal moving average order
S: number of time steps per cycle

The SARIMA model
ARIMA(2,0,1) model :
yt = a1 yt−1 + a2 yt−2 + m1 ϵt−1 + ϵt
SARIMA(0,0,0)(2,0,1)7 model:
yt = a7 yt−7 + a14 yt−14 + m7 ϵt−7 + ϵt

Fitting a SARIMA model
# Imports
statsmodels.tsa.statespace.sarimax import SARIMAX
# Instantiate model
model = SARIMAX(df, order=(p,d,q), seasonal_order=(P,D,Q,S))
# Fit model

Seasonal differencing
Subtract the time series value of one season ago
Δyt = yt − yt−S
# Take the seasonal difference

df_diff = df.diff(S)

Differencing for SARIMA models
Time series

First difference of time series

First difference and first seasonal difference of time series

Finding p and q

Finding P and Q

Plotting seasonal ACF and PACF
# Create figure
fig, (ax1, ax2) = plt.subplots(2,1)
# Plot seasonal ACF

plot_acf(df_diff, lags=[12,24,36,48,60,72], ax=ax1)
# Plot seasonal PACF

plot_pacf(df_diff, lags=[12,24,36,48,60,72], ax=ax2)
plt.show()

Let's practice!
Automation and
saving
James Fulton
Searching over model orders
import pmdarima as pm
results = pm.auto_arima(df)
Performing stepwise search to minimize aic

ARIMA(2,0,2)(1,1,1)[12] intercept : AIC=inf, Time=3.33 sec
ARIMA(0,0,0)(0,1,0)[12] intercept : AIC=2648.467, Time=0.062 sec
...

Best model: ARIMA(3,0,3)(1,1,1)[12]

Total fit time: 245.812 seconds

pmdarima results
print(results.summary()) results.plot_diagnostics()

Non-seasonal search parameters

Non-seasonal search parameters
results = pm.auto_arima( df, # data
d=0, # non-seasonal difference order
start_p=1, # initial guess for p
start_q=1, # initial guess for q
max_p=3, # max value of p to test
max_q=3, # max value of q to test
)
1 https://www.alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.auto_arima.html

Seasonal search parameters
... , # non-seasonal arguments
seasonal=True, # is the time series seasonal
m=7, # the seasonal period
D=1, # seasonal difference order
start_P=1, # initial guess for P
start_Q=1, # initial guess for Q
max_P=2, # max value of P to test
max_Q=2, # max value of Q to test
)

Other parameters
... , # model order parameters
information_criterion='aic', # used to select best model
trace=True, # print results whilst training
error_action='ignore', # ignore orders that don't work
stepwise=True, # apply intelligent order search
)

Saving model objects
# Import
import joblib
# Select a filepath
filepath ='localpath/great_model.pkl'
# Save model to filepath

joblib.dump(model_results_object, filepath)

Saving model objects
# Select a filepath
filepath ='localpath/great_model.pkl'
# Load model object from filepath

model_results_object = joblib.load(filepath)

Updating model
# Add new observations and update parameters
model_results_object.update(df_new)

Update comparison

Let's practice!
SARIMA and Box-
Jenkins
James Fulton
Box-Jenkins

Box-Jenkins with seasonal data
Determine if time series is seasonal
Find seasonal period
Find transforms to make data stationary

Seasonal and non-seasonal differencing
Other transforms

Mixed differencing
D should be 0 or 1
d + D should be 0-2

Weak vs strong seasonality
Weak seasonal pattern Strong seasonal pattern
Use seasonal differencing if necessary Always use seasonal differencing

Additive vs multiplicative seasonality
Additive series = trend + season multiplicative series = trend x season
Proceed as usual with differencing Apply log transform first - np.log

Multiplicative to additive seasonality

Let's practice!
Congratulations!
James Fulton
The SARIMAX model
`

Time series modeling framework
Test for stationarity and seasonality
Find promising model orders
Fit models and narrow selection with

AIC/BIC
Perform model diagnostics tests
Make forecasts
Save and update models

Further steps
Fit data created using arma_generate_sample()
Tackle real world data! Either your own or examples from statsmodels

Further steps
Fit data created using arma_generate_sample()
Tackle real world data! Either your own or examples from statsmodels
More time series courses here
1 https://www.statsmodels.org/stable/datasets/index.html

Good luck!
Timeseries kinds and
applications
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Time Series
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Time Series

What makes a time series?
Datapoint Datapoint Datapoint Datapoint Datapoint Datapoint
1 34 12 54 76 40
Timepoint Timepoint Timepoint Timepoint Timepoint Timepoint

2:00 2:01 2:02 2:03 2:04 2:05

Jan Feb March April May Jun

1e-9 2e-9 3e-9 4e-9 5e-9 6e-9

Reading in a time series with Pandas
import pandas as pd
data = pd.read_csv('data.csv')
data.head()
date symbol close volume

0 2010-01-04 AAPL 214.009998 123432400.0
46 2010-01-05 AAPL 214.379993 150476200.0
92 2010-01-06 AAPL 210.969995 138040000.0
138 2010-01-07 AAPL 210.580000 119282800.0
184 2010-01-08 AAPL 211.980005 111902700.0

Plotting a pandas timeseries
fig, ax = plt.subplots(figsize=(12, 6))
data.plot('date', 'close', ax=ax)
ax.set(title="AAPL daily closing price")

A timeseries plot

Why machine learning?
We can use really big data and really complicated data

Why machine learning?
We can...
Predict the future
Automate this process

Why combine these two?

A machine learning pipeline
Feature extraction
Model ing
Prediction and validation

Let's practice!
Machine learning
basics
Chris Holdgraf
Science
Always begin by looking at your data
array.shape
(10, 5)
array[:3]
array([[ 0.735528 , 1.00122818, -0.28315978],

[-0.94478393, 0.18658748, -0.00241224],
[-0.74822942, -1.46636618, 0.69835096]])

Always begin by looking at your data
df.head()
col1 col2 col3

0 0.735528 1.001228 -0.283160
1 -0.944784 0.186587 -0.002412
2 -0.748229 -1.466366 0.698351
3 1.038589 -0.171248 0.831457
4 -0.161904 0.003972 -0.321933

Always visualize your data
Make sure it looks the way you'd expect.
# Using matplotlib
ax.plot(...)
# Using pandas
df.plot(..., ax=ax)

Scikit-learn
Scikit-learn is the most popular machine learning library in Python
from sklearn.svm import LinearSVC

Preparing data for scikit-learn
scikit-learn expects a particular structure of data:
(samples, features)
Make sure that your data is at least two-dimensional
Make sure the rst dimension is samples

If your data is not shaped properly
If the axes are swapped:
array.T.shape
(10, 3)

If your data is not shaped properly
If we're missing an axis, use .reshape() :
array.shape
(10,)
array.reshape(-1, 1).shape
(10, 1)
-1 will automatically ll that axis with remaining values

Fitting a model with scikit-learn
# Import a support vector classifier
# Instantiate this model

model = LinearSVC()
# Fit the model on some data

model.fit(X, y)
It is common for y to be of shape (samples, 1)

Investigating the model
# There is one coefficient per input feature
model.coef_
array([[ 0.69417875, -0.5289162 ]])

Predicting with a fit model
# Generate predictions
predictions = model.predict(X_test)

Let's practice
Combining
timeseries data with
machine learning
Chris Holdgraf
Science
Getting to know our data
The datasets that we'll use in this course are all freely-available online
There are many datasets available to download on the web, the ones we'll use come from
Kaggle

The Heartbeat Acoustic Data
Many recordings of heart sounds from di erent patients
Some had normally-functioning hearts, others had abnormalities
Data comes in the form of audio les + labels for each le
Can we nd the "abnormal" heart beats?

Loading auditory data
from glob import glob
files = glob('data/heartbeat-sounds/files/*.wav')
print(files)
['data/heartbeat-sounds/proc/files/murmur__201101051104.wav',
...
'data/heartbeat-sounds/proc/files/murmur__201101051114.wav']

Reading in auditory data
import librosa as lr
# `load` accepts a path to an audio file
audio, sfreq = lr.load('data/heartbeat-sounds/proc/files/murmur__201101051104.wav')
print(sfreq)
2205
In this case, the sampling frequency is 2205 , meaning there are 2205 samples per second

Inferring time from samples
If we know the sampling rate of a timeseries, then we know the timestamp of each
datapoint relative to the rst datapoint
Note: this assumes the sampling rate is xed and no data points are lost

Creating a time array (I)
Create an array of indices, one for each sample, and divide by the sampling frequency
indices = np.arange(0, len(audio))

time = indices / sfreq

Creating a time array (II)
Find the time stamp for the N-1th data point. Then use linspace() to interpolate from zero
to that time
final_time = (len(audio) - 1) / sfreq

time = np.linspace(0, final_time, sfreq)

The New York Stock Exchange dataset
This dataset consists of company stock values for 10 years
Can we detect any pa erns in historical records that allow us to predict the value of
companies in the future?

Looking at the data
data = pd.read_csv('path/to/data.csv')
data.columns
Index(['date', 'symbol', 'close', 'volume'], dtype='object')
data.head()
date symbol close volume

0 2010-01-04 AAPL 214.009998 123432400.0
1 2010-01-04 ABT 54.459951 10829000.0
2 2010-01-04 AIG 29.889999 7750900.0
3 2010-01-04 AMAT 14.300000 18615100.0
4 2010-01-04 ARNC 16.650013 11512100.0

Timeseries with Pandas DataFrames
We can investigate the object type of each column by accessing the dtypes a ribute
df['date'].dtypes
0 object
1 object
2 object
dtype: object

Converting a column to a time series
To ensure that a column within a DataFrame is treated as time series, use the
to_datetime() function
df['date'] = pd.to_datetime(df['date'])
df['date']
0 2017-01-01
1 2017-01-02
2 2017-01-03
Name: date, dtype: datetime64[ns]

Let's practice!
Classification and
feature engineering
Chris Holdgraf
Science
Always visualize raw data before fitting models

Visualize your timeseries data!
ixs = np.arange(audio.shape[-1])
time = ixs / sfreq
ax.plot(time, audio)

What features to use?
Using raw timeseries data is too noisy for classi cation
We need to calculate features!
An easy start: summarize your audio data

Calculating multiple features
print(audio.shape)
# (n_files, time)
(20, 7000)
means = np.mean(audio, axis=-1)

maxs = np.max(audio, axis=-1)
stds = np.std(audio, axis=-1)
print(means.shape)
# (n_files,)
(20,)

Fitting a classifier with scikit-learn
We've just collapsed a 2-D dataset (samples x time) into several features of a 1-D dataset
(samples)
We can combine each feature, and use it as an input to a model
If we have a label for each sample, we can use scikit-learn to create and t a classi er

Preparing your features for scikit-learn
# Import a linear classifier
# Note that means are reshaped to work with scikit-learn

X = np.column_stack([means, maxs, stds])
y = labels.reshape(-1, 1)
model = LinearSVC()
model.fit(X, y)

Scoring your scikit-learn model
from sklearn.metrics import accuracy_score
# Different input data

predictions = model.predict(X_test)
# Score our model with % correct

# Manually
percent_score = sum(predictions == labels_test) / len(labels_test)
# Using a sklearn scorer
percent_score = accuracy_score(labels_test, predictions)

Let's practice!
Improving the
features we use for
classification
Chris Holdgraf
Science
The auditory envelope
Smooth the data to calculate the auditory envelope
Related to the total amount of audio energy present at each moment of time

Smoothing over time
Instead of averaging over all time, we can do a local average
This is called smoothing your timeseries
It removes short-term noise, while retaining the general pa ern

Smoothing your data

Calculating a rolling window statistic
# Audio is a Pandas DataFrame
print(audio.shape)
# (n_times, n_audio_files)
(5000, 20)
# Smooth our data by taking the rolling mean in a window of 50 samples

window_size = 50
windowed = audio.rolling(window=window_size)
audio_smooth = windowed.mean()

Calculating the auditory envelope
First rectify your audio, then smooth it
audio_rectified = audio.apply(np.abs)
audio_envelope = audio_rectified.rolling(50).mean()

Feature engineering the envelope
# Calculate several features of the envelope, one per sound
envelope_mean = np.mean(audio_envelope, axis=0)
envelope_std = np.std(audio_envelope, axis=0)
envelope_max = np.max(audio_envelope, axis=0)
# Create our training data for a classifier

X = np.column_stack([envelope_mean, envelope_std, envelope_max])

Preparing our features for scikit-learn
X = np.column_stack([envelope_mean, envelope_std, envelope_max])
y = labels.reshape(-1, 1)

Cross validation for classification
cross_val_score automates the process of:
Spli ing data into training / validation sets
Fi ing the model on training data
Scoring it on validation data
Repeating this process

Using cross_val_score
from sklearn.model_selection import cross_val_score
model = LinearSVC()
scores = cross_val_score(model, X, y, cv=3)
print(scores)
[0.60911642 0.59975305 0.61404035]

Auditory features: The Tempogram
We can summarize more complex temporal information with timeseries-speci c functions
librosa is a great library for auditory and timeseries feature engineering
Here we'll calculate the tempogram, which estimates the tempo of a sound over time
We can calculate summary statistics of tempo in the same way that we can for the
envelope

Computing the tempogram
# Import librosa and calculate the tempo of a 1-D sound array
import librosa as lr
audio_tempo = lr.beat.tempo(audio, sr=sfreq,
hop_length=2**6, aggregate=None)

Let's practice!
The spectrogram -
spectral changes to
sound over time
Chris Holdgraf
Science
Fourier transforms
Timeseries data can be described as a combination of quickly-changing things and slowly-
changing things
At each moment in time, we can describe the relative presence of fast- and slow-moving
components
The simplest way to do this is called a Fourier Transform
This converts a single timeseries into an array that describes the timeseries as a
combination of oscillations

A Fourier Transform (FFT)

Spectrograms: combinations of windows Fourier
transforms
A spectrogram is a collection of windowed Fourier transforms over time
Similar to how a rolling mean was calculated:

1. Choose a window size and shape
2. At a timepoint, calculate the FFT for that window
3. Slide the window over by one
4. Aggregate the results
Called a Short-Time Fourier Transform (STFT)

Calculating the STFT
We can calculate the STFT with librosa
There are several parameters we can tweak (such as window size)
For our purposes, we'll convert into decibels which normalizes the average values of all
frequencies
We can then visualize it with the specshow() function

Calculating the STFT with code
# Import the functions we'll use for the STFT
from librosa.core import stft, amplitude_to_db
from librosa.display import specshow
# Calculate our STFT

HOP_LENGTH = 2**4
SIZE_WINDOW = 2**7
audio_spec = stft(audio, hop_length=HOP_LENGTH, n_fft=SIZE_WINDOW)
# Convert into decibels for visualization

spec_db = amplitude_to_db(audio_spec)
# Visualize
specshow(spec_db, sr=sfreq, x_axis='time',
y axis='hz' hop length=HOP LENGTH ax=ax)

Spectral feature engineering
Each timeseries has a di erent spectral pa ern.
We can calculate these spectral pa erns by analyzing the spectrogram.
For example, spectral bandwidth and spectral centroids describe where most of the energy
is at each moment in time

Calculating spectral features
# Calculate the spectral centroid and bandwidth for the spectrogram
bandwidths = lr.feature.spectral_bandwidth(S=spec)[0]
centroids = lr.feature.spectral_centroid(S=spec)[0]
# Display these features on top of the spectrogram

specshow(spec, x_axis='time', y_axis='hz', hop_length=HOP_LENGTH, ax=ax)
ax.plot(times_spec, centroids)
ax.fill_between(times_spec, centroids - bandwidths / 2,
centroids + bandwidths / 2, alpha=0.5)

Combining spectral and temporal features in a
classifier
centroids_all = []
bandwidths_all = []
for spec in spectrograms:
bandwidths = lr.feature.spectral_bandwidth(S=lr.db_to_amplitude(spec))
centroids = lr.feature.spectral_centroid(S=lr.db_to_amplitude(spec))
# Calculate the mean spectral bandwidth
bandwidths_all.append(np.mean(bandwidths))
# Calculate the mean spectral centroid
centroids_all.append(np.mean(centroids))
# Create our X matrix

X = np.column_stack([means, stds, maxs, tempo_mean,
tempo_max, tempo_std, bandwidths_all, centroids_all])

Let's practice!
Predicting data over
time
Chris Holdgraf
Science
Classification vs. Regression
CLASSIFICATION REGRESSION
classification_model.predict(X_test) regression_model.predict(X_test)
array([0, 1, 1, 0]) array([0.2, 1.4, 3.6, 0.6])

Correlation and regression
Regression is similar to calculating correlation, with some key di erences
Regression: A process that results in a formal model of the data
Correlation: A statistic that describes the data. Less information than regression model.

Correlation between variables often changes over time
Timeseries o en have pa erns that change over time
Two timeseries that seem correlated at one moment may not remain so over time

Visualizing relationships between timeseries
fig, axs = plt.subplots(1, 2)
# Make a line plot for each timeseries

axs[0].plot(x, c='k', lw=3, alpha=.2)
axs[0].plot(y)
axs[0].set(xlabel='time', title='X values = time')
# Encode time as color in a scatterplot

axs[1].scatter(x_long, y_long, c=np.arange(len(x_long)), cmap='viridis')
axs[1].set(xlabel='x', ylabel='y', title='Color = time')

Visualizing two timeseries

Regression models with scikit-learn
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
model.predict(X)

Visualize predictions with scikit-learn
alphas = [.1, 1e2, 1e3]
ax.plot(y_test, color='k', alpha=.3, lw=3)
for ii, alpha in enumerate(alphas):
y_predicted = Ridge(alpha=alpha).fit(X_train, y_train).predict(X_test)
ax.plot(y_predicted, c=cmap(ii / len(alphas)))
ax.legend(['True values', 'Model 1', 'Model 2', 'Model 3'])
ax.set(xlabel="Time")

Visualize predictions with scikit-learn

Scoring regression models
Two most common methods:
Correlation (r )
Coe cient of Determination (R2 )

2
Coefficient of Determination (R )
The value of R2 is bounded on the top by 1, and can be in nitely low
Values closer to 1 mean the model does a be er job of predicting outputs

error(model)
1−
variance(testdata)

2
R in scikit-learn
from sklearn.metrics import r2_score
print(r2_score(y_predicted, y_test))
0.08

Let's practice!
Cleaning and
improving your data
Chris Holdgraf
Science
Data is messy
Real-world data is o en messy
The two most common problems are missing data and outliers
This o en happens because of human error, machine sensor malfunction, database failures,
etc
Visualizing your raw data makes it easier to spot these problems

What messy data looks like

Interpolation: using time to fill in missing data
A common way to deal with missing data is to interpolate missing values
With timeseries data, you can use time to assist in interpolation.
In this case, interpolation means using using the known values on either side of a gap in the
data to make assumptions about what's missing.

Interpolation in Pandas
# Return a boolean that notes where missing values are
missing = prices.isna()
# Interpolate linearly within missing windows

prices_interp = prices.interpolate('linear')
# Plot the interpolated data in red and the data w/ missing values in black
ax = prices_interp.plot(c='r')
prices.plot(c='k', ax=ax, lw=2)

Visualizing the interpolated data

Using a rolling window to transform data
Another common use of rolling windows is to transform the data
We've already done this once, in order to smooth the data
However, we can also use this to do more complex transformations

Transforming data to standardize variance
A common transformation to apply to data is to standardize its mean and variance over
time. There are many ways to do this.
Here, we'll show how to convert your dataset so that each point represents the % change
over a previous window.
This makes timepoints more comparable to one another if the absolute values of data
change a lot

Transforming to percent change with Pandas
def percent_change(values):
"""Calculates the % change between the last value
and the mean of previous values"""
# Separate the last value and all previous values into variables
previous_values = values[:-1]
last_value = values[-1]
# Calculate the % difference between the last value

# and the mean of earlier values
percent_change = (last_value - np.mean(previous_values)) \
/ np.mean(previous_values)
return percent_change

Applying this to our data
# Plot the raw data
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
ax = prices.plot(ax=axs[0])
# Calculate % change and plot

ax = prices.rolling(window=20).aggregate(percent_change).plot(ax=axs[1])
ax.legend_.set_visible(False)

Finding outliers in your data
Outliers are datapoints that are signi cantly statistically di erent from the dataset.
They can have negative e ects on the predictive power of your model, biasing it away from
its "true" value
One solution is to remove or replace outliers with a more representative value
Be very careful about doing this - o en it is di cult to determine what is a legitimately

extreme value vs an abberation

Plotting a threshold on our data
for data, ax in zip([prices, prices_perc_change], axs):
# Calculate the mean / standard deviation for the data
this_mean = data.mean()
this_std = data.std()
# Plot the data, with a window that is 3 standard deviations

# around the mean
data.plot(ax=ax)
ax.axhline(this_mean + this_std * 3, ls='--', c='r')
ax.axhline(this_mean - this_std * 3, ls='--', c='r')

Visualizing outlier thresholds

Replacing outliers using the threshold
# Center the data so the mean is 0
prices_outlier_centered = prices_outlier_perc - prices_outlier_perc.mean()
# Calculate standard deviation

std = prices_outlier_perc.std()
# Use the absolute value of each datapoint

# to make it easier to find outliers
outliers = np.abs(prices_outlier_centered) > (std * 3)
# Replace outliers with the median value

# We'll use np.nanmean since there may be nans around the outliers
prices_outlier_fixed = prices_outlier_centered.copy()
prices_outlier_fixed[outliers] = np.nanmedian(prices_outlier_fixed)

Visualize the results
prices_outlier_centered.plot(ax=axs[0])
prices_outlier_fixed.plot(ax=axs[1])

Let's practice!
Creating features
over time
Chris Holdgraf
Science
Extracting features with windows

Using .aggregate for feature extraction
# Visualize the raw data
print(prices.head(3))
symbol AIG ABT

date
2010-01-04 29.889999 54.459951
2010-01-05 29.330000 54.019953
2010-01-06 29.139999 54.319953
# Calculate a rolling window, then extract two features

feats = prices.rolling(20).aggregate([np.std, np.max]).dropna()
print(feats.head(3))
AIG ABT
std amax std amax
date
2010-02-01 2.051966 29.889999 0.868830 56.239949
2010-02-02 2.101032 29.629999 0.869197 56.239949
2010-02-03 2.157249 29.629999 0.852509 56.239949

Check the properties of your features!

Using partial() in Python
# If we just take the mean, it returns a single value
a = np.array([[0, 1, 2], [0, 1, 2], [0, 1, 2]])
print(np.mean(a))
1.0
# We can use the partial function to initialize np.mean

# with an axis parameter
from functools import partial
mean_over_first_axis = partial(np.mean, axis=0)
print(mean_over_first_axis(a))
[0. 1. 2.]

Percentiles summarize your data
Percentiles are a useful way to get more ne-grained summaries of your data (as opposed
to using np.mean )
For a given dataset, the Nth percentile is the value where N% of the data is below that
datapoint, and 100-N% of the data is above that datapoint.
print(np.percentile(np.linspace(0, 200), q=20))
40.0

Combining np.percentile() with partial functions to
calculate a range of percentiles
data = np.linspace(0, 100)
# Create a list of functions using a list comprehension

percentile_funcs = [partial(np.percentile, q=ii) for ii in [20, 40, 60]]
# Calculate the output of each function in the same way

percentiles = [i_func(data) for i_func in percentile_funcs]
print(percentiles)
[20.0, 40.00000000000001, 60.0]
# Calculate multiple percentiles of a rolling window

data.rolling(20).aggregate(percentiles)

Calculating "date-based" features
Thus far we've focused on calculating "statistical" features - these are features that
correspond statistical properties of the data, like "mean", "standard deviation", etc
However, don't forget that timeseries data o en has more "human" features associated with
it, like days of the week, holidays, etc.
These features are o en useful when dealing with timeseries data that spans multiple years
(such as stock value over time)

datetime features using Pandas
# Ensure our index is datetime
prices.index = pd.to_datetime(prices.index)
# Extract datetime features

day_of_week_num = prices.index.weekday
print(day_of_week_num[:10])
Index([0 1 2 3 4 0 1 2 3 4], dtype='object')
day_of_week = prices.index.weekday_name
print(day_of_week[:10])
Index(['Monday' 'Tuesday' 'Wednesday' 'Thursday' 'Friday' 'Monday' 'Tuesday'

'Wednesday' 'Thursday' 'Friday'], dtype='object')

Let's practice!
Time-delayed
features and auto-
regressive models
Chris Holdgraf
Science
The past is useful
Timeseries data almost always have information that is shared between timepoints
Information in the past can help predict what happens in the future
O en the features best-suited to predict a timeseries are previous values of the same
timeseries.

A note on smoothness and auto-correlation
A common question to ask of a timeseries: how smooth is the data.
AKA, how correlated is a timepoint with its neighboring timepoints (called autocorrelation).
The amount of auto-correlation in data will impact your models.

Creating time-lagged features
Let's see how we could build a model that uses values in the past as input features.
We can use this to assess how auto-correlated our signal is (and lots of other stu too)

Time-shifting data with Pandas
print(df)
df
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
# Shift a DataFrame/Series by 3 index values towards the past

print(df.shift(3))
df
0 NaN
1 NaN
2 NaN
3 0.0
4 1.0

Creating a time-shifted DataFrame
# data is a pandas Series containing time series data
data = pd.Series(...)
# Shifts
shifts = [0, 1, 2, 3, 4, 5, 6, 7]
# Create a dictionary of time-shifted data

many_shifts = {'lag_{}'.format(ii): data.shift(ii) for ii in shifts}
# Convert them into a dataframe

many_shifts = pd.DataFrame(many_shifts)

Fitting a model with time-shifted features
# Fit the model using these input features
model = Ridge()
model.fit(many_shifts, data)

Interpreting the auto-regressive model coefficients
# Visualize the fit model coefficients
ax.bar(many_shifts.columns, model.coef_)
ax.set(xlabel='Coefficient name', ylabel='Coefficient value')
# Set formatting so it looks nice

plt.setp(ax.get_xticklabels(), rotation=45, horizontalalignment='right')

Visualizing coefficients for a rough signal

Visualizing coefficients for a smooth signal

Let's practice!
Cross-validating
timeseries data
Chris Holdgraf
Science
Cross validation with scikit-learn
# Iterating over the "split" method yields train/test indices
for tr, tt in cv.split(X, y):
model.fit(X[tr], y[tr])
model.score(X[tt], y[tt])

Cross validation types: KFold
KFold cross-validation splits your data into multiple "folds" of equal size
It is one of the most common cross-validation routines
from sklearn.model_selection import KFold

cv = KFold(n_splits=5)
...

Visualizing model predictions
fig, axs = plt.subplots(2, 1)
# Plot the indices chosen for validation on each loop

axs[0].scatter(tt, [0] * len(tt), marker='_', s=2, lw=40)
axs[0].set(ylim=[-.1, .1], title='Test set indices (color=CV loop)',
xlabel='Index of raw data')
# Plot the model predictions on each iteration

axs[1].plot(model.predict(X[tt]))
axs[1].set(title='Test set predictions on each CV loop',
xlabel='Prediction index')

Visualizing KFold CV behavior

A note on shuffling your data
Many CV iterators let you shu e data as a part of the cross-validation process.
This only works if the data is i.i.d., which timeseries usually is not.
You should not shu e your data when making predictions with timeseries.
from sklearn.model_selection import ShuffleSplit
cv = ShuffleSplit(n_splits=3)
...

Visualizing shuffled CV behavior

Using the time series CV iterator
Thus far, we've broken the linear passage of time in the cross validation
However, you generally should not use datapoints in the future to predict data in the past
One approach: Always use training data from the past to predict the future

Visualizing time series cross validation iterators
# Import and initialize the cross-validation iterator
from sklearn.model_selection import TimeSeriesSplit
cv = TimeSeriesSplit(n_splits=10)
fig, ax = plt.subplots(figsize=(10, 5))

for ii, (tr, tt) in enumerate(cv.split(X, y)):
# Plot training and test indices
l1 = ax.scatter(tr, [ii] * len(tr), c=[plt.cm.coolwarm(.1)],
marker='_', lw=6)
l2 = ax.scatter(tt, [ii] * len(tt), c=[plt.cm.coolwarm(.9)],
marker='_', lw=6)
ax.set(ylim=[10, -1], title='TimeSeriesSplit behavior',
xlabel='data index', ylabel='CV iteration')
ax.legend([l1, l2], ['Training', 'Validation'])

Visualizing the TimeSeriesSplit cross validation iterator

Custom scoring functions in scikit-learn
def myfunction(estimator, X, y):
y_pred = estimator.predict(X)
my_custom_score = my_custom_function(y_pred, y)
return my_custom_score

A custom correlation function for scikit-learn
def my_pearsonr(est, X, y):
# Generate predictions and convert to a vector
y_pred = est.predict(X).squeeze()
# Use the numpy "corrcoef" function to calculate a correlation matrix

my_corrcoef_matrix = np.corrcoef(y_pred, y.squeeze())
# Return a single correlation value from the matrix

my_corrcoef = my_corrcoef[1, 0]
return my_corrcoef

Let's practice!
Stationarity and
stability
Chris Holdgraf
Science
Stationarity
Stationary time series do not change their statistical properties over time
E.g., mean, standard deviation, trends
Most time series are non-stationary to some extent

Model stability
Non-stationary data results in variability in our model
The statistical properties the model nds may change with the data
In addition, we will be less certain about the correct values of model parameters
How can we quantify this?

Cross validation to quantify parameter stability
One approach: use cross-validation
Calculate model parameters on each iteration
Assess parameter stability across all CV splits

Bootstrapping the mean
Bootstrapping is a common way to assess variability
The bootstrap:
1. Take a random sample of data with replacement
2. Calculate the mean of the sample
3. Repeat this process many times (1000s)
4. Calculate the percentiles of the result (usually 2.5, 97.5)
The result is a 95% con dence interval of the mean of each coe cient.

Bootstrapping the mean
from sklearn.utils import resample
# cv_coefficients has shape (n_cv_folds, n_coefficients)

n_boots = 100
bootstrap_means = np.zeros(n_boots, n_coefficients)
for ii in range(n_boots):
# Generate random indices for our data with replacement,
# then take the sample mean
random_sample = resample(cv_coefficients)
bootstrap_means[ii] = random_sample.mean(axis=0)
# Compute the percentiles of choice for the bootstrapped means

percentiles = np.percentile(bootstrap_means, (2.5, 97.5), axis=0)

Plotting the bootstrapped coefficients
ax.scatter(many_shifts.columns, percentiles[0], marker='_', s=200)
ax.scatter(many_shifts.columns, percentiles[1], marker='_', s=200)

Assessing model performance stability
If using the TimeSeriesSplit, can plot the model's score over time.
This is useful in nding certain regions of time that hurt the score
Also useful to nd non-stationary signals

Model performance over time
def my_corrcoef(est, X, y):
"""Return the correlation coefficient
between model predictions and a validation set."""
return np.corrcoef(y, est.predict(X))[1, 0]
# Grab the date of the first index of each validation set

first_indices = [data.index[tt[0]] for tr, tt in cv.split(X, y)]
# Calculate the CV scores and convert to a Pandas Series

cv_scores = cross_val_score(model, X, y, cv=cv, scoring=my_corrcoef)
cv_scores = pd.Series(cv_scores, index=first_indices)

Visualizing model scores as a timeseries
fig, axs = plt.subplots(2, 1, figsize=(10, 5), sharex=True)
# Calculate a rolling mean of scores over time

cv_scores_mean = cv_scores.rolling(10, min_periods=1).mean()
cv_scores.plot(ax=axs[0])
axs[0].set(title='Validation scores (correlation)', ylim=[0, 1])
# Plot the raw data

data.plot(ax=axs[1])
axs[1].set(title='Validation data')

Visualizing model scores

Fixed windows with time series cross-validation
# Only keep the last 100 datapoints in the training data
window = 100
# Initialize the CV with this window size

cv = TimeSeriesSplit(n_splits=10, max_train_size=window)

Non-stationary signals

Let's practice!
Wrapping-up
Chris Holdgraf
Science
Timeseries and machine learning
The many applications of time series + machine learning
Always visualize your data rst
The scikit-learn API standardizes this process

Feature extraction and classification
Summary statistics for time series classi cation
Combining multiple features into a single input matrix
Feature extraction for time series data

Model fitting and improving data quality
Time series features for regression
Generating predictions over time
Cleaning and improving time series data

Validating and assessing our model performance
Cross-validation with time series data (don't shu e the data!)
Time series stationarity
Assessing model coe cient and score stability

Advanced concepts in time series
Advanced window functions
Signal processing and ltering details
Spectral analysis

Advanced machine learning
Advanced time series feature extraction (e.g., tsfresh )
More complex model architectures for regression and classi cation
Production-ready pipelines for time series analysis

Ways to practice
There are a lot of opportunities to practice your skills with time series data.
Kaggle has a number of time series predictions challenges
Quantopian is also useful for learning and using predictive models others have built.

Let's practice!

Time Series With Python

Uploaded by

Document Information

Copyright

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Time Series With Python

Uploaded by

Copyright:

How to use dates &

times with pandas

A ributes & methods re ect time-related details

Sequences of dates & periods:

Index: convert object into Time Series

Many Series/DataFrame methods rely on time information in

MANIPULATING TIME SERIES DATA IN PYTHON

True # Understands dates as strings

time_stamp # type: pandas.tslib.Timestamp

MANIPULATING TIME SERIES DATA IN PYTHON

MANIPULATING TIME SERIES DATA IN PYTHON

Period object has freq

MANIPULATING TIME SERIES DATA IN PYTHON

Timestamp('2017-02-28 00:00:00', freq='M')

MANIPULATING TIME SERIES DATA IN PYTHON

index = pd.date_range(start='2017-1-1', periods=12, freq='M')

DatetimeIndex(['2017-01-31', '2017-02-28', '2017-03-31', ...,

pd.DateTimeIndex : sequence of Timestamp objects with

MANIPULATING TIME SERIES DATA IN PYTHON

Timestamp('2017-01-31 00:00:00', freq='M')

PeriodIndex(['2017-01', '2017-02', '2017-03', '2017-04', ...,

MANIPULATING TIME SERIES DATA IN PYTHON

MANIPULATING TIME SERIES DATA IN PYTHON

DatetimeIndex: 12 entries, 2017-01-31 to 2017-12-31

MANIPULATING TIME SERIES DATA IN PYTHON

MANIPULATING TIME SERIES DATA IN PYTHON

Parsing string dates and convert to datetime64

Selecting & slicing for speci c subperiods

Se ing & changing DateTimeIndex frequency

MANIPULATING TIME SERIES DATA IN PYTHON

MANIPULATING TIME SERIES DATA IN PYTHON

MANIPULATING TIME SERIES DATA IN PYTHON

MANIPULATING TIME SERIES DATA IN PYTHON

MANIPULATING TIME SERIES DATA IN PYTHON

google['2015'].info() # Pass string for part of date

DatetimeIndex: 252 entries, 2015-01-02 to 2015-12-31

google['2015-3': '2016-2'].info() # Slice includes last month

DatetimeIndex: 252 entries, 2015-03-02 to 2016-02-29

MANIPULATING TIME SERIES DATA IN PYTHON

MANIPULATING TIME SERIES DATA IN PYTHON

google.asfreq('D').info() # set calendar day frequency

DatetimeIndex: 729 entries, 2015-01-02 to 2016-12-30

MANIPULATING TIME SERIES DATA IN PYTHON

MANIPULATING TIME SERIES DATA IN PYTHON

google = google.asfreq('B') # Change to calendar day frequency

DatetimeIndex: 521 entries, 2015-01-02 to 2016-12-30

MANIPULATING TIME SERIES DATA IN PYTHON

Business days that were not trading days

MANIPULATING TIME SERIES DATA IN PYTHON

Get the di erence in value for a given time period

Compute the percent change over any number of periods

pandas built-in methods rely on pd.DateTimeIndex

MANIPULATING TIME SERIES DATA IN PYTHON

google = pd.read_csv('google.csv', parse_dates=['date'], index_col='date')

MANIPULATING TIME SERIES DATA IN PYTHON

MANIPULATING TIME SERIES DATA IN PYTHON

1 period into future

google['shifted'] = google.price.shift() # default: periods=1

MANIPULATING TIME SERIES DATA IN PYTHON

1 period back in time

price lagged shifted

MANIPULATING TIME SERIES DATA IN PYTHON

price shifted change

MANIPULATING TIME SERIES DATA IN PYTHON

price shifted change return

MANIPULATING TIME SERIES DATA IN PYTHON

MANIPULATING TIME SERIES DATA IN PYTHON