Professional Documents
Culture Documents
Time Series With Python
Time Series With Python
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Date & time series functionality
At the root: data types for date & time information
Objects for points in time and periods
Timestamp('2017-01-01 00:00:00')
time_stamp.year
2017
time_stamp.day_name()
'Sunday'
Period('2017-01-31', 'D')
Convert pd.Period() to
period.to_timestamp().to_period('M') pd.Timestamp() and back
Period('2017-01', 'M')
pd.Timestamp('2017-01-31', 'M') + 1
index
index.to_period()
RangeIndex: 12 entries, 0 to 11
Data columns (total 1 columns):
data 12 non-null datetime64[ns]
dtypes: datetime64[ns](1)
12 rows, 2 columns
data = np.random.random((size=12,2))
pd.DataFrame(data=data, index=index).info()
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Time series transformation
Basic time series transformations include:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 504 entries, 0 to 503
Data columns (total 2 columns):
date 504 non-null object
price 504 non-null float64
dtypes: float64(1), object(1)
google.head()
date price
0 2015-01-02 524.81
1 2015-01-05 513.87
2 2015-01-06 501.96
3 2015-01-07 501.10
4 2015-01-08 502.68
Convert to datetime64
google.date = pd.to_datetime(google.date)
google.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 504 entries, 0 to 503
Data columns (total 2 columns):
date 504 non-null datetime64[ns]
price 504 non-null float64
dtypes: datetime64[ns](1), float64(1)
inplace :
don't create copy
google.set_index('date', inplace=True)
google.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 504 entries, 2015-01-02 to 2016-12-30
Data columns (total 1 columns):
price 504 non-null float64
dtypes: float64(1)
734.15
google.asfreq('D').head()
price
date
2015-01-02 524.81
2015-01-03 NaN
2015-01-04 NaN
2015-01-05 513.87
2015-01-06 501.96
price
date
2015-01-19 NaN
2015-02-16 NaN
...
2016-11-24 NaN
2016-12-26 NaN
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Basic time series calculations
Typical Time Series manipulations include:
Shi or lag values back or forward back in time
google.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 504 entries, 2015-01-02 to 2016-12-30
Data columns (total 1 columns):
price 504 non-null float64
dtypes: float64(1)
price
date
2015-01-02 524.81
2015-01-05 513.87
2015-01-06 501.96
2015-01-07 501.10
2015-01-08 502.68
price shifted
date
2015-01-02 542.81 NaN
2015-01-05 513.87 542.81
2015-01-06 501.96 513.87
google['lagged'] = google.price.shift(periods=-1)
google[['price', 'lagged', 'shifted']].tail(3)
xt − xt−1
google['diff'] = google.price.diff()
google[['price', 'diff']].head(3)
price diff
date
2015-01-02 524.81 NaN
2015-01-05 513.87 -10.94
2015-01-06 501.96 -11.91
google['pct_change'] = google.price.pct_change().mul(100)
google[['price', 'return', 'pct_change']].head(3)
price return_3d
date
2015-01-02 524.81 NaN
2015-01-05 513.87 NaN
2015-01-06 501.96 NaN
2015-01-07 501.10 -4.517825
2015-01-08 502.68 -2.177594
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Comparing stock performance
Stock price series: hard to compare at di erent levels
price
date
2010-01-04 313.06
2010-01-05 311.68
2010-01-06 303.83
313.06
True
prices.head(2)
AAPL 30.57
GOOG 313.06
YHOO 17.10
Name: 2010-01-04 00:00:00, dtype: float64
normalized = prices.div(prices.iloc[0])
normalized.head(3)
normalized = prices.div(prices.iloc[0]).mul(100)
normalized.plot()
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Changing the frequency: resampling
DateTimeIndex : set & change freq using .asfreq()
pandas API:
.asfreq() , .reindex()
2016-03-31 1
2016-06-30 2
2016-09-30 3
2016-12-31 4
Freq: Q-DEC, dtype: int64 # Default: year-end quarters
2016-03-31 1.0
2016-04-30 NaN
2016-05-31 NaN
2016-06-30 2.0
2016-07-31 NaN
2016-08-31 NaN
2016-09-30 3.0
2016-10-31 NaN
2016-11-30 NaN
2016-12-31 4.0
Freq: M, dtype: float64
ffill : forward ll
new index
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Frequency conversion & transformation methods
.resample() : similar to .groupby()
unrate.head()
UNRATE
DATE
2000-01-01 4.0
2000-02-01 4.1
2000-03-01 4.0
2000-04-01 3.8
2000-05-01 4.0
True
gdp.head(2)
gpd
DATE
2000-01-01 1.2
2000-04-01 7.8
gpd_ffill
DATE
2000-01-01 1.2
2000-02-01 1.2
2000-03-01 1.2
2000-04-01 7.8
gpd_inter
DATE
2000-01-01 1.200000
2000-02-01 3.400000
2000-03-01 5.600000
2000-04-01 7.800000
df1 df2
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
0 NaN 4.0
1 NaN 5.0
2 NaN 6.0
df1 df2
0 1 4
1 2 5
2 3 6
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Downsampling & aggregation methods
So far: upsampling, ll logic & interpolation
Now: downsampling
hour to day
ozone = ozone.resample('D').asfreq()
ozone.info()
Ozone Ozone
date date
2000-01-31 0.010443 2000-01-31 0.009486
2000-02-29 0.011817 2000-02-29 0.010726
2000-03-31 0.016810 2000-03-31 0.017004
2000-04-30 0.019413 2000-04-30 0.019866
2000-05-31 0.026535 2000-05-31 0.026018
.resample().mean() : Monthly
average, assigned to end of
calendar month
Ozone
mean std
date
2000-01-31 0.010443 0.004755
2000-02-29 0.011817 0.004072
2000-03-31 0.016810 0.004977
2000-04-30 0.019413 0.006574
2000-05-31 0.026535 0.008409
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 207 entries, 2000-01-31 to 2017-03-31
Freq: BM
Data columns (total 2 columns):
ozone 207 non-null float64
pm25 207 non-null float64
dtypes: float64(2)
Ozone PM25
date
2000-01-31 0.005545 20.800000
2000-02-29 0.016139 6.500000
2000-03-31 0.017004 8.493333
2000-04-30 0.031354 6.889474
df.resample('MS').first().head()
Ozone PM25
date
2000-01-01 0.004032 37.320000
2000-02-01 0.010583 24.800000
2000-03-01 0.007418 11.106667
2000-04-01 0.017631 11.700000
2000-05-01 0.022628 9.700000
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Window functions in pandas
Windows identify sub periods of your time series
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Expanding windows in pandas
From rolling to expanding windows
RT = (1 + r1 )(1 + r2 )...(1 + rT ) − 1
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Random walks & simulations
Daily stock returns are hard to predict
Two examples:
Generate random returns
DATE
2007-05-29 -0.008357
2007-05-30 0.003702
2007-05-31 -0.013990
2007-06-01 0.008096
2007-06-04 0.013120
DATE
2007-05-25 1515.73
Name: SP500, dtype: float64
sp500_random = start.append(random_walk.add(1))
sp500_random.head())
DATE
2007-05-25 1515.730000
2007-05-29 0.998290
2007-05-30 0.995190
2007-05-31 0.997787
2007-06-01 0.983853
dtype: float64
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Correlation & relations between series
So far, focus on characteristics of individual variables
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Market value-weighted index
Composite performance of various stocks
Calculate index
Index(['PG', 'TM', 'ABB', 'KO', 'WMT', 'XOM', 'JPM', 'JNJ', 'BABA', 'T',
'ORCL', ‘UPS'], dtype='object', name='Stock Symbol’)
tickers.tolist()
['PG',
'TM',
'ABB',
'KO',
'WMT',
...
'T',
'ORCL',
'UPS']
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Build your value-weighted index
Key inputs:
number of shares
Stock Symbol
PG 2,556.48 # Outstanding shares in million
TM 1,494.15
ABB 2,138.71
KO 4,292.01
WMT 3,033.01
XOM 4,146.51
JPM 3,557.86
JNJ 2,710.89
BABA 2,500.00
T 6,140.50
ORCL 4,114.68
UPS 869.30
dtype: float64
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Evaluate your value-weighted index
Index return:
Total index return
Contribution by component
Performance vs Benchmark
Total period return
315,037.71
TM -6,365.07
KO -4,034.49
ABB 7,592.41
ORCL 11,109.65
PG 14,597.48
UPS 17,212.08
WMT 23,232.85
BABA 27,800.00
JNJ 39,931.44
T 50,229.33
XOM 53,075.38
JPM 80,656.65
Name: 2016-12-30 00:00:00, dtype: float64
Stock Symbol
ABB 1.85
UPS 3.45
TM 5.96
ORCL 6.93
KO 7.03
WMT 8.50
PG 8.81
T 9.47
BABA 10.55
JPM 11.50
XOM 12.97
JNJ 12.97
Name: Market Capitalization, dtype: float64
14.06
weighted_returns = weights.mul(index_return)
weighted_returns.sort_values().plot(kind='barh')
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Some additional analysis of your index
Daily return correlations:
Single worksheet
Multiple worksheets
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Congratulations!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Introduction to the
Course
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Example of Time Series: Google Trends
df.index = pd.to_datetime(df.index)
df.plot()
Slicing data
df['2012']
df1.join(df2)
df = df.resample(rule='W').last()
df['col'].pct_change()
df['col'].diff()
df['ABC'].corr(df['XYZ'])
pandas autocorrelation
df['ABC'].autocorr()
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Correlation of Two Time Series
Plot of S&P500 and JPMorgan stock
correlation = df['SPX_Ret'].corr(df['R2000_Ret'])
print("Correlation is: ", correlation)
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is a Regression?
Simple linear regression:
yt = α + βxt + ϵt
In numpy:
np.polyfit(x, y, deg=1)
In pandas:
pd.ols(y, x)
In scipy:
from scipy import stats
stats.linregress(x, y)
Intercept in results.params[0]
Slope in results.params[1]
Slope is positive
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is Autocorrelation?
Correlation of a time series with a lagged copy of itself
Lag-one autocorrelation
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Autocorrelation Function
Autocorrelation Function (ACF): The autocorrelation as a
function of the lag
Example: alpha=0.05
5% chance that if true autocorrelation is zero, it will fall
outside blue band
Fewer observations
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is White Noise?
White Noise is a series with:
Constant mean
Constant variance
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is a Random Walk?
Today's Price = Yesterday's Price + Noise
Pt = Pt−1 + ϵt
Plot of simulated data
Pt = Pt−1 + ϵt
Change in price is white noise
Pt − Pt−1 = ϵt
Can't forecast a random walk
Pt = Pt−1 + ϵt
Random walk with dri :
Pt = μ + Pt−1 + ϵt
Change in price is white noise with non-zero mean:
Pt − Pt−1 = μ + ϵt
Pt = μ + Pt−1 + ϵt
Regression test for random walk
Pt = α + β Pt−1 + ϵt
Test:
H0 : β = 1 (random walk)
H1 : β < 1 (not random walk)
Pt = α + β Pt−1 + ϵt
Equivalent to
Pt − Pt−1 = α + β Pt−1 + ϵt
Test:
H0 : β = 0 (random walk)
H1 : β < 0 (not random walk)
Pt − Pt−1 = α + β Pt−1 + ϵt
Test:
H0 : β = 0 (random walk)
H1 : β < 0 (not random walk)
This test is called the Dickey-Fuller test
If you add more lagged changes on the right hand side, it's
the Augmented Dickey-Fuller test
adfuller(x)
# Print p-value
print(results[1])
0.782253808587
(-0.91720490331127869,
0.78225380858668414,
0,
1257,
{'1%': -3.4355629707955395,
'10%': -2.567995644141416,
'5%': -2.8638420633876671},
10161.888789598503)
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is Stationarity?
Strong stationarity: entire distribution of data is time-
invariant
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Mathematical Description of AR(1) Model
Rt = μ + ϕ Rt−1 + ϵt
Since only one lagged value on right hand side, this is called:
AR model of order 1, or
AR(1) model
AR parameter is ϕ
For stationarity, −1 < ϕ < 1
ϕ = 0.5 ϕ = −0.5
ϕ = 0.5 ϕ = −0.5
Rt = μ + ϕ1 Rt−1 + ϵt
AR(2)
Rt = μ + ϕ1 Rt−1 + ϕ2 Rt−2 + ϵt
AR(3)
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Estimating an AR Model
To estimate parameters from data (simulated)
For ARIMA,order=(p,d,q)
array([-0.03605989, 0.90535667])
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Identifying the Order of an AR Model
The order of an AR(p) model will usually be unknown
Information criteria
Import module
from statsmodels.graphics.tsaplots import plot_pacf
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Mathematical Description of MA(1) Model
Rt = μ + ϵt + θ ϵt−1
Since only one lagged error on right hand side, this is called:
MA model of order 1, or
MA(1) model
MA parameter is θ
Stationary for all values of θ
θ = 0.5 θ = −0.5
Rt = μ + ϵt − θ1 ϵt−1
MA(2)
Rt = μ + ϵt − θ1 ϵt−1 − θ2 ϵt−2
MA(3)
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Estimating an MA Model
Same as estimating an AR model (except order=(0,0,1) )
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
ARMA Model
ARMA(1,1) model:
Rt = μ + ϕ Rt−1 + ϵt + θ ϵt−1
Rt = μ + ϕRt−1 + ϵt
⋮
μ
Rt = + ϵt + ϕϵt−1 − ϕ2 ϵt−2 + ϕ3 ϵt−3 + ...
1−ϕ
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is Cointegration?
Two series, Pt and Qt can be random walks
But the linear combination Pt − c Qt may not be a random
walk!
If that's true
Pt − c Qt is forecastable
Pt and Qt are said to be cointegrated
...
Apple and Blackberry? No! Leash broke and dog ran away
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Analyzing Temperature Data
Temperature data:
New York City from 1870-2016
Plot data
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Advanced Topics
GARCH Models
Nonlinear Models
...
Thomas Vincent
Head of Data Science, Ge y Images
Prerequisites
Intro to Python for Data Science
datestamp co2
0 1958-03-29 316.1
1 1958-04-05 317.3
2 1958-04-12 317.6
...
...
...
2281 2001-12-15 371.2
2282 2001-12-22 371.3
2283 2001-12-29 371.5
datestamp co2
0 1958-03-29 316.1
1 1958-04-05 317.3
2 1958-04-12 317.6
3 1958-04-19 317.5
4 1958-04-26 316.4
print(df.tail(n=5))
datestamp co2
2279 2001-12-01 370.3
2280 2001-12-08 370.8
2281 2001-12-15 371.2
2282 2001-12-22 371.3
2283 2001-12-29 371.5
datestamp object
co2 float64
dtype: object
pd.to_datetime(['2009/07/31', 'test'])
DatetimeIndex(['2009-07-31', 'NaT'],
dtype='datetime64[ns]', freq=None)
Thomas Vincent
Head of Data Science, Ge y Images
The Matplotlib library
In Python, matplotlib is an extensive package used to plot
data
df = df.set_index('date_column')
df.plot()
plt.show()
['seaborn-dark-palette', 'seaborn-darkgrid',
'seaborn-dark', 'seaborn-notebook',
'seaborn-pastel', 'seaborn-white',
'classic', 'ggplot', 'grayscale',
'dark_background', 'seaborn-poster',
'seaborn-muted', 'seaborn', 'bmh',
'seaborn-paper', 'seaborn-whitegrid',
'seaborn-bright', 'seaborn-talk',
'fivethirtyeight', 'seaborn-colorblind',
'seaborn-deep', 'seaborn-ticks']
ax.set_xlabel('Date')
ax.set_ylabel('The values of my Y axis')
ax.set_title('The title of my plot')
plt.show()
Thomas Vincent
Head of Data Science, Ge y Images
Slicing time series data
discoveries['1960':'1970']
discoveries['1950-01':'1950-12']
discoveries['1960-01-01':'1960-01-15']
ax = df_subset.plot(color='blue', fontsize=14)
plt.show()
ax.axhline(y=100,
color='green',
linestyle='--')
ax.axhspan(8, 6, color='green',
alpha=0.2)
Thomas Vincent
Head of Data Science, Ge y Images
The CO2 level time series
A snippet of the weekly measurements of CO2 levels at the
Mauna Loa Observatory, Hawaii.
datastamp co2
1958-03-29 316.1
1958-04-05 317.3
1958-04-12 317.6
...
...
2001-12-15 371.2
2001-12-22 371.3
2001-12-29 371.5
datestamp co2
1958-03-29 False
1958-04-05 False
1958-04-12 False
print(df.notnull())
datestamp co2
1958-03-29 True
1958-04-05 True
1958-04-12 True
...
datestamp 0
co2 59
dtype: int64
...
5 1958-05-03 316.9
6 1958-05-10 NaN
7 1958-05-17 317.5
...
df = df.fillna(method='bfill')
print(df)
...
5 1958-05-03 316.9
6 1958-05-10 317.5
7 1958-05-17 317.5
...
Thomas Vincent
Head of Data Science, Ge y Images
Moving averages
In the eld of time series analysis, a moving average can be
used for many di erent purposes:
smoothing out short-term uctuations
removing outliers
ax = co2_levels_mean.plot()
ax.set_xlabel("Date")
ax.set_ylabel("The values of my Y axis")
ax.set_title("52 weeks rolling mean of my time series")
plt.show()
DatetimeIndex(['1958-03-29', '1958-04-05',...],
dtype='datetime64[ns]', name='datestamp',
length=2284, freq=None)
print(co2_levels.index.month)
print(co2_levels.index.year)
plt.show()
Thomas Vincent
Head of Data Science, Ge y Images
Obtaining numerical summaries of your data
What is the average value of this data?
print(df.describe())
co2
count 2284.000000
mean 339.657750
std 17.100899
min 313.000000
25% 323.975000
50% 337.700000
75% 354.500000
max 373.900000
plt.show()
plt.show()
plt.show()
Thomas Vincent
Head of Data Science, Ge y Images
Autocorrelation in time series data
Autocorrelation is measured as the correlation between a
time series and a delayed copy of itself
plt.show()
plt.show()
Thomas Vincent
Head of Data Science, Ge y Images
Properties of time series
Noise: are there any outlier points or missing values that are
not consistent with the rest of the data?
rcParams['figure.figsize'] = 11, 9
decomposition = sm.tsa.seasonal_decompose(
co2_levels['co2'])
fig = decomposition.plot()
plt.show()
print(decomposition.seasonal)
datestamp
1958-03-29 1.028042
1958-04-05 1.235242
1958-04-12 1.412344
1958-04-19 1.701186
ax = decomp_seasonal.plot(figsize=(14, 2))
ax.set_xlabel('Date')
ax.set_ylabel('Seasonality of time series')
ax.set_title('Seasonal values of the time series')
plt.show()
ax = decomp_trend.plot(figsize=(14, 2))
ax.set_xlabel('Date')
ax.set_ylabel('Trend of time series')
ax.set_title('Trend values of the time series')
plt.show()
ax = decomp_resid.plot(figsize=(14, 2))
ax.set_xlabel('Date')
ax.set_ylabel('Residual of time series')
ax.set_title('Residual values of the time series')
plt.show()
Thomas Vincent
Head of Data Science, Ge y Images
So far ...
Visualize aggregates of time series data
Thomas Vincent
Head of Data Science, Ge y Images
Working with multiple time series
An isolated time series
date ts1
1949-01 112
1949-02 118
1949-03 132
other_chicken turkey
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
plt.show()
plt.show()
Thomas Vincent
Head of Data Science, Ge y Images
Clarity is key
In this plot, the default matplotlib color scheme assigns the
same color to the beef and turkey time series.
plt.show()
plt.show()
plt.show()
Thomas Vincent
Head of Data Science, Ge y Images
Correlations between two variables
In the eld of Statistics, the correlation coe cient is a
measure used to determine the strength or lack of
relationship between two variables:
Pearson's coe cient can be used to compute the
correlation coe cient between variables for which the
relationship is thought to be linear
SpearmanrResult(correlation=0.9843, pvalue=0.01569)
spearmanr(x, y)
SpearmanrResult(correlation=1.0, pvalue=0.0)
kendalltau(x, y)
KendalltauResult(correlation=1.0, pvalue=0.0415)
0: no relationship
x y z
x 1.00 -0.46 0.49
y -0.46 1.00 -0.61
z 0.49 -0.61 1.00
Thomas Vincent
Head of Data Science, Ge y Images
The Jobs dataset
Thomas Vincent
Head of Data Science, Ge y Images
Facet plots of the jobs dataset
jobs.plot(subplots=True,
layout=(4, 4),
figsize=(20, 16),
sharex=True,
sharey=False)
plt.show()
index_month = jobs.index.month
jobs_by_month = jobs.groupby(index_month).mean()
print(jobs_by_month)
ax.legend(bbox_to_anchor=(1.0, 0.5),
loc='center left')
Thomas Vincent
Head of Data Science, Ge y Images
Python dictionaries
# Initialize a Python dictionnary
my_dict = {}
{'your_key': 'your_value',
'your_second_key': 'your_second_value'}
Thomas Vincent
Head of Data Science, Ge y Images
Trends in Jobs data
print(trend_df)
plt.setp(fig.ax_heatmap.yaxis.get_majorticklabels(),
rotation=0)
plt.setp(fig.ax_heatmap.xaxis.get_majorticklabels(),
rotation=90)
Thomas Vincent
Head of Data Science, Ge y Images
Going further with time series
Data from Zillow Research
Kaggle competitions
Reddit Data
James Fulton
Climate informatics researcher
Motivation
Time series are everywhere
Science
Technology
Business
Finance
Policy
date values
2019-03-11 5.734193
2019-03-12 6.288708
2019-03-13 5.205788
2019-03-14 3.176578
Variance is constant
Variance is constant
Autocorrelation is constant
James Fulton
Climate informatics researcher
Overview
Statistical tests for stationarity
Making a dataset stationary
results = adfuller(df['close'])
(-1.34, 0.60, 23, 1235, {'1%': -3.435, '5%': -2.913, '10%': -2.568}, 10782.87)
(-1.34, 0.60, 23, 1235, {'1%': -3.435, '5%': -2.863, '10%': -2.568}, 10782.87)
1 https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html
city_population
date
1969-09-30 NaN
1970-03-31 -0.116156
1970-09-30 0.050850
1971-03-31 -0.153261
1971-09-30 0.108389
city_population
date
1970-03-31 -0.116156
1970-09-30 0.050850
1971-03-31 -0.153261
1971-09-30 0.108389
1972-03-31 -0.029569
James Fulton
Climate informatics researcher
AR models
Autoregressive (AR) model
AR(1) model :
yt = a1 yt−1 + ϵt
AR(1) model :
yt = a1 yt−1 + ϵt
AR(2) model :
yt = a1 yt−1 + a2 yt−2 + ϵt
AR(p) model :
yt = a1 yt−1 + a2 yt−2 + ... + ap yt−p + ϵt
MA(1) model :
yt = m1 ϵt−1 + ϵt
MA(2) model :
yt = m1 ϵt−1 + m2 ϵt−2 + ϵt
MA(q) model :
yt = m1 ϵt−1 + m2 ϵt−2 + ... + mq ϵt−q + ϵt
ARMA = AR + MA
ARMA(1,1) model :
yt = a1 yt−1 + m1 ϵt−1 + ϵt
ARMA(p, q)
p is order of AR part
q is order of MA part
James Fulton
Climate informatics researcher
Creating a model
from statsmodels.tsa.arima.model import ARIMA
print(results.summary())
ARMAX(1,1) model :
yt = x1 zt + a1 yt−1 + m1 ϵt−1 + ϵt
James Fulton
Climate informatics researcher
Predicting the next value
Take an AR(1) model
yt = a1 yt−1 + ϵt
yt = 0.6 x 10 + ϵt
yt = 6.0 + ϵt
Uncertainty on prediction
2013-10-28 1.519368
2013-10-29 1.351082
2013-10-30 1.218016
lower y upper y
2013-09-28 -4.720471 -0.815384
2013-09-29 -5.069875 0.112505
2013-09-30 -5.232837 0.766300
2013-10-01 -5.305814 1.282935
2013-10-02 -5.326956 1.703974
# Plot prediction
plt.plot(dates,
mean_forecast.values,
color='red',
label='forecast')
# Shade uncertainty area
plt.fill_between(dates, lower_limits, upper_limits, color='pink')
plt.show()
# forecast mean
mean_forecast = forecast.predicted_mean
# forecast mean
mean_forecast = forecast.predicted_mean
James Fulton
Climate informatics researcher
Non-stationary time series recap
Yes!
ARIMA(p, 0, q) = ARMA(p, q)
adf = adfuller(df.diff().dropna().iloc[:,0])
print('ADF Statistic:', adf[0])
print('p-value:', adf[1])
James Fulton
Climate informatics researcher
Motivation
...
AR(2) model →
MA(2) model →
# Create figure
fig, (ax1, ax2) = plt.subplots(2,1, figsize=(8,8))
# Make ACF plot
plot_acf(df, lags=10, zero=False, ax=ax1)
# Make PACF plot
plot_pacf(df, lags=10, zero=False, ax=ax2)
plt.show()
James Fulton
Climate informatics researcher
AIC - Akaike information criterion
Lower AIC indicates a better model
AIC likes to choose simple models with lower order
AIC: 2806.36
BIC: 2821.09
0 0 2900.13 2905.04
0 1 2828.70 2838.52
0 2 2806.69 2821.42
1 0 2810.25 2820.06
1 1 2806.37 2821.09
1 2 2807.52 2827.15
...
# Fit model
model = ARIMA(df, order=(p,0,q))
results = model.fit()
James Fulton
Climate informatics researcher
Introduction to model diagnostics
How good is the final model?
2013-01-23 1.013129
2013-01-24 0.114055
2013-01-25 0.430698
2013-01-26 -1.247046
2013-01-27 -0.499565
... ...
mae = np.mean(np.abs(residuals))
...
===================================================================================
Ljung-Box (Q): 32.10 Jarque-Bera (JB): 0.02
Prob(Q): 0.81 Prob(JB): 0.99
Heteroskedasticity (H): 1.28 Skew: -0.02
Prob(H) (two-sided): 0.21 Kurtosis: 2.98
===================================================================================
James Fulton
Climate informatics researcher
The Box-Jenkins method
From raw data → production model
identification
estimation
model diagnostics
Plot ACF/PACF
plot_acf() , plot_pacf()
results.summary()
James Fulton
Climate informatics researcher
Seasonal data
Has predictable and repeated patterns
Repeats after any amount of time
# Decompose data
decomp_results = seasonal_decompose(df['IPG3113N'], period=12)
type(decomp_results)
statsmodels.tsa.seasonal.DecomposeResult
# Plot ACF
plot_acf(df.dropna(), ax=ax, lags=25, zero=False)
plt.show()
James Fulton
Climate informatics researcher
The SARIMA model
Seasonal ARIMA = SARIMA SARIMA(p,d,q)(P,D,Q)S
SARIMA(0,0,0)(2,0,1)7 model:
yt = a7 yt−7 + a14 yt−14 + m7 ϵt−7 + ϵt
Δyt = yt − yt−S
Time series
plt.show()
James Fulton
Climate informatics researcher
Searching over model orders
import pmdarima as pm
results = pm.auto_arima(df)
...
1 https://www.alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.auto_arima.html
# Select a filepath
filepath ='localpath/great_model.pkl'
James Fulton
Climate informatics researcher
Box-Jenkins
Other transforms
d + D should be 0-2
James Fulton
Climate informatics researcher
The SARIMAX model
`
Make forecasts
Tackle real world data! Either your own or examples from statsmodels
1 https://www.statsmodels.org/stable/datasets/index.html
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Time Series
Model ing
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Always begin by looking at your data
array.shape
(10, 5)
array[:3]
# Using matplotlib
fig, ax = plt.subplots()
ax.plot(...)
# Using pandas
fig, ax = plt.subplots()
df.plot(..., ax=ax)
(samples, features)
array.T.shape
(10, 3)
array.shape
(10,)
array.reshape(-1, 1).shape
(10, 1)
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Getting to know our data
The datasets that we'll use in this course are all freely-available online
There are many datasets available to download on the web, the ones we'll use come from
Kaggle
print(files)
['data/heartbeat-sounds/proc/files/murmur__201101051104.wav',
...
'data/heartbeat-sounds/proc/files/murmur__201101051114.wav']
print(sfreq)
2205
In this case, the sampling frequency is 2205 , meaning there are 2205 samples per second
Note: this assumes the sampling rate is xed and no data points are lost
Can we detect any pa erns in historical records that allow us to predict the value of
companies in the future?
data.columns
data.head()
df['date'].dtypes
0 object
1 object
2 object
dtype: object
df['date'] = pd.to_datetime(df['date'])
df['date']
0 2017-01-01
1 2017-01-02
2 2017-01-03
Name: date, dtype: datetime64[ns]
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Always visualize raw data before fitting models
(20, 7000)
print(means.shape)
# (n_files,)
(20,)
If we have a label for each sample, we can use scikit-learn to create and t a classi er
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
The auditory envelope
Smooth the data to calculate the auditory envelope
Related to the total amount of audio energy present at each moment of time
(5000, 20)
audio_rectified = audio.apply(np.abs)
audio_envelope = audio_rectified.rolling(50).mean()
model = LinearSVC()
scores = cross_val_score(model, X, y, cv=3)
print(scores)
Here we'll calculate the tempogram, which estimates the tempo of a sound over time
We can calculate summary statistics of tempo in the same way that we can for the
envelope
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Fourier transforms
Timeseries data can be described as a combination of quickly-changing things and slowly-
changing things
At each moment in time, we can describe the relative presence of fast- and slow-moving
components
This converts a single timeseries into an array that describes the timeseries as a
combination of oscillations
For our purposes, we'll convert into decibels which normalizes the average values of all
frequencies
# Visualize
fig, ax = plt.subplots()
specshow(spec_db, sr=sfreq, x_axis='time',
y axis='hz' hop length=HOP LENGTH ax=ax)
For example, spectral bandwidth and spectral centroids describe where most of the energy
is at each moment in time
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Classification vs. Regression
CLASSIFICATION REGRESSION
classification_model.predict(X_test) regression_model.predict(X_test)
Correlation: A statistic that describes the data. Less information than regression model.
Two timeseries that seem correlated at one moment may not remain so over time
0.08
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Data is messy
Real-world data is o en messy
The two most common problems are missing data and outliers
This o en happens because of human error, machine sensor malfunction, database failures,
etc
In this case, interpolation means using using the known values on either side of a gap in the
data to make assumptions about what's missing.
# Plot the interpolated data in red and the data w/ missing values in black
ax = prices_interp.plot(c='r')
prices.plot(c='k', ax=ax, lw=2)
Here, we'll show how to convert your dataset so that each point represents the % change
over a previous window.
This makes timepoints more comparable to one another if the absolute values of data
change a lot
They can have negative e ects on the predictive power of your model, biasing it away from
its "true" value
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Extracting features with windows
AIG ABT
std amax std amax
date
2010-02-01 2.051966 29.889999 0.868830 56.239949
2010-02-02 2.101032 29.629999 0.869197 56.239949
2010-02-03 2.157249 29.629999 0.852509 56.239949
1.0
print(mean_over_first_axis(a))
[0. 1. 2.]
For a given dataset, the Nth percentile is the value where N% of the data is below that
datapoint, and 100-N% of the data is above that datapoint.
40.0
However, don't forget that timeseries data o en has more "human" features associated with
it, like days of the week, holidays, etc.
These features are o en useful when dealing with timeseries data that spans multiple years
(such as stock value over time)
day_of_week = prices.index.weekday_name
print(day_of_week[:10])
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
The past is useful
Timeseries data almost always have information that is shared between timepoints
Information in the past can help predict what happens in the future
O en the features best-suited to predict a timeseries are previous values of the same
timeseries.
AKA, how correlated is a timepoint with its neighboring timepoints (called autocorrelation).
We can use this to assess how auto-correlated our signal is (and lots of other stu too)
df
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
df
0 NaN
1 NaN
2 NaN
3 0.0
4 1.0
# Shifts
shifts = [0, 1, 2, 3, 4, 5, 6, 7]
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Cross validation with scikit-learn
# Iterating over the "split" method yields train/test indices
for tr, tt in cv.split(X, y):
model.fit(X[tr], y[tr])
model.score(X[tt], y[tt])
This only works if the data is i.i.d., which timeseries usually is not.
You should not shu e your data when making predictions with timeseries.
cv = ShuffleSplit(n_splits=3)
for tr, tt in cv.split(X, y):
...
However, you generally should not use datapoints in the future to predict data in the past
One approach: Always use training data from the past to predict the future
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Stationarity
Stationary time series do not change their statistical properties over time
The statistical properties the model nds may change with the data
In addition, we will be less certain about the correct values of model parameters
The bootstrap:
1. Take a random sample of data with replacement
The result is a 95% con dence interval of the mean of each coe cient.
This is useful in nding certain regions of time that hurt the score
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Timeseries and machine learning
The many applications of time series + machine learning
Spectral analysis
Quantopian is also useful for learning and using predictive models others have built.