Time Series Analysis — Data

Exploration and Visualization.
A simple walkthrough to handle time-series data and the statistics

Himani Gulati Jan 5 · 12 min read

A picture is worth a thousand words, as the saying

goes. And it definitely holds true in data analysis.

As a beginner, I really struggled to put pieces of the ‘THE TIME SERIES’

puzzle together, Hence I have tried to cover the most basic of the things
to the hopefully bigger ones, which once again makes this a beginner-
friendly Project. You can find the notebook for the source code here.

I still would suggest you'll to pick up Statistics as a subject if this is a field

where you’re headed. But, don't forget, Machine Learning != Statistics.

Statistics is Key… Well at least, one of the keys.

A Time Series Data is simply a sequence of data in chronological order (i.e

following the order of occurrence) which is used by businesses to analyze
past data and make better decisions. This project aims to create a basic
understanding of how to deal with and visualize time series data.

I have used Stock Data, hence also tried to come to a buy or sell decision
implementing one of the Trend-Following Strategies, and explored a few
others too, but only theoretically. I am still not definite with the statement
of whether Machine Learning Models can be used to predict movement in
the stock markets or not. I found this in one of the articles I read and I
couldn't agree more.

Note: There are many forms of analysis to determine the worth of an

investment/trade, I have only focused on the technical analysis of the
stock market here which solely takes historical data i.e directly related to
the particular stock into account. You can find out how ML helps with
fundamental analysis which involves the process of understanding a
stock’s intrinsic/inherent values here in this set of articles by Marco

Data Exploration:
The Data I have used in this fieldwork has been scraped by Prakhar Goel
and can be found here and here. These are two CSV formatted files, which
contain some data from the Indian stock market on about 65 shares, from
the month of April to Sept-2020. The first CSV file, contains,
open_price(Opening Price), close_price(Closing Price),
high_price(Highest Price), low_price (Lowest Price), the timestamp, and
the Scrip_id for stock identification. And the second one contains names
for particular stocks and their exchange.

stocks_df = pd.read_csv(data_directory)


id timestamp open_price high_price low_price close_price volume scrip_id

0 1 1170.00 1170.00 1149.25 1164.60 104528 2

1 13008 318.00 318.00 311.65 312.30 22036 1

2 26015 828.95 828.95 825.45 825.85 22222 3

3 39022 1672.00 1672.00 1665.00 1667.10 15844 4

4 52029 1469.75 1469.75 1463.10 1465.05 150673 5

... ... ... ... ... ... ... ... ...

2321227 2666522 108.40 108.70 108.40 108.70 1317 41

2321228 2683019 497.00 497.00 496.10 497.00 1752 42

2321229 2699517 827.00 828.00 826.95 828.00 54 43

2321230 2716011 765.70 766.45 764.00 766.00 606 44

2321231 2731697 1840.00 1840.00 1837.30 1840.00 197 45

2321232 rows × 8 columns

First DataFame

df = pd.read_csv(data_directory_two)

id name zerodha_id exchange

0 1 GLENMARK 1895937 NSE

1 2 INDUSINDBK 1346049 NSE

2 3 TECHM 3465729 NSE

3 4 KOTAKBANK 492033 NSE

4 5 RELIANCE 738561 NSE

Second DataFrame

I have joined the two to get a single DataFrame called merged_df, and you
can see the names of the stock categories or the unique stocks whose
information I have in the datasets.

df.rename(columns = {'id':'scrip_id'}, inplace=True)

merged_df = stocks_df.merge(df, on='scrip_id')

timestamp open_price high_price low_price close_price volume scrip_id name zerodha_id

0 1170.00 1170.00 1149.25 1164.60 104528 2 INDUSINDBK 1346049

1 1165.60 1168.50 1159.00 1160.75 109720 2 INDUSINDBK 1346049

2 1162.35 1164.45 1161.00 1163.05 48984 2 INDUSINDBK 1346049

3 1164.10 1168.85 1163.65 1168.85 67387 2 INDUSINDBK 1346049

4 1168.35 1173.00 1166.45 1173.00 78541 2 INDUSINDBK 1346049

Merged Dataframe.



'NIFTY20DEC9000PE', 'NIFTY20DEC8000PE', 'NIFTY20DEC10000CE',
'NIFTY20DEC9000CE', 'NIFTY20DEC10000PE', 'NIFTY20DEC8000CE',
'HDFCBANK20SEPFUT'], dtype=object)

List of all Stock on which I have data.

DONT FORGET to convert the timestamp column of Object Data type to

pandas-DateTime Data Type. It’s important so we can aggregate our data
and resample it as per our need. You can find the code for everything
mentioned but not shown, in the notebook itself.

Exploratory Data Analysis:

The data I have above is about 2 million for about 65 different stocks and
ranges from the months April to September. But because I have to
perform an exploratory data analysis, I have narrowed it down to a
particular stock’s data of one particular stock i.e Nifty 50.

Below are two graphs representing the data I have curated from the entire
dataset of 2 million to about 35K covering nifty50's movement in the
selected time period.

Now there’s a difference in the two graphs but they basically represent the
same values of the closing price. The difference is that you will notice
there are some continuous points in the first graph. This is only because
in the first plot, I have used timestamp as my index and in the second plot,
I have plotted the values of the closing price in the order of occurrence.
This continuity in data is because I have missing data for the times the
markets aren’t open.

Now, for further analysis, observations over the entire time variation
weren’t so helpful, so I further narrowed the data to a particular month
that I chose to be April, with about 3000 data points.


timestamp open_price high_price low_price close_price volume month

0 2020-04-20 09:15:00+05:30 9390.20 9390.20 9306.25 9308.40 0 4

1 2020-04-20 09:16:00+05:30 9310.75 9319.35 9289.80 9291.40 0 4

2 2020-04-20 09:17:00+05:30 9292.00 9292.00 9270.70 9281.50 0 4

3 2020-04-20 09:18:00+05:30 9280.95 9296.55 9278.35 9290.30 0 4

4 2020-04-20 09:19:00+05:30 9288.25 9297.15 9280.75 9291.95 0 4

... ... ... ... ... ... ... ...

3370 2020-04-30 15:25:00+05:30 9866.00 9866.20 9860.05 9861.75 0 4

3371 2020-04-30 15:26:00+05:30 9861.65 9862.85 9857.10 9858.70 0 4

3372 2020-04-30 15:27:00+05:30 9858.50 9858.60 9856.20 9857.60 0 4

3373 2020-04-30 15:28:00+05:30 9857.25 9857.95 9853.80 9856.20 0 4

3374 2020-04-30 15:29:00+05:30 9856.15 9858.45 9848.10 9851.85 0 4

3375 rows × 7 columns

DataFrame on which analysis has been performed.

Moving on to the important analysis plots in a time series.

Correlation Plot/Matrix:
The correlation coefficient tells us the Linear Relationship between the
two variables. It is bound between -1 and 1. Make sure you understand the
meaning of correlation here… because the correlation plot is not so useful
with time series but the autocorrelation and Partial autocorrelation plots
are.. and they will determine the model we use to forecast our time series

Coefficient close to 1 meaning

a +ve and robust association
between the two,

Coefficient close to -1 meaning

a strong -ve association
between the two variables. In
the terms, I understood,
something like Inversely

Autocorrelation and Partial

Autocorrelation Plot:
These are important plots for
Correlation Matrix Log for nifty 50 data time series. They graphically
summarize the strength of the
relationships of observations in time series.

In Autocorrelation, we calculate the correlation for time-series

observations with previous time steps, called lags. Because the
correlation of the time series observations is calculated with values of the
same series at previous times, hence are called a serial correlation or an

Below I have plotted the ACF for nifty 50 April data with the help of the
stats model library. → import statsmodels.api as sm

plt.rc("figure", figsize=(10,6))['close_price'], lags=50);

High Autocorrelation graph

The horizontal axis of an autocorrelation plot shows the size of the lag
between the elements of the time series. In simple terms, The ‘kth’ lag is
the time period that happened “k” time points before the time I. You can
optionally set in how many lags you want to observe.

Observe → that our data has a very high correlation. You will come to
know below why and how this can be mended. Also, both these plots
help us with modeling the data which also you will see how.

The Autocorrelation Plot is used in a forecasting model for time series

called Moving Averages which you will read about later in this article.

PartialAutocorrelation Plot:
A partial autocorrelation is a summary of the relationship between an
observation in a time series with observations at prior time steps with the
relationships of intervening observations removed. Meaning… The effects
of the lags in between are removed and we can see the direct impact a
previous observation has on the value to be predicted at a time(t).

PACF can be computed by regression.

Regression is a statistical method to determine the strength and

character of the relationship between one dependant value and other
variables that are independent.

Below I have plotted the PACF for nifty 50 April data again directly with
the help of the stats model library.

plt.rc("figure", figsize=(10,6))['close_price']);

Plot for Partial Autocorrelation of my data

The PACF plot is used in the AutoRegressive Model for forecasting

which again you will see later in the article.

Stationarity of Our Data:

Stationarity is an essential concept in time series analysis. If our data is
Stationary, it means that the summary statistics of our data(or rather the
process generating it) are consistent and do not change over time. It is
important to check the stationarity because many useful analytical tools
and statistical models rely on it. I have tried two methods below:

1) Summary Statistics:
One of the most basic methods
to check if our data is
stationary is to the summary
statistics. This is not much of
an accurate way, sometimes
the outcome of this test can be
a statistical fluke.

Histogram Log of Closing Prices.

X = nifty_50['close_price'].values
split = round(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))

mean1=9603.421572, mean2=11105.933332
variance1=223878.109347, variance2=114346.102623

Comparing mean and variances

2) Dickey-Fuller test:
Another method to check for stationarity in our data is statistical tests. I
have used The Augmented Dickey-Fuller test which is called a unit root
test. The null hypothesis of the test is that the time series can be
represented by a unit root concluding our data as not stationary.

p-value > 0.05: Fail to reject the null hypothesis (H0), the data has a
unit root and is non-stationary.

p-value <= 0.05: Reject the null hypothesis (H0), the data does not
have a unit root and is stationary.

X = nifty_50['close_price'].values

result = adfuller(X)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
print('\t%s: %.3f' % (key, value))

ADF Statistic: -0.663206

p-value: 0.856051
Critical Values:
1%: -3.431
5%: -2.862
10%: -2.567

Value of P = 0.856, meaning we have non-stationary data.

The data I have is non-stationary as I expected, Hence I used a basic

technique to get rid of the trend in my data i.e DIFFERENCE

Differencing can be used to remove the series dependence on time, also

called temporal dependence. This includes structures like trends and
seasonality. It is simply performed by subtracting the previous
observation from the current observation.

difference(t) = observation(t) — observation(t-1)

Using the same, I have performed differencing on my data with this piece
of code below:

data = nifty_april['close_price'] -


Using this, I conducted the dickey-fuller test again which gave me the
following results:

X = data[1:].values

result = adfuller(X)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
print('\t%s: %.3f' % (key, value))

ADF Statistic: -58.045640

p-value: 0.000000
Critical Values:
1%: -3.432
5%: -2.862
10%: -2.567

p ≤ 0.05, meaning the differenced data is now stationary.

Below I have plotted the graphs of my data before and after removing the
temporal dependence in them.

Logs for my data before and after removing trend/seasonality.

Further, I plotted ACF and PACF logs for the same.

Note: Our prior high autocorrelation plot showed no seasonality or trend.

By plotting the ACF and PACF after differencing we shall be able to get a
decent start on what kind of model we should use. Note that, these tools
help us to get a starting point on understanding the time series we are
dealing with, not to get us our final answer.

data = nifty_50_april['close_price'] - nifty_50_april['close_price'].shift(1)

plt.rc("figure", figsize=(10,6))[1:], lags=50);

Autocorrelation plot after differencing

plt.rc("figure", figsize=(10,6))[1:]);

Partial Autocorrelation Plot after differencing

Moving Averages Model to smooth our data:

The time-series data is very noisy; hence it's very difficult to gauge a
trend/pattern in the data. This is where Moving Averages/Rolling mean
helps. It works by simply splitting and aggregating the data into windows
according to function and creates a constantly updated averaged out
data. Its piece of code is very straight forward.


Moving averages with upper and lower bounds, for clearer movement.

plot_moving_average(nifty_50_april, 30, column='small_ma', plot_intervals=True)

Moving average: 30

plot_moving_average(nifty_50_april, 100, column='large_ma', plot_intervals=True)

Moving average: 100

Simple Exponential Smoothing:

Forecasts produced using exponential smoothing methods are weighted
averages of past observations, with the weights decaying exponentially
as the observations get older. Let me explain that in simple terms along
with how it is different from Simple Moving Averages.

Exponential Smoothing considers past data in a certain order/fashion. In

exponential smoothing, the most recent observation gets a little more
weightage than the 2nd most recent observation, and the 2nd most
recent observation gets a little more weightage than the 3rd most recent
observation. Hope the graph below helps.

The difference in Moving Average’s and Exponential smoothing’s emphasis on past observations.

Another important advantage of the SES model over the SMA model is
that the SES model uses a smoothing parameter that is continuously
variable, so it can easily be optimized by using a “solver” algorithm to
minimize the mean squared error.

Mathematically, if F(t) is forecast at time ‘t’, A(t) is the actual value at time

F(t+1) = F(t) + α(A(t)-F(t))

where, α → smoothing constant, (0≤α≤1)

I have used two ways to get my exponentially smoothed column. First, I

have again used the stats model library, from statsmodels.tsa.api import


Taking three different instances as follows:

𝜶(alpha) = 0.2

𝜶(alpha) = 0.8

𝜶(alpha) value automatically optimized by the stats model itself which

is the recommended one.

I used this data of nifty 50 April for exponential smoothing but the results
weren't so helpful. Have a look :/

So to get graphs with visible results, I went on to narrow my data frame to

a day’s data, and well then the results were obviously :)) SPECTACULAR!!!

In this plot, we can see that the black line is the actual distribution of the
data, other than that the red line plot is the most accurate as it is plotted
according to the optimized value determined by the stats model itself.

Next, I have used the mathematical formula directly with this piece of
code below:

def exponential_smoothing
exponential_smoothing(series, alpha):

result = [series[0]] # first value is same as series

for n in range(1, len(series)):
result.append(alpha * series[n] + (1 - alpha) * result[n-1])
return result

plot_exponential_smoothing(stock_one_day['close_price'], [0.05, 0.3])

Simple ES for two alpha values.

Trend-Following Strategies in Algorithmic Trading.

Here’s a little brief of what I think I should have known before I started and
you should too if you’re a beginner. I was clueless about time series
analysis when I started, and statistics turned out to be a pretty emotional
subject… but don’t give up! It’s all coming to you!

Well, there are mainly 5 traditional models that are most commonly used
in time series forecasting:

AR: Autoregressive Models, These models express the current values of

time series Linearly in terms of its previous values and current residual.

In case you’re wondering, (In a time series model) Residual Values, are
what is left over after fitting a model.

In simple terms… in an AutoRegressive model, the value ‘Y(t)’ which is the

value we want to predict, at time ‘t’, depends on its own past values.
Secondly, it involves regressing the past values we are considering for our
forecast. This is why we use the PACF to determine the order of our AR
model. Meaning how many specific past values we will consider for
predicting ‘Y(t)’. I hope that made sense!!!

F(t) = f[y(t-1), y(t-2)….y(t)]

Also, for an AR model, we expect the ACF to exhibit diminishing behavior,

precisely like the one I have in my data. Normally then, we plot the PACF
for further evidence, but in my data, I did see a decaying ACF plot, but
after I plotted the PACF, I saw that it wasn't very informative. This is one of
the reasons why stock prices are one of the most challenging time series
to predict.

MA: Moving Average Models, These express the current values of the
time series Linearly, in terms of its current and previous residual values. In
the Moving average model rather than taking the past values of the
forecast variable, we consider past forecast errors. Therefore this time,
the value to be forecasted is a function of errors from previous forecasts.

F(t) = f[E(t-1), E(t-2)….E(t)]

We use the ACF plot, as I mentioned before here to determine the order of
errors to be taken into account to make our model.

ARMA: AutoRegressive MovingAverage Models, This is a combination of

Autoregressive and Moving Averages models. Here, the current values of
the time series are expressed linearly, in terms of its previous values and
in terms of both current and previous residual values.

NOTE: The Above Three models are for STATIONARY PROCESSES.

ARIMA: AutoRegressive Integrated MovingAverage Model, and SARIMA:

Seasonal AutoRegressive Integrated Moving Average Model, There is
very little difference in the two. ARIMA generally fits the non-stationary
time-series based on ARMA, with a differencing process that effectively
transforms the non-stationary data into a stationary one, whereas
SARIMA models combine seasonal differencing with ARIMA, i.e time-
series data with periodic characteristics.

Sorry, I have mentioned a lot of theory here, but like you know…

The best practice is inspired by Theory -Donald Knuth

Now, as I mentioned I have tried to come to a buy/sell decision using

moving averages. I have applied a very simple concept creating two case
scenarios. One where the closing price falls behind the small moving
average to generate a SELL signal and the other when it exceeds the large
moving average to generate a BUY signal. Small MA has a window of 30.
Large MA has a window of 100.

def small_ma
small_ma(df, coloumn='close_price'):
df['small_ma'] = df[coloumn].rolling(window=30).mean()
return df

def large_ma
large_ma(df, coloumn='close_price'):
df['large_ma'] = df[coloumn].rolling(window=100).mean()
return df

def choose_month
choose_month(df, month):
df = df[(df.month == month)]
return df

#def s_ma_slope(df, coloumn):

Code for calculating Moving averages

Generating a BUY/SELL signal:

I have created 2 case scenarios, where if the Closing Price crosses the
moving averages it generates a buy/sell signal.

Plotting the generated buy/sell signal.

#this function will tell me what price to buy at and what price to sell at.
def buy_sell
sigPriceBuy =[]
flag = -1
for i in range(0,len(signal)):
if (signal['small_ma'][i] > signal['large_ma'][i]): #buying condition
if flag != 1: #incase we havent bought already.
flag = 1 # meaning now we have bought

elif (signal['small_ma'][i] < signal['large_ma'][i]): #selling condition

if flag != 0: #in case we already dont have stock
flag = 0 #now we have sold

else: #for nan values


return (sigPriceBuy, sigPriceSell)

Generation of the buy/sell signal.

Possible Profits after using Buy/Sell signal generated on the basis of

Moving Averages:
I created a transaction table based on the buy/sell signal and after
applying a profit formula without taxes.


Informative BOOK( Very Insightful)




WORDS: (That might be helpful)

The trend in data

Summary Statistics, provide information about our sample data, it tells us

about the locations of average and the skewness/kurtosis in the data in
order to communicate large information in the simplest words.

The null hypothesis, a typical statistical theory that suggests that no

statistical relationship and significance exists in a set of given single
observed variable, between two sets of observed data and measured

Residual Values, in a time series analysis, are what is left over after fitting
a model.

Hope this article helps!! Find me on Linked in here.

Also, hope you are doing well :)


Time Series Analysis Machine Learning Statistics Data Visualization Data Analysis

