Professional Documents
Culture Documents
Time Series Analysis - Data Exploration and Visualization. - by Himani Gulati - Jovian - Medium
Time Series Analysis - Data Exploration and Visualization. - by Himani Gulati - Jovian - Medium
Follow 589 Followers · Newsletter Archive Learn Data Science Contribute About
I have used Stock Data, hence also tried to come to a buy or sell decision
implementing one of the Trend-Following Strategies, and explored a few
others too, but only theoretically. I am still not definite with the statement
of whether Machine Learning Models can be used to predict movement in
the stock markets or not. I found this in one of the articles I read and I
couldn't agree more.
Data Exploration:
The Data I have used in this fieldwork has been scraped by Prakhar Goel
and can be found here and here. These are two CSV formatted files, which
contain some data from the Indian stock market on about 65 shares, from
the month of April to Sept-2020. The first CSV file, contains,
open_price(Opening Price), close_price(Closing Price),
high_price(Highest Price), low_price (Lowest Price), the timestamp, and
the Scrip_id for stock identification. And the second one contains names
for particular stocks and their exchange.
stocks_df = pd.read_csv(data_directory)
print(len(stocks_df))
stocks_df
2321232
2020-02-24
0 1 1170.00 1170.00 1149.25 1164.60 104528 2
09:15:00+05:30
2020-02-24
1 13008 318.00 318.00 311.65 312.30 22036 1
09:15:00+05:30
2020-02-24
2 26015 828.95 828.95 825.45 825.85 22222 3
09:15:00+05:30
2020-02-24
3 39022 1672.00 1672.00 1665.00 1667.10 15844 4
09:15:00+05:30
2020-02-24
4 52029 1469.75 1469.75 1463.10 1465.05 150673 5
09:15:00+05:30
2020-09-03
2321227 2666522 108.40 108.70 108.40 108.70 1317 41
15:29:00+05:30
2020-09-03
2321228 2683019 497.00 497.00 496.10 497.00 1752 42
15:29:00+05:30
2020-09-03
2321229 2699517 827.00 828.00 826.95 828.00 54 43
15:29:00+05:30
2020-09-03
2321230 2716011 765.70 766.45 764.00 766.00 606 44
15:29:00+05:30
2020-09-03
2321231 2731697 1840.00 1840.00 1837.30 1840.00 197 45
15:29:00+05:30
First DataFame
df = pd.read_csv(data_directory_two)
df.head()
Second DataFrame
I have joined the two to get a single DataFrame called merged_df, and you
can see the names of the stock categories or the unique stocks whose
information I have in the datasets.
2020-02-24
0 1170.00 1170.00 1149.25 1164.60 104528 2 INDUSINDBK 1346049
09:15:00+05:30
2020-02-24
1 1165.60 1168.50 1159.00 1160.75 109720 2 INDUSINDBK 1346049
09:16:00+05:30
2020-02-24
2 1162.35 1164.45 1161.00 1163.05 48984 2 INDUSINDBK 1346049
09:17:00+05:30
2020-02-24
3 1164.10 1168.85 1163.65 1168.85 67387 2 INDUSINDBK 1346049
09:18:00+05:30
2020-02-24
4 1168.35 1173.00 1166.45 1173.00 78541 2 INDUSINDBK 1346049
09:19:00+05:30
Merged Dataframe.
merged_df['name'].unique()
Below are two graphs representing the data I have curated from the entire
dataset of 2 million to about 35K covering nifty50's movement in the
selected time period.
Now there’s a difference in the two graphs but they basically represent the
same values of the closing price. The difference is that you will notice
there are some continuous points in the first graph. This is only because
in the first plot, I have used timestamp as my index and in the second plot,
I have plotted the values of the closing price in the order of occurrence.
This continuity in data is because I have missing data for the times the
markets aren’t open.
Now, for further analysis, observations over the entire time variation
weren’t so helpful, so I further narrowed the data to a particular month
that I chose to be April, with about 3000 data points.
nifty_50_april
Correlation Plot/Matrix:
The correlation coefficient tells us the Linear Relationship between the
two variables. It is bound between -1 and 1. Make sure you understand the
meaning of correlation here… because the correlation plot is not so useful
with time series but the autocorrelation and Partial autocorrelation plots
are.. and they will determine the model we use to forecast our time series
data.
Below I have plotted the ACF for nifty 50 April data with the help of the
stats model library. → import statsmodels.api as sm
plt.rc("figure", figsize=(10,6))
sm.graphics.tsa.plot_acf(nifty_50_april['close_price'], lags=50);
The horizontal axis of an autocorrelation plot shows the size of the lag
between the elements of the time series. In simple terms, The ‘kth’ lag is
the time period that happened “k” time points before the time I. You can
optionally set in how many lags you want to observe.
Observe → that our data has a very high correlation. You will come to
know below why and how this can be mended. Also, both these plots
help us with modeling the data which also you will see how.
PartialAutocorrelation Plot:
A partial autocorrelation is a summary of the relationship between an
observation in a time series with observations at prior time steps with the
relationships of intervening observations removed. Meaning… The effects
of the lags in between are removed and we can see the direct impact a
previous observation has on the value to be predicted at a time(t).
Below I have plotted the PACF for nifty 50 April data again directly with
the help of the stats model library.
plt.rc("figure", figsize=(10,6))
sm.graphics.tsa.plot_pacf(nifty_50_april['close_price']);
1) Summary Statistics:
One of the most basic methods
to check if our data is
stationary is to the summary
statistics. This is not much of
an accurate way, sometimes
the outcome of this test can be
a statistical fluke.
X = nifty_50['close_price'].values
split = round(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))
mean1=9603.421572, mean2=11105.933332
variance1=223878.109347, variance2=114346.102623
2) Dickey-Fuller test:
Another method to check for stationarity in our data is statistical tests. I
have used The Augmented Dickey-Fuller test which is called a unit root
test. The null hypothesis of the test is that the time series can be
represented by a unit root concluding our data as not stationary.
p-value > 0.05: Fail to reject the null hypothesis (H0), the data has a
unit root and is non-stationary.
p-value <= 0.05: Reject the null hypothesis (H0), the data does not
have a unit root and is stationary.
X = nifty_50['close_price'].values
result = adfuller(X)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
print('\t%s: %.3f' % (key, value))
Using the same, I have performed differencing on my data with this piece
of code below:
data = nifty_april['close_price'] -
nifty_april['close_price'].shift(1)
Using this, I conducted the dickey-fuller test again which gave me the
following results:
X = data[1:].values
result = adfuller(X)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
print('\t%s: %.3f' % (key, value))
Below I have plotted the graphs of my data before and after removing the
temporal dependence in them.
plt.rc("figure", figsize=(10,6))
sm.graphics.tsa.plot_pacf(data[1:]);
df.rolling(window).mean()
Moving averages with upper and lower bounds, for clearer movement.
Moving average: 30
The difference in Moving Average’s and Exponential smoothing’s emphasis on past observations.
Another important advantage of the SES model over the SMA model is
that the SES model uses a smoothing parameter that is continuously
variable, so it can easily be optimized by using a “solver” algorithm to
minimize the mean squared error.
Mathematically, if F(t) is forecast at time ‘t’, A(t) is the actual value at time
‘t’,
SimpleExpSmoothing
𝜶(alpha) = 0.2
𝜶(alpha) = 0.8
I used this data of nifty 50 April for exponential smoothing but the results
weren't so helpful. Have a look :/
In this plot, we can see that the black line is the actual distribution of the
data, other than that the red line plot is the most accurate as it is plotted
according to the optimized value determined by the stats model itself.
Next, I have used the mathematical formula directly with this piece of
code below:
def exponential_smoothing
exponential_smoothing(series, alpha):
Well, there are mainly 5 traditional models that are most commonly used
in time series forecasting:
In case you’re wondering, (In a time series model) Residual Values, are
what is left over after fitting a model.
MA: Moving Average Models, These express the current values of the
time series Linearly, in terms of its current and previous residual values. In
the Moving average model rather than taking the past values of the
forecast variable, we consider past forecast errors. Therefore this time,
the value to be forecasted is a function of errors from previous forecasts.
We use the ACF plot, as I mentioned before here to determine the order of
errors to be taken into account to make our model.
Sorry, I have mentioned a lot of theory here, but like you know…
def small_ma
small_ma(df, coloumn='close_price'):
df['small_ma'] = df[coloumn].rolling(window=30).mean()
return df
def large_ma
large_ma(df, coloumn='close_price'):
df['large_ma'] = df[coloumn].rolling(window=100).mean()
return df
def choose_month
choose_month(df, month):
df = df[(df.month == month)]
return df
#this function will tell me what price to buy at and what price to sell at.
def buy_sell
buy_sell(signal):
sigPriceBuy =[]
sigPriceSell=[]
flag = -1
for i in range(0,len(signal)):
if (signal['small_ma'][i] > signal['large_ma'][i]): #buying condition
if flag != 1: #incase we havent bought already.
sigPriceBuy.append(signal['close_price'][i])
sigPriceSell.append(np.nan)
flag = 1 # meaning now we have bought
else:
sigPriceBuy.append(np.nan)
sigPriceSell.append(np.nan)
REFERENCES:
Stats,
Stationarity,
Differencing
Residual Values, in a time series analysis, are what is left over after fitting
a model.
47
Time Series Analysis Machine Learning Statistics Data Visualization Data Analysis
Introduction
This blog is a part of a course project from Deep Learning with PyTorch:
Zero to GANs. This course is a 6-week long course by Aakash N S and his
team at Jovian. It is a beginner-friendly online course offering a practical
and coding-focused introduction to deep learning using the PyTorch
framework. It was a novel experience for all, with the lectures being
delivered via Youtube live-streaming (on the beloved freeCodeCamp
Youtube channel)…
13
Now, Let's learn out-of-box, that is completely about Sales but not the
Playing game, Does it sounds strange?
This Project is to perform the analysis on the Video Games Sales across
the countries. Used various libraries of Python for visualization of Data.
The Dataset of Video Game Sales which I used in the Project is Flashed
here. And…
74
In this article, we will see different ways of creating tensors using PyTorch
tensor methods (functions).
Topics
tensor
zeros
ones
full
arange
linspace
rand
randint
eye
complex
tensor()
It returns a tensor when data is passed to it. data can be a scalar, tuple, a
list, or a NumPy array.
40
The goal of this project is to find out which are the most relevant features
that students take into account to choose the favorite university. Some of
the essential questions for developing this project are related to the
number of applications, admissions, and enrollments, cost of tuition and
fees, cost of living on campus, types of degrees offered, and features of
the states where universities are located (population and GDP).
92