Professional Documents
Culture Documents
STOCK_MARKET_PROJECT - Jupyter Notebook
STOCK_MARKET_PROJECT - Jupyter Notebook
warnings.filterwarnings('ignore')
import pandas as pd
from pandas import Series,DataFrame
from pandas.plotting import lag_plot
import numpy as np
import seaborn as sns
from statsmodels.tsa.arima_model import ARIMA
from sklearn.metrics import mean_squared_error
for x in listofcomp:
globals()[x]=web.DataReader(x, 'yahoo',start,end)
In [4]: AAPL.describe()
Out[4]:
High Low Open Close Volume Adj Close
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 253 entries, 2020-12-14 to 2021-12-14
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 High 253 non-null float64
1 Low 253 non-null float64
2 Open 253 non-null float64
3 Close 253 non-null float64
4 Volume 253 non-null float64
5 Adj Close 253 non-null float64
dtypes: float64(6)
memory usage: 13.8 KB
Out[6]: <AxesSubplot:xlabel='Date'>
In [7]: AAPL['Volume'].plot(legend=True,figsize=(10,4))
Out[7]: <AxesSubplot:xlabel='Date'>
In [8]: ma_day=[10,20,50]
for ma in ma_day:
column_name = 'MA for %s days' %(str(ma))
AAPL[column_name] = pd.Series(AAPL['Adj Close']).rolling(window
Out[9]: <AxesSubplot:xlabel='Date'>
AAPL['Daily Return'].plot(figsize=(10,4),legend=True,linestyle='--',
Out[10]: <AxesSubplot:xlabel='Date'>
In [11]: AAPL['Daily Return'].hist(color='purple')
Out[11]: <AxesSubplot:>
Out[15]:
Symbols AAPL GOOG MSFT AMZN
Date
In [17]: tech_rets.head()
Out[17]:
Symbols AAPL GOOG MSFT AMZN
Date
In [20]: sns.pairplot(tech_rets.dropna())
returns_fig.map_upper(plt.scatter,color='purple')
returns_fig.map_lower(sns.kdeplot,cmap='cool_d')
returns_fig.map_diag(plt.hist,bins=30)
We can also analyze the correlation of the closing prices using this exact same
technique.
In [22]: returns_fig = sns.PairGrid(closing_df)
returns_fig.map_upper(plt.scatter,color='purple')
returns_fig.map_lower(sns.kdeplot,cmap='cool_d')
returns_fig.map_diag(plt.hist,bins=30)
https://www.hackerearth.com/blog/developers/data-visualization-techniques/
(https://www.hackerearth.com/blog/developers/data-visualization-techniques/) : ALL THE
PLOTS EXPLAINED WELL HERE.
Value at Risk
We can treat value at risk as the amount of money we could expect to lose (aka putting
at risk) for a given confidence interval. Theres several methods we can use for estimating
a value at risk.
For this method we will calculate the empirical quantiles from a histogram of daily
returns.
In [25]: sns.distplot(AAPL['Daily Return'].dropna(),bins=100,color='purple')
In [26]: rets['AAPL'].quantile(0.05)
Out[26]: -0.025191669250715143
price = np.zeros(days)
price[0] = start_price
shock = np.zeros(days)
drift = np.zeros(days)
for x in range(1,days):
shock[x] = np.random.normal(loc=mu*dt , scale=sigma*np.sqrt(
drift[x] = mu*dt
price[x] = price[x-1] + (price[x-1]*(drift[x] + shock[x]))
return price
In [29]: start_price = 569.85
for run in range(100):
plt.plot(stock_monte_carlo(start_price,days,mu,sigma))
plt.xlabel('Days')
plt.ylabel('Price')
plt.title('Monte Carlo Analysis for Google')
In [32]: listofcomp=['AAPL','GOOG','MSFT','AMZN']
end = datetime.datetime.now()
start = datetime.datetime(end.year - 5,end.month,end.day)
for x in listofcomp:
globals()[x]=web.DataReader(x, 'yahoo',start,end)
In [33]: AAPL.describe()
Out[33]:
High Low Open Close Volume Adj Close
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1259 entries, 2016-12-14 to 2021-12-14
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 High 1259 non-null float64
1 Low 1259 non-null float64
2 Open 1259 non-null float64
3 Close 1259 non-null float64
4 Volume 1259 non-null float64
5 Adj Close 1259 non-null float64
dtypes: float64(6)
memory usage: 68.9 KB
ARIMA models explain a time series based on its own past values, basically its own lags
and the lagged forecast errors.
An ARIMA model is characterized by 3 terms (p, d, q): p is the order of the AR term d is
the number of differencing required to make the time series stationary q is the order of
the MA term
As we see in the parameters required by the model, any stationary time series can be
modeled with ARIMA models.
STATIONARITY
A stationary time series is one whose properties do not depend on the time at which the
series is observed. Thus, time series with trends, or with seasonality, are not stationary —
the trend and seasonality will affect the value of the time series at different times.
Subtract the previous value from the current value. Now if we just difference once, we
might not get a stationary series so we might need to do that multiple times.
And the minimum number of differencing operations needed to make the series
stationary needs to be imputed into our ARIMA model.
ADF TEST
We'll use the Augumented Dickey Fuller (ADF) test to check if the price series is
stationary.
The null hypothesis of the ADF test is that the time series is non-stationary. So, if the p-
value of the test is less than the significance level (0.05) then we can reject the null
hypothesis and infer that the time series is indeed stationary.
So, in our case, if the p-value > 0.05 we'll need to find the order of differencing.
result = adfuller(AAPL.Close.dropna())
print(f"ADF Statistic: {result[0]}")
print(f"p-value: {result[1]}")
Out[39]: 1
Therefore d value is 1
p is the order of the Auto Regressive (AR) term. It refers to the number of lags to be used
as predictors.
Wecan find out the required number of AR terms by inspecting the Partial Autocorrelation
(PACF) plot.
The partial autocorrelation represents the correlation between the series and its lags.
diff = AAPL.Close.diff().dropna()
ax1.plot(diff)
ax1.set_title('Difference once')
ax2.set_ylim(0,1)
plot_pacf(diff, ax=ax2);
q is the order of the Moving Average (MA) term. It refers to the number of lagged forecast
errors that should go into the ARIMA model.
ax1.plot(diff)
ax1.set_title('Difference once')
ax2.set_ylim(0,1)
plot_acf(diff, ax=ax2);
In order to evaluate the ARIMA model, I decided to use two different error functions:
Mean Squared Error (MSE) and Symmetric Mean Absolute Percentage Error (SMAPE).
SMAPE is commonly used as an accuracy measure based on relative errors.
SMAPE is not currently supported in Scikit-learn as a loss function I, therefore, had first
to create this function on my own.
In [43]: def smape_kun(y_true, y_pred):
return np.mean((np.abs(y_pred - y_true)*200/ (np.abs(y_pred) +
<class 'list'>
Testing Mean Squared Error: 4.879
Symmetric mean absolute percentage error: 9.893
SMAPE is commonly used loss function for Time Series problems and can, therefore,
provide a more reliable analysis. That showed that our model is good.
In [45]: print(model_fit.summary())
==================================================================
============
coef std err z P>|z| [0.025
0.975]
In [46]: residuals = pd.DataFrame(model_fit.resid)
residuals.plot()
Out[46]: <AxesSubplot:>
In [47]: residuals.plot(kind='kde')
Out[47]: <AxesSubplot:ylabel='Density'>
In [48]: residuals.describe()
Out[48]:
0
count 1257.000000
mean 0.000004
std 1.593496
min -10.996909
25% -0.521333
50% -0.043387
75% 0.538770
max 9.974758
In [49]: plt.figure(figsize=(14,7))
plt.plot(AAPL['Close'], 'green', color='blue', label='Training Data'
plt.plot(test_data.index, predictions, color='green', marker='o', lines
plt.plot(test_data.index, test_data['Close'], color='red', label ='Actu
plt.title('Apple Prices Prediction')
plt.xlabel('Years')
plt.ylabel('Prices')
plt.legend()
In [51]: plt.figure(figsize=(14,7))
plt.plot(test_data.index, predictions, color='green', marker='o', lines
plt.plot(test_data.index, test_data['Close'],color='red',label='Actual
plt.legend()
plt.title('APPLE Stock Prices Prediction')
plt.xlabel('Years')
plt.ylabel('Prices')
plt.legend()
This analysis using ARIMA lead overall to appreciable results. This model demonstrated
in fact to offer good prediction accuracy and to be relatively fast compared to other
alternatives such as RRNs (Recurrent Neural Networks).
In [ ]: