Professional Documents
Culture Documents
Trabajo Definitivo Econometria
Trabajo Definitivo Econometria
Pérez
Universidad
Autónoma de
Madrid
May 2020
AN ECONOMETRIC ANLYSIS
FOR THE COVID-19
Daniel Segura Pérez
INDEX
1. INTRODUCTION....................................................................................................................2
2. DATA ANALYSIS....................................................................................................................3
·China·......................................................................................................................................3
·Italy·........................................................................................................................................5
·Spain·.......................................................................................................................................6
·Similarities and differences between countries·.....................................................................8
·Types of functions·..................................................................................................................9
·Volatility·...............................................................................................................................10
·Weekly analysis·....................................................................................................................11
·Chinese experience·..............................................................................................................13
3. ALTERNATIVES MODELS.....................................................................................................13
4. PREDICTION........................................................................................................................14
·Identification·........................................................................................................................14
·Estimation·............................................................................................................................17
·Diagnosis·..............................................................................................................................18
5. ·CONTAGIONS·....................................................................................................................19
6. RECOVERED........................................................................................................................37
7. DEATHS...............................................................................................................................57
8. CONCLUSION......................................................................................................................74
9. BIBLIOGRAFY......................................................................................................................75
1
Daniel Segura Pérez
1. INTRODUCTION
¿What is COVID-19?
The coronavirus COVID-19 pandemic is the defining global health crisis of our time and
the greatest challenge we have faced since World War Two.
Since its emergence in Asia, the virus has spread to every continent except Antarctica.
But the COVID-19 is much more than a health crisis, it has the potential to create
devastating social, economic, and political crisis that will leave deep scars.
As work in the Department of Health Economics, have to prepare a set of consecutive
reports concerning the consequences of the international pandemic in our three
countries studied.
So, I am going to prepare a set of reports analyzing three variables in each country. The
variables that I am going to study are:
-The total confirmed cases
-The total confirmed deaths owing to COVID-19
-The total recovered people that test positive on COVID-19.
At last, I am going to compare econometric models to choose the best for each sample
of the variables in Spain and I am going to find the optimal forecast for the samples of
the three variables. I assume that a optimal forecast it´s a prediction obtained by
minimizing the expected loss function.
With the predictions of the variables for each different sample, I am going to analyze
the quality and precision of the prediction depending on the sample quantity and the
model used.
2. DATA ANALYSIS
·China·
Analyzing the data of the three countries, I am going to start with the country where
the pandemic started, China. I have data about the city of Hubei, and the data fill from
22/1/2020 to 3/25/2020. The country of China is the one from I have more information
about, as I have a sample of 64 observations.
2
Daniel Segura Pérez
The first important feature of the data in China is that, in the initial observations the
number of contagions increase in a slow and lineal tendency, the first observation I
have it´s 444 contagions, which is a high quantity for the first observation.
In the first two weeks, the number of contagions increased rapidly from 3554 to 33366
to 2/11/2020, with an increase of 838%.
The number of contagions continue increasing during the first 4 weeks, with an
exponential increment trough the weeks.
The hardest week if we talk in contagions terms, it´s the fourth week, when the
number of contagions increase up to new 28665 contagions, leading to an average of
new contagions per day of 4095.
Along the next weeks the daily increase it´s featured by a smoother positive tendency,
finishing with a constant daily increase and a more lineal growth in contagions.
If I talk about the recovered cases that tested positive on COVID-19, the evolution of
the recovered cases it´s very similar with the evolution of the contagions. This has logic
because the virus has a small mortality to case ratio, so, in general, the many
confirmed cases in Hubei were correctly treated and the patients recovered from the
virus. I can see the similar evolution if I analyze the recovered cases graph.
In the graph 1.2 I can observe a very similar positive tendency of the recovered cases
with relation to the contagions in the city of Hubei. In the graph 1.2, which concern the
contagions, it has a more outstanding growth, but in the case of recovered, we can see
two different phases in the graph.
3
Daniel Segura Pérez
The first one, we observe an exponential increase in the number of recovered cases,
then the observations smooth their tendency and lead to a more constant and lower
increase in the recovered cases, due basically by the fact that there are not new
confirmed cases.
The best day if we talk in terms of recovered cases it´s the 2/21/2020 with an
incredible peak of 9181 recovered cases.
The city of Hubei may be an idyllical reference for other countries with an average
recovered case of 986 per day.
To end with Hubei, I am going to analyze the deaths during the sample period. China is
characterized for being one the greatest countries in terms of deaths in the
international pandemic.
The behavior of the variable deaths in China it´s completely different form the other 2
countries. The deaths starts with a slowly and constant increase in deaths from
1/22/2020 to 2/19/2020, when an exponential increase characterized the next few
weeks up to the last week of observations, when the daily deaths start to decrease
until the last day of observations when the daily deaths were only 3.
·Italy·
First, it´s important to say that Italy was the spotlight of the virus in Europe, so the fast
expansion in Europe is very similar to the expansion in the EU country.
4
Daniel Segura Pérez
The first observation of Italy starts with only 3 confirmed contagions, during the next 8
days the virus was spreading through all the country, mostly in the region of
Lombardia, one of the most affected regions of Italy.
After that initial week of no increases in contagions, the number of confirmed cases
start to increase very quickly. From the week starting at 3/4/2020 the contagions start
to increase in an exponential behavior. In Italy it´s very important, also in the case of
Spain, to analyze the data in week terms, because the data change too much
depending on which week we analyze. The worst week of Italy of the variable of
contagions it´s the last week of observations, and that tell me that the exponential
increase in the number of contagions in Italy is going to continue in the next weeks of
March and April.
Secondly, analyzing the variable of recovered people that test positive on COVID-19, in
the graph I can see the variable starts equal to 0, because the first weeks the sample
does not have information about Italy, which starts on 2/7/2020.
As I commented with China, the variable recovered it´s very similar in graph terms with
the variable of confirmed contagions. This is because the virus has a low death rate in
general cases, so the recovered graph it´s very similar to the confirmed cases, it´s even
more exponential because some people have only a few symptoms and recovered
easily.
The week with the most recovered cases it´s also the last week of observations, with
an average of recovered people per day of 589 and the highest number of recovered
people in a day was at 3/22/2020 with 1300 recovered cases.
5
Daniel Segura Pérez
To sum up with Italy, I am going to analyze the variable of deaths, Italy is one of the
most kicked countries with a total number of deaths in my observation of 10950 and if
I look to the total number of death people today, at 5/26/2020, the total confirmed
deaths in Italy are more than 30000, being the European country with the highest
number of deaths.
Obviously the first weeks of the sample the deaths are equal to 0 but, later on the
variable start to increase also in an exponential form, but I sincerely think that both
recovered and contagions variable have a sharper function than the variable deaths.
Italy does not reach the highest peak, that would be in a date out of the sample.
·Spain·
Now I am going to analyze Spain, our country and the country which we are going to
predict and do the forecast about the three coronavirus variables.
So, first of all, I am going to highlight that Spain is the second most affected country of
Europe, behind Italy, so the behavior of the data in Italy and Spain it´s very similar and
have so many things in common.
If I observe the evolution of the confirmed cases in Spain, the curve fits an exponential
curve, and even more pronounced than in Italy, when it was so steep.
But as in the case of Italy, the curves are very similar because the effect of the
coronavirus in both countries are very similar too. Also, both countries share culture
and traditions and geographical zone. This fact may also affect and contribute to this
similarity of the graph.
Analyzing the data, the first confirmed case arrived in Spain at February, day 27, when
the government register 2 confirmed cases of coronavirus in Valencia. The evolution in
Spain it´s very lineal the first 4-5 weeks, because of the few confirmed cases, but the
contagions will increase after this first week. As in the case of Italy, it´s better to
analyze the data weekly.
For example, it makes no sense to analyze all together because the function is an
exponential one, so the first data it´s going to be around 0 or small values and in the
time or only 1 or 2 weeks the contagions will increase massively.
6
Daniel Segura Pérez
The week with the higher number of confirmed cases is the week that starts at
3/18/2020, the last week of the sample, as in Italy, the new cases confirmed in Spain
will be higher the following weeks.
In the variable recovered cases, as I said that in Spain it´s better to analyze the data
weekly, I am going to focus on the 2 last weeks of observations. However, this
observations of the last weeks are not going to be as good as possible and
representative.
The number of recovered people augmented significantly from the week 3/11/2020
and continues with an exponential function, with some positive and negative peaks,
and an average recovered people per day in the last 2 weeks of 888 recovered cases
per day.
The recovered cases and the graph from Spain and Italy are very similar, analyzing
deeply the graph, I see that the graph of Spain is a little bit more exponential than the
Italian one.
The variable deaths in Spain follows a very similar tendency as recovered and
confirmed cases, but the curve is smoother than the other 2 variables.
As I see in the graph, again it´s better to analyze the variable deaths weekly, and in the
first 2 weeks of observation the deaths in Spain were not to high, but when the virus
expanded the deaths increase too much. For example, since the deaths were
increasing, in week 3 at most, the deaths increase with a average of 75% more from
one day to the next, with a highest daily change of 168% from 3/7/2020 to 3/8/2020.
I can say for sure that the number of deaths will continues to increase the next weeks.
7
Daniel Segura Pérez
So, both Spain and Italy have similar averages, but China does not, and this also
demonstrate that Spain and Italy managed worse the situation of the pandemic than
China do up to this moment.
8
Daniel Segura Pérez
·Types of functions·
EXPONENTIAL FUNCTION
The exponential growth function is applied to a variable such that its variation grows
faster and faster over time. The variable grows at a growth rate, this growth rate is
raised to a constant that must be greater than 1.
This exponential function is very well adopted in our data series because this virus has
had a very rapid growth trend, based on a very low group of people, it has grown very
fast over time exponentially.
LINEAL FUNCTION
If I said that some type of lineal function represents our data, I would be lying because
it´s not true.
A linear function should maintain constant growth throughout the days, and as we can
see in the graphs for each of the countries, the data series have spikes and falls with
irregular peaks that make it clear that the linear function is not the most suitable for
this series of data.
LOGARITHMIC FUNCTION
The logarithmic function could be a perfect applicant for our data.
The logarithmic function is the opposite of the exponential function, it can be
explained as a function in which the data quickly grows at the beginning but little by
little as the weeks go by, the data stabilizes until they become constant.
The problem with logarithmic function is that in this case, the virus has not grown as
fast and upward as logarithmic function does.
QUADRATIC FUNCTION
The quadratic function is a function whose graph is a parabola, that is, from an analysis
point of view, we could represent our variables as a parabola. Analyzing the 3
variables, both contagions and deaths variables would be the ones that most resemble
a quadratic function only in China and in the set of the second and third phases.
Therefore, we could say that the functions that best represent this crisis in general
would be the exponential function for the recovered variable, since it will grow steadily
to that of those infected, but it will continue to grow when contagions are 0.
For the contagion and death variables I think that the most successful functions will be
both the exponential and the quadratic, the exponential because it represents very
clearly the very marked increase in cases and the quadratic function helps us
understand how cases have become much more constant and they have been
decreasing in a controlled way until they have 0 infections and 0 deaths.
9
Daniel Segura Pérez
With the sample we have, the exponential function would represented all the variables
and countries, but as I know what will happen the following weeks and that the
pandemic would stabilize by the government, and the number of contagions and
deaths would become more constant so that its more likely a quadratic function.
·Volatility·
I want to study and analyze now the volatility and dispersion of the variables of our
data. So, first I compute the main statistics, which are, mean median and variance of
the 3 countries.
Also, if I want to check for volatility, I could also look in standard deviation, which is the
square root of the variance and in the range.
Therefore, now I am going to analyze the data on a weekly basis, in this way the data
will be much more representative and reliable because our sample varies a lot from
the beginning to the end of the period and it is much better to analyze it weekly.
If I look for the first variance, a variance calculated with all the data together, in a
simple group, I can see that is too high and that it´s because, as we know, the variance
is a measure of dispersion defined as the expectation of the square of the deviation of
said variable from its mean. Basically represents the dispersion of the values of the
variable during the sample period, so if we look at the variance takin into account all
the period, it would be so much higher because the dispersion between the fist value
and the last one is going to be enormous.
So, it´s better to look weekly. If we compare both variances, the difference is very big,
but instead of that, the weekly variance it´s actually big, that means that our data has
many dispersions between its values and such that, the volatility of the series is higher.
The volatility of the sample it´s going to continue increasing because the pandemic it´s
just arriving in Spain and Italy, so as the values of the following weeks continue
increasing, the variance of the total period it´s going to be higher and higher.
If I compare the variance of the different countries I can see that, for example, in the
variable contagions, Spain has the highest variance taking into account all the sample
period, and this is because Spain suffered a massive increased in the confirmed cases
in a shorter period than Italy and China. The variances of the other 2 variables are
similar, only that variance in China is a little bit lower tan the other 2 countries, this is
because the data it´s more constant at the last weeks.
Now I can analyze the case of Spain, as it´s going to country that I am going to predict,
and I am going to analyze the variable contagions.
Looking at the variances of the weeks, the highest value is from 3/11/2020 to
3/18/2020, in this week I see a variance equal to 830000, that is so high, but it´s
because the differences of the data with the mean of the week are high.
10
Daniel Segura Pérez
The variance follows a evolution trough the weeks, the first week it was not so
elevated, with a value of 222, but as I comment before, it´s because the values in the
first weeks are lower because the virus was settling in the country.
So, when the virus settled, the variance starts to be greater than the last week, but the
last week I observe a variance slightly lower than the previous week.
In the case of Italy, it was very similar to the case of Spain, and China have lower
variance values.
In conclusion, analyzing the variance within the whole sample period could confuse us,
because the dispersion is very high in variables like the ones I have, so it´s much better
to analyze the variance in week periods.
The variance means volatility, so the country with the highest volatility is
Spain in the three variables, mainly because the higher difference because the first and
the last value and the shorter sample period.
So, of course, the results are going to be more reliable if we analyze weekly, but not
only in the case of variance, also the mean.
The mean it´s going to be more representative of all the sample if it is calculated for
each week, otherwise the mean maybe is not going to represent well the first and the
last values of the sample.
·Weekly analysis·
China
Analyzing weekly the data of China, I can divide the variable contagions into 2 stages,
the first stage covers the first 4 weeks, when the number of confirmed cases increased
too much but, in the second stage, the last 4 weeks, the confirmed cases decrease day
by day an remain constant near to 0.
In the variable of recovered cases, now I can not divide it between 2 stages, because
when the number of contagions starts to decrease, the number of recovered people
continuous increasing because the people contagious from the past week they are
recovering one or two week after they got the virus. But in general, the 4-5 weeks the
recovered number of people increasing with the same rhythm as the confirmed ones,
but if I jump into the last 2-3 weeks the cases are going to decrease significantly
because there are not new confirmed cases.
To sum up with China, the variable deaths reach the highest average deaths per day at
week 3, and then starts to decrease because of the good management of the
pandemic by the Chinese government.
11
Daniel Segura Pérez
Italy
The variable contagions in Italy suffered a continuous increase in the value of its
variance during the weeks up to the last week, when the variance was high but it
decreases a little bit.
Looking at the recovered cases, here I can see the relation of the variances between
the contagions and the recovered variables. Once the contagions start to remain
constant and around 0, the recovered cases continue increasing and the dispersion is
higher and higher. So, as the contagions variance increase continuously up to the last
week, the variable recovered increase constantly in all the weeks.
The variable deaths follow the same tendency as the recovered cases, increasing to
much the first weeks, then the last 2 weeks the deaths continuous increasing but not
at the growth rate than the previous weeks.
Spain
First, the variable contagions in Spain show me the highest increase within the 3
countries, I only have 4 weeks so it´s difficult to analyze, but, the contagions increased
less in the first week than in the other ones, and then the massive increased in the last
2 weeks produced that the variance is such higher.
Secondly, the recovered cases in Spain, fits the same tendency and evolution that the
variable contagions, only with a short delay of values because of the minimum time
that people need to recover from the virus. But the recovered cases increase week by
week and I sincerely say that the variable will continue to increase the following
weeks.
To end with Spain, the variable deaths it´s maybe the variable with the highest
dispersion in Spain, and in the three countries. The weekly data show me that the
average deaths per week increase over the weeks and will increase during the next
weeks if the confirmed cases increase also.
·Chinese experience·
I think that the Chine experience it´s not valuable por Spain and Italy, basically because
the sample period is different, also because China managed the virus much better than
the Mediterranean countries.
I would change my opinion if the question were if Italy could be a good model for
Spain, my answer would be yes. Mainly because Spain and Italy are two very similar
countries, and that Spain saw what was happening in Italy before the virus arrived in
Spain. But Spain and Italy could not valuate at all the situation in China, because of the
absence of information and reliable data.
12
Daniel Segura Pérez
Regarding your question about any weird observation in the data of China, the only
eventually and for me unusual observation is one value in the variable of recovered
cases, the day 2/21/2020 9181 people recovered in Hubei from the coronavirus, the
odd feature about this value is that if I do not take into account that value to calculate
the average recovered cases the same week, the average will be equal to 2204
recovered cases per day, but if I take into account that extremely value, the new mean
of the week is 3201.
In conclusion, this reason is why economists have to be care with random variables
and with peculiar values in the series, the average recovered cases per day increase in
1000 units from one week to the another one, so it´s difficult to believe that this value,
which is only for 1 day, would not undermine the whole average of the week.
3. ALTERNATIVES MODELS
First, before starting the prediction of the variables analyzed in point 2, I have to
introduce for the simple and more used models in time series predictions and
econometrics.
Econometrics models are constructed from economic data with the aid of the of the
techniques of statistical inference, econometrics models are usually based on
economic theories that assume optimizing behavior on the part of economic models.
-MA model: The (MA) representation is known as moving average because it is a sum
of weighted shocks and its “moving” because the shock are different in each period.
So, by theory, a MA(q) is a moving average model of order q so that the dynamics of
the process are a linear unction of the last q innovations.
The parameters of the model MA are called theta, and I am going to obtain as many
parameters as the q is.
-AR model: The autoregressive model predicts future behavior based on past behavior.
It´s used for forecasting when there is some correlation between values in a time
series and the values that precede and succeed them.
I am going to use only past data to model the behavior, hence the name
autoregressive, the process is basically a linear regression of the data in the current
series against one or more past values in the same series.
So, by theory, an AR(p) it´s an autoregressive model of order p so that the dynamics of
the process are a linear function of the last p observations.
The parameters of the model AR are called phi, so I am going to have as parameters as
q.
-ARMA model: The autoregressive-moving-average ARMA(p,q) models provide a
parsimonious description of a stationary stochastic process in terms of two
13
Daniel Segura Pérez
polynomials, one for the autoregression AR and the second for the moving average
MA.
-White noise: Stochastic process characterized by lack of autocorrelation at any
displacement.
So, basically it´s a random signal with equal intensities at every frequency and is often
defined in statistics as a signal whose samples are a sequence of unrelated, random
variables with no mean and limited variance. In some cases, it may be required that
the samples are independent and have identical probabilities.
4. PREDICTION
To get started in the prediction of the variables I have the obligation to introduce the
process of a prediction theoretically and explain a little bit how I choose the best
model for each variable, and how I calculated and interpret all the results in EViews.
In the whole process of prediction, I split the process in 4 substages that I believe it´s
better to do and more practical.
·Identification·
First, I must identify the better model which fits the series, so I firs part it´s to see if the
series has only regular part or also has a stationary part.
As I have learned in my process to became an econometric specialist, if I have non-
stationary data, it would not be possible to work with the models MA and AR, but I
have to think about it and know that the consequences of using non-stationary models
to forecast prediction in the short run has no sense in econometrics terms. So, the
consequences of non-stationary models are only important when the number of
observations tend to infinite (∞).
So, in this situation, with the COVID-19 I have a relatively small sample to work in, so
the non-stationary models are not going to have relevant consequences for me in the
prediction process.
So, within the stage of identify the correct model, I have also 4 steps.
-First step: I have to study if the variable (series) is stationary in terms of variance, this
step is very simple, so I just have to check the series graph, and see if the variable
follows a constant tendency and if it´s homogeneous.
If the graph is homogeneous and fits a constant tendency (upward or downward) the
variable it´s going to be stationary in terms of variance.
Otherwise, if the series graph has a positive notorious tendency, and also that
tendency it´s not homogeneous, for example, it´s constant at first but then starts to
increase in the last part of the graph, then it´s heterogenous not homogenous.
So, if the series graph it´s not homogeneous, I am going to applied logarithms to the
variable.
14
Daniel Segura Pérez
15
Daniel Segura Pérez
With a confidence level of 95% and hence a significance level of 5%, I am going to
reject or not reject the Null Hypothesis.
If first I reject the Null Hypothesis because the variable has a unit root, then it means
that the variable is not stationary in terms of mean.
In the case the variable has a unit root, and as I want the variable to be stationary with
respect to the mean, I need to do something to convert the series into stationary with
respect to the mean.
To convert my series into stationary with respect to the mean I need to take regular
differences. Why regular differences?
As I just have taken logarithms to the series, if a take a one regular difference this may
help to stabilize the variance and to transform the series into a series with a more
representative a mean which it´s around 0.
1st difference: Difference between the value of a random variable and it´s 1-period
lagged value.
I am going to take that first difference and then check again if the series has now a unit
root or not. So, when I take logarithms and one regular difference to the variable, I ask
EViews for the Augmented Dickey-Fuller test and I realize a Hypothesis regarding:
·Null Hypothesis: The variable has a unit root
.Alternative hypothesis: The variable has not a unit root
If the Augmented Dickey-Fuller test show me that the variable has another unit root I
must take a second regular difference and repeat the Hypothesis. Usually it´s not
necessary to take more than 2 regular differences.
Therefore, now I have my series stationary with respect to variance and to respect to
the mean.
-Third step: Identify the best model for out data series. I am going now to check the
best model for our series data, so it´s very useful this theoretical comparative table to
choose for the best model.
STATIONARITY INVERTIBILITY A.C.F P.A.C.F
MA(1) YES CHECK 1 ∞
MA(2) YES CHECK 2 ∞
AR(1) CHECK YES ∞ 1
AR(2) CHECK YES ∞ 2
ARMA(p,q) CHECK CHECK ∞ ∞
WHITE NOISE YES YES 0 0
I only must ask EViews for the correlogram of the series taken by logarithms and with
the first or second (if necessary) regular differences.
16
Daniel Segura Pérez
The Correlogram give me the significant values of the ACF and PACF, which are the
ones that surpass the significance bands. So, I just have to see the correlogram and
with the table choose the best model for our series.
With this last step, I just finished with the substage of identification, so now, I have the
2 or 3 optimal models I think are going to fit better the data so I can check them
through the diagnosis process.
·Estimation·
This substage is very simple and quickly, I just ask EViews for the estimate equation
regarding the series taken by logarithms and with the necessary regular differences
and I include in the equation the model I think predicts better the data.
Estimate: Specific value of the estimator based on sample information. For this
estimate I am going to obtain an estimation.
Estimation: Branch of statistical inference that aims to calculate the parameters of a
population model based on sample information.
For example, d(log(series),1) c AR(1)
I just include the constant of the equation if it´s significant, if it does not, I can exclude.
Once I had estimated all my alternative models for the data I just go and compared
each together in the diagnosis substage. This diagnosis it´s going to clear me which
model is the best for my series data.
·Diagnosis·
First, I have different methods to check the model, they are:
-Significant parameters: As I know, the models MA and AR, ARMA has each
parameters, in the case of the Moving Average Model, the parameter tetha (Ɵ), and
for the Autoregressive Model (AR), the parameter phi (ɸ), and obviously for the
Autoregressive Moving Average Model (ARMA) both parameters are considered.
These parameters are supposed to be significant, if they are, the model will predict
well the data series. If not, the prediction it´s not going to be as precise as I want to be.
I realize a Hypothesis for the parameters where:
·Null Hypothesis: The model has a significant parameter
·Alternative Hypothesis: The model has not a significant parameter
The parameter is significant if it´s below the value of 0,05.
-The residuals are white noise: Another important check in the diagnosis substage is
that the residuals of the model are white noise. I can check that assumption in two
ways, looking into the correlogram of the residuals and check if any value of the ACF
17
Daniel Segura Pérez
and PACF surpass the significance bands. And, I can check the Q-stat, which has to be
greater than 0,05, in order to confirm that the residuals are white noise.
Q-stat: Test to assess the joint statistical significance of several autocorrelation
coefficients.
-Stationarity and Invertibility
·Stationarity: A stationary time series is one whose statistical properties such as mean,
variance and autocorrelation, are all constant over time. Property of AR models
·Invertibility: Property of an MA model that guarantees and equivalent AR
representation in which the present is a function of past information.
I must check if my MA models are invertible, because I know that an MA model it´s
always stationary.
And, for sure I must check If my AR models are stationary, because know that an AR
model its always invertible.
So, to check I my models are stationary or invertible I am going to search for the
“Inverted MA/AR Roots” and I am going to realize a Hypothesis testing if the roots are
or not lower than 1.
·Null Hypothesis: The inverted roots are lower than 1
·Alternative Hypothesis: The inverted roots are higher than 1
So, I check the inverted roots and decide if the models are stationary or invertible.
-Residuals normality: The last step in the diagnosis process it´s to check if the residuals
are distributed as a normal random variable. I must investigate the Jarque-Bera test.
·Jarque-Bera: The Jarque-Bera is a goodness-of-fit test of whether sample data have
the skewness and kurtosis matching a normal distribution.
I ask EViews for the Histogram of the residuals of the model and I obtain the Jarque-
Bera test. One I have it, I realize a Hypothesis:
·Null hypothesis: The residuals follow a normal distribution
·Alternative hypothesis: The residuals do not follow a normal distribution
The residuals are going to follow a normal distribution if and only if the value of the
Jarque-Bera test is greater than 0,05.
-Akaike and Schwarz criterion: If the diagnosis of both alternatives models does not
clear at all which is the best one to predict the data series, I can appeal to the Akaike
and Schwarz criterion.
·Information criteria (AIC, SIC): Measures to select the best time series models by
minimizing the residual variances but considering a penalty function to compensate for
irrelevant regressors.
18
Daniel Segura Pérez
If I take logarithms to the series, the series its more stationary with respect to the
variance. I also can see if the series need to tale logarithms looking into the
correlogram of the series.
As the Dickey-Fuller show me that the variable has a unit root, I must take one regular
difference to transform the series into stationarity with respect to the mean.
19
Daniel Segura Pérez
20
Daniel Segura Pérez
I can see that once I logged the series with logarithms, it transformed into a more
homogeneous series with a constant tendency.
Stationary with respect to the mean
Analyzing both correlogram and Dickey-Fuller test and realizing a Hypothesis for the
Dickey-Fuller test, I can confirm that the variable has a unit root and that it needs a
regular difference to be stationary with respect to the mean.
To identify the model, I check the correlogram of the series taken with logarithms and
with one regular difference.
21
Daniel Segura Pérez
The white noises are predicted with a Random Walk, where the predictions are going
to be equal to the last available value of the sample.
Obviously, the forecast error it is going to increase the longer the prediction.
Forecast Error: Difference between the realized value of the variable of interest and its
prediction.
I can see than taking logarithms to the series, it is more homogeneous and with a
constant positive tendency.
22
Daniel Segura Pérez
Looking into the correlogram of the series taken by logarithms and the Augmented
Dickey-Fuller, I have to not reject the Null Hypothesis and confirm that the variable has
a unit root, so I am going to take one regular difference and check again the Dickey-
Fuller statistic.
After check that the variable does not have more regular differences, I can go to the
identification of the models, looking at the correlogram of the logged series with one
regular difference.
As with the 2 last sample periods, its normal that the first predictions of each variable
are white noise because the samples are formed with few observations.
23
Daniel Segura Pérez
The prediction for a white noise is the Random Walk, where the prediction for the next
5 days is the last observation available in the sample.
DATE REAL VALUE PREDICTION ERROR
3/31/2020 95923 87956 9%
4/1/2020 104118 87956 18%
4/2/2020 112065 87956 27%
4/3/2020 119199 87956 36%
4/4/2020 126168 87956 43%
I could add an interesting explanation that is, if I compared the 3 prediction of the
variable contagions with the 3 different period sample, I could see that the error of the
prediction it decreasing the longer the sample period is.
As I can observe in the graph series, the series is not at most homogeneous, so I
applied logarithms to the series and now the tendency is more homogeneous and
constant.
Stationary with respect to the mean
24
Daniel Segura Pérez
Looking at the correlogram of the logged series I can see that the series need a regular
difference, first because the correlogram has infinite decreasing ACF and only one a
very high value of PACF. The Augmented Dickey-Fuller statistic confirmed that the
series need to take one regular difference because it has one-unit root.
MA(2) DIAGNOSIS
Analyzing the stationarity and the invertibility, I know that the models MA are always
invertible, so what I am going to check is if it is invertible.
25
Daniel Segura Pérez
Both Inverted MA Roots are lower than one, even if the number are complex number I
already calculated and they are lower than 1, so the MA(2) is stationary and invertible.
Residuals
Checking the residuals, I must check if the residuals are white noise and if it follows a
normal distribution.
The residuals are white noise because neither the ACF nor PACF values surpass the
significance bands in the correlogram.
The residuals are white noise also because the Q-stat values are all greater than 0.05.
The residuals of the model do not follow a normal distribution because I have to reject
the Null Hypothesis as the value of the Jarque-Bera is lower than 0.05.
AR(2) DIAGNOSIS
26
Daniel Segura Pérez
Residuals
The residuals should be white noise and follow a normal distribution in order to pass
the diagnosis process.
The residuals are white noise because neither the ACF nor PACF values surpass the
significance bands and also because the Q-stat values are all greater than 0.05.
However, the residuals do not follow a normal distribution because the Jarque-Bera
value is lower than 0.05.
DIAGNOSIS COMPARISION
MA(2) AR(2)
SIGNIFICANT PARAMETERS NO YES
STATIONARY AND INVERTIBLE YES YES
RESIDUALS WHITE NOISE YES YES
NORMALITY NO NO
Like both models are very similar in the diagnosis, except that the AR(2) has significant
parameters I am going to also use the information criteria.
The AR(2) has lower values in the Akaike Info Criterion and in the Schwarz Criterion so I
am going to choose the AR(2) model to predict the sample.
Prediction
27
Daniel Segura Pérez
The predictions are more or less precise in the first 2 days, but the longer the day I
want to predict, the greater is the forecast error.
Interval forecast: Collection of forecasts enclosed between a lower and upper bound.
4/5/2020 (93589.88,202358.21)
4/6/2020 (77106.93,269210.03)
4/7/2020 (37582.31,387264.94)
4/8/2020 (-13499.90,534145.08)
4/9/2020 (-95763.98,747816.85)
I logged the series in order to transform into stationary with respect to the variance.
Stationary with respect to the mean
28
Daniel Segura Pérez
Therefore, as the correlogram of the logged series and the Augmented Dickey-Fuller
test confirms, the series has a unit root, so I do not reject the Null Hypothesis and I
take one regular difference to the series.
Identification
We identify the properly model in the correlogram of the logged series with one
regular difference.
MA(3) DIAGNOSIS
29
Daniel Segura Pérez
The residuals are white noise because, first, neither the ACF nor PACF values surpass
the significance bands an also because all the values of the Q-stat are greater than
0.05.
The model does not follow a normal distribution due to that the histogram of the
residuals show me that the value of the Jarque-Bera is lower than 0.05.
AR(2) DIAGNOSIS
30
Daniel Segura Pérez
The residuals are white noise because neither the ACF nor PACF values surpass the
significance bands of the correlogram, and also because the values of the Q-stat are all
of them greater than 0.05.
The normality studied in the histogram of the residuals show me that the Jarque-Bera
value is lower tan 0.05, so the residual does not follow a normal distribution.
DIAGNOSIS COMPARISION
MA(3) AR(2)
SIGNIFICANT PARAMETERS NO YES
STATIONARY AND INVERTIBLE YES YES
REDISUALS WHITE NOISE YES YES
NORMALITY NO NO
As the MA(3) has not any significant parameter, I am going to choose AR(3) to predict
the sample.
Prediction
31
Daniel Segura Pérez
As in the last prediction, the forecast error is larger the larger is the prediction date.
Forecast interval
4/10/2020 (112909.02,234223.03)
4/11/2020 (92596.99,2988484.93)
4/12/2020 (45345.15,417606.50)
4/13/2020 (-10544.42,556957.37)
4/14/2020 (-99868.66,762292.25)
4/15/2020 (-217967.92,1019750.63)
I am going to log the series in order to reduce the variance and transform the series
with the objective of create a more homogeneous tendency.
Stationarity with respect to the mean
I ask EViews for the correlogram and the Augmented Dickey-Fuller for the logged
series.
32
Daniel Segura Pérez
The Augmented Dickey-Fuller and the correlogram confirm that the logged series has a
unit root and needs one regular difference in order to transform the series and make
the mean more representative over the sample.
Identification
MA(5) DIAGNOSIS
The estimate equation of the
MA(5) clearly are not going to be
a good model to predict my
sample, because none of the
theta parameters are significant,
33
so I am not going to continue the
diagnosis with MA(5).
Daniel Segura Pérez
AR(2) DIAGNOSIS
The residuals are white noise, and I have two ways to know it. First one, the
correlogram of the residuals does not have any ACF nor PACF that surpass the
34
Daniel Segura Pérez
significance bands. The second way, it´s looking to the Q-stat values, that almost all of
them are greater than 0.05.
The histogram of the residuals shows me if the residuals follow a normal distribution.
In this case, the Jarque-Bera value it´s no greater than 0.05, so I have to reject the Null
Hypothesis and the residuals do not follow a normal distribution.
AR(3) DIAGNOSIS
The residuals are white noise, as I am looking for the important values, that are the
first 4-5 values. The Q-stats values confirm the assumption.
The residuals do not follow a normal distribution as the Jarque-Bera value is lower
than 0.05.
DIAGNOSIS COMPARASION
AR(2) AR(3)
35
Daniel Segura Pérez
Like the both models pass the diagnosis with very similar results, I am going to appeal
to the Information Criteria
The Akaike Info Criterion and the Schwarz Criterion are lower in the AR(3) estimation
so I choose AR(3) to predict the sample.
In the last prediction for the variable contagions, I can observe that the forecast error
in % terms is incredible because it is very small in relation with the first prediction for
the same variable.
4/16/2020 (126472.87,246260.32)
4/17/2020 (99117.33,293930.10)
4/18/2020 (50419.66,36992.54)
4/19/2020 (-16885.22,470738.47)
4/20/2020 (-99597.14,593761.81)
6. RECOVERED
36
Daniel Segura Pérez
The correlogram with the ACF and PACF and also the Augmented Dickey-Fuller confirm
that the series has a unit root so I must not reject the Null Hypothesis and the series
needs a regular difference.
Identification
37
Daniel Segura Pérez
Prediction
The white noise it´s going to be predicted as a Random Walk, where the predictions
are equal to the last available observation of the sample.
DATE REAL VALUE PREDICTION ERROR
3/21/2020 2125 1588 34%
3/22/2020 2575 1588 62%
3/23/2020 2575 1588 62%
3/24/2020 3794 1588 139%
3/25/2020 5367 1588 238%
I already log the series to obtain a more constant tendency and also because the series
graphs is not homogeneous.
38
Daniel Segura Pérez
Both one and another reveals that the logged series has a unit root, so, the series
needs to take a regular difference.
Identification
The white noises are predict with a process called Random Walk, where the
predictions of the sample are going to be the equal to the last available observation of
the sample.
Prediction
39
Daniel Segura Pérez
I take logarithms to the serie in order to convert the tendency into one more constant
and homogeneuos.
Stationarity with respect to the mean
Both the correlagram and the Augmented Dickey-Fuller statistic reveals that the series
need at least one regular difference.
I know because In the case of the correlogram of the logged series, the ACF are
infinetely decreasing values and the PACF its only one and high value.
40
Daniel Segura Pérez
Identification
Prediction
The prediction with a white noise is with a process called Random Walk, which the
predictions are going to be equal to the last available observation of the sample.
41
Daniel Segura Pérez
As the series graph has a non-constant tendency, I log the series to obtain a more
homogeneous series and with a positive constant tendency.
The correlogram of the logged series and the Augmented Dickey-Fuller confirm that
the series has a unit root, hence I do not reject the Null Hypothesis and in order to
make the mean more representative over the sample I am going to take one regular
difference to the series.
42
Daniel Segura Pérez
Identification
DIAGNOSIS MA(1)
43
Daniel Segura Pérez
The models MA are always stationary, so I must check if the model MA(1) is invertible in order
to fulfill the condition of the diagnosis.
I look to the Inverted MA Roots and I confirm that this MA model is invertible because the
value is lower than 1.
Residuals
The residuals are not white noise because the values of the ACF and PACF supass the
significant bands.
The residuals are not distributed as a normal variable because the Jarque-Bera is lower than
0.05.
DIAGNOSIS AR(3)
The AR(3) model is always invertible hence I have to check if it´s stationary, as the three
Inverter AR Roots are lower than 1 that means that the AR(3) is invertible and stationary.
44
Daniel Segura Pérez
Residuals
The residuals are white noise in the AR(3) model, because neither the ACF nor PACF values
surpass the significant bands. Also, I know that the residuals are white noise because the Q-
stat values are almost all greater than 0.05.
DIAGNOSIS COMPARISION
MA(1) AR(3)
SIGNIFICANT PARAMETERS NO YES
STATIONARITY AND INVERTIBILITY YES YES
RESIDUALS WHITE NOISE NO YES
NORMALITY NO NO
I am going to choose the AR(3) to predict our sample.
Prediction
As the model pass very well the diagnosis, the predictions in this case are very precise
45
Daniel Segura Pérez
Forecast interval
4/5/2020 (36945.17,38668.06)
4/6/2020 (39147.42,43373.14)
4/7/2020 (40791.50,48346.55)
4/8/2020 (41884.26,53561.97)
4/9/2020 (42424.98,59002.40)
I already log the series in order to transform the tendency into a new one more
constanst and more homogeneous.
Stationarity with respect to the mean
Both the corrlegram of the logged series and the Augmented Dickey-Fuller statistics
reveals that the series has a unit root and hecne needs to take one regular difference.
So, I must do not reject the null hypothesis yet.
46
Daniel Segura Pérez
Identification
47
Daniel Segura Pérez
I am going to test both models in the diagnosis and choose th best one for the
prediction.
DIAGNOSIS MA(1)
48
Daniel Segura Pérez
First, instead the values of the Q-stat are all of them greater than 0.05, one value of
the ACF and one of the PACF are the sufficiently significant to not consider the
residuals as a white noise.
The residuals do not follow a normal distribution because the value of the Jarque-Bera
is lower than 0.05.
DIAGNOSIS AR(3)
49
Daniel Segura Pérez
The residuals are white noise.The AR(3) residuals does not follow a normal distribution
because the Jarque-Bera value is lower than 0.05.
DIAGNOSIS COMPARISION
MA(1) AR(3)
SIGNIFICANT PARAMETERS NO YES
STATIONARITY AND INVERTIBILITY NO YES
RESIDUALS WHITE NOISE NO YES
NORMALITY NO NO
I am going to choose the AR(3) model.
Prediction
Forecast interval
4/10/2020 (54943.59,57409.92)
4/11/2020 (57253.66,62786.00)
4/12/2020 (59002.71,68375.14)
4/13/2020 (60215.84,74139.89)
4/14/2020 (60906.61,80055.85)
4/15/2020 (61083.11,86105.54)
So, the model used to predict this sample series was an ARIMA(3.1.0)
50
Daniel Segura Pérez
I log the series of the sample in order to have a more homogeneous and constant
tendency.
Both images, one from the correlogram and the other form the Augmented Dickey-
Fuller statistic reveals that the series has one unit roots and needs one regular
difference.
51
Daniel Segura Pérez
Identification
52
Daniel Segura Pérez
ARMA(1.1) DIAGNOSIS
Residuals
The residuals are not white noise because one value of each ACF and PACF surpassed
the significance bands.
The residuals of the ARMA(1.1) model do not follow a normal distribution because the
value of the Jarque-Bera is not higher than 0.05.
DIAGNOSIS AR(3)
53
Daniel Segura Pérez
Residuals
The residuals are not white noise because some values of the ACF and PACF surpassed the
significance bands. And also because the Q-stat values are all lower than 0.05.
The residuals do not follow a normal distribution since the value of the Jarque-Bera is lower
than 0.05.
DIAGNOSIS COMPARISION
54
Daniel Segura Pérez
ARMA(1.1) AR(3)
SIGNIFICANT PARAMETERS NO YES
STATIONARITY AND INVERTIBILITY NO YES
RESIDUALS WHITE NOISE NO NO
NORMALITY NO NO
Prediction
Forecast interval
4/16/2020 (72828.57,75154.79)
4/17/2020 (74343.89,79539.28)
4/18/2020 (75313.00,84087.26)
4/19/2020 (75760.45,88771.70)
4/20/2020 (75702.03,93575.37)
7. DEATHS
55
Daniel Segura Pérez
I log the original series with the objective of transform the tendency into one more
constant and more homogeneous.
Stationarity with respect to the mean
Both the correlogram and the Augmented Dickey-Fuller statistic reveals that the series
has a unit root and needs at least one regular difference.
Identification
56
Daniel Segura Pérez
A white noise is predicted as a random variable with a process called Random Walk,
which the predictions of the sample are equal to the last available observation of the
sample.
DATE REAL VALUE PREDICTION ERROR
3/21/2020 1045 830 26%
3/22/2020 1375 830 66%
3/23/2020 1772 830 113%
3/24/2020 2311 830 178%
3/25/2020 2908 830 238%
As I can see, if I apply logarithms to the seres, the series now has a constant and
positive tendency, which also is more homogeneous.
57
Daniel Segura Pérez
Both correlogram and the Augmented Dickey-Fuller statistic clear me that te series has
a unit root and need to take at least a regular difference.
Identification
Prediction
58
Daniel Segura Pérez
The rest of the predictions of the deaths variable I am going to predict it in a different
way.
It´s a different approach form the Box-Jenkins metodology, but the objetive it´s exaclty
the same, to obtain a precise prediction of specific sample.
I am going to identify the tendency of my sereis graph, then I am going to work with
the filtred series in order to isolate the tendency from the series.
Later, I am going to predict the variable without the tendency and also I am going to
predict the tendency alone.
So the global prediction ir´s going to be the sum of both predictions and it´s going to
be interesting to see at the results.
Sample range from 2/26/2020 to 3/30/2020
Series graph and the filtered series graph
So both graph are from the same series, the unique difference it´s that form the left
graph it’s the original series and for the right graph it´s the filtered series.
So , I am going to create an estimate equation with the tendency of the graph, and
from that equation, I am going to create a residual, which are the filtered series.
Therefore, I am going to identify the best model that would predict that filtered series
and use to predict the filtres series without the tendency.
First, I am going to predict the tendency with the next estimate equation
59
Daniel Segura Pérez
So, I revise the correlogram for the filtred series, and I must check if the filtered series
needs or does not need to take any regular differences. So, I check the correlogram
and the Augmented Dickey-Fuller statistic, and both confirmed that the filtered series
needs at least one regular difference.
After taking one regular difference, the filtered series need at lest another regular difference.
60
Daniel Segura Pérez
I just going to compare the diagnosis of both models, with the calculations out of the
sheet, because otherwise the research it´s going to be enormous.
The AR(1) model
61
Daniel Segura Pérez
Tendency forecast
Filtered prediction
So as I can see and you, the error from the predictions is very lower, that demonstrates
this method is also valueable and correct to predict the tendency and later the filtered
series.
62
Daniel Segura Pérez
Checking the correlogram for the filtered series, in order to know if the filtered sereis
has a unit root, and if the filtered series needs to take a reular difference.
63
Daniel Segura Pérez
After taking one regular difference, the filteres series has a unit root and needs another regular
difference.
So, now I am going to compare each model to choose the best to predict.
AR(1) MODEL
64
Daniel Segura Pérez
So, like both models are similar in the diagnosis, I choose the AR(1) model based on
the Information Criteria, both Akaike Info Criterion and Schwarz Criterion are lower in
the AR(1) than in the MA(1).
Tendency forecast
Filtered forecast
65
Daniel Segura Pérez
So, I revise the correlogram for the filtred series, and I must check if the filtered series
needs or does not need to take any regular differences.
So, I check the correlogram and the Augmented Dickey-Fuller statistic, the correlogram
with infinite and dcreasing ACF values and one and high PACF values reveals that the
filtered series needs one regular differences, instead of what Augmented Dickey-Fuller
said.
66
Daniel Segura Pérez
After taking one regular difference, the filtered series need at lest another regular difference.
67
Daniel Segura Pérez
Tendency forecast
68
Daniel Segura Pérez
Filtered forecast
69
Daniel Segura Pérez
So, I check the correlogram and the Augmented Dickey-Fuller statistic, the correlogram
with infinite and dcreasing ACF values and one and high PACF values reveals that the
filtered series needs one regular differences, instead of what Augmented Dickey-Fuller
said.
After taking one regular difference, the filtered series need at lest another regular
difference.
70
Daniel Segura Pérez
71
Daniel Segura Pérez
Tendency forecast
72
Daniel Segura Pérez
Filtered forecast
8. CONCLUSION
To sum up with the research, I would say that the COVID-19 it´s going to have
unpredictable consequences, political, economics and social consequences that would
affect the whole world.
In order to analyze the data and the results of the calculations and predictions.
The variable contagions was first a white noise because of the small size of the sample,
but later, the bigger samples were predicted with AR(2), AR(2) and AR(3) for the next
prediction. The Autoregressive model was the main character in the contagions
73
Daniel Segura Pérez
scenario, but sincerely it’s the most reliable and precise model to predict the data,
because the Moving Average models does not represent as good as it´s necessary for
the predictions.
The variable recovered was also a white noise, as it´s going to be also the deaths
variable, because of the absence of sufficient observation for predict a model.
The autogressive model was again the protagonist here, with two AR(3) and one AR(2).
The Moving Average in this variable was better than in the contagions case, but not
sufficnetly to predict well a series.
If we talk about deaths, it´s curious because while in the case of contagions and
recovered, the three first predictions were white noise, but in the case of the deaths,
only the first two variables are white noise.
The variables contagions and recovered cases were predicted with the Box-Jenkinks
metodology. However, I tried to implement a different metodology for the last 4
predictions in the deaths variable.
This metodology based their predictions into 2 different predictions that create a total
result. The two different prediction are, one prediction for the variable without the
tendency and other prediction for only the tendency of the serie.
The solution reveals that either one or the other metodology are correct if the process
are perfectly realized.
To draw the conclusion, one can say that the principal learned tool for this research is
that it´s very difficult to analyze and predict series thath have small observations.
During the research, the prediction was better the longer the range of the sample.The
last predictions of the variables are better than the first ones.
Important thing to higlight about the research is that it´s very importante to create a
good diganosis for the alternative models, because sometimes models appear
something that when you test the significant parameters or the residuals you realize
that maybe the model it´s not the best to predict the sample.
I hope you like my research.
9. BIBLIOGRAFY
-Investopedia Volatility
https://www.investopedia.com/terms/v/volatility.asp
-Science direct
74
Daniel Segura Pérez
https://www.sciencedirect.com/topics/economics-econometrics-and-finance/econometric-
model
-DeepAi
https://deepai.org/machine-learning-glossary-and-terms/white-noise
-Economipedia
https://economipedia.com/definiciones/logaritmos-en-econometria.html
-People.duke
https://people.duke.edu/~rnau/411diff.htm
75