Professional Documents
Culture Documents
CCATMST013
CCATMST013
programme in Statistics
by
ROSS CYRIAC
Department of Statistics
Irinjalakuda
2021
CERTIFICATE
Statistics in Partial fulfillment of the requirements for the award of the Mas-
Irinjalakuda Irinjalakuda
External Examiner:
Irinjalakuda
05/08/2021
DECLARATION
this project has been composed by me under the Guidence and Supervision
I also declare that this project has not been previously formed the basis
for the award of any degree, diploma, associateship, fellowship etc. of any
Irinjalakuda
This project would not have been possible without the guidance and the help
this project, without which this project would not have been materialized.
I would like to give my sincere thanks to my teachers for the inspiration, en-
couragement and technical help they bestowed upon me. I am indebted to the
faculty of the department for sharing with me their knowledge base and for
giving me abetter perspective of the subject and for providing the necessary
My sincere thanks are also due to Librarian and non-teaching staff of the
warmth I could enjoy from them. Last but not least, I am indebted to my
ROSS CYRIAC
Contents
1 INTRODUCTION 8
2 METHODOLOGY 11
2.2.4 Advantages . . . . . . . . . . . . . . . . . . . . . . . . 23
5
3 ANALYSIS 29
3.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 CONCLUSION 37
REFERENCE 39
List of Figures
INTRODUCTION
PYRAMID SCHEME
Each level of new investors make payments to those above them and similarly
receives payment from those below them. People in the upper layers of the
pyramid typically gain profit, whereas those in the lower layers usually lose
money. Since most of the members in the scheme are at the bottom layers,
8
REGRESSION ANALYSIS
can be used to analyse the strength of the relationship between the variables
and also to model the relationship between them. Regression analysis are
of several types, such as simple linear, multiple linear and nonlinear. The
most commonly used models are simple linear and multiple linear. Nonlin-
ear regression analysis is generally used for complicated datasets where the
Many techniques for carrying out regression analysis have been developed.
Familiar methods such as linear regression and ordinary least squares regres-
a finite number of unknown parameters that are estimated from the data.
9
• The unknown parameters, denoted as β, which may represent a scalar
or a vector.
Y ≈ f (X, β)
DATA DESCRIPTION
The secondary data for the study were collected from KAGGLE. The Pyra-
• cost price : it is the money that a person should pay while joining this
scheme.
• profit markup : it refers to the value that the person adds to the cost
price.
• depth of tree : it gives the length of each chain in the scheme. The
gain. Here in this study we take this variable as the dependent variable.
10
Chapter 2
METHODOLOGY
Mainly, there are two things that can be found out by the method of simple
linear regression:
glaciers.
2. How much the value of the dependent variable is at a given value of the
11
A model with a single regressor (independent variable) x that has a relation-
y = β0 + β1 x + ε (2.1)
where the intercept β0 and β1 are unknown constants and ε is a random error
The parameters β0 and β1 are usually called the regression coefficients. The
The parameters β0 and β1 are unknown and must be estimated using sample
data. Suppose that we have n pairs of data, say, (y1 , x1 ), (y2 , x2 ), ..., (yn , xn ).
The method of least squares is used to estimate β0 and β1 . That is, we will
estimate β0 and β1 so that the sum of the squares of the difference between
the observations yi and the straight line is minimum. From equation (2.1)
12
we may write,
yi = β0 + β1 xi + εi , i = 1, 2, ..., n (2.2)
The least square estimators of β0 and β1 , say, ˆβ0 and ˆβ1 , must satisfy,
n
∂S X
= −2 (yi − β̂0 − β̂1 xi ) = 0
∂β0 β̂0 ,β̂1 i=1
n
X
⇒ (yi − β̂0 − β̂1 xi ) = 0
i=1
n
X n
X
⇒ yi = β̂0 + β̂1 xi (2.3)
i=1 i=1
∂S
Pn
and ∂β1
= −2 i=1 (yi − β̂0 − β̂1 xi )xi = 0
β̂0 ,β̂1
n
X
⇒ (yi − β̂0 − β̂1 xi )xi = 0
i=1
n
X n
X n
X
⇒ yi xi = β̂0 xi + β̂1 xi 2 (2.4)
i=1 i=1 i=1
The equations (2.3) and (2.4) are called the least square normal equations.
tively. Also the numerator of equation (2.6) is the corrected sum of cross
13
product of xi and yi and the denominator is the corrected sum of squares of
xi .
Therefore, β̂0 and β̂1 in equations (2.5) and (2.6) are the least square es-
timators of the intercept and slope. The fitted sample regression model is
then,
ŷ = β̂0 + β̂1 x
the difference between the observed value yi and the corresponding fitted
From equation (2.5) and (2.6) note that β̂0 and β̂1 are the linear combinations
of the observations yi .
1. The least square estimators β̂0 and β̂1 are the unbiased estimators of
14
4. The sum of the observed values yi equals the sum of the fitted values
ŷi , or
n
X n
X
yi = ŷi
i=1 i=1
5. The least squares regression line always passes through the centroid of
the data.
distributed.
15
2.1.4 Components of Simple Linear Regression Model
Residuals : Residuals are the difference between the actual observed re-
sponse values and the response values that the model predicted. The Resid-
uals section of the model output breaks it down into 5 summary points; min-
one is the intercept and the second one is slope. The intercept is the expected
zero. The slope is the change in the dependent variable over the change in
any value equal or larger than t. p-value tests whether or not there is a sta-
variable. The lower the p-value, the more significant the regressor variable.
That is, if p-value is less than the significance level(usually 0.05), we reject
the null hypothesis that the regressor variable is not significant(no relation-
ity of a linear regression fit. It is the average amount that the response will
deviate from the true regression line. Theoretically, every linear model is
16
assumed to contain an error term. Due to the presence of this error term, we
are not capable of perfectly predicting the response variable from the predic-
tor variable.
gives a measure of how well the model fits the data. It is a measure of the
linear relationship between the independent variable and the dependent vari-
able.The value of R-squared always lies between 0 and 1 (that is, a number
close to 0 represents a regression that does not explain the variance in the
ship between the regressor and the response variables. The F-statistic is near
1 the better the relationship is. However, how much larger the F-statistic
needs to be depends on both the number of data points and the number of
when we want to predict the value of a variable based on the value of two
17
overall fit of the model and the relative contribution of each of the predictors
tially similar to the simple linear model, with the exception that multiple
The regression model that involves more than one regressor variable (in-
y = β0 + β1 x1 + ... + βk xk + ε (2.7)
held contsant. For this reason the parameters βj , j = 1, 2, ..., k, are often
In the matrix notation the multiple linear regression model is given by,
Y = Xβ + ε
y1 1 x11 x12 . . . x1k β0
y2 1 x21 x22 . . . x2k β1
Y =
X=
β=
· · · . . . . . . . . . . . . . . . . . . . . . · · ·
yn 1 xn1 xn2 . . . xnk βk
18
In general, Y is an n × 1 vector, X is an n × p matrix of the level of the
The method of least squares can be used to estimate the regression coeffi-
cients. That is, we find the vector of least square estimate β̂ that minimizes,
n
X
S(β) = ε i 2 = ε0 ε
i=1
= (Y − Xβ)0 (Y − Xβ)
= Y 0 Y − Y 0 Xβ − β 0 X 0 Y + β 0 X 0 Xβ
= Y 0 Y − 2β 0 X 0 Y + β 0 X 0 Xβ
∂S
=0
∂β
∂
(Y 0 Y − 2β 0 X 0 Y + β 0 X 0 Xβ) = 0
∂β
−2X 0 Y + 2βX 0 X = 0
∂S
=0
∂β β̂
⇒ −2X 0 Y + 2X 0 X β̂ = 0
⇒ (X 0 X)β̂ = X 0 Y (2.8)
19
Equation (2.8) is the normal equation for getting β̂, which can be solved by
Thus the least square estimate of β is, β̂ = (X 0 X)−1 X 0 Y , provided that the
Ŷ = X β̂
= X(X 0 X)−1 X 0 Y
= PY
the vector of observed values into a vector of fitted values. The hat matrix
The difference between the observed value Y and the corresponding fitted
e = Y − Ŷ
= Y − X β̂
= Y − PY
= (I − P )Y
20
2.2.2 Properties of Least Square Estimates
Assuming that errors are unbiased (that is, E(ε) = 0) and the columns of X
1. Unbiased estimate of β
= (X 0 X)−1 X 0 E[Y ]
= (X 0 X)−1 X 0 E[Xβ]
= (X 0 X)−1 X 0 Xβ
=β
2. Variance of β̂
Assume that εi ’s are uncorrelated and have the same variance, that is,
= var(ε)
= σ 2 In
= σ 2 (X 0 X)−1
21
2.2.3 Generalized Least Squares
E(η) = E(K −1 ε)
= K −1 E(ε) = 0
var(η) = var(K −1 ε)
= K −1 var(ε)(K −1 )0
= K −1 σ 2 V (K −1 )0
= σ 2 K −1 KK 0 (K −1 )0
= σ 2 In
Minimizing the least squares function η 0 η with respect to β, we get the least
β ∗ = (B 0 B)−1 B 0 Z
= [X 0 (K −1 )0 K −1 ]−1 X 0 (k −1 )0 K −1 Y
= [X 0 V −1 X]−1 (X 0 V −1 Y )
22
E(β ∗ ) = E[(X 0 V −1 X)−1 (X 0 V −1 Y )]
= (X 0 V −1 X)−1 X 0 V −1 E(Y )
= (X 0 V −1 X)−1 X 0 V −1 Xβ
= In β = β
D(β ∗ ) = var(β ∗ )
= σ 2 (B 0 B)−1
= σ 2 [X 0 (K −1 )0 K −1 X]−1
= σ 2 (X 0 V −1 X)−1
The generalized least square estimate is simply the ordinary least square
Therefore, β ∗ have the same optimal properties, namely that, a0 β ∗ is the best
2.2.4 Advantages
There are two main advantages to analysing data using a multiple regression
model.
Firstly, the ability to determine the relative influence of one or more predictor
variables to the criterion value. For example, the real estate agent could
find that the size of the homes and the number of bedrooms have a strong
23
correlation at all, or even a negative correlation if it is primarily a retirement
community.
manager could find that the number of hours worked, the department size
and its budget all had a strong correlation to salaries, while seniority did not.
Y = β0 + β1 X + ε (2.9)
measures the change of Y for a unit change of X, and ε is the error term.
There are mainly four properties for ordinary least squares regression,
24
function of the independent variables, X. The OLS estimators are linear
with respect to the values of the dependent variable only, and not
that when we take out several samples, we will find that the mean of
all the constants from the samples will be equal to the actual values of
good
25
OLS esimator is a consistent estimator since it satisfies both the con-
• it is asymptotically unbiased.
tend to ignore the assumptions of OLS before interpreting the results. Hence,
higher the value of R-squared, the better the model fits the data. This statis-
variables) increase.
only when an additional variable adds to the explanatory power to the regres-
26
view of the correlation.
is used to assess the significance level of all the variables together. The null
hypothesis under this is, H0 : all the regression coefficients are equal to zero.
AIC/BIC : AIC stands for Akaike’s Information Criteria and is used for
model selection. It corrects the errors mode in case a new variable is added
nus the likelihood of the overall model. A lower AIC implies a better model.
Then we say that the coefficients estimated are Best Linear Unbiased Esti-
mators(BLUE).
implies that the variance of errors is constant. Its prefered value is between
1 and 2.
27
Prob(Jarque-Bera) : It is in line with Omnibus test. It is also used for
results of Omnibus test. A large value of Jarque-Bera test indicates that the
28
Chapter 3
ANALYSIS
Linearity means that mean values of the dependent variable for each incre-
ment of the independent variables lie along a straight line. In linear regression
be linear. The linearity assumption can de tested using scatter plot. The
Pair-wise Linearity Plot graph below shows the pairwise scatted plot of the
From this graph we can see that the relationship between the dependent
variable, profit, and the independent varibles (cost price, profit markup,
It is also important to chek for outliers in the variables, because linear regres-
sion is sensitive to outliers. Outliers are observations that fall far from other
29
points. These points are important as they can have a strong influence on
the least squares line. From the Outlier Detection boxplot of each variables
30
Figure 3.2: Outlier Detection Plot
3.2 Correlation
Corelation gives the information on the strength and direction of the linear
relationship between two variables. The value of the correlation lies between
+1 and -1. The value near to ±1 implies high correlation, the value near
to zero implies less correlation and the value near to ±0.5 implies moderate
ciated with the decrease in the other variable, while a positive correlation
31
implies that both variables move in same direction. Zero correlation implies
From the above correlation matrix graph we could see that there is a neg-
atively high correlation between the dependent variable, profit and the in-
dependent variable, depth of tree. This means that as the depth of tree in-
creases the profit will decrease. This is shown in Profit Vs Depth of tree
32
Figure 3.4: Profit Vs Depth of tree
The OLS Regression Results table below gives the output of the ordinary
last square regression on the data under study. This table gives the multiple
regression model for the data and shows how well the model fits the data. It
also gives the results for different assumptions for the model.
33
From this table we can see that, since the R-squared and Adj.R-squared
change in all other independent variables and also since the value of R-
squared is high, the model fits the data well. The Prob(F-statistic) shows
the error. Here, both these values are nearly zero which implies that the error
34
This is tested using the Durbin-watson test which is near zero indicating that
The plot below is a predicted against the actual plot which gives the effect
value is plotted on the X-axis. From the graph we can see that the points
plotted are so close to the fitted line Y=X, which implies that the model is
a good fit.
From the correlation matrix we could see that the variable depth of tree is
highly correlated with the dependent variable, profit with a correlation co-
efficient of −0.9, which indicates that when depth of tree increase the profit
will decrease.
35
Here we fit a simple linear regression model with independent variable depth of tree
and the dependent variable profit. Given below is the result of simple linear
regression model.
where Y is the profit and X is depth of tree. Here, the p-value(Pr(> |t|)) is
less than significance level. This indicates that the regression coefficient is
significant.
36
Chapter 4
CONCLUSION
The purpose of this study was to predict profit or loss in a pyramid scheme
using regression analysis. Here, for the analysis we take profit as the de-
pendent variable and all the remaining variables (cost price, profit markup,
these independent variables are significant or not. And from the results of
OLS regression method we get that these variables are significant, that is,
the change in the independent variable shows a respective change in the de-
pendent variable. Also, from the Actual Vs Predicted plot we can see that
the fitted model is good since the points plotted are so close to the fitted line.
From the correlation matrix we can see that the independent variable depth of tree
and the dependent variable profit hast the highest correlation. Thus we fit a
37
simple linear regression model using these two variables and from the result
From the Profit Vs Depth of tree graph we can see that as depth of tree
increases the profit decreases. This means that in a pyramid scheme when
the number of people joining the scheme increases the people at the lowest
level of the chain receives a very less money which leads to a loss to that
person.
38
Reference
1. George A.F. Seber and Alan J. Lee, 2003. Linear Regression Analysis,
5. https://www.kaggle.com/datasets
6. https://www.investopedia.com/insights/what-is-a-pyramid-scheme
7. https://en.wikipedia.org/wiki/Pyramid scheme
8. https://statisticsbyjim.com
39
9. https://www.albert.io/blog/ultimate-properties-of-ols-estimators-guide/
11. https://jyotiyadav99111.medium.com
12. https://blog.minitab.com
13. https://stats.stackexchange.com
40