CCATMST013

PYRAMID SCHEME
PREDICTING PROFIT OR LOSS
Project report submitted to Christ College (Autonomous)
in partial fulfilment for the award of the M.Sc. Degree
programme in Statistics
by
ROSS CYRIAC
Register No. CCATMST013
Department of Statistics
Christ College (Autonomous)
Irinjalakuda
2021
CERTIFICATE
This is to certify that the project entitled ‘PYRAMID SCHEME
PREDICTING PROFIT OR LOSS’, submitted to the Department of
Statistics in Partial fulfillment of the requirements for the award of the Mas-
ters Degree in Statistics, is a bonafied record of original research work done
by ROSS CYRIAC (CCATMST013) during the period of her study
in the Department of Statistics, Christ College (Autonomous) Irinjalakuda,
Thrissur, under my supervision and guidance during the year 2020-2021
Jiji M B Dr.Davis Antony Mundassery
Assistant Professor Head of the Department
Department of Statistics Department of Statistics
Christ College (Autonomous) Christ College (Autonomous)
Irinjalakuda Irinjalakuda
External Examiner:
Irinjalakuda
05/08/2021
DECLARATION
I hereby declare that the matter embodied in the project entitled
‘PYRAMID SCHEME PREDICTING PROFIT OR LOSS’, submitted to
the Department of Statistics in partial fulfillment of the requirements for the
award of the Masters Degree in Statistics, is the result of my studies and
this project has been composed by me under the Guidence and Supervision
of JIJI M B, Assistant Professor, Department of Statistics, Christ College
(Autonomous) Irinjalakuda, during 2020-2021.
I also declare that this project has not been previously formed the basis
for the award of any degree, diploma, associateship, fellowship etc. of any
other university or institution.
Irinjalakuda
05/08/2021 ROSS CYRIAC

ACKNOWLEDGEMENT
This project would not have been possible without the guidance and the help
of several individuals who in one way or another contributed and extended
their valuable assistance in the preparation and completion of the study.
First, I would like to express my deepest gratitude to my Guide Jiji M B,
Assistant Professor, Department of Statistics, Christ College (Autonomous)
Irinjalakuda, for her generous help, constructive criticism, scholarly guid-
ance, valuable supervision and encoragement throughout the preparation of
this project, without which this project would not have been materialized.
I would like to give my sincere thanks to my teachers for the inspiration, en-
couragement and technical help they bestowed upon me. I am indebted to the
faculty of the department for sharing with me their knowledge base and for
giving me abetter perspective of the subject and for providing the necessary
facilities during the span of my study.
My sincere thanks are also due to Librarian and non-teaching staff of the
Christ college (Autonomous) Irinjalakuda for their help and co-operations.
Also, I register my heartful thanks to my classmates for the co-operation and
warmth I could enjoy from them. Last but not least, I am indebted to my
parents for their unconditional love and support.
ROSS CYRIAC
Contents
1 INTRODUCTION 8
2 METHODOLOGY 11
2.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Estimation of β0 and β1 . . . . . . . . . . . . . . . . . 12
2.1.2 Properties of Least Square Estimators . . . . . . . . . 14
2.1.3 Assumptions of Linear Regression . . . . . . . . . . . . 15
2.1.4 Components of Simple Linear Regression Model . . . . 16
2.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Estimation of Regression Coefficient . . . . . . . . . . . 19
2.2.2 Properties of Least Square Estimates . . . . . . . . . . 21
2.2.3 Generalized Least Squares . . . . . . . . . . . . . . . . 22
2.2.4 Advantages . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Ordinary Least Squares . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Properties of Ordinary Least Squares . . . . . . . . . . 24
2.3.2 Statistics in Ordinary Least Squares . . . . . . . . . . 26
5
3 ANALYSIS 29
3.1 Linearity and Outliers . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 OLS Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . 35
4 CONCLUSION 37
REFERENCE 39
List of Figures
3.1 Pair-wise Linearity Plot . . . . . . . . . . . . . . . . . . . . . 30
3.2 Outlier Detection Plot . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Profit Vs Depth of tree . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Actual Vs Predicted Plot . . . . . . . . . . . . . . . . . . . . . 35

Chapter 1
INTRODUCTION
PYRAMID SCHEME
Pyramid scheme is so named because of its hierarchical structure which is
formed by their investors or recruits. It is a business model that recruits
investors by a promise of payments for enrolling others into the scheme.
Each level of new investors make payments to those above them and similarly
receives payment from those below them. People in the upper layers of the
pyramid typically gain profit, whereas those in the lower layers usually lose
money. Since most of the members in the scheme are at the bottom layers,
most of the participants will not make any money.
8
REGRESSION ANALYSIS
Regression analysis is a statistical method used to estimate the relationship
between a dependent variable and one or more independent variables. It
can be used to analyse the strength of the relationship between the variables
and also to model the relationship between them. Regression analysis are
of several types, such as simple linear, multiple linear and nonlinear. The
most commonly used models are simple linear and multiple linear. Nonlin-
ear regression analysis is generally used for complicated datasets where the
dependent variable and independent variables have a nonlinear relationship.
Regression analysis is widely used for prediction and forecasting. It offers
numerous applications in various fields of study.
Many techniques for carrying out regression analysis have been developed.
Familiar methods such as linear regression and ordinary least squares regres-
sion are parametric, in that the regression function is defined in terms of
a finite number of unknown parameters that are estimated from the data.
Nonparametric regression refers to techniques that allow the regression func-
tion to lie in a specified set of functions, which may be infinite-dimensional.
Regression models involve the following parameters and variables:
• The independent variables, X
• The dependent variable, Y
9
• The unknown parameters, denoted as β, which may represent a scalar
or a vector.
In general a regression model relates Y to a function of X and β.
Y ≈ f (X, β)
DATA DESCRIPTION
The secondary data for the study were collected from KAGGLE. The Pyra-
mid Scheme data for this study includes the variables,
• cost price : it is the money that a person should pay while joining this
scheme.
• profit markup : it refers to the value that the person adds to the cost
price.
• depth of tree : it gives the length of each chain in the scheme. The
length increases as each person recrutes another person to the scheme.
• sales commission : it is the amount that a person recieve when he
recures another person to the scheme.
• profit : it describes how much a person recieve from investing in this
scheme. The profit can be either negative or positive refering to loss or
gain. Here in this study we take this variable as the dependent variable.
10
Chapter 2
METHODOLOGY
2.1 Simple Linear Regression
Simple linear regression can be described as a statistical analysis method that
can be used to study the relationship between two quantitative variables.
Mainly, there are two things that can be found out by the method of simple
linear regression:
1. Strength of the relationship between the given two of variables. For
example, the relationship between global warming and the melting of
glaciers.
2. How much the value of the dependent variable is at a given value of the
independent variable. For example, the amount of melting of a glacier
at a certain level of global warming or temperature.
11
A model with a single regressor (independent variable) x that has a relation-
ship with a response (dependent variable) y is the simple linear regression
model, and it is a straight line. The model is given by,
y = β0 + β1 x + ε (2.1)
where the intercept β0 and β1 are unknown constants and ε is a random error
component which have mean zero and unknown variance σ 2 .
There is a probabilty distribution for y at each possiple value of x. The
mean and variance of this distribution is
E[y|x] = β0 + β1 x and var[y|x] = var(β0 + β1 x + ε) = σ 2
The parameters β0 and β1 are usually called the regression coefficients. The
slope β1 is the change in the mean of the distribution of y produced by a
unit change in x. If the range of data on x includes x = 0, then the intercept
β0 is the mean of the distribution of the response y when x = 0, and if the
range of x does not include zero, the β0 has no practical interpretation.
2.1.1 Estimation of β0 and β1
The parameters β0 and β1 are unknown and must be estimated using sample
data. Suppose that we have n pairs of data, say, (y1 , x1 ), (y2 , x2 ), ..., (yn , xn ).
The method of least squares is used to estimate β0 and β1 . That is, we will
estimate β0 and β1 so that the sum of the squares of the difference between
the observations yi and the straight line is minimum. From equation (2.1)
12
we may write,
yi = β0 + β1 xi + εi , i = 1, 2, ..., n (2.2)
The least square criterion is,

n
X
S(β0 , β1 ) = (yi − β0 − β1 xi )2
i=1
The least square estimators of β0 and β1 , say, ˆβ0 and ˆβ1 , must satisfy,
n
∂S X
= −2 (yi − β̂0 − β̂1 xi ) = 0
∂β0 β̂0 ,β̂1 i=1
n
X
⇒ (yi − β̂0 − β̂1 xi ) = 0
i=1
n
X n
X
⇒ yi = β̂0 + β̂1 xi (2.3)
i=1 i=1

∂S
Pn
and ∂β1
= −2 i=1 (yi − β̂0 − β̂1 xi )xi = 0
β̂0 ,β̂1
n
X
⇒ (yi − β̂0 − β̂1 xi )xi = 0
i=1
n
X n
X n
X
⇒ yi xi = β̂0 xi + β̂1 xi 2 (2.4)
i=1 i=1 i=1
The equations (2.3) and (2.4) are called the least square normal equations.
Solving these two equations we get
β̂0 = ȳ − β̂1 x̄ (2.5)

n
! n !
2
( ni=1 yi ) ( ni=1 xi )
P P Pn
X X ( x
i=1 i )
β̂1 = y i xi − xi 2 − (2.6)
i=1
n i=1
n
1
Pn 1
Pn
where ȳ = n i=1 yi and x̄ = n i=1 xi are the average of yi and xi respec-
tively. Also the numerator of equation (2.6) is the corrected sum of cross
13
product of xi and yi and the denominator is the corrected sum of squares of
xi .
Therefore, β̂0 and β̂1 in equations (2.5) and (2.6) are the least square es-
timators of the intercept and slope. The fitted sample regression model is
then,
ŷ = β̂0 + β̂1 x
the difference between the observed value yi and the corresponding fitted
value ŷi is a residual. Mathematically the ith residual is,
ei = yi − ŷi = yi − (β̂0 + β̂1 xi ) ; i = 1, 2, ..., n
2.1.2 Properties of Least Square Estimators
From equation (2.5) and (2.6) note that β̂0 and β̂1 are the linear combinations
of the observations yi .
1. The least square estimators β̂0 and β̂1 are the unbiased estimators of
the model parameters β0 and β1 . That is,
E(β̂0 ) = β0 and E(β̂1 ) = β1
2. The variances of β̂0 and β̂1 are given by,

x̄2 2
2 1
var(β̂0 ) = σ n + Sxx and var(β̂1 ) = Sσxx
3. The sum of the residuals in any regression model that contains an
intercept β0 is always zero, that is,

n
X
ei = 0
i=1
14
4. The sum of the observed values yi equals the sum of the fitted values
ŷi , or
n
X n
X
yi = ŷi
i=1 i=1
5. The least squares regression line always passes through the centroid of
the data.
6. The sum of the residuals weighted by the corresponding value of the
regressor variable always equals zero, that is,

n
X
xi e i = 0
i=1
7. The sum of the residuals weighted by the corresponding fitted value
always equals zero, that is,

n
X
ŷi ei = 0
i=1
2.1.3 Assumptions of Linear Regression
There are four assumptions for a linear regression model,
1. Linearity : The relation between the independent variable X and the
mean of the dependent variable Y is linear.
2. Homoscedasticity : The variance of the error is constant.
3. Independence : The obserations are independent of each other.
4. Normality : For any fixed value of regressor X, response Y is normally
distributed.
15
2.1.4 Components of Simple Linear Regression Model
Residuals : Residuals are the difference between the actual observed re-
sponse values and the response values that the model predicted. The Resid-
uals section of the model output breaks it down into 5 summary points; min-
imum(Min), 1st quartile(1Q), Median, 3rd quartile(3Q) and maximum(Max).
Coefficient-Estimate : The coefficient estimate contains two rows; the first
one is the intercept and the second one is slope. The intercept is the expected
average value of the dependent variable when the independent variable is
zero. The slope is the change in the dependent variable over the change in
the independent variable.
Coefficient-Pr(>t) : The Pr(>t) relates to the probability of observing
any value equal or larger than t. p-value tests whether or not there is a sta-
tistically significant relationship between a given regressor and the response
variable. The lower the p-value, the more significant the regressor variable.
That is, if p-value is less than the significance level(usually 0.05), we reject
the null hypothesis that the regressor variable is not significant(no relation-
ship between the regressor and response variable).
Residual standard error : Residual standard error is measure of the qual-
ity of a linear regression fit. It is the average amount that the response will
deviate from the true regression line. Theoretically, every linear model is
16
assumed to contain an error term. Due to the presence of this error term, we
are not capable of perfectly predicting the response variable from the predic-
tor variable.
Multiple R-squared, Adjusted R-squared : The R-squared statistic
gives a measure of how well the model fits the data. It is a measure of the
linear relationship between the independent variable and the dependent vari-
able.The value of R-squared always lies between 0 and 1 (that is, a number
close to 0 represents a regression that does not explain the variance in the
response variable and a number near to 1 explain the observed variance in
the response variable).
F-statistic : F-statistic is a good indicator of whether there is a relation-
ship between the regressor and the response variables. The F-statistic is near
1 the better the relationship is. However, how much larger the F-statistic
needs to be depends on both the number of data points and the number of
regressors. Generally, when the F-statistic is nearly 1, we can conclude that
there is a relationship between the regressor and the response variables.
2.2 Multiple Linear Regression
Multiple regression is an extension of simple linear regression. It is used
when we want to predict the value of a variable based on the value of two
or more other variables. Multiple regression also allows us to determine the
17
overall fit of the model and the relative contribution of each of the predictors
to the total variance explained. Multiple linear regression analysis is essen-
tially similar to the simple linear model, with the exception that multiple
independent variables are used in the model.
The regression model that involves more than one regressor variable (in-
dependent variable) is called multiple linear regression model. The response
(dependent variable) y may be related to k regressor or predictor variables.
The mathematical representation of multiple linear regression is,
y = β0 + β1 x1 + ... + βk xk + ε (2.7)
the parameters βj , j = 0, 1, ..., k are called the regression coefficients. The
parameter βj represent the expected change in the response variable y per
unit change in xj , where all the remaining regressor variable xi (i 6= j) are
held contsant. For this reason the parameters βj , j = 1, 2, ..., k, are often
called partial regression coefficients.
In the matrix notation the multiple linear regression model is given by,
Y = Xβ + ε
     
 y1  1 x11 x12 . . . x1k   β0 
     
 y2  1 x21 x22 . . . x2k   β1 
     
Y =
 
 X=
 
 β=
 

· · ·  . . . . . . . . . . . . . . . . . . . . . · · ·
     
     
yn 1 xn1 xn2 . . . xnk βk
18
In general, Y is an n × 1 vector, X is an n × p matrix of the level of the
regressor variable, β is a p × 1 vector of regression coefficients, and ε is an
n × 1 vector of random error; where p=k+1.
2.2.1 Estimation of Regression Coefficient
The method of least squares can be used to estimate the regression coeffi-
cients. That is, we find the vector of least square estimate β̂ that minimizes,
n
X
S(β) = ε i 2 = ε0 ε
i=1
= (Y − Xβ)0 (Y − Xβ)
= Y 0 Y − Y 0 Xβ − β 0 X 0 Y + β 0 X 0 Xβ
= Y 0 Y − 2β 0 X 0 Y + β 0 X 0 Xβ
since, β 0 X 0 Y is a 1 × 1 matrix or a scalar and its transpose (β 0 X 0 Y )0 = Y 0 Xβ
is the same scalar.
The least square estimate must satisfy,
∂S
=0
∂β
∂
(Y 0 Y − 2β 0 X 0 Y + β 0 X 0 Xβ) = 0
∂β
−2X 0 Y + 2βX 0 X = 0

∂S
=0
∂β β̂
⇒ −2X 0 Y + 2X 0 X β̂ = 0
⇒ (X 0 X)β̂ = X 0 Y (2.8)
19
Equation (2.8) is the normal equation for getting β̂, which can be solved by
multiplying both sides by the inverse of X 0 X, that is (X 0 X)−1 .
Thus the least square estimate of β is, β̂ = (X 0 X)−1 X 0 Y , provided that the
inverse of the matrix exist.
We denote the fitted value by,
Ŷ = X β̂
= X(X 0 X)−1 X 0 Y
= PY
The n × n matrix P = X(X 0 X)−1 X 0 is usually called hat matrix. It maps
the vector of observed values into a vector of fitted values. The hat matrix
and its properties play a central role in regression analysis.
The difference between the observed value Y and the corresponding fitted
value Ŷ is the residual e. The n residuals may be conveniently written in
matrix notation as,
e = Y − Ŷ
= Y − X β̂
= Y − PY
= (I − P )Y
20
2.2.2 Properties of Least Square Estimates
Assuming that errors are unbiased (that is, E(ε) = 0) and the columns of X
are linearly independent then,
1. Unbiased estimate of β
E(β̂) = E[(X 0 X)−1 X 0 Y ]
= (X 0 X)−1 X 0 E[Y ]
= (X 0 X)−1 X 0 E[Xβ]
= (X 0 X)−1 X 0 Xβ
=β
Therefore, β̂ is an unbiased estimate of β.
2. Variance of β̂
Assume that εi ’s are uncorrelated and have the same variance, that is,
cov(εi , εj ) = δij σ 2 , then var(ε) = σ 2 In
var(Y ) = var(Y − Xβ)
= var(ε)
= σ 2 In
var(β̂) = var[(X 0 X)−1 X 0 Y ]
= (X 0 X)−1 X 0 var(Y )X(X 0 X)−1
= σ 2 (X 0 X)−1 (X 0 X)(X 0 X)−1
= σ 2 (X 0 X)−1
Therefore, the variance of β̂ is σ 2 (X 0 X)−1 .
21
2.2.3 Generalized Least Squares
Consider the model Y = xβ + ε where E(ε) = 0 and var(ε) = σ 2 In . Assume
that var(ε) = σ 2 V where V is a known n × n positive definite matrix.
Since V is positive definite, there exist an n × n non-singular matrix K, such
that V = KK 0 . Set Z = K −1 Y, B = K −1 X and η = K −1 ε.
We have the model Z = Bβ + η, where B is n × p of rank p. Also,
E(η) = E(K −1 ε)
= K −1 E(ε) = 0
var(η) = var(K −1 ε)
= K −1 var(ε)(K −1 )0
= K −1 σ 2 V (K −1 )0
= σ 2 K −1 KK 0 (K −1 )0
= σ 2 In
Minimizing the least squares function η 0 η with respect to β, we get the least
square estimate of this transformed model as,
β ∗ = (B 0 B)−1 B 0 Z
= [(K −1 X)0 (K −1 X)]−1 (K −1 X)0 K −1 Y
= [X 0 (K −1 )0 K −1 ]−1 X 0 (k −1 )0 K −1 Y
= [X 0 (KK 0 )−1 X]−1 X 0 (KK 0 )−1 Y
= [X 0 V −1 X]−1 (X 0 V −1 Y )
22
E(β ∗ ) = E[(X 0 V −1 X)−1 (X 0 V −1 Y )]
= (X 0 V −1 X)−1 X 0 V −1 E(Y )
= (X 0 V −1 X)−1 X 0 V −1 Xβ
= In β = β
The dispersion matrix of β ∗ is given by
D(β ∗ ) = var(β ∗ )
= σ 2 (B 0 B)−1
= σ 2 [X 0 (K −1 )0 K −1 X]−1
= σ 2 (X 0 V −1 X)−1
The generalized least square estimate is simply the ordinary least square
estimate for a transformed model.
Therefore, β ∗ have the same optimal properties, namely that, a0 β ∗ is the best
linear unbiased estimate (BLUE) of a0 β.
2.2.4 Advantages
There are two main advantages to analysing data using a multiple regression
model.
Firstly, the ability to determine the relative influence of one or more predictor
variables to the criterion value. For example, the real estate agent could
find that the size of the homes and the number of bedrooms have a strong
correlation to the price of a home, while the proximity to schools has no
23
correlation at all, or even a negative correlation if it is primarily a retirement
community.
Secondly, the ability to identify outliers or anomalies. For example, while
reviewing the data related to management salaries, the human resources
manager could find that the number of hours worked, the department size
and its budget all had a strong correlation to salaries, while seniority did not.
2.3 Ordinary Least Squares
Ordinary least squares(OLS) is a type of linear least squares method for
estimating the unknown parameters in a linear regression model. It chooses
the parameters by the principle of least squares. OLS is a method used to
estimate the equation:
Y = β0 + β1 X + ε (2.9)
The intercept β0 represents the value of Y when X is 0, and the slope β1
measures the change of Y for a unit change of X, and ε is the error term.
This method has been widely used in research.
2.3.1 Properties of Ordinary Least Squares
There are mainly four properties for ordinary least squares regression,
1. Linear : OLS estimators are linear functions of the dependent variable,
Y, that are linearly combined using weights which are a non-linear
24
function of the independent variables, X. The OLS estimators are linear
with respect to the values of the dependent variable only, and not
necessarily with respect to the values of the independent variables.
2. Unbiasedness : Unbiasedness is one of the most desirable property of
any estimator. The estimator should ideally be an unbiased estimator
of the actual values. The unbiasedness property of OLS method says
that when we take out several samples, we will find that the mean of
all the constants from the samples will be equal to the actual values of
constants from the population.
3. Efficient-Minimum Variance : OLS estimators have the least variance
among the class of all linear unbiased estimators. This property is a
way to determine which estimator to be used.
• An estimator that is unbiased but does not have the minimum
variance is not good.
• An estimator that has the minimum variance but is biased is not
good
• An estimator that is unbiased and has the minimum variance of
all other estimators is efficient.
The OLS estimator is an efficient estimator.
4. Consistency : A consistent estimator is one that approaches the actual
value of the parameter in the population as the sample size increases.
25
OLS esimator is a consistent estimator since it satisfies both the con-
ditions of a consistent estimator, which are
• it is asymptotically unbiased.
• its variance converges to 0 as the sample size increases.
2.3.2 Statistics in Ordinary Least Squares
Linear regression is a simple and powerful tool to analyze relationship be-
tween a set of dependent and independent variables. But, often there is a
tend to ignore the assumptions of OLS before interpreting the results. Hence,
it is necessary to analyze various statistics of OLS.
R-squared : It implies the percentage variation in dependent variable that
is explained by independent variables. The value of R-squared is between 0
and 1. R-squared = Explained variation / Total variation. In general, the
higher the value of R-squared, the better the model fits the data. This statis-
tic has a drawback that it increases as the number of predictors(dependent
variables) increase.
Adj.R-squared : This is a modified version of R-squared that is adjusted
for the number of dependent variables in the regression model. It increases
only when an additional variable adds to the explanatory power to the regres-
sion. It adds reliability and precesion by considering the impact of additional
independent variables. Adjusted R-squared also can provide a more exact
26
view of the correlation.
Prob(F-Statistic) : It tells the overall significance of the regression. This
is used to assess the significance level of all the variables together. The null
hypothesis under this is, H0 : all the regression coefficients are equal to zero.
Prob(F-statistics) describes the probability of null hypothesis being true.
AIC/BIC : AIC stands for Akaike’s Information Criteria and is used for
model selection. It corrects the errors mode in case a new variable is added
to the regression equation. It is calculated as the number of parameters mi-
nus the likelihood of the overall model. A lower AIC implies a better model.
Whereas, BIC stands for Bayesian information criteria and is a variant of
AIC where corrections are made more severe.
Prob(Omnibus) : Omnibus test is performed to check the one of the as-
sumptions of OLS that the errors are normally distributed. Prob(Omnibus)
is supposed to be close to the 1 in order to satisfy the normality assumption.
Then we say that the coefficients estimated are Best Linear Unbiased Esti-
mators(BLUE).
Durbin-watson : Another assumption of OLS is homoscedasticity which
implies that the variance of errors is constant. Its prefered value is between
1 and 2.
27
Prob(Jarque-Bera) : It is in line with Omnibus test. It is also used for
the distribution analysis of the regression errors. It is expected to agree the
results of Omnibus test. A large value of Jarque-Bera test indicates that the
errors are not normally distributed.
28
Chapter 3
ANALYSIS
3.1 Linearity and Outliers
Linearity means that mean values of the dependent variable for each incre-
ment of the independent variables lie along a straight line. In linear regression
the relationship between the dependent and independent variables need to
be linear. The linearity assumption can de tested using scatter plot. The
Pair-wise Linearity Plot graph below shows the pairwise scatted plot of the
varibles under study.
From this graph we can see that the relationship between the dependent
variable, profit, and the independent varibles (cost price, profit markup,
depth of tree, and sales commission) are linear.
It is also important to chek for outliers in the variables, because linear regres-
sion is sensitive to outliers. Outliers are observations that fall far from other
29
points. These points are important as they can have a strong influence on
the least squares line. From the Outlier Detection boxplot of each variables
ve can see that there is no outliers.
Figure 3.1: Pair-wise Linearity Plot
30
Figure 3.2: Outlier Detection Plot
3.2 Correlation
Corelation gives the information on the strength and direction of the linear
relationship between two variables. The value of the correlation lies between
+1 and -1. The value near to ±1 implies high correlation, the value near
to zero implies less correlation and the value near to ±0.5 implies moderate
correlation. A negetive value implies that increase in one variable is asso-
ciated with the decrease in the other variable, while a positive correlation
31
implies that both variables move in same direction. Zero correlation implies
no relatioship between the variables.
Figure 3.3: Correlation Matrix
From the above correlation matrix graph we could see that there is a neg-
atively high correlation between the dependent variable, profit and the in-
dependent variable, depth of tree. This means that as the depth of tree in-
creases the profit will decrease. This is shown in Profit Vs Depth of tree
graph given below.
32
Figure 3.4: Profit Vs Depth of tree
3.3 OLS Regression
The OLS Regression Results table below gives the output of the ordinary
last square regression on the data under study. This table gives the multiple
regression model for the data and shows how well the model fits the data. It
also gives the results for different assumptions for the model.
33
From this table we can see that, since the R-squared and Adj.R-squared
values are approximately 1, the change in the profit variable depends on
change in all other independent variables and also since the value of R-
squared is high, the model fits the data well. The Prob(F-statistic) shows
that here the regression coefficients are significant.
Prob(Omnibus) and Prob(JB) is used to test the normality assumptions of
the error. Here, both these values are nearly zero which implies that the error
is not normally distributed. Another assumption of OLS is homoscedasticity.
34
This is tested using the Durbin-watson test which is near zero indicating that
the variance of error is not contant.
The plot below is a predicted against the actual plot which gives the effect
of a predicted model. Actual value is plotted on the Y-axis and predicted
value is plotted on the X-axis. From the graph we can see that the points
plotted are so close to the fitted line Y=X, which implies that the model is
a good fit.
Figure 3.5: Actual Vs Predicted Plot
3.4 Simple Linear Regression
From the correlation matrix we could see that the variable depth of tree is
highly correlated with the dependent variable, profit with a correlation co-
efficient of −0.9, which indicates that when depth of tree increase the profit
will decrease.
35
Here we fit a simple linear regression model with independent variable depth of tree
and the dependent variable profit. Given below is the result of simple linear
regression model.
The simple linear regression model is given by Y = 9909.89 − 1010.50X,
where Y is the profit and X is depth of tree. Here, the p-value(Pr(> |t|)) is
less than significance level. This indicates that the regression coefficient is
significant.
36
Chapter 4
CONCLUSION
The purpose of this study was to predict profit or loss in a pyramid scheme
using regression analysis. Here, for the analysis we take profit as the de-
pendent variable and all the remaining variables (cost price, profit markup,
depth of tree, and sales commission) as the independent variables.
Firstly, using ordinary least squares regression method to predict whether
these independent variables are significant or not. And from the results of
OLS regression method we get that these variables are significant, that is,
the change in the independent variable shows a respective change in the de-
pendent variable. Also, from the Actual Vs Predicted plot we can see that
the fitted model is good since the points plotted are so close to the fitted line.
From the correlation matrix we can see that the independent variable depth of tree
and the dependent variable profit hast the highest correlation. Thus we fit a
37
simple linear regression model using these two variables and from the result
of this model we get that the independent variable is significant.
From the Profit Vs Depth of tree graph we can see that as depth of tree
increases the profit decreases. This means that in a pyramid scheme when
the number of people joining the scheme increases the people at the lowest
level of the chain receives a very less money which leads to a loss to that
person.
38
Reference
1. George A.F. Seber and Alan J. Lee, 2003. Linear Regression Analysis,
Second Edition, John Wiley and Sons, Inc.
2. Douglas C. Montgomery, Elizabeth A. Peck and G. Geoffrey Vining,
2012. Introduction to Regression Analysis, Fifth Edition, John Wiley
and Sons, Inc.
3. Alvin C. Rencher and G. Bruce Schaalje, 2008.Linear Models in Ststis-
tics, Second Edition, John Wiley and Sons, Inc.
4. John O. Rawlings, Sastry G. Pantula and David A. Dickey, 1998.Ap-
plied Regression Analysis: A Research Tool, Second Edition, John Wi-
ley and Sons, Inc.
5. https://www.kaggle.com/datasets
6. https://www.investopedia.com/insights/what-is-a-pyramid-scheme
7. https://en.wikipedia.org/wiki/Pyramid scheme
8. https://statisticsbyjim.com
39
9. https://www.albert.io/blog/ultimate-properties-of-ols-estimators-guide/
10. https://en.wikibooks.org/wiki/Econometric Theory/Properties of OLS Estimators
11. https://jyotiyadav99111.medium.com
12. https://blog.minitab.com
13. https://stats.stackexchange.com
40

CCATMST013

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CCATMST013

Uploaded by

Copyright:

Available Formats

PYRAMID SCHEME

PREDICTING PROFIT OR LOSS

Project report submitted to Christ College (Autonomous)

in partial fulfilment for the award of the M.Sc. Degree

Register No. CCATMST013

Christ College (Autonomous)

This is to certify that the project entitled ‘PYRAMID SCHEME

PREDICTING PROFIT OR LOSS’, submitted to the Department of

ters Degree in Statistics, is a bonafied record of original research work done

by ROSS CYRIAC (CCATMST013) during the period of her study

in the Department of Statistics, Christ College (Autonomous) Irinjalakuda,

Thrissur, under my supervision and guidance during the year 2020-2021

Jiji M B Dr.Davis Antony Mundassery

Assistant Professor Head of the Department

Department of Statistics Department of Statistics

Christ College (Autonomous) Christ College (Autonomous)

I hereby declare that the matter embodied in the project entitled

‘PYRAMID SCHEME PREDICTING PROFIT OR LOSS’, submitted to

the Department of Statistics in partial fulfillment of the requirements for the

award of the Masters Degree in Statistics, is the result of my studies and

of JIJI M B, Assistant Professor, Department of Statistics, Christ College

(Autonomous) Irinjalakuda, during 2020-2021.

other university or institution.

05/08/2021 ROSS CYRIAC

of several individuals who in one way or another contributed and extended

their valuable assistance in the preparation and completion of the study.

First, I would like to express my deepest gratitude to my Guide Jiji M B,

Assistant Professor, Department of Statistics, Christ College (Autonomous)

Irinjalakuda, for her generous help, constructive criticism, scholarly guid-

ance, valuable supervision and encoragement throughout the preparation of

facilities during the span of my study.

Christ college (Autonomous) Irinjalakuda for their help and co-operations.

Also, I register my heartful thanks to my classmates for the co-operation and

parents for their unconditional love and support.

2.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 Estimation of β0 and β1 . . . . . . . . . . . . . . . . . 12

2.1.2 Properties of Least Square Estimators . . . . . . . . . 14

2.1.3 Assumptions of Linear Regression . . . . . . . . . . . . 15

2.1.4 Components of Simple Linear Regression Model . . . . 16

2.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . 17

2.2.1 Estimation of Regression Coefficient . . . . . . . . . . . 19

2.2.2 Properties of Least Square Estimates . . . . . . . . . . 21

2.2.3 Generalized Least Squares . . . . . . . . . . . . . . . . 22

2.3 Ordinary Least Squares . . . . . . . . . . . . . . . . . . . . . . 24

2.3.1 Properties of Ordinary Least Squares . . . . . . . . . . 24

2.3.2 Statistics in Ordinary Least Squares . . . . . . . . . . 26

3.1 Linearity and Outliers . . . . . . . . . . . . . . . . . . . . . . 29

3.3 OLS Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . 35

3.1 Pair-wise Linearity Plot . . . . . . . . . . . . . . . . . . . . . 30

3.2 Outlier Detection Plot . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Profit Vs Depth of tree . . . . . . . . . . . . . . . . . . . . . . 33

3.5 Actual Vs Predicted Plot . . . . . . . . . . . . . . . . . . . . . 35

Pyramid scheme is so named because of its hierarchical structure which is

formed by their investors or recruits. It is a business model that recruits

investors by a promise of payments for enrolling others into the scheme.

most of the participants will not make any money.

Regression analysis is a statistical method used to estimate the relationship

between a dependent variable and one or more independent variables. It

dependent variable and independent variables have a nonlinear relationship.

Regression analysis is widely used for prediction and forecasting. It offers

numerous applications in various fields of study.