Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

PYRAMID SCHEME

PREDICTING PROFIT OR LOSS

Project report submitted to Christ College (Autonomous)

in partial fulfilment for the award of the M.Sc. Degree

programme in Statistics

by

ROSS CYRIAC

Register No. CCATMST013

Department of Statistics

Christ College (Autonomous)

Irinjalakuda

2021
CERTIFICATE

This is to certify that the project entitled ‘PYRAMID SCHEME

PREDICTING PROFIT OR LOSS’, submitted to the Department of

Statistics in Partial fulfillment of the requirements for the award of the Mas-

ters Degree in Statistics, is a bonafied record of original research work done

by ROSS CYRIAC (CCATMST013) during the period of her study

in the Department of Statistics, Christ College (Autonomous) Irinjalakuda,

Thrissur, under my supervision and guidance during the year 2020-2021

Jiji M B Dr.Davis Antony Mundassery

Assistant Professor Head of the Department

Department of Statistics Department of Statistics

Christ College (Autonomous) Christ College (Autonomous)

Irinjalakuda Irinjalakuda

External Examiner:

Irinjalakuda

05/08/2021
DECLARATION

I hereby declare that the matter embodied in the project entitled

‘PYRAMID SCHEME PREDICTING PROFIT OR LOSS’, submitted to

the Department of Statistics in partial fulfillment of the requirements for the

award of the Masters Degree in Statistics, is the result of my studies and

this project has been composed by me under the Guidence and Supervision

of JIJI M B, Assistant Professor, Department of Statistics, Christ College

(Autonomous) Irinjalakuda, during 2020-2021.

I also declare that this project has not been previously formed the basis

for the award of any degree, diploma, associateship, fellowship etc. of any

other university or institution.

Irinjalakuda

05/08/2021 ROSS CYRIAC


ACKNOWLEDGEMENT

This project would not have been possible without the guidance and the help

of several individuals who in one way or another contributed and extended

their valuable assistance in the preparation and completion of the study.

First, I would like to express my deepest gratitude to my Guide Jiji M B,

Assistant Professor, Department of Statistics, Christ College (Autonomous)

Irinjalakuda, for her generous help, constructive criticism, scholarly guid-

ance, valuable supervision and encoragement throughout the preparation of

this project, without which this project would not have been materialized.

I would like to give my sincere thanks to my teachers for the inspiration, en-

couragement and technical help they bestowed upon me. I am indebted to the

faculty of the department for sharing with me their knowledge base and for

giving me abetter perspective of the subject and for providing the necessary

facilities during the span of my study.

My sincere thanks are also due to Librarian and non-teaching staff of the

Christ college (Autonomous) Irinjalakuda for their help and co-operations.

Also, I register my heartful thanks to my classmates for the co-operation and

warmth I could enjoy from them. Last but not least, I am indebted to my

parents for their unconditional love and support.

ROSS CYRIAC
Contents

1 INTRODUCTION 8

2 METHODOLOGY 11

2.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 Estimation of β0 and β1 . . . . . . . . . . . . . . . . . 12

2.1.2 Properties of Least Square Estimators . . . . . . . . . 14

2.1.3 Assumptions of Linear Regression . . . . . . . . . . . . 15

2.1.4 Components of Simple Linear Regression Model . . . . 16

2.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . 17

2.2.1 Estimation of Regression Coefficient . . . . . . . . . . . 19

2.2.2 Properties of Least Square Estimates . . . . . . . . . . 21

2.2.3 Generalized Least Squares . . . . . . . . . . . . . . . . 22

2.2.4 Advantages . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3 Ordinary Least Squares . . . . . . . . . . . . . . . . . . . . . . 24

2.3.1 Properties of Ordinary Least Squares . . . . . . . . . . 24

2.3.2 Statistics in Ordinary Least Squares . . . . . . . . . . 26

5
3 ANALYSIS 29

3.1 Linearity and Outliers . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 OLS Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . 35

4 CONCLUSION 37

REFERENCE 39
List of Figures

3.1 Pair-wise Linearity Plot . . . . . . . . . . . . . . . . . . . . . 30

3.2 Outlier Detection Plot . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Profit Vs Depth of tree . . . . . . . . . . . . . . . . . . . . . . 33

3.5 Actual Vs Predicted Plot . . . . . . . . . . . . . . . . . . . . . 35


Chapter 1

INTRODUCTION

PYRAMID SCHEME

Pyramid scheme is so named because of its hierarchical structure which is

formed by their investors or recruits. It is a business model that recruits

investors by a promise of payments for enrolling others into the scheme.

Each level of new investors make payments to those above them and similarly

receives payment from those below them. People in the upper layers of the

pyramid typically gain profit, whereas those in the lower layers usually lose

money. Since most of the members in the scheme are at the bottom layers,

most of the participants will not make any money.

8
REGRESSION ANALYSIS

Regression analysis is a statistical method used to estimate the relationship

between a dependent variable and one or more independent variables. It

can be used to analyse the strength of the relationship between the variables

and also to model the relationship between them. Regression analysis are

of several types, such as simple linear, multiple linear and nonlinear. The

most commonly used models are simple linear and multiple linear. Nonlin-

ear regression analysis is generally used for complicated datasets where the

dependent variable and independent variables have a nonlinear relationship.

Regression analysis is widely used for prediction and forecasting. It offers

numerous applications in various fields of study.

Many techniques for carrying out regression analysis have been developed.

Familiar methods such as linear regression and ordinary least squares regres-

sion are parametric, in that the regression function is defined in terms of

a finite number of unknown parameters that are estimated from the data.

Nonparametric regression refers to techniques that allow the regression func-

tion to lie in a specified set of functions, which may be infinite-dimensional.

Regression models involve the following parameters and variables:

• The independent variables, X

• The dependent variable, Y

9
• The unknown parameters, denoted as β, which may represent a scalar

or a vector.

In general a regression model relates Y to a function of X and β.

Y ≈ f (X, β)

DATA DESCRIPTION

The secondary data for the study were collected from KAGGLE. The Pyra-

mid Scheme data for this study includes the variables,

• cost price : it is the money that a person should pay while joining this

scheme.

• profit markup : it refers to the value that the person adds to the cost

price.

• depth of tree : it gives the length of each chain in the scheme. The

length increases as each person recrutes another person to the scheme.

• sales commission : it is the amount that a person recieve when he

recures another person to the scheme.

• profit : it describes how much a person recieve from investing in this

scheme. The profit can be either negative or positive refering to loss or

gain. Here in this study we take this variable as the dependent variable.

10
Chapter 2

METHODOLOGY

2.1 Simple Linear Regression

Simple linear regression can be described as a statistical analysis method that

can be used to study the relationship between two quantitative variables.

Mainly, there are two things that can be found out by the method of simple

linear regression:

1. Strength of the relationship between the given two of variables. For

example, the relationship between global warming and the melting of

glaciers.

2. How much the value of the dependent variable is at a given value of the

independent variable. For example, the amount of melting of a glacier

at a certain level of global warming or temperature.

11
A model with a single regressor (independent variable) x that has a relation-

ship with a response (dependent variable) y is the simple linear regression

model, and it is a straight line. The model is given by,

y = β0 + β1 x + ε (2.1)

where the intercept β0 and β1 are unknown constants and ε is a random error

component which have mean zero and unknown variance σ 2 .

There is a probabilty distribution for y at each possiple value of x. The

mean and variance of this distribution is

E[y|x] = β0 + β1 x and var[y|x] = var(β0 + β1 x + ε) = σ 2

The parameters β0 and β1 are usually called the regression coefficients. The

slope β1 is the change in the mean of the distribution of y produced by a

unit change in x. If the range of data on x includes x = 0, then the intercept

β0 is the mean of the distribution of the response y when x = 0, and if the

range of x does not include zero, the β0 has no practical interpretation.

2.1.1 Estimation of β0 and β1

The parameters β0 and β1 are unknown and must be estimated using sample

data. Suppose that we have n pairs of data, say, (y1 , x1 ), (y2 , x2 ), ..., (yn , xn ).

The method of least squares is used to estimate β0 and β1 . That is, we will

estimate β0 and β1 so that the sum of the squares of the difference between

the observations yi and the straight line is minimum. From equation (2.1)

12
we may write,

yi = β0 + β1 xi + εi , i = 1, 2, ..., n (2.2)

The least square criterion is,


n
X
S(β0 , β1 ) = (yi − β0 − β1 xi )2
i=1

The least square estimators of β0 and β1 , say, ˆβ0 and ˆβ1 , must satisfy,

n
∂S X
= −2 (yi − β̂0 − β̂1 xi ) = 0
∂β0 β̂0 ,β̂1 i=1
n
X
⇒ (yi − β̂0 − β̂1 xi ) = 0
i=1
n
X n
X
⇒ yi = β̂0 + β̂1 xi (2.3)
i=1 i=1

∂S
Pn
and ∂β1
= −2 i=1 (yi − β̂0 − β̂1 xi )xi = 0
β̂0 ,β̂1

n
X
⇒ (yi − β̂0 − β̂1 xi )xi = 0
i=1

n
X n
X n
X
⇒ yi xi = β̂0 xi + β̂1 xi 2 (2.4)
i=1 i=1 i=1

The equations (2.3) and (2.4) are called the least square normal equations.

Solving these two equations we get

β̂0 = ȳ − β̂1 x̄ (2.5)


n
! n !
2
( ni=1 yi ) ( ni=1 xi )
P P Pn
X X ( x
i=1 i )
β̂1 = y i xi − xi 2 − (2.6)
i=1
n i=1
n
1
Pn 1
Pn
where ȳ = n i=1 yi and x̄ = n i=1 xi are the average of yi and xi respec-

tively. Also the numerator of equation (2.6) is the corrected sum of cross

13
product of xi and yi and the denominator is the corrected sum of squares of

xi .

Therefore, β̂0 and β̂1 in equations (2.5) and (2.6) are the least square es-

timators of the intercept and slope. The fitted sample regression model is

then,

ŷ = β̂0 + β̂1 x

the difference between the observed value yi and the corresponding fitted

value ŷi is a residual. Mathematically the ith residual is,

ei = yi − ŷi = yi − (β̂0 + β̂1 xi ) ; i = 1, 2, ..., n

2.1.2 Properties of Least Square Estimators

From equation (2.5) and (2.6) note that β̂0 and β̂1 are the linear combinations

of the observations yi .

1. The least square estimators β̂0 and β̂1 are the unbiased estimators of

the model parameters β0 and β1 . That is,

E(β̂0 ) = β0 and E(β̂1 ) = β1

2. The variances of β̂0 and β̂1 are given by,


 
x̄2 2
2 1
var(β̂0 ) = σ n + Sxx and var(β̂1 ) = Sσxx

3. The sum of the residuals in any regression model that contains an

intercept β0 is always zero, that is,


n
X
ei = 0
i=1

14
4. The sum of the observed values yi equals the sum of the fitted values

ŷi , or
n
X n
X
yi = ŷi
i=1 i=1

5. The least squares regression line always passes through the centroid of

the data.

6. The sum of the residuals weighted by the corresponding value of the

regressor variable always equals zero, that is,


n
X
xi e i = 0
i=1

7. The sum of the residuals weighted by the corresponding fitted value

always equals zero, that is,


n
X
ŷi ei = 0
i=1

2.1.3 Assumptions of Linear Regression

There are four assumptions for a linear regression model,

1. Linearity : The relation between the independent variable X and the

mean of the dependent variable Y is linear.

2. Homoscedasticity : The variance of the error is constant.

3. Independence : The obserations are independent of each other.

4. Normality : For any fixed value of regressor X, response Y is normally

distributed.

15
2.1.4 Components of Simple Linear Regression Model

Residuals : Residuals are the difference between the actual observed re-

sponse values and the response values that the model predicted. The Resid-

uals section of the model output breaks it down into 5 summary points; min-

imum(Min), 1st quartile(1Q), Median, 3rd quartile(3Q) and maximum(Max).

Coefficient-Estimate : The coefficient estimate contains two rows; the first

one is the intercept and the second one is slope. The intercept is the expected

average value of the dependent variable when the independent variable is

zero. The slope is the change in the dependent variable over the change in

the independent variable.

Coefficient-Pr(>t) : The Pr(>t) relates to the probability of observing

any value equal or larger than t. p-value tests whether or not there is a sta-

tistically significant relationship between a given regressor and the response

variable. The lower the p-value, the more significant the regressor variable.

That is, if p-value is less than the significance level(usually 0.05), we reject

the null hypothesis that the regressor variable is not significant(no relation-

ship between the regressor and response variable).

Residual standard error : Residual standard error is measure of the qual-

ity of a linear regression fit. It is the average amount that the response will

deviate from the true regression line. Theoretically, every linear model is

16
assumed to contain an error term. Due to the presence of this error term, we

are not capable of perfectly predicting the response variable from the predic-

tor variable.

Multiple R-squared, Adjusted R-squared : The R-squared statistic

gives a measure of how well the model fits the data. It is a measure of the

linear relationship between the independent variable and the dependent vari-

able.The value of R-squared always lies between 0 and 1 (that is, a number

close to 0 represents a regression that does not explain the variance in the

response variable and a number near to 1 explain the observed variance in

the response variable).

F-statistic : F-statistic is a good indicator of whether there is a relation-

ship between the regressor and the response variables. The F-statistic is near

1 the better the relationship is. However, how much larger the F-statistic

needs to be depends on both the number of data points and the number of

regressors. Generally, when the F-statistic is nearly 1, we can conclude that

there is a relationship between the regressor and the response variables.

2.2 Multiple Linear Regression

Multiple regression is an extension of simple linear regression. It is used

when we want to predict the value of a variable based on the value of two

or more other variables. Multiple regression also allows us to determine the

17
overall fit of the model and the relative contribution of each of the predictors

to the total variance explained. Multiple linear regression analysis is essen-

tially similar to the simple linear model, with the exception that multiple

independent variables are used in the model.

The regression model that involves more than one regressor variable (in-

dependent variable) is called multiple linear regression model. The response

(dependent variable) y may be related to k regressor or predictor variables.

The mathematical representation of multiple linear regression is,

y = β0 + β1 x1 + ... + βk xk + ε (2.7)

the parameters βj , j = 0, 1, ..., k are called the regression coefficients. The

parameter βj represent the expected change in the response variable y per

unit change in xj , where all the remaining regressor variable xi (i 6= j) are

held contsant. For this reason the parameters βj , j = 1, 2, ..., k, are often

called partial regression coefficients.

In the matrix notation the multiple linear regression model is given by,

Y = Xβ + ε
     
 y1  1 x11 x12 . . . x1k   β0 
     
 y2  1 x21 x22 . . . x2k   β1 
     
Y =
 
 X=
 
 β=
 

· · ·  . . . . . . . . . . . . . . . . . . . . . · · ·
     
     
yn 1 xn1 xn2 . . . xnk βk

18
In general, Y is an n × 1 vector, X is an n × p matrix of the level of the

regressor variable, β is a p × 1 vector of regression coefficients, and ε is an

n × 1 vector of random error; where p=k+1.

2.2.1 Estimation of Regression Coefficient

The method of least squares can be used to estimate the regression coeffi-

cients. That is, we find the vector of least square estimate β̂ that minimizes,
n
X
S(β) = ε i 2 = ε0 ε
i=1

= (Y − Xβ)0 (Y − Xβ)

= Y 0 Y − Y 0 Xβ − β 0 X 0 Y + β 0 X 0 Xβ

= Y 0 Y − 2β 0 X 0 Y + β 0 X 0 Xβ

since, β 0 X 0 Y is a 1 × 1 matrix or a scalar and its transpose (β 0 X 0 Y )0 = Y 0 Xβ

is the same scalar.

The least square estimate must satisfy,

∂S
=0
∂β

(Y 0 Y − 2β 0 X 0 Y + β 0 X 0 Xβ) = 0
∂β

−2X 0 Y + 2βX 0 X = 0

∂S
=0
∂β β̂

⇒ −2X 0 Y + 2X 0 X β̂ = 0

⇒ (X 0 X)β̂ = X 0 Y (2.8)

19
Equation (2.8) is the normal equation for getting β̂, which can be solved by

multiplying both sides by the inverse of X 0 X, that is (X 0 X)−1 .

Thus the least square estimate of β is, β̂ = (X 0 X)−1 X 0 Y , provided that the

inverse of the matrix exist.

We denote the fitted value by,

Ŷ = X β̂

= X(X 0 X)−1 X 0 Y

= PY

The n × n matrix P = X(X 0 X)−1 X 0 is usually called hat matrix. It maps

the vector of observed values into a vector of fitted values. The hat matrix

and its properties play a central role in regression analysis.

The difference between the observed value Y and the corresponding fitted

value Ŷ is the residual e. The n residuals may be conveniently written in

matrix notation as,

e = Y − Ŷ

= Y − X β̂

= Y − PY

= (I − P )Y

20
2.2.2 Properties of Least Square Estimates

Assuming that errors are unbiased (that is, E(ε) = 0) and the columns of X

are linearly independent then,

1. Unbiased estimate of β

E(β̂) = E[(X 0 X)−1 X 0 Y ]

= (X 0 X)−1 X 0 E[Y ]

= (X 0 X)−1 X 0 E[Xβ]

= (X 0 X)−1 X 0 Xβ

Therefore, β̂ is an unbiased estimate of β.

2. Variance of β̂

Assume that εi ’s are uncorrelated and have the same variance, that is,

cov(εi , εj ) = δij σ 2 , then var(ε) = σ 2 In

var(Y ) = var(Y − Xβ)

= var(ε)

= σ 2 In

var(β̂) = var[(X 0 X)−1 X 0 Y ]

= (X 0 X)−1 X 0 var(Y )X(X 0 X)−1

= σ 2 (X 0 X)−1 (X 0 X)(X 0 X)−1

= σ 2 (X 0 X)−1

Therefore, the variance of β̂ is σ 2 (X 0 X)−1 .

21
2.2.3 Generalized Least Squares

Consider the model Y = xβ + ε where E(ε) = 0 and var(ε) = σ 2 In . Assume

that var(ε) = σ 2 V where V is a known n × n positive definite matrix.

Since V is positive definite, there exist an n × n non-singular matrix K, such

that V = KK 0 . Set Z = K −1 Y, B = K −1 X and η = K −1 ε.

We have the model Z = Bβ + η, where B is n × p of rank p. Also,

E(η) = E(K −1 ε)

= K −1 E(ε) = 0

var(η) = var(K −1 ε)

= K −1 var(ε)(K −1 )0

= K −1 σ 2 V (K −1 )0

= σ 2 K −1 KK 0 (K −1 )0

= σ 2 In

Minimizing the least squares function η 0 η with respect to β, we get the least

square estimate of this transformed model as,

β ∗ = (B 0 B)−1 B 0 Z

= [(K −1 X)0 (K −1 X)]−1 (K −1 X)0 K −1 Y

= [X 0 (K −1 )0 K −1 ]−1 X 0 (k −1 )0 K −1 Y

= [X 0 (KK 0 )−1 X]−1 X 0 (KK 0 )−1 Y

= [X 0 V −1 X]−1 (X 0 V −1 Y )

22
E(β ∗ ) = E[(X 0 V −1 X)−1 (X 0 V −1 Y )]

= (X 0 V −1 X)−1 X 0 V −1 E(Y )

= (X 0 V −1 X)−1 X 0 V −1 Xβ

= In β = β

The dispersion matrix of β ∗ is given by

D(β ∗ ) = var(β ∗ )

= σ 2 (B 0 B)−1

= σ 2 [X 0 (K −1 )0 K −1 X]−1

= σ 2 (X 0 V −1 X)−1

The generalized least square estimate is simply the ordinary least square

estimate for a transformed model.

Therefore, β ∗ have the same optimal properties, namely that, a0 β ∗ is the best

linear unbiased estimate (BLUE) of a0 β.

2.2.4 Advantages

There are two main advantages to analysing data using a multiple regression

model.

Firstly, the ability to determine the relative influence of one or more predictor

variables to the criterion value. For example, the real estate agent could

find that the size of the homes and the number of bedrooms have a strong

correlation to the price of a home, while the proximity to schools has no

23
correlation at all, or even a negative correlation if it is primarily a retirement

community.

Secondly, the ability to identify outliers or anomalies. For example, while

reviewing the data related to management salaries, the human resources

manager could find that the number of hours worked, the department size

and its budget all had a strong correlation to salaries, while seniority did not.

2.3 Ordinary Least Squares

Ordinary least squares(OLS) is a type of linear least squares method for

estimating the unknown parameters in a linear regression model. It chooses

the parameters by the principle of least squares. OLS is a method used to

estimate the equation:

Y = β0 + β1 X + ε (2.9)

The intercept β0 represents the value of Y when X is 0, and the slope β1

measures the change of Y for a unit change of X, and ε is the error term.

This method has been widely used in research.

2.3.1 Properties of Ordinary Least Squares

There are mainly four properties for ordinary least squares regression,

1. Linear : OLS estimators are linear functions of the dependent variable,

Y, that are linearly combined using weights which are a non-linear

24
function of the independent variables, X. The OLS estimators are linear

with respect to the values of the dependent variable only, and not

necessarily with respect to the values of the independent variables.

2. Unbiasedness : Unbiasedness is one of the most desirable property of

any estimator. The estimator should ideally be an unbiased estimator

of the actual values. The unbiasedness property of OLS method says

that when we take out several samples, we will find that the mean of

all the constants from the samples will be equal to the actual values of

constants from the population.

3. Efficient-Minimum Variance : OLS estimators have the least variance

among the class of all linear unbiased estimators. This property is a

way to determine which estimator to be used.

• An estimator that is unbiased but does not have the minimum

variance is not good.

• An estimator that has the minimum variance but is biased is not

good

• An estimator that is unbiased and has the minimum variance of

all other estimators is efficient.

The OLS estimator is an efficient estimator.

4. Consistency : A consistent estimator is one that approaches the actual

value of the parameter in the population as the sample size increases.

25
OLS esimator is a consistent estimator since it satisfies both the con-

ditions of a consistent estimator, which are

• it is asymptotically unbiased.

• its variance converges to 0 as the sample size increases.

2.3.2 Statistics in Ordinary Least Squares

Linear regression is a simple and powerful tool to analyze relationship be-

tween a set of dependent and independent variables. But, often there is a

tend to ignore the assumptions of OLS before interpreting the results. Hence,

it is necessary to analyze various statistics of OLS.

R-squared : It implies the percentage variation in dependent variable that

is explained by independent variables. The value of R-squared is between 0

and 1. R-squared = Explained variation / Total variation. In general, the

higher the value of R-squared, the better the model fits the data. This statis-

tic has a drawback that it increases as the number of predictors(dependent

variables) increase.

Adj.R-squared : This is a modified version of R-squared that is adjusted

for the number of dependent variables in the regression model. It increases

only when an additional variable adds to the explanatory power to the regres-

sion. It adds reliability and precesion by considering the impact of additional

independent variables. Adjusted R-squared also can provide a more exact

26
view of the correlation.

Prob(F-Statistic) : It tells the overall significance of the regression. This

is used to assess the significance level of all the variables together. The null

hypothesis under this is, H0 : all the regression coefficients are equal to zero.

Prob(F-statistics) describes the probability of null hypothesis being true.

AIC/BIC : AIC stands for Akaike’s Information Criteria and is used for

model selection. It corrects the errors mode in case a new variable is added

to the regression equation. It is calculated as the number of parameters mi-

nus the likelihood of the overall model. A lower AIC implies a better model.

Whereas, BIC stands for Bayesian information criteria and is a variant of

AIC where corrections are made more severe.

Prob(Omnibus) : Omnibus test is performed to check the one of the as-

sumptions of OLS that the errors are normally distributed. Prob(Omnibus)

is supposed to be close to the 1 in order to satisfy the normality assumption.

Then we say that the coefficients estimated are Best Linear Unbiased Esti-

mators(BLUE).

Durbin-watson : Another assumption of OLS is homoscedasticity which

implies that the variance of errors is constant. Its prefered value is between

1 and 2.

27
Prob(Jarque-Bera) : It is in line with Omnibus test. It is also used for

the distribution analysis of the regression errors. It is expected to agree the

results of Omnibus test. A large value of Jarque-Bera test indicates that the

errors are not normally distributed.

28
Chapter 3

ANALYSIS

3.1 Linearity and Outliers

Linearity means that mean values of the dependent variable for each incre-

ment of the independent variables lie along a straight line. In linear regression

the relationship between the dependent and independent variables need to

be linear. The linearity assumption can de tested using scatter plot. The

Pair-wise Linearity Plot graph below shows the pairwise scatted plot of the

varibles under study.

From this graph we can see that the relationship between the dependent

variable, profit, and the independent varibles (cost price, profit markup,

depth of tree, and sales commission) are linear.

It is also important to chek for outliers in the variables, because linear regres-

sion is sensitive to outliers. Outliers are observations that fall far from other

29
points. These points are important as they can have a strong influence on

the least squares line. From the Outlier Detection boxplot of each variables

ve can see that there is no outliers.

Figure 3.1: Pair-wise Linearity Plot

30
Figure 3.2: Outlier Detection Plot

3.2 Correlation

Corelation gives the information on the strength and direction of the linear

relationship between two variables. The value of the correlation lies between

+1 and -1. The value near to ±1 implies high correlation, the value near

to zero implies less correlation and the value near to ±0.5 implies moderate

correlation. A negetive value implies that increase in one variable is asso-

ciated with the decrease in the other variable, while a positive correlation

31
implies that both variables move in same direction. Zero correlation implies

no relatioship between the variables.

Figure 3.3: Correlation Matrix

From the above correlation matrix graph we could see that there is a neg-

atively high correlation between the dependent variable, profit and the in-

dependent variable, depth of tree. This means that as the depth of tree in-

creases the profit will decrease. This is shown in Profit Vs Depth of tree

graph given below.

32
Figure 3.4: Profit Vs Depth of tree

3.3 OLS Regression

The OLS Regression Results table below gives the output of the ordinary

last square regression on the data under study. This table gives the multiple

regression model for the data and shows how well the model fits the data. It

also gives the results for different assumptions for the model.

33
From this table we can see that, since the R-squared and Adj.R-squared

values are approximately 1, the change in the profit variable depends on

change in all other independent variables and also since the value of R-

squared is high, the model fits the data well. The Prob(F-statistic) shows

that here the regression coefficients are significant.

Prob(Omnibus) and Prob(JB) is used to test the normality assumptions of

the error. Here, both these values are nearly zero which implies that the error

is not normally distributed. Another assumption of OLS is homoscedasticity.

34
This is tested using the Durbin-watson test which is near zero indicating that

the variance of error is not contant.

The plot below is a predicted against the actual plot which gives the effect

of a predicted model. Actual value is plotted on the Y-axis and predicted

value is plotted on the X-axis. From the graph we can see that the points

plotted are so close to the fitted line Y=X, which implies that the model is

a good fit.

Figure 3.5: Actual Vs Predicted Plot

3.4 Simple Linear Regression

From the correlation matrix we could see that the variable depth of tree is

highly correlated with the dependent variable, profit with a correlation co-

efficient of −0.9, which indicates that when depth of tree increase the profit

will decrease.

35
Here we fit a simple linear regression model with independent variable depth of tree

and the dependent variable profit. Given below is the result of simple linear

regression model.

The simple linear regression model is given by Y = 9909.89 − 1010.50X,

where Y is the profit and X is depth of tree. Here, the p-value(Pr(> |t|)) is

less than significance level. This indicates that the regression coefficient is

significant.

36
Chapter 4

CONCLUSION

The purpose of this study was to predict profit or loss in a pyramid scheme

using regression analysis. Here, for the analysis we take profit as the de-

pendent variable and all the remaining variables (cost price, profit markup,

depth of tree, and sales commission) as the independent variables.

Firstly, using ordinary least squares regression method to predict whether

these independent variables are significant or not. And from the results of

OLS regression method we get that these variables are significant, that is,

the change in the independent variable shows a respective change in the de-

pendent variable. Also, from the Actual Vs Predicted plot we can see that

the fitted model is good since the points plotted are so close to the fitted line.

From the correlation matrix we can see that the independent variable depth of tree

and the dependent variable profit hast the highest correlation. Thus we fit a

37
simple linear regression model using these two variables and from the result

of this model we get that the independent variable is significant.

From the Profit Vs Depth of tree graph we can see that as depth of tree

increases the profit decreases. This means that in a pyramid scheme when

the number of people joining the scheme increases the people at the lowest

level of the chain receives a very less money which leads to a loss to that

person.

38
Reference

1. George A.F. Seber and Alan J. Lee, 2003. Linear Regression Analysis,

Second Edition, John Wiley and Sons, Inc.

2. Douglas C. Montgomery, Elizabeth A. Peck and G. Geoffrey Vining,

2012. Introduction to Regression Analysis, Fifth Edition, John Wiley

and Sons, Inc.

3. Alvin C. Rencher and G. Bruce Schaalje, 2008.Linear Models in Ststis-

tics, Second Edition, John Wiley and Sons, Inc.

4. John O. Rawlings, Sastry G. Pantula and David A. Dickey, 1998.Ap-

plied Regression Analysis: A Research Tool, Second Edition, John Wi-

ley and Sons, Inc.

5. https://www.kaggle.com/datasets

6. https://www.investopedia.com/insights/what-is-a-pyramid-scheme

7. https://en.wikipedia.org/wiki/Pyramid scheme

8. https://statisticsbyjim.com

39
9. https://www.albert.io/blog/ultimate-properties-of-ols-estimators-guide/

10. https://en.wikibooks.org/wiki/Econometric Theory/Properties of OLS Estimators

11. https://jyotiyadav99111.medium.com

12. https://blog.minitab.com

13. https://stats.stackexchange.com

40

You might also like