Statistics 19.06 v.2

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 18

Presenting

Data

Graphical Numerical

Data
Collection

Sources of Types of
Data Data

Properties of good sample


− Random
Population − Sufficient Sample
− Sample Size ( Confidence Level-
Margin Error- Homogeneity) Qualitative
Adv. Dis adv. Quantitative

Accuracy Time Systematic


Reliability Cost Stratified
More Effort
confidenc Moral Simple
Cluster
random Nominal Ordinal
e factors Interval Ratio
You cannot
perform any
Can at least
mathematical
operations. And if perform one
you do, the result mathematical
will have no operation and if
meaning. you do, the result
will have a
‫ام‬ ‫مش ك ل االرق‬ meaning.
‫ ألن‬Quantitative
‫في ارقام ليس لها مع نى‬
‫اني اعم ل عليه ا‬
operations

‫ زي أرقام الموبايل‬e.g
Serial ‫وأرقام ال‬

N.B : Counting is not a mathematical


Quantitative equation

Types of variables

Discrete Continuos

Clear GaP between GaP between values


values
1
Steps
Commenting Graphical
Presentation
1.
1. CALCULATE
MIN & MAXTHE RANGE
VALUE
2.
2. NO. OF INTERVALS
INSIGHTS
Qualitative (BY DEFAULT =5) Quantitative

3.
3. INTERVAL LENGTH = RANGE/NO. OF INTERVALS
RECOMMENDATION
Nominal Ordinal Discrete Continuous

Pie Chart Bar Chart Bar Chart Histogram

N.B : 5
4.5
4
3.5
3

Line is the best when presenting data over time 2.5


2
1.5
1
0.5
0
Category Category Category Category
1 2 3 4

Histogram Construction

Intervals Tally Frequency Relative Percentage


=count Frequency
Frequency/total Relative
no. of values Frequency*100

2
Step 1: Range = 90 - 40 – 50 million $ Intervals

Step 2: # of intervals = 5 40 – 50
50 – 60
Step 3: Interval Length = 50/5 = 10 million $
60 – 70
70 – 80
80 – 90
Intervals Tally Frequency Relative Frequency Percentage

40 – 50 // 2 2/25 = 0.08 8%

50 – 60 /// 3 3/25 = 0.12 12 %

60 – 70 //// 4 4/25 = 0.16 16 %

70 – 80 //// / 6 6/25 = 0.24 24 %

80 – 90 //// //// 10 10/25 = 0.40 40 %

Total 25 1 100%

Comment on solved example:

- The minority of the debts are found in the interval from 40-50 m $ & they represent 8% from the total number
of debts (25 debts).
Where the majority of the debts are found in the higher interval of debts which is from 80-90 m.$ and they represent
40 % from the sample size .
So the gap between the minimum and maximum indicates high level of heterogenousity.
- The majority of the debts are found in the intervals from 70-90 K $ and they represent 64 % (40-124) from the
total number of debts and this is considered to be a very alarming
indicator.
1- Majority and minority (interval
- Based on the histogram; the distribution of the debts is
& represent %)
skewed to the left, which is considered to be a very unsatisfactory
2- Gap &
indicator because the variable of study is the “ debts”
homogeneity/heterogeneity
- Recommendations:
3- Larger interval view
 Start collecting the debts which are found in the highest 4- Skewness
interval ,revise the cases in terms of their values and level of 5- Recommendation
risks
 Offer different payment methods

3
 Offer different payment plans.

Numerical Presentation
The main goal is to summarize all the values in the given dataset in a value or more, where when we look at these values
we can know what happened in the dataset.

What? How? When?

Central Measures
The main goal is to summarize all the values in one value where the majority of the values are around it.

1) Mean
It is the value at the center of dataset where the majority of the values are around it

𝑴𝒆𝒂𝒏 = 𝑺𝒖𝒎 𝒐𝒇 𝒕𝒉𝒆 𝒗𝒂𝒍𝒖𝒆𝒔(𝑪𝒐𝒖𝒏𝒕 𝒐𝒇 𝒕𝒉𝒆 𝒗𝒂𝒍𝒖𝒆𝒔)/ 𝒏


Advantages Disadvantages
Comment:
− Easy to be calculated − It is affected by outliers
The mean of the profits is 90 million $ − Easy to be explained
− Takes all the values into
which represents the value at the
calculation
center of dataset where the majority
of the values are around it

2) Median
It is the value at 50% distance of the ordered dataset.
Advantages Disadvantages
− Easy to be calculated − It concentrates on the location
− Easy to be explained more than the value
Step 1: put the values in order (smallest to largest) − It is less sensitive to − It does not take into calculation
Step2: location of the median (𝒏+𝟏/ 2)
the outliers all the values in the dataset
In case of even no sample location between (𝒏/ 2)&
− It is not applicable with
(𝒏/ 2)+1
qualitative data, specially it is
Step 3: value of the median
nominal
Even no sample: (average of 2 values)

Comment:

The median of the profits is ………. which represents the value at 50% distance of the ordered
dataset

3) Mode
It is the most frequent / repeated value(s) Advantages Disadvantages
− Easy to be calculated − It not preferred to be used
Mode − Easy to be explained
− It is applicable with
with continuous variables
due to:
qualitative data o Fail to estimate a value
unimodal bimodal Nomode o Misleading values

2 peaks No repetition
Have one
equivalent to each N.B. :We don`t
peak
other ‫بالظبط‬ say zero
4
Absolute Dispersion Measures
The main goal is to evaluate how far the values are away from each other and how far they are from the center of
dataset. As a result of that we can evaluate if the values are homogenous or heterogeneous.

1) Range
It is the distance between the min. value and max. Value.
e.g.:
Advantages Disadvantages
Profits in million $ : 92, 85, 88, 95
− Easy to be calculated − It takes only two values into
Range = Max. Value – Min. Value − Easy to be explained calculation
= 95 – 85 = 10 million $ − It combines the tails of − It does not provide us with
Comment: dataset average distance around the
mean
The range of the profits is 10
− It is affected by outlier
which represents the distance
between min profit (85 million $)
and max. profit (95 million $)
Acc.to value of the gap
(as th gap inc. ---) dispersion inc.---Homogeneity dec. )
2) Variance and Standard Deviation

 SD is the average distance around the mean


Comment:
− SD of profits is 4.4 million $ which represents the average distance around the mean profit
(90 million $)
− As a result of that, the majority of the values range from 85.6 million $ to 94.4 million $
on average

Advantages Disadvantages
− Easy to be calculated − It is affected by outliers
− Easy to be explained because the main component
− It takes all values into in its calculation is the mean
calculation which has a main drawback of
being impacted by outliers

 When SD & Variance = zero


This means perfect homogeneity = means all values are the same (one constant value repeated ; ie
flat distribution)

5
3) Inter Quartile Range(IQR)

1) Step 1: put the values in order (smallest to largest)


50 65 67 70 72 75 77 80 82 112

2) Step2: location of Q1 =¼ (n + 1) =¼ (10+1) = 2. 75


Value of Q1 = Start + ratio * distance = 65 + .75 (67 – 65) = 66.5 million $

3) Step 3: location of Q3 = ¾ (n + 1) = ¾ (10+1) = 8. 25


Value of Q3 = Start + ratio * distance = 80 + 0.25 (82 – 80) = 80.5 million $

4) Step 4: IQR = Q3 – Q1 = 80.5 – 66.5 = 14 million $

Comment

 Q1 of profits is 66.5 million $ which represents the value at 25%


distance of the ordered dataset.
 Q3 of profits is 80.5 million $ which represents the value at 75%
distance of the ordered dataset
 IQR of profits is 14 million $ which represents the range of 50%
distance of the ordered dataset after excluding the lowest and the
highest 25% of the ordered dataset.

6
Test of Outliers
A. BOX PLot

B. Test of Skewness
𝑀𝑒𝑑𝑖𝑎𝑛
𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡= 3 𝑀𝑒𝑎𝑛−
𝑆𝐷

Negatively Skewed Symmetric Positively skewed


( Skewed to the left) SC = 0 ± 0.5 ( skewed to the right)
SC is < – 0.5 ( from -0.5to +0.5) SC is > +0.5

7
outliers ‫في مسائل ال‬ 
IQR ‫ نحسب ال‬.1
box plot ‫ ونرسم ال‬upper & lower bound ‫ نحسب ال‬.2
outliers ‫ تبقى‬bounds ‫ برة ال‬values ‫ نقارن لو في‬.3

Comment for solved example:


- In case there is outliers: Since the profit of 112 million is greater than the upper bound of
the test (101.5 M$), therefore this value is confirmed to be an outlier.
o As a result of that; the given data set of profits is skewed to the left, and the best
central measure is the median and the best absolute dispersion measure is the IQR.
- Since the dataset contains outliers ; so it is skewed and the best absolute dispersion
measure for it is “IQR”
o IQR of profits is 14 million $ which represents the range of 50% distance of the
ordered dataset after excluding the lowest and the highest 25% of the ordered
dataset.
- Since the dataset contains outliers ; so it is skewed and the best central measure for
it is “Median”
o The median of the profits is 73.5 M$ which represents the value at 50% distance of
the ordered set.
In addition ; the IQR is equal to = 14 M$ which indicates that the majority of the values in
the given dataset range from 59 M$ to 87.5 M$ and that is on average.
After removal of outlier value------(must measure skeweness)
- Since the skewness coefficient (-0.34) is between -0.5 & 0.5 ;therefore the given dataset
of profits after removing the value of 112 (outlier) is confirmed to be symmetric and the
best central measure is the mean (70.8 m $) which represents the value at the center of
dataset where the majority of the values are around it.

Median
- IQR + IQR

Majority
of values

8
Scheme for tesing outliers

C. Coefficient of Variation
Can be used to compare the variability of two or more sets of data measured in
different units.

The lower CV is the higher level of homogeneity


High CV------heterogeneous

Low CV------- homogenous

9
Linear correlation (correlation analysis) & linear regression
Linear correlation
find the direction between two variables or more and the strength of this linear
relationship.

linear corelation

0  no linear relationship
direction strength
1 perfect linear relationship

negative positive weak moderate strong


Above 0.5 up to
 Before we calculate the correlation, we havezero to define 0.70 up
up the independent
0.70 and dependent variable.
toincome
 independent variable (x) .. it is the variable which affects the depedent variable (y) such as
to 0.5
1
affect consumption .. such as experience affects salary .. such as cost affects profit
A. Correlation coefficient (r)

n ∑ xy−∑ x ∑ y
r=
√ [n( ∑ x 2 )−(∑ x )2 ][ n( ∑ y 2 )−( ∑ y )2 ]
2
∑ x 2 ≠ (∑ x )
−1 ≤r ≤+1

y x xy X2 Y2
Sum
we determine the direction between two variables based on the sign of the correlation
coefficient.

Example : r= 0.85
Comment: there is positive strong linear relationship between income and consumption.

10
Loans (y) and deposits (x)

n ∑ xy−∑ x ∑ y
r=
√ [n( ∑ x 2 )−( ∑ x)2 ][ n( ∑ y 2 )−( ∑ y)2 ]
( 10× 5085975 )−(17150 ×2865)
r= =0.76
√ [ ( 10 ×30983750 )−(17150)2 ] × [ ( 10 ×853423 )−(2865)2 ]
Loans in million $ Deposits in million $
xy x -square y-square
(y) (x)
245 1400 343000 1960000 60025
312 1600 499200 2560000 97344
279 1700 474300 2890000 77841
308 1875 577500 3515625 94864
199 1100 218900 1210000 39601
219 1550 339450 2402500 47961
405 2350 951750 5522500 164025
324 2450 793800 6002500 104976
319 1425 454575 2030625 101761
255 1700 433500 2890000 65025
508597
2865 17150 30983750 853423
5
sum of sum of x sum of y
sum of y sum of x
xy square square

Correlation coefficient r =0.76

Comment: there is positive strong linear relationship between deposits and loans
(measured in million $).

B. Simple Linear regression:


it is a statistical technique that evaluates the impact of independent variable on dependent variable.

It estimates the amount of change in y according to one unit change in X.

11
Y = β0 + β 1 X +U

Y >> dependent variable ( such as marks)


X >> independent variable (such as studying hours)
β 0>> intercept

β 1>> slope or regression coefficient


Y (Salary)=2500+ 1000 X (exp)
X= 1 .. y = 60
X=2 .. y =70
X=3 .. y= 80

Comment:

- On β 1it means when X increases by one unit (one hour),that means the mark will
decrease on average by 10 marks.
- On β 0 : when X = 0 , so Y will be 50 marks on average.

How to calculate ^β 1 & ^β 0


^β 1= ❑ ( 10× 5085975 )− (17150 × 2865 )
=
n ∑ x 2− ( ∑ x )
2
( 10 × 30983750 )−( 17150 )2

¿ 0.109 0.11
^β = y − ^β x=286.5−( 0.11×1715 ) =97.85
0 1

y=
∑ y = 2865 =286.5
n 10

x=
∑ x = 17150 =1715
n 10
Y = β0 + β 1 X

loans=β 0 + β 1 deposits
^
loans=97.85+0.11deposits
Comment:

- Slope = 0.11
When deposits increases by one unit (one million $), so the loans will increase on
average by 0.11 million $
- Intercept: when X ( deposits) is equal to zero, on average the loans will be 97.85
million $

12
C. Hypothesis Testing and P-value
i. In simple linear regression model
Simple== one y one x

Hypothesis Testing
Null hypothesis Alternative
(H0) hypothesis
There is no linear relationship There is a linear
between the two variables relationship between
the two variables
Correlation = to zero
Correlation not equal
ρ= 0
zero

Level of significance:

 α Probability to reject the null hypothesis when it`s true.. by default = 0.05

α Alpha (error) = 10% confidence 90%


Alpha= 1%, confidence 99%
Alpha= 5%, confidence 95%

 P value: what’s the probability to have strong evidence to accept the null
 P value > alpha …. Accept null hypothesis (H0)----no linear relationship

P value < 0.05 …. Reject (H0)---- linear relationship

The null hypothesis can either be rejected or not.

If we reject the null hypothesis, we conclude that it exist a relationship between 2 variables

The P-value is used for this conclusion.

Example :

Correlations
yeild Nearby stors population

Nearby stors -0.144


0.423 …………………..p value

13
population 0.362 0.690
0.038 0.000
income of popula 0.537 -0.032 0.166
0.001 0.862 0.357
Cell Contents
Pearson correlation
P-Value

Comment

Regarding nearby stores


 The correlation coefficient between yield and nearby stores is -0.144 which indicates inverse weak linear
relationship between yield and nearby stores.
 Since p value is greater than & (0.05) ---) accept H0
Regarding population
 The correlation coefficient between yield and nearby stores is 0.362 which indicates positive weak
linear relationship between yield and nearby stores. which indicates that the market is open.
 P value is less than 0.05 (H0) so there is linear relationship

Regarding income of population

 The correlation coefficient between yield and nearby stores is 0.362 which indicates there is direct
moderate linear relationship between yield and income of population.
 P value is less than 0.05 (H0) so there is linear relationship

 Based on the previous correlation analysis


The factor of interest to study is income of population

Regression equation
Y = β0 + β 1 X +U

Yield =133032 - 1684 Nearby stores


Comment
In case of considering nearby stores
 When x (nearby stores) increases by one unit, the yield(revenue) is expected to decrease by 1684 $ on
average
 When x (independent, nearby stores) is equal to zero, the yield( or the revenue) is expected to be
133032 dollars on average (this is the case of max)

In case of considering population instead of nearby stores


 The constant (b0) will be representing the minimum revenue

ii. In multiple regression model


Multiple regression model more than one x

Regression equation
Y = β0 + β 1 X 1 + β 2 X 2 +… ..+U

14
yiel = 94400 - 7494 Nearby stores + 0.2839 population
d + 1.718 income of population

Comment:

 on β 1Nearby stores coefficient: when Nearby stores increase by one unit the yield (revenue)is
expected to decrease by 7494 on average and holding other factors constant.

 β 2population: When the population increases by one unit (1,000 people), the yield is expected to
increase 0.2839 units on average and holding the other factors constant.

D. Testing overall significance of the model

Hypothesis Testing

Null hypothesis (H0) Alternative


hypothesis
There is no linear There is a linear
relationship between the relationship between
two variables the two variables
Correlation not equal
Correlation = to zero
zero
ρ= 0

Analysis of Variance
F-
Source DF Adj SS Adj MS Value P-Value

Regression 3 9195480914 3065160305 12.95 0.000


Nearby stors 1 3327474641 3327474641 14.05 0.001
population 1 4302754086 4302754086 18.17 0.000
income of 1 2327418069 2327418069 9.83 0.004
population
Error 29 6866702968 236782861
Total 32 1606218388
2

Since p value is less than alpha (0.05) I’ll reject H0 which means the model as overall is significant, the
model is good

E. Testing single factor or parameter

15
Hypothesis Testing

Null hypothesis (H0) Alternative


hypothesis(H1)
The parameter is not The parameter is
significant significant

Since p value 0.001 is less than alpha 0.05 the fore we reject H0, this parameter is significant

Coefficients
P-
Term Coef SE Coef T-Value Value VIF

Constant 94400 13193 7.16 0.000


Nearby stors -7494 1999 -3.75 0.001 1.99
population 0.2839 0.0666 4.26 0.000 2.04
income of population 1.718 0.548 3.14 0.004 1.07

F. Coefficient of Determination
Model Summary
S R-sq R-sq(adj) R-sq(pred)

15387.8 57.25 52.83% 43.93%


%

Coefficient of Determination R-sq


Is a statistical measure that explains to which extent the independent variables are able to explain
the variation in the dependent variable and the rest is due to the error.

Comment:
R-squared equal to 57.25% it indicates that the independent variables are able to explain 57.25%
from the variation in the dependent variable (yield) and the rest (43 %) is due to the error.(U)(omitted
variables)

16
Forecasting and Time series

Types of economic dataset

Cross
Time series Panel data
sectional
Is a type of data in which Is a type of data in which Is a type of data in which
observation are collected for observation are collected for one observation are collected for
more than one cross section at cross section over several time more than one cross-section over
the same time interval. interval. more than one time Interval.

Time Series Approaches

Classical Box-Jenkins

Additive Model
Time series components
Multiplicative Model
Seasonality
Trend (T) Cycle(C) Shocks (I)
(S)
Is a systematic Is a systematic Is a systematic Incidents /
behavior that behavior that behavior that unexpected
happens at certain happens at certain happens at certain events. = error
point in time & it point/interval in a point/ interval in a
has the same impact specific year. & it specific year and it
in terms of behavior gets repeated every gets repeated every
on the following year. set of years.
interval.

Steps for Forecasting by Multiplicative Model


Steps:

1- Graph the dataset to detect seasonality visually.

17
Comment on the solved example in slides:
Based on the graph; Revenues go down in the 2nd interval of each year which indicates there is

seasonality

2- Smoothing dataset by using MA technique.


From observation on the graph:
Gaps: indicate seasonal effect

We can conclude from points of intersection that there is no effect of seasonality in the third

interval of each year& it is driven by trend.

3- Estimate seasonal effect for each interval in each year.


o Seasonal Effect of 2014-II:
0.88 − 1 × 100= -12%
Comment:
Revenue in the second interval in 2014 are less than the average by 12%

o For 2015 – I:
(1.11-1) × 100= 11%
Comment:
Revenue in the first interval/trimester in 2015 are higher than the average by 11%

4- Estimate seasonal index for each interval cross all years.


o Seasonal Index for first interval (I) = 1.11
1.11 − 1 × 100= +11%
Comment:
This means the revenue in the first interval (trimester) is Higher than the average by 11%
for any given year.
o Seasonal Index For second interval -II= 0.88
(0.88 -1) × 100= -12%
Comment:
This means the revenue in the first interval is less than the average by 12% for any given
year.
o Seasonal Index For third interval = 1
Comment:
This means there is no seasonal effect in the third interval.

5- Estimate deseasonalized dataset.


Des. Rev. = Yt /Seasonal Index
6- Estimate linear trend model by using dataset in (5).
7- Make the forecast by the model in (6) and adjust the value by seasonal index.

18

You might also like