Professional Documents
Culture Documents
Statistics 19.06 v.2
Statistics 19.06 v.2
Statistics 19.06 v.2
Data
Graphical Numerical
Data
Collection
Sources of Types of
Data Data
زي أرقام الموبايلe.g
Serial وأرقام ال
Types of variables
Discrete Continuos
3.
3. INTERVAL LENGTH = RANGE/NO. OF INTERVALS
RECOMMENDATION
Nominal Ordinal Discrete Continuous
N.B : 5
4.5
4
3.5
3
Histogram Construction
2
Step 1: Range = 90 - 40 – 50 million $ Intervals
Step 2: # of intervals = 5 40 – 50
50 – 60
Step 3: Interval Length = 50/5 = 10 million $
60 – 70
70 – 80
80 – 90
Intervals Tally Frequency Relative Frequency Percentage
40 – 50 // 2 2/25 = 0.08 8%
Total 25 1 100%
- The minority of the debts are found in the interval from 40-50 m $ & they represent 8% from the total number
of debts (25 debts).
Where the majority of the debts are found in the higher interval of debts which is from 80-90 m.$ and they represent
40 % from the sample size .
So the gap between the minimum and maximum indicates high level of heterogenousity.
- The majority of the debts are found in the intervals from 70-90 K $ and they represent 64 % (40-124) from the
total number of debts and this is considered to be a very alarming
indicator.
1- Majority and minority (interval
- Based on the histogram; the distribution of the debts is
& represent %)
skewed to the left, which is considered to be a very unsatisfactory
2- Gap &
indicator because the variable of study is the “ debts”
homogeneity/heterogeneity
- Recommendations:
3- Larger interval view
Start collecting the debts which are found in the highest 4- Skewness
interval ,revise the cases in terms of their values and level of 5- Recommendation
risks
Offer different payment methods
3
Offer different payment plans.
Numerical Presentation
The main goal is to summarize all the values in the given dataset in a value or more, where when we look at these values
we can know what happened in the dataset.
Central Measures
The main goal is to summarize all the values in one value where the majority of the values are around it.
1) Mean
It is the value at the center of dataset where the majority of the values are around it
2) Median
It is the value at 50% distance of the ordered dataset.
Advantages Disadvantages
− Easy to be calculated − It concentrates on the location
− Easy to be explained more than the value
Step 1: put the values in order (smallest to largest) − It is less sensitive to − It does not take into calculation
Step2: location of the median (𝒏+𝟏/ 2)
the outliers all the values in the dataset
In case of even no sample location between (𝒏/ 2)&
− It is not applicable with
(𝒏/ 2)+1
qualitative data, specially it is
Step 3: value of the median
nominal
Even no sample: (average of 2 values)
Comment:
The median of the profits is ………. which represents the value at 50% distance of the ordered
dataset
3) Mode
It is the most frequent / repeated value(s) Advantages Disadvantages
− Easy to be calculated − It not preferred to be used
Mode − Easy to be explained
− It is applicable with
with continuous variables
due to:
qualitative data o Fail to estimate a value
unimodal bimodal Nomode o Misleading values
2 peaks No repetition
Have one
equivalent to each N.B. :We don`t
peak
other بالظبط say zero
4
Absolute Dispersion Measures
The main goal is to evaluate how far the values are away from each other and how far they are from the center of
dataset. As a result of that we can evaluate if the values are homogenous or heterogeneous.
1) Range
It is the distance between the min. value and max. Value.
e.g.:
Advantages Disadvantages
Profits in million $ : 92, 85, 88, 95
− Easy to be calculated − It takes only two values into
Range = Max. Value – Min. Value − Easy to be explained calculation
= 95 – 85 = 10 million $ − It combines the tails of − It does not provide us with
Comment: dataset average distance around the
mean
The range of the profits is 10
− It is affected by outlier
which represents the distance
between min profit (85 million $)
and max. profit (95 million $)
Acc.to value of the gap
(as th gap inc. ---) dispersion inc.---Homogeneity dec. )
2) Variance and Standard Deviation
Advantages Disadvantages
− Easy to be calculated − It is affected by outliers
− Easy to be explained because the main component
− It takes all values into in its calculation is the mean
calculation which has a main drawback of
being impacted by outliers
5
3) Inter Quartile Range(IQR)
Comment
6
Test of Outliers
A. BOX PLot
B. Test of Skewness
𝑀𝑒𝑑𝑖𝑎𝑛
𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡= 3 𝑀𝑒𝑎𝑛−
𝑆𝐷
7
outliers في مسائل ال
IQR نحسب ال.1
box plot ونرسم الupper & lower bound نحسب ال.2
outliers تبقىbounds برة الvalues نقارن لو في.3
Median
- IQR + IQR
Majority
of values
8
Scheme for tesing outliers
C. Coefficient of Variation
Can be used to compare the variability of two or more sets of data measured in
different units.
9
Linear correlation (correlation analysis) & linear regression
Linear correlation
find the direction between two variables or more and the strength of this linear
relationship.
linear corelation
0 no linear relationship
direction strength
1 perfect linear relationship
n ∑ xy−∑ x ∑ y
r=
√ [n( ∑ x 2 )−(∑ x )2 ][ n( ∑ y 2 )−( ∑ y )2 ]
2
∑ x 2 ≠ (∑ x )
−1 ≤r ≤+1
y x xy X2 Y2
Sum
we determine the direction between two variables based on the sign of the correlation
coefficient.
Example : r= 0.85
Comment: there is positive strong linear relationship between income and consumption.
10
Loans (y) and deposits (x)
n ∑ xy−∑ x ∑ y
r=
√ [n( ∑ x 2 )−( ∑ x)2 ][ n( ∑ y 2 )−( ∑ y)2 ]
( 10× 5085975 )−(17150 ×2865)
r= =0.76
√ [ ( 10 ×30983750 )−(17150)2 ] × [ ( 10 ×853423 )−(2865)2 ]
Loans in million $ Deposits in million $
xy x -square y-square
(y) (x)
245 1400 343000 1960000 60025
312 1600 499200 2560000 97344
279 1700 474300 2890000 77841
308 1875 577500 3515625 94864
199 1100 218900 1210000 39601
219 1550 339450 2402500 47961
405 2350 951750 5522500 164025
324 2450 793800 6002500 104976
319 1425 454575 2030625 101761
255 1700 433500 2890000 65025
508597
2865 17150 30983750 853423
5
sum of sum of x sum of y
sum of y sum of x
xy square square
Comment: there is positive strong linear relationship between deposits and loans
(measured in million $).
11
Y = β0 + β 1 X +U
Comment:
- On β 1it means when X increases by one unit (one hour),that means the mark will
decrease on average by 10 marks.
- On β 0 : when X = 0 , so Y will be 50 marks on average.
¿ 0.109 0.11
^β = y − ^β x=286.5−( 0.11×1715 ) =97.85
0 1
y=
∑ y = 2865 =286.5
n 10
x=
∑ x = 17150 =1715
n 10
Y = β0 + β 1 X
loans=β 0 + β 1 deposits
^
loans=97.85+0.11deposits
Comment:
- Slope = 0.11
When deposits increases by one unit (one million $), so the loans will increase on
average by 0.11 million $
- Intercept: when X ( deposits) is equal to zero, on average the loans will be 97.85
million $
12
C. Hypothesis Testing and P-value
i. In simple linear regression model
Simple== one y one x
Hypothesis Testing
Null hypothesis Alternative
(H0) hypothesis
There is no linear relationship There is a linear
between the two variables relationship between
the two variables
Correlation = to zero
Correlation not equal
ρ= 0
zero
Level of significance:
α Probability to reject the null hypothesis when it`s true.. by default = 0.05
P value: what’s the probability to have strong evidence to accept the null
P value > alpha …. Accept null hypothesis (H0)----no linear relationship
If we reject the null hypothesis, we conclude that it exist a relationship between 2 variables
Example :
Correlations
yeild Nearby stors population
13
population 0.362 0.690
0.038 0.000
income of popula 0.537 -0.032 0.166
0.001 0.862 0.357
Cell Contents
Pearson correlation
P-Value
Comment
The correlation coefficient between yield and nearby stores is 0.362 which indicates there is direct
moderate linear relationship between yield and income of population.
P value is less than 0.05 (H0) so there is linear relationship
Regression equation
Y = β0 + β 1 X +U
Regression equation
Y = β0 + β 1 X 1 + β 2 X 2 +… ..+U
14
yiel = 94400 - 7494 Nearby stores + 0.2839 population
d + 1.718 income of population
Comment:
on β 1Nearby stores coefficient: when Nearby stores increase by one unit the yield (revenue)is
expected to decrease by 7494 on average and holding other factors constant.
β 2population: When the population increases by one unit (1,000 people), the yield is expected to
increase 0.2839 units on average and holding the other factors constant.
Hypothesis Testing
Analysis of Variance
F-
Source DF Adj SS Adj MS Value P-Value
Since p value is less than alpha (0.05) I’ll reject H0 which means the model as overall is significant, the
model is good
15
Hypothesis Testing
Since p value 0.001 is less than alpha 0.05 the fore we reject H0, this parameter is significant
Coefficients
P-
Term Coef SE Coef T-Value Value VIF
F. Coefficient of Determination
Model Summary
S R-sq R-sq(adj) R-sq(pred)
Comment:
R-squared equal to 57.25% it indicates that the independent variables are able to explain 57.25%
from the variation in the dependent variable (yield) and the rest (43 %) is due to the error.(U)(omitted
variables)
16
Forecasting and Time series
Cross
Time series Panel data
sectional
Is a type of data in which Is a type of data in which Is a type of data in which
observation are collected for observation are collected for one observation are collected for
more than one cross section at cross section over several time more than one cross-section over
the same time interval. interval. more than one time Interval.
Classical Box-Jenkins
Additive Model
Time series components
Multiplicative Model
Seasonality
Trend (T) Cycle(C) Shocks (I)
(S)
Is a systematic Is a systematic Is a systematic Incidents /
behavior that behavior that behavior that unexpected
happens at certain happens at certain happens at certain events. = error
point in time & it point/interval in a point/ interval in a
has the same impact specific year. & it specific year and it
in terms of behavior gets repeated every gets repeated every
on the following year. set of years.
interval.
17
Comment on the solved example in slides:
Based on the graph; Revenues go down in the 2nd interval of each year which indicates there is
seasonality
We can conclude from points of intersection that there is no effect of seasonality in the third
o For 2015 – I:
(1.11-1) × 100= 11%
Comment:
Revenue in the first interval/trimester in 2015 are higher than the average by 11%
18