Professional Documents
Culture Documents
Practical Statistical Questions: Session 1
Practical Statistical Questions: Session 1
Practical Statistical Questions: Session 1
Questions
Session 1
Course outline
Statistics
(=state
arithmetic)
1
x=
N
x
i =1
E {x}
Sampling
error
Sampling
distribution
Salary
xx43x2x5x1
Salary
E {x}
N = 2000
Salary
Unbiased
Asymptotically unbiased
E {x}
N = 4000
Salary
E {x} = 0
lim E { x } = 0
Salary distribution
Sampling distribution
Salary
Sampling distribution: distribution of the statistic if all possible samples of size N were drawn
from a given population
7
95%
Salary
Confidence interval
x3
x2
x1
x4
Salary
2
X i N ,
i =1
N
2
N
Xi
2
N
i =1
1
X i N ( , )
N
2
f 2 ( x)
k
Sometimes the distribution of the statistic is NOT known, but still the mean is well behaved
E {Xi} =
Var { X i } = 2
1
lim
N N
2
X i N ,
i =1
N
N
Discrete Sex{Male,Female}
No. Sons{0,1,2,3,}
Continuous Temperature[0,)
Metric
or
Quantitative
Nonmetric
or
Qualitative
Nominal/
Categorical
Scale
0=Male
1=Female
Ordinal
Scale
0=Love
1=Like
2=Neither like nor dislike
3=Dislike
4=Hate
Interval
Scale
Years:
2006
2007
Ratio
Scale
Temperature:
0K
1K
10
No order
Peter: Black
Molly: Blond
Charles: Brown
Company size
{Small, Medium, Big}
Company A: Big
Company B: Small
Company C: Medium
Peter:
Molly:
Charles:
Implicit order
{0, 0,1, 0}
{0,1, 0, 0}
{1, 0, 0, 0}
xsize {0,1, 2}
Company A: 2
Company B: 0
Company C: 1
11
1
X i N ( , )
N
2
2
X i N ,
i =1
N
N
Hypothesis testing
Cannot use statistical tests based on any assumption about the distribution of the underlying
variable (t-test, F-tests, 2-tests, )
Solution:
discretize the data and use a test for categorical/ordinal data (non-parametric tests)
use randomized tests
12
(-) Very sensitive to large outliers, not too meaningful for certain distributions
(+) Unique, unbiased estimate of the population mean,
better suited for symmetric distributions
*
AM
1
=
N
x
i =1
1
x*AM = (5 + 5 + 5 5 5 5) = 0%
Property E { x*AM } =
6
1
x*AM = (1.05 + 1.05 + 1.05 + 0.95 + 0.95 + 0.95) = 1 = 0%
6
Geometric mean
*
GM
= N xi log x
i =1
*
GM
1
=
N
log x
i =1
*
xGM
= 6 1.05 1.05 1.05 0.95 0.95 0.95 = 0.9987 = 0.13%
*
HM
1
1
N
i =1 xi
N
1
*
xHM
1
=
N
i =1 xi
N
60km / h t = 100min
*
xHM
=
x*AM
100 km
40km / h t = 150min
= 48km / h
1 1
1
+
2 60 40
1
= ( 60 + 40 ) = 50km / h
2
14
*
*
xHM
xGM
x*AM
More affected by extreme large values
Less affected by extreme small values
Less affected by extreme values
More affected by extreme small values
1
N
p
x
i
i =1
1
p
p =
Minimum
Harmonic mean p = 1
Geometric mean p = 0
Arithmetic mean p = 1
Quadratic mean p = 2
p=
Maximum
15
x4
x(1)
x5 x6
x(4) x(6)
x* =
Median
1
1
x
+
x
+
x
+
x
=
( 3 + 5 + 6 + 7 ) = 5.25%
(
(2)
(3)
(4)
(5) )
4
4
Which is the central sorted value? (50% of the distribution is below that value) It is not unique
Any value between x(3) = 5% and x(4) = 6%
Winsorized mean:
Substitute p% of the extreme values on each side
x* =
1
1
x
+
x
+
x
+
x
+
x
+
x
=
( 3 + 3 + 5 + 6 + 7 + 7 ) = 5.16%
(
(2)
(2)
(3)
(4)
(5)
(5) )
6
6
16
1
x = arg min
N
x
*
( x x)
i =1
x*AM !!
R and L-estimators
Now in disuse
x2
x2
17
x* = arg max f X ( x)
If a variable is multimodal,
most central measures fail!
18
19
x = 738.6
y = 751.4
d* = y x
d * = median { yi x j }
If the measures are paired (for instance, the motors are first measured, the modified and
remeasured), then we should first compute the difference.
Difference: 23, 48, -14, -9, 16
d* = d
20
2 = E {( X ) 2 }
1
s =
N
2
N
(x x )
i =1
N 1 2
E {s } =
N
2
N
Subestimation
of the variance
1 N
( xi x ) 2
s =
N 1 i =1
2
E {s 2 } = 2
X i N ( , ) ( N 1)
2
s2
N2 1
Rentability=5.592.32%
21
s=
1 N
( xi x ) 2
N 1 i =1
Rentability=5.590.0232=5.5915.23%
Tchebychevs Inequality
X i N ( , 2 ) N 1
N 1
Pr { K X + K } = 1
1
K2
At least 50% of the values are within 2 standard deviations from the mean.
At least 75% of the values are within 2 standard deviations from the mean.
At least 89% of the values are within 3 standard deviations from the mean.
At least 94% of the values are within 4 standard deviations from the mean.
At least 96% of the values are within 5 standard deviations from the mean.
At least 97% of the values are within 6 standard deviations from the mean.
At least 98% of the values are within 7 standard deviations from the mean.
For any distribution!!!
22
Pr { X x* } = q
Someone has an IQ score of 115. Is he clever, very clever, or not clever at all?
q0.95 q0.05
Deciles
Quartiles
q0.90 q0.10
q0.75 q0.25
23
In which country you can have more progress along your career?
24
1 =
{( X ) }
3
Unbiased estimator
g1 < 0
g1 > 0
g1 =
m3
s3
N
m3 =
N ( xi x )3
i =1
( N 1)( N 2)
Education (0-10)
Education
Free-time
(hours/week)
Salary $
Salary
10
High
10
70K
High
High
15
75K
High
Medium
27
40K
Medium
Low
30
20K
Low
E {( X X )(Y Y )}
XY
Salary FreeTime
[ 1,1]
1 N
( xi x )( yi y )
N 1 i =1
r=
s X sY
Correlation
Education Salary
Negative
Positive
Small
0.3 to 0.1
0.1 to 0.3
Medium
0.5 to 0.3
0.3 to 0.5
Large
1.0 to 0.5
0.5 to 1.0
26
Education
Salary $
10
70K
75K
40K
20K
Person
Education
Salary $
1st
2nd
2nd
1st
3rd
3rd
4th
4th
P=Concordant pairs
Person A
Education: (A>B) (A>C) (A>D)
Salary:
(A>C) (A>D)
Person B
Education:
(B>C) (B>D) 2
Salary:
(B>A) (B>C) (B>D)
Person C
Education: (C>D)
Salary:
(C>D)
Person D
Education:
Salary:
P
N ( N 1)
2
2 + 2 +1+ 0
4(4 1)
2
5
= 0.83
6
27
Education
Salary $
10
70K
75K
40K
20K
Person
Education
Salary $
di
1st
2nd
-1
2nd
1st
3rd
3rd
4th
4th
= 1
6 di2
i =1
2
N ( N 1)
6((1) 2 + 12 + 02 + 02 )
= 0.81
= 1
2
4(4 1)
28
29
X N ( , 2 ) f X ( x) =
X N (, ) f X (x) =
1
2
1
2
1 x
1
( x- )t 1 ( x- )
2
Covariance matrix
Use: Normalization
X N ( , 2 )
N (0,1)
Z-score
145 100
=3
15
190 100
=
=6
15
z Napoleon =
z Kasparov
30
X N ( , 2 )
What is the probability of having an IQ between 100 and 115?
Pr {100 IQ 115} =
115
100
1
2 15
1 x 100
2 15
Normalization
dx =
0
1 12 x2
1 12 x2
1 12 x2
e dx =
e dx
e dx = 0.341
2
2
2
31
X N ( , 2 )
What is the probability of having an IQ larger than 115?
Pr {100 IQ 115} =
115
1
2 15
1 x 100
2 15
dx =
1
1 12 x2
1 12 x2
e dx = 1
e dx = 0.159
2
2
Normalization
32
1 12
e
2
1 12 x2
e dx = 75 q0.75 = 0.6745
2
0.75
0.6745
33
Xi N
But many others are not
aX 1 + bX 2 ; a + bX 1 N
X1
X i F Snedecor
Cauchy; e X LogNormal ;
X2
X 2j
2
Xi N
2
i
2;
2
i
; X 12 + X 22 Rayleigh
34
35
1 2
gt N
2
36
In general,
p ( B | A) = 0
Mutual exclusion is when two results are
impossible to happen at the same time.
p( A B) = 0
p ( A B ) = p ( A) p ( B)
37
Random sample: all samples of the same size have equal probability of being selected
Example 1: Study about child removal after abuse, 30% of the members were related to each other because
when a child is removed from a family, normally, the rest of his/her siblings are also removed. Answers for all the
siblings are correlated.
Example 2: Study about watching violent scenes at the University. If someone encourages his roommate to take
part in this study about violence, and the roommate accepts, he is already biased in his answers even if he is acting as
control watching non-violent scenes.
Consequence: The sampling distributions are not what they are expected to be, and all the confidence intervals
and hypothesis testing may be seriously compromised.
38
39
40