Professional Documents
Culture Documents
Week 1
Week 1
● The 10% trimmed mean would remove 10% of the smallest and
largest data, which is just 0.1*10 = 1 observation which is the
smallest and largest.
● x̄0.10 = (138 + 143 + 145 + … + 183+ 230) / (10-2) = 158.375
● The median would then be the average between the 5th and 6th
ordered data points.
● x͂ = (x(5) + x(6))/2 = (156 + 158)/2 = 157
● The trimmed mean is 157cm (again, units matter!)
Central Tendencies: Mode
● The mode is the value (or values) that appear most often in your
data, or variable
● If all values appear the exact same amount of time, there is no
mode!
● If there are 2 or more values that appear the same amount of
times, and it is more than the others, then those 2 or more
values are your mode.
● Mode is not typically used in continuous data for descriptive
purposes, but is used for discrete and categorical data.
● The mode can be a useful way to scan if your data has issues!
Central Tendencies: Mode Example
● Let’s look at “Heights” … again :
Heights = {150, 156, 183, 230, 143, 138, 145, 165, 167, 158}
○ n = 10
○ x1= 150, x2= 156, x3= 183, …, x9= 167, x10= 158
○ No value appears most frequently. In other words, all values
appear with the same frequency. The mode does not exist
here.
● There is no mode here.
Estimates of Variability: Dispersion
● The “typical value” is one of the dimensions to summarizing your
variable. The next is variability or dispersion.
● In other words, it quantifies how tightly together or spread out
the data points are from one another.
● In an ideal world, we want to first measure variability, then
reduce it and finally explain why it occurs.
● This is a key concept that will come up again in future lectures.
● There are 3 common ways to measure dispersion:
○ variance, standard deviation, and interquartile range
(IQR)
Estimates of Variability: Variance & S.D.
● The variance is the average squared distance between all your
points and your mean.
● Squaring will always give you a value that is 0 or greater.
● Again, variance is impacted by outliers.
● The formula for the sample variance is:
Note, we use “n-1” in the denominator because of something called degrees of freedom. The short answer is it is
an unbiased estimator. There is a mathematical proof to show this, but you do not need to know this
Estimates of Variability: Variance & S.D. Example
● Yet again, let’s look at “Heights” … again :
Heights = {150, 156, 183, 230, 143, 138, 145, 165, 167, 158}
○ n = 10
○ x1= 150, x2= 156, x3= 183, …, x9= 167, x10= 158
○ x̄ = 163.5
● s = (649.85)0.5
= 25.492
Estimates of Variability: IQR
● Another way to measure variability is through ranges and
percentiles!
● One example that combines the two is called IQR
● IQR looks at the ranges, or distance, between the interquartiles
of the data.
○ Interquartiles are just 25th and 75th percentiles of your data.
○ Technically, the median was the 50th percentile!
○ The range would just be the difference between the two
values. In this case the 75th percentile minus the 25th
percentile.
See how this tells us the
spread or range of the middle
50% of the data? If this is
wide, then there is a lot of
variability in the data!
BREAK!
● You just got a tonne of statistical concepts thrown at you. So we
will take a short break and install:
○ Python if you do not already have it.
○ And an IDE called PyCharm
Thank you