Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

BDM 2053

Big Data Algorithms and Statistics


Introduction!
Background:
⚫ Graduated from McMaster; Honours Mathematics and Statistics &
Masters in Applied Statistics
⚫ Analytics and Data Science Program at TD Bank; Wealth, Personal
Banking, and Audit Analytics
⚫ Currently work at TD Insurance as a Data Science Manager
Hobbies:
⚫ Love powerlifting, building tech and non-tech things, cooking and
teaching!
⚫ I have seen The Office at least 10 times now!
Weekly Course Objectives
● What is data?
● How do we measure data?
● Discuss statistical analysis and measures of central tendency
(mean, median, mode).
● Describe statistical concepts of dispersion, i.e. variance,
standard deviation and interquartile range (IQR).
● Download Python and PyCharm
● Do some example in Python!
Data is information
● Data is facts, statistics or, in general, pieces of information
● Can be structured or unstructured
○ Structured data is organized and easy to decipher when
given to a machine. For example, your height, temperature,
classification of dog breeds, etc.
○ Unstructured data is unorganized and difficult to decipher
when given to a machine. For example, survey comments,
pictures (pixels), audio files, etc.

● When doing any analysis or predictive modelling, knowing the


type of data you have is a important because it will dictate the
direction you will take.
Types of data: Basic data structure
● The most basic data structure is a rectangular matrix (more on
this later), with rows as observations and columns as variables
or features.
Types of data: Basic data structure
● The most basic data structure is a rectangular matrix (more on
this later), with rows as observations and columns as variables
or features.

Name Age Weight IQ Eye colour Hair Colour Height


Types of data: Basic data structure
● The most basic data structure is a rectangular matrix (more on
this later), with rows as observations and columns as variables
or features.

Name Age Weight IQ Eye colour Hair Colour Height

Jack 29 182 999 Brown Brown 5’10”

Jacob 24 163 120 Brown Black 5’6”

Sasha 25 111 2000 Blue Brown 5’7”

Jordan 28 98 2001 Green Blonde 5’2”


Types of structured data
Types of Data: Categorical
● Categorical data represents groups or categories.
○ Car brands; Audi, Pontiac, Honda, Fiat, etc.
○ Marital Status; Single, Divorced, Widowed, “It’s Complicated” (as
seen on Facebook)

● Categorical data can further be broken by the type of categories that


are presented.
● There are two qualitative levels; nominal and ordinal.
○ Nominal data represents data that cannot be ranked or put in an
order
○ Ordinal data can be ordered
● For example;
○ Nominal: four seasons (Spring, Summer, Autumn, Winter), or
Colours (red, blue, yellow, etc).
○ Ordinal: Rating your experience (bad, neutral, good), or spice
level (low, medium, high)
Types of Data: Numerical
● Numerical data represents numbers. It can be discrete or
continuous.
● Continuous data can take on an infinite set of values, whereas
discrete can usually be counted and are finite.
○ Discrete: Number of houses owned, SAT Scores, etc.
○ Continuous; Height, weight, speed of car, time taken to get to
your data science class, etc.
Central Tendencies: Typical Values
● For your numerical variables, a common question that arises is:
“What is the typical value?”
● In other words, what is an estimate of where most of the data is
located?
● The most common procedure to summarize the data with the
most typical value is to take the mean or the average.
○ There is also the mode and median.
○ Also other measures like trimmed means.
Central Tendencies: Means
● As mentioned before, the mean is just the average value of your
numerical variable.
● It is affected significantly by outliers, which are data points that
are different from the rest of the data.
○ For example, very tall or very short people when looking at
heights are outliers.
● The formula for mean is:

, where n is the number of data points, i represents a specific


data point from 1 to n, and Σ is a notation for summing.
Central Tendencies: Mean Example
● Say we had a variable called “Heights” in a data set in cm given
by the following:
Heights = {150, 156, 183, 230, 143, 138, 145, 165, 167, 158}
○ n = 10
○ x1= 150, x2= 156, x3= 183, …, x9= 167, x10= 158

● x̄ = (150 + 156 + 183 + … + 167 + 158) / 10 = 163.5

● The average height is 163.5cm (units matter!)


Central Tendencies: Trimmed Mean
● The mean is susceptible to outliers.
● If we order the data and remove a portion of points from both
ends, we arrive can get something the trimmed mean.
○ Far less susceptible to outliers!
● The formula for trimmed mean is:

, where p is the number of values omitted and (i) is ordered


data
See how when the outliers get
further from the center of the
data, the mean shifts (in red)
but the trimmed mean
remains the same!
Central Tendencies: Trimmed Mean Example
● Say we had a variable called “Heights” in a data set in cm given
by the following:
Heights = {150, 156, 183, 230, 143, 138, 145, 165, 167, 158}
○ n = 10
○ x1= 150, x2= 156, x3= 183, …, x9= 167, x10= 158
○ x(1)= 138, x(2)= 143, x(3)= 145, …, x(9)= 183, x(10)= 230

● The 10% trimmed mean would remove 10% of the smallest and
largest data, which is just 0.1*10 = 1 observation which is the
smallest and largest.
● x̄0.10 = (138 + 143 + 145 + … + 183+ 230) / (10-2) = 158.375

● The trimmed mean is 158.375cm (again, units matter!)


Central Tendencies: Median
● The median is simply the middle value of your data (the data
must be ordered for you to find this!)
● 50% of the data lies to the left and right of the median
● Since the median looks at the middle value, it is not susceptible
to outliers!

● The formula for median is:


x͂ = x((n+1)/2) , if n is odd
x͂ = (x(n/2)+ x(n/2 + 1))/2 , if n is even
Central Tendencies: Median Example
● Let’s look at “Heights” … again :
Heights = {150, 156, 183, 230, 143, 138, 145, 165, 167, 158}
○ n = 10
○ x1= 150, x2= 156, x3= 183, …, x9= 167, x10= 158
○ x(1)= 138, x(2)= 143, x(3)= 145, …, x(9)= 183, x(10)= 230
○ n is even here!

● The median would then be the average between the 5th and 6th
ordered data points.
● x͂ = (x(5) + x(6))/2 = (156 + 158)/2 = 157
● The trimmed mean is 157cm (again, units matter!)
Central Tendencies: Mode
● The mode is the value (or values) that appear most often in your
data, or variable
● If all values appear the exact same amount of time, there is no
mode!
● If there are 2 or more values that appear the same amount of
times, and it is more than the others, then those 2 or more
values are your mode.
● Mode is not typically used in continuous data for descriptive
purposes, but is used for discrete and categorical data.
● The mode can be a useful way to scan if your data has issues!
Central Tendencies: Mode Example
● Let’s look at “Heights” … again :
Heights = {150, 156, 183, 230, 143, 138, 145, 165, 167, 158}
○ n = 10
○ x1= 150, x2= 156, x3= 183, …, x9= 167, x10= 158
○ No value appears most frequently. In other words, all values
appear with the same frequency. The mode does not exist
here.
● There is no mode here.
Estimates of Variability: Dispersion
● The “typical value” is one of the dimensions to summarizing your
variable. The next is variability or dispersion.
● In other words, it quantifies how tightly together or spread out
the data points are from one another.
● In an ideal world, we want to first measure variability, then
reduce it and finally explain why it occurs.
● This is a key concept that will come up again in future lectures.
● There are 3 common ways to measure dispersion:
○ variance, standard deviation, and interquartile range
(IQR)
Estimates of Variability: Variance & S.D.
● The variance is the average squared distance between all your
points and your mean.
● Squaring will always give you a value that is 0 or greater.
● Again, variance is impacted by outliers.
● The formula for the sample variance is:

, where n is the number of data points, i represents a specific


data point from 1 to n, and Σ is a notation for summing.
● The standard deviation (s.d.) is the square root of the sample
variance.

Note, we use “n-1” in the denominator because of something called degrees of freedom. The short answer is it is
an unbiased estimator. There is a mathematical proof to show this, but you do not need to know this
Estimates of Variability: Variance & S.D. Example
● Yet again, let’s look at “Heights” … again :
Heights = {150, 156, 183, 230, 143, 138, 145, 165, 167, 158}
○ n = 10
○ x1= 150, x2= 156, x3= 183, …, x9= 167, x10= 158
○ x̄ = 163.5

● s2 = ( (x1-x̄)2 + (x2-x̄)2 + … (x9-x̄)2 + (x10-x̄)2 ) / (10-1)


= ( (150-163.5)2 + (156 - 163.5)2 + … + (158 - 163.5)2) / 9
= 649.85

● s = (649.85)0.5
= 25.492
Estimates of Variability: IQR
● Another way to measure variability is through ranges and
percentiles!
● One example that combines the two is called IQR
● IQR looks at the ranges, or distance, between the interquartiles
of the data.
○ Interquartiles are just 25th and 75th percentiles of your data.
○ Technically, the median was the 50th percentile!
○ The range would just be the difference between the two
values. In this case the 75th percentile minus the 25th
percentile.
See how this tells us the
spread or range of the middle
50% of the data? If this is
wide, then there is a lot of
variability in the data!
BREAK!
● You just got a tonne of statistical concepts thrown at you. So we
will take a short break and install:
○ Python if you do not already have it.
○ And an IDE called PyCharm
Thank you

You might also like