Week1 Notes PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

BDM 2053

Big Data Algorithms and Statistics


Weekly Course Objectives
● What is data?
● How do we measure data?
● Discuss statistical analysis and measures of central tendency (mean, median, mode).
● Describe statistical concepts of dispersion, i.e. variance, standard deviation and interquartile
range (IQR).
Data is information
● Data is facts, statistics or, in general, pieces of information
● Can be structured or unstructured

○ Structured data is organized and easy to decipher when given to a machine. For example,
your height, temperature, classification of dog breeds, etc.
○ Unstructured data is unorganized and difficult to decipher when given to a machine. For
example, survey comments, pictures (pixels), audio files, etc.

● When doing any analysis or predictive modelling, knowing the type of data you have is a
important because it will dictate the direction you will take.
Types of data: Basic data structure
● The most basic data structure is a rectangular matrix (more on this later), with rows as
observations and columns as variables or features.





Types of data: Basic data structure


● The most basic data structure is a rectangular matrix (more on this later), with rows as
observations and columns as variables or features.
Name Age Weight IQ Eye colour Hair Colour Height






Types of data: Basic data structure
● The most basic data structure is a rectangular matrix (more on this later), with rows as
observations and columns as variables or features.
Name Age Weight IQ Eye colour Hair Colour Height

Jack 29 182 999 Brown Brown 5’10”

Jacob 24 163 120 Brown Black 5’6”

Sasha 25 111 2000 Blue Brown 5’7”

Jordan 28 98 2001 Green Blonde 5’2”



Types of structured data


Types of Data: Categorical
● Categorical data represents groups or categories.
○ Car brands; Audi, Pontiac, Honda, Fiat, etc.
○ Marital Status; Single, Divorced, Widowed, “It’s Complicated” (as seen on Facebook)

● Categorical data can further be broken by the type of categories that are presented.
● There are two qualitative levels; nominal and ordinal.
○ Nominal data represents data that cannot be ranked or put in an order
○ Ordinal data can be ordered
● For example;
○ Nominal: four seasons (Spring, Summer, Autumn, Winter), or Colours (red, blue, yellow, etc).
○ Ordinal: Rating your experience (bad, neutral, good), or spice level (low, medium, high)
Types of Data: Numerical
● Numerical data represents numbers. It can be discrete or continuous.
● Continuous data can take on an infinite set of values, whereas discrete can usually be counted
and are finite.
○ Discrete: Number of houses owned, SAT Scores, etc.
○ Continuous; Height, weight, speed of car, time taken to get to your data science class, etc.


Central Tendencies: Typical Values
● For your numerical variables, a common question that arises is: “What is the typical value?”
● In other words, what is an estimate of where most of the data is located?
● The most common procedure to summarize the data with the most typical value is to take the
mean or the average.
○ There is also the mode and median.
○ Also other measures like trimmed means.


Central Tendencies: Means
● As mentioned before, the mean is just the average value of your numerical variable.
● It is affected significantly by outliers, which are data points that are different from the rest of
the data.
○ For example, very tall or very short people when looking at heights are outliers.
● The formula for mean is:



, where n is the number of data points, i represents a specific data point from 1 to n, and Σ is a
notation for summing.
Central Tendencies: Mean Example
● Say we had a variable called “Heights” in a data set in cm given by the following:
Heights = {150, 156, 183, 230, 143, 138, 145, 165, 167, 158}
○ n = 10
○ x1= 150, x2= 156, x3= 183, …, x9= 167, x10= 158

● x̄ = (150 + 156 + 183 + … + 167 + 158) / 10 = 163.5

● The average height is 163.5cm (units matter!)


Central Tendencies: Trimmed Mean
● The mean is susceptible to outliers.

● If we order the data and remove a portion of points from both ends, we arrive can get something
the trimmed mean.
○ Far less susceptible to outliers!
● The formula for trimmed mean is:


, where p is the number of values omitted and (i) is ordered data
See how when the outliers get further from the center of the data, the mean shifts
(in red) but the trimmed mean remains the same!

Central Tendencies: Trimmed Mean Example


● Say we had a variable called “Heights” in a data set in cm given by
the following:
Heights = {150, 156, 183, 230, 143, 138, 145, 165, 167, 158}
○ n = 10

○ x1= 150, x2= 156, x3= 183, …, x9= 167, x10= 158
○ x(1)= 138, x(2)= 143, x(3)= 145, …, x(9)= 183, x(10)= 230

● The 10% trimmed mean would remove 10% of the smallest and largest data, which is just 0.1*10 =
1 observation which is the smallest and largest.
● x̄0.10 = (138 + 143 + 145 + … + 183+ 230) / (10-2) = 158.375

● The trimmed mean is 158.375cm (again, units matter!)


Central Tendencies: Median
● The median is simply the middle value of your data (the data must be ordered for you to find
this!)
● 50% of the data lies to the left and right of the median
● Since the median looks at the middle value, it is not susceptible to outliers!

● The formula for median is:
x͂ = x((n+1)/2) , if n is odd x͂ = (x(n/2)+ x(n/2 + 1))/2 , if n is even


Central Tendencies: Median Example
● Let’s look at “Heights” … again :
Heights = {150, 156, 183, 230, 143, 138, 145, 165, 167, 158}
○ n = 10
○ x1= 150, x2= 156, x3= 183, …, x9= 167, x10= 158
○ x(1)= 138, x(2)= 143, x(3)= 145, …, x(9)= 183, x(10)= 230 ○ n is even here!

● The median would then be the average between the 5th and 6th ordered data points.

● x͂ = (x(5) + x(6))/2 = (156 + 158)/2 = 157
● The trimmed mean is 157cm (again, units matter!)
Central Tendencies: Mode
● The mode is the value (or values) that appear most often in your data, or variable
● If all values appear the exact same amount of time, there is no mode!
● If there are 2 or more values that appear the same amount of times, and it is more than the
others, then those 2 or more values are your mode.
● Mode is not typically used in continuous data for descriptive purposes, but is used for discrete
and categorical data.
● The mode can be a useful way to scan if your data has issues!
Central Tendencies: Mode Example
● Let’s look at “Heights” … again :
Heights = {150, 156, 183, 230, 143, 138, 145, 165, 167, 158}

○ n = 10
○ x1= 150, x2= 156, x3= 183, …, x9= 167, x10= 158
○ No value appears most frequently. In other words, all values appear with the same frequency.
The mode does not exist here.
● There is no mode here.
Estimates of Variability: Dispersion
● The “typical value” is one of the dimensions to summarizing your variable. The next is variability
or dispersion.
● In other words, it quantifies how tightly together or spread out the data points are from one
another.
● In an ideal world, we want to first measure variability, then reduce it and finally explain why it
occurs.
● This is a key concept that will come up again in future lectures.
● There are 3 common ways to measure dispersion:

○ variance, standard deviation, and interquartile range (IQR)
Estimates of Variability: Variance & S.D.
● The variance is the average squared distance between all your points and your mean.
● Squaring will always give you a value that is 0 or greater.
● Again, variance is impacted by outliers.
● The formula for the sample variance is:


, where n is the number of data points, i represents a specific data point from 1 to n, and Σ is a
notation for summing.
● The standard deviation (s.d.) is the square root of the sample variance.

Note, we use “n-1” in the denominator because of something called degrees of freedom. The short answer is it is an unbiased estimator. There is a mathematical proof to
show this, but you do not need to know this

Estimates of Variability: Variance & S.D. Example
● Yet again, let’s look at “Heights” … again :
Heights = {150, 156, 183, 230, 143, 138, 145, 165, 167, 158}
○ n = 10
○ x1= 150, x2= 156, x3= 183, …, x9= 167, x10= 158 ○ x̄ = 163.5

● = ( (150-163.5)s2 = ( (x1-x̄)2 + (x22 + (156 - 163.5)-x̄)2 + … (x9-x̄)22 + … + (158 - 163.5) + (x10-x̄)2 ) /


(10-1)2) / 9

= 649.85

● s = (649.85)0.5 = 25.492
Estimates of Variability: IQR
● Another way to measure variability is through ranges and
percentiles!
● One example that combines the two is called IQR
● IQR looks at the ranges, or distance, between the interquartiles of the data.
○ Interquartiles are just 25th and 75th percentiles of your data.
○ Technically, the median was the 50th percentile!
○ The range would just be the difference between the two values. In this case the 75th percentile
minus the 25th percentile.
See how this tells us the spread or range of the middle 50% of the data? If this is
wide, then there is a lot of variability in the data!






Thank you

You might also like