Professional Documents
Culture Documents
Week1 Notes PDF
Week1 Notes PDF
Week1 Notes PDF
● When doing any analysis or predictive modelling, knowing the type of data you have is a
important because it will dictate the direction you will take.
Types of data: Basic data structure
● The most basic data structure is a rectangular matrix (more on this later), with rows as
observations and columns as variables or features.
Types of data: Basic data structure
● The most basic data structure is a rectangular matrix (more on this later), with rows as
observations and columns as variables or features.
Name Age Weight IQ Eye colour Hair Colour Height
Types of Data: Categorical
● Categorical data represents groups or categories.
○ Car brands; Audi, Pontiac, Honda, Fiat, etc.
○ Marital Status; Single, Divorced, Widowed, “It’s Complicated” (as seen on Facebook)
● Categorical data can further be broken by the type of categories that are presented.
● There are two qualitative levels; nominal and ordinal.
○ Nominal data represents data that cannot be ranked or put in an order
○ Ordinal data can be ordered
● For example;
○ Nominal: four seasons (Spring, Summer, Autumn, Winter), or Colours (red, blue, yellow, etc).
○ Ordinal: Rating your experience (bad, neutral, good), or spice level (low, medium, high)
Types of Data: Numerical
● Numerical data represents numbers. It can be discrete or continuous.
● Continuous data can take on an infinite set of values, whereas discrete can usually be counted
and are finite.
○ Discrete: Number of houses owned, SAT Scores, etc.
○ Continuous; Height, weight, speed of car, time taken to get to your data science class, etc.
Central Tendencies: Typical Values
● For your numerical variables, a common question that arises is: “What is the typical value?”
● In other words, what is an estimate of where most of the data is located?
● The most common procedure to summarize the data with the most typical value is to take the
mean or the average.
○ There is also the mode and median.
○ Also other measures like trimmed means.
Central Tendencies: Means
● As mentioned before, the mean is just the average value of your numerical variable.
● It is affected significantly by outliers, which are data points that are different from the rest of
the data.
○ For example, very tall or very short people when looking at heights are outliers.
● The formula for mean is:
, where n is the number of data points, i represents a specific data point from 1 to n, and Σ is a
notation for summing.
Central Tendencies: Mean Example
● Say we had a variable called “Heights” in a data set in cm given by the following:
Heights = {150, 156, 183, 230, 143, 138, 145, 165, 167, 158}
○ n = 10
○ x1= 150, x2= 156, x3= 183, …, x9= 167, x10= 158
, where p is the number of values omitted and (i) is ordered data
See how when the outliers get further from the center of the data, the mean shifts
(in red) but the trimmed mean remains the same!
● The 10% trimmed mean would remove 10% of the smallest and largest data, which is just 0.1*10 =
1 observation which is the smallest and largest.
● x̄0.10 = (138 + 143 + 145 + … + 183+ 230) / (10-2) = 158.375
Central Tendencies: Median Example
● Let’s look at “Heights” … again :
Heights = {150, 156, 183, 230, 143, 138, 145, 165, 167, 158}
○ n = 10
○ x1= 150, x2= 156, x3= 183, …, x9= 167, x10= 158
○ x(1)= 138, x(2)= 143, x(3)= 145, …, x(9)= 183, x(10)= 230 ○ n is even here!
● The median would then be the average between the 5th and 6th ordered data points.
● x͂ = (x(5) + x(6))/2 = (156 + 158)/2 = 157
● The trimmed mean is 157cm (again, units matter!)
Central Tendencies: Mode
● The mode is the value (or values) that appear most often in your data, or variable
● If all values appear the exact same amount of time, there is no mode!
● If there are 2 or more values that appear the same amount of times, and it is more than the
others, then those 2 or more values are your mode.
● Mode is not typically used in continuous data for descriptive purposes, but is used for discrete
and categorical data.
● The mode can be a useful way to scan if your data has issues!
Central Tendencies: Mode Example
● Let’s look at “Heights” … again :
Heights = {150, 156, 183, 230, 143, 138, 145, 165, 167, 158}
○ n = 10
○ x1= 150, x2= 156, x3= 183, …, x9= 167, x10= 158
○ No value appears most frequently. In other words, all values appear with the same frequency.
The mode does not exist here.
● There is no mode here.
Estimates of Variability: Dispersion
● The “typical value” is one of the dimensions to summarizing your variable. The next is variability
or dispersion.
● In other words, it quantifies how tightly together or spread out the data points are from one
another.
● In an ideal world, we want to first measure variability, then reduce it and finally explain why it
occurs.
● This is a key concept that will come up again in future lectures.
● There are 3 common ways to measure dispersion:
○ variance, standard deviation, and interquartile range (IQR)
Estimates of Variability: Variance & S.D.
● The variance is the average squared distance between all your points and your mean.
● Squaring will always give you a value that is 0 or greater.
● Again, variance is impacted by outliers.
● The formula for the sample variance is:
, where n is the number of data points, i represents a specific data point from 1 to n, and Σ is a
notation for summing.
● The standard deviation (s.d.) is the square root of the sample variance.
Note, we use “n-1” in the denominator because of something called degrees of freedom. The short answer is it is an unbiased estimator. There is a mathematical proof to
show this, but you do not need to know this
Estimates of Variability: Variance & S.D. Example
● Yet again, let’s look at “Heights” … again :
Heights = {150, 156, 183, 230, 143, 138, 145, 165, 167, 158}
○ n = 10
○ x1= 150, x2= 156, x3= 183, …, x9= 167, x10= 158 ○ x̄ = 163.5
= 649.85
● s = (649.85)0.5 = 25.492
Estimates of Variability: IQR
● Another way to measure variability is through ranges and
percentiles!
● One example that combines the two is called IQR
● IQR looks at the ranges, or distance, between the interquartiles of the data.
○ Interquartiles are just 25th and 75th percentiles of your data.
○ Technically, the median was the 50th percentile!
○ The range would just be the difference between the two values. In this case the 75th percentile
minus the 25th percentile.
See how this tells us the spread or range of the middle 50% of the data? If this is
wide, then there is a lot of variability in the data!
Thank you