Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

BDM 2053

Big Data Algorithms and Statistics


Weekly Course Objectives
● Why do we visualize data?
● Frequency tables and histograms.
● Box plots, scatterplots, barplots.
● Discuss the normal distribution.
● Explore the application of the normal distribution in data
science.
● Go over confidence intervals.
● How do we check if something is “Normally” distributed?
● Do some examples in Python!
Data Visualization
● A picture is worth a thousand words… or numbers!
● The process of taking data, and obtaining insights using charts
and graphs is data visualization.
● We use data visualization to story tell.
○ We can use it to quickly get a feel of our data.
○ Sometimes central tendencies and measures of dispersions
are hard to picture, but with charts you can visually see
what is going on.
● Many times, you can identify patterns and trends just through
charts and visuals!
● Leads to actionable insights (insights that you can take action
on to guide a business problem).
Data Visualization cont.
● You may not understand the benefits of data visualization
because we give very basic examples in lectures.
○ Ex, Heights…
● In reality, you won’t work with just 1 variable and 10
observations. You will work with tens to hundreds of variables
with over 50k observations (often times millions of
observations)!
○ For example; telephony data, customer banking
transactions, click-stream data, etc.
Frequency Tables
● Frequency shows the number of times a particular event
occurs.
● Therefore, a frequency table is a table that shows the number
of times particular events occur in ascending or descending
order of events.
● They are a great way to understand the distribution of data
tabularly.
○ Whether you have clusters of points.
○ Outliers.
○ In general, story tell with your data!
● Procedure is to:
1) Select number of bins (grouped intervals).
2) Count the number of observations within each bin
3) Record them sequentially on a plot.
Frequency Tables Example
● Once again, let’s visit our favourite data of Heights.
● Heights = {150, 156, 183, 230, 143, 138, 145, 165, 167, 158}
● Say we wanted to make 4 bins here. Since the max is 230, and
the min is 150, we can make the bins have (230-138)/4 sized
intervals.
○ (230-138)/4 = 23
● If the bins are of size 23, that means the ranges are :
○ 138 - 161, 162 - 185, 186 - 209 and 210 - 233
Bins Frequency Relative Frequency

138 - 161 6 6/10 = 0.6

162 - 185 3 3/10 = 0.3

186 - 209 0 0/10 = 0

210 - 233 1 1/10 = 0.1

Total 10 1
Histograms & Barplots
● Histograms extend frequency tables 1 step further by showing
them visually.
● They convey the same story but in different ways.
● In some ways, they are better because if we had many bins, it
would be very hard to comprehend and see in a table format.
● Visually, we can spot trends and patterns a lot better than in
tables with numbers and various sections.
● Barplots are the same as histograms, except we do not need to
bin numbers together and simply represent each category as its
own bin.
Histograms: Tables vs. Graphs

versus
Histograms Example
● Converting our frequency table for Heights into a chart, we get
the following:

One optimal bin size can be found here!


Boxplots
● Sometimes our goal is to not just understand the distribution
of data, but the overall symmetry and spread of our data.
● A boxplot is a way of plotting the 5 figure summary
(minimum, Q1, Median, Q3, and maximum) of our data.
● Much like the histogram, it can help us see the distribution of
our data (is it symmetric, left skewed, right skewed?), and
identify outliers.
● Values that fall within the following ranges are considered
outliers:
○ Less than Q1 - 1.5 * IQR
○ Greater than Q3 + 1.5 * IQR
● You might be wondering where the minimum and maximum is
used here. It is captured as part of the criteria above, among
other potential points, to see if there are many outliers.
Boxplot Example
● Looking at the… you guessed it, Heights, we obtained the
following statistics from the last lecture:
○ Q1 = 147.5
○ Q2 (Median) = 157
○ Q3 = 166
○ Min = 138
○ Max = 230
○ IQR = Q3 - Q1 = 18.5
● From these values, we simply need to find the last 2 statistics
which would be Q1 - 1.5 * IQR and Q3 + 1.5 * IQR
○ Q1 - 1.5 * IQR = 147.5 - 1.5 * 18.5 = 119.75
○ Q3 + 1.5 * IQR = 193.75
● Values below and above the points above, respectively, are
outliers by this definition of outliers!
Boxplot Example

● We can see that the lower end and upper end of the box plot show values that are “typical”.
● There is 1 outlier identified here which is the height of 230 cm.
Scatterplots
● Often times looking at just one variable at a time isn’t
meaningful or what we are trying to get insights for.
● Scatterplots are a way of comparing pairs of values across your
entire data set simultaneously. This way you can draw the
relationship for two (or sometimes more) variables.
● When looking at 2 variables, the x-axis would represent one
variable and the y-axis would represent another.
○ Each point would represent one observation’s
characteristic.
Scatterplots Example
● FINALLY! Let’s look at another variable besides Height…
Weight!
● Heights = {150, 156, 183, 230, 143, 138, 145, 165, 167, 158}
Weights ={115, 110, 182, 210, 104, 100, 109, 121, 124, 131}
● In terms of data, the data for this might look something like
the following:

Name Height Weight …

Sailor Moon 150 115 …

Sailor Venus 156 110 …

Sailor Jupiter 183 182 …


Scatterplots Example cont.

Sailor Jupiter

Sailor Moon
Sailor Venus
Scatterplots Example cont.

Say we had another variable that was categorical, like the strength of each person being
strong or weak
Scatterplots Example cont.

Weak

Strong

We can add that into our scatterplot to enhance our insights. If they give more information,
it is worth showing!
Scatterplots Example cont.

Not as useful, but sometimes could be under the right settings.


More graphs!
● There exists so many more types of graphs!
More graphs!
● For more plots and how to make them, check out:
https://www.python-graph-gallery.com/
Break!!!
Probability Distributions
● In the last lecture and even in this one, we have quantified and
visualized how data is distributed.
● Much of this data follows assumptions and patterns that
resemble some classical distributions, called probability
distributions.
● More formally, probability distributions are functions that
tell you the likelihood of obtaining possible values from some
random event. For example, the probability that:
○ You will arrive in the first 5 minutes of class.
○ That someone’s height is between 150cm and 160cm.
○ That the average weight is greater than 200lbs.
○ Can be represented by a probability density function (pdf,
not the file format!)
● We can draw probabilities from such events because we have
data that often follows certain assumptions!
● All probabilities under the curve must add up to 1.
Probability Distributions Example

Normal Distribution: µ = 162cm, σ = 17cm


Continuous

Poisson Distribution: µ = 3, σ = 9

Discrete
Normal Distributions
● The normal distribution (Gaussian distribution) is one of the
most commonly sampling distribution.
● The following are key facts about the normal distribution:
○ The mean, median and mode are the exact same.
○ The distribution is symmetric around the mean, µ.
○ Exactly 50% of the data lie on the left and right side of the
mean.
○ 68% of the data lies within one standard deviation of the
mean, and 95% lies within two standard deviations.
● A big misconception is that most data in the world behaves
“normally” (is normally distributed). Actually, their statistics
follow a normal distribution.
○ Most data follows a long-tail distribution (data that is
skewed, typically to the right).
Normal Distributions cont.

= ,
Normal Distributions: Standard Normal
● The standard normal distribution is when the mean is equal
to 0, and the standard deviation is equal to 1.
● You can standardize any sets of values by the following
equation:
Z = (x - µ)/σ

● Standardizing your data is not only useful to find probabilities


for normal distributions, but to scale your data (more on this
later).
Normal Distributions: Finding probabilities
● To find the probability that your random variable, X, will be
less than some specific event, x, can be written as follows:
P(X < x)
● With continuous distributions, this is not as simple as adding
probabilities as in the discrete case. For example:
○ If X is a random variable representing the number of rooms
in a rental unit, find the probability of a rental unit having
less than 3 rooms given the following:

P(X < 3) = P(X =0 ) +


P(X = 1 ) +
P(X = 2 )
= 0 + 0.009
+ 0.017
= 0.026
Normal Distributions: Finding probabilities cont.
● The density of a specific event in the continuous case is 0. I.e.,
if I asked you to find the probability that someone weighs
exactly 180 lbs, you would spend a very long time because
weights are continuous. Someone might be 179.9 lbs, or
180.0002 lbs, but the likelihood someone is exactly 180 lbs
(180.0000000000000 lbs) is 0.
● However, the probability that someone is less than 180 lbs is
much more likely. Similarly, we could find the probability of
someone being between two weight classes (180 lbs and 190
lbs) or greater than 180 lbs.
● Since the area under a pdf must equal 1, we need to simply do
some math via integration… but integrated the pdf of the
normal is not easy!!!!!! This is called the cdf (cumulative
density function
● Using software makes this extremely fast and easy!
Normal Distributions: Example
● For his brand new Banana phone, Atinder knows that the
length of time it takes the battery to recharge fully is normally
distributed with a mean of 3 hours and a standard deviation of
30 minutes. Atinder owns one of these computers and wants to
know the probability that the length of time will be between 2
and 2.5 hours.
● µ = 3, σ = 0.5

● P( 2 < X < 2.5 ) = ?


Normal Distributions: Example
● P( 2 < X < 2.5 ) = P( X < 2.5 ) - P( X < 2 )

= -

● P( 2 < X < 2.5 ) = P( X < 2.5 ) - P( X < 2 )


= 0.1359
Normal Distributions: Example 2
● The lifetime of Atinder’s phone has a normal distribution with
a mean of 24 months and standard deviation of 4 months. Find
the probability that his phone will last more than 30 months
● µ = 24, σ = 4
● P( 30 < X ) = ?

= -

● P( 30 < X ) = 1 - P(X < 30 ) = 0.0668


Normal Distributions: Example 3
● Entry to a certain University is determined by a national test.
The scores on this test are normally distributed with a mean of
500 and a standard deviation of 100. Tom wants to be admitted
to this university and he knows that he must score better than
at least 70% of the students who took the test. Tom takes the
test and scores 585. Will he be admitted to this university?
● µ = 500, σ = 100

● P( X < x ) = 0.7
Normal Distributions: Example 3 cont.
● Here the tricky thing is we know what the probability is, we
just don’t know what value satisfies it within the parameters
provided.
● We can use software to find the inverse of this very easily!
● P( X < x ) = 0.7, x must be 552.44 ~ 553.
● Since Tom scored greater than 553, he will be admitted! Yay!
Confidence Intervals
● Placing our trust in 1 value for an analysis is very risky.
● Say we wanted to forecast budgets and stated that we expect
the average savings to be $10,000 next month. When the next
month financial results occur, we find out we actually only
saved $8,000 - but the business trusted us so much that they
allocated the $2,000 not saved to some other product. Now we
are in trouble!
● To avoid this issue of saying “I don’t know” or “maybe we expect
somewhere around $10,000”, we instead give a range of what we
can expect!
● So, a confidence interval is an estimate of how likely are our
estimates to be within a range.

Point estimate: mean confidence interval: mean


Confidence Intervals cont.

● We essentially want to find how likely we are to capture the


true population mean within an interval.
○ We could have gotten a bad sample.
○ The cost of getting more observations in our sample is
expensive.
● The bigger the interval, the more confident we are!
● We need to use a bit of math to derive the values needed to
calculate the confidence intervals.
Confidence Intervals Example
● Atinder has made his own chocolate called Atindies! Like
Smarties, but with an A on them. He would like to know the
average weight a box of chocolates can have. Say he took a
sample of 1000 chocolates and found the mean and standard
deviation to be 45 grams and 3.8 grams, respectively.
● What is the 90% confidence interval for the weight of the box?

● The 90% confidence interval is captured


when you are 1.645 standard deviations of
the mean!
○ We want the interval to be around the
mean s.t. 90% of the data is captured.
○ How? Proof on next slide
Confidence Intervals Example: Proof
● We know that since the confidence interval is symmetric
around the mean, we must trim off α% from the standard
normal distribution.
○ 1-α = 90%
○ Therefore, here, α = 10%. This means 5% is trimmed off
both sides of the normal distribution:

● Therefore, P(Z<z) = 0.95 -> z = 1.645.


Confidence Intervals Example: Proof
● Remember, to standardize your results we must subtract by the
mean, and then divide by the standard deviation. So we get the
following:
Z = (X-µ)/σ
● However, the mean and standard deviation for a sample of size
n is:
○ E(X̄) = x̄
○ Var(X̄) = σ2/n
■ S.D. = σ / √(n)
● So we get as the confidence interval:
-zα/2< Z < zα/2
-zα/2< (X - x̄ ) / σ / √(n) < zα/2
x̄ - zα/2(σ / √(n) < X < x̄ + zα/2(σ / √(n),
where zα/2 is the value that covers 1-α/2 probability
For proof of the expected value and variance of the sample mean, visit here! , and the CI proof here!
Confidence Intervals: Back to the example
x̄ - zα/2(σ / √(n) < X < x̄ + zα/2(σ / √(n)
45 - 1.645*(3.8 / √(1000), 45 + 1.645*(3.8 / √(1000)
45 - 0.19767, 45 + 0.19767
44.802, 45.198
Is our data really Normal?
● We make all these assumptions, but one of the most important
things to check is if our data is even normally distributed to
begin with!
● We turn to QQ plots, which stands for quantile-quantile plots.
● As the name suggests, we want to compare each quantile of our
data to a distribution we think it is, and if it truly is distributed
according to that distribution, they should fit perfectly 1-1.
○ This means that the quantiles should form a line!
● But how do we do this?
QQ Plots Continued
● Begin by ranking your data in ascending order (smallest to
largest).
● Calculate the percentiles by taking the rank, subtracting 0.5
and then dividing by the total number of observations.
○ Percentile = (Rank - 0.5) / (n)
● Find the value according to the normal distribution that
achieves that percentile.
○ Ex: If a value is in the 10th percentile, we need to find what
value of the normal distribution (the standard normal
preferably) covers that probability.
● Standardize your data point to convert it to a z value.
● Compare your quantiles!
● An example will be done in Python!
Resources
● https://www.easycalculation.com/statistics/bell-curve-calculat
or.php
● http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R
-Code.html
Thank you

You might also like