Professional Documents
Culture Documents
Data Science Certification
Data Science Certification
Professional certifications:
PMP Project Management Professional
2 Deloitte
Driven using US policies
1 Infosys
Driven using Indian policies under Large enterprises
ITC Infotech
Driven using Indian policies SME
HSBC
Driven using UK policies
Data Mining –
Supervised & Text Mining &
Unsupervised NLP
(Machine
Learning) Data
Visualization
using Tableau
AGENDA
Domain
All Agenda Knowledge
Topics
Statistical Data
Analysis Minin
g
Practice
Data
Visualizatio
Forecasting
n
https://www.techinasia.com/alibaba-crushes-records-brings-143-billion-singles-day
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Why Tableau?
Population
Sampling Frame
SRS
Sample
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Measures of Central Tendency
Central Tendency Population Sample
Mean / Average
Variance
Standard Deviation
The expected value intuitively refers to what one would find if they repeated the
experiment an infinite number of times and took the average of all of the outcomes
Inter-
quartile
Box Plot : This graph shows the distribution of data by dividing Range : It is represented on a box plot by the
the data into four groups with the same number of data points distance between the smallest value and the largest
in each group. The box contains the middle 50% of the data
points and each of the two whiskers contain 25% of the data
value, including any outliers. If you ignore outliers,
points. It displays two common measures of the variability or the range is illustrated by the distance between the
spread in a data set opposite ends of the whiskers
The Probability associated with any single value of a random variable is always zero
• For every value (x) of the random variable X, we can calculate Z score:
X−µ
Z=
σ
• Interpretation − How many standard deviations away is the value from the mean ?
Theoretical Quantiles
- will be approximately normal even if the distribution of data in the population is not normal
if the “sample size” is fairly large
_
Mean ( X ) = µ ( the same as the population mean of the raw data)
Standard Deviation (X) = σ , where σ is the population standard deviation and n is the sample size
√𝑛
- This is referred to as standard error of mean
The standard error of the mean estimates the variability between samples whereas the
standard deviation measures the variability within a single sample
A Sample Size of 30 is considered large enough, but that may /may not be adequate
Probability is ‘0’
Interval
Estimate = Point Estimate ± Margin of Error
_
X ± Z1-ᾳ σ where Z1-ᾳ satisfies p( -Z1-ᾳ ≤ Z ≤ Z1-ᾳ) = 1-ᾳ
√𝑛
Interval
Estimate =
Point Estimate ± Margin of Error
Margin of error depends on the underlying uncertainty, confidence level and sample size
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Calculate Z value - 90%, 95% & 99%
• Construct a 95% confidence interval for the mean card balance and interpret it ?
• Construct a 90% confidence interval for the mean card balance and interpret it ?
Interpretation 1 : Mean of the population has a 95% chance of being in this range for a random
sample
Interpretation 2 : Mean of the population will be in this range for 95% of the random samples
We replace σ with our best guess (point estimate) s, which is the standard deviation of the sample:
Calculate
• Research has shown that the t-distribution is fairly robust to deviation of the population
of the normal model
As n ꝏ
tn N(0,1)
− σ -X = 2833 = 239.46
√140
The factors that affect the power of a test include sample size, effect size, population variability, and 𝛼.
Power and 𝛼 are related as increasing 𝛼 decreases 𝛽. Since power is calculated by 1 minus 𝛽, if you
increase 𝛼,You also increase the power of a test. The maximum power a test can have is 1, whereas
the minimum value is 0.
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Hypothesis Testing
Our quality will not improve
after the consulting project
1 2 3
Normality Test Population Standard 1 Sample Z Test
Deviation Known or Not
Stat > Basic Statistics > Stat > Basic Statistics >
Graphical Summary 1 Sample Z
Data was
shown to
be normal
Population
standard
deviation is
known=4
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
1-Sample t Test
1 2 3
Normality Test Population Standard 1 Sample t Test
Deviation Known or Not
Stat > Basic Statistics > Stat > Basic Statistics >
Graphical Summary 1 Sample t
Data was
given to be
Normal
Population
standard
deviation is
NOT known
1 3
Normality Test 1 Sample Sign Test
Stat > Basic Statistics > Stat > Non Parametric >
Graphical Summary 1 Sample sign
The scores of 20 students for the statistics exam are provided. Test
Student if the current median is not equal to historic median of 82. Assume
‘’ to be 0.05
Scores
1 2 3
Normality Test Variance Test 2 Sample t Test
Stat > Basic Statistics > Stat > Basic Statistics > Stat > Basic Statistics >
Graphical Summary 2 Variance 2-Sample t
Comparing the performance of machine A vs. Compare the performance of machine A vs.
machine B by feeding different raw materials to machine B when the same raw material is fed to
each machine each machine
Compare the power output of two wind mills next Compare the power output of a wind mill when you
to each other simultaneously when you use use motor A for 1 month and motor B for 1 month
motor A on one wind mill and motor B on another
Identifying resistor defects and capacitor defects Identifying resister defects on 20 PCB’s and
in same PCB by collecting such data using 20 PCB capacitor defects on 20 (different) PCB’s
units
2-Sample t test
Assume the same data was recorded if only 10 vehicles were chosen and mileage
was recorded before and after adding the additive. What method will you choose to
find the result.
Paired T test
1 2
Normality Test Mann – Whitney test
for Medians
Stat > Basic Statistics >
Graphical Summary Stat > Non Parametric >
Mann Whitney
1 2 3
Normality Test Variance Test ANOVA
Stat > Basic Statistics > Stat > ANOVA > Stat > ANOVA >
Graphical Summary Test for Equal Variances One-Way….
• Suppose the nutrition expert would like to do a comparative evaluation of three diet
programs(Atkins, South Beach, GM)
• She randomly assigns equal number of participants to each of these programs from
a common pool of volunteers
• Suppose the average weight losses in each of the groups(arms) of the experiments
are 4.5kg, 7kg, 5.3kg
• Not every individual in each program will respond identically to the diet program
• Easier to identify variations across programs if variations within programs are smaller
• Hence the method is called Analysis of Variance(ANOVA)
• With-in group variation = Experimental Error
• Hypothesis
- H0: μ1 = μ2 = μ3 = … = μr
- Not all μi are equal
• Calculate
- Mean Square Treatment MSTR = SSTR / (r‐1)
- Mean Square Error MSE = SSE / (n‐r)
- The ratio of two squares f = MSTR/MSE = Between group variation/Within group variation
- Strength of this evidence p‐value = Pr(F(r‐1,n‐r) ≥ f)
– If there are two factors, we would use a two way ANOVA. Example: One factor is department and the
second factor is the shift.(day vs. Night)
Interaction A * B SSAB (nA – 1) (nB – 1) MSAB = SSAB / (nAB – 1) FAB = MSAB / MSE
Is the new branding program more Four types of machines are used. Is
effective in increasing profits? weight of the Rugby ball dependent on
the type of machine used?
• Mood’s median assigns the data from each population that is higher than
the overall median to one group, and all points that are equal or lower to
another group. It then uses a Chi-Square test to check if the observed
frequencies are close to expected frequencies
1 2
Mood’s Median – handles outliers well Kruskal Wallis – more powerful than Mood’s Median
Stat > Nonparametric > Mood’s Median Stat > Nonparametric > Kruskal Wallis
Growth
Johnnie Talkers soft drinks division sales manager has been planning
to launch a new sales incentive program for their sales executives.
The sales executives felt that adults (>40 yrs) won’t buy, children will
Johnnie & hence requested sales manager not to launch the program.
Analyze the data & determine whether there is evidence at 5%
Talkers significance level to support the hypothesis
We can also:
Determine whether one variable is dependent over another
• Does the data suggest an association between the disease and exposure?
Disease Total
Exposure Yes No
Yes 37 13 50
No 17 53 70
Total 54 66 120
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Chi-Square Test
Calculate the number of individuals of exposed and unexposed groups
expected in each disease category (yes and no) if the probabilities were
the same
χ2 = Σ =
= 29.1
• Calculate the degrees of freedom :
(Number of rows – 1) X (Number of columns – 1)
df = (2 – 1) X (2 – 1) = 1
• Interpretation: There is 0.001 chance of obtaining such discrepancies between expected and
observed values if there is no association
• They have lower power than the parametric tests and hence are
always given the second preference after the parametric tests
12 2.48
28 3.33
87 4.47
143 4.96
• Can model different distributions due to having parameters of shape, scale & location
• Weibull & Lognormal are from same family & both can be used to assess the dataset that
contains close to average values (not too high / low)
• However, Weibull is a better fit when majority of data falls to the higher side
• Lognormal is a better fit when majority of data falls to the lower side
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Probability Distributions
Weibull:
• β is shape parameter, also called as slope, determines the shape of the distribution
When beta = 1, shape of distribution = exponential distribution
When beta: 3 to 4, shape of distribution = normal distribution
Several beta values can approximate lognormal distribution
• γ is non-zero location parameter, is the point, below which there are no failures, changing the value will
move distribution to right or left
Gamma > 0, there is a period when no failures occur
Gamma < 0, failures have occurred before time equals zero
e.g., defective raw materials or failure during transportation
When Gamma = 0, eta is called as characteristic life
• Regardless of specific value of beta, 63.2% of values fall below the characteristic life
• Used when 2 variables that are normally distributed & may be totally independent or may be correlated to
some degree
• A joint distribution of two independent variables that simultaneously
& jointly cross-classifies the data
• Can be discrete or continuous
• 3D plot like mountain terrain
• X & Y axes represent independent variables
• Z axis shows either
frequency for discrete data
probability for continuous data
• The maximum or peak occurs when X1 = Mu1 & X2 = Mu2. You can take a
“slice” anywhere along the distribution by fixing one of the variables. This
is known as a conditional distribution
Be Careful - Correlation does not guarantee causation. Correlation by itself does not
imply a cause and effect relationship!
If the absolute value of the correlation coefficient is greater than 0.85, then we say there is a
good relationship
• Example: r = 0.87, r = -0.9, r = 0.9, r = -0.87 describe good relationship
• Example: r = 0.5, r = -0.5, r = 0.28 describe poor relationship
Correlation values of -1 or 1 imply an exact linear relationship. However, the real value of
correlation is in quantifying less than perfect relationships
We can perform regression analysis, which attempts to further describe this type of
relationship, if the correlation is good between the 2 variables
y = β0 + β1x + ε
β0
β1
Error term
X
x0 = A specific value of x, the
independent variable.
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Regression Analysis
R-squared-also known as Coefficient of determination, represents the % variation in
output (dependent variable) explained by input variables/s or Percentage of response
variable variation that is explained by its relationship with one or more predictor variables
Higher the R^2, the better the model fits your data
R^2 is always between 0 and 100%
R squared is between 0.65 and 0.8 => Moderate correlation
R squared in greater than 0.8 => Strong correlation
Y = Continuous Y = Continuous
Create
Dummy
Variable
X = Single & X = Single &
Continuous Discrete
Simple Simple
Linear Linear
Regression Regression
Male 1
Female 0
Male 1
Female 0
Male 1
Male 1
Female 0
Male 1
Male 1
Female 0
• Studies have shown that individuals with excess Adipose tissue (AT) in the abdominal region have a higher risk of
cardio-vascular diseases
• Computed Tomography, commonly called the CT Scan is the only technique that allows for the precise and
reliable measurement of the AT (at any site in the body)
• Is there a simpler yet reasonably accurate way to predict the AT area? i.e.,
• Easily available
• Risk free
• Inexpensive
• A group of researchers conducted a study with the aim of predicting abdominal AT area using simple
anthropometric measurements, i.e., measurements on the human body
• The Waist Circumference – Adipose Tissue data is a part of this study wherein the aim is to study how well waist
circumference (WC) predicts the AT area
110 Footer Copyright © 2015 ExcelR . All rights reserved.
Simple Linear Regression – Data Set
Observation Waist AT Observation Waist AT Observation Waist AT
1 74.75 25.72 38 103 129 75 108 217
2 72.6 25.89 39 80 74.02 76 100 140
3 81.8 42.6 40 79 55.48 77 103 109
4 83.95 42.8 41 83.5 73.13 78 104 127
5 74.65 29.84 42 76 50.5 79 106 112
6 71.85 21.68 43 80.5 50.88 80 109 192
7 80.9 29.08 44 86.5 140 81 103.5 132
8 83.4 32.98 45 83 96.54 82 110 126
9 63.5 11.44 46 107.1 118 83 110 153
10 73.2 32.22 47 94.3 107 84 112 158
11 71.9 28.32 48 94.5 123 85 108.5 183
12 75 43.86 49 79.7 65.92 86 104 184
13 73.1 38.21 50 79.3 81.29 87 111 121
14 79 42.48 51 89.8 111 88 108.5 159
15 77 30.96 52 83.8 90.73 89 121 245
16 68.85 55.78 53 85.2 133 90 109 137
17 75.95 43.78 54 75.5 41.9 91 97.5 165
18 74.15 33.41 55 78.4 41.71 92 105.5 152
19 73.8 43.35 56 78.6 58.16 93 98 181
20 75.9 29.31 57 87.8 88.85 94 94.5 80.95
21 76.85 36.6 58 86.3 155 95 97 137
22 80.9 40.25 59 85.5 70.77 96 105 125
23 79.9 35.43 60 83.7 75.08 97 106 241
24 89.2 60.09 61 77.6 57.05 98 99 134
25 82 45.84 62 84.9 99.73 99 91 150
26 92 70.4 63 79.8 27.96 100 102.5 198
27 86.6 83.45 64 108.3 123 101 106 151
28 80.5 84.3 65 119.6 90.41 102 109.1 229
29 86 78.89 66 119.9 106 103 115 253
30 82.5 64.75 67 96.5 144 104 101 188
31 83.5 72.56 68 105.5 121 105 100.1 124
32 88.1 89.31 69 105 97.13 106 93.3 62.2
33 90.8 78.94 70 107 166 107 101.8 133
34 89.4 83.55 71 107 87.99 108 107.9 208
111 35
Footer 102 127 72 101 154 109 108.5
Copyright © 208reserved.
2015 ExcelR . All rights
36 94.5 121 73 97 100
Simple Linear Regression – Transformation
reg <- lm(AT ~ Waist) # Linear Regression
summary(reg)
confint(reg, level=0.95)
predict(reg, interval="predict”)
Y = Continuous Y = Continuous
Create
Dummy
Variable
X = Multiple & X = Multiple
Continuous & Discrete
Multiple Multiple
Linear Linear
Regression Regression
• HP = engine horsepower
Our interest is to model the MPG of a car based on the other variables
2
Logit Analysis
3
Probit Analysis
Y = Discrete Y = Discrete
Create
Dummy
Variable
X = Single & X = Single
Continuous & Discrete
Simple Simple
Logistic Logistic
Regression Regression
5
123 Footer Copyright © 2015 ExcelR . All rights reserved.
Steps in Logistic Regression
5
124 Footer Copyright © 2015 ExcelR . All rights reserved.
Logistic Regression Example
Imagine that you are a Data Scientist at a very large scale integration circuit
manufacturing company. You want to know whether or not the time spent
inspecting each product impacts the quality assurance department’s ability to
detect a designing error in the circuit