Professional Documents
Culture Documents
FDS L1 to L8 slides
FDS L1 to L8 slides
FDS L1 to L8 slides
No Objective
Get introduced to the field of Data Science, roles, process and challenges
CO1
involved therein
Explore and experience the steps involved in the data preparations and
CO2
exploratory data analysis
Learn to select and apply proper analytics technique for various scenarios,
CO3 assess the model’s performance and interpret the results of the predictive
model
Get familiarity with the general deployment considerations of the predictive
CO4
models
Appreciate the importance of techniques like data visualization, storytelling
CO5
with data for the effective presentations of the outcomes to the stakeholders
Experiential
Learning
Assignment 1
EC1 Take Home-Online 25% To be announced
Experiential
Learning
Assignment 2
Pravin Mhaske
Breaking it down:
• Data: Everything is data. Structured, unstructured.
• Scientific methods: Scientific approach, questions, data collection, analyze,
interpret, conclusion
• Statistics: Patterns, trends, insights
• Domain expertise: SME, actionable and relevant insights
• Programming: Process and manipulate data
• Traffic lights
• Online dating
• Career opportunities
• Data is growing
• Demand-Supply gap
Breaking it down:
• Data: Everything is data. Structured, unstructured.
• Scientific methods: Scientific approach, questions, data collection, analyze,
interpret, conclusion
• Statistics: Patterns, trends, insights
• Domain expertise: SME, actionable and relevant insights
• Programming: Process and manipulate data
Business Analyst
Data Engineer
Data/BI Analyst
ML Engineer
Data Scientist
Source: https://techcommunity.microsoft.com/t5/azure-developer-community-blog/the-data-science-process-with-azure-machine-learning/ba-p/336162
Business Analyst
Data Engineer
Data/BI Analyst
ML Engineer
Data Scientist
Source: https://techcommunity.microsoft.com/t5/azure-developer-community-blog/the-data-science-process-with-azure-machine-learning/ba-p/336162
1. Business Understanding
• What is the problem?
• What is the objective?
• What is causing the problem?
2. Data Understanding
• What data do we have?
• What data do we need?
3. Data Preparation
• Data Collection – Sources, format
• How to get the data?
• Where to store the data? In what format?
• Is the data clean and complete?
• Data Cleaning
• EDA – data to insights
• Feature Engineering
4. Modeling
• What kind of a problem?
• What kind of an algorithm?
1. Data Collection
• Hadoop Ecosystem (HDFS, Hive, Pig)
2. Data Preparation
• SQL
• Python and Python libraries - pandas
3. EDA
• Excel
• RStudio
• Power BI
• Tableau
• Python libraries – matplotlib, pandas, seaborn
4. Statistical Analysis
• RStudio
• Matlab
• SAS
• SPSS
5. Model building
• Jupyter Notebook
• Python libraries – Numpy, Scipy, scikitlearn
• Tensorflow
• PyTorch
• AWS/Azure/GCP
Data
Qualitative Quantitative
(Categorical) (Numeric)
Nominal
• No specific order
• All categories are equal
• Can not be measured
• Gender, colors, divisions
Ordinal
• Natural order
• Categories can be compared
• High-Medium-Low, First-Second-Third, etc.
Discrete
• Only certain values
• Typically, whole numbers
• Countable
• Runs, goals, marks
Continuous
• All possible values within a range
• Typically, with fractions and decimals
• Measurable
• Height, weight, temperature
Formats
• Tabular – rows and columns (xls, csv)
• Web data – JSON, xml
• Time series dataset
• Image dataset
• Bivariate
• Multivariate
BITS Pilani, Pilani Campus
Why Data Quality?
1. Better decisions
2. Correct analysis and insights
3. Better problem-solving
4. Reliable results
5. Less ambiguity
6. Customer experience
7. Compliance
8. Cost
Model, histogram,
cluster, sample
• Measures of Dispersion
– Range
– Standard Deviation/Variance
– IQR
– Coefficient of Variation
• Measures of Association
– Covariance
– Correlation
Variance/standard deviation
• Numerous applications in descriptive statistics, statistical inference, hypothesis testing, Monte Carlo
simulation, analysis of variance.
• Wide applications in physics, biology, chemistry, economics, and finance.
n
x x
2
i
Sample standard deviation s i 1
n 1
Coefficient of Variation (CV)
• A relative measure of dispersion.
• It has enormous applications in quality assurance studies.
• Useful in comparing dispersions of two distributions having different measurement units.
s
coefficient of variation CV for sample data
x
coefficient of variation CV for population
BITS Pilani, Pilani Campus
Measures of Dispersion
• Range: difference between the largest and the smallest values in a dataset.
• Interquartile range: difference between the third (upper) quartile and first (lower)
quartile. IQR = Q3 – Q1
• An outlier (spurious data point) is an observation point that is distant from other
observations.
x x y y
i i
Sample covariance, sxy i 1
n 1
n
x x y y
i i
Population covariance, xy i 1
N
BITS Pilani, Pilani Campus
Measures of Association
Correlation
• Correlation is a normalized covariance. It lies in between -1 to +1.
• If two variables tend to show similar behaviour, then the correlation is positive, otherwise
negative.
sxy x y x y
i i
Sample correlation, rxy i 1
sx s y n n
x i -x yi -y
2 2
i=1 i=1
xy
Population correlation, xy
x y
Objectives:
1. Explore data to become familiar with data
2. Discover patterns, trends, relationships
3. Spot anomalies
4. Test Hypothesis or assumptions
5. Summarizing data
6. Missing/Null values
7. Explain outcomes or results of analysis
8. Tell a story with data
Text:
• Simple text
• Tables
• Heatmap
• Graphs 4
• Points (Scatter) 3
• Lines 2
1
• SlopeGraph
0
0 1 2 3
0
Category 1 Category 2 Category 3 Category 4
Jan-18 Feb-18
Series 1 Series 2 Series 3
North South East West Middle
6 100%
◦ Bars 80%
4
◦ Horizontal / Vertical 60%
2 40%
◦ Stacked 20%
0
◦ Waterfall South North East West
0%
South North East West
◦ Area Sales Cost Series 3 Computer Electronics Series 3
50
West 40
30
East 20
North 10
0
South
0% 50% 100%
Computer Electronics Series 3
Series 1 Series 2
Univariate - distribution
Bivariate - relationships
• Categorical Vs Categorical
• Continuous Vs Continuous
• Continuous Vs Categorical
Multivariate
Series 1 Series 2
• Excel
• Tableau
• Power BI
• R/Rstudio
• Qlikview/Qliksense…
M4 Predictive Modeling
Lecture 5 Linear Regression
What is Machine Learning?
According to Tom M. Mitchell, Chair of Machine Learning at Carnegie Mellon University and
author of the book Machine Learning (McGraw-Hill),
A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with the experience
E.
We now have a set of objects to define machine learning:
Task (T), Experience (E), and Performance (P)
With a computer running a set of tasks, the experience should be leading to performance increases (to
satisfy the definition)
Many data mining tasks are executed successfully with help of machine learning
The Advertising data set has 4 variables and 6 # TV Radio Paper Sales
observations 1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
The variable names are “TV”, “Radio”, “Paper” and 3 17.2 45.9 69.3 9.3
“Sales”
4 151.5 41.3 58.5 18.5
p = 3 (the number of independent variables)
5 180.8 10.8 58.4 12.9
n = 6 (the number of observations) 6 8.7 48.9 75 7.2
Y = β 0 + β 1 X1 + β 2 X2 + · · · + β p Xp + ε
1. E(ε) = 0
2. The model adequately captures the relationship
3. Var(ε) = σ2 for all values of the independent variables (Homoscedasticity)
4. ε is normally distributed
5. The values of ε are independent (No Serial Correlation or Autocorrelation)
6. There is no (or little) multicollinearity among the independent variables
Solutions to multicollinearity
1. Drop unnecessary variables
2. Advanced techniques: Ridge / Lasso / Stepwise / Principal Components Regression
M4 Predictive Modeling
Lecture 5 Linear Regression
What is Machine Learning?
According to Tom M. Mitchell, Chair of Machine Learning at Carnegie Mellon University and
author of the book Machine Learning (McGraw-Hill),
A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with the experience
E.
We now have a set of objects to define machine learning:
Task (T), Experience (E), and Performance (P)
With a computer running a set of tasks, the experience should be leading to performance increases (to
satisfy the definition)
Many data mining tasks are executed successfully with help of machine learning
The Advertising data set has 4 variables and 6 # TV Radio Paper Sales
observations 1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
The variable names are “TV”, “Radio”, “Paper” and 3 17.2 45.9 69.3 9.3
“Sales”
4 151.5 41.3 58.5 18.5
p = 3 (the number of independent variables)
5 180.8 10.8 58.4 12.9
n = 6 (the number of observations) 6 8.7 48.9 75 7.2
Y = β 0 + β 1 X1 + β 2 X2 + · · · + β p Xp + ε
1. E(ε) = 0
2. The model adequately captures the relationship
3. Var(ε) = σ2 for all values of the independent variables (Homoscedasticity)
4. ε is normally distributed
5. The values of ε are independent (No Serial Correlation or Autocorrelation)
6. There is no (or little) multicollinearity among the independent variables
Solutions to multicollinearity
1. Drop unnecessary variables
2. Advanced techniques: Ridge / Lasso / Stepwise / Principal Components Regression
(Education) x y (Income)
Multiple Regression
(Education) x1
(Age) x4
𝑥𝑖 −𝑥 𝑦𝑖 −𝑦
𝑏1 =
𝑥𝑖 −𝑥 2
𝑏0 = 𝑦 - 𝑏1 𝑥
where:
xi = value of independent variable for ith observation
yi =value of dependent variable for ith observation
𝑥 = mean value for dependent variable
𝑦= mean value for dependent variable
𝑥𝑖 −𝑥 𝑦𝑖−𝑦
• Slope for the Estimated Regression Equation 𝑏1 = = 20/4 = 5
𝑥𝑖 −𝑥 2
M4 Predictive Modeling
Lecture 6 Classification, Logistic Regression
Classification
• Naïve Bayes
• K-nearest Neighbour
• Logistic Regression
• Discriminant Analysis
• Decision Trees
• Support Vector Machine
Ave I yi ≠ yi
A good classifier is one for which the test error rate is the smallest
• Regression gives a number. What if I want to identify a class or category and not a
number?
• Let’s say, I want to identify genuine emails vs spam emails, genuine transaction vs
fraud transaction. Here the outcomes are text values, but models can understand only
numbers?
• How do I handle this? I will replace the 2 classes by numbers. Say, one class as 1 and
another as 0 and train a model which can predict the outcome value that is 0 or 1.
• But models can’t give discrete values 0 and 1.
• We can rather make it give a continuous value between 0 and 1.
• If the value is closer to 1 (i.e. >= 0.5), I consider it as 1, otherwise 0.
For an event E ,
• If we know P(E) then
P(E) P(E)
Odds E = =
P(~E) 1 − P(E)
x
If the odds of E are “x to y”, then P E = x+y
Logit function:
p
• logit p = ln ,0 ≤ p ≤ 1
1−p
P(E)
log Odds(E) = ln 1−P(E)
logit can be interpreted as the log odds of a success
Confusion Matrix
TP = 10, FP = 0, TN = 85, FN = 5
True Class
Predicted Class Target Non-Target
Target TP FP
Non-Target FN TN
Also check*
1. Specificity (TNR or True Negative Rate = TN/N)
2. False Positive Rate (FPR) = FP/N = 1 – TNR
3. And others…
Refer https://en.wikipedia.org/wiki/Confusion_matrix
1 is same as Positive
0 is same as Negative
1. Under-sampling
2. Over-sampling
3. Optimize AUC
Pros:
Easy to implement
Training becomes much more efficient (smaller training set)
For some domains, can work very well
Cons:
Throwing away a lot of data/information
Pros:
Easy to implement
Utilizes all of the training data
Tends to perform well in a broader set of circumstances than subsampling
Cons:
Computationally expensive to train a classifier
True Class
Predicted Class A B C
A 30 20 10
B 50 60 10
C 20 20 80
The more zeroes or smaller the numbers on all cells but the diagonal, the better the
classifier is doing.
True Positive are those observations of a particular class that were classified correctly
False Positive are those observations that were incorrectly mapped to one class
False Negative are those observations of a particular class that were classified incorrectly
True Negative: Applicable for a Two-Class scenario