Professional Documents
Culture Documents
Data Science For Business: Business Analytics and Organizational Change
Data Science For Business: Business Analytics and Organizational Change
Ershad Gholamrezaie
Department of Informatics
Nov. 2020
UMEÅ UNIVERSITY
WHAT IS DATA SCIENCE?
• Science
o “Science (from the Latin word scientia, meaning "knowledge") is a
systematic enterprise that builds and organizes knowledge in the form
of testable explanations and predictions about the universe.” (Wikipedia)
• Data?
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
WHAT IS DATA?
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
WHAT IS DATA SCIENCE?
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
WHAT IS DATA SCIENCE FOR BUSINESS?
ü Healthcare
ü Public sector
ü Hazard and risk management
ü Finance and retail
ü IOT
ü Literarily EVERYWHERE
(https://thedatascientist.com)
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
QUALITATIVE VS QUANTITATIVE DATA
Quantitative Qualitative
• Objective • Subjective
• Conclusive • Interpretive
• Countable • Conceptual
• Measurable • Descriptive
• Categorized by numbers • Categorized by characteristics
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
STRUCTURED VS UNSTRUCTURED DATA
• Structured data
o Well-defined format
o Mainly quantitative data
o Spreadsheets (i.e. easy to deal with)
• Unstructured data
o No well-defined format
o Mainly qualitative data
o Hard to analyse
(https://learn.g2.com)
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
DISCRETE VS CONTINUOUS DATA
• Discrete data
o Disconnected
o The number of … (quantitative)
o Can be qualitative too (exp. gender)
• Continuous data
o Unfixed number
o Measurements
o Range (time-dependent) (https://learn.g2.com)
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
GENERAL FLOWCHART
1st step and
the most Data (Provost & Fawcett: Fig. 2-2)
important acquisition
One?!!!
Model Data
evaluation preparation
Deployment
Visualization Data
(Modeling) processing
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
SUMMARY
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS (LinkedIn, Steve Nouri) ERSHAD.GHOLAMREZAIE@UMU.SE
SUMMARY
Overfitting!
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS (LinkedIn, Steve Nouri) ERSHAD.GHOLAMREZAIE@UMU.SE
MODELLING TECHNIQUES
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
REGRESSION Trend Line
• Regression is a function to fit a line to
a set(s) of data
o Linear (y = ax + b)
o Nonlinear:
1) Logarithmic
2) Exponential
3) Polynomial
4) Power
• Input:
o Predictors
• Output:
o Response (good for quantitative)
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
REGRESSION Trend Line
• Regression is a function to
fit a line to a set(s) of data
o Linear (y = ax + b)
o y = ?x + ?
For all n points, it calculates 𝑥 ! & 𝑥𝑦
Calculate
𝑦! − 𝑦"
𝑎=
𝑥! − 𝑥" $ 𝑥 , $ 𝑦 , $ 𝑥 ! , $ 𝑥𝑦
o 𝑦 − 𝑦! = 𝑎 (𝑥 − 𝑥! ) 𝑛 ∑(𝑥𝑦) − ∑ 𝑥 ∑ 𝑦
𝑎=
𝑛 ∑(𝑥 ! ) − (∑ 𝑥)!
∑𝑦 − 𝑎 ∑𝑥
𝑏=
𝑛
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
CLASSIFICATION
• Classification is a model to assign
a set(s) of observations to a
specific category (supervised)
• Input:
o Earlier observations
(labelled)
• Function:
Classifier (transform unseen
o
data to Class)
• Output:
o Class (known)
o Good for qualitive data
(Provost & Fawcett: Fig. 3-13)
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
CLUSTERING
• Clustering is a model to assign
a set(s) of data to a specific
category (unsupervised)
• Input:
o Unlabelled data
• Function:
K-means / Hierarchies
o
• Output: (Provost & Fawcett: Fig. 6-2)
o Cluster (Unknown)
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
K-MEANS METHOD
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
K-MEANS METHOD
• Formula in Tableau
!"# . $%&
o ☓
!"# / &%'
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
K-MEANS METHOD
• Formula in Tableau
!"# . $%&
o ☓
!"# / &%'
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
K-MEANS METHOD
• Formula in Tableau
!"# . $%&
o ☓
!"# / &%'
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
K-MEANS METHOD
• Formula in Tableau
!"# . $%&
o ☓
!"# / &%'
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
K-MEANS METHOD
• Formula in Tableau
!"# . $%&
o ☓
!"# / &%'
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
K-MEANS METHOD
• Formula in Tableau
!"# . $%&
o ☓
!"# / &%'
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
K-MEANS METHOD
• Formula in Tableau
!"# . $%&
o ☓
!"# / &%'
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
STATISTICS IN DATA SCIENCE?
(https://thedatascientist.com)
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
TYPES OF STATISTICS
o Organizing
o Summarizing
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
TYPES OF STATISTICS
§ Mean, Median, …
(Provost & Fawcett: Fig. 3-13)
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
VARIABLE / POPULATION / SAMPLE
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
AVERAGE & MEAN & MEDIAN
Day Temp (°C)
• Mean (arithmetic mean) Monday 25
Ø 𝑥̅ 𝑠𝑎𝑚𝑝𝑙𝑒 Tuesday 20
Ø𝜇 population Wednesday 18
Thursday 23
∑&
#$% "#
Ø 𝑥̅ = Friday 28
#
Mean 22.8
• Average = Mean
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
AVERAGE & MEAN & MEDIAN
Day Temp (°C)
• Median is the middle value of Monday 25
Tuesday 20
a set of ordered data
Wednesday 18
Ø𝑥
! = 𝑥!"# Thursday 23
$%& Friday 28
Ø𝑚𝑒𝑑 = Median 23
'
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
AVERAGE & MEAN & MEDIAN
Day Temp (°C)
• Median is the middle value of Monday 25 (4)
Tuesday 20 (2)
a set of ordered data
Wednesday 18 (1)
Ø𝑥
! = 𝑥!"# Thursday 23 (3)
Friday 28 (5)
$%&
Ø𝑚𝑒𝑑 = Saturday 33 (6)
'
Median 24
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
VARIANCE
• Variance merges values in the dataset to make Day Temp (°C)
a measure of average squared deviation of Monday -1
each number from the mean. Tuesday -2
• Population: Wednesday 0
" Thursday 1
Ø 𝜎 = ! ∑#$%"(𝑥$ − 𝜇) !
# Friday 2
• Sample Saturday -3
" Variance 2.917
Ø 𝑠 ! = # ∑#$%"(𝑥$ − 𝑥)̅ !
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
STANDARD DEVIATION
• Standard deviation means spread around the Day Temp (°C)
mean. Monday -1
o mean is consider as a centre to calculate Tuesday -2
tendency of a dataset Wednesday 0
o Central tendency Thursday 1
Friday 2
Ø 𝜎= 𝜎 ! (Population)
Saturday -3
Ø 𝑠= 𝑠! (Sample)
𝜎 1.708
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
PROBABILITY
• The probability is the frequency of an event
A dice rolling B dice rolling
that would occur in a series of repetitions.
1 1
• 𝑃 𝐴 & 𝑃(𝐵) are the probability of events A and B, 2 2
respectively.
3 3
0 23245
• 𝑃 𝐴=6 = ( ) 4 4
1 56578 232459
0
5 5
• 𝑃 𝐵=6 = 1 6 6
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
PROBABILITY
• P(A ∩ B) is the intersection of events A and B.
A dice rolling B dice rolling
o A ∩ B is an event that both A and B occur.
1 1
• P(A ∪ B) is the union of events A and B.
2 2
o A ∪ B is an event that either A or B
3 3
o P(A|B) is a conditional probability that event A
4 4
occurs, given that event B has occurred.
5 5
o A|B is an event that A occurs if B has been already
occurred. 6 6
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
• P(A ∪ B) = P(A) + P(B) - P(A ∩ B) Let’s have fun ;-)
• P(A ∩ B) = P(A) × P(B|A)
1 2 2
𝑃 𝐵=6 =
6 3 3
• What is the probability of a “6” on dice A and any random 4 4
number but “6” on dice B?
5 5
1
𝑃 𝐴=6 =
6 6 6
5
𝑃 𝐵≠6 =
6
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE