Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

DATA SCIENCE FOR BUSINESS

Business analytics and organizational change

Ershad Gholamrezaie
Department of Informatics

Nov. 2020

UMEÅ UNIVERSITY
WHAT IS DATA SCIENCE?

• Science
o “Science (from the Latin word scientia, meaning "knowledge") is a
systematic enterprise that builds and organizes knowledge in the form
of testable explanations and predictions about the universe.” (Wikipedia)

• Data?

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
WHAT IS DATA?

• In general data are information


o Observations
• General types of data
o Quantitative
o Qualitative
q Structured
q Unstructured
Ø Discrete
Ø Continuous
(https://commons.wikimedia.org)

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
WHAT IS DATA SCIENCE?

• “Data science is an inter-disciplinary field that


uses scientific methods, processes, algorithms
and systems to extract knowledge and insights
from structured and unstructured data.”
(Wikipedia)
• Data in data science
o Collected analectic information
to find specific answers to specific
questions
(https://thedatascientist.com)

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
WHAT IS DATA SCIENCE FOR BUSINESS?

• Data science for business is a set of fundamental scientific principles


to collect, process, and interpret data in the business context.

ü Healthcare
ü Public sector
ü Hazard and risk management
ü Finance and retail
ü IOT
ü Literarily EVERYWHERE
(https://thedatascientist.com)

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
QUALITATIVE VS QUANTITATIVE DATA

Quantitative Qualitative
• Objective • Subjective
• Conclusive • Interpretive
• Countable • Conceptual
• Measurable • Descriptive
• Categorized by numbers • Categorized by characteristics

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
STRUCTURED VS UNSTRUCTURED DATA

• Structured data
o Well-defined format
o Mainly quantitative data
o Spreadsheets (i.e. easy to deal with)

• Unstructured data
o No well-defined format
o Mainly qualitative data
o Hard to analyse

(https://learn.g2.com)

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
DISCRETE VS CONTINUOUS DATA

• Discrete data
o Disconnected
o The number of … (quantitative)
o Can be qualitative too (exp. gender)

• Continuous data
o Unfixed number
o Measurements
o Range (time-dependent) (https://learn.g2.com)

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
GENERAL FLOWCHART
1st step and
the most Data (Provost & Fawcett: Fig. 2-2)
important acquisition
One?!!!

Model Data
evaluation preparation

Deployment
Visualization Data
(Modeling) processing

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
SUMMARY

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS (LinkedIn, Steve Nouri) ERSHAD.GHOLAMREZAIE@UMU.SE
SUMMARY

Overfitting!

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS (LinkedIn, Steve Nouri) ERSHAD.GHOLAMREZAIE@UMU.SE
MODELLING TECHNIQUES

General Techniques General Methodology


• Regression • Input (data)
• Classification • Function (Reg./Clas./Clus.)
• Clustering • Output (model, new data)

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
REGRESSION Trend Line
• Regression is a function to fit a line to
a set(s) of data
o Linear (y = ax + b)
o Nonlinear:
1) Logarithmic
2) Exponential
3) Polynomial
4) Power
• Input:
o Predictors
• Output:
o Response (good for quantitative)

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
REGRESSION Trend Line

• Regression is a function to
fit a line to a set(s) of data
o Linear (y = ax + b)
o y = ?x + ?
For all n points, it calculates 𝑥 ! & 𝑥𝑦
Calculate
𝑦! − 𝑦"
𝑎=
𝑥! − 𝑥" $ 𝑥 , $ 𝑦 , $ 𝑥 ! , $ 𝑥𝑦

o 𝑦 − 𝑦! = 𝑎 (𝑥 − 𝑥! ) 𝑛 ∑(𝑥𝑦) − ∑ 𝑥 ∑ 𝑦
𝑎=
𝑛 ∑(𝑥 ! ) − (∑ 𝑥)!
∑𝑦 − 𝑎 ∑𝑥
𝑏=
𝑛

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
CLASSIFICATION
• Classification is a model to assign
a set(s) of observations to a
specific category (supervised)
• Input:
o Earlier observations
(labelled)
• Function:
Classifier (transform unseen
o
data to Class)
• Output:
o Class (known)
o Good for qualitive data
(Provost & Fawcett: Fig. 3-13)

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
CLUSTERING
• Clustering is a model to assign
a set(s) of data to a specific
category (unsupervised)
• Input:
o Unlabelled data
• Function:
K-means / Hierarchies
o
• Output: (Provost & Fawcett: Fig. 6-2)

o Cluster (Unknown)
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
K-MEANS METHOD

• It is a simple and popular clustering


method (Tableau)
• Each cluster is based on the aggregated
(similarities) data points (centroids)
• “K” defines numbers of the centroids
• Centroids are the centres of the clusters
• “Means” is the average of the data that is
assigned to each centroids.
(Provost & Fawcett: Fig. 6-12)

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
K-MEANS METHOD

• Formula in Tableau

!"# . $%&
o ☓
!"# / &%'

o Var is variance, B is between, W is


within, N is number of cases, k is
number of cluster

(Provost & Fawcett: Fig. 6-12)

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
K-MEANS METHOD

• Formula in Tableau

!"# . $%&
o ☓
!"# / &%'

o Var is variance, B is between, W is


within, N is number of cases, k is
number of cluster

(Provost & Fawcett: Fig. 6-12)

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
K-MEANS METHOD

• Formula in Tableau

!"# . $%&
o ☓
!"# / &%'

o Var is variance, B is between, W is


within, N is number of cases, k is
number of cluster

(Provost & Fawcett: Fig. 6-12)

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
K-MEANS METHOD

• Formula in Tableau

!"# . $%&
o ☓
!"# / &%'

o Var is variance, B is between, W is


within, N is number of cases, k is
number of cluster

(Provost & Fawcett: Fig. 6-12)

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
K-MEANS METHOD

• Formula in Tableau

!"# . $%&
o ☓
!"# / &%'

o Var is variance, B is between, W is


within, N is number of cases, k is
number of cluster

(Provost & Fawcett: Fig. 6-12)

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
K-MEANS METHOD

• Formula in Tableau

!"# . $%&
o ☓
!"# / &%'

o Var is variance, B is between, W is


within, N is number of cases, k is
number of cluster

(Provost & Fawcett: Fig. 6-13)

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
K-MEANS METHOD

• Formula in Tableau

!"# . $%&
o ☓
!"# / &%'

o Var is variance, B is between, W is


within, N is number of cases, k is
number of cluster

(Provost & Fawcett: Fig. 6-13)

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
STATISTICS IN DATA SCIENCE?

• “Statistics consists of a body


of methods for collecting and
analysing data.”
(Agresti & Finlay, 1997)

(https://thedatascientist.com)

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
TYPES OF STATISTICS

• Descriptive statistics (Expressive)


o Presenting

o Organizing

o Summarizing

o Limited within the dataset boundaries

o Mean, Median, Variance, Graphs, … (Provost & Fawcett: Fig. 2-2)

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
TYPES OF STATISTICS

• Inferential statistics (Conclusive)


o Conclusive determination regarding a population
based of a sample(s)

o Beyond dataset boundaries

o It may use descriptive statistics

§ Mean, Median, …
(Provost & Fawcett: Fig. 3-13)

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
VARIABLE / POPULATION / SAMPLE

• Variable is a properties or characteristics of an


event(s) that can vary and takes on different values.

• Population is the entire group of individuals.

• Sample is a part of the population that is considered


for a study.
(Provost & Fawcett: Fig. 3-13)

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
AVERAGE & MEAN & MEDIAN
Day Temp (°C)
• Mean (arithmetic mean) Monday 25
Ø 𝑥̅ 𝑠𝑎𝑚𝑝𝑙𝑒 Tuesday 20

Ø𝜇 population Wednesday 18
Thursday 23
∑&
#$% "#
Ø 𝑥̅ = Friday 28
#
Mean 22.8
• Average = Mean
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
AVERAGE & MEAN & MEDIAN
Day Temp (°C)
• Median is the middle value of Monday 25
Tuesday 20
a set of ordered data
Wednesday 18
Ø𝑥
! = 𝑥!"# Thursday 23

$%& Friday 28
Ø𝑚𝑒𝑑 = Median 23
'

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
AVERAGE & MEAN & MEDIAN
Day Temp (°C)
• Median is the middle value of Monday 25 (4)
Tuesday 20 (2)
a set of ordered data
Wednesday 18 (1)
Ø𝑥
! = 𝑥!"# Thursday 23 (3)
Friday 28 (5)
$%&
Ø𝑚𝑒𝑑 = Saturday 33 (6)
'
Median 24

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
VARIANCE
• Variance merges values in the dataset to make Day Temp (°C)
a measure of average squared deviation of Monday -1
each number from the mean. Tuesday -2
• Population: Wednesday 0
" Thursday 1
Ø 𝜎 = ! ∑#$%"(𝑥$ − 𝜇) !
# Friday 2
• Sample Saturday -3
" Variance 2.917
Ø 𝑠 ! = # ∑#$%"(𝑥$ − 𝑥)̅ !
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
STANDARD DEVIATION
• Standard deviation means spread around the Day Temp (°C)
mean. Monday -1
o mean is consider as a centre to calculate Tuesday -2
tendency of a dataset Wednesday 0
o Central tendency Thursday 1
Friday 2
Ø 𝜎= 𝜎 ! (Population)
Saturday -3
Ø 𝑠= 𝑠! (Sample)
𝜎 1.708

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
PROBABILITY
• The probability is the frequency of an event
A dice rolling B dice rolling
that would occur in a series of repetitions.
1 1
• 𝑃 𝐴 & 𝑃(𝐵) are the probability of events A and B, 2 2
respectively.
3 3
0 23245
• 𝑃 𝐴=6 = ( ) 4 4
1 56578 232459

0
5 5
• 𝑃 𝐵=6 = 1 6 6

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
PROBABILITY
• P(A ∩ B) is the intersection of events A and B.
A dice rolling B dice rolling
o A ∩ B is an event that both A and B occur.
1 1
• P(A ∪ B) is the union of events A and B.
2 2
o A ∪ B is an event that either A or B
3 3
o P(A|B) is a conditional probability that event A
4 4
occurs, given that event B has occurred.
5 5
o A|B is an event that A occurs if B has been already
occurred. 6 6

UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE
• P(A ∪ B) = P(A) + P(B) - P(A ∩ B) Let’s have fun ;-)
• P(A ∩ B) = P(A) × P(B|A)

• What is the probability of double 6?


A dice rolling B dice rolling
1
𝑃 𝐴=6 = 1 1
6

1 2 2
𝑃 𝐵=6 =
6 3 3
• What is the probability of a “6” on dice A and any random 4 4
number but “6” on dice B?
5 5
1
𝑃 𝐴=6 =
6 6 6
5
𝑃 𝐵≠6 =
6
UMEÅ UNIVERSITY
DATA SCIENCE FOR BUSINESS ERSHAD.GHOLAMREZAIE@UMU.SE

You might also like