Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

Introduction to

Machine Learning
Ratchainant Thammasudjarit, Ph.D.
Outline
• AL, ML and DL
• Taxonomies and Use Cases
AI, ML and DL

AI is the new electricity – Andrew Ng

• From AI to DL
• Problem Complexity
• Computational Power

Image Source: Nvidia


Why healthcare needs ML

How much data is being generated daily

• The growth of data


Why healthcare needs ML

World is changing

• Task complexity
• Predefining relationship between variables are
difficult in many real world cases
Benefits of ML in Healthcare

Better life

AI, ML, &


DL
Types of Machine Learning

Supervised Learning

• Learning from known samples

• Prediction
• Classification
ML • Regression
Types of Machine Learning

Supervised Learning

• Classification
Microbial Keratitis Classification
Is a given image either bacteria or fungal
infection?

Bacteria Keratitis Infection Fungal Keratitis Infection


• Predicting two possible class is called
binary classification problem
Types of Machine Learning

Supervised Learning

• Classification
Microbial Classification

What is the microbial type of a given image?

• Predicting more than two possible class is


called Multi-classification problem
Types of Machine Learning

Supervised Learning

• Regression

Early Detection
What is the conversion likelihood from
prediabetes to type-2 diabetes of a patient’s
clinical risk factors?
Types of Machine Learning

Supervised Learning

• Example: Given a patient with the following


parameters
• Age: 45 years old
• Sex: Male
• BMI: 35 kg/sq.m
• HDL: 44 mg/dL
• LDL: 125 mg/dL
• Prediction of conversion likelihood is 0.7
Note: This is just an example to illustrate use case only.
Types of Machine Learning

Supervised Learning

• Prediction of conversion likelihood is 0.7

• Treatment plan
• Taking Glipizide to stimulate insulin generation
• Taking Metformin to reduce sugar produced by
liver

• Follow in the next 4 months


Types of Machine Learning

Supervised Learning

• Discussions
• What is the difference between classification
and regression

Classification Regression
Types of Machine Learning

Unsupervised Learning

• Learning from samples

• Tasks
• Structure discovery
ML • Clustering
• Dimensionality Reduction
Types of Machine Learning

Unsupervised Learning

• Clustering
Diagnosis Related Group

What is the DRG of a given procedure?

ML
Types of Machine Learning

Unsupervised Learning

• Each procedure consumes resources

• Procedure clustering helps hospital


manage financial better
Types of Machine Learning

Reinforcement Learning

• Self-learning

• Tasks
• Optimization
ML • Decision Making
Types of Machine Learning

Reinforcement Learning

• Decision Making

Robotic Navigation
Exploring landscape and self-decide to
perform action, e.g. moving forward, stop,
turn left, turn right
Types of Machine Learning

Reinforcement Learning

• Distance from the Earth to the Mars is 225


million kilometers

• Light speed is 300,000 km per second

• Light takes about 14 minutes traveling from the


Earth to the Mars

• What if we would like to control the rover to


explore the Mars
Types of Machine Learning

Reinforcement Learning

• State is a consequence of action


• Optimize objective function
• Maximize explored area
Forward = 0.2
Forward = 0.8 Turn = 0.3

Clear Obstruct

Turn = 0.7
Types of Machine Learning
Supervised Learning
• Prediction
Summary • Classification
• Regression

Unsupervised Learning
• Structure Discovery
• Clustering
• Dimensionality Reduction

ML
Reinforcement Learning

• Optimization
• Decision Making
Machine Learning Components

Overall

Samples

Learning Trained
Algorithm Model

Model
Machine Learning Components

Samples

Thai people who has


hyperglycemia

Population

Samples
Ramathibodi’s patient
from endocrine clinic
who has hyperglycemia
Machine Learning Components

Sample Terminologies and Notations


Example of diabetes prediction

• Features are predictors or independent


Age Sex Diabetes variables
55 Male yes
37 Female no
• Standard notation is 𝐱 as feature and 𝑥 as
48 Male yes
value
60 Female no
• 𝐱 𝑎𝑔𝑒 = 𝑥 ∈ 𝐼 + | 0 ≤ 𝑥 ≤ 100
42 Male yes
• 𝐱 𝑠𝑒𝑥 = 𝑥 | 𝑥 ∈ 𝑀𝑎𝑙𝑒, 𝐹𝑒𝑚𝑎𝑙𝑒
• Multiple features are represented by 𝐗
Machine Learning Components

Sample Terminologies and Notations


Example of diabetes prediction

• Target class is a variable to be predicted


Age Sex Diabetes • Only required for supervised learning
55 Male yes
37 Female no
• Standard notation is 𝐲 as target class and 𝑦
48 Male yes
as class label
60 Female no
• 𝐲 = 𝑦 | 𝑦 ∈ {𝑦𝑒𝑠, 𝑛𝑜}
42 Male yes
Machine Learning Components

Sample Terminologies and Notations


Example of diabetes prediction

• Supervised learning
• The sample 𝑖 is a pair of feature vector and its
Age Sex Diabetes
class label
55 Male yes • The standard notation is (𝐗 (𝑖) , 𝑦 (𝑖) )
37 Female no (𝐗 (2) , 𝑦 (2) )
48 Male yes
60 Female no
42 Male yes • Others
• The sample 𝑖 is a feature vector
• The standard notation is 𝐗 (𝑖)
Machine Learning Components

Sample Terminologies and Notations


Example of diabetes prediction

• Supervised Learning
Age Sex Diabetes • Dataset is a collection of 𝑚 pairs of feature
vector and its class label
55 Male yes
• The standard notation is (𝐗, 𝐲)
37 Female no
48 Male yes
60 Female no
42 Male yes
Machine Learning Components

Sample Terminologies and Notations

• Summary
Age Sex Diabetes • Italic lowercase represents a scalar
55 Male yes • Non-italic lowercase represents a vector
37 Female no • Non-italic uppercase represents a matrix
48 Male yes
60 Female no
42 Male yes
Machine Learning Components

Model

6
• Expression of relationships between
𝜀4 variables
5 𝜀5
• Example: Linear relationship
4 𝜀2
3
𝐲 = 𝛽0 + 𝛽1 𝐱
𝜀3
𝜀1
2

1 • Goal
0 • Learn the optimal parameters 𝛉∗ = 𝛽0 , 𝛽1 from
0 1 2 3 4 5 dataset (𝐗, 𝐲) that minimize error
Machine Learning Components

Model

𝑓1

𝑓2

Error
𝑓1
𝑓2

𝛉∗
Machine Learning Components

Learning Algorithm

• A procedure to search for the optimum


parameters 𝛉∗
Error

𝛉0 𝛉1 𝛉∗
Machine Learning Components

Model and Learning Algorithm

• Summary
• Input
Samples • Data
• Model
Learning Trained
Algorithm Model
• Process
• Learning Algorithm
Model • Output
• Model parameters (Trained Model)
Build a Machine Learning Model

Data Splitting

• Classical Machine Learning


(𝐗 𝑡𝑟𝑛 , 𝐲𝑡𝑟𝑛 )
• Number of samples is not very large
• Training samples: Test samples
(𝐗, 𝐲) • 80:20
𝐲
Stratified • 70:30
𝐲 sampling w.r.t.
distribution of 𝐲
yes no

yes no
(𝐗 𝑡𝑠𝑡 , 𝐲𝑡𝑠𝑡 )
𝐲

yes no
Build a Machine Learning Model

Learning from data (Example)


Age

𝑎𝑔𝑒 < 40 𝑎𝑔𝑒 ≥ 40


Age Sex Diabetes (𝐗 𝑡𝑟𝑛 , 𝐲𝑡𝑟𝑛 )
55 Male 𝑑𝑚 𝐲 Learning
Algorithm ¬dm Sex
37 Female ¬𝑑𝑚
yes no 𝑀𝑎𝑙𝑒 𝐹𝑒𝑚𝑎𝑙𝑒
48 Male 𝑑𝑚
Training Samples
30 Male ¬𝑑𝑚
dm ¬dm
42 Female ¬𝑑𝑚

Model
Build a Machine Learning Model

Trained model

60

Age 50

𝑎𝑔𝑒 < 40 𝑎𝑔𝑒 ≥ 40 40

30
¬dm Sex
20
𝑀𝑎𝑙𝑒 𝐹𝑒𝑚𝑎𝑙𝑒

10
dm ¬dm
0
Male Female
Learning From Data

Classical Machine Learning

Age

𝑎𝑔𝑒 < 40 𝑎𝑔𝑒 ≥ 40


(𝐗 𝑡𝑠𝑡 , 𝐲𝑡𝑠𝑡 )
𝐲
¬dm Sex 𝐲ො

yes no 𝑀𝑎𝑙𝑒 𝐹𝑒𝑚𝑎𝑙𝑒

dm ¬dm
Age
Model Evaluation 𝑎𝑔𝑒 < 40 𝑎𝑔𝑒 ≥ 40

¬dm Sex

Evaluate with (𝐗 𝑡𝑠𝑡 , 𝐲𝑡𝑠𝑡 ) 𝑀𝑎𝑙𝑒 𝐹𝑒𝑚𝑎𝑙𝑒

𝐗 𝑡𝑠𝑡 , 𝐲𝑡𝑠𝑡 dm ¬dm

Age Sex 𝐲 𝐲ො
60
20 Male ¬𝑑𝑚 ¬𝑑𝑚
55 Female 𝑑𝑚 ¬𝑑𝑚 50

42 Male 𝑑𝑚 𝑑𝑚
40
48 Male ¬𝑑𝑚 𝑑𝑚
45 Male 𝑑𝑚 𝑑𝑚 30

58 Male ¬𝑑𝑚 𝑑𝑚
20
60 Male 𝑑𝑚 𝑑𝑚
28 Female ¬𝑑𝑚 ¬𝑑𝑚 10

50 Male 𝑑𝑚 𝑑𝑚
0
50 Female ¬𝑑𝑚 ¬𝑑𝑚 Male Female
Model Evaluation

Confusion Matrix
• Type of Errors
Age Sex 𝐲 𝐲ො Error-Type • 𝑓𝑝: Type-I Error
20 Male ¬𝑑𝑚 ¬𝑑𝑚 𝑡𝑛 • 𝑓𝑛: Type-II Error
55 Female 𝑑𝑚 ¬𝑑𝑚 𝑓𝑛
42 Male 𝑑𝑚 𝑑𝑚 𝑡𝑝
Actual Actual
48 Male ¬𝑑𝑚 𝑑𝑚 𝑓𝑝
𝑦 𝑦ො 𝑦 𝑦ො
45 Male 𝑑𝑚 𝑑𝑚 𝑡𝑝
58 Male ¬𝑑𝑚 𝑑𝑚 𝑓𝑝 𝑦 𝑡𝑝 𝑓𝑝 𝑦 4 2

Predict

Predict
60 Male 𝑑𝑚 𝑑𝑚 𝑡𝑝
28 Female ¬𝑑𝑚 ¬𝑑𝑚 𝑡𝑛
50 Male 𝑑𝑚 𝑑𝑚 𝑡𝑝 𝑦ො 𝑓𝑛 𝑡𝑛 𝑦ො 1 3
50 Female ¬𝑑𝑚 ¬𝑑𝑚 𝑡𝑛
Model Evaluation

Type of Errors
Model Evaluation • Sensitivity and Specificity
𝑡𝑝
Evaluation Metrics 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑡𝑝 + 𝑓𝑛

𝑡𝑛
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑡𝑛 + 𝑓𝑝
Model Evaluation

Receiver Operating Characteristics (ROC)

• Decision threshold produces ROC curve


Model Evaluation

𝑇𝑃𝑅, 𝐹𝑃𝑅, and threshold (𝜏)


Model Evaluation

What is a good model suggested by ROC

AUC = 0.5

AUC = 0.85 AUC = 0.99


Using Trained Model Unseen sample 𝐗 𝑛𝑒𝑤
Age Sex Diabetes
Applying a trained model to an unseen sample 45 Male ?

60

Age 50

𝐗 𝑛𝑒𝑤
𝑎𝑔𝑒 < 40 𝑎𝑔𝑒 ≥ 40 40

30
¬dm Sex
20
𝑀𝑎𝑙𝑒 𝐹𝑒𝑚𝑎𝑙𝑒

10
dm ¬dm
0
Male Female
ML Tools

Most frequently used tools

Programming Language IDE GUI

Anaconda Python R Spyder Jupyter Orange


Good for production Good for library development

Colab
Python R Jupyter
Good for training
Good for writing tutorial
Big Data Management
Ratchainant Thammasudjarit, Ph.D.
Big Data Management

Overview

• Input
• Acquire data
• Read data

• Process
BDM • Manipulate data

• Output
• Write data
Input

Acquiring data

Patient Information

Profile
HN A1234567
• Design of user interface is very important to
CID 1001020450634 data quality
Name คฤจภัคฐ์ คิศถฤงคิษแคธฎ์
Date of Birth 1978-01-01

Blood Type
Height
A
172
BMI

Weight
32
72
• What’s wrong with this interface
Address

100/10 Tiwanont Road, Bangpood, Pakkred,


Nonthaburi, 11120, Thailand

Cancel OK
Input
Patient Information

Profile
Acquiring data HN A1234567
CID 1001020450634

Patient Information Title นาย Other

First Name คฤจภัคฐ์


Profile
HN A1234567
Middle Name Name should be
CID 1001020450634
Last Name คิศถฤงคิษแคธฎ์ organized as
Name คฤจภัคฐ์ คิศถฤงคิษแคธฎ์ • Title
Date of Birth 1978-01-01
Date of Birth 1978-01-01 • First Name
Blood Type A BMI 32 • Middle Name
Blood Type A BMI 32
Height 172 Weight 72
• Last Name
Height 172 Weight 72
Address
Address
100/10 Tiwanont Road, Bangpood, Pakkred,
100/10 Tiwanont Road, Bangpood, Pakkred,
Nonthaburi, 11120, Thailand
Nonthaburi, 11120, Thailand

Cancel OK
Cancel OK
Input
Patient Information Patient Information

Profile Profile
HN A1234567 HN A1234567
CID 1001020450634 CID 1001020450634
Acquiring data
Title นาย Other Title นาย Other

First Name คฤจภัคฐ์ First Name คฤจภัคฐ์


Middle Name Middle Name
Last Name คิศถฤงคิษแคธฎ์ Last Name คิศถฤงคิษแคธฎ์
Date of birth or
Date of Birth 1978-01-01
Date of Birth 1978-01-01 Jan 1978
related date
1 2 3 4 5 6 7 information
8 9 10 11 12 13 14
Blood Type A BMI 32 Blood Type A 15 BMI
16 17 18 19 20 32
21 should be input by
22 23 24 25 26 27 28
Height 172 Weight 72 Height 172 29 30Weight
31 1 2 3 724 calendar widget
Address Address

100/10 Tiwanont Road, Bangpood, Pakkred, 100/10 Tiwanont Road, Bangpood, Pakkred,
Nonthaburi, 11120, Thailand Nonthaburi, 11120, Thailand

Cancel OK Cancel OK
Patient Information Patient Information

Profile Profile
HN A1234567 HN A1234567
CID 1001020450634 CID 1001020450634
Title นาย Other Title นาย Other

First Name คฤจภัคฐ์ First Name คฤจภัคฐ์


Middle Name Middle Name
Last Name คิศถฤงคิษแคธฎ์ Last Name คิศถฤงคิษแคธฎ์
BMI should not be
Date of Birth 1978-01-01 Date of Birth 1978-01-01
input manually
Blood Type A BMI 32 Blood Type A BMI 24.33 because it can be
Height 172 Weight 72 Height 172 Weight 72 derived from height
Address Address and weight
100/10 Tiwanont Road, Bangpood, Pakkred, 100/10 Tiwanont Road, Bangpood, Pakkred,
Nonthaburi, 11120, Thailand Nonthaburi, 11120, Thailand 72
𝐵𝑀𝐼 = = 24.33
1.722

Cancel OK Cancel OK
Patient Information Patient Information

Profile Profile
HN A1234567 HN A1234567
CID 1001020450634 CID 1001020450634
Title นาย Other Title นาย Other

First Name คฤจภัคฐ์ First Name คฤจภัคฐ์


Middle Name Middle Name
Last Name คิศถฤงคิษแคธฎ์ Last Name คิศถฤงคิษแคธฎ์

Date of Birth 1978-01-01 Date of Birth 1978-01-01 Unit of measures is


very important
Blood Type A BMI 24.33 Blood Type A BMI 24.33 kg/m2

Height 172 Weight 72 Height 172 cm Weight 72 kg

Address Address

100/10 Tiwanont Road, Bangpood, Pakkred, 100/10 Tiwanont Road, Bangpood, Pakkred,
Nonthaburi, 11120, Thailand Nonthaburi, 11120, Thailand

Cancel OK Cancel OK
Patient Information
• Location information
Patient Information Profile
should input from
largest to smallest
HN A1234567
Profile • Country
CID 1001020450634
HN A1234567 • Province
Title นาย Other • District
CID 1001020450634
Title นาย Other
First Name คฤจภัคฐ์
• Zip Code should be
Middle Name
First Name คฤจภัคฐ์ automatically
Last Name คิศถฤงคิษแคธฎ์
Middle Name generated from
Last Name คิศถฤงคิษแคธฎ์
Date of Birth 1978-01-01 country, province,
BMI 24.33 kg/m2
district
Date of Birth 1978-01-01 Blood Type A
Height 172 cm Weight 72 kg
Blood Type A BMI 24.33 kg/m2 • Country, Province,
cm
Address District, Subdistrict
Height 172 Weight 72 kg
Country Thailand should be selected
Address from drop down for
Province Nonthaburi
100/10 Tiwanont Road, Bangpood, Pakkred, District Pakkred data integrity
Nonthaburi, 11120, Thailand
Zip Code 11120
• Address Number,
Subdistrict Bangpood
Address Name, and
Address Number 100/10
Road are
Address Name
Cancel OK acceptable to be
Road Tiwanont free text input

Cancel OK
Patient Information
• Location information
Patient Information Profile
should support
geographical tag
HN A1234567
Profile from online map,
CID 1001020450634
HN A1234567 e.g. Google Map
Title นาย Other
CID 1001020450634
Title นาย Other
First Name คฤจภัคฐ์
Middle Name
First Name คฤจภัคฐ์
Last Name คิศถฤงคิษแคธฎ์
Middle Name
Date of Birth 1978-01-01
Last Name คิศถฤงคิษแคธฎ์
Blood Type A BMI 24.33 kg/m2
Date of Birth 1978-01-01
Height 172 cm Weight 72 kg
Blood Type A BMI 24.33 kg/m2
Address
Height 172 cm Weight 72 kg
Country Thailand
Address
Province Nonthaburi
100/10 Tiwanont Road, Bangpood, Pakkred, District Pakkred
Nonthaburi, 11120, Thailand
Zip Code 11120
Subdistrict Bangpood
Address Number 100/10
Address Name
Cancel OK
Road Tiwanont

Cancel OK
Input

File Requires: File location


Read Data

Requires: Connection Strings Database

Cloud Requires: Credentials


Input

Reading from a text or csv file with python

• The file can be read but it is not a good


practice
• Numerical string should be specified in data
type
• Date should be parsed

Should not be integer but object

Should not be object but date


Input

Reading from a text or csv file with python

• Good practice to read a file


• Requires to know column names to be read

• What if column names are unknown


• Read partially (rows) to explore data and
column names
Input

Read a text or csv file partially

• Read all columns but


some top 𝑁 rows
Input

Read a text or csv file partially

• Read all rows but some columns

• Find more information about reading a text


file here

• What if a file is very large


• Read chunk-by-chunk
Input

Reading a big text or csv file

• Read with progress bar is a good


practice
Input

Read data from database (MySQL)

• Required modules
• mysql-connector
• mysqlclient

• For simplicity, the module


mysqlutils.py has been prepared
already
• Place the file mysqlutils.py to the
same location as your python code

• Connection string is required


bmi

Data Processing

Column based processing

• Calculating BMI in one decimal


dob

Data Processing

Column based processing

• Calculating age from date of birth (dob)

This is not a good practice. Why?


dob

Data Processing

Column based processing

• Calculating age from date of birth (dob)

This is a good practice


Previous practice Good practice
Output

Write dataframe to a text or csv file

• Assume that a dataframe (df) is existed


Output

Write dataframe to a table in database

• Assume that a dataframe (df) is existed

• Mode
• replace
• append

You might also like