Download as pdf or txt
Download as pdf or txt
You are on page 1of 169

The Beginner’s Guide

on Machine Learning &


Deep Learning
Part 1

Self-Study + Advanced Discussion Course

Department of IT and Communication Convergence Engineering


Kunsan National University
Textbook:
Self-study machine learning + deep learning

혼자 공부하는
머신러닝+딥러닝

Chapter 01 나의 첫 머신러닝
01-1 인공지능과 머신러닝,
딥러닝

https://www.youtube.com/watch?v=J6wehCO_c58&list=PLJN246lAkhQjoU0C4v8FgtbjOIXxSs_4
Q&index=1
Strong AI

http://muntermag.com/2016/09/her-y-el-amor-en-la-era-tecnologica/
Weak AI

http://muntermag.com/2016/09/her-y-el-amor-en-la-era-tecnologica/
머신러닝은 인공지능의 하위 분야입니다.

https://scikit-learn.org/
딥러닝(==인공신경망)은 머신러닝의 하위 분야입니다.
Development Environment
You need to create a Google Drive account individually.

https://www.youtube.com/watch?v=0l0g7wk9wv4&list=PLJN246lAkhQjoU0C4v8FgtbjOIXxSs_4Q&index=2
01-3 마켓과 머신러닝
첫 번째 머신러닝 프로그램

http://muntermag.com/2016/09/her-y-el-amor-en-la-era-tecnologica/
전통적인 프로그램(conventional
programming)

Rule-based Programming

http://muntermag.com/2016/09/her-y-el-amor-en-la-era-tecnologica/
도미 vs 빙어 (Bream & Smelt)

2개의 클래스(class)

분류(classification)

이진 분류(binary classification)
도미 데이터 (bream data)

Features
산점도(scatter plot)
빙어 데이터 (Smelt data)
도미와 빙어 합치기 (combining data)

Data type expected by scikit-learn


사이킷런이 기대하는 데이터 형태

sample

sample

feature

feature
리스트 내포 (List comprehension)
정답 준비 (Ground-truth preparation)

bream smelt
k-최근접 이웃 (K-nearest neighbor)

model

Training

evaluation
새로운 생선 예측 (prediction for new sample)

길이 30cm, 무게: 600g

bream

smelt

Ground Truth
무조건 도미
Hyperparameter settings are important.
Task
• Run all the examples and submit the results.
Chapter 02 데이터 다루기
02-1 훈련 세트와
테스트 세트
완벽한 보고서

http://muntermag.com/2016/09/her-y-el-amor-en-la-era-tecnologica/
지도 학습과 비지도 학습
K-nearest neighbor

Supervised learning

http://muntermag.com/2016/09/her-y-el-amor-en-la-era-tecnologica/
훈련 세트와 테스트 세트 Training set and test set

Features
Separating data using Python slicing techniques

Samples
Training set

Test set
테스트 세트에서 평가하기 Evaluation with the test set
https://scikit-
learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
?highlight=kneighborsclassifier#sklearn.neighbors.KNeighborsClassifier

Train with the training set

Evaluate with the test set


샘플링 편향 sampling bias

bad training data correct training data


넘파이 사용하기 Using NumPy

https://numpy.org/

Self-study
https://numpy.org/learn/
https://ml-ko.kr/homl2/tools_numpy.html
데이터 섞기 data shuffling

to ensure reproducibility

data shuffling
데이터 나누고 확인하기 Split and check data

Blue: training data


Orange: test data
두 번째 머신러닝 프로그램
The second machine learning program
02-2 데이터 전처리
박해선
지난 시간에… work done last time

http://muntermag.com/2016/09/her-y-el-amor-en-la-era-tecnologica/
나는 누구인가? Who am I?

Am I a bream or a smelt?

150

http://muntermag.com/2016/09/her-y-el-amor-en-la-era-tecnologica/
넘파이로 데이터 준비 Preparing Data with NumPy

Please refer to the API documentation.


http://muntermag.com/2016/09/her-y-el-amor-en-la-era-tecnologica/
사이킷런으로 데이터 나누기
Splitting data with scikit-learn

https://scikit-
learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highligh
t=train_test_split#sklearn.model_selection.train_test_split Please refer to the API documentation.
수상한 도미 Suspicious sea bream

Caution: data scale of X-axis and Y-axis


기준을 맞춰라 Normalization on axis-scale
표준 점수로 바꾸기 Convert to standard score
수상한 도미 다시 표시하기
Show the suspicious bream again
전처리 데이터에서 모델 훈련
Model training on preprocessed data
Chapter 03 회귀 알고리즘과
모델 규제
03-1 k-최근접 이웃 회귀
박해선
지난 시간에… work done last time
Normalization on axis-scale

http://muntermag.com/2016/09/her-y-el-amor-en-la-era-tecnologica/
농어의 무게를 예측하라

Regression (prediction a value)

http://muntermag.com/2016/09/her-y-el-amor-en-la-era-tecnologica/
회귀

Regression toward the


mean
k - nearest neighbor regression

k-최근접 이웃 회귀
k - nearest neighbor classification k - nearest neighbor regression

What is the difference?

http://muntermag.com/2016/09/her-y-el-amor-en-la-era-tecnologica/
농어의 길이만 사용
Use only the length of perch
훈련 세트 준비 Prepare the training dataset

reshape(-1,1)
회귀 모델 훈련
https://scikit-
learn.org/stable/modules/generated/sklearn.neighbors.K
NeighborsRegressor.html?highlight=kneighborsregressor# training a regression model
sklearn.neighbors.KNeighborsRegressor

참고: 결정계수(R-Square)의 의미와 계산방법


Note: Meaning of R-Square and calculation method
https://m.blog.naver.com/tlrror9496/222055889079
https://en.wikipedia.org/wiki/Coefficient_of_determination
과대적합과 과소적합
overfitting and underfitting

Now, underfitted !!!


이웃 개수에 따른 fitting
Fitting according to the number of neighbors

number of neighbors
overfitting underfitting
과대적합 이웃의 개수 과소적합
03-2 선형 회귀
박해선
지난 시간에… work done last time

과대적합 이웃의 개수 과소적합


아주 큰 농어 very large perch

Well predicted? No… Why?


50cm 농어의 이웃 Find a neighbor for a 50cm perch.
선형 회귀 Linear regression

Find a line
LinearRegression
https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
?highlight=linearregression#sklearn.linear_model.LinearRegression
perch weight

y-intercept
slope of a straight line

perch length
학습한 직선 그리기 Learned straight line drawing
다항 회귀 polynomial regression
모델 다시 훈련 retrain the model
학습한 직선 그리기 Draw learned straight lines
03-3 특성 공학과 규제
박해선
지난 시간에… work done last time
다중 회귀 Multiple/multinomial regression

P151
판다스로 데이터 준비 Prepare data with pandas

pandas dataframe numpy array

Tutorials
• 넘파이 튜토리얼: http://ml-ko.kr/homl2/tools_numpy.html
• 판다스 튜토리얼: http://ml-ko.kr/homl2/tools_pandas.html
P152
다항 특성 만들기 Create polynomial features

https://scikit-
learn.org/stable/modules/generated/sklearn.preprocessing.Pol
ynomialFeatures.html?highlight=polynomialfeatures#sklearn.pr
eprocessing.PolynomialFeatures
P154
LinearRegression

P156
더 많은 특성 만들기 Create more features

Overfitted

Regularization

Overfitted, bad results


P157
규제 전에 표준화 Standardization before regulation
https://scikit-
learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html?
highlight=standardscaler#sklearn.preprocessing.StandardScaler

P159
릿지 회귀 Ridge Regression L2 regularization
https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html?
highlight=ridge#sklearn.linear_model.Ridge

P160
적절한 규제 강도 찾기
Finding the right regularization strength

Train score

Test score

P161
라쏘 회귀
https://scikit-
learn.org/stable/modules/generated/sklearn.linear_mode
l.Lasso.html?highlight=lasso#sklearn.linear_model.Lasso
Lasso regression L1 regularization

Train score

Test score

P163
Chapter 04 다양한 분류
알고리즘
04-1 로지스틱 회귀
박해선
지난 시간에… work done last time
럭키 백 lucky bag

Probability of bream

probability of smelt

P176
확률 계산하기 Calculate Probabilities

10 Samples of Neighbors

lucky bag fish


P177
데이터 준비 data preparation

Target Features

P178-179
k-최근접 이웃의 다중 분류
Multiple classification of k-nearest neighbors

alphabetical order

Probability for first class (bream)

first sample

5 samples Probability for the second class (parkki)

7 generated probabilities
P180-182
로지스틱 회귀 logistic regression

weight length diagonal height width


𝑧 = 𝑎 × 무게 + 𝑏 × 길이 + 𝑐 × 대각선 + 𝑑 × 높이 + 𝑒 × 두께 + 𝑓

sigmoid function
logistic function

Negative Positive
P183
로지스틱 회귀(이진 분류)
Logistic regression (binary classification)

5 samples

https://scikit-
Negative Positive learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=l
P185-186 ogisticregression#sklearn.linear_model.LogisticRegression
로지스틱 회귀 계수 확인
Check logistic regression coefficients

𝑧 = −0.404 × 무게 − 0.576 × 길이 − 0.663 × 대각선 − 0.013 × 높이 − 0.732 × 두께 − 2.161


weight length diagonal height width

Z calculation

https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.expit.html?hi
ghlight=expit#scipy.special.expit
P187
로지스틱 회귀(다중 분류)
Logistic Regression (Multiple Classification)

5 samples

7 generated probabilities

P189-190
소프트맥스 함수 softmax function

https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.softmax.html?highlight=sof
P191 tmax#scipy.special.softmax
04-2 확률적 경사 하강법
박해선
지난 시간에… work done last time

𝑧
= 𝑎 × 무게 + 𝑏 × 길이 + 𝑐 × 대각선 + 𝑑 × 높이 + 𝑒 × 두께
+𝑓
럭키 백 대박!

P199
Optimization Methodology

확률적 경사 하강법 Stochastic Gradient Descent

Training sets Sample Take out one by one (stochastic gradient descent)
move along the slope
little by little
Are your legs too long? Take out several (mini-batch gradient descent)
(learning rate)
iteration

Take everything out (batch gradient descent)

No

Have you used all


the training sets?
the optimal point

empty Yes (1 epoch complete)


training set
Fill the training set with all
the samples and start over
P201-202
손실 함수
loss function
A function that measures the degree of badness

must be differentiable

P203-204
로지스틱 손실 함수 logistic loss function
(binary cross entropy loss function)

P205-206
데이터 전처리 Data preprocessing

https://scikit-
learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html?
P207 highlight=standardscaler#sklearn.preprocessing.StandardScaler
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html?highlight=sgdclassifier#sklearn.linear_model.SGDClassifier

SGDClassifier

logistic loss function epoch

P208
에포크와 과대/과소적합
Epochs and over/underfitting

underfitting overfitting

optimal
training Training set
accuracy
Test set

Epochs

P209
조기 종료 early stopping

P210-211
Chapter 05 트리 알고리즘
05-1 결정 트리
박해선
지난 시간에… work done last time
레드 와인과 화이트 와인
red wine and white wine

P220
데이터 준비하기 Prepare your data

P222-223
로지스틱 회귀 logistic regression

https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logisticregression#skl
P224-225 earn.linear_model.LogisticRegression
결정 트리 decision tree

Root node

Leaf node
https://scikit-
learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html?highlight=decisiontreeclassifier#sklearn.
P227 tree.DecisionTreeClassifier
결정 트리 분석 Decision tree analysis

Root node

P228-229
지니 불순도 gini impurity

negative class ratio positive class ratio


gini impurity

Pure node

P230
가지치기 pruning

P232
Using Unscaled Attributes

스케일 조정하지 않은 특성 사용하기


No preprocessing required

P234
05-2 교차 검증과
그리드 서치
박해선
지난 시간에… work done last time
검증 세트 validation set

test set
training set

parameter tuning

P243-244
model training
교차 검증 cross validation

model evaluation

Verification score average

https://scikit-
learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html?highlight=cr
P245-246 oss_validate#sklearn.model_selection.cross_validate
Cross-validation using dividers

분할기를 사용한 교차 검증

https://scikit-
learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html?highlight=stra
tifiedkfold#sklearn.model_selection.StratifiedKFold
P247
https://scikit-

그리드 서치
learn.org/stable/modules/generated/sklearn.model_s
election.GridSearchCV.html?highlight=gridsearchcv#sk grid search
learn.model_selection.GridSearchCV

P249-250
확률 분포 선택 Probability distribution selection

P253
https://scikit-
learn.org/stable/modules/generated/sklearn.model_selection.Rando

랜덤 서치
mizedSearchCV.html?highlight=randomizedsearchcv#sklearn.model_s
election.RandomizedSearchCV random search

P254-255
05-3 트리의 앙상블
박해선
지난 시간에… work done last time
structured and unstructured data

정형 데이터와 비정형 데이터


Feature engineering Representation learning

P264
랜덤 포레스트 random forest

random forest
decision tree

P265
Random Forest Training Method

랜덤 포레스트 훈련 방법
bootstrap sample

decision tree training


training set

random
sampling decision tree training

P266
https://scikit-

랜덤 포레스트 훈련 random forest training


learn.org/stable/modules/generated/sklearn.model_s
election.cross_validate.html?highlight=cross_validate
#sklearn.model_selection.cross_validate

https://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.h
tml?highlight=randomforestclassifier#sklearn.ensemble.RandomForestClassifier
P268-269
엑스트라 트리
https://scikit-
learn.org/stable/modules/generated/sklearn.ense
mble.ExtraTreesClassifier.html?highlight=extratree
extra tree
sclassifier#sklearn.ensemble.ExtraTreesClassifier

P270
그레이디언트 부스팅 gradient boosting

https://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifie
r.html?highlight=gradientboostingclassifier#sklearn.ensemble.GradientBoostingCl
assifier
P271-272
히스토그램 기반 그레이디언트 부스팅
Histogram-Based Gradient Boosting

https://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBo
ostingClassifier.html?highlight=histgradientboostingclassifier#sklearn.e
nsemble.HistGradientBoostingClassifier

P273
특징의 중요도 평가, Assessing the importance of features

Permutation Importance

https://scikit-
learn.org/stable/modules/generated/sklearn.inspection.permutation_importanc
e.html?highlight=permutation_importance#sklearn.inspection.permutation_imp
ortance
P274
XGBoost vs LightGBM

P275
앙상블 보고서

P254-255
Chapter 06 비지도 학습
06-1 군집 알고리즘
박해선
지난 시간에… work done last time
비지도 학습 unsupervised learning

What kind of fruit is this? Is it possible to automatically collect


pictures of fruits like this?

P286
과일 데이터 준비하기 Preparing fruit data

Executing shell commands in colab

100x100 image

300 samples

P288
샘플 확인 sample check

P289-290
샘플 차원 변경하기 Changing the sample dimension

1D Array
2D Array

P292
샘플 평균의 히스토그램 histogram of sample mean

100 apple
samples

10000 Pixels
P293-294
픽셀 평균의 히스토그램 histogram of pixel mean

apple pineapple banana

mean

pixels
P295
평균 이미지 그리기 average image drawing

P296
평균과 가까운 사진 고르기
Pick photos that are close to average

P297
06-2 k-평균
박해선
지난 시간에… work done last time
군집 clustering

What kind of fruit is this? Is it possible to automatically collect


pictures of fruits like this?

P303
k-평균 K-means

k=3

P304
https://scikit-

모델 훈련
learn.org/stable/modules/generated/sklearn.cluster.KMe
ans.html?highlight=kmeans#sklearn.cluster.KMeans
model training

P305-306
첫 번째 클러스터 first cluster

91 pictures

P307
두 번째, 세 번째 클러스터
second and third cluster

P308-309 98 pictures 111 pictures


클러스터 중심 cluster centroid

P309-310
최적의 k 찾기 Find the best k

elbow method

P312
06-3 주성분 분석
박해선
지난 시간에… work done last time
차원 축소 dimensionality reduction

axis 1

axis 0
dimension 5

axis 2

P318-319
주성분 principal component

P320-321
https://scikit-
learn.org/stable/modules/generated/sklearn.decomposition
.PCA.html?highlight=pca#sklearn.decomposition.PCA
PCA principal component analysis

Numbers of principal component

P322-323
재구성 reconstruction

P325
설명된 분산 explained variance

P326
분류기와 함께 사용하기 Use with classifiers

explained variance: 50%

P327-328
군집과 함께 사용하기 Use with clusters

P329-330
시각화 visualization

PCA 2

P331 PCA 1
End (ch01 ~ ch06)

You might also like