Ch01 06 Exercise

The Beginner’s Guide
on Machine Learning &

Deep Learning
Part 1
Self-Study + Advanced Discussion Course
Department of IT and Communication Convergence Engineering

Kunsan National University
Textbook:
Self-study machine learning + deep learning
혼자 공부하는
머신러닝+딥러닝
Chapter 01 나의 첫 머신러닝
01-1 인공지능과 머신러닝,
딥러닝
https://www.youtube.com/watch?v=J6wehCO_c58&list=PLJN246lAkhQjoU0C4v8FgtbjOIXxSs_4
Q&index=1
Strong AI
http://muntermag.com/2016/09/her-y-el-amor-en-la-era-tecnologica/
Weak AI
머신러닝은 인공지능의 하위 분야입니다.
https://scikit-learn.org/
딥러닝(==인공신경망)은 머신러닝의 하위 분야입니다.
Development Environment
You need to create a Google Drive account individually.
https://www.youtube.com/watch?v=0l0g7wk9wv4&list=PLJN246lAkhQjoU0C4v8FgtbjOIXxSs_4Q&index=2
01-3 마켓과 머신러닝
첫 번째 머신러닝 프로그램
전통적인 프로그램(conventional
programming)
Rule-based Programming
도미 vs 빙어 (Bream & Smelt)
2개의 클래스(class)
분류(classification)
이진 분류(binary classification)
도미 데이터 (bream data)
Features
산점도(scatter plot)
빙어 데이터 (Smelt data)
도미와 빙어 합치기 (combining data)
Data type expected by scikit-learn

사이킷런이 기대하는 데이터 형태
sample
sample
feature
feature
리스트 내포 (List comprehension)
정답 준비 (Ground-truth preparation)
bream smelt
k-최근접 이웃 (K-nearest neighbor)
model
Training
evaluation
새로운 생선 예측 (prediction for new sample)
길이 30cm, 무게: 600g
bream
smelt
Ground Truth
무조건 도미
Hyperparameter settings are important.
Task
• Run all the examples and submit the results.
Chapter 02 데이터 다루기
02-1 훈련 세트와
테스트 세트
완벽한 보고서
지도 학습과 비지도 학습
K-nearest neighbor
Supervised learning
훈련 세트와 테스트 세트 Training set and test set
Features
Separating data using Python slicing techniques
Samples
Training set
Test set
테스트 세트에서 평가하기 Evaluation with the test set
https://scikit-
learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
?highlight=kneighborsclassifier#sklearn.neighbors.KNeighborsClassifier
Train with the training set
Evaluate with the test set

샘플링 편향 sampling bias
bad training data correct training data

넘파이 사용하기 Using NumPy
https://numpy.org/
Self-study
https://numpy.org/learn/
https://ml-ko.kr/homl2/tools_numpy.html
데이터 섞기 data shuffling
to ensure reproducibility
data shuffling
데이터 나누고 확인하기 Split and check data
Blue: training data

Orange: test data
두 번째 머신러닝 프로그램
The second machine learning program
02-2 데이터 전처리
박해선
지난 시간에… work done last time
나는 누구인가? Who am I?
Am I a bream or a smelt?
150
넘파이로 데이터 준비 Preparing Data with NumPy
Please refer to the API documentation.

사이킷런으로 데이터 나누기
Splitting data with scikit-learn
https://scikit-
learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highligh
t=train_test_split#sklearn.model_selection.train_test_split Please refer to the API documentation.
수상한 도미 Suspicious sea bream
Caution: data scale of X-axis and Y-axis

기준을 맞춰라 Normalization on axis-scale
표준 점수로 바꾸기 Convert to standard score
수상한 도미 다시 표시하기
Show the suspicious bream again
전처리 데이터에서 모델 훈련
Model training on preprocessed data
Chapter 03 회귀 알고리즘과
모델 규제
03-1 k-최근접 이웃 회귀
박해선
Normalization on axis-scale
농어의 무게를 예측하라
Regression (prediction a value)
회귀
Regression toward the

mean
k - nearest neighbor regression
k-최근접 이웃 회귀
k - nearest neighbor classification k - nearest neighbor regression
What is the difference?
농어의 길이만 사용
Use only the length of perch
훈련 세트 준비 Prepare the training dataset
reshape(-1,1)
회귀 모델 훈련
https://scikit-
learn.org/stable/modules/generated/sklearn.neighbors.K
NeighborsRegressor.html?highlight=kneighborsregressor# training a regression model
sklearn.neighbors.KNeighborsRegressor
참고: 결정계수(R-Square)의 의미와 계산방법

Note: Meaning of R-Square and calculation method
https://m.blog.naver.com/tlrror9496/222055889079
https://en.wikipedia.org/wiki/Coefficient_of_determination
과대적합과 과소적합
overfitting and underfitting
Now, underfitted !!!

이웃 개수에 따른 fitting
Fitting according to the number of neighbors
number of neighbors
overfitting underfitting
과대적합 이웃의 개수 과소적합
03-2 선형 회귀
박해선
과대적합 이웃의 개수 과소적합

아주 큰 농어 very large perch
Well predicted? No… Why?

50cm 농어의 이웃 Find a neighbor for a 50cm perch.
선형 회귀 Linear regression
Find a line
LinearRegression
https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
?highlight=linearregression#sklearn.linear_model.LinearRegression
perch weight
y-intercept
slope of a straight line
perch length
학습한 직선 그리기 Learned straight line drawing
다항 회귀 polynomial regression
모델 다시 훈련 retrain the model
학습한 직선 그리기 Draw learned straight lines
03-3 특성 공학과 규제
박해선
다중 회귀 Multiple/multinomial regression
P151
판다스로 데이터 준비 Prepare data with pandas
pandas dataframe numpy array
Tutorials
• 넘파이 튜토리얼: http://ml-ko.kr/homl2/tools_numpy.html
• 판다스 튜토리얼: http://ml-ko.kr/homl2/tools_pandas.html
P152
다항 특성 만들기 Create polynomial features
https://scikit-
learn.org/stable/modules/generated/sklearn.preprocessing.Pol
ynomialFeatures.html?highlight=polynomialfeatures#sklearn.pr
eprocessing.PolynomialFeatures
P154
LinearRegression
P156
더 많은 특성 만들기 Create more features
Overfitted
Regularization
Overfitted, bad results

P157
규제 전에 표준화 Standardization before regulation
https://scikit-
learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html?
highlight=standardscaler#sklearn.preprocessing.StandardScaler
P159
릿지 회귀 Ridge Regression L2 regularization
https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html?
highlight=ridge#sklearn.linear_model.Ridge
P160
적절한 규제 강도 찾기
Finding the right regularization strength
Train score
Test score
P161
라쏘 회귀
https://scikit-
learn.org/stable/modules/generated/sklearn.linear_mode
l.Lasso.html?highlight=lasso#sklearn.linear_model.Lasso
Lasso regression L1 regularization
Train score
Test score
P163
Chapter 04 다양한 분류
알고리즘
04-1 로지스틱 회귀
박해선
럭키 백 lucky bag
Probability of bream
probability of smelt
P176
확률 계산하기 Calculate Probabilities
10 Samples of Neighbors
lucky bag fish

P177
데이터 준비 data preparation
Target Features
P178-179
k-최근접 이웃의 다중 분류
Multiple classification of k-nearest neighbors
alphabetical order
Probability for first class (bream)
first sample
5 samples Probability for the second class (parkki)
7 generated probabilities
P180-182
로지스틱 회귀 logistic regression
weight length diagonal height width

𝑧 = 𝑎 × 무게 + 𝑏 × 길이 + 𝑐 × 대각선 + 𝑑 × 높이 + 𝑒 × 두께 + 𝑓
sigmoid function
logistic function
Negative Positive
P183
로지스틱 회귀(이진 분류)
Logistic regression (binary classification)
5 samples
https://scikit-
Negative Positive learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=l
P185-186 ogisticregression#sklearn.linear_model.LogisticRegression
로지스틱 회귀 계수 확인
Check logistic regression coefficients
𝑧 = −0.404 × 무게 − 0.576 × 길이 − 0.663 × 대각선 − 0.013 × 높이 − 0.732 × 두께 − 2.161

weight length diagonal height width
Z calculation
https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.expit.html?hi
ghlight=expit#scipy.special.expit
P187
로지스틱 회귀(다중 분류)
Logistic Regression (Multiple Classification)
5 samples
7 generated probabilities
P189-190
소프트맥스 함수 softmax function
https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.softmax.html?highlight=sof
P191 tmax#scipy.special.softmax
04-2 확률적 경사 하강법
박해선
𝑧
= 𝑎 × 무게 + 𝑏 × 길이 + 𝑐 × 대각선 + 𝑑 × 높이 + 𝑒 × 두께
+𝑓
럭키 백 대박!
P199
Optimization Methodology
확률적 경사 하강법 Stochastic Gradient Descent
Training sets Sample Take out one by one (stochastic gradient descent)
move along the slope
little by little
Are your legs too long? Take out several (mini-batch gradient descent)
(learning rate)
iteration
Take everything out (batch gradient descent)
No
Have you used all

the training sets?
the optimal point
empty Yes (1 epoch complete)

training set
Fill the training set with all
the samples and start over
P201-202
손실 함수
loss function
A function that measures the degree of badness
must be differentiable
P203-204
로지스틱 손실 함수 logistic loss function
(binary cross entropy loss function)
P205-206
데이터 전처리 Data preprocessing
https://scikit-
learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html?
P207 highlight=standardscaler#sklearn.preprocessing.StandardScaler
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html?highlight=sgdclassifier#sklearn.linear_model.SGDClassifier
SGDClassifier
logistic loss function epoch
P208
에포크와 과대/과소적합
Epochs and over/underfitting
underfitting overfitting
optimal
training Training set
accuracy
Test set
Epochs
P209
조기 종료 early stopping
P210-211
Chapter 05 트리 알고리즘
05-1 결정 트리
박해선
레드 와인과 화이트 와인
red wine and white wine
P220
데이터 준비하기 Prepare your data
P222-223
로지스틱 회귀 logistic regression
https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logisticregression#skl
P224-225 earn.linear_model.LogisticRegression
결정 트리 decision tree
Root node
Leaf node
https://scikit-
learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html?highlight=decisiontreeclassifier#sklearn.
P227 tree.DecisionTreeClassifier
결정 트리 분석 Decision tree analysis
Root node
P228-229
지니 불순도 gini impurity
negative class ratio positive class ratio

gini impurity
Pure node
P230
가지치기 pruning
P232
Using Unscaled Attributes
스케일 조정하지 않은 특성 사용하기

No preprocessing required
P234
05-2 교차 검증과
그리드 서치
박해선
검증 세트 validation set
test set
training set
parameter tuning
P243-244
model training
교차 검증 cross validation
model evaluation
Verification score average
https://scikit-
learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html?highlight=cr
P245-246 oss_validate#sklearn.model_selection.cross_validate
Cross-validation using dividers
분할기를 사용한 교차 검증
https://scikit-
learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html?highlight=stra
tifiedkfold#sklearn.model_selection.StratifiedKFold
P247
https://scikit-
그리드 서치
learn.org/stable/modules/generated/sklearn.model_s
election.GridSearchCV.html?highlight=gridsearchcv#sk grid search
learn.model_selection.GridSearchCV
P249-250
확률 분포 선택 Probability distribution selection
P253
https://scikit-
learn.org/stable/modules/generated/sklearn.model_selection.Rando
랜덤 서치
mizedSearchCV.html?highlight=randomizedsearchcv#sklearn.model_s
election.RandomizedSearchCV random search
P254-255
05-3 트리의 앙상블
박해선
structured and unstructured data
정형 데이터와 비정형 데이터

Feature engineering Representation learning
P264
랜덤 포레스트 random forest
random forest
decision tree
P265
Random Forest Training Method
랜덤 포레스트 훈련 방법
bootstrap sample
decision tree training

training set
random
sampling decision tree training
P266
https://scikit-
랜덤 포레스트 훈련 random forest training

learn.org/stable/modules/generated/sklearn.model_s
election.cross_validate.html?highlight=cross_validate
#sklearn.model_selection.cross_validate
https://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.h
tml?highlight=randomforestclassifier#sklearn.ensemble.RandomForestClassifier
P268-269
엑스트라 트리
https://scikit-
learn.org/stable/modules/generated/sklearn.ense
mble.ExtraTreesClassifier.html?highlight=extratree
extra tree
sclassifier#sklearn.ensemble.ExtraTreesClassifier
P270
그레이디언트 부스팅 gradient boosting
https://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifie
r.html?highlight=gradientboostingclassifier#sklearn.ensemble.GradientBoostingCl
assifier
P271-272
히스토그램 기반 그레이디언트 부스팅
Histogram-Based Gradient Boosting
https://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBo
ostingClassifier.html?highlight=histgradientboostingclassifier#sklearn.e
nsemble.HistGradientBoostingClassifier
P273
특징의 중요도 평가, Assessing the importance of features
Permutation Importance
https://scikit-
learn.org/stable/modules/generated/sklearn.inspection.permutation_importanc
e.html?highlight=permutation_importance#sklearn.inspection.permutation_imp
ortance
P274
XGBoost vs LightGBM
P275
앙상블 보고서
P254-255
Chapter 06 비지도 학습
06-1 군집 알고리즘
박해선
비지도 학습 unsupervised learning
What kind of fruit is this? Is it possible to automatically collect

pictures of fruits like this?
P286
과일 데이터 준비하기 Preparing fruit data
Executing shell commands in colab
100x100 image
300 samples
P288
샘플 확인 sample check
P289-290
샘플 차원 변경하기 Changing the sample dimension
1D Array
2D Array
P292
샘플 평균의 히스토그램 histogram of sample mean
100 apple
samples
10000 Pixels
P293-294
픽셀 평균의 히스토그램 histogram of pixel mean
apple pineapple banana
mean
pixels
P295
평균 이미지 그리기 average image drawing
P296
평균과 가까운 사진 고르기
Pick photos that are close to average
P297
06-2 k-평균
박해선
군집 clustering
What kind of fruit is this? Is it possible to automatically collect

pictures of fruits like this?
P303
k-평균 K-means
k=3
P304
https://scikit-
모델 훈련
learn.org/stable/modules/generated/sklearn.cluster.KMe
ans.html?highlight=kmeans#sklearn.cluster.KMeans
model training
P305-306
첫 번째 클러스터 first cluster
91 pictures
P307
두 번째, 세 번째 클러스터
second and third cluster
P308-309 98 pictures 111 pictures

클러스터 중심 cluster centroid
P309-310
최적의 k 찾기 Find the best k
elbow method
P312
06-3 주성분 분석
박해선
차원 축소 dimensionality reduction
axis 1
axis 0
dimension 5
axis 2
P318-319
주성분 principal component
P320-321
https://scikit-
learn.org/stable/modules/generated/sklearn.decomposition
.PCA.html?highlight=pca#sklearn.decomposition.PCA
PCA principal component analysis
Numbers of principal component
P322-323
재구성 reconstruction
P325
설명된 분산 explained variance
P326
분류기와 함께 사용하기 Use with classifiers
explained variance: 50%
P327-328
군집과 함께 사용하기 Use with clusters
P329-330
시각화 visualization
PCA 2
P331 PCA 1
End (ch01 ~ ch06)

Ch01 06 Exercise

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ch01 06 Exercise

Uploaded by

Copyright:

Available Formats

The Beginner’s Guide

on Machine Learning &

Self-Study + Advanced Discussion Course

Department of IT and Communication Convergence Engineering

Data type expected by scikit-learn

길이 30cm, 무게: 600g

Train with the training set

Evaluate with the test set

bad training data correct training data

Blue: training data

Please refer to the API documentation.

Caution: data scale of X-axis and Y-axis

Regression (prediction a value)

Regression toward the

What is the difference?

참고: 결정계수(R-Square)의 의미와 계산방법

Now, underfitted !!!

과대적합 이웃의 개수 과소적합

Well predicted? No… Why?

pandas dataframe numpy array

Overfitted, bad results

lucky bag fish

Probability for first class (bream)

5 samples Probability for the second class (parkki)

weight length diagonal height width

𝑧 = −0.404 × 무게 − 0.576 × 길이 − 0.663 × 대각선 − 0.013 × 높이 − 0.732 × 두께 − 2.161

확률적 경사 하강법 Stochastic Gradient Descent

Take everything out (batch gradient descent)

Have you used all

empty Yes (1 epoch complete)

logistic loss function epoch

negative class ratio positive class ratio

스케일 조정하지 않은 특성 사용하기

Verification score average

정형 데이터와 비정형 데이터

decision tree training

랜덤 포레스트 훈련 random forest training

What kind of fruit is this? Is it possible to automatically collect

Executing shell commands in colab

apple pineapple banana

What kind of fruit is this? Is it possible to automatically collect

P308-309 98 pictures 111 pictures

Numbers of principal component

explained variance: 50%

You might also like