Professional Documents
Culture Documents
COMPX310-19A Machine Learning: An Introduction Using Python, Scikit-Learn, Keras, and Tensorflow
COMPX310-19A Machine Learning: An Introduction Using Python, Scikit-Learn, Keras, and Tensorflow
COMPX310-19A Machine Learning: An Introduction Using Python, Scikit-Learn, Keras, and Tensorflow
Machine Learning
An introduction using Python, Scikit-Learn, Keras, and Tensorflow
Unless otherwise indicated, all images are from Hands-on Machine Learning with
Scikit-Learn, Keras, and TensorFlow by Aurélien Géron, Copyright © 2019 O’Reilly Media
C2: A first end-to-end-application
Blueprint:
Big picture
Data
Visualize to understand
Preprocess data
Select model and train
Fine-tune model
Present
Launch, monitor, and maintain
03/08/2021 COMPX310 2
Many data sources
Open data:
UC Irvine Machine Learning repository
Kaggle
Amazon AWS datasets
Meta portals:
dataportals.org
opendatamonitor.eu
quandl.com
Other:
https://en.wikipedia.org/wiki/List_of_datasets_for_machine-
learning_research
https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-
public
https://www.reddit.com/r/datasets
03/08/2021 COMPX310 3
California house prices, 1990 census
03/08/2021 COMPX310 4
One cog in a larger system
03/08/2021 COMPX310 5
Some performance measure: RMSE
03/08/2021 COMPX310 6
MAE: mean absolute error
03/08/2021 COMPX310 8
Inspect some more:
03/08/2021 COMPX310 9
And some more:
03/08/2021 COMPX310 10
What about ‘ocean_proximity’?
03/08/2021 COMPX310 11
Some histograms
03/08/2021 COMPX310 12
03/08/2021 COMPX310 13
Some observations
Many plots have a long right tail
Scales are very different, e.g. 0-16 vs. 0-500000
Some data is preprocessed, e.g. median income 3 means $30k
Some data is capped, like median_age, median_house_value,
and median_income
Can be problematic
Maybe remove
Maybe try to get correct values
03/08/2021 COMPX310 14
Manually splitting into train and test
03/08/2021 COMPX310 15
More on splitting
Generally it is a better idea to use scikit_learn functions, e.g.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2, random_state=42)
The text book then also explains how to use hashing to keep
splits similar, even when adding new data
And how to do stratification of some attribute, and stratified
sampling with regard to such an attribute
Read this in your own time
03/08/2021 COMPX310 16
Visualising
03/08/2021 COMPX310 17
More Visualising
03/08/2021 COMPX310 18
Looking for correlations
03/08/2021 COMPX310 19
Be careful with correlations
03/08/2021 COMPX310 21
Focus
03/08/2021 COMPX310 22
Derived attributes/features
03/08/2021 COMPX310 23
Preparing to train a model
Split the augmented dataframe into train and test
And then train into input and output (or target): X and y
03/08/2021 COMPX310 24
What about missing values
Most learner do not handle missing values, simple options are
Drop examples with missing values
Drop features with missing values
Replace missing values somehow: 0, mean, median, smarter …
03/08/2021 COMPX310 25
Or the ‘scikit_learn’ way
03/08/2021 COMPX310 26
And applying it:
03/08/2021 COMPX310 27
Scikit-learn design
Consistency:
Estimators: fit(dataset)
Transformers: transform(), fit_transform()
Predictors: predict(), score()
Inspection:
hyperparameters are public instance variables:
imputer.strategy -> median
Learned parameters are public instance variables with ‘_’ suffix:
imputer.statistics_
Sensible defaults
03/08/2021 COMPX310 28
What about ‘ocean_proximity’?
03/08/2021 COMPX310 29
Or use separate 0/1 feature for each value
03/08/2021 COMPX310 30
Notes
OrdinalEncoder is perfect for ’ordinal’ scales, e.g. ‘bad’,
‘average’, ‘good’, ’excellent’,
But make sure to define this order explicitly
03/08/2021 COMPX310 31
More notes
The text book also covers:
Custom transformers
Scaling of numeric attributes
Transformation pipelines
03/08/2021 COMPX310 32
Preparing X
03/08/2021 COMPX310 33
And y and a linear regression model
03/08/2021 COMPX310 34
How well does it do?
03/08/2021 COMPX310 35
Now try a decision tree:
03/08/2021 COMPX310 36
Try cross-validation to get a better estimate
03/08/2021 COMPX310 37
CV for linear regression
03/08/2021 COMPX310 38
Now try a RandomForest
03/08/2021 COMPX310 39
Plot all cv results
03/08/2021 COMPX310 40
How well do we do on TEST data?
03/08/2021 COMPX310 41
Plot predictions: linear regression
03/08/2021 COMPX310 42
Plot predictions: Random Forest
03/08/2021 COMPX310 43
More book stuff
Fine tuning the model:
Grid search
Random search
Analyze best model
More later
03/08/2021 COMPX310 44