COMPX310-19A Machine Learning: An Introduction Using Python, Scikit-Learn, Keras, and Tensorflow

COMPX310-19A
Machine Learning
An introduction using Python, Scikit-Learn, Keras, and Tensorflow
Unless otherwise indicated, all images are from Hands-on Machine Learning with
Scikit-Learn, Keras, and TensorFlow by Aurélien Géron, Copyright © 2019 O’Reilly Media
C2: A first end-to-end-application
 Blueprint:
 Big picture
 Data
 Visualize to understand
 Preprocess data
 Select model and train
 Fine-tune model
 Present
 Launch, monitor, and maintain
03/08/2021 COMPX310 2
Many data sources
 Open data:
 UC Irvine Machine Learning repository
 Kaggle
 Amazon AWS datasets
 Meta portals:
 dataportals.org
 opendatamonitor.eu
 quandl.com
 Other:
 https://en.wikipedia.org/wiki/List_of_datasets_for_machine-
learning_research
 https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-
public
 https://www.reddit.com/r/datasets
03/08/2021 COMPX310 3
California house prices, 1990 census
03/08/2021 COMPX310 4
One cog in a larger system
03/08/2021 COMPX310 5
Some performance measure: RMSE
Root mean squared error

m .. Number of example
x .. Input value for this example, e.g. latitude, longitude,
district size, median income
y .. Target value, e.g. median house price
h .. Our regression for predicting this median price
Often used in regression, but may over-emphasise outliers
Also called L2 norm
03/08/2021 COMPX310 6
MAE: mean absolute error
Also called: L1 norm, manhattan distance, city block distance

More robust to outliers
Both RMSE and MAE are instances of the Lk norm idea:
k can be any natural number,

L0 counts the number of elements (n here)
Linfinity computes the max absolute value
03/08/2021 COMPX310 7
California housing is also on Kaggle
03/08/2021 COMPX310 8
Inspect some more:
03/08/2021 COMPX310 9
And some more:
03/08/2021 COMPX310 10
What about ‘ocean_proximity’?
03/08/2021 COMPX310 11
Some histograms
Notebook ”magic” commands start with %

This time we use matplotlib, not seaborn
There are only histogram plots for numeric features,
Ocean_proximity will be missing
Have a look at the values and try to make sense of them
03/08/2021 COMPX310 12
03/08/2021 COMPX310 13
Some observations
 Many plots have a long right tail
 Scales are very different, e.g. 0-16 vs. 0-500000
 Some data is preprocessed, e.g. median income 3 means $30k
 Some data is capped, like median_age, median_house_value,
and median_income
 Can be problematic
 Maybe remove
 Maybe try to get correct values
03/08/2021 COMPX310 14
Manually splitting into train and test
03/08/2021 COMPX310 15
More on splitting
 Generally it is a better idea to use scikit_learn functions, e.g.
 from sklearn.model_selection import train_test_split
 train, test = train_test_split(df, test_size=0.2, random_state=42)
 The text book then also explains how to use hashing to keep
splits similar, even when adding new data
 And how to do stratification of some attribute, and stratified
sampling with regard to such an attribute
 Read this in your own time
03/08/2021 COMPX310 16
Visualising
03/08/2021 COMPX310 17
More Visualising
03/08/2021 COMPX310 18
Looking for correlations
03/08/2021 COMPX310 19
Be careful with correlations
Linear correlations only: does y increase with x, or decrease

-1 max decrease, +1 max increase, no relationship around 0
03/08/2021 COMPX310 20
Some scatter plot
03/08/2021 COMPX310 21
Focus
03/08/2021 COMPX310 22
Derived attributes/features
03/08/2021 COMPX310 23
Preparing to train a model
 Split the augmented dataframe into train and test
 And then train into input and output (or target): X and y
03/08/2021 COMPX310 24
What about missing values
 Most learner do not handle missing values, simple options are
 Drop examples with missing values
 Drop features with missing values
 Replace missing values somehow: 0, mean, median, smarter …
03/08/2021 COMPX310 25
Or the ‘scikit_learn’ way
03/08/2021 COMPX310 26
And applying it:
03/08/2021 COMPX310 27
Scikit-learn design
 Consistency:
 Estimators: fit(dataset)
 Transformers: transform(), fit_transform()
 Predictors: predict(), score()
 Inspection:
 hyperparameters are public instance variables:
 imputer.strategy -> median
 Learned parameters are public instance variables with ‘_’ suffix:
 imputer.statistics_
 Datasets are NumPy arrays or SciPy sparse matrices, hyperparameters are

numbers and strings
 Composition: some transformers + estimator -> Pipeline estimator
 Sensible defaults
03/08/2021 COMPX310 28
What about ‘ocean_proximity’?
03/08/2021 COMPX310 29
Or use separate 0/1 feature for each value
03/08/2021 COMPX310 30
Notes
 OrdinalEncoder is perfect for ’ordinal’ scales, e.g. ‘bad’,
‘average’, ‘good’, ’excellent’,
 But make sure to define this order explicitly
 OneHot can generate too many features, then maybe

 Replace with some numeric feature, e.g. distance from the sea
 Or one or more reasonable proxies, e.g. zip code with average
income, education, …
 Later we will learn about ‘embeddings’
03/08/2021 COMPX310 31
More notes
 The text book also covers:
 Custom transformers
 Scaling of numeric attributes
 Transformation pipelines
 General warning: always fit estimators and transformers one just

the training data, otherwise information will ‘leak’ and may
make your results look better than they are
03/08/2021 COMPX310 32
Preparing X
03/08/2021 COMPX310 33
And y and a linear regression model
03/08/2021 COMPX310 34
How well does it do?
03/08/2021 COMPX310 35
Now try a decision tree:
03/08/2021 COMPX310 36
Try cross-validation to get a better estimate
03/08/2021 COMPX310 37
CV for linear regression
03/08/2021 COMPX310 38
Now try a RandomForest
03/08/2021 COMPX310 39
Plot all cv results
03/08/2021 COMPX310 40
How well do we do on TEST data?
03/08/2021 COMPX310 41
Plot predictions: linear regression
03/08/2021 COMPX310 42
Plot predictions: Random Forest
03/08/2021 COMPX310 43
More book stuff
 Fine tuning the model:
 Grid search
 Random search
 Analyze best model
 More later
03/08/2021 COMPX310 44

COMPX310-19A Machine Learning: An Introduction Using Python, Scikit-Learn, Keras, and Tensorflow

Uploaded by

Copyright:

Available Formats

You might also like

COMPX310-19A Machine Learning: An Introduction Using Python, Scikit-Learn, Keras, and Tensorflow

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

COMPX310-19A Machine Learning: An Introduction Using Python, Scikit-Learn, Keras, and Tensorflow

Uploaded by

Copyright:

Available Formats

COMPX310-19A

Root mean squared error

Often used in regression, but may over-emphasise outliers

Also called L2 norm

Also called: L1 norm, manhattan distance, city block distance

Both RMSE and MAE are instances of the Lk norm idea:

k can be any natural number,

Notebook ”magic” commands start with %

Have a look at the values and try to make sense of them

Linear correlations only: does y increase with x, or decrease

 Datasets are NumPy arrays or SciPy sparse matrices, hyperparameters are

 Composition: some transformers + estimator -> Pipeline estimator

 OneHot can generate too many features, then maybe

 Later we will learn about ‘embeddings’

 General warning: always fit estimators and transformers one just

You might also like