COMPX310-19A Machine Learning: An Introduction Using Python, Scikit-Learn, Keras, and Tensorflow

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 44

COMPX310-19A

Machine Learning
An introduction using Python, Scikit-Learn, Keras, and Tensorflow

Unless otherwise indicated, all images are from Hands-on Machine Learning with
Scikit-Learn, Keras, and TensorFlow by Aurélien Géron, Copyright © 2019 O’Reilly Media
C2: A first end-to-end-application
 Blueprint:
 Big picture
 Data
 Visualize to understand
 Preprocess data
 Select model and train
 Fine-tune model
 Present
 Launch, monitor, and maintain

03/08/2021 COMPX310 2
Many data sources
 Open data:
 UC Irvine Machine Learning repository
 Kaggle
 Amazon AWS datasets

 Meta portals:
 dataportals.org
 opendatamonitor.eu
 quandl.com

 Other:
 https://en.wikipedia.org/wiki/List_of_datasets_for_machine-
learning_research
 https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-
public
 https://www.reddit.com/r/datasets

03/08/2021 COMPX310 3
California house prices, 1990 census

03/08/2021 COMPX310 4
One cog in a larger system

03/08/2021 COMPX310 5
Some performance measure: RMSE

Root mean squared error


m .. Number of example
x .. Input value for this example, e.g. latitude, longitude,
district size, median income
y .. Target value, e.g. median house price
h .. Our regression for predicting this median price

Often used in regression, but may over-emphasise outliers

Also called L2 norm

03/08/2021 COMPX310 6
MAE: mean absolute error

Also called: L1 norm, manhattan distance, city block distance


More robust to outliers

Both RMSE and MAE are instances of the Lk norm idea:

k can be any natural number,


L0 counts the number of elements (n here)
Linfinity computes the max absolute value
03/08/2021 COMPX310 7
California housing is also on Kaggle

03/08/2021 COMPX310 8
Inspect some more:

03/08/2021 COMPX310 9
And some more:

03/08/2021 COMPX310 10
What about ‘ocean_proximity’?

03/08/2021 COMPX310 11
Some histograms

Notebook ”magic” commands start with %


This time we use matplotlib, not seaborn
There are only histogram plots for numeric features,
Ocean_proximity will be missing

Have a look at the values and try to make sense of them

03/08/2021 COMPX310 12
03/08/2021 COMPX310 13
Some observations
 Many plots have a long right tail
 Scales are very different, e.g. 0-16 vs. 0-500000
 Some data is preprocessed, e.g. median income 3 means $30k
 Some data is capped, like median_age, median_house_value,
and median_income
 Can be problematic
 Maybe remove
 Maybe try to get correct values

03/08/2021 COMPX310 14
Manually splitting into train and test

03/08/2021 COMPX310 15
More on splitting
 Generally it is a better idea to use scikit_learn functions, e.g.
 from sklearn.model_selection import train_test_split
 train, test = train_test_split(df, test_size=0.2, random_state=42)

 The text book then also explains how to use hashing to keep
splits similar, even when adding new data
 And how to do stratification of some attribute, and stratified
sampling with regard to such an attribute
 Read this in your own time

03/08/2021 COMPX310 16
Visualising

03/08/2021 COMPX310 17
More Visualising

03/08/2021 COMPX310 18
Looking for correlations

03/08/2021 COMPX310 19
Be careful with correlations

Linear correlations only: does y increase with x, or decrease


-1 max decrease, +1 max increase, no relationship around 0
03/08/2021 COMPX310 20
Some scatter plot

03/08/2021 COMPX310 21
Focus

03/08/2021 COMPX310 22
Derived attributes/features

03/08/2021 COMPX310 23
Preparing to train a model
 Split the augmented dataframe into train and test

 And then train into input and output (or target): X and y

03/08/2021 COMPX310 24
What about missing values
 Most learner do not handle missing values, simple options are
 Drop examples with missing values
 Drop features with missing values
 Replace missing values somehow: 0, mean, median, smarter …

03/08/2021 COMPX310 25
Or the ‘scikit_learn’ way

03/08/2021 COMPX310 26
And applying it:

03/08/2021 COMPX310 27
Scikit-learn design
 Consistency:
 Estimators: fit(dataset)
 Transformers: transform(), fit_transform()
 Predictors: predict(), score()

 Inspection:
 hyperparameters are public instance variables:
 imputer.strategy -> median
 Learned parameters are public instance variables with ‘_’ suffix:
 imputer.statistics_

 Datasets are NumPy arrays or SciPy sparse matrices, hyperparameters are


numbers and strings

 Composition: some transformers + estimator -> Pipeline estimator

 Sensible defaults

03/08/2021 COMPX310 28
What about ‘ocean_proximity’?

03/08/2021 COMPX310 29
Or use separate 0/1 feature for each value

03/08/2021 COMPX310 30
Notes
 OrdinalEncoder is perfect for ’ordinal’ scales, e.g. ‘bad’,
‘average’, ‘good’, ’excellent’,
 But make sure to define this order explicitly

 OneHot can generate too many features, then maybe


 Replace with some numeric feature, e.g. distance from the sea
 Or one or more reasonable proxies, e.g. zip code with average
income, education, …

 Later we will learn about ‘embeddings’

03/08/2021 COMPX310 31
More notes
 The text book also covers:
 Custom transformers
 Scaling of numeric attributes
 Transformation pipelines

 General warning: always fit estimators and transformers one just


the training data, otherwise information will ‘leak’ and may
make your results look better than they are

03/08/2021 COMPX310 32
Preparing X

03/08/2021 COMPX310 33
And y and a linear regression model

03/08/2021 COMPX310 34
How well does it do?

03/08/2021 COMPX310 35
Now try a decision tree:

03/08/2021 COMPX310 36
Try cross-validation to get a better estimate

03/08/2021 COMPX310 37
CV for linear regression

03/08/2021 COMPX310 38
Now try a RandomForest

03/08/2021 COMPX310 39
Plot all cv results

03/08/2021 COMPX310 40
How well do we do on TEST data?

03/08/2021 COMPX310 41
Plot predictions: linear regression

03/08/2021 COMPX310 42
Plot predictions: Random Forest

03/08/2021 COMPX310 43
More book stuff
 Fine tuning the model:
 Grid search
 Random search
 Analyze best model

 More later

03/08/2021 COMPX310 44

You might also like