Download as pdf or txt
Download as pdf or txt
You are on page 1of 72

Machine Learning with Scikit-Learn

Andreas Mueller (NYU Center for Data Science, scikit-learn)

http://bit.ly/sklstrata
Me

2
Classification
Regression
Clustering
Semi-Supervised Learning
Feature Selection
Feature Extraction
Manifold Learning
Dimensionality Reduction
Kernel Approximation
Hyperparameter Optimization
Evaluation Metrics
Out-of-core learning
…...

3
4
Get the notebooks!

http://bit.ly/sklstrata
5
Hi Andy,

I just received an email from the first tutorial


speaker, presenting right before you, saying
he's ill and won't be able to make it.

I know you have already committed yourself to


two presentations, but is there anyway you
could increase your tutorial time slot, maybe
just offer time to try out what you've taught?
Otherwise I have to do some kind of modern
dance interpretation of Python in data :-)
Hi Andreas, -Leah

I am very interested in your Machine Learning


background. I work for X Recruiting who have
been engaged by Z, a worldwide leading supplier
of Y. We are expanding the core engineering
team and we are looking for really passionate
engineers who want to create their own story and
help millions of people.

Can we find a time for a call to chat for a few


minutes about this?

Thanks 6
Hi Andy,

I just received an email from the first tutorial


speaker, presenting right before you, saying
he's ill and won't be able to make it.

I know you have already committed yourself to


two presentations, but is there anyway you
could increase your tutorial time slot, maybe
just offer time to try out what you've taught?
Otherwise I have to do some kind of modern
dance interpretation of Python in data :-)
Hi Andreas, -Leah

I am very interested in your Machine Learning


background. I work for X Recruiting who have
been engaged by Z, a worldwide leading supplier
of Y. We are expanding the core engineering
team and we are looking for really passionate
engineers who want to create their own story and
help millions of people.

Can we find a time for a call to chat for a few


minutes about this?

Thanks 7
Supervised Machine Learning
Training Data

Model

Training Labels

8
Supervised Machine Learning
Training Data

Model

Training Labels

Test Data Prediction

9
Supervised Machine Learning
Training Data

Model

Training Labels

Test Data Prediction

Test Labels Evaluation

10
Supervised Machine Learning
Training Data
Training
Model

Training Labels

Test Data Prediction


Generalization

Test Labels Evaluation

11
clf = RandomForestClassifier()

Training Data
clf.fit(X_train, y_train) Model

Training Labels

y_pred = clf.predict(X_test) Test Data Prediction

12
clf = RandomForestClassifier()

Training Data
clf.fit(X_train, y_train) Model

Training Labels

y_pred = clf.predict(X_test) Test Data Prediction

clf.score(X_test, y_test) Test Labels Evaluation

13
IPython Notebook:
Chapter 1 - Introduction to Scikit-learn

14
Unsupervised Machine Learning

Training Data Model

15
Unsupervised Machine Learning

Training Data Model

Test Data New View

16
Unsupervised Transformations

pca = PCA()

pca.fit(X_train) Training Data Model

X_new = pca.transform(X_test) Test Data Transformation

17
IPython Notebook:
Chapter 2 – Unsupervised Transformers

18
All Data

Training data Test data

19
All Data

Training data Test data

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

20
All Data

Training data Test data

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Split 1 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

21
All Data

Training data Test data

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Split 1 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Split 2 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

22
All Data

Training data Test data

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Split 1 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Split 2 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Split 3 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Split 4 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Split 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

23
IPython Notebook:
Chapter 3 - Cross-validation

24
25
26
All Data

Training data Test data

27
All Data

Training data Test data

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Split 1 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Split 2 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Split 3 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Split 4 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Split 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Test data
28
All Data

Training data Test data

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Split 1 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Split 2 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5


Finding Parameters
Split 3 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Split 4 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Split 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Final evaluation Test data


29
SVC(C=0.001,
gamma=0.001)

30
SVC(C=0.001, SVC(C=0.01, SVC(C=0.1, SVC(C=1, SVC(C=10,
gamma=0.001) gamma=0.001) gamma=0.001) gamma=0.001) gamma=0.001)

31
SVC(C=0.001, SVC(C=0.01, SVC(C=0.1, SVC(C=1, SVC(C=10,
gamma=0.001) gamma=0.001) gamma=0.001) gamma=0.001) gamma=0.001)

SVC(C=0.001, SVC(C=0.01, SVC(C=0.1, SVC(C=1, SVC(C=10,


gamma=0.01) gamma=0.01) gamma=0.01) gamma=0.01) gamma=0.01)

32
SVC(C=0.001, SVC(C=0.01, SVC(C=0.1, SVC(C=1, SVC(C=10,
gamma=0.001) gamma=0.001) gamma=0.001) gamma=0.001) gamma=0.001)

SVC(C=0.001, SVC(C=0.01, SVC(C=0.1, SVC(C=1, SVC(C=10,


gamma=0.01) gamma=0.01) gamma=0.01) gamma=0.01) gamma=0.01)

SVC(C=0.001, SVC(C=0.01, SVC(C=0.1, SVC(C=1, SVC(C=10,


gamma=0.1) gamma=0.1) gamma=0.1) gamma=0.1) gamma=0.1)

33
SVC(C=0.001, SVC(C=0.01, SVC(C=0.1, SVC(C=1, SVC(C=10,
gamma=0.001) gamma=0.001) gamma=0.001) gamma=0.001) gamma=0.001)

SVC(C=0.001, SVC(C=0.01, SVC(C=0.1, SVC(C=1, SVC(C=10,


gamma=0.01) gamma=0.01) gamma=0.01) gamma=0.01) gamma=0.01)

SVC(C=0.001, SVC(C=0.01, SVC(C=0.1, SVC(C=1, SVC(C=10,


gamma=0.1) gamma=0.1) gamma=0.1) gamma=0.1) gamma=0.1)

SVC(C=0.001, SVC(C=0.01, SVC(C=0.1, SVC(C=1, SVC(C=10,


gamma=1) gamma=1) gamma=1) gamma=1) gamma=1)

SVC(C=0.001, SVC(C=0.01, SVC(C=0.1, SVC(C=1, SVC(C=10,


gamma=10) gamma=10) gamma=10) gamma=10) gamma=10)

34
IPython Notebook:
Chapter 4 – Grid Searches

35
Training Labels Training Data

Model

36
Training Labels Training Data

Model
37
Training Labels Training Data

Feature
Extraction

Model
38
Training Labels Training Data

Feature
Extraction

Scaling

Model
39
Training Labels Training Data

Feature
Extraction

Scaling

Feature
Selection

Model
40
Training Labels Training Data

Feature
Extraction

Scaling

Feature
Selection

Model
41
Cross Validation
Training Labels Training Data

Feature
Extraction

Scaling

Feature
Selection

Model
42
Cross Validation
IPython Notebook:
Chapter 5 - Preprocessing and Pipelines

43
Do cross-validation over all steps jointly.
Keep a separate test set until the very end.

44
Bag Of Word Representations
CountVectorizer / TfidfVectorizer

45
Bag Of Word Representations
CountVectorizer / TfidfVectorizer

“You better call Kenny Loggins”

46
Bag Of Word Representations
CountVectorizer / TfidfVectorizer

“You better call Kenny Loggins”

tokenizer

['you', 'better', 'call', 'kenny', 'loggins']

47
Bag Of Word Representations
CountVectorizer / TfidfVectorizer

“You better call Kenny Loggins”

tokenizer

['you', 'better', 'call', 'kenny', 'loggins']

Sparse matrix encoding

aardvak better call you zyxst


[0, …, 0, 1, 0, … , 0, 1 , 0, …, 0, 1, 0, …., 0 ]

48
Application: Insult detection

49
Application: Insult detection

i really don't understand your point. It seems


that you are mixing apples and oranges.

50
Application: Insult detection

i really don't understand your point. It seems


that you are mixing apples and oranges.

Clearly you're a fucktard.

51
IPython Notebook:
Chapter 6 - Working With Text Data

52
Overfitting and Underfitting
Training

Accuracy

Model complexity
53
Overfitting and Underfitting
Training

Accuracy
Generalization

Model complexity
54
Overfitting and Underfitting
Training

Sweet spot

Accuracy
Generalization

Underfitting Overfitting

Model complexity
55
Linear SVM

56
Linear SVM

57
(RBF) Kernel SVM

58
(RBF) Kernel SVM

59
(RBF) Kernel SVM

60
(RBF) Kernel SVM

61
Decision Trees

62
Decision Trees

63
Decision Trees

64
Decision Trees

65
Decision Trees

66
Decision Trees

67
Random Forests

68
Random Forests

69
Random Forests

70
71
Thank you for your attention.

@t3kcit

@amueller

importamueller@gmail.com

72

You might also like