Professional Documents
Culture Documents
Konrad Banachewicz, Luca Massaron - The Kaggle Workbook_ Self-learning exercises and valuable insights for Kaggle data science competitions-Packt Publishing
Konrad Banachewicz, Luca Massaron - The Kaggle Workbook_ Self-learning exercises and valuable insights for Kaggle data science competitions-Packt Publishing
Pa ckt Pu blish in g h a s en dea v or ed t o pr ov ide t r a dem a r k in for m a t ion a bou t a ll of t h e com pa n ies a n d
pr odu ct s m en t ion ed in t h is book by t h e a ppr opr ia t e u se of ca pit a ls. How ev er , Pa ckt Pu blish in g ca n n ot
g u a r a n t ee t h e a ccu r a cy of t h is in for m a t ion .
Liv er y Pla ce
3 5 Liv er y St r eet
Bir m in g h a m
B3 2 PB, UK
1 . Cov er
2 . T a ble of con t en t s
The Kaggle Workbook: Self-learning exercises and valuable
insights for Kaggle data science competitions
Wel com e t o Pa ckt Ea r l y A ccess. W e’r e g iv in g y ou a n ex clu siv e pr ev iew of t h is book befor e it g oes on
sa le. It ca n t a ke m a n y m on t h s t o w r it e a book, bu t ou r a u t h or s h a v e cu t t in g -edg e in for m a t ion t o sh a r e
w it h y ou t oda y . Ea r ly A ccess g iv es y ou a n in sig h t in t o t h e la t est dev elopm en t s by m a kin g ch a pt er
dr a ft s a v a ila ble. T h e ch a pt er s m a y be a lit t le r ou g h a r ou n d t h e edg es r ig h t n ow , bu t ou r a u t h or s w ill
u pda t e t h em ov er t im e.
Lea r n in g h ow t o per for m t op on t h e lea der boa r d on a n y Ka g g le com pet it ion r equ ir es pa t ien ce,
dilig en ce a n d m a n y a t t em pt s in or der t o lea r n t h e best w a y t o com pet e a n d a ch iev e t op r esu lt s. For
t h is r ea son , w e h a v e t h ou g h t of a w or kbook t h a t ca n h elp y ou bu ild fa st er t h ose skills by lea din g y ou t o
t r y som e Ka g g le com pet it ion s of t h e pa st a n d t o lea r n h ow t o m a ke t op com pet it ion s for t h em by
r ea din g discu ssion s, r eu sin g n ot ebooks, en g in eer in g fea t u r es a n d t r a in in g v a r iou s m odels.
In illu st r a t in g t h e key in sig h t a n d t ech n ica lit ies n ecessa r y for cr a ckin g t h is com pet it ion w e w ill sh ow
y ou t h e n ecessa r y code a n d a sk y ou t o st u dy a n d a n sw er a bou t t opics t o be fou n d on t h e Ka g g le book
it self. T h er efor e, w it h ou t m u ch m or e a do, let ’s st a r t im m edia t ely t h is n ew lea r n in g pa t h of y ou r s.
In t h is ch a pt er , y ou w ill lea r n :
T h e com pet it ion a im s a t h a v in g Ka g g ler s bu ild a m odel t h a t pr edict s t h e pr oba bilit y t h a t a dr iv er w ill
in it ia t e a n a u t o in su r a n ce cla im in t h e n ex t y ea r , w h ich is a qu it e com m on kin d of t a sk (t h e spon sor
m en t ion s it a s a “ cla ssica l ch a llen g e for in su r a n ce” ). For doin g so, t h e spon sor pr ov ided a t r a in a n d t est
set s a n d t h e com pet it ion a ppea r s idea l for a n y on e sin ce t h e da t a set is n ot v er y la r g e a n d it seem s v er y
w ell pr epa r ed.
A s st a t ed in t h e pa g e of t h e com pet it ion dev ot ed t o pr esen t in g t h e da t a (h t t ps://w w w .ka g g le.com /
com pet it ion s/por t o-seg u r o-sa fe-dr iv er -pr edict ion /da t a ):
“features that belong to s im ilar groupings are tagged as s uch in the feature nam es (e.g., ind,
reg, car, calc). In addition, feature nam es include the pos tfix bin to indicate binary features
and cat to indicate categorical features . Features w ithout thes e des ignations are either
continuous or ordinal. Values of -1 indicate that the feature w as m is s ing from the
obs ervation. The target colum n s ignifies w hether or not a claim w as filed for that policy
holder”.
T h e da t a pr epa r a t ion for t h e com pet it ion h a d been qu it e ca r efu lly n ot t o lea k a n y in for m a t ion a n d,
t h ou g h , m a in t a in in g secr ecy a bou t t h e m ea n in g of t h e fea t u r es, it is qu it e clea r t o r efer t h e differ en t
u sed t a g s t o specific kin d of fea t u r es com m on ly u sed in su r a n ce m odelin g :
A s for t h e in div idu a l fea t u r es, t h er e h a s been m u ch specu la t ion du r in g t h e com pet it ion . See for
in st a n ce h t t ps://w w w .ka g g le.com /com pet it ion s/por t o-seg u r o-sa fe-dr iv er -pr edict ion /discu ssion /4 1 4 8 9
or h t t ps://w w w .ka g g le.com /com pet it ion s/por t o-seg u r o-sa fe-dr iv er -pr edict ion /discu ssion /4 1 4 8 8 bot h
by Ra dda r or a g a in t h e a t t em pt s t o a t t r ibu t e t h e fea t u r e t o Por t o Seg u r o’s on lin e qu ot e for m h t t ps://
w w w .ka g g le.com /com pet it ion s/por t o-seg u r o-sa fe-dr iv er -pr edict ion /discu ssion /4 1 0 5 7 . In spit e of a ll
su ch effor t s, in t h e en d t h e m ea n in g of t h e fea t u r es h a s r em a in ed a m y st er y u p u n t il n ow .
1 . t h e da t a is r ea l w or ld, t h ou g h t h e fea t u r es a r e a n on y m ou s
2 . t h e da t a is r ea lly v er y w ell pr epa r ed, w it h ou t lea ka g es of a n y sor t (n o m a g ic fea t u r es h er e)
3 . t h e t est set n ot on ly h olds t h e sa m e ca t eg or ica l lev els of t h e t r a in t est , a n d it a lso seem s t o be
fr om t h e sa m e dist r ibu t ion , t h ou g h Y u y a Y a m a m ot o a r g u es t h a t pr e-pr ocessin g t h e da t a w it h
t -SNE lea ds t o a fa ilin g a dv er sa r ia l v a lida t ion t est (h t t ps://w w w .ka g g le.com /com pet it ion s/
por t o-seg u r o-sa fe-dr iv er -pr edict ion /discu ssion /4 4 7 8 4 ) .
A n in t er est in g post by T ilii (Men su r Dla kic, a ssocia t e Pr ofessor a t Mon t a n a St a t e Un iv er sit y : h t t ps://
w w w .ka g g le.com /com pet it ion s/por t o-seg u r o-sa fe-dr iv er -pr edict ion /discu ssion /4 2 1 9 7 ) dem on st r a t es
u sin g t SNE t h a t “ t h er e a r e m a n y people w h o a r e v er y sim ila r in t er m s of t h eir in su r a n ce pa r a m et er s,
y et som e of t h em w ill file a cla im a n d ot h er s w ill n ot ” . W h a t T ilii m en t ion s is qu it e t y pica l of w h a t
h a ppen s in in su r a n ce w h er e t o cer t a in pr ior s (in su r a n ce pa r a m et er s) t h er e is st icked t h e sa m e
pr oba bilit y of som et h in g h a ppen in g bu t t h a t ev en t w ill h a ppen or n ot ba sed on h ow lon g w e obser v e
t h e sit u a t ion .
When you s ubm it an entry, the obs ervations are s orted from "larges t prediction" to "s m alles t prediction".
This is the only s tep w here your predictions com e into play, s o only the order determ ined by your
predictions m atters . Vis ualize the obs ervations arranged from left to right, w ith the larges t predictions on
the left. We then m ove from left to right, as k ing "In the leftm os t x% of the data, how m uch of the actual
obs erved los s have you accum ulated?" With no m odel, you can expect to accum ulate 10% of the los s in 10%
of the predictions , s o no m odel (or a "null" m odel) achieves a s traight line. We call the area betw een your
curve and this s traight line the Gini coefficient.
There is a m axim um achievable area for a "perfect" m odel. We w ill us e the norm alized Gini coefficient by
dividing the Gini coefficient of your m odel by the Gini coefficient of the perfect m odel.
A n ot h er g ood ex pla n a t ion is pr ov ided in t h e com pet it ion by t h e n ot ebook by Kilia n Ba t zn er : h t t ps://
w w w .ka g g le.com /code/ba t zn er /g in i-coefficien t -a n -in t u it iv e-ex pla n a t ion t h a t , t h ou g h m ea n s of plot s
a n d t oy ex a m ples t r ies t o g iv e a sen se t o a n ot so com m on m et r ic bu t in t h e a ct u a r ia l depa r t m en t s of
in su r a n ce com pa n ies.
T h e m et r ic ca n a lso be a ppr ox im a t ely ex pr essed a s t h e cov a r ia n ce of sca led pr edict ion r a n k a n d sca led
t a r g et v a lu e, r esu lt in g in a m or e u n der st a n da ble r a n k a ssocia t ion m ea su r e (see Dm it r iy Gu ller :
h t t ps://w w w .ka g g le.com /com pet it ion s/por t o-seg u r o-sa fe-dr iv er -pr edict ion /discu ssion /4 0 5 7 6 )
Fr om t h e poin t of v iew of t h e object iv e fu n ct ion , y ou ca n opt im ize for t h e bin a r y log loss (a s y ou w ou ld
do in a cla ssifica t ion pr oblem ). Nor ROC-A UC n or t h e n or m a lized Gin i coefficien t a r e differ en t ia ble a n d
t h ey m a y be u sed on ly for m et r ic ev a lu a t ion on t h e v a lida t ion set (for in st a n ce for ea r ly st oppin g or
for r edu cin g t h e lea r n in g r a t e in a n eu r a l n et w or k). How ev er , opt im izin g for t h e log loss doesn ’t a lw a y s
im pr ov e t h e ROC-A UC a n d t h e n or m a lized Gin i coefficien t s. T h er e is a ct u a lly a differ en t ia ble ROC-
A UC a ppr ox im a t ion (Ca lder s, T oon , a n d Szy m on Ja r oszew icz. "Efficien t A UC opt im iza t ion for
cla ssifica t ion ." Eu r opea n con fer en ce on pr in ciples of da t a m in in g a n d kn ow ledg e discov er y . Spr in g er ,
Ber lin , Heidelber g , 2 0 0 7 h t t ps://lin k.spr in g er .com /con t en t /pdf/
1 0 .1 0 0 7 /9 7 8 -3 -5 4 0 -7 4 9 7 6 -9 _8 .pdf). How ev er , it seem s t h a t it is n ot n ecessa r y t o u se a n y t h in g
differ en t fr om log loss a s object iv e fu n ct ion a n d ROC-A UC or n or m a lized Gin i coefficien t a s ev a lu a t ion
m et r ic in t h e com pet it ion .
Fir st , Mich a el ex pla in s t h a t h is solu t ion is com posed of a blen d of six m odels (on e Lig h t GBM m odel a n d
fiv e n eu r a l n et w or ks). Mor eov er , sin ce n o a dv a n t a g e cou ld be g a in ed by w eig h t in g t h e con t r ibu t ion s
of ea ch m odel t o t h e blen d (a s w ell a s doin g lin ea r a n d n on -lin ea r st a ckin g ) likely beca u se of
ov er fit t in g , h e st a t es t h a t h e r esor t ed t o ju st a pla in blen d of m odels (a n a r it h m et ic m ea n ) t h a t h a v e
been bu ilt fr om differ en t seeds.
Su ch in sig h t m a kes t h e t a sk m u ch ea sier for u s in or der t o r eplica t e h is a ppr oa ch , a lso beca u se h e a lso
m en t ion s t h a t ju st h a v in g blen d t og et h er t h e Lig h t GBM’s r esu lt s w it h on e fr om t h e n eu r a l n et w or ks h e
bu ilt w ou ld h a v e been en ou g h t o g u a r a n t ee t h e fir st pla ce in t h e com pet it ion . T h a t w ill lim it ou r
ex er cise w or k t o t w o g ood sin g le m odels in st ea d of a h ost of t h em . In a ddit ion , h e m en t ion ed h a v in g
don e lit t le da t a pr ocessin g , bu t dr oppin g som e colu m n s a n d on e-h ot en codin g ca t eg or ica l fea t u r es.
W e st a r t by im por t in g key pa cka g es (Nu m py , Pa n da s, Opt u n a for h y per -pa r a m et er opt im iza t ion ,
Lig h t GBM a n d som e u t ilit y fu n ct ion s). W e a lso defin e a con fig u r a t ion cla ss a n d in st a n t ia t e it . W e w ill
discu ss t h e pa r a m et er s defin ed in t o t h e con fig u r a t ion cla ss du r in g t h e ex plor a t ion of t h e code a s w e
pr og r ess. W h a t is im por t a n t t o r em a r k h er e is t h a t by u sin g a cla ss con t a in in g a ll y ou r pa r a m et er s it
w ill be ea sier for y ou t o m odify t h em in a con sist en t w a y a lon g t h e code. In t h e h ea t of a com pet it ion , it
is ea sy t o for g et t o u pda t e a pa r a m et er t h a t it is r efer r ed t o in m u lt iple pla ces in t h e code a n d it is
a lw a y s difficu lt t o set t h e pa r a m et er s w h en t h ey a r e disper sed a m on g cells a n d fu n ct ion s. A
con fig u r a t ion cla ss ca n sa v e y ou a lot of effor t a n d spa r e y ou m ist a kes a lon g t h e w a y .
import numpy as np
import pandas as pd
import optuna
import lightgbm as lgb
from path import Path
from sklearn.model_selection import StratifiedKFold
class Config:
input_path = Path('../input/porto-seguro-safe-driver-prediction')
optuna_lgb = False
n_estimators = 1500
early_stopping_round = 150
cv_folds = 5
random_state = 0
params = {'objective': 'binary',
'boosting_type': 'gbdt',
'learning_rate': 0.01,
'max_bin': 25,
'num_leaves': 31,
'min_child_samples': 1500,
'colsample_bytree': 0.7,
'subsample_freq': 1,
'subsample': 0.7,
'reg_alpha': 1.0,
'reg_lambda': 1.0,
'verbosity': 0,
'random_state': 0}
config = Config()
T h en , w e ju st ex t r a ct t h e t a r g et (a bin a r y t a r g et of 0 s a n d 1 s) a n d r em ov e it fr om t h e t r a in da t a set .
target = train["target"]
train = train.drop("target", axis="columns")
Du r in g t h e com pet it ion , T ilii h a s t est ed fea t u r e elim in a t ion u sin g Bor u t a (h t t ps://
g it h u b.com /scikit -lea r n -con t r ib/bor u t a _py ). Y ou ca n fin d h is ker n el h er e: h t t ps://
w w w .ka g g le.com /code/t ilii7 /bor u t a -fea t u r e-elim in a t ion /n ot ebook. A s y ou ca n ch eck,
t h er e is n o ca lc_fea t u r e con sider ed a s a con fir m ed fea t u r e by Bor u t a .
def objective(trial):
params = {
'learning_rate': trial.suggest_float("learning_rate", 0.01, 1.0),
'num_leaves': trial.suggest_int("num_leaves", 3, 255),
'min_child_samples': trial.suggest_int("min_child_samples",
3, 3000),
'colsample_bytree': trial.suggest_float("colsample_bytree",
0.1, 1.0),
'subsample_freq': trial.suggest_int("subsample_freq", 0, 10),
'subsample': trial.suggest_float("subsample", 0.1, 1.0),
'reg_alpha': trial.suggest_loguniform("reg_alpha", 1e-9, 10.0),
'reg_lambda': trial.suggest_loguniform("reg_lambda", 1e-9, 10.0),
}
score = list()
skf = StratifiedKFold(n_splits=config.cv_folds, shuffle=True,
random_state=config.random_state)
for train_idx, valid_idx in skf.split(train, target):
X_train = train.iloc[train_idx]
y_train = target.iloc[train_idx]
X_valid = train.iloc[valid_idx]
y_valid = target.iloc[valid_idx]
model = lgb.LGBMClassifier(**params,
n_estimators=1500,
early_stopping_round=150,
force_row_wise=True)
callbacks=[lgb.early_stopping(stopping_rounds=150,
verbose=False)]
model.fit(X_train, y_train,
eval_set=[(X_valid, y_valid)],
eval_metric=gini_lgb, callbacks=callbacks)
score.append(
model.best_score_['valid_0']['normalized_gini_coef'])
return np.mean(score)
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=300)
print("Best Gini Normalized Score", study.best_value)
print("Best parameters", study.best_params)
params.update(study.best_params)
else:
params = config.params
Du r in g t h e com pet it ion , T ilii h a s t est ed fea t u r e elim in a t ion u sin g Bor u t a (h t t ps://g it h u b.com /scikit -
lea r n -con t r ib/bor u t a _py ). Y ou ca n fin d h is ker n el h er e: h t t ps://w w w .ka g g le.com /code/t ilii7 /bor u t a -
fea t u r e-elim in a t ion /n ot ebook. A s y ou ca n ch eck, t h er e is n o ca lc_fea t u r e con sider ed a s a con fir m ed
fea t u r e by Bor u t a .
In t h e Ka g g le Book, w e ex pla in a bou t bot h h y per -pa r a m et er opt im iza t ion (pa g es 2 4 1
on w a r d) a n d pr ov ide som e key h y per -pa r a m et er s for t h e Lig h t GBM m odel. A s a n
ex er cise, t r y t o im pr ov e t h e h y per -pa r a m et er sea r ch by r edu cin g or in cr ea sin g t h e
ex plor ed pa r a m et er s a n d by t r y in g a lt er n a t iv e m et h ods su ch a s t h e r a n dom sea r ch or
t h e h a lv in g sea r ch fr om Scikit -Lea r n (pa g es 2 4 5 -2 4 6 ).
preds = np.zeros(len(test))
oof = np.zeros(len(train))
metric_evaluations = list()
skf = StratifiedKFold(n_splits=config.cv_folds, shuffle=True, random_state=config.rando
for idx, (train_idx, valid_idx) in enumerate(skf.split(train,
target)):
print(f"CV fold {idx}")
X_train, y_train = train.iloc[train_idx], target.iloc[train_idx]
X_valid, y_valid = train.iloc[valid_idx], target.iloc[valid_idx]
model = lgb.LGBMClassifier(**params,
n_estimators=config.n_estimators,
early_stopping_round=config.early_stopping_round,
force_row_wise=True)
callbacks=[lgb.early_stopping(stopping_rounds=150),
lgb.log_evaluation(period=100, show_stdv=False)]
model.fit(X_train, y_train,
eval_set=[(X_valid, y_valid)],
eval_metric=gini_lgb, callbacks=callbacks)
metric_evaluations.append(
model.best_score_['valid_0']['normalized_gini_coef'])
preds += (model.predict_proba(test,
num_iteration=model.best_iteration_)[:,1]
/ skf.n_splits)
oof[valid_idx] = model.predict_proba(X_valid,
num_iteration=model.best_iteration_)[:,1]
submission['target'] = preds
submission.to_csv('lgb_submission.csv')
oofs = pd.DataFrame({'id':train_index, 'target':oof})
oofs.to_csv('dnn_oof.csv', index=False)
Y ou ca n r ea d m or e a bou t den oisin g a u t o-en coder s a s bein g u sed in Ka g g le com pet it ion s
in t h e Ka g g le Book, a t pa g es 2 2 6 a n d follow in g .
A ct u a lly t h er e a r e n o ex a m ples a r ou n d r epr odu cin g Mich a el Ja h r er ’s a ppr oa ch in t h e com pet it ion
u sin g DA Es, bu t w e t ooked ex a m ple fr om a w or kin g T en sor Flow im plem en t a t ion in a n ot h er
com pet it ion by OsciiA r t (h t t ps://w w w .ka g g le.com /code/osciia r t /den oisin g -a u t oen coder ).
Her e w e st a r t by im por t in g a ll t h e n ecessa r y pa cka g es, especia lly T en sor Flow a n d Ker a s. Sin ce w e a r e
g oin g t o cr ea t e m u lt iple n eu r a l n et w or ks, w e poin t ou t t o T en sor Flow n ot t o u se a ll t h e GPU m em or y
a v a ila ble by u sin g t h e ex per im en t a l set_memory_growth com m a n d. Su ch w ill a v oid h a v in g m em or y
ov er flow pr oblem s a lon g t h e w a y . W e a lso r ecor d t h e Lea ky Relu a ct iv a t ion a s a cu st om on e, so w e ca n
ju st m en t ion it a s a n a ct iv a t ion by a st r in g in Ker a s la y er s.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from path import Path
import gc
import optuna
from sklearn.model_selection import StratifiedKFold
from scipy.special import erfinv
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
from tensorflow import keras
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Input, Dense, BatchNormalization, Dropout
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.regularizers import l2
from tensorflow.keras.metrics import AUC
from tensorflow.keras.utils import get_custom_objects
from tensorflow.keras.layers import Activation, LeakyReLU
get_custom_objects().update({'leaky-relu': Activation(LeakyReLU(alpha=0.2))})
def gpu_cleanup(objects):
if objects:
del(objects)
K.clear_session()
gc.collect()
W e a lso r econ fig u r e t h e Con fig cla ss in or der t o t a ke in t o a ccou n t m u lt iple pa r a m et er s r ela t ed t o t h e
den oisin g a u t o-en coder a n d t h e n eu r a l n et w or k. A s pr ev iou sly st a t ed a bou t t h e Lig h t GBM, h a v in g a ll
t h e pa r a m et er s in a sin g le pla ce r en der s sim pler m odify in g t h em in a con sist en t w a y .
class Config:
input_path = Path('../input/porto-seguro-safe-driver-prediction')
dae_batch_size = 128
dae_num_epoch = 50
dae_architecture = [1500, 1500, 1500]
reuse_autoencoder = False
batch_size = 128
num_epoch = 150
units = [64, 32]
input_dropout=0.06
dropout=0.08
regL2=0.09
activation='selu'
cv_folds = 5
nas = False
random_state = 0
config = Config()
A s a lso st a t ed in som e pa per s, su ch a s in t h e Ba t ch Nor m a liza t ion pa per : h t t ps://a r x iv .or g /pdf/
1 5 0 2 .0 3 1 6 7 .pdf, n eu r a l n et w or ks per for m ev en bet t er if y ou pr ov ide t h em a Ga u ssia n in pu t .
A ccor din g ly t o t h is NV idia blog post , h t t ps://dev eloper .n v idia .com /blog /g a u ss-r a n k-t r a n sfor m a t ion -
is-1 0 0 x -fa st er -w it h -r a pids-a n d-cu py /, Ga u ssRa n k w or ks m ost of t h e t im es bu t w h en fea t u r es a r e
a lr ea dy n or m a lly dist r ibu t ed or a r e ex t r em ely a sy m m et r ica l (in su ch ca ses a pply in g t h e
t r a n sfor m a t ion m a y lea d t o w or sen ed per for m a n ces).
features = train.columns
train_index = train.index
test_index = test.index
train = train.values.astype(np.float32)
test = test.values.astype(np.float32)
plt.show(fig)
from numba import jit
@jit
def eval_gini(y_true, y_pred):
y_true = np.asarray(y_true)
y_true = y_true[np.argsort(y_pred)]
ntrue = 0
gini = 0
delta = 0
n = len(y_true)
for i in range(n-1, -1, -1):
y_i = y_true[i]
ntrue += y_i
gini += y_i * delta
delta += 1 - y_i
gini = 1 - 2 * gini / (ntrue * (n - ntrue))
return gini
T h e get_DAE is t h e fu n ct ion t h a t bu ilds t h e den oisin g a u t o-en coder . It a ccept s a pa r a m et er for defin in g
t h e a r ch it ect u r e w h ich in ou r ca se h a s been set t o t h r ee la y er s of 1 5 0 0 n odes ea ch (a s su g g est ed fr om
Mich a el Ja h r er ’s solu t ion ). T h e fir st la y er sh ou ld a ct a s a n en coder , t h e secon d is a bot t len eck la y er
idea lly con t a in in g t h e la t en t fea t u r es ca pa ble of ex pr essin g t h e in for m a t ion in t h e da t a , t h e la st la y er
is a decodin g la y er , ca pa ble of r econ st r u ct in g t h e in it ia l in pu t da t a . T h e t h r ee la y er s h a v e a r elu
a ct iv a t ion fu n ct ion , n o bia s a n d ea ch on e is follow ed by a ba t ch n or m a liza t ion la y er . T h e fin a l ou t pu t
w it h t h e r econ st r u ct ed in pu t da t a h a s a lin ea r a ct iv a t ion . T h e t r a in in g is pr ov ided u sin g a n a da m
opt im izer w it h st a n da r d set t in g s (t h e opt im ised cost fu n ct ion is t h e m ea n squ a r ed er r or - m se).
def get_DAE(X, architecture=[1500, 1500, 1500]):
features = X.shape[1]
inputs = Input((features,))
for i, nodes in enumerate(architecture):
layer = Dense(nodes, activation='relu',
use_bias=False, name=f"code_{i+1}")
if i==0:
x = layer(inputs)
else:
x = layer(x)
x = BatchNormalization()(x)
outputs = Dense(features, activation='linear')(x)
model = Model(inputs=inputs, outputs=outputs)
model.compile(optimizer='adam', loss='mse',
metrics=['mse', 'mae'])
return model
Ra n dom n or m a l in it ia liza t ion , sin ce em pir ica lly it h a s been fou n d t o con v er g e t o bet t er r esu lt s
in t h is pr oblem
A den se la y er w it h L2 r eg u la r iza t ion a n d pa r a m et r a ble a ct iv a t ion fu n ct ion
A n ex clu da ble a n d t u n a ble dr opou t
inputs = dae.get_layer("code_2").output
if input_dropout > 0:
x = Dropout(input_dropout)(inputs)
else:
x = tf.keras.layers.Layer()(inputs)
x = dense_blocks(x, units, activation, regL2, dropout)
outputs = Dense(1, activation='sigmoid')(x)
model = Model(inputs=dae.input, outputs=outputs)
model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss=keras.losses.binary_crossentropy,
metrics=[AUC(name='auc')])
return model
if not suppress_output:
plot_keras_history(history, measures=['loss', 'auc'])
return model, history
In ou r ex per im en t s w e t h a t :
if config.nas is True:
def evaluate():
metric_evaluations = list()
skf = StratifiedKFold(n_splits=config.cv_folds, shuffle=True, random_state=conf
for k, (train_idx, valid_idx) in enumerate(skf.split(train, target)):
gpu_cleanup([autoencoder, model])
return np.mean(metric_evaluations)
def objective(trial):
params = {
'first_layer': trial.suggest_categorical("first_layer", [8, 16, 32, 64,
'second_layer': trial.suggest_categorical("second_layer", [0, 8, 16, 32
'third_layer': trial.suggest_categorical("third_layer", [0, 8, 16, 32,
'input_dropout': trial.suggest_float("input_dropout", 0.0, 0.5),
'dropout': trial.suggest_float("dropout", 0.0, 0.5),
'regL2': trial.suggest_uniform("regL2", 0.0, 0.1),
'activation': trial.suggest_categorical("activation", ['relu', 'leaky-r
}
config.units = [nodes for nodes in [params['first_layer'], params['second_layer
config.input_dropout = params['input_dropout']
config.dropout = params['dropout']
config.regL2 = params['regL2']
config.activation = params['activation']
return evaluate()
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=60)
print("Best Gini Normalized Score", study.best_value)
print("Best parameters", study.best_params)
config.units = [nodes for nodes in [study.best_params['first_layer'], study.best_pa
config.input_dropout = study.best_params['input_dropout']
config.dropout = study.best_params['dropout']
config.regL2 = study.best_params['regL2']
config.activation = study.best_params['activation']
preds = np.zeros(len(test))
oof = np.zeros(len(train))
metric_evaluations = list()
skf = StratifiedKFold(n_splits=config.cv_folds, shuffle=True, random_state=config.rando
for k, (train_idx, valid_idx) in enumerate(skf.split(train, target)):
print(f"CV fold {k}")
X_train, y_train = train[train_idx, :], target[train_idx]
X_valid, y_valid = train[valid_idx, :], target[valid_idx]
if config.reuse_autoencoder:
print("restoring previously trained dae")
autoencoder = load_model(f"./dae_fold_{k}")
else:
autoencoder = autoencoder_fitting(X_train, X_valid,
filename=f'./dae_fold_{k}',
random_state=config.random_state)
metric_evaluations.append(best_score)
preds += (model.predict(test, batch_size=128).ravel() /
skf.n_splits)
oof[valid_idx] = model.predict(X_valid, batch_size=128).ravel()
gpu_cleanup([autoencoder, model])
submission['target'] = preds
submission.to_csv('dnn_submission.csv')
oofs = pd.DataFrame({'id':train_index, 'target':oof})
oofs.to_csv('dnn_oof.csv', index=False)
import pandas as pd
import numpy as np
from numba import jit
@jit
def eval_gini(y_true, y_pred):
y_true = np.asarray(y_true)
y_true = y_true[np.argsort(y_pred)]
ntrue = 0
gini = 0
delta = 0
n = len(y_true)
for i in range(n-1, -1, -1):
y_i = y_true[i]
ntrue += y_i
gini += y_i * delta
delta += 1 - y_i
gini = 1 - 2 * gini / (ntrue * (n - ntrue))
return gini
lgb_oof = pd.read_csv("../input/workbook-lgb/lgb_oof.csv")
dnn_oof = pd.read_csv("../input/workbook-dae/dnn_oof.csv")
target = pd.read_csv("../input/porto-seguro-safe-driver-prediction/train.csv", usecols=
Now w e ju st t est if, by com bin in g t h e t w o m odels u sin g differ en t w eig h t s w e ca n g et a bet t er
ev a lu a t ion on t h e ou t -of-fold da t a .
lgb_submission = pd.read_csv("../input/workbook-lgb/lgb_submission.csv")
dnn_submission = pd.read_csv("../input/workbook-dae/dnn_submission.csv")
submission = pd.read_csv(
"../input/porto-seguro-safe-driver-prediction/sample_submission.csv")
A s poin t ed ou t by CPMP in h is solu t ion (h t t ps://w w w .ka g g le.com /com pet it ion s/por t o-
seg u r o-sa fe-dr iv er -pr edict ion /discu ssion /4 4 6 1 4 ), depen din g on h ow t o bu ild y ou r cr oss-
v a lida t ion , y ou ca n ex per ien ce a “ h u g e v a r ia t ion of Gin i scor es a m on g folds” . For t h is
r ea son , CPMP su g g est s t o decr ea se t h e v a r ia n ce of t h e est im a t es by u sin g m a n y differ en t
seeds for m u lt iple cr oss-v a lida t ion s a n d a v er a g in g t h e r esu lt s.
Sin ce 1 9 8 2 , Spy r os Ma kr ida kis (h t t ps://m ofc.u n ic.a c.cy /dr -spy r os-m a kr ida kis/h t t ps://
m ofc.u n ic.a c.cy /dr -spy r os-m a kr ida kis/) h a s in v olv ed g r ou ps of r esea r ch er s fr om a ll ov er t h e w or ld in
for eca st in g ch a llen g es, ca lled M com pet it ion s, in or der t o con du ct com pa r ison s of t h e effica cy of
ex ist in g a n d n ew for eca st in g m et h ods a g a in st differ en t for eca st in g pr oblem s. For t h is r ea son , M
com pet it ion s h a v e a lw a y s been com plet ely open t o bot h a ca dem ics a n d pr a ct it ion er s. T h e com pet it ion s
a r e pr oba bly t h e m ost cit ed a n d r efer en ced ev en t in t h e for eca st in g com m u n it y a n d t h ey h a v e a lw a y s
h ig h lig h t ed t h e ch a n g in g st a t e of t h e a r t in for eca st in g m et h ods. Ea ch pr ev iou s M com pet it ion h a s
pr ov ided bot h r esea r ch er s a n d pr a ct it ion er s n ot on ly w it h u sefu l da t a t o t r a in a n d t est t h eir
for eca st in g t ools, bu t a lso w it h a ser ies of discov er ies a n d a ppr oa ch es t h a t a r e r ev olu t ion izin g t h e w a y
for eca st in g a r e don e.
A pa r t fr om t h e com pet it ion pa g es, w e fou n d a lot of in for m a t ion r eg a r din g t h e com pet it ion a n d
it s dy n a m ics in t h e follow in g pa per s fr om t h e In t er n a t ion a l Jou r n a l of For eca st in g :
Ma kr ida kis, Spy r os, Ev a n g elos Spiliot is, a n d V a ssilios A ssim a kopou los. The M5 com petition:
Back ground, organization, and im plem entation. In t er n a t ion a l Jou r n a l of For eca st in g (2 0 2 1 ).
Ma kr ida kis, Spy r os, Ev a n g elos Spiliot is, a n d V a ssilios A ssim a kopou los. "M5 a ccu r a cy
com pet it ion : Resu lt s, fin din g s, a n d con clu sion s." In t er n a t ion a l Jou r n a l of For eca st in g (2 0 2 2 ).
Ma kr ida kis, Spy r os, et a l. "T h e M5 Un cer t a in t y com pet it ion : Resu lt s, fin din g s a n d con clu sion s."
In t er n a t ion a l Jou r n a l of For eca st in g (2 0 2 1 ).
W a lm a r t pr ov ided t h e da t a . It con sist ed of 4 2 ,8 4 0 da ily sa les t im e ser ies of it em s h ier a r ch ica lly
a r r a n g ed in t o depa r t m en t s, ca t eg or ies, a n d st or es spr ea d in t h r ee U.S. st a t es (t h e ser ies a r e som ew h a t
cor r ela t ed ea ch ot h er ). A lon g w it h t h e sa les, W a lm a r t a lso pr ov ided a ccom pa n y in g in for m a t ion
(ex og en ou s v a r ia bles, u su a lly n ot oft en pr ov ided in for eca st in g pr oblem s) su ch a s t h e pr ices of it em s,
som e ca len da r in for m a t ion , a ssocia t ed pr om ot ion s or pr esen ce of ot h er ev en t s a ffect in g t h e sa les.
On e in t er est in g a spect of t h e com pet it ion s is t h a t it dea lt w it h con su m er g oods sa les bot h fa st m ov in g
a n d slow m ov in g w it h m a n y ex a m ples of t h e la t est pr esen t in g in t er m it t en t sa les (sa les a r e oft en zer o
bu t for som e r a r e ca ses). In t er m it t en t ser ies, t h ou g h com m on in m a n y in du st r ies, a r e st ill a
ch a llen g in g ca se in for eca st in g for m a n y pr a ct it ion er s.
t h ose da y s:
w h er e:
n is t h e len g t h of t h e t r a in in g sa m ple
A g ood ex pla n a t ion of t h e u n der ly in g w or kin g s of t h is is pr ov ided by t h is post fr om A lex a n der Soa r e
(h t t ps://w w w .ka g g le.com /a lex a n der soa r e): h t t ps://w w w .ka g g le.com /com pet it ion s/m 5 -for eca st in g -
a ccu r a cy /discu ssion /1 4 8 2 7 3 . A ft er h a v in g t r a n sfor m ed t h e ev a lu a t ion m et r ic, A lex a n dr e a t t r ibu t es
im pr ov in g per for m a n ces t o im pr ov in g t h e r a t io bet w een t h e er r or in t h e pr edict ion s a n d t h e da y -t o-
da y v a r ia t ion of sa les v a lu es. If t h e er r or is t h e sa m e a s t h e da ily v a r ia t ion s (r a t io=1 ), it is likely t h a t
t h e m odel is n ot m u ch bet t er t h a n a r a n dom g u ess ba sed on h ist or ica l v a r ia t ion s. If y ou r er r or is less
t h a n t h a t , it is scor e in a qu a dr a t ic w a y (beca u se of t h e squ a r e r oot ) a s it a ppr oa ch es zer o.
Con sequ en t ly , a W RMSSE of 0 .5 cor r espon ds t o a r a t io of 0 .7 a n d a W RMSSE of 0 .2 5 cor r espon ds t o a
r a t io of 0 .5 .
Not icea bly , a ll t h e Ka g g ler s t h a t h a v e pla ced in t h e h ig h er r a n ks of t h e com pet it ion s h a v e u sed, a s
t h eir u n iqu e m odel t y pe or in blen ded/st a cked in en sem bles, Lig h t GBM beca u se of it s lesser m em or y
u sa g e a n d speed of com pu t a t ion s t h a t g a v e it a n a dv a n t a g e in t h e com pet it ion beca u se of t h e la r g e
a m ou n t of t im es ser ies t o pr ocess a n d pr edict . Bu t t h er e a r e a lso ot h er r ea son s for it s su ccess. Con t r a r y
t o cla ssica l m et h ods ba sed on A RIMA , it doesn ’t r equ ir e r ely in g on t h e a n a ly sis of a u t o-cor r ela t ion a n d
in specifica lly fig u r in g ou t t h e pa r a m et er s for ea ch sin g le ser ies in t h e pr oblem . In a ddit ion , con t r a r y
t o m et h ods ba sed on deep lea r n in g , it doesn ’t r equ ir e lookin g for im pr ov in g com plica t ed n eu r a l
a r ch it ect u r es or t u n in g la r g e n u m ber of h y per pa r a m et er s. T h e st r en g t h of t h e g r a dien t boost in g
m et h ods in t im e ser ies pr oblem s (for ex t en sion of ev er y ot h er g r a dien t boost in g a lg or it h m , su ch a s for
in st a n ce X GBoost ) is t o r ely on fea t u r e en g in eer in g , cr ea t in g t h e r ig h t n u m ber of fea t u r es ba sed on
t im e la g s, m ov in g a v er a g es a n d a v er a g es fr om g r ou pin g s of t h e ser ies a t t r ibu t es. T h en ch oosin g t h e
r ig h t object iv e fu n ct ion a n d doin g som e h y per -pa r a m et er t u n in g w ill su ffice t o obt a in ex cellen t r esu lt s
w h en t h e t im e ser ies a r e en ou g h lon g (for sh or t er ser ies, cla ssica l st a t ist ica l m et h ods su ch a s A RIMA or
ex pon en t ia l sm oot h in g a r e st ill t h e r ecom m en ded ch oice).
A m on g a ll t h ese a v a ila ble solu t ion s, w e fou n d t h e on e pr oposed by Mon sa r a ida (Ma sa n or i Miy a h a r a ), a
Ja pa n ese com pu t er scien t ist , t h e m ost in t er est in g on e. He h a s pr oposed a sim ple a n d st r a ig h t for w a r d
solu t ion t h a t h a s r a n ked fou r on t h e pr iv a t e lea der boa r d w it h a scor e of 0 .5 3 5 8 3 . T h e solu t ion u ses
ju st g en er a l fea t u r es w it h ou t pr ior select ion (su ch a s sa les st a t ist ics, ca len da r , pr ices, a n d iden t ifier s).
Mor eov er , it u ses a lim it ed n u m ber of m odels of t h e sa m e kin d, u sin g Lig h t GBM g r a dien t boost in g ,
w it h ou t r esor t in g t o a n y kin d of blen din g , r ecu r siv e m odellin g w h en pr edict ion s feed ot h er
h ier a r ch ica lly r ela t ed pr edict ion s or m u lt iplier s t h a t is ch oosin g con st a n t s t o fit t h e t est set bet t er .
Her e is a sch em e t a ken fr om h is pr esen t a t ion solu t ion pr esen t a t ion t o t h e M (h t t ps://g it h u b.com /
Mcom pet it ion s/M5 -m et h ods/t r ee/m a st er /Code%2 0 of%2 0 W in n in g %2 0 Met h ods/A 4 ) w h er e it ca n be
n ot ed t h a t h e t r ea t s ea ch of t h e t en st or es by ea ch of t h e fou r w eeks t o be looked in t o t h e fu t u r e t h a t in
Giv en t h a t Mon sa r a ida h a s kept h is solu t ion sim ple a n d pr a ct ica l, like in a r ea l-w or ld for eca st in g
pr oject , in t h is ch a pt er , w e w ill t r y t o r eplica t e h is ex a m ple by r efa ct or in g h is code in or der t o r u n in
Ka g g le n ot ebooks (w e w ill h a n dle t h e m em or y a n d t h e r u n n in g t im e lim it a t ion s by split t in g t h e code
in t o m u lt iple n ot ebooks). In t h is w a y , w e in t en d t o pr ov ide t h e r ea der s w it h a sim ple a n d effect iv e
w a y , ba sed on g r a dien t boost in g , t o a ppr oa ch for eca st in g pr oblem s.
Sin ce w e n eed t o pr edict bot h for t h e pu blic lea der boa r d a n d for t h e pr iv a t e on e, it is n ecessa r y t o
r epea t t h is pr ocess t w ice, st oppin g t r a in in g a t da y 1 ,9 1 3 (pr edict in g da y s fr om 1 ,9 1 4 t o 1 ,9 4 1 ) for t h e
pu blic t est set su bm ission a n d a t da y 1 ,9 4 1 (pr edict in g da y s fr om 1 ,9 4 2 5 o 1 ,9 6 9 ) for t h e pr iv a t e on e.
Giv en t h e cu r r en t lim it a t ion s for r u n n in g Ka g g le n ot ebooks ba sed on CPU, a ll t h ese eig h t n ot ebooks
ca n be r u n in pa r a llel (a ll t h e pr ocess t a kin g a lm ost 6 h ou r s a n d a h a lf). Ea ch n ot ebook ca n be
dist in g u ish a ble by ot h er s by it s n a m e, con t a in in g t h e pa r a m et er s’ v a lu es r ela t iv e t o t h e la st t r a in in g
da y a n d t h e look-a h ea d h or izon in da y s. A n ex a m ple of on e of t h ese n ot ebooks ca n be fou n d a t : h t t ps://
w w w .ka g g le.com /code/lu ca m a ssa r on /m 5 -t r a in -da y -1 9 4 1 -h or izon -7 .
import numpy as np
import pandas as pd
import os
import random
import math
from decimal import Decimal as dec
import datetime
import time
import gc
import lightgbm as lgb
import pickle
import warnings
warnings.filterwarnings(“ignore”, category=UserWarning)
W e keep on defin in g fu n ct ion s for t h is solu t ion , beca u se it h elps split t in g t h e solu t ion in t o sm a ller pa r t s
a n d beca u se it is ea sier t o clea n u p a ll t h e u sed v a r ia bles w h en y ou ju st r et u r n fr om a fu n ct ion (y ou
keep on ly w h a t y ou sa v ed t o disk, a n d y ou r et u r n ed fr om t h e fu n ct ion ). Ou r n ex t fu n ct ion h elps u s t o
loa d a ll t h e da t a a v a ila ble a n d com pr ess it :
def load_data():
train_df = reduce_mem_usage(pd.read_csv("../input/m5-forecasting-accuracy/sales_tra
prices_df = reduce_mem_usage(pd.read_csv("../input/m5-forecasting-accuracy/sell_pri
calendar_df = reduce_mem_usage(pd.read_csv("../input/m5-forecasting-accuracy/calend
submission_df = reduce_mem_usage(pd.read_csv("../input/m5-forecasting-accuracy/samp
return train_df, prices_df, calendar_df, submission_df
train_df, prices_df, calendar_df, submission_df = load_data()
A ft er pr epa r in g t h e code t o r et r iev e t h e da t a r ela t iv e t o pr ices, v olu m es, a n d ca len da r in for m a t ion , w e
pr oceed t o pr epa r e t h e fir st pr ocessin g fu n ct ion t h a t w ill h a v e t h e r ole t o cr ea t e a ba sic t a ble of
in for m a t ion h a v in g item_id, dept_id, cat_id, state_id a n d store_id a s r ow key s, a da y colu m n
a n d v a lu es colu m n con t a in in g t h e v olu m es. T h is is a ch iev ed st a r t in g fr om r ow s h a v in g a ll t h e da y s’
da t a colu m n s by u sin g t h e pa n da s com m a n d m elt (h t t ps://pa n da s.py da t a .or g /pa n da s-docs/st a ble/
r efer en ce/a pi/pa n da s.m elt .h t m l). T h e com m a n d t a kes a s r efer en ce t h e in dex of t h e Da t a Fr a m e a n d
t h en picks a ll t h e r em a in in g fea t u r es, pla cin g t h eir n a m e on a colu m n a n d t h eir v a lu e on a n ot h er on e
(var_name a n d value_name pa r a m et er s h elp y ou defin e t h e n a m e of t h ese n ew colu m n s). In t h is w a y ,
y ou ca n u n fold a r ow r epr esen t in g t h e sa les ser ies of a cer t a in it em in a cer t a in st or e in t o m u lt iple
r ow s ea ch on e r epr esen t in g a sin g le da y . T h e fa ct t h a t t h e posit ion a l or der of t h e u n folded colu m n s is
pr eser v ed g u a r a n t ees t h a t n ow y ou r t im e ser ies spa n s on t h e v er t ica l a x is (y ou ca n t h er efor e a pply
fu r t h er m or e t r a n sfor m a t ion s on it , su ch a s m ov in g m ea n s).
A pa r t fr om t h is, t h e code sn ippet a lso sepa r a t es a h oldou t of t h e r ow s (y ou r v a lida t ion set ) ba sed on t h e
t im e fr om t h e t r a in in g on es. On t h e t r a in in g pa r t it w ill a lso a dd t h e r ow s n ecessa r y for y ou r
pr edict ion s ba sed on t h e pr edict h or izon y ou pr ov ide.
del release_df
grid_df = reduce_mem_usage(grid_df, verbose=False)
gc.collect()
th e a ct u a l pr ice (n or m a lized by t h e m a x im u m )
th e m a x im u m pr ice
th e m in im u m pr ice
th e m ea n pr ice
th e st a n da r d dev ia t ion of t h e pr ice
th e n u m ber of differ en t pr ices t h e it em h a s t a ken
th e n u m ber of it em s in t h e st or e w it h t h e sa m e pr ice
Besides t h ese ba sic descr ipt iv e st a t ist ics of pr ices, w e a lso a dd som e fea t u r es t o descr ibe t h eir dy n a m ics
for ea ch it em in a st or e ba sed on differ en t t im e g r a n u la r it ies:
del calendar_prices
gc.collect()
original_columns = list(grid_df)
grid_df = grid_df.merge(prices_df, on=['store_id', 'item_id', 'wm_yr_wk'], how='lef
del(prices_df)
gc.collect()
grid_df = pd.read_feather(
f"grid_df_{end_train_day_x}_to_{end_train_day_x +
predict_horizon}.feather")
grid_df = grid_df[['id', 'd']]
gc.collect()
calendar_df['moon'] = calendar_df.date.apply(get_moon_phase)
# Merge calendar partly
icols = ['date',
'd',
'event_name_1',
'event_type_1',
'event_name_2',
'event_type_2',
'snap_CA',
'snap_TX',
'snap_WI',
'moon',
]
grid_df = grid_df.merge(calendar_df[icols], on=['d'], how='left')
icols = ['event_name_1',
'event_type_1',
'event_name_2',
'event_type_2',
'snap_CA',
'snap_TX',
'snap_WI']
del(grid_df['date'])
grid_df = reduce_mem_usage(grid_df, verbose=False)
grid_df.to_feather(f"grid_calendar_{end_train_day_x}_to_{end_train_day_x + predict_
del(grid_df)
del(calendar_df)
gc.collect()
del(grid_df)
gc.collect()
Ou r la st t w o fea t u r e cr ea t ion fu n ct ion s w ill g en er a t e m or e soph ist ica t ed fea t u r e en g in eer in g for t im e
ser ies. T h e fir st fu n ct ion w ill pr odu ce bot h la g g ed sa les a n d t h eir m ov in g a v er a g es. Fist , u sin g t h e sh ift
m et h od (h t t ps://pa n da s.py da t a .or g /pa n da s-docs/st a ble/r efer en ce/a pi/pa n da s.Da t a Fr a m e.sh ift .h t m l)
w ill g en er a t e a r a n g e of la g g ed sa les u p t o 1 5 da y s in t h e pa st . T h en u sin g sh ift in con ju n ct ion w it h
r ollin g (h t t ps://pa n da s.py da t a .or g /docs/r efer en ce/a pi/pa n da s.Da t a Fr a m e.r ollin g .h t m l) w ill cr ea t e
m ov in g m ea n s w it h w in dow s 7 , 1 4 , 3 0 , 6 0 a n d 1 8 0 da y s.
num_lag_day_list = []
num_lag_day = 15
for col in range(predict_horizon, predict_horizon + num_lag_day):
num_lag_day_list.append(col)
grid_df = grid_df.assign(**{
'{}_lag_{}'.format(col, l): grid_df.groupby(['id'])['sales'].transform(lambda x
for l in num_lag_day_list
})
for col in list(grid_df):
if 'lag' in col:
grid_df[col] = grid_df[col].astype(np.float16)
num_rolling_day_list = [7, 14, 30, 60, 180]
for num_rolling_day in num_rolling_day_list:
grid_df['rolling_mean_' + str(num_rolling_day)] = grid_df.groupby(['id'])['sale
lambda x: x.shift(predict_horizon).rolling(num_rolling_day).mean()).astype(
grid_df['rolling_std_' + str(num_rolling_day)] = grid_df.groupby(['id'])['sales
lambda x: x.shift(predict_horizon).rolling(num_rolling_day).std()).astype(n
grid_df = reduce_mem_usage(grid_df, verbose=False)
grid_df.to_feather(f"lag_feature_{end_train_day_x}_to_{end_train_day_x + predict_ho
del(grid_df)
gc.collect()
A s for a s t h e secon d a dv a n ced fea t u r e en g in eer in g fu n ct ion , it is a n en codin g fu n ct ion , t a kin g specific
g r ou pin g s of v a r ia bles a m on g st a t e, st or e, ca t eg or y , depa r t m en t , a n d sold it em a n d r epr esen t in g t h eir
m ea n a n d st a n da r d dev ia t ion . Su ch em beddin g s a r e t im e in depen den t (t im e is n ot pa r t of t h e
g r ou pin g ) a n d t h ey h a v e t h e r ole t o h elp t h e t r a in in g a lg or it h m t o dist in g u ish h ow it em s, ca t eg or ies,
a n d st or es (a n d t h eir com bin a t ion s) differ en t ia t e a m on g t h em selv es.
del(grid_df)
gc.collect()
mean_features = [
'enc_cat_id_mean', 'enc_cat_id_std',
'enc_dept_id_mean', 'enc_dept_id_std',
'enc_item_id_mean', 'enc_item_id_std'
]
df2 = pd.read_feather(f"target_encoding_{end_train_day_x}_to_{end_train_day_x + pre
for store_id in store_id_set_list:
df = pd.read_feather(f"grid_full_store_{store_id}_{end_train_day_x}_to_{end_tra
df = pd.concat([df, df2[df2.index.isin(index_store[store_id])].reset_index(drop
df.to_feather(f"grid_full_store_{store_id}_{end_train_day_x}_to_{end_train_day_
del(df2)
gc.collect()
Her e is t h e com plet e fu n ct ion for t r a in in g a ll t h e m odels for a specific pr edict ion h or izon :
lgb_params = {
'boosting_type': 'goss',
'objective': 'tweedie',
'tweedie_variance_power': 1.1,
'metric': 'rmse',
#'subsample': 0.5,
#'subsample_freq': 1,
'learning_rate': 0.03,
'num_leaves': 2 ** 11 - 1,
'min_data_in_leaf': 2 ** 12 - 1,
'feature_fraction': 0.5,
'max_bin': 100,
'boost_from_average': False,
'num_boost_round': 1400,
'verbose': -1,
'num_threads': os.cpu_count(),
'force_row_wise': True,
}
random.seed(seed)
np.random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
lgb_params['seed'] = seed
store_id_set_list = list(train_df['store_id'].unique())
print(f"training stores: {store_id_set_list}")
feature_importance_all_df = pd.DataFrame()
for store_index, store_id in enumerate(store_id_set_list):
print(f'now training {store_id} store')
grid_df, enable_features = load_grid_by_store(end_train_day_x, predict_horizon,
train_mask = grid_df['d'] <= end_train_day_x
valid_mask = train_mask & (grid_df['d'] > (end_train_day_x - predict_horizon))
preds_mask = grid_df['d'] > (end_train_day_x - 100)
train_data = lgb.Dataset(grid_df[train_mask][enable_features],
label=grid_df[train_mask]['sales'])
valid_data = lgb.Dataset(grid_df[valid_mask][enable_features],
label=grid_df[valid_mask]['sales'])
# Saving part of the dataset for later predictions
# Removing features that we need to calculate recursively
grid_df = grid_df[preds_mask].reset_index(drop=True)
grid_df.to_feather(f'test_{store_id}_{predict_horizon}.feather')
del(grid_df)
gc.collect()
estimator = lgb.train(lgb_params,
train_data,
valid_sets=[valid_data],
callbacks=[lgb.log_evaluation(period=100, show_stdv=False
)
model_name = str(f'lgb_model_{store_id}_{predict_horizon}.bin')
feature_importance_store_df = pd.DataFrame(sorted(zip(enable_features, estimato
columns=['feature_name', 'importance
feature_importance_store_df = feature_importance_store_df.sort_values('importan
feature_importance_store_df['store_id'] = store_id
feature_importance_store_df.to_csv(f'feature_importance_{store_id}_{predict_hor
feature_importance_all_df = pd.concat([feature_importance_all_df, feature_impor
pickle.dump(estimator, open(model_name, 'wb'))
del([train_data, valid_data, estimator])
gc.collect()
feature_importance_all_df.to_csv(f'feature_importance_all_{predict_horizon}.csv', i
feature_importance_agg_df = feature_importance_all_df.groupby(
'feature_name')['importance'].agg(['mean', 'std']).reset_index()
feature_importance_agg_df.columns = ['feature_name', 'importance_mean', 'importance
feature_importance_agg_df = feature_importance_agg_df.sort_values('importance_mean'
feature_importance_agg_df.to_csv(f'feature_importance_agg_{predict_horizon}.csv', i
end_train_day_x_list = [1941]
prediction_horizon_list = [7]
seed = 42
train_pipeline(train_df, prices_df, calendar_df, end_train_day_x_list, prediction_horiz
end_train_day_x_list = [1941]
prediction_horizon_list = [14]
seed = 42
train_pipeline(train_df, prices_df, calendar_df, end_train_day_x_list, prediction_horiz
end_train_day_x_list = [1941]
prediction_horizon_list = [21]
seed = 42
train_pipeline(train_df, prices_df, calendar_df, end_train_day_x_list, prediction_horiz
end_train_day_x_list = [1941]
prediction_horizon_list = [28]
seed = 42
train_pipeline(train_df, prices_df, calendar_df, end_train_day_x_list, prediction_horiz
Pu blic lea der boa r d ex a m ple: h t t ps://w w w .ka g g le.com /lu ca m a ssa r on /m 5 -pr edict -pu blic-
lea der boa r d
Pr iv a t e lea der boa r d ex a m ple: h t t ps://w w w .ka g g le.com /code/lu ca m a ssa r on /m 5 -pr edict -
pr iv a t e-lea der boa r d
In t h is con clu siv e code sn ippet , a ft er loa din g t h e n ecessa r y pa cka g es, su ch a s Lig h t GBM, for ev er y en d
of t r a in in g da y , a n d for ev er y pr edict ion h or izon , w e r ecov er t h e cor r ect n ot ebook w it h it s da t a . T h en ,
w e it er a t e t h r ou g h a ll t h e st or es a n d pr edict t h e sa les for a ll t h e it em s in t h e t im e r a n g in g fr om t h e
pr ev iou s pr edict ion h or izon u p t o t h e pr esen t on e. In t h is w a y , ev er y m odel w ill pr edict on a sin g le
w eek, t h e on e it h a s been t r a in ed on .
import numpy as np
import pandas as pd
import os
import random
import math
from decimal import Decimal as dec
import datetime
import time
import gc
import lightgbm as lgb
import pickle
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
store_id_set_list = ['CA_1', 'CA_2', 'CA_3', 'CA_4', 'TX_1', 'TX_2', 'TX_3', 'WI_1', 'W
end_train_day_x_list = [1913, 1941]
prediction_horizon_list = [7, 14, 21, 28]
pred_v_all_df = list()
for end_train_day_x in end_train_day_x_list:
previous_prediction_horizon = 0
for prediction_horizon in prediction_horizon_list:
notebook_name = f"../input/m5-train-day-{end_train_day_x}-horizon-{prediction_h
pred_v_df = pd.DataFrame()
model_path = str(f'{notebook_name}/lgb_model_{store_id}_{prediction_horizon
print(f'loading {model_path}')
estimator = pickle.load(open(model_path, 'rb'))
base_test = pd.read_feather(f"{notebook_name}/test_{store_id}_{prediction_h
enable_features = [col for col in base_test.columns if col not in ['id', 'd
temp_v_df = base_test[
(base_test['d'] >= end_train_day_x + previous_prediction_horizon +
(base_test['d'] < end_train_day_x + prediction_horizon + 1)
][['id', 'd', 'sales']]
if len(pred_v_df)!=0:
pred_v_df = pd.concat([pred_v_df, temp_v_df])
else:
pred_v_df = temp_v_df.copy()
del(temp_v_df)
gc.collect()
previous_prediction_horizon = prediction_horizon
pred_v_all_df.append(pred_v_df)
pred_v_all_df = pd.concat(pred_v_all_df)
submission = pd.read_csv("../input/m5-forecasting-accuracy/sample_submission.csv"
pred_v_all_df.d = pred_v_all_df.d - end_train_day_x_list
pred_h_all_df = pred_v_all_df.pivot(index='id', columns='d', values='sales')
pred_h_all_df = pred_h_all_df.reset_index()
pred_h_all_df.columns = submission.columns
submission = submission[['id']].merge(pred_h_all_df, on=['id'], how='left').fillna(0)
submission.to_csv("m5_predictions.csv", index=False)
Exer cise
Summary
In t h is secon d ch a pt er , w e t ook on qu it e a com plex t im e ser ies com pet it ion , h en ce t h e ea siest t op
solu t ion w e t r ied, it is a ct u a lly fa ir ly com plex , a n d it r equ ir es codin g qu it e a lot of pr ocessin g fu n ct ion s.
A ft er y ou w en t t h r ou g h t h e ch a pt er , y ou sh ou ld h a v e a bet t er idea of h ow t o pr ocess t im e ser ies a n d
h a v e t h em pr edict ed u sin g g r a dien t boost in g . Fa v or in g g r a dien t boost in g solu t ion s ov er t r a dit ion a l
m et h ods, w h en y ou h a v e en ou g h da t a , a s w it h t h is pr oblem , sh ou ld h elp y ou cr ea t e st r on g solu t ion s
for com plex pr oblem s w it h h ier a r ch ica l cor r ela t ion s, in t er m it t en t ser ies a n d a v a ila bilit y of cov a r ia t es
su ch a s ev en t s or pr ices or m a r ket con dit ion s. In t h e follow in g ch a pt er s, y ou w ill t a ckle w it h ev en
m or e com plex Ka g g le com pet it ion s, dea lin g w it h im a g es a n d t ex t s. Y ou w ill be a m a zed a t h ow m u ch
y ou ca n lea r n by r ecr ea t in g t op-scor in g solu t ion s a n d u n der st a n din g t h eir in n er w or kin g s.
03 Cassava Leaf Disease competition
Join our book community on Discord
h t t ps://w w w .ka g g le.com /com pet it ion s/ca ssa v a -lea f-disea se-cla ssifica t ion
“As the s econd-larges t provider of carbohydrates in Africa, cas s ava is a k ey food s ecurity
crop grow n by s m allholder farm ers becaus e it can w iths tand hars h conditions . At leas t 80%
of hous ehold farm s in Sub-Saharan Africa grow this s tarchy root, but viral dis eas es are
m ajor s ources of poor yields . With the help of data s cience, it m ay be pos s ible to identify
com m on dis eas es s o they can be treated.”
“Exis ting m ethods of dis eas e detection require farm ers to s olicit the help of governm ent-
funded agricultural experts to vis ually ins pect and diagnos e the plants . This s uffers from
being labor-intens ive, low -s upply and cos tly. As an added challenge, effective s olutions for
farm ers m us t perform w ell under s ignificant cons traints , s ince African farm ers m ay only
have acces s to m obile-quality cam eras w ith low -bandw idth.”
“Y our tas k is to clas s ify each cas s ava im age into four dis eas e categories or a fifth category
indicating a healthy leaf. With your help, farm ers m ay be able to quick ly identify dis eas ed
plants , potentially s aving their crops before they inflict irreparable dam age.”
T h is bit is r a t h er im por t a n t : it specifies t h a t t h is is a cla ssifica t ion com pet it ion , a n d t h e n u m ber of
cla sses is sm a ll (5 ).
Figure 3.1: des cription of the Cas s ava com petition datas et
W h a t ca n w e m a ke of t h a t ?
T h e ev a lu a t ion m et r ic w a s ch osen t o be ca t eg or iza t ion a ccu r a cy : h t t ps://dev eloper s.g oog le.com /
m a ch in e-lea r n in g /cr a sh -cou r se/cla ssifica t ion /a ccu r a cy . T h is m et r ic t a kes discr et e v a lu es a s in pu t s,
w h ich m ea n s pot en t ia l en sem blin g st r a t eg ies becom e som ew h a t m or e in v olv ed. Loss fu n ct ion is
im plem en t ed du r in g t r a in in g t o opt im ise a lea r n in g fu n ct ion a n d a s lon g a s w e w a n t t o u se m et h ods
ba sed on g r a dien t descen t , t h is on e n eeds t o be con t in u ou s; ev a lu a t ion m et r ic, on t h e ot h er h a n d, is
u sed a ft er t r a in in g t o m ea su r e ov er a ll per for m a n ce a n d a s su ch , ca n be discr et e.
Figure 3.2: the im ports needed for our bas eline s olution
con fig u r a t ion cla ss: a pla ceh older for a ll t h e pa r a m et er s defin in g ou r lea r n in g pr ocess:
A g ood pr a ct ice for codin g a solu t ion is t o en ca psu la t e t h e im por t a n t bit s in fu n ct ion s - a n d cr ea t in g
Few r em a r ks a r ou n d t h is st ep:
Exer cise: Ex a m in e t h e possible ch oices for loss a n d m et r ic - a u sefu l g u ide is h t t ps://ker a s.io/a pi/losses/
W h a t w ou ld t h e ot h er r ea son a ble opt ion s be?
Nex t st ep is t h e da t a :
Figure 3.5: s etting up the data generator
Nex t w e set u p t h e m odel - st r a ig h t for w a r d, t h a n ks t o t h e fu n ct ion w e defin ed a bov e:
con st r u ct in g t h e su bm ission da t a fr a m e
In t h is sect ion w e h a v e dem on st r a t ed h ow t o st a r t t o com pet e in a com pet it ion focu sed on im a g e
cla ssifica t ion - y ou ca n u se t h is a ppr oa ch t o m ov e qu ickly fr om ba sic EDA t o a fu n ct ion a l su bm ission .
How ev er , a r u dim en t a r y a ppr oa ch like t h is is u n likely t o pr odu ce v er y com pet it iv e r esu lt s. For t h is
r ea son , in t h e n ex t sect ion w e discu ss m or e specia lised t ech n iqu es t h a t w er e u t ilised in t op scor in g
solu t ion s.
Pretraining
h t t ps://w w w .ka g g le.com /com pet it ion s/ca ssa v a -disea se/ov er v iew
TTA
T T A w a s u sed ex t en siv ely by t h e t op solu t ion s in t h e Ca ssa v a com pet it ion , a n ex cellen t ex a m ple bein g
t h e t op3 pr iv a t e LB r esu lt : h t t ps://w w w .ka g g le.com /com pet it ion s/ca ssa v a -lea f-disea se-cla ssifica t ion /
discu ssion /2 2 1 1 5 0
Transformers
h t t ps://w w w .ka g g le.com /com pet it ion s/ca ssa v a -lea f-disea se-cla ssifica t ion /discu ssion /2 2 1 1 5 0
Ensembling
h t t ps://w w w .ka g g le.com /com pet it ion s/ca ssa v a -lea f-disea se-cla ssifica t ion /discu ssion /2 2 0 7 5 1
T h e w in n in g solu t ion in v olv ed a differ en t a ppr oa ch (w it h few er m odels in t h e fin a l blen d), bu t r ely in g
on t h e sa m e cor e log ic:
h t t ps://w w w .ka g g le.com /com pet it ion s/ca ssa v a -lea f-disea se-cla ssifica t ion /discu ssion /2 2 1 9 5 7
Summary
In t h is ch a pt er , w e h a v e descr ibed h ow t o g et st a r t ed w it h a ba selin e solu t ion for a n im a g e
cla ssifica t ion com pet it ion a n d discu ssed a div er se set of possible ex t en sion s for m ov in g t o a com pet it iv e
(m eda l) zon e.