Konrad Banachewicz, Luca Massaron - The Kaggle Workbook_ Self-learning exercises and valuable insights for Kaggle data science competitions-Packt Publishing

The Kaggle Workbook
Copy r ig h t © 2 0 2 2 Pa ckt Pu blish in g
A ll r ig h t s r eser v ed. No pa r t of t h is book m a y be r epr odu ced, st or ed in a r et r iev a l sy st em , or

t r a n sm it t ed in a n y for m or by a n y m ea n s, w it h ou t t h e pr ior w r it t en per m ission of t h e pu blish er ,
ex cept in t h e ca se of br ief qu ot a t ion s em bedded in cr it ica l a r t icles or r ev iew s.
Ev er y effor t h a s been m a de in t h e pr epa r a t ion of t h is book t o en su r e t h e a ccu r a cy of t h e in for m a t ion

pr esen t ed. How ev er , t h e in for m a t ion con t a in ed in t h is book is sold w it h ou t w a r r a n t y , eit h er ex pr ess or
im plied. Neit h er t h e a u t h or , n or Pa ckt Pu blish in g , a n d it s dea ler s a n d dist r ibu t or s w ill be h eld lia ble
for a n y da m a g es ca u sed or a lleg ed t o be ca u sed dir ect ly or in dir ect ly by t h is book.
Pa ckt Pu blish in g h a s en dea v or ed t o pr ov ide t r a dem a r k in for m a t ion a bou t a ll of t h e com pa n ies a n d
pr odu ct s m en t ion ed in t h is book by t h e a ppr opr ia t e u se of ca pit a ls. How ev er , Pa ckt Pu blish in g ca n n ot
g u a r a n t ee t h e a ccu r a cy of t h is in for m a t ion .
Ea r l y A ccess Pu bl ica t ion : T h e Ka g g le W or kbook
Ea r l y A ccess Pr odu ct ion Refer en ce: B1 9 2 6 5
Pu blish ed by Pa ckt Pu blish in g Lt d.
Liv er y Pla ce
3 5 Liv er y St r eet
Bir m in g h a m
B3 2 PB, UK
ISBN: 9 7 8 -1 -8 0 4 6 1 -1 2 1 -0 w w w .pa ckt .com

Table of Contents
1 . T h e Ka g g le W or kbook: Self-lea r n in g ex er cises a n d v a lu a ble in sig h t s for Ka g g le da t a scien ce
com pet it ion s
2 . 0 1 T h e m ost r en ow n ed t a bu la r com pet it ion : Por t o Seg u r o’s Sa fe Dr iv er Pr edict ion
1 . Join ou r book com m u n it y on Discor d
2 . Un der st a n din g t h e com pet it ion a n d t h e da t a
3 . Un der st a n din g t h e Ev a lu a t ion Met r ic
4 . Ex a m in in g t h e t op solu t ion ’s idea s fr om Mich a el Ja h r er
5 . Bu ildin g a Lig h t GBM su bm ission
6 . Set t in g u p a Den oisin g A u t o-en coder a n d a DNN
7 . En sem blin g t h e r esu lt s
8 . Su m m a r y
3 . 2 T h e Ma kr ida kis Com pet it ion s: M5 on Ka g g le for A ccu r a cy a n d Un cer t a in t y
1 . Join ou r book com m u n it y on Discor d
2 . Un der st a n din g t h e com pet it ion a n d t h e da t a
3 . Un der st a n din g t h e Ev a lu a t ion Met r ic
4 . Ex a m in in g t h e 4 t h pla ce solu t ion ’s idea s fr om Mon sa r a ida
5 . Com pu t in g pr edict ion s for specific da t es a n d t im e h or izon s
6 . A ssem blin g pu blic a n d pr iv a t e pr edict ion s
7 . Su m m a r y
4 . 0 3 Ca ssa v a Lea f Disea se com pet it ion
1. Join ou r book com m u n it y on Discor d

2. Un der st a n din g t h e da t a a n d m et r ics
3. Bu ildin g a ba selin e m odel
4. Lea r n in g s fr om t op solu t ion s
1 . Pr et r a in in g
2 . TTA
3 . T r a n sfor m er s
4 . En sem blin g
5 . Su m m a r y
1 . Cov er
2 . T a ble of con t en t s
The Kaggle Workbook: Self-learning exercises and valuable
insights for Kaggle data science competitions
Wel com e t o Pa ckt Ea r l y A ccess. W e’r e g iv in g y ou a n ex clu siv e pr ev iew of t h is book befor e it g oes on
sa le. It ca n t a ke m a n y m on t h s t o w r it e a book, bu t ou r a u t h or s h a v e cu t t in g -edg e in for m a t ion t o sh a r e
w it h y ou t oda y . Ea r ly A ccess g iv es y ou a n in sig h t in t o t h e la t est dev elopm en t s by m a kin g ch a pt er
dr a ft s a v a ila ble. T h e ch a pt er s m a y be a lit t le r ou g h a r ou n d t h e edg es r ig h t n ow , bu t ou r a u t h or s w ill
u pda t e t h em ov er t im e.
Y ou ca n dip in a n d ou t of t h is book or follow a lon g fr om st a r t t o fin ish ; Ea r ly A ccess is desig n ed t o be

flex ible. W e h ope y ou en joy g et t in g t o kn ow m or e a bou t t h e pr ocess of w r it in g a Pa ckt book.
1. Ch a pt er 1 : T h e m ost r en ow n t a bu la r com pet it ion : Por t o Seg u r o’s Sa fe Dr iv er Pr edict ion

2. Ch a pt er 2 : T h e Ma kr ida kis com pet it ion s: M5 on Ka g g le for a ccu r a cy a n d u n cer t a in t y
3. Ch a pt er 3 : V ision com pet it ion : Ca ssa v a Lea f Disea se cla ssifica t ion
4. Ch a pt er 4 : NLP com pet it ion : Qu or a in sin cer e qu est ion s cla ssifica t ion
01 The most renowned tabular competition: Porto Seguro’s
Safe Driver Prediction
Join our book community on Discord
h t t ps://pa ckt .lin k/Ea r ly A ccessCom m u n it y
Lea r n in g h ow t o per for m t op on t h e lea der boa r d on a n y Ka g g le com pet it ion r equ ir es pa t ien ce,
dilig en ce a n d m a n y a t t em pt s in or der t o lea r n t h e best w a y t o com pet e a n d a ch iev e t op r esu lt s. For
t h is r ea son , w e h a v e t h ou g h t of a w or kbook t h a t ca n h elp y ou bu ild fa st er t h ose skills by lea din g y ou t o
t r y som e Ka g g le com pet it ion s of t h e pa st a n d t o lea r n h ow t o m a ke t op com pet it ion s for t h em by
r ea din g discu ssion s, r eu sin g n ot ebooks, en g in eer in g fea t u r es a n d t r a in in g v a r iou s m odels.
W e st a r t in t h e book fr om on e of t h e m ost r en ow n ed t a bu la r com pet it ion s, Por t o Seg u r o’s Sa fe Dr iv er

Pr edict ion . In t h is com pet it ion , y ou a r e a sked t o solv e a com m on pr oblem in in su r a n ce, t o fig u r e ou t
w h o is g oin g t o h a v e a n a u t o in su r a n ce cla im in t h e n ex t y ea r . Su ch in for m a t ion is u sefu l in or der t o
in cr ea se t h e in su r a n ce fee t o dr iv er s m or e likely t o h a v e a cla im a n d t o low er it t o t h ose less likely t o.
In illu st r a t in g t h e key in sig h t a n d t ech n ica lit ies n ecessa r y for cr a ckin g t h is com pet it ion w e w ill sh ow
y ou t h e n ecessa r y code a n d a sk y ou t o st u dy a n d a n sw er a bou t t opics t o be fou n d on t h e Ka g g le book
it self. T h er efor e, w it h ou t m u ch m or e a do, let ’s st a r t im m edia t ely t h is n ew lea r n in g pa t h of y ou r s.
In t h is ch a pt er , y ou w ill lea r n :
How t o t u n e a n d t r a in a Lig h t GBM m odel.

How t o bu ild a den oisin g a u t o-en coder a n d h ow t o u se it t o feed a n eu r a l n et w or k.
How t o effect iv ely blen d m odels t h a t a r e qu it e differ en t fr om ea ch ot h er .
Understanding the competition and the data

Por t o Seg u r o is t h e t h ir d la r g est in su r a n ce com pa n y in Br a zil (it oper a t es in Br a zil a n d Ur u g u a y ),
offer in g ca r in su r a n ce cov er a g e a s w ell a s m a n y ot h er in su r a n ce pr odu ct s. Ha v in g u sed a n a ly t ica l
m et h ods a n d m a ch in e lea r n in g for t h e pa st 2 0 y ea r s in or der t o t a ilor t h eir pr ices a n d m a ke a u t o
in su r a n ce cov er a g e m or e a ccessible t o m or e dr iv er s. In or der t o ex plor e n ew w a y s t o a ch iev e t h eir
t a sk, t h ey spon sor ed a com pet it ion (h t t ps://w w w .ka g g le.com /com pet it ion s/por t o-seg u r o-sa fe-dr iv er -
pr edict ion ), ex pect in g Ka g g ler s t o com e u p w it h n ew a n d bet t er m et h ods of solv in g som e of t h eir cor e
a n a ly t ica l pr oblem s.
T h e com pet it ion a im s a t h a v in g Ka g g ler s bu ild a m odel t h a t pr edict s t h e pr oba bilit y t h a t a dr iv er w ill
in it ia t e a n a u t o in su r a n ce cla im in t h e n ex t y ea r , w h ich is a qu it e com m on kin d of t a sk (t h e spon sor
m en t ion s it a s a “ cla ssica l ch a llen g e for in su r a n ce” ). For doin g so, t h e spon sor pr ov ided a t r a in a n d t est
set s a n d t h e com pet it ion a ppea r s idea l for a n y on e sin ce t h e da t a set is n ot v er y la r g e a n d it seem s v er y
w ell pr epa r ed.
A s st a t ed in t h e pa g e of t h e com pet it ion dev ot ed t o pr esen t in g t h e da t a (h t t ps://w w w .ka g g le.com /
com pet it ion s/por t o-seg u r o-sa fe-dr iv er -pr edict ion /da t a ):
“features that belong to s im ilar groupings are tagged as s uch in the feature nam es (e.g., ind,
reg, car, calc). In addition, feature nam es include the pos tfix bin to indicate binary features
and cat to indicate categorical features . Features w ithout thes e des ignations are either
continuous or ordinal. Values of -1 indicate that the feature w as m is s ing from the
obs ervation. The target colum n s ignifies w hether or not a claim w as filed for that policy
holder”.
T h e da t a pr epa r a t ion for t h e com pet it ion h a d been qu it e ca r efu lly n ot t o lea k a n y in for m a t ion a n d,
t h ou g h , m a in t a in in g secr ecy a bou t t h e m ea n in g of t h e fea t u r es, it is qu it e clea r t o r efer t h e differ en t
u sed t a g s t o specific kin d of fea t u r es com m on ly u sed in su r a n ce m odelin g :
in d r efer s t o “ in div idu a l ch a r a ct er ist ics”

ca r r efer t o “ ca r s ch a r a ct er ist ics”
ca lc t o “ ca lcu la t ed fea t u r es”
r eg t o “ r eg ion a l/g eog r a ph ic fea t u r es”
A s for t h e in div idu a l fea t u r es, t h er e h a s been m u ch specu la t ion du r in g t h e com pet it ion . See for
in st a n ce h t t ps://w w w .ka g g le.com /com pet it ion s/por t o-seg u r o-sa fe-dr iv er -pr edict ion /discu ssion /4 1 4 8 9
or h t t ps://w w w .ka g g le.com /com pet it ion s/por t o-seg u r o-sa fe-dr iv er -pr edict ion /discu ssion /4 1 4 8 8 bot h
by Ra dda r or a g a in t h e a t t em pt s t o a t t r ibu t e t h e fea t u r e t o Por t o Seg u r o’s on lin e qu ot e for m h t t ps://
w w w .ka g g le.com /com pet it ion s/por t o-seg u r o-sa fe-dr iv er -pr edict ion /discu ssion /4 1 0 5 7 . In spit e of a ll
su ch effor t s, in t h e en d t h e m ea n in g of t h e fea t u r es h a s r em a in ed a m y st er y u p u n t il n ow .
T h e in t er est in g fa ct a bou t t h is com pet it ion is t h a t :
1 . t h e da t a is r ea l w or ld, t h ou g h t h e fea t u r es a r e a n on y m ou s
2 . t h e da t a is r ea lly v er y w ell pr epa r ed, w it h ou t lea ka g es of a n y sor t (n o m a g ic fea t u r es h er e)
3 . t h e t est set n ot on ly h olds t h e sa m e ca t eg or ica l lev els of t h e t r a in t est , a n d it a lso seem s t o be
fr om t h e sa m e dist r ibu t ion , t h ou g h Y u y a Y a m a m ot o a r g u es t h a t pr e-pr ocessin g t h e da t a w it h
t -SNE lea ds t o a fa ilin g a dv er sa r ia l v a lida t ion t est (h t t ps://w w w .ka g g le.com /com pet it ion s/
por t o-seg u r o-sa fe-dr iv er -pr edict ion /discu ssion /4 4 7 8 4 ) .
A s a fir st ex er cise, r efer r in g t o t h e con t en t s a n d t h e code on t h e Ka g g le Book r ela t ed t o

a dv er sa r ia l v a lida t ion (st a r t in g fr om pa g e 1 7 9 ), pr ov e t h a t t r a in a n d t est da t a m ost
pr oba bly or ig in a t ed fr om t h e sa m e da t a dist r ibu t ion .
A n in t er est in g post by T ilii (Men su r Dla kic, a ssocia t e Pr ofessor a t Mon t a n a St a t e Un iv er sit y : h t t ps://
w w w .ka g g le.com /com pet it ion s/por t o-seg u r o-sa fe-dr iv er -pr edict ion /discu ssion /4 2 1 9 7 ) dem on st r a t es
u sin g t SNE t h a t “ t h er e a r e m a n y people w h o a r e v er y sim ila r in t er m s of t h eir in su r a n ce pa r a m et er s,
y et som e of t h em w ill file a cla im a n d ot h er s w ill n ot ” . W h a t T ilii m en t ion s is qu it e t y pica l of w h a t
h a ppen s in in su r a n ce w h er e t o cer t a in pr ior s (in su r a n ce pa r a m et er s) t h er e is st icked t h e sa m e
pr oba bilit y of som et h in g h a ppen in g bu t t h a t ev en t w ill h a ppen or n ot ba sed on h ow lon g w e obser v e
t h e sit u a t ion .
T a ke for in st a n ce IoT a n d t elem a t ic da t a in in su r a n ce. It is qu it e com m on t o a n a ly ze a dr iv er 's

beh a v ior in or der t o pr edict if sh e or h e w ill file a cla im in t h e fu t u r e. If y ou r obser v a t ion per iod is t oo
sh or t (for in st a n ce on e y ea r a s in t h e ca se of t h is com pet it ion ), it m a y h a ppen t h a t ev en v er y ba d
dr iv er s w on ’t h a v e a cla im beca u se it is a m a t t er of n ot t oo h ig h pr oba bilit y of t h e ou t com e t h a t ca n
becom e a r ea lit y on ly a ft er a cer t a in a m ou n t of t im e. Sim ila r idea s a r e discu ssed by A n dy Ha r less
(h t t ps://w w w .ka g g le.com /com pet it ion s/por t o-seg u r o-sa fe-dr iv er -pr edict ion /discu ssion /4 2 7 3 5 ) w h ich
a r g u es in st ea d t h a t t h e r ea l t a sk of t h e com pet it ion is t o g u ess “the value of a latent continuous variable
that determ ines w hich drivers are m ore lik ely to have accidents ” beca u se a ct u a lly “m ak ing a claim is not a
characteris tic of a driver; it's a res ult of chance”.
Understanding the Evaluation Metric
T h e m et r ic u sed in t h e com pet it ion is t h e "n or m a lized Gin i coefficien t " (n a m ed for t h e sim ila r Gin i
coefficien t /in dex u sed in Econ om ics), w h ich h a s been pr ev iou sly u sed in a n ot h er com pet it ion , t h e
A llst a t e Cla im Pr edict ion Ch a llen g e (h t t ps://w w w .ka g g le.com /com pet it ion s/
Cla im Pr edict ion Ch a llen g e). Fr om t h a t com pet it ion , w e ca n g et a v er y clea r ex pla n a t ion a bou t w h a t
t h is m et r ic is a bou t :
When you s ubm it an entry, the obs ervations are s orted from "larges t prediction" to "s m alles t prediction".
This is the only s tep w here your predictions com e into play, s o only the order determ ined by your
predictions m atters . Vis ualize the obs ervations arranged from left to right, w ith the larges t predictions on
the left. We then m ove from left to right, as k ing "In the leftm os t x% of the data, how m uch of the actual
obs erved los s have you accum ulated?" With no m odel, you can expect to accum ulate 10% of the los s in 10%
of the predictions , s o no m odel (or a "null" m odel) achieves a s traight line. We call the area betw een your
curve and this s traight line the Gini coefficient.
There is a m axim um achievable area for a "perfect" m odel. We w ill us e the norm alized Gini coefficient by
dividing the Gini coefficient of your m odel by the Gini coefficient of the perfect m odel.
A n ot h er g ood ex pla n a t ion is pr ov ided in t h e com pet it ion by t h e n ot ebook by Kilia n Ba t zn er : h t t ps://
w w w .ka g g le.com /code/ba t zn er /g in i-coefficien t -a n -in t u it iv e-ex pla n a t ion t h a t , t h ou g h m ea n s of plot s
a n d t oy ex a m ples t r ies t o g iv e a sen se t o a n ot so com m on m et r ic bu t in t h e a ct u a r ia l depa r t m en t s of
in su r a n ce com pa n ies.
In ch a pt er 5 of t h e Ka g g le Book (pa g es 9 5 on w a r d), w e ex pla in ed h ow t o dea l w it h

com pet it ion m et r ics, especia lly if t h ey a r e n ew a n d g en er a lly u n kn ow n . A s a n ex er cise,
ca n y ou fin d ou t h ow m a n y com pet it ion s on Ka g g le h a v e u sed t h e n or m a lized Gin i
coefficien t a s a n ev a lu a t ion m et r ic?
T h e m et r ic ca n be a ppr ox im a t ed by t h e Ma n n –W h it n ey U n on -pa r a m et r ic st a t ist ica l t est a n d by t h e

ROC-A UC scor e beca u se it a ppr ox im a t ely cor r espon ds t o 2 * ROC-A UC - 1 . Hen ce, m a x im izin g t h e ROC-
A UC is t h e sa m e a s m a x im izin g t h e Nor m a lized Gin i coefficien t (for a r efer en ce see t h e “ Rela t ion t o
ot h er st a t ist ica l m ea su r es” in t h e W ikipedia en t r y : h t t ps://en .w ikipedia .or g /w iki/Gin i_coefficien t ).
T h e m et r ic ca n a lso be a ppr ox im a t ely ex pr essed a s t h e cov a r ia n ce of sca led pr edict ion r a n k a n d sca led
t a r g et v a lu e, r esu lt in g in a m or e u n der st a n da ble r a n k a ssocia t ion m ea su r e (see Dm it r iy Gu ller :
h t t ps://w w w .ka g g le.com /com pet it ion s/por t o-seg u r o-sa fe-dr iv er -pr edict ion /discu ssion /4 0 5 7 6 )
Fr om t h e poin t of v iew of t h e object iv e fu n ct ion , y ou ca n opt im ize for t h e bin a r y log loss (a s y ou w ou ld
do in a cla ssifica t ion pr oblem ). Nor ROC-A UC n or t h e n or m a lized Gin i coefficien t a r e differ en t ia ble a n d
t h ey m a y be u sed on ly for m et r ic ev a lu a t ion on t h e v a lida t ion set (for in st a n ce for ea r ly st oppin g or
for r edu cin g t h e lea r n in g r a t e in a n eu r a l n et w or k). How ev er , opt im izin g for t h e log loss doesn ’t a lw a y s
im pr ov e t h e ROC-A UC a n d t h e n or m a lized Gin i coefficien t s. T h er e is a ct u a lly a differ en t ia ble ROC-
A UC a ppr ox im a t ion (Ca lder s, T oon , a n d Szy m on Ja r oszew icz. "Efficien t A UC opt im iza t ion for
cla ssifica t ion ." Eu r opea n con fer en ce on pr in ciples of da t a m in in g a n d kn ow ledg e discov er y . Spr in g er ,
Ber lin , Heidelber g , 2 0 0 7 h t t ps://lin k.spr in g er .com /con t en t /pdf/
1 0 .1 0 0 7 /9 7 8 -3 -5 4 0 -7 4 9 7 6 -9 _8 .pdf). How ev er , it seem s t h a t it is n ot n ecessa r y t o u se a n y t h in g
differ en t fr om log loss a s object iv e fu n ct ion a n d ROC-A UC or n or m a lized Gin i coefficien t a s ev a lu a t ion
m et r ic in t h e com pet it ion .
T h er e a r e a ct u a lly a few Py t h on im plem en t a t ion s a m on g t h e n ot ebooks. W e h a v e u sed h er e a n d w e

su g g est t h e w or k by CPMP (h t t ps://w w w .ka g g le.com /code/cpm pm l/ex t r em ely -fa st -g in i-com pu t a t ion /
n ot ebook) t h a t u ses Nu m ba for speedin g u p com pu t a t ion s: it is bot h ex a ct a n d fa st .
Examining the top solution’s ideas from Michael Jahrer

Mich a el Ja h r er (h t t ps://w w w .ka g g le.com /m ja h r er , com pet it ion s g r a n dm a st er a n d on e of t h e w in n er s
of t h e Net flix Pr ize in t h e t ea m "BellKor 's Pr a g m a t ic Ch a os"), h a s led for lon g t im e a n d by a fa ir m a r g in
t h e pu blic lea der boa r d du r in g t h e com pet it ion a n d, w h en fin a lly t h e pr iv a t e on e h a s been disclosed,
h a s been decla r ed t h e w in n er .
Sh or t ly a ft er , in t h e discu ssion for u m , h e pu blish ed a sh or t su m m a r y of h is solu t ion t h a t h a s becom e a

r efer en ce for m a n y Ka g g ler s beca u se of h is sm a r t u sa g e of den oisin g a u t oen coder s a n d n eu r a l n et w or ks
(h t t ps://w w w .ka g g le.com /com pet it ion s/por t o-seg u r o-sa fe-dr iv er -pr edict ion /discu ssion /4 4 6 2 9 ).
A lt h ou g h Mich a el h a sn ’t a ccom pa n ied h is post by a n y Py t h on code r eg a r din g h is solu t ion (h e qu ot ed it
a s a n “ old sch ool” a n d “ low lev el” on e, bein g dir ect ly w r it t en in C++/CUDA w it h n o Py t h on ), h is
w r it in g is qu it e r ich in r efer en ces t o w h a t m odels h e h a s u sed a s w ell t o t h eir s h y per -pa r a m et er s a n d
a r ch it ect u r es.
Fir st , Mich a el ex pla in s t h a t h is solu t ion is com posed of a blen d of six m odels (on e Lig h t GBM m odel a n d
fiv e n eu r a l n et w or ks). Mor eov er , sin ce n o a dv a n t a g e cou ld be g a in ed by w eig h t in g t h e con t r ibu t ion s
of ea ch m odel t o t h e blen d (a s w ell a s doin g lin ea r a n d n on -lin ea r st a ckin g ) likely beca u se of
ov er fit t in g , h e st a t es t h a t h e r esor t ed t o ju st a pla in blen d of m odels (a n a r it h m et ic m ea n ) t h a t h a v e
been bu ilt fr om differ en t seeds.
Su ch in sig h t m a kes t h e t a sk m u ch ea sier for u s in or der t o r eplica t e h is a ppr oa ch , a lso beca u se h e a lso
m en t ion s t h a t ju st h a v in g blen d t og et h er t h e Lig h t GBM’s r esu lt s w it h on e fr om t h e n eu r a l n et w or ks h e
bu ilt w ou ld h a v e been en ou g h t o g u a r a n t ee t h e fir st pla ce in t h e com pet it ion . T h a t w ill lim it ou r
ex er cise w or k t o t w o g ood sin g le m odels in st ea d of a h ost of t h em . In a ddit ion , h e m en t ion ed h a v in g
don e lit t le da t a pr ocessin g , bu t dr oppin g som e colu m n s a n d on e-h ot en codin g ca t eg or ica l fea t u r es.
Building a LightGBM submission

Ou r ex er cise st a r t s by w or kin g ou t a solu t ion ba sed on Lig h t GBM. Y ou ca n fin d t h e code a lr ea dy set for
ex ecu t ion a t t h e Ka g g le Not ebook a t t h is a ddr ess: h t t ps://w w w .ka g g le.com /code/lu ca m a ssa r on /
w or kbook-lg b. A lt h ou g h w e m a de t h e code r ea dily a v a ila ble, w e in st ea d su g g est y ou t o t y pe or copy
t h e code dir ect ly fr om t h e book a n d ex ecu t e it cell by cell, u n der st a n din g w h a t ea ch lin e of code does
a n d y ou cou ld fu r t h er m or e per son a lise t h e solu t ion a n d h a v e it per for m ev en bet t er .
W e st a r t by im por t in g key pa cka g es (Nu m py , Pa n da s, Opt u n a for h y per -pa r a m et er opt im iza t ion ,
Lig h t GBM a n d som e u t ilit y fu n ct ion s). W e a lso defin e a con fig u r a t ion cla ss a n d in st a n t ia t e it . W e w ill
discu ss t h e pa r a m et er s defin ed in t o t h e con fig u r a t ion cla ss du r in g t h e ex plor a t ion of t h e code a s w e
pr og r ess. W h a t is im por t a n t t o r em a r k h er e is t h a t by u sin g a cla ss con t a in in g a ll y ou r pa r a m et er s it
w ill be ea sier for y ou t o m odify t h em in a con sist en t w a y a lon g t h e code. In t h e h ea t of a com pet it ion , it
is ea sy t o for g et t o u pda t e a pa r a m et er t h a t it is r efer r ed t o in m u lt iple pla ces in t h e code a n d it is
a lw a y s difficu lt t o set t h e pa r a m et er s w h en t h ey a r e disper sed a m on g cells a n d fu n ct ion s. A
con fig u r a t ion cla ss ca n sa v e y ou a lot of effor t a n d spa r e y ou m ist a kes a lon g t h e w a y .
import numpy as np
import pandas as pd
import optuna
import lightgbm as lgb
from path import Path
from sklearn.model_selection import StratifiedKFold
class Config:
input_path = Path('../input/porto-seguro-safe-driver-prediction')
optuna_lgb = False
n_estimators = 1500
early_stopping_round = 150
cv_folds = 5
random_state = 0
params = {'objective': 'binary',
'boosting_type': 'gbdt',
'learning_rate': 0.01,
'max_bin': 25,
'num_leaves': 31,
'min_child_samples': 1500,
'colsample_bytree': 0.7,
'subsample_freq': 1,
'subsample': 0.7,
'reg_alpha': 1.0,
'reg_lambda': 1.0,
'verbosity': 0,
'random_state': 0}
config = Config()
T h e n ex t st ep r equ ir es im por t in g t r a in , t est a n d t h e sa m ple su bm ission da t a set s. By doin g so by pa n da s

csv r ea din g fu n ct ion , w e a lso set t h e in dex of t h e u ploa ded da t a fr a m es t o t h e iden t ifier (t h e ‘id’
colu m n ) of ea ch da t a ex a m ple.
Sin ce fea t u r es t h a t belon g t o sim ila r g r ou pin g s a r e t a g g ed (u sin g in d, r eg , ca r , ca lc t a g s in t h eir

la bels) a n d a lso bin a r y a n d ca t eg or ica l fea t u r es a r e ea sy t o loca t e (t h ey u se t h e bin a n d ca t ,
r espect iv ely , t a g s in t h eir la bels), w e ca n en u m er a t e t h em a n d r ecor d t h em in t o list s.
train = pd.read_csv(config.input_path / 'train.csv', index_col='id')

test = pd.read_csv(config.input_path / 'test.csv', index_col='id')
submission = pd.read_csv(config.input_path / 'sample_submission.csv', index_col='id')
calc_features = [feat for feat in train.columns if "_calc" in feat]
cat_features = [feat for feat in train.columns if "_cat" in feat]
T h en , w e ju st ex t r a ct t h e t a r g et (a bin a r y t a r g et of 0 s a n d 1 s) a n d r em ov e it fr om t h e t r a in da t a set .
target = train["target"]
train = train.drop("target", axis="columns")
A t t h is poin t , a s poin t ed ou t by Mich a el Ja h r er , w e ca n dr op t h e ca lc fea t u r es. T h is idea h a s r ecu r r ed a

lot du r in g t h e com pet it ion (h t t ps://w w w .ka g g le.com /com pet it ion s/por t o-seg u r o-sa fe-dr iv er -
pr edict ion /discu ssion /4 1 9 7 0 ), especia lly in n ot ebooks, bot h beca u se it cou ld be em pir ica lly v er ified
t h a t dr oppin g t h em im pr ov ed t h e pu blic lea der boa r d scor e, bot h beca u se t h ey per for m ed poor ly in
g r a dien t boost in g m odels (t h eir im por t a n ce is a lw a y s below t h e a v er a g e). W e ca n a r g u e t h a t , sin ce
t h ey a r e en g in eer ed fea t u r es, t h ey do n ot con t a in n ew in for m a t ion in r espect of t h eir or ig in fea t u r es
bu t t h ey ju st a dd n oise t o a n y m odel t r a in ed com pr isin g t h em .
train = train.drop(calc_features, axis="columns")

test = test.drop(calc_features, axis="columns")
Du r in g t h e com pet it ion , T ilii h a s t est ed fea t u r e elim in a t ion u sin g Bor u t a (h t t ps://
g it h u b.com /scikit -lea r n -con t r ib/bor u t a _py ). Y ou ca n fin d h is ker n el h er e: h t t ps://
w w w .ka g g le.com /code/t ilii7 /bor u t a -fea t u r e-elim in a t ion /n ot ebook. A s y ou ca n ch eck,
t h er e is n o ca lc_fea t u r e con sider ed a s a con fir m ed fea t u r e by Bor u t a .
Ex er cise: ba sed on t h e su g g est ion s pr ov ided in t h e Ka g g le Book a t pa g e 2 2 0

(“ Usin g fea t u r e im por t a n ce t o ev a lu a t e y ou r w or k” ), a s a n ex er cise, code
y ou r ow n fea t u r e select ion n ot ebook for t h is com pet it ion a n d ch eck w h a t
fea t u r es sh ou ld be kept a n d w h a t sh ou ld be disca r ded.
Ca t eg or ica l fea t u r es a r e in st ea d on e-h ot en coded. Sin ce w e w a n t t o r et r a in t h eir la bels, a n d sin ce t h e

sa m e lev els a r e pr esen t bot h in t r a in a n d t est (t h e r esu lt of a ca r efu l t r a in /t est split bet w een t h e t w o
a r r a n g ed by t h e Por t o Seg u r o t ea m ), in st ea d of t h e u su a l Scikit -Lea r n On eHot En coder (h t t ps://scikit -
lea r n .or g /st a ble/m odu les/g en er a t ed/sklea r n .pr epr ocessin g .On eHot En coder .h t m l) w e u se t h e pa n da s
g et _du m m ies fu n ct ion (h t t ps://pa n da s.py da t a .or g /docs/r efer en ce/a pi/pa n da s.g et _du m m ies.h t m l).
Sin ce t h e pa n da s fu n ct ion m a y pr odu ce differ en t en codin g s if t h e fea t u r es a n d t h eir lev els differ fr om
t r a in t o t est set , w e a sser t a ch eck on t h e on e h ot en codin g r esu lt in g in t h e sa m e for bot h .
train = pd.get_dummies(train, columns=cat_features)

test = pd.get_dummies(test, columns=cat_features)
assert((train.columns==test.columns).all())
A ft er on e h ot en codin g t h e ca t eg or ica l fea t u r es w e h a v e com plet ed doin g a ll t h e da t a pr ocessin g . W e

pr oceed t o defin e ou r ev a lu a t ion m et r ic, t h e Nor m a lized Gin i Coefficien t , a s pr ev iou sly discu ssed. Sin ce
w e a r e g oin g t o u se a Lig h t GBM m odel, w e h a v e t o a dd a su it a ble w r a pper (gini_lgb) in or der t o
r et u r n t o t h e GBM a lg or it h m t h e ev a lu a t ion of t h e t r a in in g a n d t h e v a lida t ion set s in a for m t h a t
cou ld w or ks w it h it (see: h t t ps://lig h t g bm .r ea dt h edocs.io/en /la t est /py t h on a pi/lig h t g bm .Boost er .h t m l?
h ig h lig h t =h ig h er _bet t er #lig h t g bm .Boost er .ev a l - “ Ea ch ev a lu a t ion fu n ct ion sh ou ld a ccept t w o
pa r a m et er s: pr eds, ev a l_da t a , a n d r et u r n (eval_name, eval_result, is_higher_better) or list of
su ch t u ples” ).
from numba import jit

@jit
def eval_gini(y_true, y_pred):
y_true = np.asarray(y_true)
y_true = y_true[np.argsort(y_pred)]
ntrue = 0
gini = 0
delta = 0
n = len(y_true)
for i in range(n-1, -1, -1):
y_i = y_true[i]
ntrue += y_i
gini += y_i * delta
delta += 1 - y_i
gini = 1 - 2 * gini / (ntrue * (n - ntrue))
return gini
def gini_lgb(y_true, y_pred):
eval_name = 'normalized_gini_coef'
eval_result = eval_gini(y_true, y_pred)
is_higher_better = True
return eval_name, eval_result, is_higher_better
A s for t h e t r a in in g pa r a m et er s, w e fou n d t h a t t h e pa r a m et er s su g g est ed by Mich a el Ja h r er in h is post

(h t t ps://w w w .ka g g le.com /com pet it ion s/por t o-seg u r o-sa fe-dr iv er -pr edict ion /discu ssion /4 4 6 2 9 ) w or k
per fect ly . Y ou m a y a lso t r y t o com e u p w it h t h e sa m e pa r a m et er s or w it h sim ila r ly per for m in g on es
by per for m in g a sea r ch by opt u n a (h t t ps://opt u n a .or g /) if y ou set t h e optuna_lgb fla g t o T r u e in t h e
con fig cla ss. Her e t h e opt im iza t ion t r ies t o fin d t h e best v a lu es for key pa r a m et er s su ch a s t h e lea r n in g
r a t e a n d t h e r eg u la r iza t ion pa r a m et er s, ba sed on a 5 -fold cr oss-v a lida t ion t est on t r a in in g da t a . In
or der t o speed u p t h in g s, ea r ly st oppin g on t h e v a lida t ion it self is t a ken in t o a ccou n t (w h ich , w e a r e
a w a r e, cou ld a ct u a lly a dv a n t a g e som e v a lu es t h a t ca n ov er fit bet t er t h e v a lida t ion fold - a g ood
a lt er n a t iv e cou ld be t o r em ov e t h e ea r ly st oppin g ca llba ck a n d keep a fix ed n u m ber of r ou n ds for t h e
t r a in in g ).
if config.optuna_lgb:
def objective(trial):
params = {
'learning_rate': trial.suggest_float("learning_rate", 0.01, 1.0),
'num_leaves': trial.suggest_int("num_leaves", 3, 255),
'min_child_samples': trial.suggest_int("min_child_samples",
3, 3000),
'colsample_bytree': trial.suggest_float("colsample_bytree",
0.1, 1.0),
'subsample_freq': trial.suggest_int("subsample_freq", 0, 10),
'subsample': trial.suggest_float("subsample", 0.1, 1.0),
'reg_alpha': trial.suggest_loguniform("reg_alpha", 1e-9, 10.0),
'reg_lambda': trial.suggest_loguniform("reg_lambda", 1e-9, 10.0),
}
score = list()
skf = StratifiedKFold(n_splits=config.cv_folds, shuffle=True,
random_state=config.random_state)
for train_idx, valid_idx in skf.split(train, target):
X_train = train.iloc[train_idx]
y_train = target.iloc[train_idx]
X_valid = train.iloc[valid_idx]
y_valid = target.iloc[valid_idx]
model = lgb.LGBMClassifier(**params,
n_estimators=1500,
early_stopping_round=150,
force_row_wise=True)
callbacks=[lgb.early_stopping(stopping_rounds=150,
verbose=False)]
model.fit(X_train, y_train,
eval_set=[(X_valid, y_valid)],
eval_metric=gini_lgb, callbacks=callbacks)
score.append(
model.best_score_['valid_0']['normalized_gini_coef'])
return np.mean(score)
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=300)
print("Best Gini Normalized Score", study.best_value)
print("Best parameters", study.best_params)
params = {'objective': 'binary',

'boosting_type': 'gbdt',
'verbosity': 0,
'random_state': 0}
params.update(study.best_params)
else:
params = config.params
Du r in g t h e com pet it ion , T ilii h a s t est ed fea t u r e elim in a t ion u sin g Bor u t a (h t t ps://g it h u b.com /scikit -
lea r n -con t r ib/bor u t a _py ). Y ou ca n fin d h is ker n el h er e: h t t ps://w w w .ka g g le.com /code/t ilii7 /bor u t a -
fea t u r e-elim in a t ion /n ot ebook. A s y ou ca n ch eck, t h er e is n o ca lc_fea t u r e con sider ed a s a con fir m ed
fea t u r e by Bor u t a .
In t h e Ka g g le Book, w e ex pla in a bou t bot h h y per -pa r a m et er opt im iza t ion (pa g es 2 4 1
on w a r d) a n d pr ov ide som e key h y per -pa r a m et er s for t h e Lig h t GBM m odel. A s a n
ex er cise, t r y t o im pr ov e t h e h y per -pa r a m et er sea r ch by r edu cin g or in cr ea sin g t h e
ex plor ed pa r a m et er s a n d by t r y in g a lt er n a t iv e m et h ods su ch a s t h e r a n dom sea r ch or
t h e h a lv in g sea r ch fr om Scikit -Lea r n (pa g es 2 4 5 -2 4 6 ).
On ce w e h a v e ou r best pa r a m et er s (or w e sim ply t r y Ja h r er ’s on es), w e a r e r ea dy t o t r a in a n d pr edict .

Ou r st r a t eg y , a s su g g est ed by t h e best solu t ion , is t o t r a in a m odel on ea ch cr oss-v a lida t ion folds a n d
u se t h a t fold t o con t r ibu t e t o a n a v er a g e of t est pr edict ion s. T h e sn ippet of code w ill pr odu ce bot h t h e
t est pr edict ion s a n d t h e ou t of fold pr edict ion s on t h e t r a in set , la t er u sefu l for fig u r in g ou t h ow t o
en sem ble t h e r esu lt s.
preds = np.zeros(len(test))
oof = np.zeros(len(train))
metric_evaluations = list()
skf = StratifiedKFold(n_splits=config.cv_folds, shuffle=True, random_state=config.rando
for idx, (train_idx, valid_idx) in enumerate(skf.split(train,
target)):
print(f"CV fold {idx}")
X_train, y_train = train.iloc[train_idx], target.iloc[train_idx]
X_valid, y_valid = train.iloc[valid_idx], target.iloc[valid_idx]
model = lgb.LGBMClassifier(**params,
n_estimators=config.n_estimators,
early_stopping_round=config.early_stopping_round,
force_row_wise=True)
callbacks=[lgb.early_stopping(stopping_rounds=150),
lgb.log_evaluation(period=100, show_stdv=False)]
model.fit(X_train, y_train,
eval_set=[(X_valid, y_valid)],
eval_metric=gini_lgb, callbacks=callbacks)
metric_evaluations.append(
model.best_score_['valid_0']['normalized_gini_coef'])
preds += (model.predict_proba(test,
num_iteration=model.best_iteration_)[:,1]
/ skf.n_splits)
oof[valid_idx] = model.predict_proba(X_valid,
num_iteration=model.best_iteration_)[:,1]
T h e m odel t r a in in g sh ou ldn ’t t a ke t oo lon g . In t h e en d y ou ca n g et r epor t ed t h e Nor m a lized Gin i

Coefficien t obt a in ed du r in g t h e cr oss-v a lida t ion pr ocedu r e.
print(f"LightGBM CV normalized Gini coefficient:

{np.mean(metric_evaluations):0.3f}
({np.std(metric_evaluations):0.3f})")
T h e r esu lt s a r e qu it e en cou r a g in g beca u se t h e a v er a g e scor e is 0 .2 8 9 a n d t h e st a n da r d dev ia t ion of t h e

v a lu es is qu it e sm a ll.
LightGBM CV Gini Normalized Score: 0.289 (0.015)

A ll t h a t is left is t o sa v e t h e ou t -of-fold a n d t h e t est pr edict ion s a s a su bm ission a n d t o v er ify t h e r esu lt s
on t h e pu blic a n d pr iv a t e lea der boa r ds.
submission['target'] = preds
submission.to_csv('lgb_submission.csv')
oofs = pd.DataFrame({'id':train_index, 'target':oof})
oofs.to_csv('dnn_oof.csv', index=False)
T h e obt a in ed pu blic scor e sh ou ld be a r ou n d 0 .2 8 4 4 2 . T h e a ssocia t ed pr iv a t e scor e is a bou t 0 .2 9 1 2 1 ,

pla cin g y ou in t h e 3 0 th posit ion in t h e fin a l lea der boa r d. A qu it e g ood r esu lt , bu t w e st ill h a v e t o blen d
it w it h a differ en t m odel, a n eu r a l n et w or k.
Ba g g in g t h e t r a in in g set (i.e. t a kin g m u lt iple boot st r a ps of t h e t r a in in g da t a a n d t r a in in g m u lt iple

m odels ba sed on t h e boot st r a ps) sh ou ld in cr ea se t h e per for m a n ce, t h ou g h , a s Mich a el Ja h r er h im self
n ot ed in h is post , n ot a ll t h a t m u ch .
Setting up a Denoising Auto-encoder and a DNN

T h e n ex t st ep is n ot t o set u p a den oisin g a u t o-en coder (DA E) a n d a n eu r a l n et w or k t h a t ca n lea r n a n d
pr edict fr om it . Y ou ca n fin d t h e r u n n in g code a t t h is n ot ebook: h t t ps://w w w .ka g g le.com /code/
lu ca m a ssa r on /w or kbook-da e . T h e n ot ebook ca n be r u n in GPU m ode (speedier ), bu t it ca n a lso r u n in
CPU on e w it h som e slig h t m odifica t ion s.
Y ou ca n r ea d m or e a bou t den oisin g a u t o-en coder s a s bein g u sed in Ka g g le com pet it ion s
in t h e Ka g g le Book, a t pa g es 2 2 6 a n d follow in g .
A ct u a lly t h er e a r e n o ex a m ples a r ou n d r epr odu cin g Mich a el Ja h r er ’s a ppr oa ch in t h e com pet it ion
u sin g DA Es, bu t w e t ooked ex a m ple fr om a w or kin g T en sor Flow im plem en t a t ion in a n ot h er
com pet it ion by OsciiA r t (h t t ps://w w w .ka g g le.com /code/osciia r t /den oisin g -a u t oen coder ).
Her e w e st a r t by im por t in g a ll t h e n ecessa r y pa cka g es, especia lly T en sor Flow a n d Ker a s. Sin ce w e a r e
g oin g t o cr ea t e m u lt iple n eu r a l n et w or ks, w e poin t ou t t o T en sor Flow n ot t o u se a ll t h e GPU m em or y
a v a ila ble by u sin g t h e ex per im en t a l set_memory_growth com m a n d. Su ch w ill a v oid h a v in g m em or y
ov er flow pr oblem s a lon g t h e w a y . W e a lso r ecor d t h e Lea ky Relu a ct iv a t ion a s a cu st om on e, so w e ca n
ju st m en t ion it a s a n a ct iv a t ion by a st r in g in Ker a s la y er s.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from path import Path
import gc
import optuna
from sklearn.model_selection import StratifiedKFold
from scipy.special import erfinv
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
from tensorflow import keras
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Input, Dense, BatchNormalization, Dropout
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.regularizers import l2
from tensorflow.keras.metrics import AUC
from tensorflow.keras.utils import get_custom_objects
from tensorflow.keras.layers import Activation, LeakyReLU
get_custom_objects().update({'leaky-relu': Activation(LeakyReLU(alpha=0.2))})
Rela t ed t o ou r in t en t ion of cr ea t in g m u lt iple n eu r a l n et w or ks w it h ou t r u n n in g ou t of m em or y , w e a lso

defin e a sim ple fu n ct ion for clea n in g t h e m em or y in GPU a n d r em ov in g m odels t h a t a r e n o lon g er
n eeded.
def gpu_cleanup(objects):
if objects:
del(objects)
K.clear_session()
gc.collect()
W e a lso r econ fig u r e t h e Con fig cla ss in or der t o t a ke in t o a ccou n t m u lt iple pa r a m et er s r ela t ed t o t h e
den oisin g a u t o-en coder a n d t h e n eu r a l n et w or k. A s pr ev iou sly st a t ed a bou t t h e Lig h t GBM, h a v in g a ll
t h e pa r a m et er s in a sin g le pla ce r en der s sim pler m odify in g t h em in a con sist en t w a y .
class Config:
input_path = Path('../input/porto-seguro-safe-driver-prediction')
dae_batch_size = 128
dae_num_epoch = 50
dae_architecture = [1500, 1500, 1500]
reuse_autoencoder = False
batch_size = 128
num_epoch = 150
units = [64, 32]
input_dropout=0.06
dropout=0.08
regL2=0.09
activation='selu'
cv_folds = 5
nas = False
random_state = 0
config = Config()
A s pr ev iou sly , w e loa d t h e da t a set s a n d pr oceed pr ocessin g t h e fea t u r es by r em ov in g t h e ca lc fea t u r es

a n d on e h ot en codin g t h e ca t eg or ica l on es. W e lea v e m issin g ca ses v a lu ed a t -1 , a s Mich a el Ja h r er
poin t ed ou t in h is solu t ion .
train = pd.read_csv(config.input_path / 'train.csv', index_col='id')

test = pd.read_csv(config.input_path / 'test.csv', index_col='id')
submission = pd.read_csv(config.input_path / 'sample_submission.csv', index_col='id')
calc_features = [feat for feat in train.columns if "_calc" in feat]
cat_features = [feat for feat in train.columns if "_cat" in feat]
target = train["target"]
train = train.drop("target", axis="columns")
train = train.drop(calc_features, axis="columns")
test = test.drop(calc_features, axis="columns")
train = pd.get_dummies(train, columns=cat_features)
test = pd.get_dummies(test, columns=cat_features)
assert((train.columns==test.columns).all())
How ev er , differ en t ly fr om ou r pr ev iou s a ppr oa ch , w e h a v e t o r esca le a ll t h e fea t u r es t h a t a r e n ot
bin a r y or on e h ot -en coded ca t eg or ica l. Resca lin g w ill a llow t h e opt im iza t ion a lg or it h m of bot h t h e
a u t o-en coder a n d t h e n eu r a l n et w or k t o con v er g e t o a g ood solu t ion fa st er beca u se it w ill h a v e t o w or k
w it h v a lu es set in t o a com pa r a ble a n d pr edefin ed r a n g e. In st ea d of u sin g st a t ist ica l n or m a lisa t ion ,
Ga u ssRa n k is a pr ocedu r e t h a t a lso a llow s t h e m odifica t ion of t h e dist r ibu t ion of t r a n sfor m ed v a r ia bles
in t o a Ga u ssia n on e.
A s a lso st a t ed in som e pa per s, su ch a s in t h e Ba t ch Nor m a liza t ion pa per : h t t ps://a r x iv .or g /pdf/
1 5 0 2 .0 3 1 6 7 .pdf, n eu r a l n et w or ks per for m ev en bet t er if y ou pr ov ide t h em a Ga u ssia n in pu t .
A ccor din g ly t o t h is NV idia blog post , h t t ps://dev eloper .n v idia .com /blog /g a u ss-r a n k-t r a n sfor m a t ion -
is-1 0 0 x -fa st er -w it h -r a pids-a n d-cu py /, Ga u ssRa n k w or ks m ost of t h e t im es bu t w h en fea t u r es a r e
a lr ea dy n or m a lly dist r ibu t ed or a r e ex t r em ely a sy m m et r ica l (in su ch ca ses a pply in g t h e
t r a n sfor m a t ion m a y lea d t o w or sen ed per for m a n ces).
print("Applying GaussRank to columns: ", end='')

to_normalize = list()
for k, col in enumerate(train.columns):
if '_bin' not in col and '_cat' not in col and '_missing' not in col:
to_normalize.append(col)
print(to_normalize)
def to_gauss(x): return np.sqrt(2) * erfinv(x)
def normalize(data, norm_cols):
n = data.shape[0]
for col in norm_cols:
sorted_idx = data[col].sort_values().index.tolist()
uniform = np.linspace(start=-0.99, stop=0.99, num=n)
normal = to_gauss(uniform)
normalized_col = pd.Series(index=sorted_idx, data=normal)
data[col] = normalized_col
return data
train = normalize(train, to_normalize)
test = normalize(test, to_normalize)
W e ca n a pply t h e Ga u ssRa n k t r a n sfor m a t ion sepa r a t ely on t h e t r a in a n d t est fea t u r es on a ll t h e

n u m er ic fea t u r es of ou r da t a set :
Applying GaussRank to columns: ['ps_ind_01', 'ps_ind_03', 'ps_ind_14', 'ps_ind_15
W h en n or m a lisin g t h e fea t u r es, w e sim ply t u r n ou r da t a in t o a Nu m Py a r r a y of floa t 3 2 v a lu es, t h e

idea l in pu t for a GPU.
features = train.columns
train_index = train.index
test_index = test.index
train = train.values.astype(np.float32)
test = test.values.astype(np.float32)
Nex t , w e ju st pr epa r e som e u sefu l fu n ct ion s su ch a s t h e ev a lu a t ion fu n ct ion , t h e n or m a lized Gin i

coefficien t a n d a plot t in g fu n ct ion h elpfu l r epr esen t in g a Ker a s m odel h ist or y of fit t in g on bot h
t r a in in g a n d v a lida t ion set s.
def plot_keras_history(history, measures):

rows = len(measures) // 2 + len(measures) % 2
fig, panels = plt.subplots(rows, 2, figsize=(15, 5))
plt.subplots_adjust(top = 0.99, bottom=0.01,
hspace=0.4, wspace=0.2)
try:
panels = [item for sublist in panels for item in sublist]
except:
pass
for k, measure in enumerate(measures):
panel = panels[k]
panel.set_title(measure + ' history')
panel.plot(history.epoch, history.history[measure],
label="Train "+measure)
try:
panel.plot(history.epoch,
history.history["val_"+measure],
label="Validation "+measure)
except:
pass
panel.set(xlabel='epochs', ylabel=measure)
panel.legend()
plt.show(fig)
@jit
ntrue = 0
gini = 0
delta = 0
n = len(y_true)
y_i = y_true[i]
ntrue += y_i
gini += y_i * delta
delta += 1 - y_i
return gini
T h e n ex t fu n ct ion s a r e a ct u a lly a bit m or e com plex a n d m or e r ela t ed t o t h e fu n ct ion in g of bot h t h e

den oisin g a u t o-en coder a n d t h e su per v ised n eu r a l n et w or k. T h e batch_generator is a fu n ct ion t h a t
w ill cr ea t e a g en er a t or pr ov idin g sh u ffled ch u n ks of t h e da t a ba sed on a ba t ch size. It isn ’t a ct u a lly
u sed a s a st a n d-a lon e g en er a t or bu t a s pa r t of a m or e com plex ba t ch g en er a t or t h a t w e a r e soon g oin g
t o descr ibe, t h e mixup_generator.
def batch_generator(x, batch_size, shuffle=True, random_state=None):

batch_index = 0
n = x.shape[0]
while True:
if batch_index == 0:
index_array = np.arange(n)
if shuffle:
np.random.seed(seed=random_state)
index_array = np.random.permutation(n)
current_index = (batch_index * batch_size) % n
if n >= current_index + batch_size:
current_batch_size = batch_size
batch_index += 1
else:
current_batch_size = n - current_index
batch_index = 0
batch = x[index_array[current_index: current_index + current_batch_size]]
yield batch
T h e mixup_generator, a n ot h er g en er a t or , r et u r n s ba t ch es of da t a w h ose v a lu es h a v e been pa r t ia lly

sw a pped in or der t o cr ea t e som e n oise a n d a u g m en t t h e da t a in or der for t h e den oisin g a u t o-en coder
(DA E) n ot t o ov er fit t h e t r a in set . It w or ks ba sed on a sw a p r a t e, fix ed a t 1 5 % of fea t u r es a s su g g est ed
by Mich a el Ja h r er .
T h e fu n ct ion g en er a t es t w o dist in ct ba t ch es of da t a , on e t o be r elea sed t o t h e m odel a n d a n ot h er t o be

u sed a s a sou r ce for t h e v a lu e t o be sw a pped in t h e ba t ch t o be r elea sed. Ba sed on a r a n dom ch oice,
w h ose ba se pr oba bilit y is t h e sw a p r a t e, a t ea ch ba t ch a cer t a in n u m ber of fea t u r es is decided t o be
sw a pped bet w een t h e t w o ba t ch es.
Su ch a ssu r es t h a t t h e DA E ca n n ot a lw a y s r ely on t h e sa m e fea t u r es (sin ce t h ey ca n be r a n dom ly

sw a pped fr om t im e t o t im e) bu t it h a s t o con cen t r a t e on t h e w h ole of t h e fea t u r es (som et h in g sim ila r t o
dr opou t in a cer t a in sen se). in or der t o fin d r ela t ion s bet w een t h em a n d r econ st r u ct cor r ect ly t h e da t a
a t t h e en d of t h e pr ocess.
def mixup_generator(X, batch_size, swaprate=0.15, shuffle=True, random_state=None

if random_state is None:
random_state = np.randint(0, 999)
num_features = X.shape[1]
num_swaps = int(num_features * swaprate)
generator_a = batch_generator(X, batch_size, shuffle,
random_state)
generator_b = batch_generator(X, batch_size, shuffle,
random_state + 1)
while True:
batch = next(generator_a)
mixed_batch = batch.copy()
effective_batch_size = batch.shape[0]
alternative_batch = next(generator_b)
assert((batch != alternative_batch).any())
for i in range(effective_batch_size):
swap_idx = np.random.choice(num_features, num_swaps,
replace=False)
mixed_batch[i, swap_idx] = alternative_batch[i, swap_idx]
yield (mixed_batch, batch)
T h e get_DAE is t h e fu n ct ion t h a t bu ilds t h e den oisin g a u t o-en coder . It a ccept s a pa r a m et er for defin in g
t h e a r ch it ect u r e w h ich in ou r ca se h a s been set t o t h r ee la y er s of 1 5 0 0 n odes ea ch (a s su g g est ed fr om
Mich a el Ja h r er ’s solu t ion ). T h e fir st la y er sh ou ld a ct a s a n en coder , t h e secon d is a bot t len eck la y er
idea lly con t a in in g t h e la t en t fea t u r es ca pa ble of ex pr essin g t h e in for m a t ion in t h e da t a , t h e la st la y er
is a decodin g la y er , ca pa ble of r econ st r u ct in g t h e in it ia l in pu t da t a . T h e t h r ee la y er s h a v e a r elu
a ct iv a t ion fu n ct ion , n o bia s a n d ea ch on e is follow ed by a ba t ch n or m a liza t ion la y er . T h e fin a l ou t pu t
w it h t h e r econ st r u ct ed in pu t da t a h a s a lin ea r a ct iv a t ion . T h e t r a in in g is pr ov ided u sin g a n a da m
opt im izer w it h st a n da r d set t in g s (t h e opt im ised cost fu n ct ion is t h e m ea n squ a r ed er r or - m se).
def get_DAE(X, architecture=[1500, 1500, 1500]):
features = X.shape[1]
inputs = Input((features,))
for i, nodes in enumerate(architecture):
layer = Dense(nodes, activation='relu',
use_bias=False, name=f"code_{i+1}")
if i==0:
x = layer(inputs)
else:
x = layer(x)
x = BatchNormalization()(x)
outputs = Dense(features, activation='linear')(x)
model = Model(inputs=inputs, outputs=outputs)
model.compile(optimizer='adam', loss='mse',
metrics=['mse', 'mae'])
return model
T h e extract_dae_features fu n ct ion is r epor t ed h er e on ly for edu ca t ion a l pu r poses. T h e fu n ct ion

h elps in t h e ex t r a ct ion of t h e v a lu es of specific la y er s of t h e t r a in ed den oisin g a u t o-en coder . T h e
ex t r a ct ion w or ks by bu ildin g a n ew m odel com bin in g t h e DA E in pu t la y er a n d t h e desir ed ou t pu t on e.
A sim ple pr edict w ill t h en ex t r a ct t h e v a lu es w e n eed (t h e pr edict a lso a llow s fix in g t h e pr efer r ed
ba t ch size in or der t o fit a n y m em or y r equ ir em en t ).
In t h e ca se of t h e com pet it ion , g iv en t h e n u m ber of obser v a t ion s a n d t h e n u m ber of t h e fea t u r es t o be

t a ken ou t fr om t h e a u t o-en coder , if w e w er e t o u se t h is fu n ct ion , t h e r esu lt in g den se m a t r ix w ou ld be
t oo la r g e t o be h a n dled by t h e m em or y of a Ka g g le n ot ebook. For t h is r ea son ou r st r a t eg y w on ’t be t o
t r a n sfor m t h e or ig in a l da t a in t o t h e a u t o-en coder n ode v a lu es of t h e bot t len eck la y er bu t t o fu se t h e
a u t o-en coder w it h it s fr ozen la y er s u p t o t h e bot t len eck w it h t h e su per v ised n eu r a l n et w or k, a s w e w ill
be discu ssin g soon .
def extract_dae_features(autoencoder, X, layers=[3]):

data = []
for layer in layers:
if layer==0:
data.append(X)
else:
get_layer_output = Model([autoencoder.layers[0].input],
[autoencoder.layers[layer].output])
layer_output = get_layer_output.predict(X,
batch_size=128)
data.append(layer_output)
data = np.hstack(data)
return data
T o com plet e t h e w or k w it h t h e DA E, w e h a v e a fin a l fu n ct ion w r a ppin g a ll t h e pr ev iou s on es in t o a n

u n su per v ised t r a in in g pr ocedu r e (a t lea st pa r t ia lly u n su per v ised sin ce t h er e is a n ea r ly st op m on it or
set on a v a lida t ion set ). T h e fu n ct ion set s u p t h e m ix -u p g en er a t or , cr ea t es t h e den oisin g a u t o-en coder
a r ch it ect u r e a n d t h en t r a in s it , m on it or in g it s fit on a v a lida t ion set for a n ea r ly st op if t h er e a r e sig n s
of ov er fit t in g . Fin a lly , befor e r et u r n in g t h e t r a in ed DA E, it plot s a g r a ph of t h e t r a in in g a n d v a lida t ion
fit a n d it st or es t h e m odel on disk.
Ev en if w e t r y t o fix a seed on t h is m odel, con t r a r y t o t h e Lig h t GBM m odel, t h e r esu lt s a r e ex t r em ely

v a r ia ble a n d t h ey m a y in flu en ce t h e fin a l en sem ble r esu lt s. T h ou g h t h e r esu lt w ill be a h ig h scor in g
on e, it m a y la n d h ig h er or low er on t h e pr iv a t e lea der boa r d, t h ou g h t h e r esu lt s obt a in ed on t h e pu blic
on e a r e v er y cor r ela t ed t o t h e pu blic lea der boa r d a n d it w ill be ea sy for y ou t o a lw a y s pick u p t h e best
fin a l su bm ission ba sed on it s pu blic r esu lt s.
def autoencoder_fitting(X_train, X_valid, filename='dae',

random_state=None, suppress_output=False):
if suppress_output:
verbose = 0
else:
verbose = 2
print("Fitting a denoising autoencoder")
tf.random.set_seed(seed=random_state)
generator = mixup_generator(X_train,
batch_size=config.dae_batch_size,
swaprate=0.15,
dae = get_DAE(X_train, architecture=config.dae_architecture)

steps_per_epoch = np.ceil(X_train.shape[0] /
config.dae_batch_size)
early_stopping = EarlyStopping(monitor='val_mse',
mode='min',
patience=5,
restore_best_weights=True,
verbose=0)
history = dae.fit(generator,
steps_per_epoch=steps_per_epoch,
epochs=config.dae_num_epoch,
validation_data=(X_valid, X_valid),
callbacks=[early_stopping],
verbose=verbose)
if not suppress_output: plot_keras_history(history,
measures=['mse', 'mae'])
dae.save(filename)
return dae
Ha v in g dea lt w it h t h e DA E, w e t a ke t h e ch a n ce a lso t o defin e t h e su per v ised n eu r a l m odel dow n t h e

lin e t h a t sh ou ld pr edict ou r cla im ex pect a t ion s. A s a fir st st ep, w e defin e a fu n ct ion t o defin e a sin g le
la y er of t h e w or k:
Ra n dom n or m a l in it ia liza t ion , sin ce em pir ica lly it h a s been fou n d t o con v er g e t o bet t er r esu lt s
in t h is pr oblem
A den se la y er w it h L2 r eg u la r iza t ion a n d pa r a m et r a ble a ct iv a t ion fu n ct ion
A n ex clu da ble a n d t u n a ble dr opou t
Her e is t h e code for cr ea t in g t h e den se blocks:
def dense_blocks(x, units, activation, regL2, dropout):

kernel_initializer = keras.initializers.RandomNormal(mean=0.0,
stddev=0.1, seed=config.random_state)
for k, layer_units in enumerate(units):
if regL2 > 0:
x = Dense(layer_units, activation=activation,
kernel_initializer=kernel_initializer,
kernel_regularizer=l2(regL2))(x)
else:
x = Dense(layer_units,
kernel_initializer=kernel_initializer,
activation=activation)(x)
if dropout > 0:
x = Dropout(dropout)(x)
return x
A s y ou m a y h a v e a lr ea dy n ot iced, t h e fu n ct ion defin in g t h e sin g le la y er is qu it e cu st om iza ble. T h e

sa m e g oes for t h e w r a pper a r ch it ect u r e fu n ct ion , t a kin g in pu t s for t h e n u m ber of la y er s a n d u n it s in
t h em , dr opou t pr oba bilit ies, r eg u la r iza t ion a n d a ct iv a t ion t y pe. T h e idea is t o be a ble t o r u n a Neu r a l
A r ch it ect u r e Sea r ch (NA S) a n d fig u r e ou t w h a t con fig u r a t ion sh ou ld per for m bet t er in ou r pr oblem .
A s a fin a l n ot e on t h e fu n ct ion , a m on g t h e in pu t s it is r equ ir ed t o pr ov ide t h e t r a in ed DA E beca u se it s

in pu t s a r e u sed a s t h e n eu r a l n et w or k m odel in pu t s w h ile it s fir st la y er s a r e con n ect ed t o t h e DA E’s
ou t pu t s. In su ch a w a y w e a r e de fa ct o con ca t en a t in g t h e t w o m odels in t o on e (t h e DA E w eig h t s a r e
fr ozen a n y w a y a n d n ot t r a in a ble, t h ou g h ). T h is solu t ion h a s been dev ised in or der t o a v oid h a v in g t o
t r a n sfor m a ll y ou r t r a in in g da t a bu t on ly t h e sin g le ba t ch es t h a t t h e n eu r a l n et w or k is pr ocessin g ,
t h u s sa v in g m em or y in t h e sy st em .
def dnn_model(dae, units=[4500, 1000, 1000],

input_dropout=0.1, dropout=0.5,
regL2=0.05,
activation='relu'):
inputs = dae.get_layer("code_2").output
if input_dropout > 0:
x = Dropout(input_dropout)(inputs)
else:
x = tf.keras.layers.Layer()(inputs)
x = dense_blocks(x, units, activation, regL2, dropout)
outputs = Dense(1, activation='sigmoid')(x)
model = Model(inputs=dae.input, outputs=outputs)
model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss=keras.losses.binary_crossentropy,
metrics=[AUC(name='auc')])
return model
W e con clu de w it h a w r a pper for t h e t r a in in g pr ocess, in clu din g a ll t h e st eps in or der t o t r a in t h e en t ir e

pipelin e on a cr oss-v a lida t ion fold.
def model_fitting(X_train, y_train, X_valid, y_valid, autoencoder,

filename, random_state=None, suppress_output=False):
if suppress_output:
verbose = 0
else:
verbose = 2
print("Fitting model")
early_stopping = EarlyStopping(monitor='val_auc',
mode='max',
patience=10,
restore_best_weights=True,
verbose=0)
rlrop = ReduceLROnPlateau(monitor='val_auc',
mode='max',
patience=2,
factor=0.75,
verbose=0)
tf.random.set_seed(seed=random_state)
model = dnn_model(autoencoder,
units=config.units,
input_dropout=config.input_dropout,
dropout=config.dropout,
regL2=config.regL2,
activation=config.activation)
history = model.fit(X_train, y_train,

epochs=config.num_epoch,
batch_size=config.batch_size,
validation_data=(X_valid, y_valid),
callbacks=[early_stopping, rlrop],
shuffle=True,
verbose=verbose)
model.save(filename)
if not suppress_output:
plot_keras_history(history, measures=['loss', 'auc'])
return model, history
Sin ce ou r DA E im plem en t a t ion is su r ely differ en t fr om Ja h r er ’s, a lt h ou g h t h e idea beh in d is t h e sa m e,

w e ca n n ot r ely com plet ely on h is in dica t ion s on t h e a r ch it ect u r e of t h e su per v ised n eu r a l n et w or k a n d
w e h a v e t o look for t h e idea l on es a s w e h a v e been lookin g for t h e best h y per -pa r a m et er s in t h e
Lig h t GBM m odel. Usin g Opt u n a a n d lev er a g in g t h e m u lt iple pa r a m et er s t h a t w e set open for
con fig u r in g t h e n et w or k’s a r ch it ect u r e, w e ca n r u n t h is code sn ippet for som e h ou r s a n d g et a n idea
a bou t w h a t cou ld w or k bet t er .
In ou r ex per im en t s w e t h a t :
w e sh ou ld u se a t w o la y er n et w or k w it h few n odes, 6 4 a n d 3 2 r espect iv ely .

in pu t dr opou t , dr opou t bet w een la y er s a n d som e L2 r eg u la r iza t ion do h elp.
It is bet t er t o u se t h e SELU a ct iv a t ion fu n ct ion .
Her e is t h e code sn ippet for r u n n in g t h e en t ir e opt im iza t ion ex per im en t s:
if config.nas is True:
def evaluate():
skf = StratifiedKFold(n_splits=config.cv_folds, shuffle=True, random_state=conf
for k, (train_idx, valid_idx) in enumerate(skf.split(train, target)):
X_train, y_train = train[train_idx, :], target[train_idx]

X_valid, y_valid = train[valid_idx, :], target[valid_idx]
if config.reuse_autoencoder:
autoencoder = load_model(f"./dae_fold_{k}")
else:
autoencoder = autoencoder_fitting(X_train, X_valid,
filename=f'./dae_fold_{k}',
random_state=config.random_state,
suppress_output=True)
model, _ = model_fitting(X_train, y_train, X_valid, y_valid,

autoencoder=autoencoder,
filename=f"dnn_model_fold_{k}",
random_state=config.random_state,
suppress_output=True)
val_preds = model.predict(X_valid, batch_size=128, verbose=0)

best_score = eval_gini(y_true=y_valid, y_pred=np.ravel(val_preds))
metric_evaluations.append(best_score)
gpu_cleanup([autoencoder, model])
return np.mean(metric_evaluations)
def objective(trial):
params = {
'first_layer': trial.suggest_categorical("first_layer", [8, 16, 32, 64,
'second_layer': trial.suggest_categorical("second_layer", [0, 8, 16, 32
'third_layer': trial.suggest_categorical("third_layer", [0, 8, 16, 32,
'input_dropout': trial.suggest_float("input_dropout", 0.0, 0.5),
'dropout': trial.suggest_float("dropout", 0.0, 0.5),
'regL2': trial.suggest_uniform("regL2", 0.0, 0.1),
'activation': trial.suggest_categorical("activation", ['relu', 'leaky-r
}
config.units = [nodes for nodes in [params['first_layer'], params['second_layer
config.input_dropout = params['input_dropout']
config.dropout = params['dropout']
config.regL2 = params['regL2']
config.activation = params['activation']
return evaluate()
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=60)
print("Best Gini Normalized Score", study.best_value)
print("Best parameters", study.best_params)
config.units = [nodes for nodes in [study.best_params['first_layer'], study.best_pa
config.input_dropout = study.best_params['input_dropout']
config.dropout = study.best_params['dropout']
config.regL2 = study.best_params['regL2']
config.activation = study.best_params['activation']
If y ou a r e lookin g for m or e in for m a t ion a bou t m or e on Neu r a l A r ch it ect u r e Sea r ch

(NA S), y ou ca n h a v e a look a t t h e Ka g g le Book, a t pa g es 2 7 6 on w a r d. In t h e ca se of t h e
DA E a n d t h e su per v ised n eu r a l n et w or k, it is cr it ica l t o look for t h e best a r ch it ect u r e
sin ce w e a r e im plem en t in g som et h in g su r ely differ en t fr om Mich a el Ja h r er solu t ion .
A s a n ex er cise, t r y t o im pr ov e t h e h y per -pa r a m et er sea r ch by u sin g

Ker a sT u n er (t o be fou n d on pa g es 2 8 5 on w a r d in t h e Ka g g le Book), a fa st
solu t ion for opt im izin g n eu r a l n et w or ks t h a t h a s seen t h e im por t a n t
con t r ibu t ion of Fr a n çois Ch ollet , t h e cr ea t or of Ker a s.
Ha v in g fin a lly set ev er y t h in g r ea dy , w e a r e set t o st a r t t h e t r a in in g . In a bou t on e h ou r , on a Ka g g le

n ot ebook w it h GPU, y ou ca n obt a in com plet e t est a n d ou t -of-fold pr edict ion s.
preds = np.zeros(len(test))
oof = np.zeros(len(train))
skf = StratifiedKFold(n_splits=config.cv_folds, shuffle=True, random_state=config.rando
for k, (train_idx, valid_idx) in enumerate(skf.split(train, target)):
print(f"CV fold {k}")
X_train, y_train = train[train_idx, :], target[train_idx]
X_valid, y_valid = train[valid_idx, :], target[valid_idx]
if config.reuse_autoencoder:
print("restoring previously trained dae")
autoencoder = load_model(f"./dae_fold_{k}")
else:
autoencoder = autoencoder_fitting(X_train, X_valid,
filename=f'./dae_fold_{k}',
model, history = model_fitting(X_train, y_train, X_valid, y_valid,

autoencoder=autoencoder,
filename=f"dnn_model_fold_{k}",
val_preds = model.predict(X_valid, batch_size=128)

best_score = eval_gini(y_true=y_valid,
y_pred=np.ravel(val_preds))
best_epoch = np.argmax(history.history['val_auc']) + 1
print(f"[best epoch is {best_epoch}]\tvalidation_0-gini_dnn: {best_score:0.5f}\n")
metric_evaluations.append(best_score)
preds += (model.predict(test, batch_size=128).ravel() /
skf.n_splits)
oof[valid_idx] = model.predict(X_valid, batch_size=128).ravel()
gpu_cleanup([autoencoder, model])
A s don e w it h t h e Lig h GBM m odel, w e ca n g et a n idea of t h e r esu lt s by lookin g a t t h e a v er a g e fold

n or m a lized Gin i coefficien t .
print(f"DNN CV normalized Gini coefficient: {np.mean(metric_evaluations):0.3f} ({
T h e r esu lt s w on ’t be qu it e in lin e w it h w h a t pr ev iou sly obt a in ed u sin g t h e Lig h t GBM.
DNN CV Gini Normalized Score: 0.276 (0.015)
Pr odu cin g t h e su bm ission a n d su bm it t in g it w ill r esu lt in a pu blic scor e of a bou t 0 .2 7 7 3 7 a n d a

pr iv a t e scor e of a bou t 0 .2 8 4 7 1 (r esu lt s m a y v a r y w ildly a s w e pr ev iou sly m en t ion ed), n ot qu it e a
h ig h scor e.
submission['target'] = preds
submission.to_csv('dnn_submission.csv')
oofs = pd.DataFrame({'id':train_index, 'target':oof})
oofs.to_csv('dnn_oof.csv', index=False)
T h e sca r ce r esu lt s fr om t h e n eu r a l n et w or k seem t o g o by t h e a da g e t h a t n eu r a l n et w or ks

u n der per for m in t a bu la r pr oblem s. A s Ka g g ler s, a n y w a y , w e kn ow t h a t a ll m odels a r e u sefu l for a
su ccessfu l pla cin g on t h e lea der boa r d, w e ju st n eed t o fig u r e ou t h ow t o best u se t h em . Su r ely , a n eu r a l
n et w or k feed w it h a n a u t o-en coder h a s w or ked ou t a solu t ion less a ffect ed by n oise in da t a a n d
ela bor a t ed t h e in for m a t ion in a differ en t w a y t h a n a GBM.
Ensembling the results
Now , h a v in g t w o m odels, w h a t ’s left is t o m ix t h em t og et h er a n d see if w e ca n im pr ov e t h e r esu lt s. A s
su g g est ed by Ja h r er w e g o st r a ig h t for a blen d of t h em , bu t w e do n ot lim it ou r selv es t o pr odu cin g ju st
a n a v er a g e of t h e t w o (sin ce ou r a ppr oa ch in t h e en d h a s slig h t ly differ ed fr om Ja h r er ’s on e) bu t w e
w ill a lso t r y t o g et opt im a l w eig h t s for t h e blen d. W e st a r t im por t in g t h e ou t -of-fold pr edict ion s a n d
h a v in g ou r ev a lu a t ion fu n ct ion r ea dy .
import pandas as pd
import numpy as np
@jit
ntrue = 0
gini = 0
delta = 0
n = len(y_true)
y_i = y_true[i]
ntrue += y_i
gini += y_i * delta
delta += 1 - y_i
return gini
lgb_oof = pd.read_csv("../input/workbook-lgb/lgb_oof.csv")
dnn_oof = pd.read_csv("../input/workbook-dae/dnn_oof.csv")
target = pd.read_csv("../input/porto-seguro-safe-driver-prediction/train.csv", usecols=
On ce don e, w e t u r n t h e ou t -of-fold pr edict ion s of t h e Lig h t GBM a n d t h e n eu r a l n et w or k in t o r a n ks sin ce

t h e n or m a lized Gin i coefficien t is sen sible t o r a n kin g s (a s w ou ld be a ROC-A UC ev a lu a t ion ).
lgb_oof_ranks = (lgb_oof.target.rank() / len(lgb_oof))

dnn_oof_ranks = (dnn_oof.target.rank() / len(dnn_oof))
Now w e ju st t est if, by com bin in g t h e t w o m odels u sin g differ en t w eig h t s w e ca n g et a bet t er
ev a lu a t ion on t h e ou t -of-fold da t a .
baseline = eval_gini(y_true=target.target, y_pred=lgb_oof_ranks)

print(f"starting from a oof lgb baseline {baseline:0.5f}\n")
best_alpha = 1.0
for alpha in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]:
ensemble = alpha * lgb_oof_ranks + (1.0 - alpha) * dnn_oof_ranks
score = eval_gini(y_true=target.target, y_pred=ensemble)
print(f"lgd={alpha:0.1f} dnn={(1.0 - alpha):0.1f} -> {score:0.5f}")
if score > baseline:

baseline = score
best_alpha = alpha
print(f"\nBest alpha is {best_alpha:0.1f}")

W h en r ea dy , by r u n n in g t h e sn ippet w e ca n g et in t er est in g r esu lt s:
starting from a oof lgb baseline 0.28850

lgd=0.1 dnn=0.9 -> 0.27352
lgd=0.2 dnn=0.8 -> 0.27744
lgd=0.3 dnn=0.7 -> 0.28084
lgd=0.4 dnn=0.6 -> 0.28368
lgd=0.5 dnn=0.5 -> 0.28595
lgd=0.6 dnn=0.4 -> 0.28763
lgd=0.7 dnn=0.3 -> 0.28873
lgd=0.8 dnn=0.2 -> 0.28923
lgd=0.9 dnn=0.1 -> 0.28916
Best alpha is 0.8
It seem s t h a t blen din g u sin g a st r on g w eig h t (0 .8 ) on t h e Lig h t GBM m odel a n d a w ea ker on e (0 .2 ) on

t h e n eu r a l n et w or k m a y lea d t o a n ov er per for m in g m odel. W e im m edia t ely t r y t h is h y pot h esis by
set t in g a blen d w it h t h e sa m e w eig h t s for t h e m odels a n d t h e idea l w eig h t s t h a t w e h a v e fou n d ou t .
lgb_submission = pd.read_csv("../input/workbook-lgb/lgb_submission.csv")
dnn_submission = pd.read_csv("../input/workbook-dae/dnn_submission.csv")
submission = pd.read_csv(
"../input/porto-seguro-safe-driver-prediction/sample_submission.csv")
Fir st w e t r y t h e equ a l w eig h t s solu t ion :
lgb_ranks = (lgb_submission.target.rank() / len(lgb_submission))

dnn_ranks = (dnn_submission.target.rank() / len(dnn_submission))
submission.target = lgb_ranks * 0.5 + dnn_ranks * 0.5
submission.to_csv("equal_blend_rank.csv", index=False)
It lea ds t o a pu blic scor e of 0 .2 8 3 9 3 a n d a pr iv a t e scor e of 0 .2 9 0 9 3 , w h ich is a r ou n d 5 0 th posit ion in

t h e fin a l lea der boa r d, a bit fa r fr om ou r ex pect a t ion s. Now let ’s t r y u sin g t h e w eig h t s t h e ou t -of-fold
pr edict ion s h elped u s t o fin d:
lgb_ranks = (lgb_submission.target.rank() / len(lgb_submission))

dnn_ranks = (dnn_submission.target.rank() / len(dnn_submission))
submission.target = lgb_ranks * best_alpha + dnn_ranks * (1.0 - best_alpha)
submission.to_csv("blend_rank.csv", index=False)
Her e t h e r esu lt s lea d t o a pu blic scor e of 0 .2 8 5 0 2 a n d a pr iv a t e scor e of 0 .2 9 1 9 2 t h a t t u r n s ou t t o be

a r ou n d t h e sev en t h posit ion in t h e fin a l lea der boa r d. A m u ch bet t er r esu lt in deed, beca u se t h e
Lig h t GBM is a g ood on e bu t it is pr oba bly m issin g som e n u a n ces in t h e da t a t h a t ca n be pr ov ided a s a
fa v or a ble cor r ect ion by a ddin g som e in for m a t ion fr om t h e n eu r a l n et w or k t r a in ed on t h e den oised
da t a .
A s poin t ed ou t by CPMP in h is solu t ion (h t t ps://w w w .ka g g le.com /com pet it ion s/por t o-
seg u r o-sa fe-dr iv er -pr edict ion /discu ssion /4 4 6 1 4 ), depen din g on h ow t o bu ild y ou r cr oss-
v a lida t ion , y ou ca n ex per ien ce a “ h u g e v a r ia t ion of Gin i scor es a m on g folds” . For t h is
r ea son , CPMP su g g est s t o decr ea se t h e v a r ia n ce of t h e est im a t es by u sin g m a n y differ en t
seeds for m u lt iple cr oss-v a lida t ion s a n d a v er a g in g t h e r esu lt s.
Ex er cise: a s a n ex er cise, t r y t o m odify t h e code w e u sed in or der t o cr ea t e

m or e st a ble pr edict ion s, especia lly for t h e den oisin g a u t o-en coder .
Summary
In t h is fir st ch a pt er , y ou h a v e dea lt w it h a cla ssica l t a bu la r com pet it ion . By r ea din g t h e n ot ebooks a n d
discu ssion s of t h e com pet it ion , w e h a v e com e u p w it h a sim ple solu t ion in v olv in g ju st t w o m odels t o be
ea sily blen ded. In pa r t icu la r , w e h a v e offer ed a n ex a m ple on h ow t o u se a den oisin g a u t o-en coder in
or der t o pr odu ce a n a lt er n a t iv e da t a pr ocessin g pa r t icu la r ly u sefu l w h en oper a t in g w it h n eu r a l
n et w or ks for t a bu la r da t a . By u n der st a n din g a n d r eplica t in g solu t ion s on pa st com pet it ion s, y ou ca n
qu ickly bu ild u p y ou r cor e com pet en cies on Ka g g le com pet it ion s a n d be qu ickly a ble t o per for m
con sist en t ly a n d h ig h ly in m or e r ecen t com pet it ion s a n d ch a llen g es.
In t h e n ex t ch a pt er , w e w ill ex plor e a n ot h er t a bu la r com pet it ion fr om Ka g g le, t h is t im e r ev olv in g

a bou t a com plex pr edict ion pr oblem w it h t im e ser ies.
2 The Makridakis Competitions: M5 on Kaggle for Accuracy
and Uncertainty
Sin ce 1 9 8 2 , Spy r os Ma kr ida kis (h t t ps://m ofc.u n ic.a c.cy /dr -spy r os-m a kr ida kis/h t t ps://
m ofc.u n ic.a c.cy /dr -spy r os-m a kr ida kis/) h a s in v olv ed g r ou ps of r esea r ch er s fr om a ll ov er t h e w or ld in
for eca st in g ch a llen g es, ca lled M com pet it ion s, in or der t o con du ct com pa r ison s of t h e effica cy of
ex ist in g a n d n ew for eca st in g m et h ods a g a in st differ en t for eca st in g pr oblem s. For t h is r ea son , M
com pet it ion s h a v e a lw a y s been com plet ely open t o bot h a ca dem ics a n d pr a ct it ion er s. T h e com pet it ion s
a r e pr oba bly t h e m ost cit ed a n d r efer en ced ev en t in t h e for eca st in g com m u n it y a n d t h ey h a v e a lw a y s
h ig h lig h t ed t h e ch a n g in g st a t e of t h e a r t in for eca st in g m et h ods. Ea ch pr ev iou s M com pet it ion h a s
pr ov ided bot h r esea r ch er s a n d pr a ct it ion er s n ot on ly w it h u sefu l da t a t o t r a in a n d t est t h eir
for eca st in g t ools, bu t a lso w it h a ser ies of discov er ies a n d a ppr oa ch es t h a t a r e r ev olu t ion izin g t h e w a y
for eca st in g a r e don e.
T h e r ecen t M5 com pet it ion (t h e M6 is ju st r u n n in g a s t h is ch a pt er is bein g w r it t en ) h a s been h eld on

Ka g g le a n d it h a s pr ov ed pa r t icu la r ly sig n ifica n t in r em a r kin g t h e u sefu ln ess of g r a dien t boost in g
m et h ods w h en t r y in g t o solv e a h ost of v olu m e for eca st s of r et a il pr odu ct s. In t h is ch a pt er , focu sin g on
t h e a ccu r a cy t r a ck, w e dea l w it h a t im e ser ies pr oblem fr om Ka g g le com pet it ion s, a n d by r eplica t in g
on e of t h e t op, y et sim plest a n d m ost clea r solu t ion s, w e in t en d t o pr ov ide ou r r ea der s w it h code a n d
idea s t o su ccessfu lly h a n dle a n y fu t u r e for eca st in g com pet it ion t h a t m a y a ppea r on Ka g g le.
A pa r t fr om t h e com pet it ion pa g es, w e fou n d a lot of in for m a t ion r eg a r din g t h e com pet it ion a n d
it s dy n a m ics in t h e follow in g pa per s fr om t h e In t er n a t ion a l Jou r n a l of For eca st in g :
Ma kr ida kis, Spy r os, Ev a n g elos Spiliot is, a n d V a ssilios A ssim a kopou los. The M5 com petition:
Back ground, organization, and im plem entation. In t er n a t ion a l Jou r n a l of For eca st in g (2 0 2 1 ).
Ma kr ida kis, Spy r os, Ev a n g elos Spiliot is, a n d V a ssilios A ssim a kopou los. "M5 a ccu r a cy
com pet it ion : Resu lt s, fin din g s, a n d con clu sion s." In t er n a t ion a l Jou r n a l of For eca st in g (2 0 2 2 ).
Ma kr ida kis, Spy r os, et a l. "T h e M5 Un cer t a in t y com pet it ion : Resu lt s, fin din g s a n d con clu sion s."
In t er n a t ion a l Jou r n a l of For eca st in g (2 0 2 1 ).
Understanding the competition and the data

T h e com pet it ion r a n fr om Ma r ch t o Ju n e 2 0 2 0 a n d ov er 7 ,0 0 0 pa r t icipa n t s t ook pa r t in it on Ka g g le.
T h e or g a n izer s a r r a n g ed it in t o t w o sepa r a t ed t r a cks, on e for poin t -w ise pr edict ion (a ccu r a cy t r a ck)
a n d a n ot h er on e for est im a t in g r elia ble v a lu es a t differ en t con fiden ce in t er v a ls (u n cer t a in t y t r a ck)
W a lm a r t pr ov ided t h e da t a . It con sist ed of 4 2 ,8 4 0 da ily sa les t im e ser ies of it em s h ier a r ch ica lly
a r r a n g ed in t o depa r t m en t s, ca t eg or ies, a n d st or es spr ea d in t h r ee U.S. st a t es (t h e ser ies a r e som ew h a t
cor r ela t ed ea ch ot h er ). A lon g w it h t h e sa les, W a lm a r t a lso pr ov ided a ccom pa n y in g in for m a t ion
(ex og en ou s v a r ia bles, u su a lly n ot oft en pr ov ided in for eca st in g pr oblem s) su ch a s t h e pr ices of it em s,
som e ca len da r in for m a t ion , a ssocia t ed pr om ot ion s or pr esen ce of ot h er ev en t s a ffect in g t h e sa les.
A pa r t fr om Ka g g le, t h e da t a is a v a ila ble, t og et h er w it h t h e da t a set s fr om pr ev iou s M com pet it ion , a t

t h e a ddr ess h t t ps://for eca st er s.or g /r esou r ces/t im e-ser ies-da t a /.
On e in t er est in g a spect of t h e com pet it ion s is t h a t it dea lt w it h con su m er g oods sa les bot h fa st m ov in g
a n d slow m ov in g w it h m a n y ex a m ples of t h e la t est pr esen t in g in t er m it t en t sa les (sa les a r e oft en zer o
bu t for som e r a r e ca ses). In t er m it t en t ser ies, t h ou g h com m on in m a n y in du st r ies, a r e st ill a
ch a llen g in g ca se in for eca st in g for m a n y pr a ct it ion er s.
T h e com pet it ion t im elin e h a s been a r r a n g ed in t w o pa r t s. In t h e fir st , fr om t h e beg in n in g of Ma r ch

2 0 2 0 t o Ju n e 1 st, com pet it or s cou ld t r a in m odels on t h e r a n g e of da y s u p t o da y 1 ,9 1 3 a n d scor e t h eir
su bm ission on t h e pu blic t est set (r a n g in g fr om da y 1 ,9 1 4 t o 1 ,9 4 1 ). A ft er t h a t da t e, u n t il t h e en d of
t h e com pet it ion on Ju ly 1 st, t h e pu blic t est set w a s m a de a v a ila ble a s pa r t of t h e t r a in in g set , a llow in g
pa r t icipa n t s t o t u n e t h eir m odels in or der t o pr edict fr om da y 1 ,9 4 2 t o 1 9 6 9 (a t im e w in dow s of 2 8
da y s, i.e. fou r w eeks). In t h a t per iod, su bm ission s w er e n ot scor ed on t h e lea der boa r d.
T h e r a t io beh in d su ch a n a r r a n g em en t of t h e com pet it ion w a s t o a llow t ea m s in it ia lly t o t est t h eir

m odels on t h e lea der boa r d a n d t o h a v e g r ou n ds t o sh a r e t h eir best per for m in g m et h ods in n ot ebooks
a n d discu ssion s. A ft er t h e fir st ph a se, t h e or g a n izer s w a n t ed t o a v oid h a v in g t h e lea der boa r d u sed for
ov er fit t in g pu r poses or h y per pa r a m et er t u n in g of t h e m odels a n d t h ey w a n t ed t o r esem ble a
for eca st in g sit u a t ion , a s it w ou ld h a ppen in t h e r ea l w or ld. In a ddit ion , t h e r equ ir em en t t o ch oose on ly
on e su bm ission a s t h e fin a l on e m ir r or ed t h e sa m e n ecessit y for r ea lism (in t h e r ea l w or ld y ou ca n n ot
u se t w o dist in ct m odels pr edict ion s a n d ch oose t h e on e t h a t su it s y ou t h e best a ft er w a r ds).
A s for a s t h e da t a , w e m en t ion ed t h a t t h e da t a h a s been pr ov ided by W a lm a r t a n d it r epr esen t s t h e

USA m a r ket : it or ig in a t ed fr om 1 0 st or es in Ca lifor n ia , W iscon sin a n d T ex a s. Specifica lly , t h e da t a it is
m a de u p by t h e sa les of 3 ,0 4 9 pr odu ct s, or g a n ized in t o t h r ee ca t eg or ies (h obbies, food, a n d h ou seh old)
t h a t ca n be div ided fu r t h er m or e in t o 7 depa r t m en t s ea ch . Su ch h ier a r ch ica l st r u ct u r e is cer t a in ly a
ch a llen g e beca u se y ou ca n m odel sa le dy n a m ics a t t h e lev el of USA m a r ket , st a t e m a r ket , sin g le st or e,
pr odu ct ca t eg or y , ca t eg or y depa r t m en t a n d fin a lly specific pr odu ct . A ll t h ese lev els ca n a lso com bin e
a s differ en t a g g r eg a t es, w h ich a r e som et h in g r equ ir ed t o be pr edict ed in t h e secon d t r a ck, t h e
u n cer t a in t y t r a ck:
Lev el id Lev el descr ipt ion A ggr ega t ion l ev el Nu m ber of ser ies
1 A ll pr odu ct s, a g g r eg a t ed for a ll st or es a n d st a t es T ot a l 1
2 A ll pr odu ct s, a g g r eg a t ed for ea ch st a t e St a t e 3
3 A ll pr odu ct s, a g g r eg a t ed for ea ch st or e St or e 10
4 A ll pr odu ct s, a g g r eg a t ed for ea ch ca t eg or y Ca t eg or y 3
5 A ll pr odu ct s, a g g r eg a t ed for ea ch depa r t m en t Depa r t m en t 7
6 A ll pr odu ct s, a g g r eg a t ed for ea ch st a t e a n d St a t e-Ca t eg or y 9
ca t eg or y
7 A ll pr odu ct s, a g g r eg a t ed for ea ch st a t e a n d St a t e-Depa r t m en t 2 1
depa r t m en t
8 A ll pr odu ct s, a g g r eg a t ed for ea ch st or e a n d St or e-Ca t eg or y 30
ca t eg or y
9 A ll pr odu ct s, a g g r eg a t ed for ea ch st or e a n d St or e-Depa r t m en t 7 0
depa r t m en t
10 Ea ch pr odu ct , a g g r eg a t ed for a ll st or es/st a t es Pr odu ct 3 ,0 4 9
11 Ea ch pr odu ct , a g g r eg a t ed for ea ch st a t e Pr odu ct -St a t e 9 ,1 4 7
12 Ea ch pr odu ct , a g g r eg a t ed for ea ch st or e Pr odu ct -St or e 3 0 ,4 9 0
T ot a l 4 2 ,8 4 0
Fr om t h e poin t of v iew of t im e, t h e g r a n u la r it y is da ily sa les r ecor d a n d t h e cov er ed t h e per iod

spa n n in g fr om 2 9 Ja n u a r y 2 0 1 1 t o 1 9 Ju n e 2 0 1 6 w h ich equ a ls t o 1 ,9 6 9 da y s in t ot a l, 1 ,9 1 3 for
t r a in in g , 2 8 for v a lida t ion – pu blic lea der boa r d – 2 8 for t est – pr iv a t e lea der boa r d. A for eca st in g
h or izon of 2 8 da y s is a ct u a lly r ecog n ized in t h e r et a il sect or a s t h e pr oper h or izon for h a n dlin g st ocks
a n d r e-or der in g oper a t ion s for m ost g oods.
Let ’s ex a m in e t h e differ en t da t a y ou r eceiv e for t h e com pet it ion . Y ou g et

sales_train_evaluation.csv, sell_prices.csv a n d calendar.csv. T h e on e keepin g t h e t im e
ser ies is sales_train_evaluation.csv. It is com posed of fields t h a t a ct a s iden t ifier s (item_id,
dept_id, cat_id, store_id, a n d state_id) a n d colu m n s fr om d_1 t o d_1941 r epr esen t in g t h e sa les of
t h ose da y s:
Figure 2.1: The s ales _train_evaluation.cs v data
sell_prices.csv con t a in s in st ea d in for m a t ion a bou t t h e pr ice of t h e it em s. T h e difficu lt y h er e is in
join in g t h e wm_yr_wk (t h e id of t h e w eek) w it h t h e colu m n s in t h e t r a in in g da t a :
Figure 2.2: The s ell_prices .cs v data

T h e la st file, calendar.csv, con t a in s da t a r ela t iv e t o ev en t s t h a t cou ld h a v e a ffect ed t h e sa les:
Figure 2.3: The calendar.cs v data
A g a in , t h e m a in difficu lt y seem s in join in g t h e da t a t o t h e colu m n s in t h e t r a in in g t a ble. A n y w a y ,

h er e y ou ca n g et a n ea sy key t o con n ect colu m n s (t h e d field) w it h t h e wm_yr_wk. In a ddit ion , in t h e
t a ble w e h a v e r epr esen t ed differ en t ev en t s t h a t m a y h a v e occu r r ed on pa r t icu la r da y s a s w ell a s SNA P
da y s t h a t a r e specia l da y s w h en t h e n u t r it ion a ssist a n ce ben efit s ca lled t h e Su pplem en t Nu t r it ion
A ssist a n ce Pr og r a m (SNA P) ca n be u sed.
Understanding the Evaluation Metric

T h e a ccu r a cy com pet it ion in t r odu ced a n ew ev a lu a t ion m et r ic: W eig h t ed Root Mea n Squ a r ed Sca led
Er r or (W RMSSE). T h e m et r ic ev a lu a t es t h e dev ia t ion of t h e of t h e poin t for eca st s a r ou n d t h e m ea n of
t h e r ea lized v a lu es of t h e ser ies bein g pr edict ed:
w h er e:
n is t h e len g t h of t h e t r a in in g sa m ple
h is t h e for eca st in g h or izon (in ou r ca se it is h=2 8 ) Y t is t h e sa les v a lu e a t t im e t, is t h e pr edict ed

v a lu e a t t im e t
In t h e com pet it ion g u idelin es (h t t ps://m ofc.u n ic.a c.cy /m 5 -com pet it ion /), in r eg a r d of W RMSSE, it is
r em a r ked t h a t :
T h e den om in a t or of RMSSE is com pu t ed on ly for t h e t im e-per iods for w h ich t h e ex a m in ed

pr odu ct (s) a r e a ct iv ely sold, i.e., t h e per iods follow in g t h e fir st n on -zer o dem a n d obser v ed for
t h e ser ies u n der ev a lu a t ion
T h e m ea su r e is sca le in depen den t , m ea n in g t h a t it ca n be effect iv ely u sed t o com pa r e for eca st s
a cr oss ser ies w it h differ en t sca les.
In con t r a st t o ot h er m ea su r es, it ca n be sa fely com pu t ed a s it does n ot r ely on div ision s w it h
v a lu es t h a t cou ld be equ a l or close t o zer o (e.g . a s don e in per cen t a g e er r or s w h en Y t = 0 or
r ela t iv e er r or s w h en t h e er r or of t h e ben ch m a r k u sed for sca lin g is zer o).
T h e m ea su r e pen a lizes posit iv e a n d n eg a t iv e for eca st er r or s, a s w ell a s la r g e a n d sm a ll
for eca st s, equ a lly , t h u s bein g sy m m et r ic.
A g ood ex pla n a t ion of t h e u n der ly in g w or kin g s of t h is is pr ov ided by t h is post fr om A lex a n der Soa r e
(h t t ps://w w w .ka g g le.com /a lex a n der soa r e): h t t ps://w w w .ka g g le.com /com pet it ion s/m 5 -for eca st in g -
a ccu r a cy /discu ssion /1 4 8 2 7 3 . A ft er h a v in g t r a n sfor m ed t h e ev a lu a t ion m et r ic, A lex a n dr e a t t r ibu t es
im pr ov in g per for m a n ces t o im pr ov in g t h e r a t io bet w een t h e er r or in t h e pr edict ion s a n d t h e da y -t o-
da y v a r ia t ion of sa les v a lu es. If t h e er r or is t h e sa m e a s t h e da ily v a r ia t ion s (r a t io=1 ), it is likely t h a t
t h e m odel is n ot m u ch bet t er t h a n a r a n dom g u ess ba sed on h ist or ica l v a r ia t ion s. If y ou r er r or is less
t h a n t h a t , it is scor e in a qu a dr a t ic w a y (beca u se of t h e squ a r e r oot ) a s it a ppr oa ch es zer o.
Con sequ en t ly , a W RMSSE of 0 .5 cor r espon ds t o a r a t io of 0 .7 a n d a W RMSSE of 0 .2 5 cor r espon ds t o a
r a t io of 0 .5 .
Du r in g t h e com pet it ion , m a n y a t t em pt s h a v e been m a de a t u sin g t h e m et r ic n ot on ly for ev a lu a t ion

besides t h e lea der boa r d bu t a lso a s a n object iv e fu n ct ion . Fir st of a ll, t h e T w eedie loss (im plem en t ed
bot h in X GBoost a n d Lig h t GBM) w or ked qu it e w ell for t h e pr oblem beca u se it cou ld h a n dle t h e skew ed
dist r ibu t ion s of sa les for m ost pr odu ct s (a lot of t h em a lso h a d in t er m it t en t sa les a n d t h a t is a lso
h a n dled fin ely by t h e T w eedie loss). T h e Poisson a n d Ga m m a dist r ibu t ion s ca n be con sider ed ex t r em e
ca ses of t h e T w eedie dist r ibu t ion : ba sed on t h e pa r a m et er pow er , p, w it h p=1 y ou g et a Poisson
dist r ibu t ion a n d w it h p=2 a Ga m m a on e. Su ch pow er pa r a m et er is a ct u a lly t h e g lu e t h a t keeps t h e
m ea n a n d t h e v a r ia n ce of t h e dist r ibu t ion con n ect ed by t h e for m u la v a r ia n ce = k*m ea n **p. Usin g a
pow er v a lu e bet w een 1 a n d 2 , y ou a ct u a lly g et a m ix of Poisson a n d Ga m m a dist r ibu t ion s w h ich ca n
fit v er y w ell t h e com pet it ion pr oblem . Most of t h e Ka g g ler s in v olv ed in t h e com pet it ion a n d u sin g a
GBM solu t ion , a ct u a lly h a v e r esor t ed t o T w eedie loss.
In spit e of T w eedie su ccess, som e ot h er Ka g g ler s, h ow ev er , fou n d in t er est in g w a y s t o im plem en t a n

object iv e loss m or e sim ila r t o W RMSSE for t h eir m odels:
* Ma r t in Kov a cev ic Bu v in ic w it h h is a sy m m et r ic loss: h t t ps://w w w .ka g g le.com /code/r a g n a r 1 2 3 /

sim ple-lg bm -g r ou pkfold-cv /n ot ebook
* T im et r a v eller u sin g Py T or ch A u t og r a d t o g et g r a dien t a n d h essia n for a n y differ en t ia ble con t in u ou s

loss fu n ct ion t o be im plem en t ed in Lig h GBM: h t t ps://w w w .ka g g le.com /com pet it ion s/m 5 -for eca st in g -
a ccu r a cy /discu ssion /1 5 2 8 3 7
Examining the 4th place solution’s ideas from Monsaraida

T h er e a r e m a n y solu t ion s a v a ila ble for t h e com pet it ion , t o be m ost ly fou n d on t h e com pet it ion Ka g g le
discu ssion s pa g es. T h e t op fiv e m et h ods of bot h ch a llen g es h a v e a lso been g a t h er ed a n d pu blish ed (bu t
on e beca u se of pr opr iet a r y r ig h t s) by t h e com pet it ion or g a n izer s t h em selv es: h t t ps://g it h u b.com /
Mcom pet it ion s/M5 -m et h ods (by t h e w a y , r epr odu cin g t h e r esu lt s of t h e w in n in g su bm ission s w a s a
pr er equ isit e for t h e collect ion of a com pet it ion pr ize).
Not icea bly , a ll t h e Ka g g ler s t h a t h a v e pla ced in t h e h ig h er r a n ks of t h e com pet it ion s h a v e u sed, a s
t h eir u n iqu e m odel t y pe or in blen ded/st a cked in en sem bles, Lig h t GBM beca u se of it s lesser m em or y
u sa g e a n d speed of com pu t a t ion s t h a t g a v e it a n a dv a n t a g e in t h e com pet it ion beca u se of t h e la r g e
a m ou n t of t im es ser ies t o pr ocess a n d pr edict . Bu t t h er e a r e a lso ot h er r ea son s for it s su ccess. Con t r a r y
t o cla ssica l m et h ods ba sed on A RIMA , it doesn ’t r equ ir e r ely in g on t h e a n a ly sis of a u t o-cor r ela t ion a n d
in specifica lly fig u r in g ou t t h e pa r a m et er s for ea ch sin g le ser ies in t h e pr oblem . In a ddit ion , con t r a r y
t o m et h ods ba sed on deep lea r n in g , it doesn ’t r equ ir e lookin g for im pr ov in g com plica t ed n eu r a l
a r ch it ect u r es or t u n in g la r g e n u m ber of h y per pa r a m et er s. T h e st r en g t h of t h e g r a dien t boost in g
m et h ods in t im e ser ies pr oblem s (for ex t en sion of ev er y ot h er g r a dien t boost in g a lg or it h m , su ch a s for
in st a n ce X GBoost ) is t o r ely on fea t u r e en g in eer in g , cr ea t in g t h e r ig h t n u m ber of fea t u r es ba sed on
t im e la g s, m ov in g a v er a g es a n d a v er a g es fr om g r ou pin g s of t h e ser ies a t t r ibu t es. T h en ch oosin g t h e
r ig h t object iv e fu n ct ion a n d doin g som e h y per -pa r a m et er t u n in g w ill su ffice t o obt a in ex cellen t r esu lt s
w h en t h e t im e ser ies a r e en ou g h lon g (for sh or t er ser ies, cla ssica l st a t ist ica l m et h ods su ch a s A RIMA or
ex pon en t ia l sm oot h in g a r e st ill t h e r ecom m en ded ch oice).
A n ot h er a dv a n t a g e of Lig h t GBM a n d X GBoost a g a in st deep lea r n in g solu t ion s in t h e

com pet it ion w a s t h e T w eedie loss, n ot r equ ir in g a n y fea t u r e sca lin g (deep lea r n in g
n et w or ks a r e pa r t icu la r ly sen sible t o t h e sca lin g y ou u se) a n d t h e speed of t r a in in g t h a t
a llow ed fa st er it er a t ion s w h ile t est in g fea t u r e en g in eer in g .
A m on g a ll t h ese a v a ila ble solu t ion s, w e fou n d t h e on e pr oposed by Mon sa r a ida (Ma sa n or i Miy a h a r a ), a
Ja pa n ese com pu t er scien t ist , t h e m ost in t er est in g on e. He h a s pr oposed a sim ple a n d st r a ig h t for w a r d
solu t ion t h a t h a s r a n ked fou r on t h e pr iv a t e lea der boa r d w it h a scor e of 0 .5 3 5 8 3 . T h e solu t ion u ses
ju st g en er a l fea t u r es w it h ou t pr ior select ion (su ch a s sa les st a t ist ics, ca len da r , pr ices, a n d iden t ifier s).
Mor eov er , it u ses a lim it ed n u m ber of m odels of t h e sa m e kin d, u sin g Lig h t GBM g r a dien t boost in g ,
w it h ou t r esor t in g t o a n y kin d of blen din g , r ecu r siv e m odellin g w h en pr edict ion s feed ot h er
h ier a r ch ica lly r ela t ed pr edict ion s or m u lt iplier s t h a t is ch oosin g con st a n t s t o fit t h e t est set bet t er .
Her e is a sch em e t a ken fr om h is pr esen t a t ion solu t ion pr esen t a t ion t o t h e M (h t t ps://g it h u b.com /
Mcom pet it ion s/M5 -m et h ods/t r ee/m a st er /Code%2 0 of%2 0 W in n in g %2 0 Met h ods/A 4 ) w h er e it ca n be
n ot ed t h a t h e t r ea t s ea ch of t h e t en st or es by ea ch of t h e fou r w eeks t o be looked in t o t h e fu t u r e t h a t in
t h e en d cor r espon ds t o pr odu cin g 4 0 m odels:

Figure 2.4: explanation by Mons araida about the s tructure of his s olution
Giv en t h a t Mon sa r a ida h a s kept h is solu t ion sim ple a n d pr a ct ica l, like in a r ea l-w or ld for eca st in g
pr oject , in t h is ch a pt er , w e w ill t r y t o r eplica t e h is ex a m ple by r efa ct or in g h is code in or der t o r u n in
Ka g g le n ot ebooks (w e w ill h a n dle t h e m em or y a n d t h e r u n n in g t im e lim it a t ion s by split t in g t h e code
in t o m u lt iple n ot ebooks). In t h is w a y , w e in t en d t o pr ov ide t h e r ea der s w it h a sim ple a n d effect iv e
w a y , ba sed on g r a dien t boost in g , t o a ppr oa ch for eca st in g pr oblem s.
Computing predictions for specific dates and time horizons

T h e pla n for r eplica t in g Mon sa r a ida ’s solu t ion is t o cr ea t e a n ot ebook cu st om iza ble by in pu t
pa r a m et er s in or der t o pr odu ce t h e n ecessa r y pr ocessed da t a for t r a in a n d t est a n d t h e Lig h t GBM
m odels for pr edict ion s. T h e m odels, g iv en da t a in t h e pa st , w ill be t r a in ed t o lea r n t o pr edict v a lu es in
a specific n u m ber of da y s in t h e fu t u r e. T h e best r esu lt s ca n be obt a in ed by h a v in g ea ch m odel t o lea r n
t o pr edict t h e v a lu es in a specific w eek r a n g e in t h e fu t u r e. Sin ce w e h a v e t o pr edict u p t o 2 8 da y s
a h ea d in t h e fu t u r e, w e n eed a m odel pr edict in g fr om da y +1 t o da y +7 in t h e fu t u r e, t h en a n ot h er on e
a ble t o pr edict fr om da y +8 t o da y +1 4 , a n ot h er fr om da y +1 5 t o +2 1 a n d fin a lly a n ot h er la st on e
ca pa ble of h a n dlin g pr edict ion s fr om da y +2 2 t o da y +2 8 . W e w ill n eed a Ka g g le n ot ebook for ea ch of
t h ese t im e r a n g es, t h u s w e n eed fou r n ot ebooks. Ea ch of t h ese n ot ebooks w ill be t r a in ed t o pr edict t h a t
fu t u r e t im e spa n for ea ch of t h e t en st or es pa r t of t h e com pet it ion s. In t ot a l, ea ch n ot ebook w ill pr odu ce
t en m odels. A ll t og et h er , t h e n ot ebooks w ill t h en pr odu ce for t y m odels cov er in g a ll t h e fu t u r e r a n g e
a n d a ll t h e st or es.
Sin ce w e n eed t o pr edict bot h for t h e pu blic lea der boa r d a n d for t h e pr iv a t e on e, it is n ecessa r y t o
r epea t t h is pr ocess t w ice, st oppin g t r a in in g a t da y 1 ,9 1 3 (pr edict in g da y s fr om 1 ,9 1 4 t o 1 ,9 4 1 ) for t h e
pu blic t est set su bm ission a n d a t da y 1 ,9 4 1 (pr edict in g da y s fr om 1 ,9 4 2 5 o 1 ,9 6 9 ) for t h e pr iv a t e on e.
Giv en t h e cu r r en t lim it a t ion s for r u n n in g Ka g g le n ot ebooks ba sed on CPU, a ll t h ese eig h t n ot ebooks
ca n be r u n in pa r a llel (a ll t h e pr ocess t a kin g a lm ost 6 h ou r s a n d a h a lf). Ea ch n ot ebook ca n be
dist in g u ish a ble by ot h er s by it s n a m e, con t a in in g t h e pa r a m et er s’ v a lu es r ela t iv e t o t h e la st t r a in in g
da y a n d t h e look-a h ea d h or izon in da y s. A n ex a m ple of on e of t h ese n ot ebooks ca n be fou n d a t : h t t ps://
w w w .ka g g le.com /code/lu ca m a ssa r on /m 5 -t r a in -da y -1 9 4 1 -h or izon -7 .
Let ’s n ow ex a m in e t og et h er h ow t h e code h a s been a r r a n g ed a n d w h a t w e ca n lea r n fr om Mon sa r a ida ’s

solu t ion .
W e sim ply st a r t by im por t in g t h e n ecessa r y pa cka g es. Y ou ca n ju st n ot ice h ow , a pa r t fr om Nu m Py

a n d pa n da s, t h e on ly da t a scien ce specia lized pa cka g e is Lig h t GBM. Y ou m a y a lso n ot ice t h a t w e a r e
g oin g t o u se g c (g a r ba g e collect ion ): t h a t ’s beca u se w e n eed t o lim it t h e a m ou n t of u sed m em or y by t h e
scr ipt , a n d w e fr equ en t ly ju st collect a n d r ecy cle t h e u n u sed m em or y . A s pa r t of t h is st r a t eg y , w e a lso
fr equ en t ly st or e a w a y on disk m odels a n d da t a st r u ct u r es, in st ea d of keepin g t h em in -m em or y :
import numpy as np
import pandas as pd
import os
import random
import math
from decimal import Decimal as dec
import datetime
import time
import gc
import pickle
import warnings
warnings.filterwarnings(“ignore”, category=UserWarning)
A s pa r t of t h e st r a t eg y t o lim it t h e m em or y u sa g e, w e r esor t t o t h e fu n ct ion t o r edu ce pa n da s

Da t a Fr a m e descr ibed in t h e Ka g g le book a n d in it ia lly dev eloped by A r ja n Gr oen du r in g t h e Zillow
com pet it ion (r ea d t h e discu ssion h t t ps://w w w .ka g g le.com /com pet it ion s/t a bu la r -pla y g r ou n d-ser ies-
dec-2 0 2 1 /discu ssion /2 9 1 8 4 4 ):
def reduce_mem_usage(df, verbose=True):

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
start_mem = df.memory_usage().sum() / 1024**2
for col in df.columns:
col_type = df[col].dtypes
if col_type in numerics:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
else:
if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).ma
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
end_mem = df.memory_usage().sum() / 1024**2
if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(e
return df
W e keep on defin in g fu n ct ion s for t h is solu t ion , beca u se it h elps split t in g t h e solu t ion in t o sm a ller pa r t s
a n d beca u se it is ea sier t o clea n u p a ll t h e u sed v a r ia bles w h en y ou ju st r et u r n fr om a fu n ct ion (y ou
keep on ly w h a t y ou sa v ed t o disk, a n d y ou r et u r n ed fr om t h e fu n ct ion ). Ou r n ex t fu n ct ion h elps u s t o
loa d a ll t h e da t a a v a ila ble a n d com pr ess it :
def load_data():
train_df = reduce_mem_usage(pd.read_csv("../input/m5-forecasting-accuracy/sales_tra
prices_df = reduce_mem_usage(pd.read_csv("../input/m5-forecasting-accuracy/sell_pri
calendar_df = reduce_mem_usage(pd.read_csv("../input/m5-forecasting-accuracy/calend
submission_df = reduce_mem_usage(pd.read_csv("../input/m5-forecasting-accuracy/samp
return train_df, prices_df, calendar_df, submission_df
train_df, prices_df, calendar_df, submission_df = load_data()
A ft er pr epa r in g t h e code t o r et r iev e t h e da t a r ela t iv e t o pr ices, v olu m es, a n d ca len da r in for m a t ion , w e
pr oceed t o pr epa r e t h e fir st pr ocessin g fu n ct ion t h a t w ill h a v e t h e r ole t o cr ea t e a ba sic t a ble of
in for m a t ion h a v in g item_id, dept_id, cat_id, state_id a n d store_id a s r ow key s, a da y colu m n
a n d v a lu es colu m n con t a in in g t h e v olu m es. T h is is a ch iev ed st a r t in g fr om r ow s h a v in g a ll t h e da y s’
da t a colu m n s by u sin g t h e pa n da s com m a n d m elt (h t t ps://pa n da s.py da t a .or g /pa n da s-docs/st a ble/
r efer en ce/a pi/pa n da s.m elt .h t m l). T h e com m a n d t a kes a s r efer en ce t h e in dex of t h e Da t a Fr a m e a n d
t h en picks a ll t h e r em a in in g fea t u r es, pla cin g t h eir n a m e on a colu m n a n d t h eir v a lu e on a n ot h er on e
(var_name a n d value_name pa r a m et er s h elp y ou defin e t h e n a m e of t h ese n ew colu m n s). In t h is w a y ,
y ou ca n u n fold a r ow r epr esen t in g t h e sa les ser ies of a cer t a in it em in a cer t a in st or e in t o m u lt iple
r ow s ea ch on e r epr esen t in g a sin g le da y . T h e fa ct t h a t t h e posit ion a l or der of t h e u n folded colu m n s is
pr eser v ed g u a r a n t ees t h a t n ow y ou r t im e ser ies spa n s on t h e v er t ica l a x is (y ou ca n t h er efor e a pply
fu r t h er m or e t r a n sfor m a t ion s on it , su ch a s m ov in g m ea n s).
T o g iv e y ou a n idea of w h a t is h a ppen in g , h er e is t h e train_df befor e t h e t r a n sfor m a t ion w it h
pd.melt. Not ice h ow t h e v olu m es of t h e dist in ct da y s a r e colu m n fea t u r es:
Figure 2.5: The training DataFram e
A ft er t h e t r a n sfor m a t ion , y ou obt a in a grid_df w h er e t h e da y s h a v e been dist r ibu t ed on sepa r a t ed

da y s:
Figure 2.5: Applying pd.m elt to the training DataFram e
T h e fea t u r e d con t a in s t h e r efer en ce t o t h e colu m n s t h a t a r e n ot pa r t of t h e in dex , in essen ce, a ll t h e

fea t u r es fr om d_1 t o d_1935. By sim ply r em ov in g t h e ‘d_’ pr efix fr om it s v a lu es a n d con v er t in g t h em
t o in t eg er s, y ou n ow h a v e a da y fea t u r e.
A pa r t fr om t h is, t h e code sn ippet a lso sepa r a t es a h oldou t of t h e r ow s (y ou r v a lida t ion set ) ba sed on t h e
t im e fr om t h e t r a in in g on es. On t h e t r a in in g pa r t it w ill a lso a dd t h e r ow s n ecessa r y for y ou r
pr edict ion s ba sed on t h e pr edict h or izon y ou pr ov ide.
Her e is t h e fu n ct ion t h a t cr ea t es ou r ba sic fea t u r e t em pla t e. A s in pu t , it t a kes t h e train_df

Da t a Fr a m e, it ex pect s t h e da y t h e t r a in en ds a n d t h e pr edict h or izon (t h e n u m ber of da y s y ou w a n t t o
pr edict in t h e fu t u r e):
def generate_base_grid(train_df, end_train_day_x, predict_horizon):

index_columns = ['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id']
grid_df = pd.melt(train_df, id_vars=index_columns, var_name='d', value_name='sales'
grid_df = reduce_mem_usage(grid_df, verbose=False)
grid_df['d_org'] = grid_df['d']
grid_df['d'] = grid_df['d'].apply(lambda x: x[2:]).astype(np.int16)
time_mask = (grid_df['d'] > end_train_day_x) & (grid_df['d'] <= end_train_day_x +
holdout_df = grid_df.loc[time_mask, ["id", "d", "sales"]].reset_index(drop=True)
holdout_df.to_feather(f"holdout_df_{end_train_day_x}_to_{end_train_day_x + predict_
del(holdout_df)
gc.collect()
grid_df = grid_df[grid_df['d'] <= end_train_day_x]
grid_df['d'] = grid_df['d_org']
grid_df = grid_df.drop('d_org', axis=1)
add_grid = pd.DataFrame()
for i in range(predict_horizon):
temp_df = train_df[index_columns]
temp_df = temp_df.drop_duplicates()
temp_df['d'] = 'd_' + str(end_train_day_x + i + 1)
temp_df['sales'] = np.nan
add_grid = pd.concat([add_grid, temp_df])
grid_df = pd.concat([grid_df, add_grid])
grid_df = grid_df.reset_index(drop=True)
for col in index_columns:

grid_df[col] = grid_df[col].astype('category')

grid_df.to_feather(f"grid_df_{end_train_day_x}_to_{end_train_day_x + predict_horizo
del(grid_df)
gc.collect()
A ft er h a n dlin g t h e fu n ct ion t o cr ea t e t h e ba sic fea t u r e t em pla t e, w e pr epa r e a m er g e fu n ct ion for

pa n da s Da t a Fr a m es t h a t w ill h elp t o sa v e m em or y spa ce a n d a v oid m em or y er r or s w h en h a n dlin g
la r g e set s of da t a . Giv en t w o Da t a Fr a m e, df1 a n d df2 a n d t h e set of for eig n key s w e n eed t h em t o be
m er g ed, t h e fu n ct ion a pplies a left ou t er join bet w een df1 a n d df2 w it h ou t cr ea t in g a n ew m er g ed
object bu t sim ply ex pa n din g t h e ex ist en t df1 Da t a Fr a m e.
T h e fu n ct ion w or ks fir st by ex t r a ct in g t h e for eig n key s fr om df1 , t h en m er g in g t h e ex t r a ct ed key s

w it h df2 . In t h is w a y , t h e fu n ct ion cr ea t es a n ew Da t a Fr a m e, ca lled merged_gf, w h ich is or der ed a s
df1 . A t t h is poin t , w e ju st a ssig n t h e merged_gf colu m sn t o df1 . In t er n a lly , df1 w ill pick t h e r efer en ce
t o t h e in t er n a l da t a st r u ct u r es fr om merged_gf. Su ch a n a ppr oa ch h elps m in im izin g t h e m em or y
u sa g e beca u se on ly t h e n ecessa r y u sed da t a is cr ea t ed a t a n y t im e (t h er e a r e n o du plica t es t h a t ca n
fill-u p t h e m em or y ). W h en t h e fu n ct ion r et u r n s df1 , merged_gf is ca n celled bu t for t h e da t a n ow u sed
by df1 .
Her e is t h e code for t h is u t ilit y fu n ct ion :
def merge_by_concat(df1, df2, merge_on):

merged_gf = df1[merge_on]
merged_gf = merged_gf.merge(df2, on=merge_on, how='left')
new_columns = [col for col in list(merged_gf)
if col not in merge_on]
df1[new_columns] = merged_gf[new_columns]
return df1
A ft er t h is n ecessa r y st ep, w e pr oceed t o pr og r a m a n ew fu n ct ion t o pr ocess t h e da t a . T h is t im e w e

h a n dle t h e pr ices da t a , a set of da t a con t a in in g t h e pr ices of ea ch it em by ea ch st or e for a ll t h e w eeks.
Sin ce it is im por t a n t t o fig u r e ou t if w e a r e t a lkin g a bou t a n ew pr odu ct a ppea r in g in a st or e or n ot , t h e
fu n ct ion picks t h e fir st da t e of pr ice a v a ila bilit y (u sin g t h e wm_yr_wk fea t u r e in t h e pr ice t a ble,
r epr esen t in g t h e id of t h e w eek) a n d it copies it t o ou r fea t u r e t em pla t e.
Her e is t h e code for pr ocessin g t h e r elea se da t es:
def calc_release_week(prices_df, end_train_day_x, predict_horizon):

index_columns = ['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id']
grid_df = pd.read_feather(f"grid_df_{end_train_day_x}_to_{end_train_day_x + predict
release_df = prices_df.groupby(['store_id', 'item_id'])['wm_yr_wk'].agg(['min']).re

release_df.columns = ['store_id', 'item_id', 'release']
grid_df = merge_by_concat(grid_df, release_df, ['store_id', 'item_id'])
del release_df
gc.collect()
grid_df = merge_by_concat(grid_df, calendar_df[['wm_yr_wk', 'd']], ['d'])

grid_df = grid_df.reset_index(drop=True)
grid_df['release'] = grid_df['release'] - grid_df['release'].min()
grid_df['release'] = grid_df['release'].astype(np.int16)

del(grid_df)
gc.collect()
A ft er h a v in g h a n dles t h e da y of t h e pr odu ct a ppea r a n ce in a st or e, w e defin it ely pr oceed t o dea l w it h

t h e pr ices. In r eg a r d of ea ch it em , by ea ch sh op, w e pr epa r e ba sic pr ice fea t u r es t ellin g :
th e a ct u a l pr ice (n or m a lized by t h e m a x im u m )
th e m a x im u m pr ice
th e m in im u m pr ice
th e m ea n pr ice
th e st a n da r d dev ia t ion of t h e pr ice
th e n u m ber of differ en t pr ices t h e it em h a s t a ken
th e n u m ber of it em s in t h e st or e w it h t h e sa m e pr ice
Besides t h ese ba sic descr ipt iv e st a t ist ics of pr ices, w e a lso a dd som e fea t u r es t o descr ibe t h eir dy n a m ics
for ea ch it em in a st or e ba sed on differ en t t im e g r a n u la r it ies:
t h e da y m om en t u m , i.e. t h e r a t io befor e t h e a ct u a l pr ice a n d it s pr ice t h e pr ev iou s da y

t h e m on t h m om en t u m , i.e. t h e r a t io befor e t h e a ct u a l pr ice a n d it s a v er a g e pr ice t h e sa m e
m on t h
t h e y ea r m om en t u m , i.e. t h e r a t io befor e t h e a ct u a l pr ice a n d it s a v er a g e pr ice t h e sa m e y ea r
Her e w e u se t w o in t er est in g a n d essen t ia l pa n da s m et h ods for t im e ser ies fea t u r e pr ocessin g :
* sh ift : t h a t ca n m ov e t h e in dex for w a r d or ba ckw a r d by n st eps (h t t ps://pa n da s.py da t a .or g /pa n da s-

docs/st a ble/r efer en ce/a pi/pa n da s.Da t a Fr a m e.sh ift .h t m l)
* t r a n sfor m : t h a t a pplied t o a g r ou p by , fills a like-in dex fea t u r e w it h t h e t r a n sfor m ed v a lu es (h t t ps://

pa n da s.py da t a .or g /docs/r efer en ce/a pi/pa n da s.cor e.g r ou pby .Da t a Fr a m eGr ou pBy .t r a n sfor m .h t m l)
In a ddit ion , t h e decim a l pa r t of t h e pr ice is pr ocessed a s a fea t u r e, in or der t o r ev ea l a sit u a t ion w h en

t h e it em is sold a t psy ch olog ica l pr icin g t h r esh olds (e.g . $1 9 .9 9 or £2 .9 8 – see t h is discu ssion : h t t ps://
w w w .ka g g le.com /com pet it ion s/m 5 -for eca st in g -a ccu r a cy /discu ssion /1 4 5 0 1 1 ). T h e fu n ct ion
math.modf (h t t ps://docs.py t h on .or g /3 .8 /libr a r y /m a t h .h t m l#m a t h .m odf) h elps in doin g so beca u se it
split s a n y floa t in g -poin t n u m ber in t o t h e fr a ct ion a l a n d in t eg er pa r t s (a t w o-it em t u ple).
Fin a lly , t h e r esu lt in g t a ble is sa v ed on t o disk.
Her e is t h e fu n ct ion doin g a ll t h e fea t u r e en g in eer in g on pr ices:
def generate_grid_price(prices_df, calendar_df, end_train_day_x, predict_horizon)

prices_df['price_max'] = prices_df.groupby(['store_id', 'item_id'])['sell_price'].t
prices_df['price_min'] = prices_df.groupby(['store_id', 'item_id'])['sell_price'].t
prices_df['price_std'] = prices_df.groupby(['store_id', 'item_id'])['sell_price'].t
prices_df['price_mean'] = prices_df.groupby(['store_id', 'item_id'])['sell_price'].
prices_df['price_norm'] = prices_df['sell_price'] / prices_df['price_max']
prices_df['price_nunique'] = prices_df.groupby(['store_id', 'item_id'])['sell_price
prices_df['item_nunique'] = prices_df.groupby(['store_id', 'sell_price'])['item_id'
calendar_prices = calendar_df[['wm_yr_wk', 'month', 'year']]
calendar_prices = calendar_prices.drop_duplicates(subset=['wm_yr_wk'])
prices_df = prices_df.merge(calendar_prices[['wm_yr_wk', 'month', 'year']], on=['wm
del calendar_prices
gc.collect()
prices_df['price_momentum'] = prices_df['sell_price'] / prices_df.groupby(['store_i

'sell_price'].transform(lambda x: x.shift(1))
prices_df['price_momentum_m'] = prices_df['sell_price'] / prices_df.groupby(['store
'sell_price'].transform('mean')
prices_df['price_momentum_y'] = prices_df['sell_price'] / prices_df.groupby(['store
'sell_price'].transform('mean')
prices_df['sell_price_cent'] = [math.modf(p)[0] for p in prices_df['sell_price']]
prices_df['price_max_cent'] = [math.modf(p)[0] for p in prices_df['price_max']]
prices_df['price_min_cent'] = [math.modf(p)[0] for p in prices_df['price_min']]
del prices_df['month'], prices_df['year']
prices_df = reduce_mem_usage(prices_df, verbose=False)
gc.collect()
original_columns = list(grid_df)
grid_df = grid_df.merge(prices_df, on=['store_id', 'item_id', 'wm_yr_wk'], how='lef
del(prices_df)
gc.collect()
keep_columns = [col for col in list(grid_df) if col not in original_columns]

grid_df = grid_df[['id', 'd'] + keep_columns]
grid_df.to_feather(f"grid_price_{end_train_day_x}_to_{end_train_day_x + predict_hor
del(grid_df)
gc.collect()
T h e n ex t fu n ct ion com pu t es t h e m oon ph a se, g iv in g ba ck on e of it s eig h t ph a ses (fr om n ew m oon t o

w a n in g cr escen t ). A lt h ou g h m oon ph a ses sh ou ldn ’t dir ect ly in flu en ce a n y sa les (w ea t h er con dit ion s
in st ea d do, bu t w e h a v e n o w ea t h er in for m a t ion in t h e da t a ), t h ey r epr esen t a per iodic cy cle of 2 9 a n d
a h a lf da y s w h ich ca n w ell su it per iodic sh oppin g beh a v ior s. T h er e is a n in t er est in g discu ssion , w it h
differ en t h y pot h esis r eg a r din g w h y m oon ph a ses m a y w or k a s a pr edict or , in t h is com pet it ion post :
h t t ps://w w w .ka g g le.com /com pet it ion s/m 5 -for eca st in g -a ccu r a cy /discu ssion /1 5 4 7 7 6 :
def get_moon_phase(d): # 0=new, 4=full; 4 days/phase

diff = datetime.datetime.strptime(d, '%Y-%m-%d') - datetime.datetime(2001, 1, 1)
days = dec(diff.days) + (dec(diff.seconds) / dec(86400))
lunations = dec("0.20439731") + (days * dec("0.03386319269"))
phase_index = math.floor((lunations % dec(1) * dec(8)) + dec('0.5'))
return int(phase_index) & 7
T h e m oon ph a se fu n ct ion is pa r t of a g en er a l fu n ct ion for cr ea t in g t im e-ba sed fea t u r es. T h e fu n ct ion

t a kes t h e ca len da r da t a set in for m a t ion a n d pla ces it a m on g t h e fea t u r es. Su ch in for m a t ion con t a in s
ev en t s a n d t h eir t y pe a s w ell a s in dica t ion of t h e SNA P per iods (a n u t r it ion a ssist a n ce ben efit ca lled t h e
Su pplem en t Nu t r it ion A ssist a n ce Pr og r a m – h en ce SNA P – t o h elp low er -in com e fa m ilies) t h a t cou ld
dr iv e fu r t h er m or e sa les of ba sic g oods. T h e fu n ct ion a lso g en er a t es n u m er ic fea t u r es su ch a s t h e da y ,
t h e m on t h , t h e y ea r , t h e da y of t h e w eek, t h e w eek in t h e m on t h , if it is a n en d of w eek. Her e is t h e
code:
def generate_grid_calendar(calendar_df, end_train_day_x, predict_horizon):
grid_df = pd.read_feather(
f"grid_df_{end_train_day_x}_to_{end_train_day_x +
predict_horizon}.feather")
grid_df = grid_df[['id', 'd']]
gc.collect()
calendar_df['moon'] = calendar_df.date.apply(get_moon_phase)
# Merge calendar partly
icols = ['date',
'd',
'event_name_1',
'event_type_1',
'event_name_2',
'event_type_2',
'snap_CA',
'snap_TX',
'snap_WI',
'moon',
]
grid_df = grid_df.merge(calendar_df[icols], on=['d'], how='left')
icols = ['event_name_1',
'event_type_1',
'event_name_2',
'event_type_2',
'snap_CA',
'snap_TX',
'snap_WI']
for col in icols:

grid_df[col] = grid_df[col].astype('category')
grid_df['date'] = pd.to_datetime(grid_df['date'])
grid_df['tm_d'] = grid_df['date'].dt.day.astype(np.int8)
grid_df['tm_w'] = grid_df['date'].dt.isocalendar().week.astype(np.int8)
grid_df['tm_m'] = grid_df['date'].dt.month.astype(np.int8)
grid_df['tm_y'] = grid_df['date'].dt.year
grid_df['tm_y'] = (grid_df['tm_y'] - grid_df['tm_y'].min()).astype(np.int8)
grid_df['tm_wm'] = grid_df['tm_d'].apply(lambda x: math.ceil(x / 7)).astype(np.int8
grid_df['tm_dw'] = grid_df['date'].dt.dayofweek.astype(np.int8)
grid_df['tm_w_end'] = (grid_df['tm_dw'] >= 5).astype(np.int8)
del(grid_df['date'])
grid_df.to_feather(f"grid_calendar_{end_train_day_x}_to_{end_train_day_x + predict_
del(grid_df)
del(calendar_df)
gc.collect()
T h e follow in g fu n ct ion in st ea d ju st r em ov es t h e wm_yr_wk fea t u r e a n d t r a n sfor m s t h e d (da y ) fea t u r e

in t o a n u m er ic. It is a n ecessa r y st ep for t h e follow in g fea t u r e t r a n sfor m a t ion fu n ct ion s.
def modify_grid_base(end_train_day_x, predict_horizon):

grid_df['d'] = grid_df['d'].apply(lambda x: x[2:]).astype(np.int16)
del grid_df['wm_yr_wk']

del(grid_df)
gc.collect()
Ou r la st t w o fea t u r e cr ea t ion fu n ct ion s w ill g en er a t e m or e soph ist ica t ed fea t u r e en g in eer in g for t im e
ser ies. T h e fir st fu n ct ion w ill pr odu ce bot h la g g ed sa les a n d t h eir m ov in g a v er a g es. Fist , u sin g t h e sh ift
m et h od (h t t ps://pa n da s.py da t a .or g /pa n da s-docs/st a ble/r efer en ce/a pi/pa n da s.Da t a Fr a m e.sh ift .h t m l)
w ill g en er a t e a r a n g e of la g g ed sa les u p t o 1 5 da y s in t h e pa st . T h en u sin g sh ift in con ju n ct ion w it h
r ollin g (h t t ps://pa n da s.py da t a .or g /docs/r efer en ce/a pi/pa n da s.Da t a Fr a m e.r ollin g .h t m l) w ill cr ea t e
m ov in g m ea n s w it h w in dow s 7 , 1 4 , 3 0 , 6 0 a n d 1 8 0 da y s.
T h e sh ift com m a n d is n ecessa r y beca u se it w ill a llow m ov in g t h e in dex so t h a t y ou w ill a lw a y s

con sider a v a ila ble da t a for y ou r ca lcu la t ion s. Hen ce, if y ou r pr edict ion h or izon g oes u p t o sev en da y s,
t h e ca lcu la t ion s w ill con sider on ly t h e da t a a v a ila ble sev en da y s befor e. T h en t h e r ollin g com m a n d
w ill cr ea t e a m ov in g w in dow obser v a t ion s t h a t ca n be su m m a r ized (in t h is ca se by t h e m ea n ). Ha v in g
a m ea n ov er a per iod (t h e m ov in g w in dow ) a n d follow in g it s ev olu t ion s w ill h elp y ou t o det ect bet t er
a n y ch a n g es in t r en ds beca u se pa t t er n s n ot r epea t in g a cr oss t h e t im e w in dow s w ill be lev eled off. T h is
a com m on st r a t eg y in t im e ser ies a n a ly sis t o r em ov e n oise a n d n on -in t er est in g pa t t er s. For in st a n ce,
w it h a r ollin g m ea n of sev en da y s y ou w ill ca n cel a ll t h e da ily pa t t er n s a n d ju st r epr esen t w h a t
h a ppen s t o y ou r sa les on a w eekly ba sis.
Ca n y ou ex per im en t w it h differ en t m ov in g a v er a g e w in dow s? A lso t r y in g differ en t

st r a t eg ies m a y h elp. For in st a n ce, by ex plor in g t h e T a bu la r Pla y g r ou n d of Ja n u a r y
2 0 2 2 (h t t ps://w w w .ka g g le.com /com pet it ion s/t a bu la r -pla y g r ou n d-ser ies-ja n -2 0 2 2 )
dev ot ed t o t im e ser ies y ou m a y fin d fu r t h er m or e idea sin ce m ost solu t ion s a r e bu ilt
u sin g Gr a dien t Boost in g .
Her e is t h e code t o g en er a t e t h e la g a n d r ollin g m ea n fea t u r es:
def generate_lag_feature(end_train_day_x, predict_horizon):

grid_df = grid_df[['id', 'd', 'sales']]
num_lag_day_list = []
num_lag_day = 15
for col in range(predict_horizon, predict_horizon + num_lag_day):
num_lag_day_list.append(col)
grid_df = grid_df.assign(**{
'{}_lag_{}'.format(col, l): grid_df.groupby(['id'])['sales'].transform(lambda x
for l in num_lag_day_list
})
for col in list(grid_df):
if 'lag' in col:
grid_df[col] = grid_df[col].astype(np.float16)
num_rolling_day_list = [7, 14, 30, 60, 180]
for num_rolling_day in num_rolling_day_list:
grid_df['rolling_mean_' + str(num_rolling_day)] = grid_df.groupby(['id'])['sale
lambda x: x.shift(predict_horizon).rolling(num_rolling_day).mean()).astype(
grid_df['rolling_std_' + str(num_rolling_day)] = grid_df.groupby(['id'])['sales
lambda x: x.shift(predict_horizon).rolling(num_rolling_day).std()).astype(n
grid_df.to_feather(f"lag_feature_{end_train_day_x}_to_{end_train_day_x + predict_ho
del(grid_df)
gc.collect()
A s for a s t h e secon d a dv a n ced fea t u r e en g in eer in g fu n ct ion , it is a n en codin g fu n ct ion , t a kin g specific
g r ou pin g s of v a r ia bles a m on g st a t e, st or e, ca t eg or y , depa r t m en t , a n d sold it em a n d r epr esen t in g t h eir
m ea n a n d st a n da r d dev ia t ion . Su ch em beddin g s a r e t im e in depen den t (t im e is n ot pa r t of t h e
g r ou pin g ) a n d t h ey h a v e t h e r ole t o h elp t h e t r a in in g a lg or it h m t o dist in g u ish h ow it em s, ca t eg or ies,
a n d st or es (a n d t h eir com bin a t ion s) differ en t ia t e a m on g t h em selv es.
T h e pr oposed em beddin g s a r e qu it e ea sy t o com pu t e u sin g t a r g et en codin g , a s descr ibed

in The Kaggle Book on pa g e 2 1 6 , ca n y ou obt a in bet t er r esu lt s a n d h ow ?
T h e code w or ks by g r ou pin g t h e fea t u r es, com pu t in g t h eir descr ipt iv e st a t ist ic (t h e m ea n or t h e

st a n da r d dev ia t ion in ou r ca se) a n d t h en a pply in g t h e r esu lt s t o t h e da t a set u sin g t h e t r a n sfor m
(h t t ps://pa n da s.py da t a .or g /pa n da s-docs/st a ble/r efer en ce/a pi/pa n da s.Da t a Fr a m e.t r a n sfor m .h t m l)
m et h od t h a t w e discu ssed befor e:
def generate_target_encoding_feature(end_train_day_x, predict_horizon):

grid_df.loc[grid_df['d'] > (end_train_day_x - predict_horizon), 'sales'] = np.nan

base_cols = list(grid_df)
icols = [
['state_id'],
['store_id'],
['cat_id'],
['dept_id'],
['state_id', 'cat_id'],
['state_id', 'dept_id'],
['store_id', 'cat_id'],
['store_id', 'dept_id'],
['item_id'],
['item_id', 'state_id'],
['item_id', 'store_id']
]
for col in icols:
col_name = '_' + '_'.join(col) + '_'
grid_df['enc' + col_name + 'mean'] = grid_df.groupby(col)['sales'].transform('m
np.float16)
grid_df['enc' + col_name + 'std'] = grid_df.groupby(col)['sales'].transform('st
np.float16)
keep_cols = [col for col in list(grid_df) if col not in base_cols]
grid_df = grid_df[['id', 'd'] + keep_cols]
grid_df.to_feather(f"target_encoding_{end_train_day_x}_to_{end_train_day_x + predic
del(grid_df)
gc.collect()
Ha v in g com plet ed t h e fea t u r e en g in eer in g pa r t , w e n ow pr oceed t o pu t t og et h er a ll t h e files w e h a v e

st or ed a w a y on disk w h ile g en er a t in g t h e fea t u r es. T h e follow in g fu n ct ion ju st loa ds t h e differ en t
da t a set s of ba sic fea t u r es, pr ice fea t u r es, ca len da r fea t u r es, la g /r ollin g a n d em bedded fea t u r es, a n d
con ca t en a t e a ll t og et h er . T h e code t h en filt er s on ly t h e r ow s r ela t iv e t o a specific sh op t o be sa v ed a s a
sepa r a t ed da t a set . Su ch a n a ppr oa ch m a t ch es t h e st r a t eg y of h a v in g a m odel t r a in ed on a specific
st or e a im ed a t pr edict in g for a specific t im e in t er v a l:
def assemble_grid_by_store(train_df, end_train_day_x, predict_horizon):

grid_df = pd.concat([pd.read_feather(f"grid_df_{end_train_day_x}_to_{end_train_day_
pd.read_feather(f"grid_price_{end_train_day_x}_to_{end_train_day_x
pd.read_feather(f"grid_calendar_{end_train_day_x}_to_{end_train_da
axis=1)
gc.collect()
store_id_set_list = list(train_df['store_id'].unique())
index_store = dict()
for store_id in store_id_set_list:
extract = grid_df[grid_df['store_id'] == store_id]
index_store[store_id] = extract.index.to_numpy()
extract = extract.reset_index(drop=True)
extract.to_feather(f"grid_full_store_{store_id}_{end_train_day_x}_to_{end_train
del(grid_df)
gc.collect()
mean_features = [
'enc_cat_id_mean', 'enc_cat_id_std',
'enc_dept_id_mean', 'enc_dept_id_std',
'enc_item_id_mean', 'enc_item_id_std'
]
df2 = pd.read_feather(f"target_encoding_{end_train_day_x}_to_{end_train_day_x + pre
df = pd.read_feather(f"grid_full_store_{store_id}_{end_train_day_x}_to_{end_tra
df = pd.concat([df, df2[df2.index.isin(index_store[store_id])].reset_index(drop
df.to_feather(f"grid_full_store_{store_id}_{end_train_day_x}_to_{end_train_day_
del(df2)
gc.collect()
df3 = pd.read_feather(f"lag_feature_{end_train_day_x}_to_{end_train_day_x + predict

df = pd.read_feather(f"grid_full_store_{store_id}_{end_train_day_x}_to_{end_tra
df = pd.concat([df, df3[df3.index.isin(index_store[store_id])].reset_index(drop
df.to_feather(f"grid_full_store_{store_id}_{end_train_day_x}_to_{end_train_day_
del(df3)
del(store_id_set_list)
gc.collect()
T h e follow in g fu n ct ion , in st ea d, ju st fu r t h er pr ocesses t h e select ion fr om t h e pr ev iou s on e, by

r em ov in g u n u sed fea t u r es a n d r eor der in g t h e colu m n s a n d it r et u r n s t h e da t a for a m odel t o be
t r a in ed on :
def load_grid_by_store(end_train_day_x, predict_horizon, store_id):

df = pd.read_feather(f"grid_full_store_{store_id}_{end_train_day_x}_to_{end_train_d
remove_features = ['id', 'state_id', 'store_id', 'date', 'wm_yr_wk', 'd', 'sales']

enable_features = [col for col in list(df) if col not in remove_features]
df = df[['id', 'd', 'sales'] + enable_features]
df = reduce_mem_usage(df, verbose=False)
gc.collect()
return df, enable_features

Fin a lly , w e ca n n ow dea l w it h t h e t r a in in g ph a se. T h e follow in g code sn ippet st a r t s by defin in g t h e
t r a in in g pa r a m et er s a s ex plica t ed by Mon sa r a ida bein g t h e m ost effect iv e on t h e pr oblem . For t r a in in g
t im e r ea son s, w e ju st m odified t h e boost in g t y pe, ch oosin g u sin g g oss in st ea d of g bdt beca u se t h a t ca n
r ea lly speed u p t r a in in g w it h ou t m u ch loss in t er m s of per for m a n ce. A g ood speed-u p t o t h e m odel is
a lso pr ov ided by t h e su bsa m ple pa r a m et er a n d t h e fea t u r e fr a ct ion : a t ea ch lea r n in g st ep of t h e
g r a dien t boost in g on ly h a lf of t h e ex a m ples a n d h a lf of t h e fea t u r es w ill be con sider ed.
A lso com pilin g Lig h t GBM on y ou r m a ch in e w it h t h e r ig h t com pilin g opt ion s m a y

in cr ea se y ou r speed a s ex pla in ed in t h is in t er est in g com pet it ion discu ssion : h t t ps://
w w w .ka g g le.com /com pet it ion s/m 5 -for eca st in g -a ccu r a cy /discu ssion /1 4 8 2 7 3
T h e T w eedie loss, w it h a pow er v a lu e of 1 .1 (h en ce w it h a n u n der ly in g dist r ibu t ion qu it e sim ila r t o

Poisson ) seem s pa r t icu la r ly effect iv e in m odelin g in t er m it t en t ser ies (w h er e zer o sa les pr ev a il). T h e
u sed m et r ic is ju st t h e r oot m ea n squ a r ed er r or (t h er e is n o n ecessit y t o u se a cu st om m et r ic for
r epr esen t in g t h e com pet it ion m et r ic). W e a lso u se t h e force_row_wise pa r a m et er in or der t o sa v e
m em or y in t h e Ka g g le n ot ebook. A ll t h e ot h er pa r a m et er s a r e ex a ct ly t h e on es pr esen t ed by
Mon sa r a ida in h is solu t ion (a pa r t fr om t h e su bsa m plin g pa r a m et er t h a t h a s been disa bled beca u se of
it s in com pa t ibilit y w it h t h e g oss boost in g t y pe).
In w h a t ot h er Ka g g le com pet it ion t h e T w eedie loss h a s pr ov en u sefu l? Ca n y ou fin d

u sefu l discu ssion s a bou t t h is loss a n d it s u sa g e in Met a Ka g g le by ex plor in g t h e
For u m T opics a n d For u m Messa g es t a bles (h t t ps://w w w .ka g g le.com /da t a set s/ka g g le/
m et a -ka g g le)?
A ft er defin in g t h e t r a in in g pa r a m et er s, w e ju st it er a t e ov er t h e st or es, ea ch t im e u ploa din g t h e

t r a in in g da t a of a sin g le st or e a n d t r a in in g t h e Lig h t GBM m odel. Ea ch m odel is t h e pickled. W e a lso
ex t r a ct fea t u r e im por t a n ce fr om ea ch m odel in or der t o con solida t e it in t o a file a n d t h en a g g r eg a t e it
r esu lt in g in t o h a v in g for ea ch fea t u r e t h e m ea n im por t a n ce a cr oss a ll t h e st or es for t h a t pr edict ion
h or izon .
Her e is t h e com plet e fu n ct ion for t r a in in g a ll t h e m odels for a specific pr edict ion h or izon :
def train(train_df, seed, end_train_day_x, predict_horizon):
lgb_params = {
'boosting_type': 'goss',
'objective': 'tweedie',
'tweedie_variance_power': 1.1,
'metric': 'rmse',
#'subsample': 0.5,
#'subsample_freq': 1,
'learning_rate': 0.03,
'num_leaves': 2 ** 11 - 1,
'min_data_in_leaf': 2 ** 12 - 1,
'feature_fraction': 0.5,
'max_bin': 100,
'boost_from_average': False,
'num_boost_round': 1400,
'verbose': -1,
'num_threads': os.cpu_count(),
'force_row_wise': True,
}
random.seed(seed)
np.random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
lgb_params['seed'] = seed
store_id_set_list = list(train_df['store_id'].unique())
print(f"training stores: {store_id_set_list}")
feature_importance_all_df = pd.DataFrame()
for store_index, store_id in enumerate(store_id_set_list):
print(f'now training {store_id} store')
grid_df, enable_features = load_grid_by_store(end_train_day_x, predict_horizon,
train_mask = grid_df['d'] <= end_train_day_x
valid_mask = train_mask & (grid_df['d'] > (end_train_day_x - predict_horizon))
preds_mask = grid_df['d'] > (end_train_day_x - 100)
train_data = lgb.Dataset(grid_df[train_mask][enable_features],
label=grid_df[train_mask]['sales'])
valid_data = lgb.Dataset(grid_df[valid_mask][enable_features],
label=grid_df[valid_mask]['sales'])
# Saving part of the dataset for later predictions
# Removing features that we need to calculate recursively
grid_df = grid_df[preds_mask].reset_index(drop=True)
grid_df.to_feather(f'test_{store_id}_{predict_horizon}.feather')
del(grid_df)
gc.collect()
estimator = lgb.train(lgb_params,
train_data,
valid_sets=[valid_data],
callbacks=[lgb.log_evaluation(period=100, show_stdv=False
)
model_name = str(f'lgb_model_{store_id}_{predict_horizon}.bin')
feature_importance_store_df = pd.DataFrame(sorted(zip(enable_features, estimato
columns=['feature_name', 'importance
feature_importance_store_df = feature_importance_store_df.sort_values('importan
feature_importance_store_df['store_id'] = store_id
feature_importance_store_df.to_csv(f'feature_importance_{store_id}_{predict_hor
feature_importance_all_df = pd.concat([feature_importance_all_df, feature_impor
pickle.dump(estimator, open(model_name, 'wb'))
del([train_data, valid_data, estimator])
gc.collect()
feature_importance_all_df.to_csv(f'feature_importance_all_{predict_horizon}.csv', i
feature_importance_agg_df = feature_importance_all_df.groupby(
'feature_name')['importance'].agg(['mean', 'std']).reset_index()
feature_importance_agg_df.columns = ['feature_name', 'importance_mean', 'importance
feature_importance_agg_df = feature_importance_agg_df.sort_values('importance_mean'
feature_importance_agg_df.to_csv(f'feature_importance_agg_{predict_horizon}.csv', i
W it h t h e la st fu n ct ion pr epa r ed, w e g ot a ll t h e n ecessa r y code u p for ou r pipelin e t o w or k. For t h e

fu n ct ion w r a ppin g t h e w h ole oper a t ion s t og et h er , w e n eed t h e in pu t da t a set s (t h e t im e ser ies da t a set ,
t h e pr ice da t a set , t h e ca len da r in for m a t ion ) t og et h er w it h t h e la st t r a in in g da y (1 ,9 1 3 for pr edict in g
on t h e pu blic lea der boa r d, 1 ,9 4 1 for t h e pr iv a t e on e) a n d t h e pr edict h or izon (w h ich cou ld be 7 , 1 4 , 2 1
or 2 8 da y s).
def train_pipeline(train_df, prices_df, calendar_df,

end_train_day_x_list, prediction_horizon_list):
for end_train_day_x in end_train_day_x_list:
for predict_horizon in prediction_horizon_list:

print(f"end training point day: {end_train_day_x} - prediction horizon: {pr
# Data preparation
generate_base_grid(train_df, end_train_day_x, predict_horizon)
calc_release_week(prices_df, end_train_day_x, predict_horizon)
generate_grid_price(prices_df, calendar_df, end_train_day_x, predict_horizo
generate_grid_calendar(calendar_df, end_train_day_x, predict_horizon)
modify_grid_base(end_train_day_x, predict_horizon)
generate_lag_feature(end_train_day_x, predict_horizon)
generate_target_encoding_feature(end_train_day_x, predict_horizon)
assemble_grid_by_store(train_df, end_train_day_x, predict_horizon)
# Modelling
train(train_df, seed, end_train_day_x, predict_horizon)
Sin ce Ka g g le n ot ebook h a v e a lim it ed r u n n in g t im e a n d a lim it ed a m ou n t of bot h m em or y a n d disk

spa ce, ou r su g g est ed st r a t eg y is t o r eplica t e fou r n ot ebooks w it h t h e code h er eby pr esen t ed a n d t r a in
t h em w it h differ en t pr edict ion h or izon pa r a m et er s. Usin g t h e sa m e n a m e for t h e n ot ebooks bu t for a
pa r t con t a in in g t h e v a lu e of t h e pr edict ion pa r a m et er w ill h elp g a t h er in g a n d h a n dlin g t h e m odels
la t er a s ex t er n a l da t a set s in a n ot h er n ot ebook.
Her e is t h e fir st n ot ebook, m 5 -t r a in -da y -1 9 4 1 -h or izon -7 (h t t ps://w w w .ka g g le.com /code/

lu ca m a ssa r on /m 5 -t r a in -da y -1 9 4 1 -h or izon -7 ):
end_train_day_x_list = [1941]
prediction_horizon_list = [7]
seed = 42
train_pipeline(train_df, prices_df, calendar_df, end_train_day_x_list, prediction_horiz
T h e secon d n ot ebook, m 5 -t r a in -da y -1 9 4 1 -h or izon -1 4 (h t t ps://w w w .ka g g le.com /code/lu ca m a ssa r on /

m 5 -t r a in -da y -1 9 4 1 -h or izon -1 4 ):
seed = 42
T h e t h ir d n ot ebook, m 5 -t r a in -da y -1 9 4 1 -h or izon -2 1 (h t t ps://w w w .ka g g le.com /code/lu ca m a ssa r on /

seed = 42
Fin a lly t h e la st on e, m 5 -t r a in -da y -1 9 4 1 -h or izon -2 8 (h t t ps://w w w .ka g g le.com /code/lu ca m a ssa r on /

seed = 42
If y ou a r e w or kin g on a loca l com pu t er w it h en ou g h disk spa ce a n d m em or y r esou r ces, y ou ca n ju st

If y ou a r e w or kin g on a loca l com pu t er w it h en ou g h disk spa ce a n d m em or y r esou r ces, y ou ca n ju st
r u n a ll t h e fou r pr edict ion h or izon s t og et h er , by u sin g a s a n in pu t t h e list con t a in in g t h em a ll: [7 , 1 4 ,
2 1 , 2 8 ]. Now t h e la st st ep befor e bein g a ble t o su bm it ou r pr edict ion is a ssem blin g t h e pr edict ion s.
Assembling public and private predictions

Y ou ca n see a n ex a m ple a bou t h ow w e a ssem bled t h e pr edict ion s for bot h t h e pu blic a n d pr iv a t e
lea der boa r ds h er e:
Pu blic lea der boa r d ex a m ple: h t t ps://w w w .ka g g le.com /lu ca m a ssa r on /m 5 -pr edict -pu blic-
lea der boa r d
Pr iv a t e lea der boa r d ex a m ple: h t t ps://w w w .ka g g le.com /code/lu ca m a ssa r on /m 5 -pr edict -
pr iv a t e-lea der boa r d
W h a t ch a n g es bet w een t h e pu blic a n d pr iv a t e su bm ission s is ju st t h e differ en t la st t r a in in g da y : it

det er m in a t es w h a t da y s w e a r e g oin g t o pr edict .
In t h is con clu siv e code sn ippet , a ft er loa din g t h e n ecessa r y pa cka g es, su ch a s Lig h t GBM, for ev er y en d
of t r a in in g da y , a n d for ev er y pr edict ion h or izon , w e r ecov er t h e cor r ect n ot ebook w it h it s da t a . T h en ,
w e it er a t e t h r ou g h a ll t h e st or es a n d pr edict t h e sa les for a ll t h e it em s in t h e t im e r a n g in g fr om t h e
pr ev iou s pr edict ion h or izon u p t o t h e pr esen t on e. In t h is w a y , ev er y m odel w ill pr edict on a sin g le
w eek, t h e on e it h a s been t r a in ed on .
import numpy as np
import pandas as pd
import os
import random
import math
from decimal import Decimal as dec
import datetime
import time
import gc
import pickle
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
store_id_set_list = ['CA_1', 'CA_2', 'CA_3', 'CA_4', 'TX_1', 'TX_2', 'TX_3', 'WI_1', 'W
end_train_day_x_list = [1913, 1941]
prediction_horizon_list = [7, 14, 21, 28]
pred_v_all_df = list()
for end_train_day_x in end_train_day_x_list:
previous_prediction_horizon = 0
for prediction_horizon in prediction_horizon_list:
notebook_name = f"../input/m5-train-day-{end_train_day_x}-horizon-{prediction_h
pred_v_df = pd.DataFrame()
for store_index, store_id in enumerate(store_id_set_list):
model_path = str(f'{notebook_name}/lgb_model_{store_id}_{prediction_horizon
print(f'loading {model_path}')
estimator = pickle.load(open(model_path, 'rb'))
base_test = pd.read_feather(f"{notebook_name}/test_{store_id}_{prediction_h
enable_features = [col for col in base_test.columns if col not in ['id', 'd
for predict_day in range(previous_prediction_horizon + 1, prediction_horizo

print('[{3} -> {4}] predict {0}/{1} {2} day {5}'.format(
store_index + 1, len(store_id_set_list), store_id,
previous_prediction_horizon + 1, prediction_horizon, predict_day))
mask = base_test['d'] == (end_train_day_x + predict_day)
base_test.loc[mask, 'sales'] = estimator.predict(base_test[mask][enable
temp_v_df = base_test[
(base_test['d'] >= end_train_day_x + previous_prediction_horizon +
(base_test['d'] < end_train_day_x + prediction_horizon + 1)
][['id', 'd', 'sales']]
if len(pred_v_df)!=0:
pred_v_df = pd.concat([pred_v_df, temp_v_df])
else:
pred_v_df = temp_v_df.copy()
del(temp_v_df)
gc.collect()
previous_prediction_horizon = prediction_horizon
pred_v_all_df.append(pred_v_df)
pred_v_all_df = pd.concat(pred_v_all_df)
W h en a ll t h e pr edict ion s h a v e been g a t h er d, w e m er g e t h em u sin g t h e sa m ple su bm ission file a s a

r efer en ce, bot h for t h e r equ ir ed r ow s t o be pr edict ed a n d for t h e colu m n s for m a t (Ka g g le ex pect s
dist in ct r ow s for it em s in t h e v a lida t ion or t est in g per iods w it h da ily sa les in pr og r essiv e colu m n s).
submission = pd.read_csv("../input/m5-forecasting-accuracy/sample_submission.csv"
pred_v_all_df.d = pred_v_all_df.d - end_train_day_x_list
pred_h_all_df = pred_v_all_df.pivot(index='id', columns='d', values='sales')
pred_h_all_df = pred_h_all_df.reset_index()
pred_h_all_df.columns = submission.columns
submission = submission[['id']].merge(pred_h_all_df, on=['id'], how='left').fillna(0)
submission.to_csv("m5_predictions.csv", index=False)
T h e solu t ion ca n r ea ch a r ou n d 0 .5 4 9 0 7 in t h e pr iv a t e lea der boa r d, r esu lt in g in 1 2 th posit ion , in t h e

g old m eda l a r ea . Rev er t in g ba ck t o Mon sa r a ida ’s Lig h t GBM pa r a m et er s (for in st a n ce u sin g g bdt
in st ea d of g oss for t h e boost in g pa r a m et er ) sh ou ld r esu lt ev en in h ig h er per for m a n ces (bu t y ou w ou ld
n eed t o r u n t h e code in loca l com pu t er or on t h e Goog le Clou d Pla t for m ).
Exer cise
A s a n ex er cise, t r y com pa r in g a t r a in in g of Lig h t GBM u sin g t h e sa m e

n u m ber of it er a t ion s w it h t h e boost in g set t o g bdt in st ea d of g oss. How
m u ch is t h e differ en ce in per for m a n ce a n d t r a in in g t im e (y ou m a y n eed t o
u se a loca l m a ch in e or clou d on e beca u se t h e t r a in in g m a y ex ceed t h e 1 2
h ou r s)?
Summary
In t h is secon d ch a pt er , w e t ook on qu it e a com plex t im e ser ies com pet it ion , h en ce t h e ea siest t op
solu t ion w e t r ied, it is a ct u a lly fa ir ly com plex , a n d it r equ ir es codin g qu it e a lot of pr ocessin g fu n ct ion s.
A ft er y ou w en t t h r ou g h t h e ch a pt er , y ou sh ou ld h a v e a bet t er idea of h ow t o pr ocess t im e ser ies a n d
h a v e t h em pr edict ed u sin g g r a dien t boost in g . Fa v or in g g r a dien t boost in g solu t ion s ov er t r a dit ion a l
m et h ods, w h en y ou h a v e en ou g h da t a , a s w it h t h is pr oblem , sh ou ld h elp y ou cr ea t e st r on g solu t ion s
for com plex pr oblem s w it h h ier a r ch ica l cor r ela t ion s, in t er m it t en t ser ies a n d a v a ila bilit y of cov a r ia t es
su ch a s ev en t s or pr ices or m a r ket con dit ion s. In t h e follow in g ch a pt er s, y ou w ill t a ckle w it h ev en
m or e com plex Ka g g le com pet it ion s, dea lin g w it h im a g es a n d t ex t s. Y ou w ill be a m a zed a t h ow m u ch
y ou ca n lea r n by r ecr ea t in g t op-scor in g solu t ion s a n d u n der st a n din g t h eir in n er w or kin g s.
03 Cassava Leaf Disease competition
In t h is ch a pt er w e w ill lea v e t h e dom a in of t a bu la r da t a a n d focu s on im a g e pr ocessin g . In or der t o

dem on st r a t e t h e st eps n ecessa r y t o do w ell in cla ssifica t ion com pet it ion s, w e w ill u se t h e da t a fr om t h e
Ca ssa v a Lea f Disea se con t est :
h t t ps://w w w .ka g g le.com /com pet it ion s/ca ssa v a -lea f-disea se-cla ssifica t ion
T h e fir st t h in g t o do u pon st a r t in g a Ka g g le com pet it ion is t o r ea d t h e descr ipt ion s pr oper ly :
“As the s econd-larges t provider of carbohydrates in Africa, cas s ava is a k ey food s ecurity
crop grow n by s m allholder farm ers becaus e it can w iths tand hars h conditions . At leas t 80%
of hous ehold farm s in Sub-Saharan Africa grow this s tarchy root, but viral dis eas es are
m ajor s ources of poor yields . With the help of data s cience, it m ay be pos s ible to identify
com m on dis eas es s o they can be treated.”
So t h is com pet it ion r ela t es t o a n a ct u a lly im por t a n t r ea l-life pr oblem - y ou r m ilea g e m ig h t v a r y , bu t

in g en er a l it is u sefu l t o kn ow t h a t .
“Exis ting m ethods of dis eas e detection require farm ers to s olicit the help of governm ent-
funded agricultural experts to vis ually ins pect and diagnos e the plants . This s uffers from
being labor-intens ive, low -s upply and cos tly. As an added challenge, effective s olutions for
farm ers m us t perform w ell under s ignificant cons traints , s ince African farm ers m ay only
have acces s to m obile-quality cam eras w ith low -bandw idth.”
T h is pa r a g r a ph - especia lly t h e la st sen t en ce - set s t h e ex pect a t ion s: sin ce t h e da t a is com in g fr om

div er se sou r ces, w e a r e likely t o h a v e som e ch a llen g es r ela t ed t o qu a lit y of t h e im a g es a n d (possibly )
dist r ibu t ion sh ift .
“Y our tas k is to clas s ify each cas s ava im age into four dis eas e categories or a fifth category
indicating a healthy leaf. With your help, farm ers m ay be able to quick ly identify dis eas ed
plants , potentially s aving their crops before they inflict irreparable dam age.”
T h is bit is r a t h er im por t a n t : it specifies t h a t t h is is a cla ssifica t ion com pet it ion , a n d t h e n u m ber of
cla sses is sm a ll (5 ).
W it h t h e in t r odu ct or y foot w or k ou t of t h e w a y , let u s h a v e a look a t t h e da t a .
Understanding the data and metrics

Upon en t er in g t h e “ Da t a ” t a b for t h is com pet it ion , w e see t h e su m m a r y of t h e pr ov ided da t a :
Figure 3.1: des cription of the Cas s ava com petition datas et
W h a t ca n w e m a ke of t h a t ?
T h e da t a is in a fa ir ly st r a ig h t for w a r d for m a t , w h er e t h e or g a n iser s ev en pr ov ided t h e

m a ppin g bet w een disea se n a m es a n d n u m er ica l codes
W e h a v e t h e da t a in t fr ecor d for m a t , w h ich is g ood n ew s for a n y on e in t er est ed in u sin g a T PU
T h e pr ov ided t est set is on ly a sm a ll su bset a n d t o be su bst it u t ed w it h t h e fu ll da t a set a t
ev a lu a t ion t im e. T h is su ggest s t h a t l oa din g a pr ev iou sl y t r a in ed m odel a t ev a l u a t ion
t im e a n d u sin g it for in fer en ce is a pr efer r ed st r a t egy .
T h e ev a lu a t ion m et r ic w a s ch osen t o be ca t eg or iza t ion a ccu r a cy : h t t ps://dev eloper s.g oog le.com /
m a ch in e-lea r n in g /cr a sh -cou r se/cla ssifica t ion /a ccu r a cy . T h is m et r ic t a kes discr et e v a lu es a s in pu t s,
w h ich m ea n s pot en t ia l en sem blin g st r a t eg ies becom e som ew h a t m or e in v olv ed. Loss fu n ct ion is
im plem en t ed du r in g t r a in in g t o opt im ise a lea r n in g fu n ct ion a n d a s lon g a s w e w a n t t o u se m et h ods
ba sed on g r a dien t descen t , t h is on e n eeds t o be con t in u ou s; ev a lu a t ion m et r ic, on t h e ot h er h a n d, is
u sed a ft er t r a in in g t o m ea su r e ov er a ll per for m a n ce a n d a s su ch , ca n be discr et e.
Exer cise: w it h ou t bu ildin g a m odel, w r it e a code t o con du ct a ba sic EDA
Com pa r e t h e ca r din a lit y of cla sses in ou r cla ssifica t ion pr oblem
Nor m a lly , t h is w ou ld a lso be t h e m om en t t o ch eck for dist r ibu t ion sh ift : if t h e im a g es a r e v er y

differ en t in t r a in a n d t est set , it is som et h in g t h a t m ost cer t a in ly n eeds t o be t a ken in t o a ccou n t .
How ev er , sin ce w e do n ot h a v e a ccess t o t h e com plet e da t a set in t h is ca se, t h e st ep is om it t ed - plea se
ch eck Ch a pt er 1 for a discu ssion of a dv er sa r ia l v a lida t ion , w h ich is a popu la r t ech n iqu e for det ect in g
con cept dr ift bet w een da t a set s.
Building a baseline model

W e st a r t ou r a ppr oa ch by bu ildin g a ba selin e solu t ion . T h e n ot ebook r u n n in g a n en d-t o-en d solu t ion is
a v a ila ble a t :
W h ile h opefu lly u sefu l a s a st a r t in g poin t for ot h er com pet it ion s y ou m ig h t w a n t t o t r y , it is m or e

edu ca t ion a l t o follow t h e flow descr ibed in t h is sect ion , i.e. copy in g t h e code cell by cell, so t h a t y ou ca n
u n der st a n d it bet t er (a n d of cou r se im pr ov e on it - it is ca lled a ba selin e solu t ion for a r ea son ).
Figure 3.2: the im ports needed for our bas eline s olution
W e beg in by im por t in g t h e n ecessa r y pa cka g es - w h ile per son a l differ en ces in st y le a r e a n a t u r a l

t h in g , it is ou r opin ion t h a t g a t h er in g t h e im por t s in on e pla ce m a kes t h e code ea sier t o m a in t a in a s
t h e com pet it ion pr og r esses a n d y ou m ov e t ow a r ds m or e ela bor a t e solu t ion s. In a ddit ion , w e cr ea t e a
con fig u r a t ion cla ss: a pla ceh older for a ll t h e pa r a m et er s defin in g ou r lea r n in g pr ocess:
Figure 3.3: configuration clas s for our bas eline s olution
T h e com pon en t s in clu de:
T h e da t a folder is m ost ly u sefu l if y ou t r a in m odels ou t side of Ka g g le som et im es (e.g . in Goog le

Cola b or on y ou r loca l m a ch in e)
BA T CH_SIZE is a pa r a m et er t h a t som et im es n eeds a dju st in g if y ou w a n t t o opt im ise y ou r
t r a in in g pr ocess (or m a ke it possible a t a ll, for la r g e im a g es in con st r a in ed m em or y
en v ir on m en t )
Modify in g EPOCHS is u sefu l for debu g g in g : st a r t w it h sm a ll n u m ber of epoch s t o v er ify t h a t
y ou r solu t ion is r u n n in g sm oot h ly en d-t o-en d a n d in cr ea se a s y ou a r e m ov in g t ow a r ds a pr oper
solu t ion
T A RGET _SIZE defin es t h e size t o w h ich w e w a n t t o r esca le ou r im a g es
NCLA SSES cor r espon ds t o t h e n u m ber of possible cla sses in ou r cla ssifica t ion pr oblem
A g ood pr a ct ice for codin g a solu t ion is t o en ca psu la t e t h e im por t a n t bit s in fu n ct ion s - a n d cr ea t in g
ou r t r a in a ble m odel cer t a in ly qu a lifies a s im por t a n t :

Figure 3.4: function to create our m odel
Few r em a r ks a r ou n d t h is st ep:
W h ile m or e ex pr essiv e opt ion s a r e a v a ila ble, it is pr a ct ica l t o beg in w it h a fa st m odel t h a t ca n

qu ickly it er a t ed u pon ; Efficien t Net h t t ps://pa per sw it h code.com /m et h od/efficien t n et
a r ch it ect u r e fit s t h e bill qu it e w ell
W e a dd a poolin g la y er for r eg u la r isa t ion pu r poses
A dd a cla ssifica t ion h ea d - a Den se la y er w it h CFG.NCLA SSES in dica t in g t h e n u m ber of possible
r esu lt s for t h e cla ssifier
Fin a lly , w e com pile t h e m odel w it h loss a n d m et r ic cor r espon din g t o t h e r equ ir em en t s for t h is
com pet it ion .
Exer cise: Ex a m in e t h e possible ch oices for loss a n d m et r ic - a u sefu l g u ide is h t t ps://ker a s.io/a pi/losses/
W h a t w ou ld t h e ot h er r ea son a ble opt ion s be?
Nex t st ep is t h e da t a :
Figure 3.5: s etting up the data generator
Nex t w e set u p t h e m odel - st r a ig h t for w a r d, t h a n ks t o t h e fu n ct ion w e defin ed a bov e:
Figure 3.6: ins tantiating the m odel
Befor e w e pr oceed w it h t r a in in g t h e m odel, w e sh ou ld dedica t e som e a t t en t ion t o ca llba cks:
Figure 3.7: Model callback s
Som e poin t s w or t h m en t ion in g :
ModelCh eckpoin t is u sed t o en su r e w e keep t h e w eig h t s for t h e best m odel on ly , w h er e t h e

opt im a lit y is decided on t h e ba sis of t h e m et r ic t o m on it or (v a lida t ion loss in t h is in st a n ce)
Ea r ly St oppin g h elps u s con t r ol t h e r isk of ov er fit t in g by en su r in g t h a t t h e t r a in in g is st opped if
t h e v a lida t ion loss does n ot im pr ov e (decr ea se) for a g iv en n u m ber of epoch s
Redu ceLROn Pla t ea u is a sch em a for lea r n in g r a t e
Exer cise: w h a t pa r a m et er s w ou ld it m a ke sen se t o m odify in t h e a bov e set u p, a n d w h ich on es ca n be

left a t t h eir defa u lt v a lu es?
W it h t h is set u p, w e ca n fit t h e m odel:

Figure 3.8: Fitting the m odel
On ce t h e t r a in in g is com plet e, w e ca n u se t h e m odel t o bu ild t h e pr edict ion of im a g e cla ss for ea ch

im a g e in t h e t est set . Reca ll t h a t in t h is com pet it ion t h e pu blic (v isible) t est set con sist ed of a sin g le
im a g e a n d t h e size of t h e fu ll on e w a s u n kn ow n - h en ce t h e n eed for a slig h t ly con v olu t ed m a n n er of
con st r u ct in g t h e su bm ission da t a fr a m e
Figure 3.9: Generating a s ubm is s ion
In t h is sect ion w e h a v e dem on st r a t ed h ow t o st a r t t o com pet e in a com pet it ion focu sed on im a g e
cla ssifica t ion - y ou ca n u se t h is a ppr oa ch t o m ov e qu ickly fr om ba sic EDA t o a fu n ct ion a l su bm ission .
How ev er , a r u dim en t a r y a ppr oa ch like t h is is u n likely t o pr odu ce v er y com pet it iv e r esu lt s. For t h is
r ea son , in t h e n ex t sect ion w e discu ss m or e specia lised t ech n iqu es t h a t w er e u t ilised in t op scor in g
solu t ion s.
Learnings from top solutions

In t h is sect ion w e g a t h er a spect s fr om t h e t op solu t ion s t h a t cou ld a llow u s t o r ise a bov e t h e lev el of t h e
ba selin e solu t ion . Keep in m in d t h a t t h e lea der boa r d (bot h pu blic a n d pr iv a t e) in t h is com pet it ion
w er e qu it e t ig h t ; t h is w a s a com bin a t ion of a few fa ct or s:
T h e n oisy da t a - it w a s ea sy t o g et t o .8 9 a ccu r a cy by cor r ect ly iden t ify in g la r g e pa r t of t h e

t r a in da t a , a n d t h en ea ch n ew cor r ect on e a llow ed for a t in y m ov e u pw a r d
T h e m et r ic - a ccu r a cy ca n be t r icky t o en sem ble
Lim it ed size of t h e da t a
Pretraining
Fir st a n d m ost obv iou s r em edy t o t h e issu e of lim it ed da t a size w a s pr et r a in in g : u sin g m or e da t a . T h e

Ca ssa v a com pet it ion w a s h eld a y ea r befor e a s w ell:
h t t ps://w w w .ka g g le.com /com pet it ion s/ca ssa v a -disea se/ov er v iew
W it h m in im a l a dju st m en t s, t h e da t a fr om t h e 2 0 1 9 edit ion cou ld be lev er a g ed in t h e con t ex t of t h e

cu r r en t on e. Sev er a l com pet it or s a ddr essed t h e t opic:
Com bin ed 2 0 1 9 + 2 0 2 0 da t a set in T FRecor ds for m a t w a s r elea sed in t h e for u m : h t t ps://

w w w .ka g g le.com /com pet it ion s/ca ssa v a -lea f-disea se-cla ssifica t ion /discu ssion /1 9 9 1 3 1
W in n in g solu t ion fr om t h e 2 0 1 9 edit ion ser v ed a s a u sefu l st a r t in g poin t : h t t ps://
w w w .ka g g le.com /com pet it ion s/ca ssa v a -lea f-disea se-cla ssifica t ion /discu ssion /2 1 6 9 8 5
Gen er a t in g pr edict ion s on 2 0 1 9 da t a a n d u sin g t h e pseu do-la bels t o a u g m en t t h e da t a set w a s
r epor t ed t o y ield som e (m in or ) im pr ov em en t s h t t ps://w w w .ka g g le.com /com pet it ion s/ca ssa v a -
lea f-disea se-cla ssifica t ion /discu ssion /2 0 3 5 9 4
TTA
T h e idea beh in d T est T im e A u g m en t a t ion (T T A ) is t o a pply differ en t t r a n sfor m a t ion s t o t h e t est

im a g e: r ot a t ion s, flippin g a n d t r a n sla t ion s. T h is cr ea t es a few differ en t v er sion s of t h e t est im a g e a n d
w e g en er a t e a pr edict ion for ea ch of t h em . T h e r esu lt in g cla ss pr oba bilit ies a r e t h en a v er a g ed t o g et a
m or e con fiden t a n sw er . A n ex cellen t dem on st r a t ion of t h is t ech n iqu e is g iv en in a n ot ebook by A n dr ew
Kh a el: h t t ps://w w w .ka g g le.com /code/a n dr ew kh /t est -t im e-a u g m en t a t ion -t t a -w or t h -it
T T A w a s u sed ex t en siv ely by t h e t op solu t ion s in t h e Ca ssa v a com pet it ion , a n ex cellen t ex a m ple bein g
t h e t op3 pr iv a t e LB r esu lt : h t t ps://w w w .ka g g le.com /com pet it ion s/ca ssa v a -lea f-disea se-cla ssifica t ion /
discu ssion /2 2 1 1 5 0
Transformers
W h ile m or e kn ow n a r ch it ect u r es like ResNex t a n d Efficien t Net w er e u sed a lot in t h e cou r se of t h e

com pet it ion , it w a s t h e a ddit ion of m or e n ov el on es t h a t pr ov ided t h e ex t r a edg e t o m a n y com pet it or s
y ea r n in g for pr og r ess in a t ig h t ly pa cked lea der boa r d. T r a n sfor m er s em er g ed in 2 0 1 7 a s a
r ev olu t ion a r y a r ch it ect u r e for NLP (if som eh ow y ou m issed t h e pa per t h a t st a r t ed it a ll, h er e it is:
h t t ps://a r x iv .or g /a bs/1 7 0 6 .0 3 7 6 2 ) a n d w er e su ch a spect a cu la r su ccess t h a t in ev it a bly m a n y people
st a r t ed w on der in g if t h ey cou ld be a pplied t o ot h er m oda lit ies a s w ell - v ision bein g a n obv iou s
ca n dida t e. A pt ly n a m ed V ision T r a n sfor m er m a de on e of it s fir st a ppea r a n ces in a Ka g g le com pet it ion
in t h e Ca ssa v a con t est . A n ex cellen t t u t or ia l for V iT h a s been m a de pu blic:
h t t ps://w w w .ka g g le.com /code/a bh in a n d0 5 /v ision -t r a n sfor m er -v it -t u t or ia l-ba selin e
V ision T r a n sfor m er w a s a n im por t a n t com pon en t of t h e w in n in g solu t ion :
h t t ps://w w w .ka g g le.com /com pet it ion s/ca ssa v a -lea f-disea se-cla ssifica t ion /discu ssion /2 2 1 1 5 0
Ensembling
T h e cor e idea of en sem blin g is v er y popu la r on Ka g g le (see Ch a pt er 9 of t h e Ka g g le Book for a m or e

ela bor a t e descr ipt ion ) a n d Ca ssa v a com pet it ion w a s n o ex cept ion . A s it t u r n ed ou t , com bin in g div er se
a r ch it ect u r es w a s v er y ben eficia l (by a v er a g in g t h e cla ss pr oba bilit ies): Efficien t Net , ResNex t a n d V iT
a r e su fficien t ly differ en t fr om ea ch ot h er t h a t t h eir pr edict ion s com plem en t ea ch ot h er . A n ot h er
im por t a n t a n g le w a s st a ckin g , i.e. u sin g m odels in t w o st a g es: t h e 3 r d pla ce solu t ion com bin ed
pr edict ion s fr om m u lt iple m odels a n d t h en a pplied Lig h t GBM a s a blen der
T h e w in n in g solu t ion in v olv ed a differ en t a ppr oa ch (w it h few er m odels in t h e fin a l blen d), bu t r ely in g
on t h e sa m e cor e log ic:
Summary
In t h is ch a pt er , w e h a v e descr ibed h ow t o g et st a r t ed w it h a ba selin e solu t ion for a n im a g e
cla ssifica t ion com pet it ion a n d discu ssed a div er se set of possible ex t en sion s for m ov in g t o a com pet it iv e
(m eda l) zon e.

Konrad Banachewicz, Luca Massaron - The Kaggle Workbook_ Self-learning exercises and valuable insights for Kaggle data science competitions-Packt Publishing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Konrad Banachewicz, Luca Massaron - The Kaggle Workbook_ Self-learning exercises and valuable insights for Kaggle data science competitions-Packt Publishing

Uploaded by

Copyright:

Available Formats

The Kaggle Workbook

Copy r ig h t © 2 0 2 2 Pa ckt Pu blish in g

A ll r ig h t s r eser v ed. No pa r t of t h is book m a y be r epr odu ced, st or ed in a r et r iev a l sy st em , or

Ev er y effor t h a s been m a de in t h e pr epa r a t ion of t h is book t o en su r e t h e a ccu r a cy of t h e in for m a t ion

Ea r l y A ccess Pu bl ica t ion : T h e Ka g g le W or kbook

Ea r l y A ccess Pr odu ct ion Refer en ce: B1 9 2 6 5

Pu blish ed by Pa ckt Pu blish in g Lt d.

ISBN: 9 7 8 -1 -8 0 4 6 1 -1 2 1 -0 w w w .pa ckt .com

1. Join ou r book com m u n it y on Discor d

Y ou ca n dip in a n d ou t of t h is book or follow a lon g fr om st a r t t o fin ish ; Ea r ly A ccess is desig n ed t o be

1. Ch a pt er 1 : T h e m ost r en ow n t a bu la r com pet it ion : Por t o Seg u r o’s Sa fe Dr iv er Pr edict ion

h t t ps://pa ckt .lin k/Ea r ly A ccessCom m u n it y

W e st a r t in t h e book fr om on e of t h e m ost r en ow n ed t a bu la r com pet it ion s, Por t o Seg u r o’s Sa fe Dr iv er

How t o t u n e a n d t r a in a Lig h t GBM m odel.

Understanding the competition and the data

in d r efer s t o “ in div idu a l ch a r a ct er ist ics”

T h e in t er est in g fa ct a bou t t h is com pet it ion is t h a t :

A s a fir st ex er cise, r efer r in g t o t h e con t en t s a n d t h e code on t h e Ka g g le Book r ela t ed t o

T a ke for in st a n ce IoT a n d t elem a t ic da t a in in su r a n ce. It is qu it e com m on t o a n a ly ze a dr iv er 's

In ch a pt er 5 of t h e Ka g g le Book (pa g es 9 5 on w a r d), w e ex pla in ed h ow t o dea l w it h

T h e m et r ic ca n be a ppr ox im a t ed by t h e Ma n n –W h it n ey U n on -pa r a m et r ic st a t ist ica l t est a n d by t h e

T h er e a r e a ct u a lly a few Py t h on im plem en t a t ion s a m on g t h e n ot ebooks. W e h a v e u sed h er e a n d w e

Examining the top solution’s ideas from Michael Jahrer

Sh or t ly a ft er , in t h e discu ssion for u m , h e pu blish ed a sh or t su m m a r y of h is solu t ion t h a t h a s becom e a

Building a LightGBM submission

T h e n ex t st ep r equ ir es im por t in g t r a in , t est a n d t h e sa m ple su bm ission da t a set s. By doin g so by pa n da s

Sin ce fea t u r es t h a t belon g t o sim ila r g r ou pin g s a r e t a g g ed (u sin g in d, r eg , ca r , ca lc t a g s in t h eir

train = pd.read_csv(config.input_path / 'train.csv', index_col='id')

A t t h is poin t , a s poin t ed ou t by Mich a el Ja h r er , w e ca n dr op t h e ca lc fea t u r es. T h is idea h a s r ecu r r ed a

train = train.drop(calc_features, axis="columns")

Ex er cise: ba sed on t h e su g g est ion s pr ov ided in t h e Ka g g le Book a t pa g e 2 2 0

Ca t eg or ica l fea t u r es a r e in st ea d on e-h ot en coded. Sin ce w e w a n t t o r et r a in t h eir la bels, a n d sin ce t h e

train = pd.get_dummies(train, columns=cat_features)

A ft er on e h ot en codin g t h e ca t eg or ica l fea t u r es w e h a v e com plet ed doin g a ll t h e da t a pr ocessin g . W e

from numba import jit

A s for t h e t r a in in g pa r a m et er s, w e fou n d t h a t t h e pa r a m et er s su g g est ed by Mich a el Ja h r er in h is post

params = {'objective': 'binary',

On ce w e h a v e ou r best pa r a m et er s (or w e sim ply t r y Ja h r er ’s on es), w e a r e r ea dy t o t r a in a n d pr edict .

T h e m odel t r a in in g sh ou ldn ’t t a ke t oo lon g . In t h e en d y ou ca n g et r epor t ed t h e Nor m a lized Gin i

print(f"LightGBM CV normalized Gini coefficient:

T h e r esu lt s a r e qu it e en cou r a g in g beca u se t h e a v er a g e scor e is 0 .2 8 9 a n d t h e st a n da r d dev ia t ion of t h e

LightGBM CV Gini Normalized Score: 0.289 (0.015)

T h e obt a in ed pu blic scor e sh ou ld be a r ou n d 0 .2 8 4 4 2 . T h e a ssocia t ed pr iv a t e scor e is a bou t 0 .2 9 1 2 1 ,

Ba g g in g t h e t r a in in g set (i.e. t a kin g m u lt iple boot st r a ps of t h e t r a in in g da t a a n d t r a in in g m u lt iple

Setting up a Denoising Auto-encoder and a DNN

Rela t ed t o ou r in t en t ion of cr ea t in g m u lt iple n eu r a l n et w or ks w it h ou t r u n n in g ou t of m em or y , w e a lso

A s pr ev iou sly , w e loa d t h e da t a set s a n d pr oceed pr ocessin g t h e fea t u r es by r em ov in g t h e ca lc fea t u r es

train = pd.read_csv(config.input_path / 'train.csv', index_col='id')

print("Applying GaussRank to columns: ", end='')

W e ca n a pply t h e Ga u ssRa n k t r a n sfor m a t ion sepa r a t ely on t h e t r a in a n d t est fea t u r es on a ll t h e

Applying GaussRank to columns: ['ps_ind_01', 'ps_ind_03', 'ps_ind_14', 'ps_ind_15

W h en n or m a lisin g t h e fea t u r es, w e sim ply t u r n ou r da t a in t o a Nu m Py a r r a y of floa t 3 2 v a lu es, t h e

Nex t , w e ju st pr epa r e som e u sefu l fu n ct ion s su ch a s t h e ev a lu a t ion fu n ct ion , t h e n or m a lized Gin i

def plot_keras_history(history, measures):

T h e n ex t fu n ct ion s a r e a ct u a lly a bit m or e com plex a n d m or e r ela t ed t o t h e fu n ct ion in g of bot h t h e

def batch_generator(x, batch_size, shuffle=True, random_state=None):

T h e mixup_generator, a n ot h er g en er a t or , r et u r n s ba t ch es of da t a w h ose v a lu es h a v e been pa r t ia lly

T h e fu n ct ion g en er a t es t w o dist in ct ba t ch es of da t a , on e t o be r elea sed t o t h e m odel a n d a n ot h er t o be

Su ch a ssu r es t h a t t h e DA E ca n n ot a lw a y s r ely on t h e sa m e fea t u r es (sin ce t h ey ca n be r a n dom ly

def mixup_generator(X, batch_size, swaprate=0.15, shuffle=True, random_state=None

T h e extract_dae_features fu n ct ion is r epor t ed h er e on ly for edu ca t ion a l pu r poses. T h e fu n ct ion

In t h e ca se of t h e com pet it ion , g iv en t h e n u m ber of obser v a t ion s a n d t h e n u m ber of t h e fea t u r es t o be

def extract_dae_features(autoencoder, X, layers=[3]):

T o com plet e t h e w or k w it h t h e DA E, w e h a v e a fin a l fu n ct ion w r a ppin g a ll t h e pr ev iou s on es in t o a n