Professional Documents
Culture Documents
3rd ISSTEC 2021 Template (Oral Presentation)
3rd ISSTEC 2021 Template (Oral Presentation)
30 November 2021
1
Department of Statistics, IPB University, Bogor, West Java, Indonesia
2
Department of Mathematics, University of Jember, East Java, Indonesia
3
Department of Statistics, Syiah Kuala University, Aceh, Indonesia
isstec.uii.ac.id
isstec.uii.ac.id
VIRTUAL CONFERENCE
30 November 2021
Outline
1 2 3 4
Introduction Methodology Result Conclusion
isstec.uii.ac.id
VIRTUAL CONFERENCE
30 November 2021
Introduction
Black Box SHAP
SHapley Additive exPlanations
SHAP able interpret
blackbox into whitebox
Machine learning is a computational
methods helpful in making and improving
predictions and big data modelling The various components are difficult
to understand and interpret SHAP is not equipped
Volume with an algorithm to filter
Data many variables into finite
Size Inconsistency variables
Data
Complexity
Boruta-SHAP
combining the
Boruta variable Boruta
KRT Rawan
KRT Rawan
The complexity of big data makes big data The resulting values proved to be local selection SHAP
BPJS
PIP
BPJS
challenging to handle with traditional and inconsistent for different algorithm with
PIP
learning methods observations Shapley values
isstec.uii.ac.id
VIRTUAL CONFERENCE
30 November 2021
Introduction
The Boruta-SHAP algorithm will be used to analyze the
incidence of household food insecurity in West Java
Province using the XGBoost method.
You were unable
To see if there is a difference in the performance of
You were worried You were
you would not have to eat healthy and You had to hungry but did Boruta-SHAP and SHAP, the SHAP algorithm will be
enough food to eat? nutritious food? skip a meal? not eat? applied to data containing all variables (complete data)
and Boruta-SHAP filtered data.
You ate only a You ate less than Your You went
few kinds of you thought you household ran without eating
foods? should? out of food? for a whole day?
Methodology
Variables The Name of Variables
Y Food Insecurity Status
X1
X2
Education of Household Head (6 levels)
Vulnerable Household Head (1=yes, 0=no)
Research Process
X3 Number of family members illiterate (1, 2, etc)
X4 Access to Outpatient Treatment (1=yes, 0=no)
Pre-Processing
X5 Number of Family Members Having Saving Account (0, 1, 2, etc) Start Exploration Data
Data
X6 The income from transferee(1=yes, 0=no)
X7 Ownership of Land (1=yes, 0=no)
XGBoost Model Boruta-SHAP
X8 Grantee of Non-Cash Social Assistance (1=yes, 0=no) • Hyperparameter
X9 Grantee of Hopeful Family Program (1=yes, 0=no) optimization
Boruta-SHAP Data
X10 Grantee of Prosperous Family Program (1=yes, 0=no) • Model Fitting
X11 Grantee of Social Assistance from Local Government (1=yes, 0=no) • Model Evaluation
XGBoost Model
X12 Grantee of Health Insurance National Program (1=yes, 0=no)
SHAP • Hyperparameter
X13 Grantee of Health Insurance Local Program (1=yes, 0=no) optimization
X14 Grantee of Scholarship Social Program (1=yes, 0=no) Feature Importance • Model Fitting
X15 Roof Types (roofing, asbes, roof tile, concrete, other) • Model Evaluation
X16 Floor Types (cement, tiles/terrazzo, parquet/carpet, marble/ceramic, other)
SHAP
X17 Wall Types (wood/board, wall/plaster/woven, other)
X18 House Size (square meter) Feature Importance
X19 Internet Access (1=yes, 0=no) Boruta-SHAP
X20 Electricity (1=yes, 0=no)
X21 Types of Cooking Fuel (firewood, kerosene, LPG 3kg, not cooking, blue gas) End
X22 Drinking-Water Source (borehole/pump, plumbing, bottled water/refill, other)
X23 Decent Drinking Water (1=yes, 0=no)
X24 Decent Sanitation (1=yes, 0=no)
isstec.uii.ac.id
VIRTUAL CONFERENCE
30 November 2021
Result
The
proportion of
vulnerable
households
more than the
proportion of
households
that are not
vulnerable
isstec.uii.ac.id
VIRTUAL CONFERENCE
30 November 2021
Result
SHAP Score
0.4746
0.4702
The SHAP values and order of model 1 and
0.4240 model 2 are different. However, the seven
0.4045 variables that are in the first order of model 1
0.3782
all appear in model 2. So it can conclude that
0.3703
0.3594 the results of the order of importance of the
0.2868 complete data variables and data from Boruta-
0.2213
SHAP tend to be the same and converge.
0.1908
0.1260
0.1154
0.1060
0.817
SHAP
0.773 Score
0.725 0.5409
0.522 0.5241
0.420 0.4186
0.335 0.4070
0.327 0.3979
0.283 0.3478
0.3180
0.145
0.000
0.000
isstec.uii.ac.id
VIRTUAL CONFERENCE
30 November 2021
Conclusion
The order of the variables formed had differences between the SHAP value using the complete
data model and the data model of the Boruta-SHAP filtering results.
All seven variables created from Boruta-SHAP are also the topmost important variables in the
complete SHAP data model. So it can conclude that the SHAP values of the two models are in
line.
The seven variables that contribute the most are ownership of land, house size, education of
household head, number of family members having saving account, decent drinking water,
internet access, and drinking water source.
When the analysis is carried out only to see the importance of all variables, then select the SHAP
model without variable filtering; however, if the research aims to see which variables contribute
the most to the variables, the Boruta-SHAP variable filtering algorithm before modelling is
recommended.
isstec.uii.ac.id
VIRTUAL CONFERENCE
30 November 2021
Acknowledges
Research reported in this publication was supported by the Ministry of Research
and Technology/National Agency of Research and Innovation - Republic of
Indonesia under Award Number 1/E1/KP.PTNBH/2021. The content is solely the
responsibility of the authors and does not necessarily represent the official views
of the National Agency of Research and Innovation.
isstec.uii.ac.id
VIRTUAL CONFERENCE
30 November 2021
Acknowledges
Research reported in this publication was supported by the Ministry of Research
and Technology/National Agency of Research and Innovation - Republic of
Indonesia under Award Number 1/E1/KP.PTNBH/2021. The content is solely the
responsibility of the authors and does not necessarily represent the official views
of the National Agency of Research and Innovation.
isstec.uii.ac.id
VIRTUAL CONFERENCE
30 November 2021
References
Cristoph Molnar, Interpretable Machine Learning. A Guide for Making Black Box Models Explainable, chapter 2.2.
https://christophm.github.io/interpretable-ml-book/
Scoot M. Lundberg and Su-In Lee, A Unified Approach to Interpreting Model Predictions (Proceedings of the 31st International Conference on Neural
Information Processing Systems, 2017), pp. 4768-4777. https://dl.acm.org/doi/10.5555/3295222.3295230
Miron B. Kursa and Witold R. Rudnicki, Feature Selection with Boruta Package, (Journal of Statistical Software, September 2010, Volume 36, Issue 11), pp.
1-13. doi:10.18637/jss.v036.i11
Yagyanath Rimal, Boruta Algorithm is Significant for Large Feature Selection of Student Marks Data of Pokhara University Nepal, (Universe International
Journal of Interdisciplinary Research. Vol. 1, Issue.2, 2020), pp. 308-315. http://doi-ds.org/doilink/08.2020-25662434/
Lee Kuok Leong and Azian Azamimi Abdullah, Prediction of Alzheimer’s Disease (AD) Using Machine Learning Techniques with Boruta Algorithm as
Feature Selection Method, (Journal of Physics: Conference Series, Volume 1372, International Conference of Biomedical Engineering 26-27 August 2019,
Penang Island, Malaysia), pp. 1-8. doi:10.1088/1742-6596/1372/1/012065
Eoghan Keany, BorutaShap : A Wrapper Feature Selection Method Which Combines the Boruta Feature Selection Algorithm with Shapley Values,
(Zenodo.org, https://pypi.org/project/BorutaShap/, 2020). doi:10.5281/zenodo.4247618
EoHeru Irawan, "Faktor-Faktor Rumah Tangga yang Mencirikan Tingkat Kerawanan Pangan," thesis, Institut Pertanian Bogor, 2019.
Otilia Vanessa Coldero-Ahiman, Jorge Leonardo Vanegas, Pablo Beltrán-Romero and Maria Elena Quinde-Lituma, Determinant of Food Insecurity in Rural
Households: The Case of the Paute River Basin of Azuay Province, Ecuador, (Sustainability, 2020, 12, 946) pp. 1-18. doi:10.3390/su12030946
Giovanna Menardi and Nicola Torelli, Training and Assessing Classification Rules with Imbalanced Data, (Data Mining and Knowledge Discovery, Volume
28, Issue 1, January 2014), pp. 92-122. doi:10.1007/s10618-012-0295-5
Tianqi Chen and Carlos Guestrin, XGBoost: A Scalable Tree Boosting System, (Proceeding of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, San Fransisco, 13-17 Agustus 2016), pp. 785-794. doi:10.1145/2939672.2939785
Scoot M. Lundberg and Su-In Lee, A Unified Approach to Interpreting Model Predictions, (Proceedings of 31st International Conference on Neural
Information Processing Systems, December 2017), pp. 4768-4777. doi:10.5555/3295222.3295230
Scott M. Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M. Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal and Su-In Lee,
From Local Explanations to Global Understanding with Explainable AI for Trees, (Nature Machine Intelligence 2, 2020), pp. 56-67.
https://doi.org/10.1038/s42256-019-0138-9
isstec.uii.ac.id
VIRTUAL CONFERENCE
30 November 2021
Thank you
isstec.uii.ac.id