Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 13

Sequence mining for searching interpretable

demographic patterns

Anna Muratova
Robiul Islam
Ekaterina Mitrofanova
Outline

• Demographic data
• Previous methods
o Decision trees
o Variants of a custom kernel in the SVM method
o Recurrent neural networks
• Interpretable method
o Method SHAP
o Applying SHAP to SVM model
o Tuning features number using SHAP
o Applying SHAP to sequences of events with neural network
• Using convolutional neural network instead of recurrent

2
Demographic data
The Data for the work was obtained from the scientific laboratory of
socio-demographic policy at HSE and it contains results of a survey of
6626 people (3314 men and 3312 women)
Features:
• Gender (male, female);
• Generation (Soviet 1930-1969 and modern 1970-1986);
• Type of education (general, higher, professional);
• Location (city, town, country);
• Religion (yes, no);
• Frequency of church attendance (several times a week, once a week, at least once a
month, several times a year or less);

Also for each person the dates of birth and the dates of significant events
in their lives are indicated, such as: partner, marriage, break up, divorce,
education, work, separation from parents and birth of a child
I used classification by gender based on other features and events.

3
Secondary features coding
•Binary coding
0 – event happened
1 – event didn’t happened

•Pairs of events coding


< - if first event happened before second or second event haven’t
happened yet
> - otherwise
= - if both events happened at the same time
n – if both events haven’t happened yet

4
Comparison of classification methods
Classification by Classification Classification by
Methods sequences only by features sequences and
only features
SVM
Custom kernel function CP 0.648 0.615 0.678
Custom kernel function ACS 0.659 0.615 0.670
Custom kernel function LCS 0.490 0.615 0.612
Using sequences transformed into 0.675 0.615 0.716
features
Recurrent neural networks (Keras,
Tensorflow)
SimpleRNN, Dense 0.676 0.626 0.754
Decision trees, time coding 0.661

5
Method SHAP for interpretation
•SHAP based on Shapley Values from cooperative game theory
developed by Lloyd Shapley
•It allows to calculate contribution of each player to the result

•Here instead of players we have features and we calculate the


impact of each feature on result of classification

•Formula for calculating Shapley value for feature i

n – number of features
S – feature set without feature i
p(S) – model prediction without feature i
p(S U i) - model prediction with feature i

6
Method SHAP for interpretation
•Shapley values calculate the importance of a feature by
comparing what a model predicts with and without the feature.
However, since the order in which a model sees features can affect
its predictions, this is done in every possible order, so that the
features are fairly compared

•There is SHAP library for Python developed by Scott M. Lundberg


et al.: https://github.com/slundberg/shap
•For SVM was used KernelExplainer module
•For neural networks was used DeepExplainer module

7
Applying SHAP to SVM model
•Dataset with secondary features (binary, pairs of events coding)
•Results of applying SHAP method to SVM model for gender classification

8
Tuning features number using SHAP
•Comparison of classification accuracy with full feature set and
reduced to top 20 most influencing features

Classification by Classification by
Parameter sequences, features, sequences, features and
binary coded events and only most influencing
coded events pairs events pairs
Number of initial features 5 5
Number of additional features 1 1
generated from sequences
Number of binary encoded events as 8 0
features
Number of events pairs (additional 28 14
features)
Total number of features (initial and 42 20
additional)
Accuracy 0.688 0.706

9
Applying SHAP to neural network model for
classification by sequences of events
•Influence of events in the sequences on the classification

10
Using convolutional neural network instead
of recurrent
•The network with CNN channel for sequences and dense for features was
created
•This network shows best classification accuracy of 0.78 and surpasses
previous best result with RNN for sequences equal to 0.75

Parameter Sequences channel Features channel


Input size 8 5
Output size 100 100
Merged layer size 200
Intermediate layers sizes 200, 32
Model output size (binary) 1

11
Conclusion

•In this work, for the interpretation of block-box models developed in the
previous work, SHAP method was used.
•SHAP allows identify most influencing features and events.
•It is shown how to adjust the model taking into account the significance of
features and improve the classification accuracy.
•New neural network model was developed, in which instead of recurrent
neural network we used convolutional neural network, which improved the
classification accuracy to 0.78.

12
Thanks!

You might also like