Age and Gender Prediction in Open Domain Text

ScienceDirect
Available online at www.sciencedirect.com
ScienceDirect
Procedia Computer Science 00 (2020) 000–000
Available online at www.sciencedirect.com www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia
ScienceDirect
The 11th International Conference on Ambient Systems, Networks and Technologies (ANT)
April 6-9, 2020, Warsaw, Poland
The 11th International Conference on Ambient Systems, Networks and Technologies (ANT)
April 6-9, 2020, Warsaw, Poland
Age and Gender prediction in Open Domain Text
Age and
Emad Gendera*prediction
E. Abdallah in Open
, Jamil R. Alzghoul b Domain
, Muath Text
Alzghool c
Emad E. Abdallah a*, Jamil R. Alzghoul b, Muath Alzghool c

a
Department of Computer Information System, Faculty of Information Technology, Hashemite University, Zarqa, Jordan
b
University College Of Surveying And Geospatial Sciences, Regional Center For Space Science And Technology Education for Westrern Asia,
a
Department of Computer Information System, FacultyAmman, Jordan Technology, Hashemite University, Zarqa, Jordan
of Information
b c
Department
University of Of
College Electrical Engineering
Surveying and Computer
And Geospatial Sciences,Science York
Regional University
Center 4700Science
For Space Keele Street Toronto, Ontario
And Technology M3Jfor
Education 1P3 CanadaAsia,
Westrern
Amman, Jordan
c
Department of Electrical Engineering and Computer Science York University 4700 Keele Street Toronto, Ontario M3J 1P3 Canada
Abstract
Abstract
The massive use of the social media and the huge number of messages that are shared on the internet, create a countless need to
automatically detect the age and gender of the people who write these messages. Several sites and platforms attempt to mislead
The cheat
and massive
the use of the
people whosocial media and
are visiting theby
them huge numberdeceptive
providing of messages that are shared
information on the
about the ageinternet,
and the create
genderaofcountless need to
their customer.
automatically
The traditionaldetect
way tothe age and
detect genderwas
deceivers of by
thehuman
peoplejudgment,
who writebutthese
thismessages.
way is no Several sites andsince
longer suitable platforms
lots ofattempt to mislead
interviews are not
and cheat the
conducted facepeople who
to face. Thisarepaper
visiting them an
presents by automate
providingtool
deceptive
with a information
unique set ofabout the that
features age and
usedthe gender of
to analyze their customer.
a given text. The
The traditional
features includewaythe to detect deceivers
unigram, was byand
part of speech, human judgment,
production butThe
rules. thisaccuracy
way is noresults
longerofsuitable since lots
the proposed of interviews
method are the
outperform not
conducted
existing face to face.
techniques. TheThis paper presents
best results achievedanbyautomate
using thetool with a unique
production set of features that used to analyze a given text. The
rules features.
features include the unigram, part of speech, and production rules. The accuracy results of the proposed method outperform the
existing techniques. The best results achieved by using the production rules features.
© 2020 The Authors. Published by Elsevier B.V.
© 2020 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
© 2020 The under
Peer-review
Peer-review Authors.
under Published of
responsibility
responsibility by Elsevier
ofthe B.V. Program
theConference
Conference ProgramChairs.
Chairs.
Peer-review
Keywords: under
Open responsibility
domain of the
text; Deceptive Conference
information; Age Program Chairs.
prediction; Gender prediction.
Keywords: Open domain text; Deceptive information; Age prediction; Gender prediction.
* Corresponding author. Tel.: +0096-795673231; fax: +00962(05) 3826625

E-mail address: emad@hu.edu.jo
* Corresponding author. Tel.: +0096-795673231; fax: +00962(05) 3826625
E-mail address: emad@hu.edu.jo
1877-0509 © 2020 The Authors. Published by Elsevier B.V.
Peer-review
1877-0509 ©under
2020responsibility
The Authors. of the Conference
Published Program
by Elsevier B.V.Chairs.
Peer-review under responsibility of the Conference Program Chairs.
1877-0509 © 2020 The Authors. Published by Elsevier B.V.

Peer-review under responsibility of the Conference Program Chairs.
10.1016/j.procs.2020.03.126
564 Emad E. Abdallah et al. / Procedia Computer Science 170 (2020) 563–570
2 Author name / Procedia Computer Science 00 (2018) 000–000
1. Introduction
Nowadays, with the rapid spread in the use of internet resources and the huge amount of data shared on the
Internet, most people are using a variety of domains, forums, online dating sites, social network sites and consumer
reviews. For these reasons, many problems have been arising, such as gender and age deception. Recent studies
show that demographic information about deceivers can be useful [1]; it has been shown that internet users often lie
about their sex, appearance, age and education level.
This study aims to build a classifier using a machine learning technique to predict the gender and age of the
deceiver. This will be helpful in terms of decreasing the amount of deception on the internet in many fields such
online dating and job interviews. The target of this study is the deceptive behavior where the domain and context are
not specific. The open source machine learning (Weka tool) [2] has been used to classify and analyze the text,
where we applied different individual and combined classifier algorithms, including the support vector machine,
naïve Bayes, decision tree (J48).
The remainder of this paper is organized as follows: In section 2, we provide a brief literature material about age
and gender detection systems. In Section 3, we introduce deception detection methodology adopted in this research.
In section 4, we present some experimental results. Finally, we conclude and point out future directions in Section 5.
2. Related Work
Deception has been used in e-mails, social networking sites, and instant messaging such as SMS, and there is a
large number of research studies on deception particularly in social networks and online dating [3] [4].
In [5] a deception detection in an open domain is presented. The authors created an open domain deception
dataset, where the domain and context were not available. A demographic data including age and gender are
collected to create the dataset. The authors used several linguistic features. The age and gender experiment in short
open domain deceptive texts shown moderate improvements over the baseline for gender classification. However,
the age prediction results were poor. Generally, the result showed that the classifier did not perform very well when
trying to predict age and gender, so age and gender prediction are challenging tasks when trying to detect deception
in open domain data.
An earlier study, Tilley and colleagues in [6] showed that the gender differences influence the communication
process and deception detection when interviewing through many electronic media. Throughout communication
actions, males are concerned about the communication process more than females. Moreover, males lean towards to
focus on situation, and freedom or independency, whereas females try to focus on the knowledge, public connection
and communications, collaboration, and escape from direct facing. The results showed that females were less
successful at deception than males and female receivers were more successful at detecting deception. This indicates
that companies may encounter difficulties when trying to hire the most competent employees via online interviews.
Earlier, Koppel and colleagues examined the prediction of gender in two types of formal documents (fiction and
non-fiction genres) written in British English. They used three sets of features: function words, POS, and a
combination of POS and function words. By comparing the two different types of document, it was found that the
use of these features resulted in very good prediction of gender [7].
A study to detect deception according to the deceiver’s age is presented in [8]. The article concentrated on
children aged between four and seven years old. Five classifiers including SVM, random forest, and neural networks
are used. The experiments evaluated the effectiveness of a new set of features, namely syntactic units, in identifying
the age of deceivers. The results demonstrated that sentence complexity (sentence length) has a relationship with the
deceiver’s age.
The above-mentioned research show the effectiveness of using features that are derived from text analysis, such
as the unigram, shallow syntax features, and deep syntax features that are derived from context-free grammar
(CFG), sentence count statistics, and POS. All of these features are very effective for detecting deception in text [9]
[10] [11].
Emad E. Abdallah et al. / Procedia Computer Science 170 (2020) 563–570 565
Author name / Procedia Computer Science 00 (2018) 000–000 3
Machine learning is a way to transform input to output on a computer by using an algorithm that should be
executed in sequence, where it parses the data and learns from them, and then produces a prediction. Such
algorithms provide computers with the capability to learn without a set of specific instructions. The algorithm trains
itself on the old data and concentrates on developing the computer program. Such an algorithm can teach itself about
development and modulation when the system is in a changing environment or when it is exposed to new data [12].
Machine learning techniques can be categorized into supervised learning and unsupervised learning. Supervised
learning generates new model from what was learned in the past (from old data), whereas unsupervised learning
makes deductions from the data set.There are many open source machine learning frameworks such as Weka [2]
which support many classifiers such as the support vector machine (SVM), decision trees, naïve Bayes, and rules.
3. Methodology
In this section, we present the methodology of the proposed age and gender detection system. The first step is to
input the data. The second step is tokenization and extraction of the feature sets that we will use later to build the
classifier, where tokenization of the data means chopping the text into words. The third step is applying string to
word vector, which is very important as it cleans the data by removing unnecessary information in order to improve
system performance. The fourth step is applying feature selection to the data. The fifth step is applying the classifier
using different algorithms namely (support vector machine (SVM), naïve Bayes, decision tree, logistic, random
forest, stacking, and multiclass classifier). The last step is producing the output class and evaluating the performance
of the model. The classes that we will use in the experiment are gender (male, female) classes, and the age classes
(younger, older) classes is to predict the gender and age of the participant, respectively. Fig. 1 describes the age
detection methodology. The first class A is where the age of the participants is less than or equal to 35 years and the
and the second class B is where the participants’ age is more than 35 years old.
Class
A
Training Tokenization String Feature
data to build and extract to word selection
the system feature sets vector
Class
B
Input the
open domain
text (Data)
Testing data Class
to check the Tokenization String A
Feature
correctness and extract to word selection
of the feature sets vector
system
Class
B
Fig. 1. The proposed age and gender methodology
4. Experiments and Results
The experiments designed to evaluate the performance of the proposed classifier in predicting gender and age.
Eight different experiments are applied to predict the gender of the writers of open domain texts. Different sets of
features are employed including (unigram, POS, production rules, unigram + lexicalized production rules + (part of
speech) POS, POS(part of speech) + lexicalized production rules, POS + production rules, unigram + production
rules, unigram + POS). Different classifiers are used to evaluate the features including single and combined
classifiers namely (SVM, naïve Bayes, decision tree, SVM+j48, SVM+ logistic + random forest+ Multi class
classifier, Multi class classifier + Random forest, SVM+ stacking + Multiclass classifier).
A meta classifier is also known as an ensemble method. The idea of meta classifier is to combine a set of
classifiers to vote on the classification, instead of utilizing each classifier separately. Meta classifiers often surpass
other techniques in terms of accuracy. This is because, in the classification process, the error is always inherited
from other errors. A meta classifier uses a weighted vote in the classification step, where each classifier in the
ensemble gives a value to the prediction and predicts the class, and then the meta classifier calculates these values
and the class with the largest value is chosen [2] [13].
4.1 Gender Prediction Experiments
The first experiment designed to evaluate the performance of the proposed classifier in predicting the
participant’s gender from the short text. The procedure for testing the gender is started with submitting the data set,
then applying the string to word vector filter, then applying the classifier algorithms. Fivefold cross-validation was
performed for all experiments. The best accuracy for the SVM classifier was achieved by combining the features
(unigram + CFG), achieving 69.33% accuracy. For the Naïve Bayes, the best performance was achieved by the
(unigram + lexicalized CFG) feature sets, which achieves an accuracy of 68.16%. The third algorithm applied to the
data was the decision tree. The classification results show the best performance was achieved by a set of combined
features set (Unigram + lexicalized CFG), achieving accuracy of 62.89%. Fig. 2 shows the detailed results for the
three classifiers. Clearly, there are no significant differences between the features set experiments.
Fig. 2. Gender prediction experiment (Accuracy Vs Features) for different classifiers. 1. Unigram, 2. POS, 3. Production rules, 4. Unigram + Lex
CFG+POS, 5. POS + Lex CFG, 6. POS +CFG, 7. Unigram + CFG, 8. Unigram + POS.
To improve the classification accuracy, we applied the information gain filter method feature selection
attributes. Again, fivefold cross-validation was performed for all experiments. Fig. 3 shows the detailed results for
the three classifiers. Evidently, all the results of the experiments after feature selection were outperforming those of
the experiments before feature selection. The best two sets of features that achieved maximum accuracy are (CFG),
which achieved accuracy of 82.81% and (unigram + CFG), which achieved accuracy of 82.51%.
After we identified the best two feature sets from the eight feature sets, we applied the three classifiers (SVM,
naïve Bayes, Decision Tree) to these features to determine the best classifier. The results of the experiment are
shown in Fig. 4 from which it can be seen that the best feature set is CFG and the best classifier is SVM, with an
accuracy of 82.81%. Clearly, the SVM is the best classifier if there is a two-class problem with a balanced data set,
and the little modification in the extracted feature data does not affect its results. The combined classifiers did not
achieve better accuracy than the single classifier.
Fig. 3. Gender prediction experiment (Accuracy Vs Features) with feature selection using the information gain filter. 1. Unigram, 2. POS, 3.
Production rules, 4. Unigram+ Lex CFG+POS, 5. POS+ Lex CFG, 6. POS +CFG, 7. Unigram + CFG, 8. Unigram+ POS.
Fig. 4. Experiment results for the best two feature sets using SVM, naïve Bayes, and Decision Tree with feature selection for gender prediction.
Combinations of classifiers on the best two sets of features are shown in Fig. 5. The combined classifiers (SVM
+ Decision Tree) that are applied on (unigram + CFG) features achieved accuracy of 81.05%. Thus, according to
these results, the combined classifier did not achieve better accuracy than the SVM classifier for the gender
prediction experiments.
The maximum accuracy that could be achieved by our research for gender prediction was 82.81%, which was in
an experiment conducted on the feature set (production rules) using the support vector machine classifier. This result
was also better than that of a previous study that reported a result of 62.26%.
Fig. 5. Experiment results for the best two feature sets using combined classifiers. 1. SVM, 2. Naive bayes, 3. Decision tree, 4. SVM + j48, 5.
SVM+ logistic + random forest+ Multi class classifier, 6. Multi class classifier + Random forest, 7. SVM+ stacking + Multiclass classifier.
4.2 Age prediction experiments
The Second experiment was designed to evaluate the performance of the proposed classifier in predicting the
age of the writer of open domain texts, with different sets of features, using single and combined classifiers. We
divided the participants into two age groups. The first group is named “ younger ” , where the age of the
participants is less than or equal to 35 years old (<= 35) and the second group is named “elder”, where the
participants age is more than 35 years old (>35). The class distribution of the age groups is 319 participants are <=
35 and 193 participants are >35.
First, The SVM results are shown in Fig. 6. The best performance was achieved by the set of features (POS +
CFG), which achieved accuracy of 70.01%. For the Naïve Bayes, the best performance was achieved by the
(unigram + CFG) with accuracy 62.69%. However, decision tree results show the best performance was achieved by
a set of combined features set (unigram + POS), achieving accuracy of 69.92%. Clearly, there are no significant
differences between the features set experiments.
To improve the classification accuracy several experiments using feature selection are conducted by applying
the feature selection attributes, using the information gain filter method. Fig. 7 show the detailed results for the three
classifiers. Clearly, all the results of the experiments after feature selection were better than those of the experiments
before feature selection. The best sets of features that achieved maximum accuracy is (CFG), which achieved
accuracy of 83.20% and (unigram + CFG), which achieved accuracy of 82.61%, and POS+ lexicalized production
rules , which achieved accuracy of 82.51%.
The maximum accuracy that could be achieved for age prediction was 83.20%, which was in an experiment that
was conducted on a combined feature set (part of speech + lexicalized production rules) using combined classifiers
(multiclass classifier + random forest). This result is also better than that in a previous study, which reported a result
of 69.92%.
Fig. 6. Age prediction experiment (Accuracy Vs Features) for different classifiers. 1. Unigram, 2. POS, 3. Production rules, 4. Unigram+ Lex
CFG+POS, 5. POS+ Lex CFG, 6. POS +CFG, 7. Unigram + CFG, 8. Unigram+ POS.
Fig. 7. Age prediction experiment (Accuracy Vs Features) with feature selection using the information gain filter. 1. Unigram, 2. POS, 3.
Production rules, 4. Unigram+ Lex CFG+POS, 5. POS+ Lex CFG, 6. POS +CFG, 7. Unigram + CFG, 8. Unigram+ POS.
5. Conclusion
The goal of this research was to predict the age and gender of the writers of deceptive text. A complete model
was trained and tested with several classifiers on a data set to determine the gender and age the writers of deceptive
text. In order to achieve the highest accuracy possible, gain-based feature selection was implemented to removes
irrelevant and redundant features that do not have high impact. The experimental results outperform the existing
techniques that deal with open domain text. The best results are achieved for gender prediction using features (CFG)
with accuracy of 82.81% via the SVM classifier. On the other hand the best prediction for age was achieved using
CFG with accuracy of 83.2% via SVM classifier.
References
[1] Warkentin, Darcy, Michael Woodworth, Jeffrey T. Hancock, and Nicole Cormier. (2010) "Warrants and deception in computer mediated
communication." In Proceedings of the ACM conference on Computer supported cooperative work, pp. 9-12.
[2] Witten, Ian H., Eibe Frank, and A. Mark. "Hall, and Christopher J Pal. 2016. Data Mining: Practical machine learning tools and techniques."
[3] Lu, Hung-Yi. (2008) "Sensation-seeking, Internet dependency, and online interpersonal deception" CyberPsychology & Behavior 11, no. 2:
227-231.
[4] Naquin, Charles E., Terri R. Kurtzberg, and Liuba Y. Belkin. (2010) "The finer points of lying online: E-mail versus pen and paper." Journal
of Applied Psychology 95, no. 2: 387.
[5] Pérez-Rosas, Verónica, and Rada Mihalcea. (2015) "Experiments in open domain deception detection." In Proceedings of the 2015
conference on empirical methods in natural language processing, pp. 1120-1125.
[6] Tilley, Patti, Joey F. George, and Kent Marett. (2005) "Gender differences in deception and its detection under varying electronic media
conditions." In Proceedings of the 38th Annual Hawaii International Conference on System Sciences, pp. 24b-24b. IEEE, 2005.
[7] Koppel, Moshe, Shlomo Argamon, and Anat Rachel Shimoni. (2002) "Automatically categorizing written texts by author gender." Literary
and linguistic computing 17, no. 4: 401-412.
[8] Yancheva, Maria, and Frank Rudzicz. (2013) "Automatic detection of deception in child-produced speech using syntactic complexity
features." In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 944-
953.
[9] Mihalcea, Rada, and Carlo Strapparava. (2009) "The lie detector: Explorations in the automatic recognition of deceptive language."
In Proceedings of the ACL-IJCNLP 2009 Conference, pp. 309-312. Association for Computational Linguistics, 2009.
[10] Ott, Myle, Claire Cardie, and Jeffrey T. Hancock. (2013) "Negative deceptive opinion spam." In Proceedings of the north American chapter
of the association for computational linguistics: human language technologies, pp. 497-501.
[11] Feng, Song, Ritwik Banerjee, and Yejin Choi. (2012) "Syntactic stylometry for deception detection." In Proceedings of the 50th Annual
Meeting of the Association for Computational Linguistics. Volume 2, pp. 171-175. Association for Computational Linguistics.
[12] Sutton, Richard S., and Andrew G. Barto. (2011) "Reinforcement learning: An introduction".
[13] Han, Jiawei, Jian Pei, and Micheline Kamber. (2011) Data mining: concepts and techniques. Elsevier.

Age and Gender Prediction in Open Domain Text

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Age and Gender Prediction in Open Domain Text

Uploaded by

Copyright:

Available Formats

ScienceDirect

Available online at www.sciencedirect.com

Emad E. Abdallah a*, Jamil R. Alzghoul b, Muath Alzghool c

* Corresponding author. Tel.: +0096-795673231; fax: +00962(05) 3826625

1877-0509 © 2020 The Authors. Published by Elsevier B.V.

Fig. 1. The proposed age and gender methodology

4. Experiments and Results

4.1 Gender Prediction Experiments

4.2 Age prediction experiments

You might also like