Professional Documents
Culture Documents
Author Profiling Using Machine Learning Techniques
Author Profiling Using Machine Learning Techniques
(min,
VOCABULARY SIZE
n-gram)
Training
DS1 DS2 DS3 DS4 DS5
Dataset 3.1 Feature Selection
(1,1) 183119 96544 52098 16077 9184 6797 The feature selection was a process in which we started by building
(1,2) 852380 390683 234284 80991 39732 25139 a baseline model [1] and improved on the accuracy of the model
(1,3) 1698948 746445 451658 164853 75790 46435 with step by step empirical procedure of combining and modifying
(1,4) 2583433 1114714 672974 252602 111942 68167 the existing feature representation [10].
(2,1) 45223 22482 17278 5989 3438 2151 • Count Based Matrix :
(2,2) 91646 38869 29643 13529 6528 3562 The 1st approach from the dataset was to form a simple
(2,3) 106192 43324 31965 16190 7291 3898 count based Term Document(TD) and Term Frequency In-
(2,4) 109234 44206 32232 17074 7565 3987 verse Document Frequency (TFIDF) matrix which became
(3,1) 28365 13188 9973 3801 2190 1233 the baseline for our accuracy and further went with adding
(3,2) 106192 19702 15086 7202 3381 1767 general features to previous representation.
(3,3) 52940 20923 15708 8098 3526 1864
(3,4) 53646 21065 15745 8381 3545 1886 • Feature Extraction :
(4,1) 20677 9228 6910 2805 1594 821 With the knowledge of the social media network "Twitter",
(4,2) 32403 12953 9931 4903 2305 1101 we essentially narrowed down our focus on search for fea-
(4,3) 34876 13551 10226 5380 2376 1139 tures to tags, like ’@’ which is mainly used to address peo-
(4,4) 35184 13604 10237 5512 2383 1145 ple/gathering and hash tag ’#’ which is based on the context
or the image of the adjoining content. Moving on, we found
that URLs were being used widely across most of the dataset
which linked to various internet sources, so we then incorpo-
rated these as a feature to the earlier feature representation
count of documents for each category varies from 96 to 776 files
which proved to show slight improvement on all of the clas-
each. On further inspection the text format provided in each folders
sification algorithms, captured below in Figure 3-4.
apart from the 3rd folder is different when comparing with the
training corpus, namely offline texts, Facebook, Twitter, product
• Data Normalization :
and online reviews and gender imitation corpus in order of their
On further analysis we found that, individual URL’s in itself
folder number respectively as shown in Table 1-2.
as a feature seemed fruitless, thus only considering the hy-
We have also taken the statistical data of the complete vocab-
perlink itself, we focused on normalizing these across the
ulary size that we gained from grid search of attributes namely
dataset and went with the count of the URL and those of the
n-gram_range and min_df count. In each combination their respec-
tag’s as feature to represent a document. It proved to increase
tive corpus size is found, and the statistics are tabulated in Table
the accuracy little more, This further led to normalizing of
3.
various emoticons represented by a keyword and various
other punctuation like the exclamation mark ’!’, period ’.’ and
3 METHODOLOGY hyphen ’-’ which occurred multiple time or in continuous
The Figure 1 gives a rudimentary picture of the architecture that repetition were converted to a single instance of each.
we have implemented for our 3 submissions, in all of these models • Word Average: :
we mainly focused on data pre-processing methods to incorporate As we were not familiar with the language, we considered
various features and build upon each one of them to improve the the average word length as the total number of character
feature representation. We started from a simple count based model, per document to the total number of feature instance in that
the same methods are discussed next. document and appended that list as an average word length
AmritaNLP@PAN-RusProfiling ,,
(1) A simple count based matrix is taken to achieve baseline reviews corpus".
accuracy, for this we have considered both TD and TFIDF
matrix representation from which we set a base-line accuracy
of 80.5% ( We randomly initialized few attributes like n- 7 EXTENDED WORK
gram_range = 2 , min_df = 3 and used a linear SVM classifier In our earlier experiments we randomly initialized our attributes
to get the baseline). like max feature length, n_gram and min_df with 10000, 2 and 3
(2) Count of ’http’ and ’https’ are taken and converted to a single respectively. As a motive to increase validation accuracy we per-
key word ’https’ as this will help in adding feature to see formed a grid search for the hyper-parameter namely word count,
the usage of URLs between the 2 class distinguishing which n_gram and min_df based values with the TFIDF model, where we
gender base might have used more number of hyperlinks considered the following range of data values for each:
within their tweets.
(3) Count of ’#’ tags was further attributed to the previous rep- Word count : 10000 - 50000
resentation. ngram-range : 2 - 6
(4) Replaced emoticons with keyword. Min-df : 1 - 4
(5) Took the average word length in a document i.e count of
character to number of feature instances as the language, After applying grid search we pushed the baseline accuracy to
this we chose as a preferred method. 82.5% when initializing max_feature with 10,000, n_gram with 2
and min_df with 1 and applying a linear SVM classifier. We further
5 FEATURE REPRESENTATIONS MODELS pushed our validation to 86.49% by applying Adaboost classifier.
Over all we found that the trend of accuracy of TD feature represen-
We submitted a total of 3 models/run’s and for each individual run
tation model decreased with increase in all the attributes, and the
the following pre-processing method have been followed:
accuracy of TFIDF feature representation model increased but satu-
• Submission 1 : We have considered feature representation 2, rated after n_gram value exceeds 6 and the min_df value exceeds 4,
3, 4, 5 and also the normalization of ’@’ followed by content the same is show in Figure 5 where the best of the attributes, feature
tags to simple key word( Splitting tags from their context combination were taken for each TD and TFIDF representation.
otherwise to preserve the word content in particular did
not show much difference in validation accuracy), and used
SVM classifier for classification. Based on learned model
from training corpus the prediction for the test corpus’s
were taken.
• Submission 2 : The same feature representation as the 1st
run was considered, but we used a different classifier, we
took Adaboost based training model and the prediction for
the test corpus’s were taken
• Submission 3 : In this run, we considered mostly with re-
gard to the other test datasets 1,2,4 and 5 where the content
are in longer and in paragraph form rather than the shorter
version and there was less to no use of tags and or hyperlinks.
Thus to normalize this we disregarded the above used tags
and reduced any extended repeat of punctuation’s to a single
count(e.g:’...’ is shortened to a single ’.’) Figure 5: Hyper-parameter tuning for feature representa-
A sample of this is shown in Figure 2. tion
6 RESULTS
As per the global ranking published for the shared task by the 8 CONCLUSION & FUTURE WORK
organizers[6] our team secured 2nd position overall (Concatenating The challenge in this shared task we faced was the fact that we
all dataset). From the rankings our 3rd submission performed the were working on a language corpus that is non native to us, thus we
best compared to our team’s previous 2 submissions by a margin mainly focused on pre-processing and normalizing the data corpus
of 1%, 2% respectively whereas from the leading team we trailed to get improved feature representation. We built from a basic count
by margin of 6%, this was w.r.t to the facts that we mentioned in representation and incorporated simple modification on iterating
our submission 3 and also we got better validation accuracy for the feature representation and observed the various accuracy changes
submission Model 3 for datasets 1,2,4 and 5. involved with those features. Based on the experimental analysis
Individually, submission 3 gaining 2nd best accuracy in "off line and further discussion on optimizing of the various attributes in
texts (picture descriptions, letter to a friend etc.)from RusProfiling the extended work, we could make an inference that the baseline
corpus" whereas the submission 1 gained our team 2nd place for can further be increased which could better improve the prediction,
"gender imitation corpus" and 3rd in "product and service online fetching us better gender identification model.
AmritaNLP@PAN-RusProfiling ,,
REFERENCES
[1] H.B. Barathi Ganesh, M. Anand Kumar, and K.P. Soman. 2016. Statistical semantics
in context space: Amrita CEN@author profiling. CEUR Workshop Proceedings
1609 (2016), 881–889.
[2] H.B. Barathi Ganesh, U. Reshma, and M. Anand Kumar. 2015. Author identifica-
tion based on word distribution in word space. 2015 International Conference on
Advances in Computing, Communications and Informatics, ICACCI 2015 (2015),
1519–1523. https://doi.org/10.1109/ICACCI.2015.7275828
[3] Fabio Celli, Bruno Lepri, Joan-Isaac Biel, Daniel Gatica-Perez, Giuseppe Ric-
cardi, and Fabio Pianesi. 2014. The Workshop on Computational Personality
Recognition. (2014).
[4] Tatiana Litvinova, Olga Litvinlova, Olga Zagorovskaya, Pavel Seredin, Aleksandr
Sboev, and Olga Romanchenko. 2016. " Ruspersonality": A Russian corpus for
authorship profiling and deception detection. In Intelligence, Social Media and
Web (ISMW FRUCT), 2016 International FRUCT Conference on. IEEE, 1–7.
[5] Tatiana Litvinova and Olga Litvinova. 2016. Authorship Profiling in Russian-
Language Texts. In Proceedings of 13th International Conference on Statistical
Analysis of Textual Data (JADT 2016), University Nice Sophia Antipolis, Nice.
793–798.
[6] Tatiana Litvinova, Francisco Rangel, Paolo Rosso, Pavel Seredin, and Olga Litvi-
nova. 2017. Overview of the RUSProfiling PAN at FIRE Track on Cross-genre
Gender Identification in Russian. In Notebook Papers of FIRE 2017, FIRE-2017,
Bangalore, India, December 8-10, CEUR Workshop Proceedings. CEUR-WS.org
[7] Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstathios Stamatatos, and Gia-
como Inches. 2013. Overview of the author profiling task at PAN 2013. Notebook
Papers of CLEF (2013), 23–26.
[8] Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Pot-
thast, and Benno Stein. 2016. Overview of the 4th author profiling task at PAN
2016: cross-genre evaluations. Working Notes Papers of the CLEF (2016).
[9] Francisco Manuel Rangel Pardo, Fabio Celli, Paolo Rosso, Martin Potthast, Benno
Stein, and Walter Daelemans. 2015. Overview of the 3rd Author Profiling Task
at PAN 2015. In CLEF 2015 Evaluation Labs and Workshop Working Notes Papers.
1–8.
[10] Aleksandr Sboev, Tatiana Litvinova, Dmitry Gudovskikh, Roman Rybka, and
Ivan Moloshnikov. 2016. Machine Learning Models of Text Categorization by
Author Gender Using Topic-independent Features. Procedia Computer Science
101 (2016), 135–142.
[11] Aleksandr Sboev, Tatiana Litvinova, Irina Voronina, Dmitry Gudovskikh, and
Roman Rybka. 2016. Deep Learning Network Models to Categorize Texts Ac-
cording to Author’s Gender and to Identify Text Sentiment. In Computational
Science and Computational Intelligence (CSCI), 2016 International Conference on.
IEEE, 1101–1106.