Professional Documents
Culture Documents
2022.dravidianlangtech 1.44
2022.dravidianlangtech 1.44
2022.dravidianlangtech 1.44
292
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages, pages 292 - 298
May 26, 2022 ©2022 Association for Computational Linguistics
the comments from the Tamil-English code mixed Class balancing of the data has also been attempted
language. as the distribution of the class labels in the given
The goal of this task is to identify whether a training dataset.
given comment contains abusive comment. A com-
ment/post within the corpus may contain more than 4.2 Participant’s Systems
one sentence but the average sentence length of Term Frequency- Inverse Document Frequency
the corpora is 1. The annotations in the corpus (TF- IDF) and BERT embeddings have been used
are made at a comment/post level. The partici- to extract and represent the features in the feature
pants were provided development, training and test extraction phase. The participants have used a wide
dataset in Tamil and Tamil-English languages. The variety of machine learning algorithms, deep learn-
dataset is tagged using various classes namely, Ho- ing models, and transformers. Logistic Regres-
mophobia, Misandry, Counter-speech, Misogyny, sion, Linear Support Vector Machines, Gradient
Xenophobia, Transphobic and Hope Speech. To the Boost classifier, and K neighbor classifier have
best of our knowledge, this is the first shared task been used as machine learning algorithms. En-
on abusive detection in Tamil at this fine-grained semble models attempted composed of a mixture
level. 11 teams participated for detecting abusive of these machine learning models. Multi-layered
comments in Tamil language and Tamil-English perceptron, Recurrent Neural Networks (RNN),
language tasks. Vanilla LSTM (Schuster and Paliwal, 1997) were
opted as deep learning models. On the transformers
2 Task Description
front, mBERT(Devlin et al., 2018), MuRIL BERT
The task is primarily a comment/post-level classi- (Khanuja et al., 2021), XLM RoBERTa (Liu et al.,
fication task. Given a YouTube comment, the sys- 2019), and ULMFit (Howard and Ruder, 2018)
tems submitted by the participants should classify it models have been opted. The MuRIL BERT mod-
abusive categories. The participants were provided els have shown the best performance compared
with development, training and test dataset in Tamil to the other models. This is primarily because it
and Tamil-English. The dataset is tagged using is trained exclusively for Indian languages. The
various classes namely, Homophobia, Misandry, ranking of the teams for both of the language tasks
Counter-speech, Misogyny, Xenophobia, Trans- is shown in Tables 2 and 3. The ranking is given
phobic and hope speech. 10 teams participated based on their f1 score and how intense their system
for detecting abusive comment in Tamil language is, which counts their pre-processing techniques
and 11 teams participated for the Tamil-English and the number of models used to prove their per-
language. formance.
293
Comment category Count in the datasets
None of the above 5011
Misandry 1276
Counter-speech 497
Xenophobia 392
Misogyny 336
Hope Speech 299
Homophobia 207
Transphobic 163
Table 2: Rank list based on weighted average F1-score along with other evaluation metrics (Precision and Recall)
for Tamil Language
Table 3: Rank list based on weighted average F1-score along with other evaluation metrics (Precision and Recall)
for Tamil-English Language
294
L
X
Pweighted = (P recisionof i × W eightof i)
i=1 in spelling mistakes. For example, the text, This
(4) has lead to the misclassification.
, where i is the test sample size. Scenario 4: Certain comments were too short
and had references that were not captured by the
L
X systems. For example, the comment give below is
Rweighted = (Recallof i × W eighti) (5)
i=1
L
X supposed to be classified as “Misandry.” It is in-
F −Scoreweighted = (F −Scoreof i×W eighti) stead classified as “None of the above.” Apart from
i=1
(6) these scenarios, the systems could never classify
The participants have also used accuracy, Macro- incomplete comments and double entendre com-
Precision, Macro-Recall, and Macro-F-scores to ments correctly. Specific comments had hyperlinks
evaluate the system. It can be observed that the that had the main content, which was missed by the
highest F-score achieved by the systems is 0.41. systems.
This is primarily due to the inability of the tech-
6 Conclusion
niques to handle the errors observed consistently
in all the systems during the classification. The This shared task aims at detecting the categories of
various scenarios of errors are explained below. abusive comments that are posted on social media.
Scenario 1: The systems fail to classify the sen- This kind of analysis would quantify the negativity
tences whenever the sentences do not contain even that is spread in the society, which in turn should
a single Tamil word. In other words, the sentences be turned into positivity either by enacting laws
contain only the English transliterated words. For to enforce restrictions on posting comments on so-
example, the comment, “World health enda ilukara cial media. This has been the motivation behind
ara kora nayae, “ is classified as “Xenophobia” hosting this shared task which has attempted to
by all the systems while the actual label is “None aggregate the comments from social media in two
of the above. The comment is actually against a languages, namely, Tamil and in code mixed lan-
xenophobic person. On the other comment, “sor- guage containing Tamil and English scripts. These
nam lakshmi mudiyathu mooditu” is classified as comments were trained by various machine learn-
“Misandry” by all the systems while the actual ing, deep learning, and transfer learning models.
class is “Misogyny.” The name “sornam lakshhmi ” 11 teams participated in Tamil and Tamil-English
refers to a woman but none of the systems labeled languages tasks. 7 categories of abusive categories
this right. were tagged in the collected comments. The rank-
Scenario 2: The comments contain spelling mis- ing of the teams was done based on the perfor-
takes and could not be handled during the pre- mance shown by the systems that were used by
processing step. For example, This is classified the participants and the in-depth analysis done by
them. It was observed that the transformer models
showed better performance when compared to that
of the rest of the systems.
295
International Conference on Natural Language Pro- Gayathri G L, Krithika S, Divyasri K, Thenmozhi Du-
cessing, pages 18–25. rairaj, and B. Bharathi. 2022. Pandas@tamilnlp-
acl2022: Abusive comment detection in tamil code-
B Bharathi, Bharathi Raja Chakravarthi, Subalalitha mixed data using custom embeddings with labse. In
Chinnaudayar Navaneethakrishnan, N Sripriya, Proceedings of the Second Workshop on Speech and
Arunaggiri Pandian, and Swetha Valli. 2022. Find- Language Technologies for Dravidian Languages.
ings of the shared task on Speech Recognition for Association for Computational Linguistics.
Vulnerable Individuals in Tamil. In Proceedings of
the Second Workshop on Language Technology for Nikhil Ghanghor, Parameswari Krishnamurthy, Sajeetha
Equality, Diversity and Inclusion. Association for Thavareesan, Ruba Priyadharshini, and Bharathi Raja
Computational Linguistics. Chakravarthi. 2021a. IIITK@DravidianLangTech-
EACL2021: Offensive language identification and
Bharathi Raja Chakravarthi. 2020. HopeEDI: A mul- meme classification in Tamil, Malayalam and Kan-
tilingual hope speech detection dataset for equality, nada. In Proceedings of the First Workshop on
diversity, and inclusion. In Proceedings of the Third Speech and Language Technologies for Dravidian
Workshop on Computational Modeling of People’s Languages, pages 222–229, Kyiv. Association for
Opinions, Personality, and Emotion’s in Social Me- Computational Linguistics.
dia, pages 41–53, Barcelona, Spain (Online). Associ-
ation for Computational Linguistics. Nikhil Ghanghor, Rahul Ponnusamy, Prasanna Kumar
Kumaresan, Ruba Priyadharshini, Sajeetha Thava-
Bharathi Raja Chakravarthi and Vigneshwaran Mural- reesan, and Bharathi Raja Chakravarthi. 2021b.
idaran. 2021. Findings of the shared task on hope IIITK@LT-EDI-EACL2021: Hope speech detection
speech detection for equality, diversity, and inclu- for equality, diversity, and inclusion in Tamil , Malay-
sion. In Proceedings of the First Workshop on Lan- alam and English. In Proceedings of the First Work-
guage Technology for Equality, Diversity and Inclu- shop on Language Technology for Equality, Diversity
sion, pages 61–72, Kyiv. Association for Computa- and Inclusion, pages 197–203, Kyiv. Association for
tional Linguistics. Computational Linguistics.
Bharathi Raja Chakravarthi, Ruba Priyadharshini, Then-
Alon Halevy, Cristian Canton-Ferrer, Hao Ma, Umut
mozhi Durairaj, John Phillip McCrae, Paul Buitaleer,
Ozertem, Patrick Pantel, Marzieh Saeidi, Fabrizio Sil-
Prasanna Kumar Kumaresan, and Rahul Ponnusamy.
vestri, and Ves Stoyanov. 2022. Preserving integrity
2022. Findings of the shared task on Homophobia
in online social networks. Communications of the
Transphobia Detection in Social Media Comments.
ACM, 65(2):92–98.
In Proceedings of the Second Workshop on Language
Technology for Equality, Diversity and Inclusion. As-
Jeremy Howard and Sebastian Ruder. 2018. Universal
sociation for Computational Linguistics.
language model fine-tuning for text classification.
Bharathi Raja Chakravarthi, Ruba Priyadharshini, arXiv preprint arXiv:1801.06146.
Vigneshwaran Muralidaran, Shardul Suryawanshi,
Navya Jose, Elizabeth Sherly, and John P McCrae. Simran Khanuja, Diksha Bansal, Sarvesh Mehtani,
2020. Overview of the track on sentiment analysis for Savya Khosla, Atreyee Dey, Balaji Gopalan,
Dravidian languages in code-mixed text. In Forum Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja
for Information Retrieval Evaluation, pages 21–24. Nagipogu, Shachi Dave, et al. 2021. Muril: Multi-
lingual representations for indian languages. arXiv
Bharathi Raja Chakravarthi, Ruba Priyadharshini, preprint arXiv:2103.10730.
Rahul Ponnusamy, Prasanna Kumar Kumaresan,
Kayalvizhi Sampath, Durairaj Thenmozhi, Sathi- Prasanna Kumar Kumaresan, Ratnasingam Sakuntharaj,
yaraj Thangasamy, Rajendran Nallathambi, and Sajeetha Thavareesan, Subalalitha Navaneethakr-
John Phillip McCrae. 2021. Dataset for identi- ishnan, Anand Kumar Madasamy, Bharathi Raja
fication of homophobia and transophobia in mul- Chakravarthi, and John P McCrae. 2021. Findings
tilingual YouTube comments. arXiv preprint of shared task on offensive language identification
arXiv:2109.00227. in tamil and malayalam. In Forum for Information
Retrieval Evaluation, pages 16–18.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
bidirectional transformers for language understand- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
ing. arXiv preprint arXiv:1810.04805. Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining ap-
Ankita Diraphe, Ratnavel Rajalakshmi, and Antonette proach. arXiv preprint arXiv:1907.11692.
Shibani. 2022. Dlrg@dravidianlangtech-acl2022:
Abusive comment detection intamil using multilin- Anitha Narasimhan, Aarthy Anandan, Madhan Karky,
gual transformer models. In Proceedings of the Sec- and CN Subalalitha. 2018. Porul: Option generation
ond Workshop on Speech and Language Technologies and selection and scoring algorithms for a tamil flash
for Dravidian Languages. Association for Computa- card game. International Journal of Cognitive and
tional Linguistics. Language Sciences, 12(2):225–228.
296
Bhavish Pahwa. 2022. Bphigh@tamilnlp-acl2022: Aug- Ratnasingam Sakuntharaj and Sinnathamby Mahesan.
mentation strategies for indic transformer-based abu- 2017. Use of a novel hash-table for speeding-up sug-
sive comment detection in tamil. In Proceedings of gestions for misspelt Tamil words. In 2017 IEEE
the Second Workshop on Speech and Language Tech- International Conference on Industrial and Informa-
nologies for Dravidian Languages. Association for tion Systems (ICIIS), pages 1–5.
Computational Linguistics.
Ratnasingam Sakuntharaj and Sinnathamby Mahesan.
Vasanth Palanikrmar, Sean Benhur, Adeep Hande, 2021. Missing word detection and correction based
and Bharathi Raja Chakravarthi. 2022. De- on context of Tamil sentences using n-grams. In
abuse@tamilnlp-acl 2022: Transliteration as data 2021 10th International Conference on Information
augmentation for abuse detection in tamil. In Pro- and Automation for Sustainability (ICIAfS), pages
ceedings of the Second Workshop on Speech and 42–47.
Language Technologies for Dravidian Languages.
Association for Computational Linguistics. Anbukkarasi Sampath, Thenmozhi Durairaj,
Bharathi Raja Chakravarthi, Ruba Priyadharshini,
Shantanu Patankar, Omkar Gokhale, Onkar Litake, Subalalitha Chinnaudayar Navaneethakrishnan,
Aditya Mandke, and Dipali Kandam. 2022. Opti- Kogilavani Shanmugavadivel, Sajeetha Thavareesan,
mize prime@tamilnlp-acl2022: Abusive comment Sathiyaraj Thangasamy, Parameswari Krishna-
detection in tamil. In Proceedings of the Second murthy, Adeep Hande, Sean Benhur, Kishor Kumar
Workshop on Speech and Language Technologies Ponnusamy, and Santhiya Pandiyan. 2022. Findings
for Dravidian Languages. Association for Computa- of the shared task on Emotion Analysis in Tamil. In
tional Linguistics. Proceedings of the Second Workshop on Speech and
Language Technologies for Dravidian Languages.
Ruba Priyadharshini, Bharathi Raja Chakravarthi, Sub- Association for Computational Linguistics.
alalitha Chinnaudayar Navaneethakrishnan, Then-
mozhi Durairaj, Malliga Subramanian, Kogila- Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-
vani Shanmugavadivel, Siddhanth U Hegde, and tional recurrent neural networks. IEEE transactions
Prasanna Kumar Kumaresan. 2022. Findings of on Signal Processing, 45(11):2673–2681.
the shared task on Abusive Comment Detection in
Tamil. In Proceedings of the Second Workshop on R Srinivasan and CN Subalalitha. 2019. Automated
Speech and Language Technologies for Dravidian named entity recognition from tamil documents. In
Languages. Association for Computational Linguis- 2019 IEEE 1st International Conference on Energy,
tics. Systems and Information Processing (ICESIP), pages
1–5. IEEE.
Ruba Priyadharshini, Bharathi Raja Chakravarthi,
Sajeetha Thavareesan, Dhivya Chinnappa, Durairaj C. N. Subalalitha. 2019. Information extraction frame-
Thenmozhi, and Rahul Ponnusamy. 2021. Overview work for Kurunthogai. Sādhanā, 44(7):156.
of the DravidianCodeMix 2021 shared task on senti-
ment detection in Tamil, Malayalam, and Kannada. CN Subalalitha and E Poovammal. 2018. Automatic
In Forum for Information Retrieval Evaluation, pages bilingual dictionary construction for Tirukural. Ap-
4–6. plied Artificial Intelligence, 32(6):558–567.
Manikandan Ravikiran, Bharathi Raja Chakravarthi, Sajeetha Thavareesan and Sinnathamby Mahesan. 2019.
Anand Kumar Madasamy, Sangeetha Sivanesan, Rat- Sentiment analysis in Tamil texts: A study on ma-
navel Rajalakshmi, Sajeetha Thavareesan, Rahul Pon- chine learning techniques and feature representation.
nusamy, and Shankar Mahadevan. 2022. Findings In 2019 14th Conference on Industrial and Informa-
of the shared task on Offensive Span Identification tion Systems (ICIIS), pages 320–325.
in code-mixed Tamil-English comments. In Pro-
ceedings of the Second Workshop on Speech and Sajeetha Thavareesan and Sinnathamby Mahesan.
Language Technologies for Dravidian Languages. 2020a. Sentiment lexicon expansion using Word2vec
Association for Computational Linguistics. and fastText for sentiment prediction in Tamil texts.
In 2020 Moratuwa Engineering Research Conference
Prasanth S N, R Aswin Raj, Adhithan P, Premjith B, and (MERCon), pages 272–276.
Soman K P. 2022. Cen-tamil@dravidianlangtech-
acl2022: Abusive comment detection in tamil using Sajeetha Thavareesan and Sinnathamby Mahesan.
tf-idf and random kitchen sink algorithm. In Pro- 2020b. Word embedding-based part of speech tag-
ceedings of the Second Workshop on Speech and ging in Tamil texts. In 2020 IEEE 15th International
Language Technologies for Dravidian Languages. Conference on Industrial and Information Systems
Association for Computational Linguistics. (ICIIS), pages 478–482.
Ratnasingam Sakuntharaj and Sinnathamby Mahesan. Sajeetha Thavareesan and Sinnathamby Mahesan. 2021.
2016. A novel hybrid approach to detect and correct Sentiment analysis in Tamil texts using k-means and
spelling in Tamil text. In 2016 IEEE International k-nearest neighbour. In 2021 10th International Con-
Conference on Information and Automation for Sus- ference on Information and Automation for Sustain-
tainability (ICIAfS), pages 1–6. ability (ICIAfS), pages 48–53.
297
Josephine Varsha and B. Bharathi. 2022. Ssncse
nlp@tamilnlp-acl2022: Transformer based approach
for detection of abusive comment for tamil lan-
guage. In Proceedings of the Second Workshop on
Speech and Language Technologies for Dravidian
Languages. Association for Computational Linguis-
tics.
Rita Kirk Whillock and David Slayden. 1995. Hate
speech. ERIC.
Michael Wiegand, Josef Ruppenhofer, and Elisabeth
Eder. 2021. Implicitly abusive language–what does it
actually look like and why are we not getting there?
In Proceedings of the 2021 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
pages 576–587. Association for Computational Lin-
guistics.
Konthala Yasaswini, Karthik Puranik, Adeep
Hande, Ruba Priyadharshini, Sajeetha Thava-
reesan, and Bharathi Raja Chakravarthi. 2021.
IIITT@DravidianLangTech-EACL2021: Transfer
learning for offensive language detection in Dravid-
ian languages. In Proceedings of the First Workshop
on Speech and Language Technologies for Dravidian
Languages, pages 187–194, Kyiv. Association for
Computational Linguistics.
298