Investigating Hostile Post Detection in Gujarati: A Machine Learning Approach

Rameshbhai, B. J.; Rana, K.

doi:10.5829/ije.2024.37.07a.08

Investigating Hostile Post Detection in Gujarati: A Machine Learning Approach

Document Type : Original Article

Authors

B. J. Rameshbhai ¹
K. Rana ²

¹ Department of Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

² Department of Computer Engineering, Sarvajanik College of Engineering and Technology, Gujarat Technological University, Ahmedabad, Gujarat, India

10.5829/ije.2024.37.07a.08

Abstract

Hostile post on social media is a crucial issue for individuals, governments and organizations. There is a critical need for an automated system that can investigate and identify hostile posts from large-scale data. In India, Gujarati is the sixth most spoken language. In this work, we have constructed a major hostile post dataset in the Gujarati language. The data are collected from Twitter, Instagram and Facebook. Our dataset consists of 1,51,000 distinct comments having 10,000 manually annotated posts. These posts are labeled into the Hostile and Non-Hostile categories. We have used the dataset in two ways: (i) Original Gujarati Text Data and (ii) English data translated from Gujarati text. We have also checked the performance of pre-processing and without pre-processing data by removing extra symbols and substituting emoji descriptions in the text. We have conducted experiments using machine learning models based on supervised learning such as Support Vector Machine, Decision Tree, Random Forest, Gaussian Naive-Bayes, Logistic Regression, K-Nearest Neighbor and unsupervised learning based model such as k-means clustering. We have evaluated performance of these models for Bag-of-Words and TF-IDF feature extraction methods. It is observed that classification using TF-IDF features is efficient. Among these methods Logistic regression outperforms with an Accuracy of 0.68 and F1-score of 0.67. The purpose of this research is to create a benchmark dataset and provide baseline results for detecting hostile posts in Gujarati Language.

Graphical Abstract

Keywords

Main Subjects

Computer Engineering

References

Bhatnagar V, Kumar P, Bhattacharyya P. Investigating hostile post detection in hindi. Neurocomputing. 2022;474:60-81. 10.1016/j.neucom.2021.11.096
Bhardwaj M, Akhtar MS, Ekbal A, Das A, Chakraborty T. Hostility detection dataset in Hindi. arXiv preprint arXiv:201103588. 2020. https://doi.org/10.48550/arXiv.2011.03588
Dowlagar S, Mamidi R. Hasocone@ fire-hasoc2020: Using bert and multilingual bert models for hate speech detection. arXiv preprint arXiv:210109007. 2021. https://doi.org/10.48550/arXiv.2101.09007
Alshaalan R, Al-Khalifa, H.,, editor Hate speech detection in saudi twittersphere: A deep learning approach. In: Proceedings of the Fifth Arabic Natural Language Processing Workshop; 2020: IEEE. 10.1109/I2CT51068.2021.9418073
Joshi R, Karnavat R, Jirapure K, Joshi R, editors. Evaluation of deep learning models for hostility detection in hindi text. 2021 6th International conference for convergence in technology (I2CT); 2021: IEEE. 10.1109/I2CT51068.2021.9418073
Bhatnagar V, Kumar P, Moghili S, Bhattacharyya P, editors. Divide and conquer: an ensemble approach for hostile post detection in Hindi. Combating Online Hostile Posts in Regional Languages during Emergency Situation: First International Workshop, CONSTRAINT 2021, Collocated with AAAI 2021, Virtual Event, February 8, 2021, Revised Selected Papers 1; 2021: Springer.
Velankar A, Patil H, Gore A, Salunke S, Joshi R. Hate and offensive speech detection in hindi and marathi. arXiv preprint arXiv:211012200. 2021. https://doi.org/10.48550/arXiv.2110.12200
Phung TM, Cloos J. An exploratory experiment on Hindi, Bengali hate-speech detection and transfer learning using neural networks. arXiv preprint arXiv:220101997. 2022. https://doi.org/10.48550/arXiv.2201.01997
Kamble S, Joshi A. Hate speech detection from code-mixed hindi-english tweets using deep learning models. arXiv preprint arXiv:181105145. 2018. https://doi.org/10.48550/arXiv.1811.05145
Velankar A, Patil H, Gore A, Salunke S, Joshi R. L3cube-mahahate: A tweet-based marathi hate speech detection dataset and bert models. arXiv preprint arXiv:220313778. 2022. https://doi.org/10.48550/arXiv.2203.13778
Glazkova A, Kadantsev M, Glazkov M. Fine-tuning of pre-trained transformers for hate, offensive, and profane content detection in english and marathi. arXiv preprint arXiv:211012687. 2021. https://doi.org/10.48550/arXiv.2110.12687
Chavan T, Patankar S, Kane A, Gokhale O, Joshi R. A twitter bert approach for offensive language detection in marathi. arXiv preprint arXiv:221210039. 2022. https://doi.org/10.48550/arXiv.2212.10039
Kamal O, Kumar A, Vaidhya T, editors. Hostility detection in hindi leveraging pre-trained language models. Combating Online Hostile Posts in Regional Languages during Emergency Situation: First International Workshop, CONSTRAINT 2021, Collocated with AAAI 2021, Virtual Event, February 8, 2021, Revised Selected Papers 1; 2021: Springer.
Bhardwaj M, Sundriyal M, Bedi M, Akhtar MS, Chakraborty T. HostileNet: Multilabel Hostile Post Detection in Hindi. IEEE Transactions on Computational Social Systems. 2023. 10.1109/TCSS.2023.3244014
Khan MM, Shahzad K, Malik MK. Hate speech detection in roman urdu. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP). 2021;20(1):1-19. https://doi.org/10.1145/3414524
Haq NU, Ullah M, Khan R, Ahmad A, Almogren A, Hayat B, et al. USAD: an intelligent system for slang and abusive text detection in PERSO-Arabic-scripted Urdu. Complexity. 2020;2020:1-7. https://doi.org/10.1155/2020/6684995
Anbukkarasi S, Varadhaganapathy S. Deep learning-based hate speech detection in code-mixed Tamil text. IETE Journal of Research. 2023;69(11):7893-8. https://doi.org/10.1080/03772063.2022.2043786
Farooqi ZM, Ghosh S, Shah RR. Leveraging transformers for hate speech detection in conversational code-mixed tweets. arXiv preprint arXiv:211209986. 2021. https://doi.org/10.48550/arXiv.2112.09986
Biradar S, Saumya S, editors. Iiitdwd@ tamilnlp-acl2022: Transformer-based approach to classify abusive content in dravidian code-mixed text. Proceedings of the second workshop on speech and language technologies for Dravidian languages; 2022. 10.18653/v1/2022.dravidianlangtech-1.16
Mohapatra SK, Prasad S, Bebarta DK, Das TK, Srinivasan K, Hu Y-C. Automatic hate speech detection in english-odia code mixed social media data using machine learning techniques. Applied Sciences. 2021;11(18):8575. https://doi.org/10.3390/app11188575
Nayak R, Joshi R. Contextual hate speech detection in code mixed text using transformer based approaches. arXiv preprint arXiv:211009338. 2021. https://doi.org/10.48550/arXiv.2110.09338
Sreelakshmi K, Premjith B, Soman K. Detection of hate speech text in Hindi-English code-mixed data. Procedia Computer Science. 2020;171:737-44. https://doi.org/10.1016/j.procs.2020.04.080
Luo X. Efficient English text classification using selected machine learning techniques. Alexandria Engineering Journal. 2021;60(3):3401-9. https://doi.org/10.1016/j.aej.2021.02.009
Sanoussi MSA, Xiaohua C, Agordzo GK, Guindo ML, Al Omari AM, Issa BM, editors. Detection of hate speech texts using machine learning algorithm. 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC); 2022: IEEE. 10.1109/CCWC54503.2022.9720792
Felber T. Constraint 2021: Machine learning models for COVID-19 fake news detection shared task. arXiv preprint arXiv:210103717. 2021. https://doi.org/10.48550/arXiv.2101.03717
Fahad N, Goh KM, Hossen MI, Shopnil KS, Mitu IJ, Alif MAH, et al. Stand up against bad intended news: An approach to detect fake news using machine learning. Emerging science journal. 2023;7(4):1247-59. 10.28991/ESJ-2023-07-04-015
Defersha N, Tune K. Detection of hate speech text in afan oromo social media using machine learning approach. Indian J Sci Technol. 2021;14(31):2567-78. 10.17485/IJST/v14i31.1019
Badjatiya P, Gupta S, Gupta M, Varma V, editors. Deep learning for hate speech detection in tweets. Proceedings of the 26th international conference on World Wide Web companion; 2017.
Bangyal WH, Qasim R, Rehman NU, Ahmad Z, Dar H, Rukhsar L, et al. Detection of fake news text classification on COVID-19 using deep learning approaches. Computational and mathematical methods in medicine. 2021;2021:1-14. https://doi.org/10.1155/2021/5514220
Aggarwal CC, Zhai C. A survey of text classification algorithms. Mining text data. 2012:163-222. https://doi.org/10.1007/978-1-4614-3223-4_6
Allahyari M, Pouriyeh S, Assefi M, Safaei S, Trippe ED, Gutierrez JB, et al. A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:170702919. 2017. https://doi.org/10.48550/arXiv.1707.02919
Akram MW, Salman M, Bashir MF, Salman SMS, Gadekallu TR, Javed AR. A novel deep auto-encoder based linguistics clustering model for social text. Transactions on Asian and Low-Resource Language Information Processing. 2022. https://doi.org/10.1145/3527838
Qasim R, Bangyal WH, Alqarni MA, Almazroi AA. A fine-tuned BERT-based transfer learning approach for text classification. Journal of healthcare engineering. 2022;2022. https://doi.org/10.1155/2022/3498123
Aluru SS, Mathew B, Saha P, Mukherjee A. Deep learning models for multilingual hate speech detection. arXiv preprint arXiv:200406465. 2020. https://doi.org/10.48550/arXiv.2004.06465
Hassan SU, Ahamed J, Ahmad K. Analytics of machine learning-based algorithms for text classification. Sustainable operations and computers. 2022;3:238-48. https://doi.org/10.1016/j.susoc.2022.03.001
Indrawan G, Setiawan H, Gunadi A. Multi-class svm classification comparison for health service satisfaction survey data in bahasa. HighTech and Innovation Journal. 2022;3(4):425-42. 10.28991/HIJ-2022-03-04-05
Balamurugan V, Vedanarayanan V, Sahaya Anselin Nisha A, Narmadha R, Amirthalakshmi T. Multi-label Text Categorization using Error-correcting Output Coding with Weighted Probability. International Journal of Engineering, Transactions B: Applications. 2022;35(8):1516-23. 10.5829/ije.2022.35.08b.08
Dorrani Z. Traffic Scene Analysis and Classification using Deep Learning. International Journal of Engineering, Transactions C: Aspects. 2024;37(3):496-502. 10.5829/IJE.2024.37.03C.06
Zare F, Mahmoudi-Nasr P. Feature Engineering Methods in Intrusion Detection System: A Performance Evaluation. International Journal of Engineering, Transactions A: Basics. 2023;36(7):1343-53. 10.5829/ije.2023.36.07a.15
Banerjee S, Sarkar M, Agrawal N, Saha P, Das M. Exploring transformer based models to identify hate speech and offensive content in english and indo-aryan languages. arXiv preprint arXiv:211113974. 2021. https://doi.org/10.48550/arXiv.2111.13974
Warjri S, Pakray P, Lyngdoh SA, Maji AK, editors. Fake news detection using social media data for Khasi language. 2023 International Conference on Intelligent Systems, Advanced Computing and Communication (ISACC); 2023: IEEE. 10.1109/ISACC56298.2023.10083518

Volume 37, Issue 7
TRANSACTIONS A: Basics
July 2024
Pages 1284-1295

Article View: 226
PDF Download: 54

Investigating Hostile Post Detection in Gujarati: A Machine Learning Approach

References

Volume 37, Issue 7
TRANSACTIONS A: Basics
July 2024
Pages 1284-1295

Files

Share

How to cite

Statistics

Investigating Hostile Post Detection in Gujarati: A Machine Learning Approach

References

Volume 37, Issue 7 TRANSACTIONS A: BasicsJuly 2024Pages 1284-1295

Files

Share

How to cite

Statistics

Volume 37, Issue 7
TRANSACTIONS A: Basics
July 2024
Pages 1284-1295