Investigating Hostile Post Detection in Gujarati: A Machine Learning Approach

Document Type : Original Article

Authors

1 Department of Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

2 Department of Computer Engineering, Sarvajanik College of Engineering and Technology, Gujarat Technological University, Ahmedabad, Gujarat, India

Abstract

Hostile post on social media is a crucial issue for individuals, governments and organizations. There is a critical need for an automated system that can investigate and identify hostile posts from large-scale data. In India, Gujarati is the sixth most spoken language. In this work, we have constructed a major hostile post dataset in the Gujarati language. The data are collected from Twitter, Instagram and Facebook. Our dataset consists of 1,51,000 distinct comments having 10,000 manually annotated posts. These posts are labeled into the Hostile and Non-Hostile categories. We have used the dataset in two ways: (i) Original Gujarati Text Data and (ii) English data translated from Gujarati text. We have also checked the performance of pre-processing and without pre-processing data by removing extra symbols and substituting emoji descriptions in the text. We have conducted experiments using machine learning models based on supervised learning such as Support Vector Machine, Decision Tree, Random Forest, Gaussian Naive-Bayes, Logistic Regression, K-Nearest Neighbor and unsupervised learning based model such as k-means clustering. We have evaluated performance of these models for Bag-of-Words and TF-IDF feature extraction methods. It is observed that classification using TF-IDF features is efficient. Among these methods Logistic regression outperforms with an Accuracy of 0.68 and F1-score of 0.67. The purpose of this research is to create a benchmark dataset and provide baseline results for detecting hostile posts in Gujarati Language.

Graphical Abstract

Investigating Hostile Post Detection in Gujarati: A Machine Learning Approach

Keywords

Main Subjects


  1. Bhatnagar V, Kumar P, Bhattacharyya P. Investigating hostile post detection in hindi. Neurocomputing. 2022;474:60-81. 10.1016/j.neucom.2021.11.096
  2. Bhardwaj M, Akhtar MS, Ekbal A, Das A, Chakraborty T. Hostility detection dataset in Hindi. arXiv preprint arXiv:201103588. 2020. https://doi.org/10.48550/arXiv.2011.03588
  3. Dowlagar S, Mamidi R. Hasocone@ fire-hasoc2020: Using bert and multilingual bert models for hate speech detection. arXiv preprint arXiv:210109007. 2021. https://doi.org/10.48550/arXiv.2101.09007
  4. Alshaalan R, Al-Khalifa, H.,, editor Hate speech detection in saudi twittersphere: A deep learning approach. In: Proceedings of the Fifth Arabic Natural Language Processing Workshop; 2020: IEEE. 10.1109/I2CT51068.2021.9418073
  5. Joshi R, Karnavat R, Jirapure K, Joshi R, editors. Evaluation of deep learning models for hostility detection in hindi text. 2021 6th International conference for convergence in technology (I2CT); 2021: IEEE. 10.1109/I2CT51068.2021.9418073
  6. Bhatnagar V, Kumar P, Moghili S, Bhattacharyya P, editors. Divide and conquer: an ensemble approach for hostile post detection in Hindi. Combating Online Hostile Posts in Regional Languages during Emergency Situation: First International Workshop, CONSTRAINT 2021, Collocated with AAAI 2021, Virtual Event, February 8, 2021, Revised Selected Papers 1; 2021: Springer.
  7. Velankar A, Patil H, Gore A, Salunke S, Joshi R. Hate and offensive speech detection in hindi and marathi. arXiv preprint arXiv:211012200. 2021. https://doi.org/10.48550/arXiv.2110.12200
  8. Phung TM, Cloos J. An exploratory experiment on Hindi, Bengali hate-speech detection and transfer learning using neural networks. arXiv preprint arXiv:220101997. 2022. https://doi.org/10.48550/arXiv.2201.01997
  9. Kamble S, Joshi A. Hate speech detection from code-mixed hindi-english tweets using deep learning models. arXiv preprint arXiv:181105145. 2018. https://doi.org/10.48550/arXiv.1811.05145
  10. Velankar A, Patil H, Gore A, Salunke S, Joshi R. L3cube-mahahate: A tweet-based marathi hate speech detection dataset and bert models. arXiv preprint arXiv:220313778. 2022. https://doi.org/10.48550/arXiv.2203.13778
  11. Glazkova A, Kadantsev M, Glazkov M. Fine-tuning of pre-trained transformers for hate, offensive, and profane content detection in english and marathi. arXiv preprint arXiv:211012687. 2021. https://doi.org/10.48550/arXiv.2110.12687
  12. Chavan T, Patankar S, Kane A, Gokhale O, Joshi R. A twitter bert approach for offensive language detection in marathi. arXiv preprint arXiv:221210039. 2022. https://doi.org/10.48550/arXiv.2212.10039
  13. Kamal O, Kumar A, Vaidhya T, editors. Hostility detection in hindi leveraging pre-trained language models. Combating Online Hostile Posts in Regional Languages during Emergency Situation: First International Workshop, CONSTRAINT 2021, Collocated with AAAI 2021, Virtual Event, February 8, 2021, Revised Selected Papers 1; 2021: Springer.
  14. Bhardwaj M, Sundriyal M, Bedi M, Akhtar MS, Chakraborty T. HostileNet: Multilabel Hostile Post Detection in Hindi. IEEE Transactions on Computational Social Systems. 2023. 10.1109/TCSS.2023.3244014
  15. Khan MM, Shahzad K, Malik MK. Hate speech detection in roman urdu. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP). 2021;20(1):1-19. https://doi.org/10.1145/3414524
  16. Haq NU, Ullah M, Khan R, Ahmad A, Almogren A, Hayat B, et al. USAD: an intelligent system for slang and abusive text detection in PERSO-Arabic-scripted Urdu. Complexity. 2020;2020:1-7. https://doi.org/10.1155/2020/6684995
  17. Anbukkarasi S, Varadhaganapathy S. Deep learning-based hate speech detection in code-mixed Tamil text. IETE Journal of Research. 2023;69(11):7893-8. https://doi.org/10.1080/03772063.2022.2043786
  18. Farooqi ZM, Ghosh S, Shah RR. Leveraging transformers for hate speech detection in conversational code-mixed tweets. arXiv preprint arXiv:211209986. 2021. https://doi.org/10.48550/arXiv.2112.09986
  19. Biradar S, Saumya S, editors. Iiitdwd@ tamilnlp-acl2022: Transformer-based approach to classify abusive content in dravidian code-mixed text. Proceedings of the second workshop on speech and language technologies for Dravidian languages; 2022. 10.18653/v1/2022.dravidianlangtech-1.16
  20. Mohapatra SK, Prasad S, Bebarta DK, Das TK, Srinivasan K, Hu Y-C. Automatic hate speech detection in english-odia code mixed social media data using machine learning techniques. Applied Sciences. 2021;11(18):8575. https://doi.org/10.3390/app11188575
  21. Nayak R, Joshi R. Contextual hate speech detection in code mixed text using transformer based approaches. arXiv preprint arXiv:211009338. 2021. https://doi.org/10.48550/arXiv.2110.09338
  22. Sreelakshmi K, Premjith B, Soman K. Detection of hate speech text in Hindi-English code-mixed data. Procedia Computer Science. 2020;171:737-44. https://doi.org/10.1016/j.procs.2020.04.080
  23. Luo X. Efficient English text classification using selected machine learning techniques. Alexandria Engineering Journal. 2021;60(3):3401-9. https://doi.org/10.1016/j.aej.2021.02.009
  24. Sanoussi MSA, Xiaohua C, Agordzo GK, Guindo ML, Al Omari AM, Issa BM, editors. Detection of hate speech texts using machine learning algorithm. 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC); 2022: IEEE. 10.1109/CCWC54503.2022.9720792
  25. Felber T. Constraint 2021: Machine learning models for COVID-19 fake news detection shared task. arXiv preprint arXiv:210103717. 2021. https://doi.org/10.48550/arXiv.2101.03717
  26. Fahad N, Goh KM, Hossen MI, Shopnil KS, Mitu IJ, Alif MAH, et al. Stand up against bad intended news: An approach to detect fake news using machine learning. Emerging science journal. 2023;7(4):1247-59. 10.28991/ESJ-2023-07-04-015
  27. Defersha N, Tune K. Detection of hate speech text in afan oromo social media using machine learning approach. Indian J Sci Technol. 2021;14(31):2567-78. 10.17485/IJST/v14i31.1019
  28. Badjatiya P, Gupta S, Gupta M, Varma V, editors. Deep learning for hate speech detection in tweets. Proceedings of the 26th international conference on World Wide Web companion; 2017.
  29. Bangyal WH, Qasim R, Rehman NU, Ahmad Z, Dar H, Rukhsar L, et al. Detection of fake news text classification on COVID-19 using deep learning approaches. Computational and mathematical methods in medicine. 2021;2021:1-14. https://doi.org/10.1155/2021/5514220
  30. Aggarwal CC, Zhai C. A survey of text classification algorithms. Mining text data. 2012:163-222. https://doi.org/10.1007/978-1-4614-3223-4_6
  31. Allahyari M, Pouriyeh S, Assefi M, Safaei S, Trippe ED, Gutierrez JB, et al. A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:170702919. 2017. https://doi.org/10.48550/arXiv.1707.02919
  32. Akram MW, Salman M, Bashir MF, Salman SMS, Gadekallu TR, Javed AR. A novel deep auto-encoder based linguistics clustering model for social text. Transactions on Asian and Low-Resource Language Information Processing. 2022. https://doi.org/10.1145/3527838
  33. Qasim R, Bangyal WH, Alqarni MA, Almazroi AA. A fine-tuned BERT-based transfer learning approach for text classification. Journal of healthcare engineering. 2022;2022. https://doi.org/10.1155/2022/3498123
  34. Aluru SS, Mathew B, Saha P, Mukherjee A. Deep learning models for multilingual hate speech detection. arXiv preprint arXiv:200406465. 2020. https://doi.org/10.48550/arXiv.2004.06465
  35. Hassan SU, Ahamed J, Ahmad K. Analytics of machine learning-based algorithms for text classification. Sustainable operations and computers. 2022;3:238-48. https://doi.org/10.1016/j.susoc.2022.03.001
  36. Indrawan G, Setiawan H, Gunadi A. Multi-class svm classification comparison for health service satisfaction survey data in bahasa. HighTech and Innovation Journal. 2022;3(4):425-42. 10.28991/HIJ-2022-03-04-05
  37. Balamurugan V, Vedanarayanan V, Sahaya Anselin Nisha A, Narmadha R, Amirthalakshmi T. Multi-label Text Categorization using Error-correcting Output Coding with Weighted Probability. International Journal of Engineering, Transactions B: Applications. 2022;35(8):1516-23. 10.5829/ije.2022.35.08b.08
  38. Dorrani Z. Traffic Scene Analysis and Classification using Deep Learning. International Journal of Engineering, Transactions C: Aspects. 2024;37(3):496-502. 10.5829/IJE.2024.37.03C.06
  39. Zare F, Mahmoudi-Nasr P. Feature Engineering Methods in Intrusion Detection System: A Performance Evaluation. International Journal of Engineering, Transactions A: Basics. 2023;36(7):1343-53. 10.5829/ije.2023.36.07a.15
  40. Banerjee S, Sarkar M, Agrawal N, Saha P, Das M. Exploring transformer based models to identify hate speech and offensive content in english and indo-aryan languages. arXiv preprint arXiv:211113974. 2021. https://doi.org/10.48550/arXiv.2111.13974
  41. Warjri S, Pakray P, Lyngdoh SA, Maji AK, editors. Fake news detection using social media data for Khasi language. 2023 International Conference on Intelligent Systems, Advanced Computing and Communication (ISACC); 2023: IEEE. 10.1109/ISACC56298.2023.10083518