A Signal Processing Method for Text Language Identification

Document Type : Original Article

Authors

1 Image Processing & Data Mining Lab, Shahrood University of Technology, Shahrood, Iran

2 Department of Mathematics, University of Science and Technology of Mazandaran, Behshahr, Iran

3 Department of Information Technology, College of Engineering and Computer Science, Lebanese French University, KR-Iraq

Abstract

Language identification is a critical step prior to any natural language processing. In this paper, a signal processing method for Language Identification is proposed. Sequence of characters in a word and the order of words in stream identify the language. The sequence of characters in a stream provides a signature to recognize the language without understanding its meaning. The signature can be extracted using signal processing techniques via converting texts into time series. Although several research and commercial software have been developed to identify text language, they need a standard dictionary for each language. We proposed a dictionary independent method consisting of three main steps, I) preprocessing, II) clustering and finally III) classification. First, the texts are converted to time series using UTF-8 codes. Second, to group similar languages, the obtained series are clustered. Third, each cluster is decomposed into 32 sub-bands using a Wavelet packet, and 32 features are extracted from each sub-band. Also, a multilayer perceptron neural network is used to classify the extracted features. The proposed method was tested on our dataset with 31000 texts from 31 different languages. The proposed method achieved 72.20% accuracy for language identification.

Keywords


  1. Cai, W., Cai, Z., Liu, W., Wang, X. and Li, M., "Insights in-to-end learning scheme for language identification", in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, (2018), 5209-5213, DOI: 10.21437/Interspeech.2019-1386
  2. Sharma, A.R. and Kaushik, P., "Literature survey of statistical, deep and reinforcement learning in natural language processing", in 2017 International Conference on Computing, Communication and Automation (ICCCA), IEEE, (2017), 350-354, DOI: 10.1109/CCAA.2017.8229841
  3. Ambikairajah, E., Li, H., Wang, L., Yin, B. and Sethu, V., "Language identification: A tutorial", IEEE Circuits and Systems Magazine, Vol. 11, No. 2, (2011), 82-108, DOI: 10.1109/MCAS.2011.941081
  4. Kralisch, A. and Mandl, T., "Barriers to information access across languages on the internet: Network and language effects", in Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06), IEEE, (2006), 54b-54b, DOI: 10.1109/HICSS.2006.71
  5. Radford, W. and Gallé, M., "Discriminating between similar languages in twitter using label propagation", arXiv preprint arXiv:1607.05408, (2016), DOI: arXiv:1607.05408
  6. Castro, D., Souza, E. and De Oliveira, A.L., "Discriminating between brazilian and european portuguese national varieties on twitter texts", in 2016 5th Brazilian Conference on Intelligent Systems (BRACIS), IEEE, (2016), 265-270, DOI: 10.1109/BRACIS.2016.056
  7. Malmasi, S., Zampieri, M., Ljubešić, N., Nakov, P., Ali, A. and Tiedemann, J., "Discriminating between similar languages and arabic dialect identification: A report on the third dsl shared task", in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), (2016), 1-14, DOI:
  8. Zissman, M.A. and Berkling, K.M., "Automatic language identification", Speech Communication, Vol. 35, No. 1-2, (2001), 115-124, DOI: 10.1016/S0167-6393(00)00099-6
  9. Kosmajac, D. and Keselj, V., "Slavic language identification using cascade classifier approach", in 2018 17th International Symposium INFOTEH-JAHORINA (INFOTEH), IEEE, (2018), 1-6, DOI: 10.1109/INFOTEH.2018.8345541
  10. Martins, B. and Silva, M.J., "Language identification in web pages", in Proceedings of the 2005 ACM symposium on Applied computing, (2005), 764-768, DOI: 10.1145/1066677.1066852
  11. Bangalore, S. and Rambow, O., "Corpus-based lexical choice in natural language generation", in Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, (2000), 464-471,
  12. Ng, C.-C. and Selamat, A., "Improving language identification of web page using optimum profile", in International Conference on Software Engineering and Computer Systems, Springer, (2011), 157-166, DOI: 10.1007/978-3-642-22191-0_14
  13. Dunning, T.,"Statistical identification of language": Computing Research Laboratory, New Mexico State University Las Cruces, NM, USA, (1994).
  14. Cavnar, W.B. and Trenkle, J.M., "N-gram-based text categorization", in Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, Citeseer, (1994), DOI: 10.1.1.53.9367
  15. Bhargava, A. and Kondrak, G., "Language identification of names with svms", in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, (2010), 693-696, DOI: 10.5555/1857999.1858101
  16. Lui, M. and Baldwin, T., "Langid. Py: An off-the-shelf language identification tool", in Proceedings of the ACL 2012 system demonstrations, (2012), 25-30.
  17. Bobicev, V., "Native language identification with ppm", in Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, (2013), 180-187.
  18. Duvenhage, B., "Short text language identification for under resourced languages", arXiv preprint arXiv:1911.07555,  (2019), DOI: https://arxiv.org/abs/1911.07555
  19. Carter, S., Weerkamp, W. and Tsagkias, M., "Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text", Language Resources and Evaluation, Vol. 47, No. 1, (2013), 195-215, DOI: 10.1007/s10579-012-9195-y
  20. Bergsma, S., McNamee, P., Bagdouri, M., Fink, C. and Wilson, T., "Language identification for creating language-specific twitter collections", in Proceedings of the second workshop on language in social media, (2012), 65-74,
  21. Shekhar, S., Sharma, D.K. and Beg, M.M.S., "Language identification framework in code-mixed social media text based on quantum lstm — the word belongs to which language?", Modern Physics Letters B, Vol. 34, No. 06, (2020), 2050086, DOI: 10.1142/S0217984920500864
  22. Gupta, Y., Raghuwanshi, G. and Tripathi, A., "A new methodology for language identification in social media code-mixed text", in International Conference on Advanced Machine Learning Technologies and Applications, Springer, (2020), 243-254, DOI: 10.1007/978-981-15-3383-9_22
  23. AlyanNezhadi, M. M., Forghani, M. and Hassanpour, H., "Text language identification using signal processing techniques", in 2017 3rd Iranian Conference on Intelligent Systems and Signal Processing (ICSPIS), IEEE, (2017), 147-151, DOI: 10.1109/ICSPIS.2017.8311606
  24. Pradeep, J., Srinivasan, E. and Himavathi, S., "Neural network based recognition system integrating feature extraction and classification for english handwritten", International Journal of Engineering, Transactions B: Applications,  Vol. 25, No. 2, (2012), 99-106, DOI: 10.5829/idosi.ije.2012.25.02b.03
  25. Akbari Foroud, A. and Hajian, M., "Discrimination of power quality distorted signals based on time-frequency analysis and probabilistic neural network", International Journal of Engineering, Transactions C: Aspects, Vol. 27, No. 6, (2014), 881-888, DOI: 10.5829/idosi.ije.2014.27.06c.06
  26. Hamidi, H. and Daraee, A., "Analysis of pre-processing and post-processing methods and using data mining to diagnose heart diseases", International Journal of Engineering, Transactions A: Basics, Vol. 29, No. 7, (2016), 921-930, DOI: 10.5829/idosi.ije.2016.29.07a.06