A Signal Processing Method for Text Language Identification

Hassanpour, H.; AlyanNezhadi, M. M.; Mohammadi, M.

doi:10.5829/ije.2021.34.06c.04

A Signal Processing Method for Text Language Identification

Document Type : Original Article

Authors

¹ Image Processing & Data Mining Lab, Shahrood University of Technology, Shahrood, Iran

² Department of Mathematics, University of Science and Technology of Mazandaran, Behshahr, Iran

³ Department of Information Technology, College of Engineering and Computer Science, Lebanese French University, KR-Iraq

10.5829/ije.2021.34.06c.04

Abstract

Language identification is a critical step prior to any natural language processing. In this paper, a signal processing method for Language Identification is proposed. Sequence of characters in a word and the order of words in stream identify the language. The sequence of characters in a stream provides a signature to recognize the language without understanding its meaning. The signature can be extracted using signal processing techniques via converting texts into time series. Although several research and commercial software have been developed to identify text language, they need a standard dictionary for each language. We proposed a dictionary independent method consisting of three main steps, I) preprocessing, II) clustering and finally III) classification. First, the texts are converted to time series using UTF-8 codes. Second, to group similar languages, the obtained series are clustered. Third, each cluster is decomposed into 32 sub-bands using a Wavelet packet, and 32 features are extracted from each sub-band. Also, a multilayer perceptron neural network is used to classify the extracted features. The proposed method was tested on our dataset with 31000 texts from 31 different languages. The proposed method achieved 72.20% accuracy for language identification.

Keywords

References

Cai, W., Cai, Z., Liu, W., Wang, X. and Li, M., "Insights in-to-end learning scheme for language identification", in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, (2018), 5209-5213, DOI: 10.21437/Interspeech.2019-1386
Sharma, A.R. and Kaushik, P., "Literature survey of statistical, deep and reinforcement learning in natural language processing", in 2017 International Conference on Computing, Communication and Automation (ICCCA), IEEE, (2017), 350-354, DOI: 10.1109/CCAA.2017.8229841
Ambikairajah, E., Li, H., Wang, L., Yin, B. and Sethu, V., "Language identification: A tutorial", IEEE Circuits and Systems Magazine, Vol. 11, No. 2, (2011), 82-108, DOI: 10.1109/MCAS.2011.941081
Kralisch, A. and Mandl, T., "Barriers to information access across languages on the internet: Network and language effects", in Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06), IEEE, (2006), 54b-54b, DOI: 10.1109/HICSS.2006.71
Radford, W. and Gallé, M., "Discriminating between similar languages in twitter using label propagation", arXiv preprint arXiv:1607.05408, (2016), DOI: arXiv:1607.05408
Castro, D., Souza, E. and De Oliveira, A.L., "Discriminating between brazilian and european portuguese national varieties on twitter texts", in 2016 5th Brazilian Conference on Intelligent Systems (BRACIS), IEEE, (2016), 265-270, DOI: 10.1109/BRACIS.2016.056
Malmasi, S., Zampieri, M., Ljubešić, N., Nakov, P., Ali, A. and Tiedemann, J., "Discriminating between similar languages and arabic dialect identification: A report on the third dsl shared task", in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), (2016), 1-14, DOI:
Zissman, M.A. and Berkling, K.M., "Automatic language identification", Speech Communication, Vol. 35, No. 1-2, (2001), 115-124, DOI: 10.1016/S0167-6393(00)00099-6
Kosmajac, D. and Keselj, V., "Slavic language identification using cascade classifier approach", in 2018 17th International Symposium INFOTEH-JAHORINA (INFOTEH), IEEE, (2018), 1-6, DOI: 10.1109/INFOTEH.2018.8345541
Martins, B. and Silva, M.J., "Language identification in web pages", in Proceedings of the 2005 ACM symposium on Applied computing, (2005), 764-768, DOI: 10.1145/1066677.1066852
Bangalore, S. and Rambow, O., "Corpus-based lexical choice in natural language generation", in Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, (2000), 464-471,
Ng, C.-C. and Selamat, A., "Improving language identification of web page using optimum profile", in International Conference on Software Engineering and Computer Systems, Springer, (2011), 157-166, DOI: 10.1007/978-3-642-22191-0_14
Dunning, T.,"Statistical identification of language": Computing Research Laboratory, New Mexico State University Las Cruces, NM, USA, (1994).
Cavnar, W.B. and Trenkle, J.M., "N-gram-based text categorization", in Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, Citeseer, (1994), DOI: 10.1.1.53.9367
Bhargava, A. and Kondrak, G., "Language identification of names with svms", in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, (2010), 693-696, DOI: 10.5555/1857999.1858101
Lui, M. and Baldwin, T., "Langid. Py: An off-the-shelf language identification tool", in Proceedings of the ACL 2012 system demonstrations, (2012), 25-30.
Bobicev, V., "Native language identification with ppm", in Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, (2013), 180-187.
Duvenhage, B., "Short text language identification for under resourced languages", arXiv preprint arXiv:1911.07555, (2019), DOI: https://arxiv.org/abs/1911.07555
Carter, S., Weerkamp, W. and Tsagkias, M., "Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text", Language Resources and Evaluation, Vol. 47, No. 1, (2013), 195-215, DOI: 10.1007/s10579-012-9195-y
Bergsma, S., McNamee, P., Bagdouri, M., Fink, C. and Wilson, T., "Language identification for creating language-specific twitter collections", in Proceedings of the second workshop on language in social media, (2012), 65-74,
Shekhar, S., Sharma, D.K. and Beg, M.M.S., "Language identification framework in code-mixed social media text based on quantum lstm — the word belongs to which language?", Modern Physics Letters B, Vol. 34, No. 06, (2020), 2050086, DOI: 10.1142/S0217984920500864
Gupta, Y., Raghuwanshi, G. and Tripathi, A., "A new methodology for language identification in social media code-mixed text", in International Conference on Advanced Machine Learning Technologies and Applications, Springer, (2020), 243-254, DOI: 10.1007/978-981-15-3383-9_22
AlyanNezhadi, M. M., Forghani, M. and Hassanpour, H., "Text language identification using signal processing techniques", in 2017 3rd Iranian Conference on Intelligent Systems and Signal Processing (ICSPIS), IEEE, (2017), 147-151, DOI: 10.1109/ICSPIS.2017.8311606
Pradeep, J., Srinivasan, E. and Himavathi, S., "Neural network based recognition system integrating feature extraction and classification for english handwritten", International Journal of Engineering, Transactions B: Applications, Vol. 25, No. 2, (2012), 99-106, DOI: 10.5829/idosi.ije.2012.25.02b.03
Akbari Foroud, A. and Hajian, M., "Discrimination of power quality distorted signals based on time-frequency analysis and probabilistic neural network", International Journal of Engineering, Transactions C: Aspects, Vol. 27, No. 6, (2014), 881-888, DOI: 10.5829/idosi.ije.2014.27.06c.06
Hamidi, H. and Daraee, A., "Analysis of pre-processing and post-processing methods and using data mining to diagnose heart diseases", International Journal of Engineering, Transactions A: Basics, Vol. 29, No. 7, (2016), 921-930, DOI: 10.5829/idosi.ije.2016.29.07a.06

Volume 34, Issue 6
TRANSACTIONS C: Aspects
June 2021
Pages 1413-1418

Article View: 1,024
PDF Download: 456

A Signal Processing Method for Text Language Identification

References

Volume 34, Issue 6
TRANSACTIONS C: Aspects
June 2021
Pages 1413-1418

Files

Cited by

Share

How to cite

Statistics

A Signal Processing Method for Text Language Identification

References

Volume 34, Issue 6 TRANSACTIONS C: AspectsJune 2021Pages 1413-1418

Files

Cited by

Share

How to cite

Statistics

Volume 34, Issue 6
TRANSACTIONS C: Aspects
June 2021
Pages 1413-1418