A Clustering-Based Approach for Features Extraction in Spectro-Temporal Domain Using Artificial Neural Network

Document Type : Original Article


1 Department of Electrical Engineering, Qaemshahr Branch, Islamic Azad University, Qaemshahr, Iran

2 Department of Artificial Intelligence and Robotics, Aryan Institute of Higher Education and Technology, Babol, Iran


In this paper, a new feature extraction method is presented based on spectro-temporal representation of speech signal for phoneme classification. In the proposed method, an artificial neural network approach is used to cluster spectro-temporal domain. Self-organizing map artificial neural network (SOM) was applied to clustering of features space. Scale, rate and frequency were used as spatial information of each point and the magnitude component was used as similarity attribute in clustering algorithm. Three mechanisms were considered to select attributes in spectro-temporal features space. Spatial information of clusters, the magnitude component of samples in spectro-temporal domain and the average of the amplitude components of each cluster points were considered as secondary features. The proposed features vectors were used for phonemes classification. The results demonstrate that a significant improvement is obtained in classification rate of different sets of phonemes in comparison to previous clustering-based methods. The obtained results of new features indicate the system error is compensated in all vowels and consonants subsets in compare to weighted K-means clustering.


1.     Mesgarani, N., David, S. V., Fritz, J. B., and Shamma, S. A., "Mechanisms of noise robust representation of speech in primary auditory cortex", Proceedings of the National Academy of Sciences of the United States of America, Vol. 111, No. 18, (2014), 6792–6797. doi:10.1073/pnas.1318017111
2.     Patil, K., and Elhilali, M., "Biomimetic spectro-temporal features for music instrument recognition in isolated notes and solo phrases", Eurasip Journal on Audio, Speech, and Music Processing, Vol. 2015, No. 1, (2015), 27. doi:10.1186/s13636-015-0070-9
3.     Mesgarani, N., Slaney, M., and Shamma, S. A., "Discrimination of speech from nonspeech based on multiscale spectro-temporal modulations", IEEE Transactions on Audio, Speech and Language Processing, Vol. 14, No. 3, (2006), 920–930. doi:10.1109/TSA.2005.858055
4.     Mesgarani, N., David, S. V., Fritz, J. B., and Shamma, S. A., "Phoneme representation and classification in primary auditory cortex", The Journal of the Acoustical Society of America, Vol. 123, No. 2, (2008), 899–909. doi:10.1121/1.2816572
5.     Mesgarani, N., Fritz, J., and Shamma, S., "A computational model of rapid task-related plasticity of auditory cortical receptive fields", Journal of Computational Neuroscience, Vol. 28, No. 1, (2010), 19–27. doi:10.1007/s10827-009-0181-3
6.     Zulfiqar, I., Moerel, M., and Formisano, E., "Spectro-Temporal Processing in a Two-Stream Computational Model of Auditory Cortex", Frontiers in Computational Neuroscience, Vol. 13, (2020), 1–18. doi:10.3389/fncom.2019.00095
7.     Huang, C., and Rinzel, J., "A Neuronal Network Model for Pitch Selectivity and Representation", Frontiers in Computational Neuroscience, Vol. 10, (2016), 1–17. doi:10.3389/fncom.2016.00057
8.     De Martino, F., Moerel, M., Ugurbil, K., Goebel, R., Yacoub, E., and Formisano, E., "Frequency preference and attention effects across cortical depths in the human primary auditory cortex", Proceedings of the National Academy of Sciences of the United States of America, Vol. 112, No. 52, (2015), 16036–16041. doi:10.1073/pnas.1507552112
9.     Ruggles, D. R., Tausend, A. N., Shamma, S. A., and Oxenham, A. J., "Cortical markers of auditory stream segregation revealed for streaming based on tonotopy but not pitch", The Journal of the Acoustical Society of America, Vol. 144, No. 4, (2018), 2424–2433. doi:10.1121/1.5065392
10.   Valipour, S., Razzazi, F., Fard, A., and Esfandian, N., "A Gaussian clustering based voice activity detector for noisy environments using spectro-temporal domain", Signal Processing-An International Journal (SPIJ), Vol. 4, No. 4, (2010), 228–238
11.   Yen, F. Z., Huang, M. C., and Chi, T.-S., "A two-stage singing voice separation algorithm using spectro-temporal modulation features", Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vols 2015-January, (2015), 3321–3324.
12.   Yin, P., Fritz, J. B., and Shamma, S. A., "Rapid spectrotemporal plasticity in primary auditory cortex during behavior", Journal of Neuroscience, Vol. 34, No. 12, (2014), 4396–4408. doi:10.1523/JNEUROSCI.2799-13.2014
13.   Lu, K., Liu, W., Zan, P., David, S. V., Fritz, J. B., and Shamma, S. A., "Implicit memory for complex sounds in higher auditory cortex of the ferret", Journal of Neuroscience, Vol. 38, No. 46, (2018), 9955–9966. doi:10.1523/JNEUROSCI.2118-18.2018
14.   Esfandian, N., Razzazi, F., and Behrad, A., "A clustering based feature selection method in spectro-temporal domain for speech recognition", Engineering Applications of Artificial Intelligence, Vol. 25, No. 6, (2012), 1194–1202. doi:10.1016/j.engappai.2012.04.004
15.   Esfandian, N., Razzazi, F., and Behrad, A., "A feature extraction method for speech recognition based on temporal tracking of clusters in spectro-temporal domain", AISP 2012 - 16th CSI International Symposium on Artificial Intelligence and Signal Processing, (2012), 12–17. doi:10.1109/AISP.2012.6313709
16.   Esfandian, N., Razzazi, F., Behrad, A., and Valipour, S., "A feature selection method in spectro-temporal domain based on Gaussian Mixture Models", International Conference on Signal Processing Proceedings, ICSP, (2010), 522–525. doi:10.1109/ICOSP.2010.5656082
17.   Esfandian, N., "Phoneme classification using temporal tracking of speech clusters in spectro-temporal domain", International Journal of Engineering, Transactions A: Basics, Vol. 33, No. 1, (2020), 105–111. doi:10.5829/ije.2020.33.01a.12
18.   Nithya, A., Appathurai, A., Venkatadri, N., Ramji, D. R., and Anna Palagan, C., "Kidney disease detection and segmentation using artificial neural network and multi-kernel k-means clustering for ultrasound images", Measurement: Journal of the International Measurement Confederation, Vol. 149, (2020), 106952. doi:10.1016/j.measurement.2019.106952
19.   Peng, J., Wang, X., and Shang, X., "Combining gene ontology with deep neural networks to enhance the clustering of single cell RNA-Seq data", BMC Bioinformatics, Vol. 20, No. 8, (2019), 1–12. doi:10.1186/s12859-019-2769-6
20.   Nida, N., Irtaza, A., Javed, A., Yousaf, M. H., and Mahmood, M. T., "Melanoma lesion detection and segmentation using deep region based convolutional neural network and fuzzy C-means clustering", International Journal of Medical Informatics, Vol. 124, (2019), 37–48. doi:10.1016/j.ijmedinf.2019.01.005
21.   Hu, G., Wang, K., Peng, Y., Qiu, M., Shi, J., and Liu, L., "Deep Learning Methods for Underwater Target Feature Extraction and Recognition", Computational Intelligence and Neuroscience, Vol. 2018, (2018). doi:10.1155/2018/1214301
22.   Xia, Y., Braun, S., Reddy, C. K. A., Dubey, H., Cutler, R., and Tashev, I., "Weighted Speech Distortion Losses for Neural-Network-Based Real-Time Speech Enhancement", ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Vols 2020-May, (2020), 871–875. doi:10.1109/ICASSP40776.2020.9054254
23.   Li, A., Yuan, M., Zheng, C., and Li, X., "Speech enhancement using progressive learning-based convolutional recurrent neural network", Applied Acoustics, Vol. 166, (2020), 107347. doi:10.1016/j.apacoust.2020.107347
24.   Magnuson, J. S., You, H., Luthra, S., Li, M., Nam, H., Escabí, M., Brown, K., Allopenna, P. D., Theodore, R. M., Monto, N., and Rueckl, J. G., "EARSHOT: A Minimal Neural Network Model of Incremental Human Speech Recognition", Cognitive Science, Vol. 44, No. 4, (2020). doi:10.1111/cogs.12823
25.   Garofolo, J., Lamel, L., Fiscus, J., and Pallett, D., DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus Documentatio, National Institute of Standards and Technology, Gaithersburg, MD, (1993).
26.   Burges, C. J. C., "A tutorial on support vector machines for pattern recognition", Data Mining and Knowledge Discovery, Vol. 2, No. 2, (1998), 121–167. doi:10.1023/A:1009715923555
27.   Davis, S. B., and Mermelstein, P., "Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences", IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 28, No. 4, (1980), 357–366. doi:10.1109/TASSP.1980.1163420
28.   Fartash, M., Setayeshi, S., and Razzazi, F., "A noise robust speech features extraction approach in multidimensional cortical representation using multilinear principal component analysis", International Journal of Speech Technology, Vol. 18, No. 3, (2015), 351–365. doi:10.1007/s10772-015-9274-8