A Voice Activity Detection Algorithm Using Sparse Non-negative Matrix Factorization-based Model Learning in Spectro-Temporal Domain

Document Type : Original Article


Faculty of Engineering and Technology, University of Mazandaran, Babolsar, Iran


Voice activity detectors are presented to extract silence/speech segments of the speech signal to eliminate different background noise signals. A novel voice activity detector is proposed in this paper using spectro-temporal features extracted from the auditory model of the speech signal. After extracting the scale, rate, and frequency features from this feature space, a sparse structured principal component analysis algorithm is used to consider the basic components of these features and reduce the dimension of learning data. Then these feature vectors are employed to learn the models by the sparse non-negative matrix factorization algorithm. The model learning procedure is performed to represent each feature vector with a proper sparse rate based on the selected atoms. Voice activity detection of the input frames is performed by computing the energy of the sparse representation for each input frame over the composite model. If the calculated energy exceeds a specified threshold, it indicates that the input frame has a structure similar to the atoms of the learned models and concludes that the observed frame has voice content. The results of the proposed detector were compared with other baseline methods and classifiers in this processing field. These results in the presence of stationary, non-stationary and periodic noises were investigated and they are shown that the proposed method based on model learning with spectro-temporal features can correctly detect the silence/speech activities.


Main Subjects

  1. Park, J.-S., Yoon, J.-S., Seo, Y.-H. and Jang, G.-J., "Spectral energy based voice activity detection for real-time voice interface", Journal of Theoretical & Applied Information Technology, Vol. 95, No. 17, (2017).
  2. Ahmadi, P. and Joneidi, M., "A new method for voice activity detection based on sparse representation", in 2014 7th International Congress on Image and Signal Processing, IEEE. (2014), 878-882.
  3. You, D., Han, J., Zheng, G. and Zheng, T., "Sparse power spectrum based robust voice activity detector", in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. (2012), 289-292.
  4. You, D., Han, J., Zheng, G., Zheng, T. and Li, J., "Sparse representation with optimized learned dictionary for robust voice activity detection", Circuits, Systems, and Signal Processing, Vol. 33, (2014), 2267-2291. doi: 10.1007/s00034-014-9748-y.
  5. Teng, P. and Jia, Y., "Voice activity detection via noise reducing using non-negative sparse coding", IEEE Signal Processing Letters, Vol. 20, No. 5, (2013), 475-478. doi: 10.1109/LSP.2013.2252615.
  6. Mavaddaty, S., Ahadi, S.M. and Seyedin, S., "Speech enhancement using sparse dictionary learning in wavelet packet transform domain", Computer Speech & Language, Vol. 44, (2017), 22-47. doi: 10.1016/j.csl.2017.01.009.
  7. Chi, T., Ru, P. and Shamma, S.A., "Multiresolution spectrotemporal analysis of complex sounds", The Journal of the Acoustical Society of America, Vol. 118, No. 2, (2005), 887-906. doi: 10.1121/1.1945807.
  8. Elhilali, M., Chi, T. and Shamma, S.A., "A spectro-temporal modulation index (stmi) for assessment of speech intelligibility", Speech Communication, Vol. 41, No. 2-3, (2003), 331-348. doi: 10.1016/S0167-6393(02)00134-6.
  9. Elhilali, M., Fritz, J.B., Klein, D.J., Simon, J.Z. and Shamma, S.A., "Dynamics of precise spike timing in primary auditory cortex", Journal of Neuroscience, Vol. 24, No. 5, (2004), 1159-1172. doi: 10.1523/JNEUROSCI.3825-03.2004.
  10. Hoyer, P.O., "Non-negative matrix factorization with sparseness constraints", Journal of Machine Learning Research, Vol. 5, No. 9, (2004). doi: 10.48550/arXiv.cs/0408058.
  11. Ullah, R., Islam, M.S., Ye, Z. and Asif, M., "Semi-supervised transient noise suppression using omlsa and snmf algorithms", Applied Acoustics, Vol. 170, (2020), 107533. doi: 10.1016/j.apacoust.2020.107533.
  12. Ullah, R., Islam, M.S., Hossain, M.I., Wahab, F.E. and Ye, Z., "Single channel speech dereverberation and separation using rpca and snmf", Applied Acoustics, Vol. 167, (2020), 107406. doi: 10.1016/j.apacoust.2020.107406.
  13. Jolliffe, I.T., "Principal component analysis for special types of data, Springer, (2002).
  14. Zou, H., Hastie, T. and Tibshirani, R., "Sparse principal component analysis", Journal of Computational and Graphical Statistics, Vol. 15, No. 2, (2006), 265-286. doi: 10.1198/106186006X113430.
  15. Jenatton, R., Obozinski, G. and Bach, F., "Structured sparse principal component analysis", in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings. Vol., No. Issue, (2010), 366-373.
  16. Jenatton, R., Audibert, J.-Y. and Bach, F., "Structured variable selection with sparsity-inducing norms", The Journal of Machine Learning Research, Vol. 12, (2011), 2777-2824. doi: 10.48550/arXiv.0904.3523.
  17. Kapadia, S., Valtchev, V. and Young, S.J., "Mmi training for continuous phoneme recognition on the timit database", in 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE. Vol. 2, (1993), 491-494.
  18. Jafari, M.G. and Plumbley, M.D., "Speech denoising based on a greedy adaptive dictionary algorithm", in 2009 17th European Signal Processing Conference, IEEE. (2009), 1423-1426.
  19. Sharma, P. and Rajpoot, A.K., "Automatic identification of silence, unvoiced and voiced chunks in speech", Journal of Computer Science & Information Technology (CS & IT), Vol. 3, No. 5, (2013), 87-96. doi: 10.5121/csit.2013.3509.
  20. Wang, G.-B. and Zhang, W.-Q., "An rnn and crnn based approach to robust voice activity detection", in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), IEEE. (2019), 1347-1350.
  21. Jordán, P.G., Bailo, I.V., Giménez, A.O., Artiaga, A.M. and Solano, E.L., "Vivovad: A voice activity detection tool based on recurrent neural networks", Jornada de Jóvenes Investigadores del I3A, Vol. 7, (2019). doi: 10.26754/jji-i3a.003524.
  22. Mavaddati, S., "Voice-based age and gender recognition using training generative sparse model", International Journal of Engineering, Transactions C: Aspects, Vol. 31, No. 9, (2018), 1529-1535. doi: 10.5829/ije.2018.31.09c.08.
  23. Sabzalian, B. and Abolghasemi, V., "Iterative weighted non-smooth non-negative matrix factorization for face recognition", International Journal of Engineering, Transactions A: Basics, Vol. 31, No. 10, (2018), 1698-1707. doi: 10.5829/ije.2018.31.10a.12.
  24. Varga, A. and Steeneken, H.J., "Assessment for automatic speech recognition: Ii. Noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems", Speech Communication, Vol. 12, No. 3, (1993), 247-251. doi: 10.1016/0167-6393(93)90095-3.
  25. Hirsch, H.-G. and Pearce, D., "The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions", in ASR2000-Automatic speech recognition: challenges for the new Millenium ISCA tutorial and research workshop (ITRW). (2000).
  26. Sigg, C.D., Dikk, T. and Buhmann, J.M., "Speech enhancement with sparse coding in learned dictionaries", in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE. (2010), 4758-4761.
  27. Sigg, C.D., Dikk, T. and Buhmann, J.M., "Speech enhancement using generative dictionary learning", IEEE Transactions on Audio, Speech, and Language Processing, Vol. 20, No. 6, (2012), 1698-1712. doi: 10.1109/TASL.2012.2187194.