Voice-based Age and Gender Recognition using Training Generative Sparse Model


Department of Engineering and Technology, University of Mazandaran, Babolsar, Iran


Abstract: Gender recognition and age detection are important problems in telephone speech processing to investigate the identity of an individual using voice characteristics. In this paper a new gender and age recognition system is introduced based on generative incoherent models learned using sparse non-negative matrix factorization and atom correction post-processing method. Similar to general signal classification scheme, our proposed algorithm includes train step to provide related atoms to each signal class and test phase to assess classification performance. Since the classification accuracy depends highly on the selected features, we employ Mel-frequency cepstral coefficients to train basis for better representation of speech structure. These bases are learned over data of male and female speakers using non-negative matrix factorization with sparsity constraint. Then, atom correction is carried out using an energy-based algorithm to decrease coherence between different categories of trained dictionaries. In sparse representation of each data class, atoms related to other sets with the highest energy are replaced with the lowest energy bases if reconstruction error does not exceed from a specified limit. The experimental results show that the proposed algorithm performs better than the earlier methods in this context especially in the presence of background noise.


1.     Tanner, D. C., Tanner, M. E., “Forensic aspects of speech patterns: voice prints, speaker profiling, lie and intoxication detection”, Lawyers & Judges Publishing Company, (2004).

2.     Reddy, T. R., Vardhanb. B. V., Reddy P. V., “ A document weighted approach for gender and age prediction based on term weight measure”, International Journal of Engineering-Transactions B: Applications, Vol. 30, No. 5, (2017), 643-651. 

3.     Jain, A. K., Flynn, P., Ross, A. A., Handbook of biometrics, Springer, (2008).

4.     Bocklet, T., Maier, A., Bauer, J. G., Burkhardt, F., Noth, E., “Age and gender recognition for telephone applications based on GMM supervectors and support vector machines” , In proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), USA, (2008), 1605-1608.

5.     Metze, F., “Comparison of four approaches to age and gender recognition for telephone applications”, In proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), USA, (2007), 1089-1092.

6.     Porat, R., Lange, D., Zigel, Y., “Age recognition based on speech signals using weights supervector”, In proc. Interspeech, Japan, (2010), 2814-2817.

7.     Bahari, M. H., Van hamme, H., “Speaker age estimation and gender detection based on supervised non-negative matrix factorization”, In proc. of the IEEE Workshop on Biometric Measurements and Systems for Security and Medical Applications (BIOMS), (2008), 1-6.

8.     Li, M., Han, K.J., Narayanan, S., “Automatic speaker age and gender recognition using acoustic and prosodic level information fusion”, Comput. Speech Lang., Vol. 27, (2013), 151-167.

9.     Ajmera, J., Burkhardt, F., “Age and gender classification using modulation cepstrum”, The Speaker and Language Recognition Workshop Stellenbosch, South Africa, Speaker Odyssey, (2008).

10.   Kumar, V. M., Thipesh, D.S.H, “Robot arm performing writing through speech recognition using dynamic time warping algorithm”, International Journal of Engineering, Transactions B: Applications, Vol. 30, No. 8, (2017), 1238-1245.

11.   Loizou, P. C., Speech Enhancement: Theory and Practice, Taylor and Francis., 2007.

12.   Kim, H., Park, H., “Non-negative matrix factorization based on alternating non-negativity constrained least squares and active set method”, Technical report, Technical Report GT-CSE-07-01, College of Computing, Georgia Institute of Technology, (2007).

13.   Kim, H., Park, H., “Sparse non-negative matrix factorizations via alternating non-negativity constrained least squares for microarray data analysis”, Bioinformatics, Vol. 23, No. 12, (2007), 1495-1502.

14.   Mavaddaty, S., Ahadi, S.M., Seyedin, S., “Modified coherence-based dictionary learning method for speech enhancement”, Signal Processing, IET, Vol. 9, No. 7, (2015), 1-9.

15.   Barchiesi, D., Plumbley, M.D., “Learning incoherent dictionaries for sparse approximation using iterative projections and rotations”, IEEE Transactions on Signal Processing, Vol. 61, (2013), 2055-2065.

16.   Sustik,  M., Tropp,  J., Dhillon, I., Heath, R., “On the existence of equiangular tight frames”,Linear Algebra and Its Applications, Vol. 26, (2007), 619-635.

17.   Mavaddaty, S., Ahadi, S.M., Seyedin, S., “A novel speech enhancement method by learnable sparse and low-rank decomposition and domain adaptation”, Speech Communication, Vol. 76, (2016), 42-60.

18.   Mavaddati, S., Ahadi, S.M., Seyedin, S., “Speech enhancement based on voice activity detection using dictionary learning in wavelet packet transform domain,” Computer Speech & Language, (2017), Vol. 44, 22-47.

19.   Shrawankar, U., Thakare, V. M., “Feature Extraction for a speech recognition system in a noisy environment:a study”,International Journal of Engineering Science and Technology (IJEST), Indonesia , (2010), Vol. 3, No. 2, 1764-1769.

20.   https://www.phonetik.uni-muenchen.de/forschung/BITS/TP1 /Cookbook/node187.html.

21.   Varga, A., Steeneken, H. J. M., Tomlinson, M., Jones, D., “The Noisex-92 study on the effect of additive noise on automatic speech recognition”,Technical Report. Malvern, U.K.: DRA Speech Res. Unit, (1992).