A Voice Activity Detection Algorithm Using Sparse Non-negative Matrix Factorization-based Model Learning in Spectro-Temporal Domain

Voice activity detectors are presented to extract silence/speech segments of the speech signal to eliminate different background noise signals. A novel voice activity detector is proposed in this paper using spectro-temporal features extracted from the auditory model of the speech signal. After extracting the scale, rate, and frequency features from this feature space, a sparse structured principal component analysis algorithm is used to consider the basic components of these features and reduce the dimension of learning data. Then these feature vectors are employed to learn the models by the sparse non-negative matrix factorization algorithm. The model learning procedure is performed to represent each feature vector with a proper sparse rate based on the selected atoms. Voice activity detection of the input frames is performed by computing the energy of the sparse representation for each input frame over the composite model. If the calculated energy exceeds a specified threshold, it indicates that the input frame has a structure similar to the atoms of the learned models and concludes that the observed frame has voice content. The results of the proposed detector were compared with other baseline methods and classifiers in this processing field. These results in the presence of stationary, non-stationary and periodic noises were investigated and they are shown that the proposed method based on model learning with spectro-temporal features can correctly detect


INTRODUCTION 1
One of the research fields in the speech signal processing is detection of silence/speech areas of the speech signal performed by a voice activity detector (VAD). The VAD block has an important role to eliminate the background noise from the speech signals. So far, different feature domains have been used to determine voice activities since the performance of VAD is closely related to the type of these extracted features. In these methods, an attempt is made to separate the speech frames from the silent sections of the speech signal. The energy of speech signal frames and the calculation of the zero-crossing rate (ZCR) are the most advanced features in this processing area [1]. Since various detectors have been introduced in many fields, this paper only deals with the methods presented based on the model learning technique. Ahmadi, and Joneidi [2] proposed a VAD algorithm *Corresponding Author Email: s.mavaddati@umz.ac.ir (S. Mavaddati) based on the sparse representation technique using an orthogonal matching pursuit algorithm (OMP) followed by the K-singular value decomposition (K-SVD) dictionary learning method. The detection criterion of voice activity was based on the energy in the sparse representation of the input frame over the learned voice dictionary. You et al. [3] proposed a VAD algorithm based on the sparse representation technique using the Bergman iteration method and online dictionary learning. In this algorithm, the sparse power spectrum criterion was defined to calculate two types of features and decide on the label of the input frames. This criterion was achieved by averaging over the different signal segments that include the short segment average spectrum and long segment average spectrum. The labels of the different parts of the input frame are determined by calculating the energy in these frames. You et al. [4] optimized algorithm for learning speech and noise dictionaries. The goal of this optimization procedure was to reduce the coherence value between the learned dictionaries to obtain a robust VAD algorithm in the different noise conditions. The features used in this method were the modified versions of the features presented by You et al. [3] and include the long-time average energy and the long-time dynamic threshold. Also, Teng and Jia [5] designed a VAD algorithm using a non-negative sparse coding method with a noise reduction procedure. In this method, the input noise signal is first represented in the combined dictionary which contains the atoms associated with the speech and noise signals. The coefficients related to voice segments are then used as the desirable features in the conditional random field (CRF) method to model the correlation between the feature sequences and detect the speech and noise labels for each input frame. Mavaddaty et al. [6] used the spectro features of speech signal spectrograms to learn the models using the concepts of sparse representation and the K-SVD algorithm. In this work, two supervised and semisupervised methods were presented to eliminate the background noise from the speech signal. The main part of each method was the presented voice activity detector in the wavelet packet transform domain.
The purpose of this paper is to increase the detection accuracy as much as possible based on the proposed model-based method by applying the spectro-temporal features. In this paper, scale, rate, and frequency characteristics extracted from the auditory model of the speech signal were used to learn models that show the structure of active parts of speech signals. In the following, the dimension of the mentioned features was reduced by the parse structured principal component analysis (SSPCA) algorithm and then the sparse nonnegative matrix factorization (SNMF) algorithm is employed to learn the dimensionless feature sets.
In the second part of this paper, the auditory model and its extractive features are introduced. Section 3 introduces the SNMF model and the proposed VAD algorithm. In section 4, the performance of the proposed method is evaluated and the paper is concluded in section 5.

Spectro-Temporal Representation Using Auditory Model
As stated, the recognition process to detect speech areas of the speech signal and separate these frames from the silence frames has a great importance in many speech processing applications. In this paper, the spectrotemporal features are used to identify the speech segments of the speech signals that can be described using the auditory model. In this model, the auditory spectrum related to each speech is calculated. Then, the spectro-temporal features are extracted using this spectrogram and the auditory cortex model [7]. The features of the auditory cortex model have four dimensions: scale Ω, speech rate ω, frequency f, and time or frame number t. The auditory part of the cortical model is implemented by a time-frequency filter bank. Each filter can operate at different rates and scales to simulate the cochlear of the human ear and the first layer of the auditory brainstem. This procedure of filtering at different rates and scales is performed linearly in the spectro-temporal space by the wavelet transform function or the two-dimensional Gabor filter [7][8][9].
The block diagram of the auditory cortex model is shown in Figure 1. Initially, the acoustic signal enters the filter bank that consist of 128 uniformly distributed bandpass filters along the frequency-logarithmic axis that models the performance of the outer membrane of the human ear. The output of this filter bank with a timefrequency structure passes from three steps: a derivative high pass filter, a nonlinear compressor, and a low pass filter to simulate the inner portion of the human ear. In the following, the auditory spectrogram of the speech signal is obtained by the first-order derivative, half-wave rectifier, and integrator. Then, the spectro-temporal content of the auditory spectrum is achieved by a filter bank consisting of a two-dimensional Gabor filter. Then, a four-dimensional speech cortical signal including Ω scale in cycles/octave, speech velocity or rate ω in Hertz, frequency f, and the frame number of the input speech signal t is yielded .

THE PROPOSED VOICE ACTIVITY DETECTOR
In this section, the proposed VAD algorithm is presented using the extracted spectro-temporal features and SNMFbased model learning. The proposed method employs model learning technique to represent the structure of the input frame. Model learning in this paper is performed by the sparse non-negative matrix factorization algorithm, which is the non-negative matrix factorization (NMF) procedure that has been added to the nonlinearity constraint.
The combination of the sparse and NMF coding algorithms results in a model learning method called SNMF [10][11][12]. This technique results in a sparser representation than the NMF algorithm to apply the sparse constraints. In the SNMF algorithm, which is more robust than the NMF algorithm, the generalized Kolbeck-Leibler divergence method used to determine lower approximation error in the data representation. In the sparse encoding technique, each input signal frame can be represented as a linear combination of the dictionary atoms. In this procedure, it is determined which set of atoms and coefficients represent the data frame with the least approximation error. These sparse coefficients for all input signal frames constitute the H sparse coefficient matrix, which is one of the outputs of the SNMF algorithm. Many coefficients in the sparse matrix H have a zero value and indicate that each data frame can be represented only by a limited number of dictionary atoms. The sparsity or cardinality parameter determines the number of atoms in each representation procedure. The data matrix containing signal frames S can be modeled as follows by sparse coding: . Also, the K-sparse coefficient matrix H with L≫K includes the representation coefficients related to the input data matrix [13]. The sparse representation problem that consists of the approximation error and sparse constraint parts is formulated as follows [13]: where C represents the sparse rate or the number of nonzero coefficients in each row of the sparse matrix H. This parameter must be set correctly to avoid massive coding. If the high value is selected for this parameter, the large numbers of atoms participate in the representation of the input data frame that is improper. On the other hand, if the low value is selected for this parameter, the atoms are not enough to represent the data structure, and then the approximation error increases. The NMF algorithm performs a linear analysis on the observed data matrix with non-negative values, which L is smaller than M and N [13]. These matrices are employed to solve the following optimization problem: The optimization of this cost function is based on the generalized Kullback-Leibler divergence method. However, solving this problem with other cost functions yields different versions of the NMF algorithm.
As stated, the SNMF algorithm will obtain a sparser representation to consider a specified constraint than the NMF algorithm [11][12][13]. The generalized Kullback-Leibler divergence algorithm is then used to determine the approximation error in the analysis of non-negative coefficients, which results in the following optimization problem : The α parameter determines the weight coefficient of the sparsity part. The update of atoms in the W dictionary matrix is as follows: The NMF algorithm is obtained when the α parameter is omitted in Equation (5) [11]. Then, the dimensionality of the data matrix is reduced by the SSPCA algorithm to learn comprehensive models for the representation of the input data structure. The principal component analysis algorithm (PCA) is a commonly used statistical method to reduce data dimension and is used to convert the input data sets into a new set of the independent variables that include the maximum changes in the original data [13]. This algorithm presented to develop the SSPCA method which is used to estimate the basic components by applying a sparsity constraint [14]. The benefits of using this method include reducing computation time, extracting the components with more variance, and obtaining appropriate values for important variables of each problem. Further, by generalizing this algorithm, the SSPCA algorithm is obtained, which can extract the data with more variance using the sparsity and some structural constraints [15]. The non-convex form of the SSPCA algorithm is presented by Jenatton et al. [16] to solve the problem of structured sparse dictionary learning. The SSPCA is a robust algorithm to solve the occlusion problem using the block-coordinate descent algorithm for better data analysis. The block diagram of the proposed method to determine the labels of input speech frames using spectro-temporal properties is shown in Figure 2.

DETAILS OF SIMULATION
In this paper, the TIMIT dataset is used to determine the efficiency of the proposed method. This comprehensive speech dataset contains a large number of speakers and expressions that is suitable to consider the performance of a VAD algorithm [17]. The sampling rate of speech signals is set to 16kHz. The train and test scenarios contain 200 and 100 spoken expressions, respectively. In the training step, the phrases are uttered by 10 female and 10 male speakers. In the test step, the phrases uttered by 3 male and 3 female speakers were employed in the speaker-independent test. The data frame length is equal to 20 ms and the frame overlap is 50%. The parameter settings in the learning procedure are the same for all spoken data sources in the train and test scenarios.

1. Simulation Results
In the proposed method, the model learning procedure using the SNMF algorithm was used to identify silence/speech speech frames. The sparsity rate for the dictionary and the coefficients matrices in the SNMF algorithm are set to 0.9 and 0.7, respectively. These parameter values are achieved based on the experimental simulations to result in a lower approximation error. Also, the number of iterations and the sparsity parameter in the SSPCA as employed in dimension reduction are 250 and 0.6, which leads to stability in the solving procedure. The performance evaluation of the algorithms is determined by the classification accuracy rate, which is calculated by the percentage of voice and silence frames that have the correct labels for the entire test data. In the first step of the proposed algorithm, 100 speech signals with silence/speech labels selected from the TIMIT dataset are used to learn the model of scale, rate, and frequency features extracted from the auditory cortex model. The auditory model of these signals is computed and then applied to the model learning after employing the SSPCA dimension reduction algorithm. Finally, the the learned models that represent the structural features of the silence/speech segments are considered in the representation of the test input signal. The sparsity parameter in this algorithm means that each input data frame can only be represented by a linear combination of a small number of learned atoms. This parameter value is determined by the cardinality rate. Input data classification in this paper is not performed by the usual classifiers such as neural networks, support vector machine or decision trees, but it is suggested to design and use a model-based classifier based on the calculated energy of the extracted features from the sparse coefficients matrix. In the proposed detection procedure, the input signal is sparsely represented by the SNMF algorithm on the combinational dictionary D= where L represents the length of the frame and s E , r E , H are sparse coefficient matrices related to scale, rate and frequency features of the speech signal. The sum of energies is calculated and if this energy is more than half the energy of the input frame then it can result that the input frame contains the voice structure. If the difference between the calculated energy in the sparse coding procedure over the speech model and the energy of the input frame is less than a specified value of ε1=0.04, then the average energy of the SNMF coding coefficients for one frame before and one frame after the input frame is calculated as the short-term energy. The value ε1 has been obtained experimentally in various simulations. If the short-term energy is higher than half the energy of the input frame, the input frame has a speech label otherwise it will have a silence label. The time-frequency energy plot of the learned atoms based on the proposed SNMF-based VAD algorithm using the elliptical plots presented by Jafari and Plumbley [18] which is shown in Figure 3. This procedure determines how much of the time-frequency energy of frames is sparsely represented by the learned atoms. These plots show that the learned atoms according to the proposed method have been able to cover the entire timefrequency space of the considered speech signals. The elliptical plot of the proposed method based on the frequency features is concentrated in the center of the time axis and it does not include the entire frequency content at different times. The proposed method consists of a wide time-frequency range caused by a proper matching with the content of observation signal and the dictionary atoms.
The spectrogram plots of the atoms learned by the proposed method, the frequency features and the spectrotemporal features are shown in Figure 4. These plots show that the learned atoms according to the proposed method have the highest energy coverage in the timefrequency space and can precisely display the structure of the speech and silence frames.The parameters setting procedure was done according to the experimental simulations to have a proper decision on the input frame label. Since the input data frame with voice content has more energy in the sparse representation on the related dictionary so the energy criterion of the resulting sparse coefficients is used to determine the appropriate label. As a result, there is no need to use other classifiers, and the labeling procedure of the input frame can only be estimated using the SNMF algorithm. The results of the proposed method to detect the silence/speech frames are reported in Tables 1 and 2 for the speaker-independent and speaker-dependent detection scenarios, respectively. It is noteworthy that this paper has tried to evaluate the performance of the proposed VAD algorithm with the methods presented in the field of sparse representation technique. The results show that the proposed method has the ability to correctly identify the input area by applying the comprehensive learning models based on the structural content of the input frames. These results are slightly higher in the speaker-dependent scenario than in the speaker-independent scenario, which may be due to the overlap between the train and test data speakers.
The results of the proposed algorithm were compared with the other voice activity detection methods introduced in this processing field. These methods include the algorithm presented by Sharma1 and Rajpoot [19] that utilizes a zero-crossing rate and clustering procedure and also the VAD method which uses a clustering method based on the Gaussian mixture model. Mavaddaty et al. [6] presented a VAD algorithm based on the energy of the sparse coefficient matrices extracted from the wavelet packet transform features of speech and noise signals.    The sparse coding procedure was based on a combination of orthogonal matching pursuit algorithm (OMP) and coherence criterion.
A VAD algorithm with a combination of convolutional recurrent neural network and a recurrent neural network was proposed by Wang and Zhang [20]. Also, a speech enhancement module was designed to improve the performance of VAD system in low signalnoise ratio conditions. Jordán et al. [21] introduced a VAD system to identify correctly the speech frames based on recurrent neural networks. The model defined in this paper was learned using bidirectional long shortterm memory.
A comparison is also made with the method presented by Ahmadi and Joneidi [2], which is based on a sparse representation using the orthogonal matching pursuit (OMP) algorithm and K-SVD dictionary learning algorithm. As mentioned before the presented VAD algorithm employed SNMF learning method with a sparse-based statistical structure as a model learning method that has been widely used in signal processing in recent [22,23].
These results are presented in Tables 3 and 4. The results show that the proposed method correctly identifies the voice and silent regions of the input speech signal. This success and superiority over other methods can be due to the use of appropriate learned models and the dimension reduction algorithm to eliminate the outlier data during the training step. In these simulations, the results of the speaker-dependent scenario are better than the speaker-independent test, which can be due to the similarity between the speakers in the train and test steps. The results show that employing spectro-temporal features and speech signal processing through the auditory model is a desirable approach to identify the speech frames. The combination of these two techniques has many applications as a pre-processing step in speech signal analysis. The first two rows in Tables 3 and 4 are the same since the methods proposed by Sharma1 and Rajpoot [19], they did not employ the learning-based technique and the detection procedure for them occurs in one step, not in the different scenarios.
To investigate more the performance of the proposed method, the ROC curve obtained from the results of the proposed method and other comparable methods in the speaker-independent and speaker-dependent scenarios are shown in Figures 5 and 6, respectively, which emphasize the capability of the proposed method to achieve high accuracy in detection procedure .

2. Simulation Results in The Presence of Different Noise Signals
The quality of the speech signal can be significantly reduced in the presence of environmental noise signals and lead to the malfunction of hearing aids, automatic speech recognition systems, cell phones, etc. In this paper, a single-channel speech    processing corrupted by additive noise is considered. When the speech signal is exposed to non-stationary noise signals, the performance of the VAD algorithm decreases. This is especially for speech-like noise signals, that have fundamental overlap between the components in the spectro-temporal domain. Although the evaluation of the VAD algorithm is usually not performed in the presence of noise signals, in this paper, the performance of the proposed VAD algorithm in different noise conditions is investigated. In none of the references in which the proposed method has been compared with them, such as [2,[19][20][21][22][23], the performance evaluation in the presence of noise has not been done, so the results of this method have not been reviewed in the noise conditions and only the proposed method has been evaluated.
In this paper, a variety of noise signals consisting of white and babble noises from Noisex92 [24] car and train noises from Aurora2 [25] as well as piano noise from the piano society website 1 have been considered to have a proper investigation about the performance of the proposed method.
The block diagram of the proposed VAD algorithm to determine the labels of the input frames in the presence of the mentioned noise signals is shown in Figure 7. The learning procedures for speech and different noise signals were carried out with the same parameters in the SNMF coding algorithm and the dimension reduction technique.
In recent years, the use of sparse representation techniques for voice activity detector algorithms in a noisy condition has increased. An ideal VAD is used to acquire the data frames needed to learn the noise signal dictionary as reported by Sigg et al. [26]. The data frames obtained by the non-speech frames of the noisy signal are not usually enough to learn a dictionary with low approximation error. The noise dictionary learning algorithm in this approach is performed in the speech enhancement step and leads to a significant increase in the computation time. Also, Sigg et al. [27] presented a generative coherence-based dictionary learning method using the pure noise data to train noise dictionary models. The offline learning process was performed with enough noise signals. In this paper, the advantages of the SNMF technique were utilized to learn the dictionaries for scale, rate, and frequency features. The model learning procedure for the noise signals is done without any problems since adequate noise data is available. This learning process for speech and noise signals is carried . In this test step, the sum of the energies calculated from the sparse coefficient matrices for speech and noise signals is considered. The total energy calculated based on speech and noise model determines the label of the input noisy frames. If this calculated energy over the speech signal model is greater then the calculated energy over the noise model, the input frame is detected as speech frame, otherwise, a noise label is assigned to this frame. Also, if the energy difference calculated on the speech and noise models is less than a certain limit of ε2=0.08, then the total energy of the sparse coding coefficients for one frame before and one frame after the input frame is calculated over the speech and noise models to obtain the short-term energy of this representation. If the average of these calculated energies on the speech model is higher than the noise model, the speech label is assigned to the input frame, otherwise the noise label.
The average results of the proposed method to assign the proper labels in a speaker-independent scenario in the presence of various noise signals with 10dB SNR are shown in Table 5. Also, these results in a speakerdependent scenario are reported in Table 6. For more evaluation of the performance of the proposed VAD in different conditions, the average results of the proposed VAD in the speaker-independent and speaker-dependent scenarios in the presence of various noise signals with 5dB SNR are shown in Tables 7 and 8, respectively.   From the reported values in Tables 5-8, it can be concluded that the accuracy of the proposed method decreases as the SNR value decreases. Also, the accuracy of labeling to silence/speech in the presence of noise signals with stationary content such as white noise is higher than other noise signals. The accuracy in the presence of periodic piano noise signal with harmonic structure is more accurate than in other conditions. In    addition, the results of the proposed VAD algorithm has been considered in the presence of the car noise signal that has a stationary structure. But in the presence of babble noise, which is very similar to the speech signal, accuracy is greatly reduced. It should be noted that in the speaker-dependent scenario, the results are slightly higher than in the speaker-independent scenario in different situations because there is an overlap between the speakers in the train and test steps. Therefore, it can be concluded that the best results are obtained in the high SNR value, in the presence of white and piano noise signals, and the speaker-dependent scenario. Also, the performance of labeling in the case of speech frames that consist of consonant letters such as fricatives that have a similar structure to the noise signal may be decreased. The average accuracy values in the speaker-dependent and independent scenarios evaluated at two SNR values of 10dB and 5dB are represented in Figure 8. These results are obtained for a clean speech signal case and five stationary, non-stationary and periodic noises: white, car, train, babble, and piano signals. In general, it can be Figure 8. Performance comparison of the proposed method in terms of average accuracy in speaker-dependent and independent cases in 10dB and 5dB SNRs concluded that the reported results emphasize that the proposed VAD has an appropriate performance in different noisy conditions.

CONCLUSION
Voice activity detection methods are very effective in the various fields of signal analysis and speech processing as a pre-processing block. In this paper, this detection procedure is performed in the space of spectro-temporal features. The features extracted from this space are used to learn comprehensive models of the input data structure. The dimension of these feature matrices is reduced by the SSPCA algorithm. Then the resulted data are used to learn models using the SNMF method which has a sparse-based statistical structure. In the following, by computing the energy derived from the representation of the input frame features on the composite model, the label of the input frame is identified. Also, these results have been examined for an extensive range of noise types including the stationary, non-stationary, and periodic noise signals in two SNR values of 5dB and 10dB. The simulation results in both speaker-independent and speaker-dependent scenarios indicate the superior performance of the proposed method compared to the other methods presented in this processing field.

CONFLICT OF INTEREST
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.