A New VAD Algorithm using Sparse Representation in Spectro-Temporal Domain
محورهای موضوعی : Speech ProcessingMohadeseh Eshaghi 1 , Farbod Razzazi 2 , Alireza Behrad 3
1 - Islamic Azad University, Science and Research Branch
2 - Islamic Azad University, Science and Research Branch
3 - Shahed Universiety
کلید واژه: Speech Processing , Voice Activity Detector (VAD) , Spectro-Temporal Domain Representation , Sparse Representation , NMF , K-SVD,
چکیده مقاله :
This paper proposes two algorithms for Voice Activity Detection (VAD) based on sparse representation in spectro-temporal domain. The first algorithm was made using two-dimensional STRF (Spectro-Temporal Response Field) space based on sparse representation. Dictionaries with different atomic sizes and two dictionary learning methods were investigated in this approach. This algorithm revealed good results at high SNRs (signal-to-noise ratio). The second algorithm, whose approach is more complicated, suggests a speech detector using the sparse representation in four-dimensional STRF space. Due to the large volume of STRF's four-dimensional space, this space was divided into cubes, with dictionaries made for each cube separately by NMF (non-negative matrix factorization) learning algorithm. Simulation results were presented to illustrate the effectiveness of our new VAD algorithms. The results revealed that the achieved performance was 90.11% and 91.75% under -5 dB SNR in white and car noise respectively, outperforming most of the state-of-the-art VAD algorithms.
This paper proposes two algorithms for Voice Activity Detection (VAD) based on sparse representation in spectro-temporal domain. The first algorithm was made using two-dimensional STRF (Spectro-Temporal Response Field) space based on sparse representation. Dictionaries with different atomic sizes and two dictionary learning methods were investigated in this approach. This algorithm revealed good results at high SNRs (signal-to-noise ratio). The second algorithm, whose approach is more complicated, suggests a speech detector using the sparse representation in four-dimensional STRF space. Due to the large volume of STRF's four-dimensional space, this space was divided into cubes, with dictionaries made for each cube separately by NMF (non-negative matrix factorization) learning algorithm. Simulation results were presented to illustrate the effectiveness of our new VAD algorithms. The results revealed that the achieved performance was 90.11% and 91.75% under -5 dB SNR in white and car noise respectively, outperforming most of the state-of-the-art VAD algorithms.
[1] M. Graciarena, A. Alwan, D. Ellis, H. Franco, L. Ferrer, J. H. L. Hansen, A. Janin, B. S. Lee, Y. Lei, V. Mitra, N. Morgan, S.O. Sadjadi, T.J. Tsai, N. Scheffer, L.N. Tan and B. Williams, “All for one: feature combination for highly channel-degraded speech activity detection,” in ISCA Interspeech, pp.709–713, 2013.
[2] M. Eshaghi, and M. R. Karami Mollaei, “Voice activity detection based on using wavelet packet,” in Digital Signal Processing, vol. 20, No. 4, pp. 1102-1115, 2010.
[3] Y. Datao, H. Jiqing, Z. Guibin and Z. Tieran, “Sparse power spectrum based robust voice activity detector,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, pp. 289-292, 2012.
[4] W Hongzhi, X Yuchao and L Meijing, “Study on the MFCC similarity-based voice activity detection algorithm,” in International Conference on Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC), Dengleng, 2011.
[5] S.G. Mallat, “A theory for multiresolution signal decomposition: the wavelet representation,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.11, No.7, pp. 674-693,1989.
[6] Nima Mesgarani, and Shihab Shamma, “Denoising in the Domain of Spectro-temporal Modulations”, in EURASIP Journal on Audio, Speech, and Music Processing , ID. 42357, 8 pages, doi:10.1155/2007/42357 ,2007.
[7] Weifeng Li, Yicong Zhou, Norman Poh, Fei Zhou, and Qingmin Liao, “Feature Denoising Using Joint Sparse Representation for In-car Speech Recognition”, in IEEE Transactions on audio, speech, and language processing, vol.20, No.7, pp. 681-684, 2013.
[8] N. Mesgarani, S. David, and S.A. Shamma, “Representation of phoneme in primary auditory cortex: how the brain analyzes speech,” in IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), Hawaii, April, 2007.
[9] Majid Mirbagheri, Nima Mesgarani, and Shihab Shamma, “Nonlinear filtering of spectrotemporal modulation in speech enhancement,” in IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), pp. 5478-81, 2010.
[10] C. Kim, K. Kumar and R. M. Stern, “Binaural sound source seperation motivated by auditory processing,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP2011, Prague, 2011.
[11] C. Mart´ınez, J. Goddardb, D. Milone, and H. Rufiner, “ Bio inspired sparse spectro-temporal representation of speech for robust classification,”in Computer Speech and Language, vol.26, No.5, pp. 336-345, 2012.
[12] Jort Florent Gemmeke, Hugo Van Hamme, Bert Cranen, and Lou Boves, “Compressive Sensing for Missing Data Imputation in Noise Robust Speech Recognition”, in IEEE Journal of selected topics in signal processing, vol.4,No.2, pp. 272-287, 2010.
[13] B. K. Natarajan, “Sparse approximate solutions to linear systems,” in Society for Industrial and Applied Mathematics (SIAM J). Computer, vol.24,No.2, pp.227–234, 1995.
[14] Mohadese Eshaghi, Farbod Razzazi, and Alireza Behrad, “A voice activity detection algorithm in spectro-temporal domain using sparse representation,” in International Journal of Machine Learning and Cybernetics, 2018, DOI: 10.1007/s13042-018-0856-z.
[15] G. Hosein Mohimani, Massoud Babaie-Zadeh, and Christian Jutten, “A fast approach for overcomplete sparse decomposition based on smoothed L0 norm,” in IEEE Transactions on Signal Processing, vol.57, No.1, pp.289-301, 2009.
[16] K. Kreutz-Delgado, J.F. Murray, B.D. Rao, K. Engan, T. Lee, and T.J. Sejnowski, “Dictionary learning algorithms for sparse representation,” in Neural Computer, vol.15, No.2, pp.349–396,2003.
[17] P. O. Hoyer, “Non-negative matrix factorization with sparseness constraints,” in Journal of Machine Learning Research, vol.5, No. 9, pp.1457–1469, 2004.
[18] R. Zdunek, and A. Cichocki, “Non-negative matrix factorization with quadratic programming,” in Neurocomputing, vol.71, No.10-12, pp. 2309-2320, 2007.
[19] M. Aharon, M. Elad, A. Bruckstein, “K-SVD: An algorithm for designing over complete dictionaries for sparse representation,” in IEEE Transactions on Signal Processing, vol.54, No.11, pp.4311–4322, 2006.
[20] W. M. Fisher, G. R. Doddington, M. Goudie and M. Kathleen, “The DARPA speech recognition research database: specifications and status”, in DARPA Workshop on Speech Recognition, 1986.
[21] A. Varga, H. J. M. Steeneken, M. Tomlinson and D. Jones, “The NOISEX-92 study the effect of additive noise on automatic speech recognition”, Documentation included in the NOISEX-92 CD-ROMs, 1992.
[22] B. Raj, T. Virtanen, S. Chaudhure, and R. Singh, “Non-negative matrix factorization based compensation of music for automatic speech recognition,” International Conference on Speech and Language Processing, 2010.
[23] N. Mesgarani, S. Shamma, and M. Slaney, “Speech discrimination based on multiscale spectro-temporal modulations,” in IEEE International Conference on Acoustics, Speech and Signal Processing, vol4, No.1, pp.601–604, 2004.
[24] Ian Vince McLoughlin, “Super-Audible Voice Activity Detection,” in IEEE Transactions on Speech and Audio Processing, vol. 22, No.9, pp.1424-1433, 2014.
[25] L. N. Tan, B. J. Borgstrom, and A. Alwan, “Voice activity detection using harmonic frequency components in likelihood ratio test,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2010.
[26] J. Ramirez, J.C. Segura, C. Benitez, A. de la Torre and A. Rubio, “Efficient voice activity detection algorithms using long-term speech information,” in Speech Communication, vol.42, No.3-4, pp.271–287, 2004.
[27] M. Yanna and A. Nishihara, “Efficient voice activity detection algorithm using long-term spectral flatness measure,” in EURASIP Journal on Audio, Speech and Music Processing, 2013, DOI: 10.1186/1687-4722-2013-21.
[28] Xu-Kui Yang, Liang He, Dan Qu1 and Wei-Qiang Zhang, “Voice activity detection algorithm based on long-term pitch information,” in EURASIP Journal on Audio, Speech, and Music Processing, 2016, DOI: 10.1186/s13636-016-0092-y.