Skip to Main content Skip to Navigation
Theses

Implicit and explicit phase modeling in deep learning-based source separation

Manuel Pariente 1
1 MULTISPEECH - Speech Modeling for Facilitating Oral-Based Communication
Inria Nancy - Grand Est, LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : Whether processed by humans or machines, speech occupies a central part of our daily lives, yet distortions such as noise or competing speakers reduce both human understanding and machine performance. Audio source separation and speech enhancement aim at solving this problem. To perform separation and enhancement, most traditional approaches rely on the magnitude short-time Fourier transform (STFT), thus discarding the phase. Thanks to their increased representational power, deep neural networks (DNNs) have recently made it possible to break that assumption and exploit the fine-grained spectro-temporal information provided by the phase. In this thesis, we study the impact of implicit and explicit phase modeling in deep discriminative and generative models with application to source separation and speech enhancement.In a first stage, we consider the task of discriminative source separation based on the encoder-masker-decoder framework popularized by TasNet. We propose a unified view of learned and fixed filterbanks and extend on two previously proposed learnable filterbanks by making them analytical, thus enabling the computation of the magnitude and phase of the resulting representation. We study the amount of information provided by the magnitude and phase components as a function of the window size. Results on the WHAM dataset show that for all filterbanks the best performance is achieved for short 2 ms windows and that, for such short windows, phase modeling is indeed crucial. Interestingly, this also holds for STFT-based models that even surpass the performance of oracle magnitude masking. This work has formed the basis of Asteroid, the PyTorch-based audio source separation toolkit for researchers, of which we then present the main features as well as example results obtained with it. Second, we tackle the speech enhancement task with an approach based on a popular deep generative model, the variational autoencoder (VAE), which models the complex STFT coefficients in a given time frame as independent zero-mean complex Gaussian variables whose variances depend on a latent representation. By combining a VAE model for the speech variances and a nonnegative matrix factorization (NMF) model for the noise variances, we propose a variational inference algorithm to iteratively infer these variances and derive an estimate of the clean speech signal. In particular, the encoder of the pretrained VAE can be used to estimate the variational approximation of the true posterior distribution, using the very same assumption made to train VAEs. Experiments show that the proposed method produces results on par with other VAE-based methods, while decreasing the computational cost by a factor of 36.Following on the above study, we integrate time-frequency dependency and phase modeling capabilities into the above VAE-based generative model by relaxing the time-frequency independence assumption and assuming a multivariate zero-mean Gaussian model over the entire complex STFT conditional to the latent representation. The covariance matrix of that model is parameterized by its sparse Cholesky factor which constitutes the VAE’s output. The sparsity pattern is chosen so that local time and frequency dependencies can be expressed. We evaluate the proposed method for speech separation on the WSJ0 dataset as a function of the chosen dependency pattern.
Complete list of metadata

https://hal.univ-lorraine.fr/tel-03395953
Contributor : Thèses Ul Connect in order to contact the contributor
Submitted on : Friday, October 22, 2021 - 3:11:39 PM
Last modification on : Saturday, October 23, 2021 - 4:09:51 AM

File

DDOC_T_2021_0150_PARIENTE.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : tel-03395953, version 1

Citation

Manuel Pariente. Implicit and explicit phase modeling in deep learning-based source separation. Machine Learning [stat.ML]. Université de Lorraine, 2021. English. ⟨NNT : 2021LORR0150⟩. ⟨tel-03395953⟩

Share

Metrics

Record views

78

Files downloads

154