This paper presents an approach for improving the perceptual quality of speech separated from background noise at low signal-to-noise ratios. Our approach uses two stages of deep neural networks, where the first stage estimates the ideal ratio mask that separates speech from noise, and the second stage maps the ratio-masked speech to the clean speech activation matrices that are used for nonnegative matrix factorization (NMF).
Supervised NMF systems make assumptions about the relationship between the activation and basic matrices that do not always hold. Other two-stage approaches combining masking with NMF reconstruction do not account for mask estimation errors. We show that the proposed algorithm achieves higher objective speech quality and intelligibility compared to these related methods.