SenSE

Semantic-Aware High-Fidelity Universal Speech Enhancement

Xingchen Li1, Hanke Xie1, Ziqian Wang1, Zihan Zhang2, Longshuai Xiao2, Lei Xie1
1Northwestern Polytechnical University, 2Huawei Technologies Co., Ltd.

Abstract

Generative universal speech enhancement (USE) methods aim to leverage generative models to improve speech quality under various types of distortions. Diffusion- or flow-based generative models are capable of producing enhanced speech with high quality and fidelity. However, they typically achieve speech enhancement by learning an acoustic feature mapping from degraded speech to clean speech, while lacking awareness of high-level semantic information. This deficiency tends to cause semantic ambiguity and acoustic discontinuities in the enhanced speech. In contrast, humans can often comprehend heavily corrupted speech by relying on semantic priors, suggesting that semantics play a crucial role in speech enhancement. Therefore, in this paper, we propose SenSE, which leverages a language model to capture the semantic information of distorted speech and effectively integrates it into a flow-matching-based speech enhancement framework. Specifically, we introduce a semantic-aware speech language model to capture the semantics of degraded speech and generate semantic tokens. We then design a semantic guidance mechanism that incorporates semantic information into the flow-matching-based speech enhancement process, effectively mitigating semantic ambiguity. In addition, we propose a prompt guidance mechanism, which leverages a short reference utterance to alleviate the loss of speaker similarity under severe distortion conditions. The results of several benchmark data sets demonstrate that SenSE not only ensures high perceptual quality but also substantially improves speech fidelity while maintaining strong robustness under severe distortions.

SenSE Overview

Demo video

Demonstration of SenSE's semantic-aware speech enhancement capabilities

Video samples

Sample 1

Before Processing After Processing

Sample 2

Before Processing After Processing

Sample 3

Before Processing After Processing

Comparison with other models in real-recording scenarios

Real-recording Sample 1

Degraded Speech Enhanced by PGUSE Enhanced by LLaSE-G1 Enhanced by SenSE
0:00/0:08
Degraded Speech Spectrogram
0:00/0:08
PGUSE Enhanced Spectrogram
0:00/0:08
LLaSE-G1 Enhanced Spectrogram
0:00/0:08
SenSE Enhanced Spectrogram

Real-recording Sample 2

Degraded Speech Enhanced by PGUSE Enhanced by LLaSE-G1 Enhanced by SenSE
0:00/0:08
Degraded Speech Spectrogram
0:00/0:08
PGUSE Enhanced Spectrogram
0:00/0:08
LLaSE-G1 Enhanced Spectrogram
0:00/0:08
SenSE Enhanced Spectrogram

Real-recording Sample 3

Degraded Speech Enhanced by PGUSE Enhanced by LLaSE-G1 Enhanced by SenSE
0:00/0:08
Degraded Speech Spectrogram
0:00/0:08
PGUSE Enhanced Spectrogram
0:00/0:08
LLaSE-G1 Enhanced Spectrogram
0:00/0:08
SenSE Enhanced Spectrogram

Performance under various distortion types

Sample 1

Distortion Type Clean Speech Degraded Speech Enhanced Speech
Cicada chirping, broadband noise
0:00/0:07
Clean Speech Spectrogram
0:00/0:08
Noisy Speech Spectrogram
0:00/0:08
Enhanced Speech Spectrogram

Sample 2

Distortion Type Clean Speech Degraded Speech Enhanced Speech
Musical noise
0:00/0:08
Clean Speech Spectrogram
0:00/0:09
Noisy Speech Spectrogram
0:00/0:09
Enhanced Speech Spectrogram

Sample 3

Distortion Type Clean Speech Degraded Speech Enhanced Speech
Baby crying
0:00/0:07
Clean Speech Spectrogram
0:00/0:08
Noisy Speech Spectrogram
0:00/0:08
Enhanced Speech Spectrogram

Sample 4

Distortion Type Clean Speech Degraded Speech Enhanced Speech
Clipping, Machine noise
0:00/0:09
Clean Speech Spectrogram
0:00/0:10
Noisy Speech Spectrogram
0:00/0:10
Enhanced Speech Spectrogram

Sample 5

Distortion Type Clean Speech Degraded Speech Enhanced Speech
Reverberation, machine noise
0:00/0:07
Clean Speech Spectrogram
0:00/0:08
Noisy Speech Spectrogram
0:00/0:08
Enhanced Speech Spectrogram

Sample 6

Distortion Type Clean Speech Degraded Speech Enhanced Speech
Reverberation, background noise
0:00/0:07
Clean Speech Spectrogram
0:00/0:08
Noisy Speech Spectrogram
0:00/0:08
Enhanced Speech Spectrogram

Sample 7

Distortion Type Clean Speech Degraded Speech Enhanced Speech
Bandwidth limitation, strong reverberation, birdsong
0:00/0:07
Clean Speech Spectrogram
0:00/0:08
Noisy Speech Spectrogram
0:00/0:08
Enhanced Speech Spectrogram