SenSE

Semantic-Aware High-Fidelity Universal Speech Enhancement

Xingchen Li1, Hanke Xie1, Ziqian Wang1, Zihan Zhang2, Longshuai Xiao2, Shuai Wang3, Lei Xie1
1Northwestern Polytechnical University, 2Huawei Technologies Co., Ltd. 3Nanjing University

Abstract

Generative Universal Speech Enhancement (USE) methods aim to leverage generative models to improve speech quality under various types of distortions. However, existing generative speech enhancement methods often suffer from semantic inconsistency in the generated outputs. Therefore, we propose SenSE, a novel two-stage generative universal speech enhancement framework, by modeling semantic priors with a language model, the flow-matching-based speech enhancement process is guided to generate semantically faithful speech, thereby effectively improving context fidelity. In addition, we introduce a dual-path masked conditioning training strategy that enables flow-matching-based enhancement to flexibly integrate multi-source conditioning signals from degraded speech, semantic tokens, and reference speech, thereby improving model flexibility and adaptability. Experimental results demonstrate that SenSE achieves state-of-the-art performance among generative speech enhancement models and exhibits a high performance ceiling, particularly under challenging distortion conditions.

SenSE Overview

Demo video

Demonstration of SenSE's semantic-aware speech enhancement capabilities

Video samples

Sample 1

Before Processing After Processing

Sample 2

Before Processing After Processing

Sample 3

Before Processing After Processing

Comparison with other models in real-recording scenarios

Real-recording Sample 1

Degraded Speech Enhanced by PGUSE Enhanced by LLaSE-G1 Enhanced by SenSE
0:00/0:08
Degraded Speech Spectrogram
0:00/0:08
PGUSE Enhanced Spectrogram
0:00/0:08
LLaSE-G1 Enhanced Spectrogram
0:00/0:08
SenSE Enhanced Spectrogram

Real-recording Sample 2

Degraded Speech Enhanced by PGUSE Enhanced by LLaSE-G1 Enhanced by SenSE
0:00/0:08
Degraded Speech Spectrogram
0:00/0:08
PGUSE Enhanced Spectrogram
0:00/0:08
LLaSE-G1 Enhanced Spectrogram
0:00/0:08
SenSE Enhanced Spectrogram

Real-recording Sample 3

Degraded Speech Enhanced by PGUSE Enhanced by LLaSE-G1 Enhanced by SenSE
0:00/0:08
Degraded Speech Spectrogram
0:00/0:08
PGUSE Enhanced Spectrogram
0:00/0:08
LLaSE-G1 Enhanced Spectrogram
0:00/0:08
SenSE Enhanced Spectrogram

Performance under various distortion types

Sample 1

Distortion Type Clean Speech Degraded Speech Enhanced Speech
Cicada chirping, broadband noise
0:00/0:07
Clean Speech Spectrogram
0:00/0:08
Noisy Speech Spectrogram
0:00/0:08
Enhanced Speech Spectrogram

Sample 2

Distortion Type Clean Speech Degraded Speech Enhanced Speech
Musical noise
0:00/0:08
Clean Speech Spectrogram
0:00/0:09
Noisy Speech Spectrogram
0:00/0:09
Enhanced Speech Spectrogram

Sample 3

Distortion Type Clean Speech Degraded Speech Enhanced Speech
Baby crying
0:00/0:07
Clean Speech Spectrogram
0:00/0:08
Noisy Speech Spectrogram
0:00/0:08
Enhanced Speech Spectrogram

Sample 4

Distortion Type Clean Speech Degraded Speech Enhanced Speech
Clipping, Machine noise
0:00/0:09
Clean Speech Spectrogram
0:00/0:10
Noisy Speech Spectrogram
0:00/0:10
Enhanced Speech Spectrogram

Sample 5

Distortion Type Clean Speech Degraded Speech Enhanced Speech
Reverberation, machine noise
0:00/0:07
Clean Speech Spectrogram
0:00/0:08
Noisy Speech Spectrogram
0:00/0:08
Enhanced Speech Spectrogram

Sample 6

Distortion Type Clean Speech Degraded Speech Enhanced Speech
Reverberation, background noise
0:00/0:07
Clean Speech Spectrogram
0:00/0:08
Noisy Speech Spectrogram
0:00/0:08
Enhanced Speech Spectrogram

Sample 7

Distortion Type Clean Speech Degraded Speech Enhanced Speech
Bandwidth limitation, strong reverberation, birdsong
0:00/0:07
Clean Speech Spectrogram
0:00/0:08
Noisy Speech Spectrogram
0:00/0:08
Enhanced Speech Spectrogram