SenSE

Semantic-Aware High-Fidelity Universal Speech Enhancement

Xingchen Li¹, Hanke Xie¹, Ziqian Wang¹, Zihan Zhang², Longshuai Xiao², Shuai Wang³, Lei Xie¹

¹Northwestern Polytechnical University, ²Huawei Technologies Co., Ltd. ³Nanjing University

Abstract Universal Speech Enhancement

Abstract

Generative Universal Speech Enhancement (USE) methods aim to leverage generative models to improve speech quality under various types of distortions. However, existing generative speech enhancement methods often suffer from semantic inconsistency in the generated outputs. Therefore, we propose SenSE, a novel two-stage generative universal speech enhancement framework, by modeling semantic priors with a language model, the flow-matching-based speech enhancement process is guided to generate semantically faithful speech, thereby effectively improving context fidelity. In addition, we introduce a dual-path masked conditioning training strategy that enables flow-matching-based enhancement to flexibly integrate multi-source conditioning signals from degraded speech, semantic tokens, and reference speech, thereby improving model flexibility and adaptability. Experimental results demonstrate that SenSE achieves state-of-the-art performance among generative speech enhancement models and exhibits a high performance ceiling, particularly under challenging distortion conditions.

SenSE Overview

Demo video

Demonstration of SenSE's semantic-aware speech enhancement capabilities

Video samples

Sample 1

Before Processing	After Processing

Sample 2

Before Processing	After Processing

Sample 3

Before Processing	After Processing

Comparison with other models in real-recording scenarios

Real-recording Sample 1

Degraded Speech	Enhanced by PGUSE	Enhanced by LLaSE-G1	Enhanced by SenSE
0:00/0:08	0:00/0:08	0:00/0:08	0:00/0:08

Real-recording Sample 2

Degraded Speech	Enhanced by PGUSE	Enhanced by LLaSE-G1	Enhanced by SenSE
0:00/0:08	0:00/0:08	0:00/0:08	0:00/0:08

Real-recording Sample 3

Degraded Speech	Enhanced by PGUSE	Enhanced by LLaSE-G1	Enhanced by SenSE
0:00/0:08	0:00/0:08	0:00/0:08	0:00/0:08

Performance under various distortion types

Sample 1

Distortion Type	Clean Speech	Degraded Speech	Enhanced Speech
Cicada chirping, broadband noise	0:00/0:07	0:00/0:08	0:00/0:08

Sample 2

Distortion Type	Clean Speech	Degraded Speech	Enhanced Speech
Musical noise	0:00/0:08	0:00/0:09	0:00/0:09

Sample 3

Distortion Type	Clean Speech	Degraded Speech	Enhanced Speech
Baby crying	0:00/0:07	0:00/0:08	0:00/0:08

Sample 4

Distortion Type	Clean Speech	Degraded Speech	Enhanced Speech
Clipping, Machine noise	0:00/0:09	0:00/0:10	0:00/0:10

Sample 5

Distortion Type	Clean Speech	Degraded Speech	Enhanced Speech
Reverberation, machine noise	0:00/0:07	0:00/0:08	0:00/0:08

Sample 6

Distortion Type	Clean Speech	Degraded Speech	Enhanced Speech
Reverberation, background noise	0:00/0:07	0:00/0:08	0:00/0:08

Sample 7

Distortion Type	Clean Speech	Degraded Speech	Enhanced Speech
Bandwidth limitation, strong reverberation, birdsong	0:00/0:07	0:00/0:08	0:00/0:08