Generative Universal Speech Enhancement (USE) methods aim to leverage generative models to improve speech quality under various types of distortions. However, existing generative speech enhancement methods often suffer from semantic inconsistency in the generated outputs. Therefore, we propose SenSE, a novel two-stage generative universal speech enhancement framework, by modeling semantic priors with a language model, the flow-matching-based speech enhancement process is guided to generate semantically faithful speech, thereby effectively improving context fidelity. In addition, we introduce a dual-path masked conditioning training strategy that enables flow-matching-based enhancement to flexibly integrate multi-source conditioning signals from degraded speech, semantic tokens, and reference speech, thereby improving model flexibility and adaptability. Experimental results demonstrate that SenSE achieves state-of-the-art performance among generative speech enhancement models and exhibits a high performance ceiling, particularly under challenging distortion conditions.
Demonstration of SenSE's semantic-aware speech enhancement capabilities
| Before Processing | After Processing |
|---|---|
| Before Processing | After Processing |
|---|---|
| Before Processing | After Processing |
|---|---|
| Degraded Speech | Enhanced by PGUSE | Enhanced by LLaSE-G1 | Enhanced by SenSE |
|---|---|---|---|
|
0:00/0:08
|
0:00/0:08
|
0:00/0:08
|
0:00/0:08
|
| Degraded Speech | Enhanced by PGUSE | Enhanced by LLaSE-G1 | Enhanced by SenSE |
|---|---|---|---|
|
0:00/0:08
|
0:00/0:08
|
0:00/0:08
|
0:00/0:08
|
| Degraded Speech | Enhanced by PGUSE | Enhanced by LLaSE-G1 | Enhanced by SenSE |
|---|---|---|---|
|
0:00/0:08
|
0:00/0:08
|
0:00/0:08
|
0:00/0:08
|
| Distortion Type | Clean Speech | Degraded Speech | Enhanced Speech |
|---|---|---|---|
| Cicada chirping, broadband noise |
0:00/0:07
|
0:00/0:08
|
0:00/0:08
|
| Distortion Type | Clean Speech | Degraded Speech | Enhanced Speech |
|---|---|---|---|
| Musical noise |
0:00/0:08
|
0:00/0:09
|
0:00/0:09
|
| Distortion Type | Clean Speech | Degraded Speech | Enhanced Speech |
|---|---|---|---|
| Baby crying |
0:00/0:07
|
0:00/0:08
|
0:00/0:08
|
| Distortion Type | Clean Speech | Degraded Speech | Enhanced Speech |
|---|---|---|---|
| Clipping, Machine noise |
0:00/0:09
|
0:00/0:10
|
0:00/0:10
|
| Distortion Type | Clean Speech | Degraded Speech | Enhanced Speech |
|---|---|---|---|
| Reverberation, machine noise |
0:00/0:07
|
0:00/0:08
|
0:00/0:08
|
| Distortion Type | Clean Speech | Degraded Speech | Enhanced Speech |
|---|---|---|---|
| Reverberation, background noise |
0:00/0:07
|
0:00/0:08
|
0:00/0:08
|
| Distortion Type | Clean Speech | Degraded Speech | Enhanced Speech |
|---|---|---|---|
| Bandwidth limitation, strong reverberation, birdsong |
0:00/0:07
|
0:00/0:08
|
0:00/0:08
|