SynParaSpeech

Automated Synthesis of Paralinguistic Datasets for Speech Generation and Understanding

Bingsong Bai1*, Qihang Lu1*, Wenbing Yang1*, Zihan Sun2, Yueran Hou2, Peilei Jia2, Songbai Pu2, Ruibo Fu3, Yingming Gao1, Ya Li1†, Jun Gao2
1School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China
2Hello Group Inc., China
3Institute of Automation, Chinese Academy of Sciences, China

Abstract

Paralinguistic sounds like laughter and sighs are crucial for realistic speech synthesis. They make natural talks more engaging and authentic. But current methods depend heavily on private datasets, and open ones have problems: missing speech, timestamps, or not matching real life. To mitigate these issues, we propose an automated synthesis framework for large-scale paralinguistic datasets and introduce SynParaSpeech. It includes 6 paralinguistic categories, 118.75 hours of Chinese speech, and precise timestamp annotations. Our work contributes the first automated synthesis method for such datasets, the release of SynParaSpeech, improved paralinguistic speech synthesis models via fine-tuning, and enhanced paralinguistic event detection through prompt tuning.

Automated Synthesis Pipeline

  1. Labeled Text Synthesis: ASR models (Whisper, Paraformer) generate transcriptions with VAD-based timestamp correction. LLMs insert paralinguistic tags at appropriate positions.
  2. Audio Synthesis: Paralinguistic audio clips are converted to match speech timbre using SeedVC, then inserted into speech segments at annotated timestamps.
  3. Verification: Manual checks ensure naturalness, timbre consistency, audio quality, and timing alignment.
SynParaSpeech Pipeline

Dataset Overview

SynParaSpeech covers 6 paralinguistic categories (laughter, sigh, throat clearing, gasp, tsk, pause), with 118.75 hours of audio and precise timestamp annotations. The dataset is constructed via an automated pipeline combining ASR transcription, LLM-based paralinguistic tagging, voice conversion, and manual verification.

Category Hours Clips Avg.(s) Share
Sigh 28.22 17,706 5.74 23.76 %
Throat Clearing 25.45 18,827 4.87 21.43 %
Laugh 20.84 13,023 5.76 17.55 %
Pause 18.30 9,643 6.83 15.41 %
Tsk 14.82 11,941 4.47 12.48 %
Gasp 11.11 8,846 4.52 9.36 %
Total 118.75 79,986 5.34 100.00 %

Paralinguistic TTS Improvement

Fine-tuning with SynParaSpeech enhances paralinguistic generation quality in CosyVoice2 and F5-TTS

BibTeX

@article{bai2025synparaspeech,
    title={SynParaSpeech: Automated Synthesis of Paralinguistic Datasets for Speech Generation and Understanding},
    author={Bingsong Bai and Qihang Lu and Wenbing Yang and Zihan Sun and Yueran Hou and Peilei Jia and Songbai Pu and Ruibo Fu and Yingming Gao and Ya Li and Jun Gao},
    journal={arXiv preprint arXiv:2509.14946},
    year={2025}
}