SynParaSpeech: Automated Synthesis of Paralinguistic Datasets

Automated Synthesis of Paralinguistic Datasets for Speech Generation and Understanding

Bingsong Bai¹*, Qihang Lu¹*, Wenbing Yang¹*, Zihan Sun², Yueran Hou², Peilei Jia², Songbai Pu², Ruibo Fu³, Yingming Gao¹, Ya Li¹†, Jun Gao²†

¹School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China

²Hello Group Inc., China

³Institute of Automation, Chinese Academy of Sciences, China

Abstract

Paralinguistic sounds like laughter and sighs are crucial for realistic speech synthesis. They make natural talks more engaging and authentic. But current methods depend heavily on private datasets, and open ones have problems: missing speech, timestamps, or not matching real life. To mitigate these issues, we propose an automated synthesis framework for large-scale paralinguistic datasets and introduce SynParaSpeech. It includes 6 paralinguistic categories, 118.75 hours of Chinese speech, and precise timestamp annotations. Our work contributes the first automated synthesis method for such datasets, the release of SynParaSpeech, improved paralinguistic speech synthesis models via fine-tuning, and enhanced paralinguistic event detection through prompt tuning.

Automated Synthesis Pipeline
Dataset Overview
Paralinguistic TTS Improvement
Paralinguistic Event Detection

Automated Synthesis Pipeline

Labeled Text Synthesis: ASR models (Whisper, Paraformer) generate transcriptions with VAD-based timestamp correction. LLMs insert paralinguistic tags at appropriate positions.
Audio Synthesis: Paralinguistic audio clips are converted to match speech timbre using SeedVC, then inserted into speech segments at annotated timestamps.
Verification: Manual checks ensure naturalness, timbre consistency, audio quality, and timing alignment.

Dataset Overview

SynParaSpeech covers 6 paralinguistic categories (laughter, sigh, throat clearing, gasp, tsk, pause), with 118.75 hours of audio and precise timestamp annotations. The dataset is constructed via an automated pipeline combining ASR transcription, LLM-based paralinguistic tagging, voice conversion, and manual verification.

Category	Hours	Clips	Avg.(s)	Share
Sigh	28.22	17,706	5.74	23.76 %
Throat Clearing	25.45	18,827	4.87	21.43 %
Laugh	20.84	13,023	5.76	17.55 %
Pause	18.30	9,643	6.83	15.41 %
Tsk	14.82	11,941	4.47	12.48 %
Gasp	11.11	8,846	4.52	9.36 %
Total	118.75	79,986	5.34	100.00 %

Paralinguistic Event Detection

Prompt tuning with SynParaSpeech improves event localization and classification accuracy

Kimi Audio 0-shot	Kimi Audio 1-shot	Kimi Audio 3-shot	Kimi Audio 5-shot	Kimi Audio 7-shot	Qwen 2.5 omni 0-shot	Qwen 2.5 omni 1-shot	Qwen 2.5 omni 3-shot	Qwen 2.5 omni 5-shot	Qwen 2.5 omni 7-shot

BibTeX

@article{bai2025synparaspeech, title={SynParaSpeech: Automated Synthesis of Paralinguistic Datasets for Speech Generation and Understanding}, author={Bingsong Bai and Qihang Lu and Wenbing Yang and Zihan Sun and Yueran Hou and Peilei Jia and Songbai Pu and Ruibo Fu and Yingming Gao and Ya Li and Jun Gao}, journal={arXiv preprint arXiv:2509.14946}, year={2025} }

SynParaSpeech