StyleTTS 2
S
Styletts 2
Overview :
StyleTTS 2 is a text-to-speech (TTS) model that utilizes large speech language models (SLMs) for style diffusion and adversarial training, achieving human-level TTS synthesis. It employs a diffusion model to model style as a latent stochastic variable, generating the most appropriate style for the given text without relying on voice references. Furthermore, we utilize large pre-trained SLMs (such as WavLM) as discriminators and incorporate our innovative differentiable duration modeling for end-to-end training, enhancing the naturalness of the synthesized speech. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches them on the multi-speaker VCTK dataset, garnering recognition from native English-speaking evaluators. Additionally, when trained on the LibriTTS dataset, our model outperforms prior publicly available zero-shot extension models. By demonstrating the potential of style diffusion and adversarial training with large SLMs, this work achieves human-level TTS synthesis on both single and multi-speaker datasets.
Target Users :
Suitable for text-to-speech synthesis tasks
Total Visits: 474.6M
Top Region: US(19.34%)
Website Views : 213.9K
Features
Generates the most suitable style for text through style diffusion
Utilizes large pre-trained SLMs as discriminators
Features innovative differentiable duration modeling
Achieves human-level TTS synthesis on single and multi-speaker datasets
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase