Styletts 2 : Human-level text-to-speech synthesis model

Styletts 2

AI speech synthesis AI text to speech #Text-to-speech #Speech synthesis #Style diffusion #Adversarial training #Large language models Standard Picks Open Source

Overview :

StyleTTS 2 is a text-to-speech (TTS) model that utilizes large speech language models (SLMs) for style diffusion and adversarial training, achieving human-level TTS synthesis. It employs a diffusion model to model style as a latent stochastic variable, generating the most appropriate style for the given text without relying on voice references. Furthermore, we utilize large pre-trained SLMs (such as WavLM) as discriminators and incorporate our innovative differentiable duration modeling for end-to-end training, enhancing the naturalness of the synthesized speech. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches them on the multi-speaker VCTK dataset, garnering recognition from native English-speaking evaluators. Additionally, when trained on the LibriTTS dataset, our model outperforms prior publicly available zero-shot extension models. By demonstrating the potential of style diffusion and adversarial training with large SLMs, this work achieves human-level TTS synthesis on both single and multi-speaker datasets.

Target Users :

Suitable for text-to-speech synthesis tasks

Total Visits： 474.6M

Top Region： US(19.34%)

Website Views ： 213.9K

Features

Generates the most suitable style for text through style diffusion

Utilizes large pre-trained SLMs as discriminators

Features innovative differentiable duration modeling

Achieves human-level TTS synthesis on single and multi-speaker datasets