Make An Audio 2 : Text-to-audio generation technology based on diffusion models

Make An Audio 2

AI Music Generation AI Sound Generation #Text-to-audio #Diffusion models #Large language models #Audio synthesis Standard Picks Open Source

Overview :

Make-An-Audio 2 is a text-to-audio generation technology based on diffusion models, co-developed by researchers from Zhejiang University, ByteDance, and the Chinese University of Hong Kong. This technology utilizes pre-trained large language models (LLMs) to parse text, optimizing for semantic alignment and temporal consistency, thereby improving the quality of generated audio. It also incorporates a feed-forward Transformer-based diffusion denoiser to enhance performance in generating variable-length audio and bolster the extraction of temporal information. Furthermore, by leveraging LLMs to convert abundant audio label data into audio-text datasets, the issue of time data scarcity is addressed.

Target Users :

This technology is aimed at researchers and developers in the field of audio synthesis, as well as applications that require high-quality text-to-speech conversion, such as automatic dubbing and audiobook production. Make-An-Audio 2, through its advanced technology, can generate high-quality audio that is semantically aligned with the text content and temporally consistent, meeting the needs of these users.

Total Visits： 67

Top Region： US(60.64%)

Website Views ： 52.2K

Use Cases

Automatic generation of background sound effects and dialogues for audiobooks.

Automatic addition of narration and sound effects to video content.

Creation of virtual character voices for games or animations.

Features

Uses pre-trained large language models (LLMs) to parse text, optimizing time information capture.

Introduces a structured text encoder to assist in learning semantic alignment during the diffusion denoising process.

Designs a feed-forward Transformer-based diffusion denoiser to improve performance in generating variable-length audio.

Utilizes LLMs to enhance and convert audio label data, alleviating the problem of time data scarcity.

Exceeds baseline models in both objective and subjective metrics, significantly enhancing temporal information understanding, semantic consistency, and sound quality.

How to Use

Step 1: Prepare natural language text as input.

Step 2: Parse the text using Make-An-Audio 2's Text Encoder.

Step 3: Utilize the structured text encoder to assist in learning semantic alignment.

Step 4: Generate audio using the diffusion denoiser.

Step 5: Adjust the length and time control of the generated audio.

Step 6: Modify the structured input as needed for precise time control.

Step 7: Generate the final audio output.