

Make An Audio 2
Overview :
Make-An-Audio 2 is a text-to-audio generation technology based on diffusion models, co-developed by researchers from Zhejiang University, ByteDance, and the Chinese University of Hong Kong. This technology utilizes pre-trained large language models (LLMs) to parse text, optimizing for semantic alignment and temporal consistency, thereby improving the quality of generated audio. It also incorporates a feed-forward Transformer-based diffusion denoiser to enhance performance in generating variable-length audio and bolster the extraction of temporal information. Furthermore, by leveraging LLMs to convert abundant audio label data into audio-text datasets, the issue of time data scarcity is addressed.
Target Users :
This technology is aimed at researchers and developers in the field of audio synthesis, as well as applications that require high-quality text-to-speech conversion, such as automatic dubbing and audiobook production. Make-An-Audio 2, through its advanced technology, can generate high-quality audio that is semantically aligned with the text content and temporally consistent, meeting the needs of these users.
Use Cases
Automatic generation of background sound effects and dialogues for audiobooks.
Automatic addition of narration and sound effects to video content.
Creation of virtual character voices for games or animations.
Features
Uses pre-trained large language models (LLMs) to parse text, optimizing time information capture.
Introduces a structured text encoder to assist in learning semantic alignment during the diffusion denoising process.
Designs a feed-forward Transformer-based diffusion denoiser to improve performance in generating variable-length audio.
Utilizes LLMs to enhance and convert audio label data, alleviating the problem of time data scarcity.
Exceeds baseline models in both objective and subjective metrics, significantly enhancing temporal information understanding, semantic consistency, and sound quality.
How to Use
Step 1: Prepare natural language text as input.
Step 2: Parse the text using Make-An-Audio 2's Text Encoder.
Step 3: Utilize the structured text encoder to assist in learning semantic alignment.
Step 4: Generate audio using the diffusion denoiser.
Step 5: Adjust the length and time control of the generated audio.
Step 6: Modify the structured input as needed for precise time control.
Step 7: Generate the final audio output.
Featured AI Tools
Chinese Picks

Skymusic
SkyMusic, an AI music generation large model built based on the Kunlun Wanwei "TianGong 3.0" super-large model, supports high-quality AI music generation, voice synthesis, lyric segmentation control, various music styles, and intelligent musical expression. Currently open for free beta testing, it aims to help users create better music and express their emotions.
AI Music Generation
1.0M

TME Studio
TME Studio is a creative tool designed for music lovers. It includes features like music separation, MIR calculation, songwriting assistance, and intelligent music sheet generation. Users can upload any song, separate vocals and instruments, perform musical content understanding and analysis, and extract various musical information. Additionally, it provides a songwriting tool to help users find suitable rhyming words and ignite their creative inspiration. Users can also generate intelligent guitar music sheets, simply by uploading music and play any song they want to. TME Studio will empower music enthusiasts to create music more effortlessly.
AI Music Generation
693.3K