FLOAT : Audio-driven talking avatar video generation method based on flow matching.

FLOAT

Video Production AI Model #Artificial Intelligence #Avatar Animation #Audio-driven #Emotion Enhancement #Flow Matching Standard Picks Open Source

Overview :

FLOAT is an audio-driven avatar video generation technique that utilizes a flow matching generative model, transitioning the generative modeling from pixel-based latent space to learned motion latent space, achieving temporally coherent motion design. This technology incorporates a transformer-based vector field predictor and features a straightforward yet effective per-frame conditioning mechanism. Additionally, FLOAT supports speech-driven emotional enhancement, allowing for the natural integration of expressive motion. Extensive experiments demonstrate that FLOAT outperforms existing audio-driven avatar methods in visual quality, motion fidelity, and efficiency.

Target Users :

FLOAT is designed for developers, researchers, and content creators seeking to generate realistic talking avatar videos. With its efficient motion design and emotion enhancement features, FLOAT is particularly suited for professionals looking to incorporate natural expressions and emotions into their videos.

Total Visits： 2.1K

Top Region： US(52.37%)

Website Views ： 59.1K

Use Cases

1. Generate a public speaking video with specific emotional expressions using FLOAT.

2. Utilize FLOAT technology to create realistic dialogue scenes for movies.

3. In virtual reality, create virtual characters with natural expressions using FLOAT technology.

Features

- Audio-driven avatar video generation: Synthesize talking avatar videos using a single avatar image and driving audio.

- Motion latent space encoding: Encode a given avatar image into an identity-motion latent representation via a motion latent autoencoder.

- Flow matching generation: Generate the talking avatar motion latent conditioned on audio through flow matching (with optimal transport trajectories).

- Emotion enhancement: Supports speech-driven emotion labels, providing a natural approach to generating emotionally aware talking avatar motions.

- Emotion redirection: Redirect the emotions of the talking avatar during inference using simple one-hot emotion labels.

- Comparison with state-of-the-art technologies: Demonstrate FLOAT's advantages over non-diffusion-based and diffusion-based methods.

- Ablation studies: Conduct ablation studies on frame-by-frame AdaLN (and gating) and flow matching to validate their effectiveness.

- Varying number of functional evaluations (NFEs): Showcase the impact of a small number of NFEs on temporal consistency and demonstrate FLOAT's ability to generate reasonable video outputs with approximately 10 NFEs.

How to Use

1. Visit the FLOAT project page and download the relevant code.

2. Prepare a single avatar image and the corresponding driving audio.

3. Configure audio conditions and emotion labels according to the documentation.

4. Run the FLOAT model to generate the latent motion of the talking avatar.

5. Generate temporally consistent videos through flow matching.

6. Adjust emotion redirection and NFEs to optimize video output.

7. Export and view the generated realistic talking avatar video.