

FLOAT
Overview :
FLOAT is an audio-driven avatar video generation technique that utilizes a flow matching generative model, transitioning the generative modeling from pixel-based latent space to learned motion latent space, achieving temporally coherent motion design. This technology incorporates a transformer-based vector field predictor and features a straightforward yet effective per-frame conditioning mechanism. Additionally, FLOAT supports speech-driven emotional enhancement, allowing for the natural integration of expressive motion. Extensive experiments demonstrate that FLOAT outperforms existing audio-driven avatar methods in visual quality, motion fidelity, and efficiency.
Target Users :
FLOAT is designed for developers, researchers, and content creators seeking to generate realistic talking avatar videos. With its efficient motion design and emotion enhancement features, FLOAT is particularly suited for professionals looking to incorporate natural expressions and emotions into their videos.
Use Cases
1. Generate a public speaking video with specific emotional expressions using FLOAT.
2. Utilize FLOAT technology to create realistic dialogue scenes for movies.
3. In virtual reality, create virtual characters with natural expressions using FLOAT technology.
Features
- Audio-driven avatar video generation: Synthesize talking avatar videos using a single avatar image and driving audio.
- Motion latent space encoding: Encode a given avatar image into an identity-motion latent representation via a motion latent autoencoder.
- Flow matching generation: Generate the talking avatar motion latent conditioned on audio through flow matching (with optimal transport trajectories).
- Emotion enhancement: Supports speech-driven emotion labels, providing a natural approach to generating emotionally aware talking avatar motions.
- Emotion redirection: Redirect the emotions of the talking avatar during inference using simple one-hot emotion labels.
- Comparison with state-of-the-art technologies: Demonstrate FLOAT's advantages over non-diffusion-based and diffusion-based methods.
- Ablation studies: Conduct ablation studies on frame-by-frame AdaLN (and gating) and flow matching to validate their effectiveness.
- Varying number of functional evaluations (NFEs): Showcase the impact of a small number of NFEs on temporal consistency and demonstrate FLOAT's ability to generate reasonable video outputs with approximately 10 NFEs.
How to Use
1. Visit the FLOAT project page and download the relevant code.
2. Prepare a single avatar image and the corresponding driving audio.
3. Configure audio conditions and emotion labels according to the documentation.
4. Run the FLOAT model to generate the latent motion of the talking avatar.
5. Generate temporally consistent videos through flow matching.
6. Adjust emotion redirection and NFEs to optimize video output.
7. Export and view the generated realistic talking avatar video.
Featured AI Tools
English Picks

Pika
Pika is a video production platform where users can upload their creative ideas, and Pika will automatically generate corresponding videos. Its main features include: support for various creative idea inputs (text, sketches, audio), professional video effects, and a simple and user-friendly interface. The platform operates on a free trial model, targeting creatives and video enthusiasts.
Video Production
17.6M

Gemini
Gemini is the latest generation of AI system developed by Google DeepMind. It excels in multimodal reasoning, enabling seamless interaction between text, images, videos, audio, and code. Gemini surpasses previous models in language understanding, reasoning, mathematics, programming, and other fields, becoming one of the most powerful AI systems to date. It comes in three different scales to meet various needs from edge computing to cloud computing. Gemini can be widely applied in creative design, writing assistance, question answering, code generation, and more.
AI Model
11.4M