VLOGGER : Text and voice-driven human video generation from a single input portrait image.

VLOGGER

AI video generation AI image generation #Video generation #Human synthesis #Text-to-video #Audio-to-video Standard Picks Open Source

Overview :

VLOGGER is a method for generating text and audio-driven speaking human videos from a single input portrait image. It builds upon the success of recent generative diffusion models. Our method consists of 1) a random human-to-3D motion diffusion model, and 2) a novel diffusion-based architecture that enhances text-to-image models through temporal and spatial control. This approach enables the generation of high-quality videos of variable length, and is easily controllable through advanced expression of human faces and bodies. Unlike previous work, our method does not require individual training for each person, nor does it rely on face detection and cropping. It generates complete images (rather than just faces or lips), and takes into account the wide range of scenarios required for the correct synthesis of human communication (e.g., visible torsos or diverse subject identities).

Target Users :

Suitable for scenarios where you need to generate dynamic videos from a single static image, such as video editing and image replacement.

Total Visits： 1.6K

Top Region： US(54.43%)

Website Views ： 319.1K

Use Cases

Generate realistic human videos

Edit existing video content

Video translation

Features

Text and audio-driven video generation

High-quality video generation