Deep Floyd : A highly realistic text-to-image model

Deep Floyd

AI image generation AI model #Text-to-image #Image synthesis #Realism #Language understanding Standard Picks Open Source

Overview :

Deep floyd is an open-source text-to-image model with high realism and language understanding capabilities. It consists of a frozen text encoder and three cascaded pixel diffusion modules: a base model generates 64x64 pixel images based on text prompts, and two super-resolution models generate images with gradually increasing resolutions: 256x256 pixels and 1024x1024 pixels. All stages of the model utilize a frozen T5 transformer-based text encoder to extract text embeddings, which are then input into a UNet architecture enhanced with cross-attention and attention pooling. This efficient model surpasses current state-of-the-art models, achieving a zero-shot FID score of 6.66 on the COCO dataset. Our work highlights the potential of larger UNet architectures in the first stage of cascaded diffusion models and demonstrates a promising future for text-to-image synthesis.

Target Users :

Used for text-to-image synthesis and image generation tasks

Total Visits： 474.6M

Top Region： US(19.34%)

Website Views ： 50.5K

Features

Generate highly realistic images

Understand text prompts and generate corresponding images

Support super-resolution image generation