UNIMO-G
U
UNIMO G
Overview :
UNIMO-G is a simple multimodal conditional diffusion framework for processing interwoven text and visual inputs. It comprises two core components: a multimodal large language model (MLLM) for encoding multimodal prompts and a conditional denoising diffusion network for generating images based on the encoded multimodal inputs. We utilize a two-stage training strategy to effectively train this framework: Firstly, pre-training on a large-scale text-image pair dataset to develop conditional image generation capabilities, followed by guided fine-tuning using multimodal prompts to achieve unified image generation capabilities. We have adopted a carefully designed data processing pipeline, including language grounding and image segmentation, to construct multimodal prompts. UNIMO-G demonstrates outstanding performance in text-to-image generation and zero-shot theme-driven synthesis, proving highly effective in generating high-fidelity images with complex multimodal prompts involving multiple image entities.
Target Users :
UNIMO-G can be used in scenarios such as text-to-image generation and zero-shot theme-driven synthesis.
Total Visits: 29.7M
Top Region: US(17.94%)
Website Views : 114.8K
Use Cases
Generate high-fidelity images with complex multimodal prompts containing multiple image entities using the UNIMO-G model.
Utilize UNIMO-G for text-to-image generation.
UNIMO-G demonstrates excellent performance in zero-shot theme-driven synthesis.
Features
Processes interwoven text and visual inputs
Generates images
Two-stage training strategy (pre-training and guided fine-tuning)
Data processing pipeline including language grounding and image segmentation
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase