Minigemini : A multimodal large language model capable of understanding and generating images

Minigemini

AI image generation AI model #Multimodal #Visual Language Model #Large Language Model #Image Understanding #Image Generation Standard Picks Open Source

Overview :

Mini-Gemini is a multimodal visual language model supporting a series of dense and MoE large language models ranging from 2B to 34B. It possesses capabilities for image understanding, reasoning, and generation. Based on LLaVA, it utilizes dual vision encoders to provide low-resolution visual embeddings and high-resolution candidate regions. It employs patch-level information mining to perform patch-level mining between high-resolution regions and low-resolution visual queries, fusing text and images for understanding and generation tasks. It supports multiple visual understanding benchmark tests, including COCO, GQA, OCR-VQA, and VisualGenome.

Target Users :

Mini-Gemini can be applied to a variety of scenarios that require simultaneous processing of text and images, such as visual question answering, image description generation, and image editing.

Total Visits： 1.0K

Top Region： US(100.00%)

Website Views ： 153.7K

Use Cases

Answer relevant questions based on the content of a given image

Generate textual descriptions of images

Edit images and generate new images according to instructions

Features

Low-resolution/High-resolution Dual Vision Encoders

Patch-level Information Mining

Large Language Model-based Text-Image Fusion

Support Vision Understanding and Generation Tasks