LLaVA
L
Llava
Overview :
LLaVA is a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna, achieving impressive chat capabilities, emulating the spirit of multimodal GPT-4, and achieving new highest accuracy in scientific question answering. LLaVA's use cases include multimodal chat in daily user applications and multimodal reasoning in the scientific domain. LLaVA's data, code, and checkpoints are limited to research use and follow the licenses of CLIP, LLaMA, Vicuna, and GPT-4.
Target Users :
LLaVA is suitable for scenarios requiring multimodal chat and scientific question answering, such as daily user applications and scientific reasoning.
Total Visits: 81.0K
Top Region: US(22.84%)
Website Views : 176.9K
Use Cases
LLaVA can answer questions about the Mona Lisa, including the artist, characteristics of the painting, and its location.
LLaVA can perform optical character recognition (OCR) and provide detailed descriptions of the recognized results.
LLaVA can perform visual reasoning, such as in the two examples in the OpenAI GPT-4 technical report.
Features
Combines a vision encoder and Vicuna to achieve multimodal chat and scientific question answering
Uses language-only GPT-4 to generate multimodal language-image instruction-following data
Achieves pre-training and fine-tuning through a two-stage instruction tuning process
Demonstrates impressive performance in visual chat and scientific question answering
Provides open-source data, code, and checkpoints
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase