Cola : Large language models are visual reasoning coordinators.

Cola

AI image detection and recognition AI model #Language Model #Visual Reasoning #LM Fine-tuning #Zero-Shot Learning Standard Picks Open Source

Overview :

Cola is a method that uses a language model (LM) to aggregate the outputs of 2 or more vision-language models (VLMs). Our model assembly method is called Cola (COordinative LAnguage model or visual reasoning). Cola performs best when the LM is fine-tuned (called Cola-FT). Cola is also effective in zero-shot or few-shot context learning (called Cola-Zero). In addition to performance improvements, Cola is also more robust to VLM errors. We demonstrate that Cola can be applied to various VLMs (including large multimodal models like InstructBLIP) and 7 datasets (VQA v2, OK-VQA, A-OKVQA, e-SNLI-VE, VSR, CLEVR, GQA), and it consistently improves performance.

Target Users :

Suitable for various visual-language tasks, such as visual question answering and image description.

Total Visits： 474.6M

Top Region： US(19.34%)

Website Views ： 54.4K

Use Cases

Performing visual question answering using Cola-Zero

Performing image description using Cola-FT

Using Cola to improve VLM performance

Features

Aggregates the output of multiple vision-language models using a language model

Supports LM fine-tuning and zero-shot learning