Cola
C
Cola
Overview :
Cola is a method that uses a language model (LM) to aggregate the outputs of 2 or more vision-language models (VLMs). Our model assembly method is called Cola (COordinative LAnguage model or visual reasoning). Cola performs best when the LM is fine-tuned (called Cola-FT). Cola is also effective in zero-shot or few-shot context learning (called Cola-Zero). In addition to performance improvements, Cola is also more robust to VLM errors. We demonstrate that Cola can be applied to various VLMs (including large multimodal models like InstructBLIP) and 7 datasets (VQA v2, OK-VQA, A-OKVQA, e-SNLI-VE, VSR, CLEVR, GQA), and it consistently improves performance.
Target Users :
Suitable for various visual-language tasks, such as visual question answering and image description.
Total Visits: 474.6M
Top Region: US(19.34%)
Website Views : 54.4K
Use Cases
Performing visual question answering using Cola-Zero
Performing image description using Cola-FT
Using Cola to improve VLM performance
Features
Aggregates the output of multiple vision-language models using a language model
Supports LM fine-tuning and zero-shot learning
Improves performance and enhances robustness to VLM errors
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase