

Video MME
Overview :
Video-MME is a benchmark for evaluating the performance of Multi-Modal Large Language Models (MLLMs) in video analysis. It fills the gap in existing evaluation methods regarding the ability of MLLMs to process continuous visual data, providing researchers with a high-quality and comprehensive evaluation platform. The benchmark covers videos of different lengths and evaluates core MLLM capabilities.
Target Users :
Video-MME is designed for researchers and developers in the field of artificial intelligence, particularly those specializing in video understanding and multi-modal interaction. It provides a standardized testing platform for these users to evaluate and improve their MLLM models.
Use Cases
Accuracy scores of Gemini 1.5 Pro in different video lengths and subcategories
Performance comparison of GPT-4o and GPT-4V in video analysis tasks
Scoring results of the LLaVA-NeXT-Video model in different video tasks
Features
Provides accuracy scores for short, medium, and long videos
Includes 6 main categories and 30 subcategories of video types
comprehensively covers video length and task types
New data collected and manually annotated, not from existing video datasets
Provides statistical information on video category hierarchy, video duration, and task type distribution
Allows comparison with other benchmarks, highlighting the unique advantages of Video-MME
How to Use
Visit the official Video-MME website
Understand the evaluation standards for different video lengths and task types
Select the desired MLLM model for performance testing
Submit the model and obtain performance results in different video subcategories
Analyze the results, compare with other models or benchmarks
Utilize the evaluation results to optimize and improve MLLM models