Internvit 300M 448px V2 5 : An enhanced version based on InternViT-300M-448px, improving the ability to extract visual features.

Internvit 300M 448px V2 5

AI Model Image Editing #Visual Feature Extraction #Multimodal Learning #Incremental Learning #Large-Scale Datasets #Image Classification #Semantic Segmentation Standard Picks Open Source

Overview :

InternViT-300M-448px-V2_5 is an enhanced version of InternViT-300M-448px, utilizing incremental learning with ViT and NTP loss (Stage 1.5) to enhance the visual encoder's capability to extract visual features. It is particularly effective in underrepresented domains in large-scale network datasets, such as multilingual OCR data and mathematical graphs. This model is part of the InternViT 2.5 series and retains the same 'ViT-MLP-LLM' architecture as its predecessors while integrating incrementally pre-trained InternViT with various pre-trained LLMs, such as InternLM 2.5 and Qwen 2.5, using randomly initialized MLP projectors.

Target Users :

The target audience includes researchers and developers, particularly professionals seeking high-performance visual models in areas such as image recognition, multilingual OCR, and mathematical graph analysis. This model empowers them with a robust tool to handle and interpret complex visual data through enhanced capabilities of the visual encoder.

Total Visits： 29.7M

Top Region： US(17.94%)

Website Views ： 55.5K

Use Cases

Utilize InternViT-300M-448px-V2_5 for image classification tasks to identify and categorize different image contents.

Apply this model on multilingual OCR data to enhance the accuracy and efficiency of text recognition.

Leverage the model for analyzing mathematical graphs, extracting key visual and structural information to support education and research.

Features

- Visual Feature Extraction: Enhances the model's capability in extracting visual features, particularly in underrepresented domains within large-scale datasets.

- Incremental Learning with NTP Loss: Improves the model's ability to handle data from rare domains through ViT incremental learning and NTP loss.

- Model Architecture: Maintains the same 'ViT-MLP-LLM' architecture as predecessors to ensure consistency and performance.

- Multimodal Data Support: Introduces support for multiple images and video data, expanding the model's application range.

- Dynamic High-Resolution Training: Enhances the model's capability to process multiple images and video datasets through dynamic high-resolution training methods.

- Cross-Modal Alignment: Ensures stability and robustness of the model during multimodal training.

- Multi-Stage Training: Includes MLP warm-up, ViT incremental learning, and full model instruction tuning to comprehensively improve model performance.

How to Use

1. Import necessary libraries, such as torch and transformers.

2. Load the InternViT-300M-448px-V2_5 model from the Hugging Face model hub.

3. Use the PIL library to open and convert the image to RGB format.

4. Load CLIPImageProcessor from the model library to process the image.

5. Use the image_processor to process the image and obtain pixel values.

6. Convert pixel values to the data type required by the model and transfer them to GPU.

7. Input the processed pixel values into the model to get the output.