

Vitlp
Overview :
ViTLP is a visually guided generative text layout pre-trained model designed to enhance the efficiency and accuracy of document intelligent processing. This model combines OCR text localization and recognition capabilities, enabling rapid and accurate text detection and recognition on document images. The pre-trained version, ViTLP-medium (380M parameters), provides a balanced solution under constraints of computational resources and the scale of pre-training datasets, ensuring performance while optimizing inference speed and memory usage. ViTLP's inference speed typically ranges from 5 to 10 seconds per page on an Nvidia 4090, making it competitive compared to most OCR engines.
Target Users :
The target audience includes businesses and research institutions that require document image processing, particularly in areas that need automated document handling and archiving digitization. ViTLP's fast inference speed and high accuracy make it an ideal choice for these scenarios.
Use Cases
Example 1: Using ViTLP to digitize historical texts, automatically extracting text information from documents.
Example 2: In the legal field, utilizing ViTLP to automate the processing and information extraction from a large number of case documents.
Example 3: In the financial sector, using ViTLP for intelligent analysis of contract documents to extract key terms.
Features
? Native OCR text localization and recognition: ViTLP can directly locate and recognize text on document images.
? Pre-trained model ViTLP-medium: Offers a pre-trained model with 380M parameters, providing good performance on limited computational resources.
? Fast inference speed: On the Nvidia 4090, ViTLP can rapidly process document images, completing the processing of one page in 5 to 10 seconds.
? Support on Huggingface platform: The pre-trained weights for the ViTLP model can be found on the Huggingface platform for easy downloading and use.
? Easy to integrate and use: With the provided code and instructions, users can seamlessly integrate ViTLP into their projects.
? Batch decoding support: With the provided decode.sh script, users can perform batch decoding of document images.
? Suitable for document intelligent processing: ViTLP is particularly well-suited for scenarios requiring text detection and recognition in document images, such as automated document processing and archiving digitization.
How to Use
1. Visit the ViTLP GitHub page and clone the project to your local machine.
2. Install the required dependencies by running `pip install -r requirements.txt`.
3. Clone the pre-trained ViTLP model weights to the specified directory using `git clone https://huggingface.co/veason/ViTLP-medium ckpts/ViTLP-medium`.
4. Run the demo by executing `python ocr.py` and upload a document image for testing.
5. Review `decode.py` for detailed inference code, and you can run batch decoding using `bash decode.sh`.
6. For fine-tuning ViTLP, refer to the guidelines in the `./finetuning` directory.
Featured AI Tools

Gemini
Gemini is the latest generation of AI system developed by Google DeepMind. It excels in multimodal reasoning, enabling seamless interaction between text, images, videos, audio, and code. Gemini surpasses previous models in language understanding, reasoning, mathematics, programming, and other fields, becoming one of the most powerful AI systems to date. It comes in three different scales to meet various needs from edge computing to cloud computing. Gemini can be widely applied in creative design, writing assistance, question answering, code generation, and more.
AI Model
11.4M
Chinese Picks

Liblibai
LiblibAI is a leading Chinese AI creative platform offering powerful AI creative tools to help creators bring their imagination to life. The platform provides a vast library of free AI creative models, allowing users to search and utilize these models for image, text, and audio creations. Users can also train their own AI models on the platform. Focused on the diverse needs of creators, LiblibAI is committed to creating inclusive conditions and serving the creative industry, ensuring that everyone can enjoy the joy of creation.
AI Model
6.9M