ReaderLM v2
R
Readerlm V2
Overview :
ReaderLM v2, introduced by Jina AI, is a small language model with 1.5 billion parameters, specifically designed for converting HTML to Markdown and extracting HTML to JSON with exceptional accuracy. The model supports 29 languages and can handle input/output combinations of up to 512,000 tokens in length. It employs a new training paradigm and higher-quality training data, making significant advances over its predecessor in handling long text and generating Markdown syntax, allowing for proficient use of Markdown syntax and the creation of complex elements. Additionally, ReaderLM v2 features direct HTML to JSON generation capabilities, enabling users to extract specific information from raw HTML based on a provided JSON schema, eliminating the need for intermediate Markdown conversion.
Target Users :
The target audience includes developers, content creators, data analysts, and researchers who need to convert web content into Markdown format or extract structured data from web pages. For developers, ReaderLM v2 enables quick conversion of web content into formats suitable for further processing. Content creators can easily organize web content into Markdown format for sharing or archiving. For enterprises and researchers, its HTML to JSON functionality aids in efficiently extracting key information from web pages for data analysis and research.
Total Visits: 539.8K
Top Region: CN(18.57%)
Website Views : 56.9K
Use Cases
A developer uses ReaderLM v2 to convert collected web news into Markdown format for sharing on a tech blog.
A corporate data analyst utilizes its HTML to JSON function to extract product information from web pages for a market analysis report.
Researchers extract paper information from academic websites using the model, storing it in JSON format for subsequent data organization.
Features
Supports HTML to Markdown conversion, preserving complete information and skillfully utilizing Markdown syntax to build content.
Can process input/output combinations of up to 512,000 tokens, effectively addressing degradation issues in long text handling.
Has direct HTML to JSON generation capabilities, enhancing data cleaning and extraction efficiency based on a defined JSON schema.
Supports 29 languages, including English, Chinese, and Japanese, making it widely applicable.
Performs better in quantitative and qualitative benchmarks compared to multiple larger models, despite having significantly fewer parameters.
How to Use
1. Use via Reader API: Specify `x-engine: readerlm-v2` in the request headers and enable streaming responses with `-H 'Accept: text/event-stream'`.
2. Use on Google Colab: Perform HTML to Markdown conversion, JSON extraction, and instruction compliance testing through a Colab notebook.
3. Production environment usage: Deploy the ReaderLM v2 model on AWS SageMaker, Azure, and GCP Marketplace.
4. For HTML to Markdown conversion, use the `create_prompt` helper function to create prompts, then call the model to generate results.
5. When extracting HTML to JSON using JSON Schema, first define the Schema, then create prompts and call the model to generate JSON format results.
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase