RULER
R
RULER
Overview :
RULER is a new synthetic benchmark that provides a more comprehensive evaluation of long-text language models. It extends standard retrieval tests to cover different types and quantities of information points. Additionally, RULER introduces new task categories, such as multi-hop tracking and aggregation, to test behaviors beyond retrieving from context. 10 long-text language models were evaluated on RULER and achieved performance on 13 representative tasks. Despite achieving near-perfect accuracy on standard retrieval tests, these models performed poorly as context length increased. Only four models (GPT-4, Command-R, Yi-34B, and Mixtral) performed reasonably well at a length of 32K. We make RULER publicly available to promote comprehensive evaluation of long-text language models.
Target Users :
Education, Research
Total Visits: 29.7M
Top Region: US(17.94%)
Website Views : 65.4K
Use Cases
Finding information in long text
Multi-hop tracking information
Aggregation in long text
Features
Long-text language model testing
Multi-hop tracking
Aggregation
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase