DCLM
D
DCLM
Overview :
DataComp-LM (DCLM) is a comprehensive framework for building and training large language models (LLMs), providing standardized corpora, efficient pre-training recipes based on the open_lm framework, and over 50 evaluation methods. DCLM supports researchers in experimenting with different data set construction strategies at different computational scales, from 411M to 7B parameter models. DCLM significantly improves model performance through optimized dataset design and has already facilitated the creation of multiple high-quality datasets that outperform all open datasets at different scales.
Target Users :
DCLM is designed for researchers and developers who are building and training large language models, particularly those seeking to enhance model performance through optimized dataset design. It is suitable for scenarios that require handling large data sets and experimenting with different data set building strategies at different computational scales.
Total Visits: 474.6M
Top Region: US(19.34%)
Website Views : 51.1K
Use Cases
Researchers used DCLM to create the DCLM-BASELINE dataset and trained models with it, showcasing superior performance compared to closed-source models and other open-source datasets.
DCLM supports training models at different scales, such as 400M-1x and 7B-2x, to accommodate different computational needs.
Community members demonstrated model performance on different data sets and scales by submitting models to DCLM's leaderboard.
Features
Provides over 300T of unfiltered CommonCrawl corpora
Provides efficient pre-training recipes based on the open_lm framework
Provides over 50 evaluation methods to evaluate model performance
Supports different computational scales from 411M to 7B parameter models
Enables researchers to experiment with different data set construction strategies
Improves model performance through optimized dataset design
How to Use
Clone the DCLM repository locally
Install required dependencies
Setup AWS storage and Ray distributed processing environment
Choose the raw data source and create a reference JSON
Define data processing steps and create a pipeline configuration file
Setup Ray cluster and run data processing scripts
Tokenize and shuffle processed data
Run model training scripts using the tokenized dataset
Evaluate trained models and submit results to DCLM's leaderboard
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase