DCLM : Comprehensive framework for building and training large language models

DCLM

AI Model AI Model Inference Training #Large language models #Dataset construction #Model training #Performance evaluation Fresh Picks Open Source

Overview :

DataComp-LM (DCLM) is a comprehensive framework for building and training large language models (LLMs), providing standardized corpora, efficient pre-training recipes based on the open_lm framework, and over 50 evaluation methods. DCLM supports researchers in experimenting with different data set construction strategies at different computational scales, from 411M to 7B parameter models. DCLM significantly improves model performance through optimized dataset design and has already facilitated the creation of multiple high-quality datasets that outperform all open datasets at different scales.

Target Users :

DCLM is designed for researchers and developers who are building and training large language models, particularly those seeking to enhance model performance through optimized dataset design. It is suitable for scenarios that require handling large data sets and experimenting with different data set building strategies at different computational scales.

Total Visits： 474.6M

Top Region： US(19.34%)

Website Views ： 51.1K

Use Cases

Researchers used DCLM to create the DCLM-BASELINE dataset and trained models with it, showcasing superior performance compared to closed-source models and other open-source datasets.

DCLM supports training models at different scales, such as 400M-1x and 7B-2x, to accommodate different computational needs.

Community members demonstrated model performance on different data sets and scales by submitting models to DCLM's leaderboard.

Features

Provides over 300T of unfiltered CommonCrawl corpora

Provides efficient pre-training recipes based on the open_lm framework

Provides over 50 evaluation methods to evaluate model performance

Supports different computational scales from 411M to 7B parameter models

Enables researchers to experiment with different data set construction strategies

Improves model performance through optimized dataset design

How to Use

Clone the DCLM repository locally

Install required dependencies