Nemotron-CC
N
Nemotron CC
Overview :
Nemotron-CC is a dataset of 6.3 trillion tokens based on Common Crawl. It integrates classifiers, rewrites synthetic data, and reduces reliance on heuristic filters to convert English Common Crawl into a long-term pre-training dataset with 6.3 trillion tokens, 4.4 trillion of which are globally de-duplicated raw tokens, and 1.9 trillion are synthetically generated tokens. This dataset strikes a better balance between accuracy and data volume, making it significant for training large language models.
Target Users :
The primary target audience consists of professionals engaged in artificial intelligence research and development, particularly scientists and engineers focusing on natural language processing and large language model training. Nemotron-CC provides them with a high-quality, large-scale dataset that helps train more accurate and powerful models, advancing the development of natural language processing technology.
Total Visits: 21.5K
Top Region: US(33.87%)
Website Views : 50.5K
Use Cases
Trained an 8B parameter model with the Nemotron-CC dataset, improving the MMLU score by 5.6 over DCLM
An 8B parameter model trained on this dataset with 15T tokens outperformed the Llama 3.1 8B model across multiple tasks
Researchers can leverage different quality-level partitions for targeted model training and study
Features
Offers a dataset of 6.3 trillion tokens, including both raw and synthetic tokens
Optimizes data quality through various methods to enhance model training effectiveness
Supports long-term pre-training, unlocking advanced training capabilities
The dataset includes partitions of varying quality levels and types to meet diverse needs
Provides data in both jsonl and parquet formats for convenience across different scenarios
How to Use
1. Visit the official Nemotron-CC website to learn about the dataset and its download options
2. Choose the appropriate data partition and format for download according to research needs
3. Use the downloaded dataset for pre-training language models
4. During the pre-training process, adjust training parameters and strategies based on model performance
5. Fine-tune and apply the pre-trained model for specific tasks
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase