

Nemotron CC
Overview :
Nemotron-CC is a dataset of 6.3 trillion tokens based on Common Crawl. It integrates classifiers, rewrites synthetic data, and reduces reliance on heuristic filters to convert English Common Crawl into a long-term pre-training dataset with 6.3 trillion tokens, 4.4 trillion of which are globally de-duplicated raw tokens, and 1.9 trillion are synthetically generated tokens. This dataset strikes a better balance between accuracy and data volume, making it significant for training large language models.
Target Users :
The primary target audience consists of professionals engaged in artificial intelligence research and development, particularly scientists and engineers focusing on natural language processing and large language model training. Nemotron-CC provides them with a high-quality, large-scale dataset that helps train more accurate and powerful models, advancing the development of natural language processing technology.
Use Cases
Trained an 8B parameter model with the Nemotron-CC dataset, improving the MMLU score by 5.6 over DCLM
An 8B parameter model trained on this dataset with 15T tokens outperformed the Llama 3.1 8B model across multiple tasks
Researchers can leverage different quality-level partitions for targeted model training and study
Features
Offers a dataset of 6.3 trillion tokens, including both raw and synthetic tokens
Optimizes data quality through various methods to enhance model training effectiveness
Supports long-term pre-training, unlocking advanced training capabilities
The dataset includes partitions of varying quality levels and types to meet diverse needs
Provides data in both jsonl and parquet formats for convenience across different scenarios
How to Use
1. Visit the official Nemotron-CC website to learn about the dataset and its download options
2. Choose the appropriate data partition and format for download according to research needs
3. Use the downloaded dataset for pre-training language models
4. During the pre-training process, adjust training parameters and strategies based on model performance
5. Fine-tune and apply the pre-trained model for specific tasks
Featured AI Tools

Gemini
Gemini is the latest generation of AI system developed by Google DeepMind. It excels in multimodal reasoning, enabling seamless interaction between text, images, videos, audio, and code. Gemini surpasses previous models in language understanding, reasoning, mathematics, programming, and other fields, becoming one of the most powerful AI systems to date. It comes in three different scales to meet various needs from edge computing to cloud computing. Gemini can be widely applied in creative design, writing assistance, question answering, code generation, and more.
AI Model
11.4M
Chinese Picks

Liblibai
LiblibAI is a leading Chinese AI creative platform offering powerful AI creative tools to help creators bring their imagination to life. The platform provides a vast library of free AI creative models, allowing users to search and utilize these models for image, text, and audio creations. Users can also train their own AI models on the platform. Focused on the diverse needs of creators, LiblibAI is committed to creating inclusive conditions and serving the creative industry, ensuring that everyone can enjoy the joy of creation.
AI Model
6.9M