Nemotron CC : Transforms Common Crawl into a refined long-term pre-training dataset.

Nemotron CC

AI Model Development & Tools #Artificial Intelligence #Dataset #Pre-training #Natural Language Processing Standard Picks Paid

Overview :

Nemotron-CC is a dataset of 6.3 trillion tokens based on Common Crawl. It integrates classifiers, rewrites synthetic data, and reduces reliance on heuristic filters to convert English Common Crawl into a long-term pre-training dataset with 6.3 trillion tokens, 4.4 trillion of which are globally de-duplicated raw tokens, and 1.9 trillion are synthetically generated tokens. This dataset strikes a better balance between accuracy and data volume, making it significant for training large language models.

Target Users :

The primary target audience consists of professionals engaged in artificial intelligence research and development, particularly scientists and engineers focusing on natural language processing and large language model training. Nemotron-CC provides them with a high-quality, large-scale dataset that helps train more accurate and powerful models, advancing the development of natural language processing technology.

Total Visits： 21.5K

Top Region： US(33.87%)

Website Views ： 51.1K

Use Cases

Trained an 8B parameter model with the Nemotron-CC dataset, improving the MMLU score by 5.6 over DCLM

An 8B parameter model trained on this dataset with 15T tokens outperformed the Llama 3.1 8B model across multiple tasks

Researchers can leverage different quality-level partitions for targeted model training and study

Features

Offers a dataset of 6.3 trillion tokens, including both raw and synthetic tokens

Optimizes data quality through various methods to enhance model training effectiveness

Supports long-term pre-training, unlocking advanced training capabilities

The dataset includes partitions of varying quality levels and types to meet diverse needs

Provides data in both jsonl and parquet formats for convenience across different scenarios

How to Use

1. Visit the official Nemotron-CC website to learn about the dataset and its download options

2. Choose the appropriate data partition and format for download according to research needs

3. Use the downloaded dataset for pre-training language models

4. During the pre-training process, adjust training parameters and strategies based on model performance

5. Fine-tune and apply the pre-trained model for specific tasks

Featured AI Tools

Gemini

Gemini is the latest generation of AI system developed by Google DeepMind. It excels in multimodal reasoning, enabling seamless interaction between text, images, videos, audio, and code. Gemini surpasses previous models in language understanding, reasoning, mathematics, programming, and other fields, becoming one of the most powerful AI systems to date. It comes in three different scales to meet various needs from edge computing to cloud computing. Gemini can be widely applied in creative design, writing assistance, question answering, code generation, and more.

LiblibAI is a leading Chinese AI creative platform offering powerful AI creative tools to help creators bring their imagination to life. The platform provides a vast library of free AI creative models, allowing users to search and utilize these models for image, text, and audio creations. Users can also train their own AI models on the platform. Focused on the diverse needs of creators, LiblibAI is committed to creating inclusive conditions and serving the creative industry, ensuring that everyone can enjoy the joy of creation.

AI Model

6.9M

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Direct Visits	36.24%	External Links	30.19%	Email	0.08%
Organic Search	28.03%	Social Media	4.81%	Display Ads	0.66%

Monthly Visits	12.37k
Average Visit Duration	123.21
Pages Per Visit	1.76
Bounce Rate	34.80%

Monthly Visits	12.37k
United States	33.87%
India	12.86%
Germany	10.95%
France	6.56%
Japan	4.69%