

Chinese Internet Corpus Resource Platform
Overview :
The Chinese Internet Corpus Resource Platform is a professional website hosted by the China Cybersecurity Association, aiming to provide high-quality and compliant Chinese corpus resources for the pre-training of large AI models. The platform integrates the collaborative strengths of enterprises, universities, and research units, relying on a 'co-build and share' mechanism, forming several high-quality corpora including Chinese Internet Basic Corpus 2.0, People's Daily Mainstream Value Dataset, and National Library Qing and Ming Literature Corpus. These corpora undergo strict data source validation, format cleansing, language filtering, data deduplication, content filtering, and privacy filtering to ensure the legality, authenticity, accuracy, and objectivity of the data. The resources on this platform are of significant importance for promoting national AI technology innovation and industrial development, aiding large models in better understanding and generating Chinese content, and enhancing their knowledge capability and value alignment.
Target Users :
The primary target audience includes researchers and developers from enterprises, universities, and research institutions engaged in the development of large AI models. For them, this platform offers a rich repository of rigorously selected and processed Chinese language corpus resources that can effectively enhance the training effectiveness of large models, help address issues related to ideological security, knowledge competency development, and value alignment, thereby promoting innovation and development of AI technology in the Chinese language environment.
Use Cases
A certain AI company trained its natural language processing model using the Chinese Internet Basic Corpus 2.0, significantly improving the model's understanding and generation, capabilities in Chinese text.
A university research team utilized the People's Daily Mainstream Value Dataset to carry out research on knowledge graph construction for specific fields, providing strong support for the application of AI in that area.
Research institutions employed the National Library Qing and Ming Literature Corpus to conduct digitization studies of ancient literature, promoting the integration of traditional culture and modern technology.
Features
Provide various high-quality Chinese corpora to meet different pre-training needs.
Implement a strict data processing process to ensure corpus safety and compliance.
Cover multiple fields such as culture, politics, and economics, highlighting comprehensiveness.
Support a co-build and share mechanism to facilitate continuous updating and enrichment of corpus resources.
Standardized corpus formats for easy downloading and use by users.
Regularly release new corpora to continually empower AI development.
Provide policy information to help users understand industry dynamics.
Showcase co-build and share outcomes to promote collaboration between academia and industry.
How to Use
1. Visit the platform website at https://corpus.cybersac.cn/#/home.
2. Register and log in to access more resources and services.
3. Browse and select the desired corpus on the homepage or dataset page.
4. Click on the corpus of interest to view detailed information and data samples.
5. Download the corpus as needed and use it according to the formats and guidelines provided by the platform.
6. Refer to the policy information page to stay informed on industry trends and relevant regulations, ensuring that research and development work comply with requirements.
7. Participate in collaborative sharing activities, contribute your own data or research findings, and collectively promote the platform's development.
Featured AI Tools

Gemini
Gemini is the latest generation of AI system developed by Google DeepMind. It excels in multimodal reasoning, enabling seamless interaction between text, images, videos, audio, and code. Gemini surpasses previous models in language understanding, reasoning, mathematics, programming, and other fields, becoming one of the most powerful AI systems to date. It comes in three different scales to meet various needs from edge computing to cloud computing. Gemini can be widely applied in creative design, writing assistance, question answering, code generation, and more.
AI Model
11.4M
Chinese Picks

Liblibai
LiblibAI is a leading Chinese AI creative platform offering powerful AI creative tools to help creators bring their imagination to life. The platform provides a vast library of free AI creative models, allowing users to search and utilize these models for image, text, and audio creations. Users can also train their own AI models on the platform. Focused on the diverse needs of creators, LiblibAI is committed to creating inclusive conditions and serving the creative industry, ensuring that everyone can enjoy the joy of creation.
AI Model
6.9M