

Crawl4llm
Overview :
Crawl4LLM is an open-source web crawling project designed to provide an efficient data crawling solution for the pre-training of Large Language Models (LLMs). It helps researchers and developers obtain high-quality training corpora through intelligent selection and crawling of web data. The tool supports various document scoring methods and allows flexible adjustment of crawling strategies based on configurations to meet different pre-training needs. Developed in Python, the project boasts good scalability and ease of use, making it suitable for both academic research and industrial applications.
Target Users :
This product is primarily designed for researchers and developers who need to efficiently crawl web data for LLM pre-training. It is suitable for users who want to obtain high-quality training corpora with limited resources, especially professionals in natural language processing and artificial intelligence.
Use Cases
Researchers use Crawl4LLM to crawl high-quality documents from the ClueWeb22 dataset for LLM pre-training.
Developers leverage Crawl4LLM's flexible configuration to customize crawling strategies to meet specific project pre-training needs.
Teams efficiently crawl data using Crawl4LLM and combine it with the DCLM framework for model evaluation and optimization.
Features
Supports multiple document scoring methods, such as length-based and fastText model-based scoring.
Offers flexible configuration options, allowing users to customize crawling strategies and parameters.
Enables efficient data crawling with support for multi-threading and large-scale data processing.
Integrates with the DCLM framework for seamless LLM pre-training and evaluation.
Supports crawling data from large-scale datasets like ClueWeb22.
Provides logging and state-saving functionalities for easy monitoring and resumption of the crawling process.
Supports various baseline crawling strategies, including random and in-degree-based strategies.
How to Use
1. Request access to the ClueWeb22 dataset and prepare a Python virtual environment.
2. Install project dependencies, including numpy, tqdm, fasttext, etc.
3. Download the DCLM fastText classifier to the specified directory.
4. Create a configuration file to set crawling parameters and strategies.
5. Run the crawl.py script to start crawling data.
6. Use fetch_docs.py to retrieve the text content of the crawled documents.
7. Integrate with the DCLM framework for LLM pre-training and evaluation.
Featured AI Tools

Devin
Devin is the world's first fully autonomous AI software engineer. With long-term reasoning and planning capabilities, Devin can execute complex engineering tasks and collaborate with users in real time. It empowers engineers to focus on more engaging problems and helps engineering teams achieve greater objectives.
Development and Tools
1.7M
Chinese Picks

Foxkit GPT AI Creation System
FoxKit GPT AI Creation System is a completely open-source system that supports independent secondary development. The system framework is developed using ThinkPHP6 + Vue-admin and provides application ends such as WeChat mini-programs, mobile H5, PC website, and official accounts. Sora video generation interface has been reserved. The system provides detailed installation and deployment documents, parameter configuration documents, and one free setup service.
Development and Tools
751.5K