Crawl4llm : An efficient web crawler for LLM pre-training, focused on crawling high-quality web data effectively.

Crawl4llm

Development and Tools Model Training and Deployment #LLM #Web Crawler #Pre-training #Data Crawling #Python #Open Source Standard Picks Open Source

Overview :

Crawl4LLM is an open-source web crawling project designed to provide an efficient data crawling solution for the pre-training of Large Language Models (LLMs). It helps researchers and developers obtain high-quality training corpora through intelligent selection and crawling of web data. The tool supports various document scoring methods and allows flexible adjustment of crawling strategies based on configurations to meet different pre-training needs. Developed in Python, the project boasts good scalability and ease of use, making it suitable for both academic research and industrial applications.

Target Users :

This product is primarily designed for researchers and developers who need to efficiently crawl web data for LLM pre-training. It is suitable for users who want to obtain high-quality training corpora with limited resources, especially professionals in natural language processing and artificial intelligence.

Total Visits： 474.6M

Top Region： US(19.34%)

Website Views ： 60.2K

Use Cases

Researchers use Crawl4LLM to crawl high-quality documents from the ClueWeb22 dataset for LLM pre-training.

Developers leverage Crawl4LLM's flexible configuration to customize crawling strategies to meet specific project pre-training needs.

Teams efficiently crawl data using Crawl4LLM and combine it with the DCLM framework for model evaluation and optimization.

Features

Supports multiple document scoring methods, such as length-based and fastText model-based scoring.

Offers flexible configuration options, allowing users to customize crawling strategies and parameters.

Enables efficient data crawling with support for multi-threading and large-scale data processing.

Integrates with the DCLM framework for seamless LLM pre-training and evaluation.

Supports crawling data from large-scale datasets like ClueWeb22.

Provides logging and state-saving functionalities for easy monitoring and resumption of the crawling process.

Supports various baseline crawling strategies, including random and in-degree-based strategies.

How to Use

1. Request access to the ClueWeb22 dataset and prepare a Python virtual environment.

2. Install project dependencies, including numpy, tqdm, fasttext, etc.

3. Download the DCLM fastText classifier to the specified directory.

4. Create a configuration file to set crawling parameters and strategies.

5. Run the crawl.py script to start crawling data.

6. Use fetch_docs.py to retrieve the text content of the crawled documents.

7. Integrate with the DCLM framework for LLM pre-training and evaluation.

Featured AI Tools

Devin

Devin is the world's first fully autonomous AI software engineer. With long-term reasoning and planning capabilities, Devin can execute complex engineering tasks and collaborate with users in real time. It empowers engineers to focus on more engaging problems and helps engineering teams achieve greater objectives.

Development and Tools

1.7M

Chinese Picks

Foxkit GPT AI Creation System

FoxKit GPT AI Creation System is a completely open-source system that supports independent secondary development. The system framework is developed using ThinkPHP6 + Vue-admin and provides application ends such as WeChat mini-programs, mobile H5, PC website, and official accounts. Sora video generation interface has been reserved. The system provides detailed installation and deployment documents, parameter configuration documents, and one free setup service.

Development and Tools

751.5K

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Direct Visits	51.61%	External Links	33.46%	Email	0.04%
Organic Search	12.58%	Social Media	2.19%	Display Ads	0.11%

Monthly Visits	4.92m
Average Visit Duration	393.01
Pages Per Visit	6.11
Bounce Rate	36.20%

Monthly Visits	4.92m
United States	19.34%
China	13.25%
India	9.32%
Russia	4.28%
Germany	3.63%