YaFSDP
Y
Yafsdp
Overview :
YaFSDP is a distributed data parallelism framework designed to work well with transformer-like neural network architectures. It is 20% faster than the traditional FSDP when pre-training large language models (LLMs) and performs better under high-memory pressure conditions. YaFSDP aims to reduce the overhead of communication and memory operations.
Target Users :
YaFSDP framework is suitable for machine learning researchers and engineers who need to handle large-scale data and models. It is particularly suitable for scenarios where deep learning model training needs to be performed in high memory pressure environments, such as the pre-training and fine-tuning of large language models.
Total Visits: 474.6M
Top Region: US(19.34%)
Website Views : 49.7K
Use Cases
Use YaFSDP to pre-train language models ranging from 7B to 70B parameters.
Apply YaFSDP to train models on 64 to 256 devices to improve efficiency.
Train models with sequences ranging from 2048 to 8192 tokens using YaFSDP.
Features
Supports efficient pre-training of large language models.
Optimized memory and communication operations, improving training efficiency.
Provides detailed usage examples, including causal pre-training and supervised fine-tuning.
Built on NVIDIA PyTorch, integrating necessary patch libraries.
Supports custom event notifications, allowing developers to receive updates as needed.
Performance evaluated on an A100 80G cluster, ensuring high framework performance.
How to Use
1. Clone the YaFSDP GitHub repository to your local environment.
2. Set up a Docker environment according to the guidance in the example folder.
3. Run the docker/build.sh script to build the required Docker image.
4. Choose a suitable example script based on your specific training needs to perform model training.
5. Monitor the memory and communication overhead during the training process to ensure stable system operation.
6. Adjust the YaFSDP configuration parameters as needed to optimize model training performance.
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase