Slicegpt : SliceGPT: Compressing Large Language Models by Deleting Rows and Columns

Slicegpt

AI Model AI Model Inference Training #Sparsification #Model Compression #Computational Efficiency Standard Picks Open Source

Overview :

SliceGPT is a new post-training sparsity approach that reduces the network's embedding dimension by replacing each weight matrix with a smaller (dense) matrix. Through extensive experiments, we demonstrate that SliceGPT can remove up to 25% of the model parameters (including embeddings) from LLAMA2-70B, OPT 66B, and Phi-2 models while maintaining 99%, 99%, and 90% of the zero-shot task performance, respectively. Our sliced models run on fewer GPUs and execute faster without any additional code optimizations: on a 24GB consumer-grade GPU, we reduce the total inference computation of LLAMA2-70B to 64% of the dense model; on a 40GB A100 GPU, we reduce it to 66%. We provide a new insight into the computational invariance in transformer networks, which makes SliceGPT possible. We hope it can inspire and promote new avenues for reducing memory and computational requirements of pre-trained models in the future.

Target Users :

SliceGPT is suitable for scenarios that require improved model computational efficiency and reduced memory usage.

Total Visits： 29.7M

Top Region： US(17.94%)

Website Views ： 51.1K

Use Cases

SliceGPT can be used to reduce the memory consumption of large language models.

SliceGPT can be used to accelerate the inference process of large language models.

SliceGPT can be used to improve the computational efficiency of pre-trained models.

Features

Post-Training Sparsification

Model Parameter Compression