T MAC : Acceleration of low-bit large language model inference on CPU.

T MAC

AI Model AI Model Inference Training #Low-bit inference #CPU optimization #Lookup table #Mixed-precision computing Fresh Picks Open Source

Overview :

T-MAC is a kernel library that directly supports mixed-precision matrix multiplication using lookup tables, eliminating the need for quantization operations, aimed at accelerating low-bit large language model inference on CPUs. It supports various low-bit models including W4A16 for GPTQ/gguf, W2A16 for BitDistiller/EfficientQAT, and BitNet W1(.58)A8 on ARM/Intel CPUs across OSX/Linux/Windows. T-MAC achieved a token generation throughput of 20 tokens per second on a single core and 48 tokens per second on four cores for 3B BitNet on the Surface Laptop 7, making it 4-5 times faster than existing state-of-the-art low-bit CPU frameworks such as llama.cpp.

Target Users :

T-MAC is primarily designed for developers and enterprises needing large language model inference on CPUs, particularly those looking to achieve real-time or near-real-time inference performance in edge devices. It is suitable for scenarios that require optimization of energy consumption and computational resources, such as mobile devices, embedded systems, or any resource-constrained environment.

Total Visits： 474.6M

Top Region： US(19.34%)

Website Views ： 47.2K

Use Cases

Achieved significant speed improvement using T-MAC for 3B BitNet model inference on Surface Laptop 7.

Realized performances comparable to NPU on Snapdragon X Elite chip using T-MAC while reducing model size.

Demonstrated energy efficiency advantages compared to CUDA GPU for specific tasks on Jetson AGX Orin using T-MAC.

Features

Inference support for Llama models with 1/2/3/4-bit quantization in GPTQ format.

On the latest Snapdragon X Elite chip, T-MAC's token generation speed even surpasses that of NPU.

Native deployment support on Windows ARM, showcasing significant 5-fold speedup on Surface Laptop 7.

By utilizing lookup tables, T-MAC significantly reduces CPU core usage while lowering power consumption.

Compared to NPU, T-MAC outperforms Qualcomm's Snapdragon Neural Processing Engine (NPE) on Snapdragon X Elite chip.

On Jetson AGX Orin, T-MAC's 2-bit mpGEMM performance is comparable to that of CUDA GPUs.

How to Use

Install a Python environment ensuring version 3.8 to meet TVM requirements.

Depending on your operating system, install cmake>=3.22 and other dependencies.

Use pip to install T-MAC within a virtual environment and activate the corresponding environment variables.

Utilize the provided tool scripts for end-to-end inference or integrate with llama.cpp for task-specific inference.

Adjust parameters as needed, for example, use -fa to enable fast aggregation for an additional speed boost.

Featured AI Tools

Gemini

Gemini is the latest generation of AI system developed by Google DeepMind. It excels in multimodal reasoning, enabling seamless interaction between text, images, videos, audio, and code. Gemini surpasses previous models in language understanding, reasoning, mathematics, programming, and other fields, becoming one of the most powerful AI systems to date. It comes in three different scales to meet various needs from edge computing to cloud computing. Gemini can be widely applied in creative design, writing assistance, question answering, code generation, and more.

LiblibAI is a leading Chinese AI creative platform offering powerful AI creative tools to help creators bring their imagination to life. The platform provides a vast library of free AI creative models, allowing users to search and utilize these models for image, text, and audio creations. Users can also train their own AI models on the platform. Focused on the diverse needs of creators, LiblibAI is committed to creating inclusive conditions and serving the creative industry, ensuring that everyone can enjoy the joy of creation.

AI Model

6.9M

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Direct Visits	51.61%	External Links	33.46%	Email	0.04%
Organic Search	12.58%	Social Media	2.19%	Display Ads	0.11%

Monthly Visits	4.92m
Average Visit Duration	393.01
Pages Per Visit	6.11
Bounce Rate	36.20%

Monthly Visits	4.92m
United States	19.34%
China	13.25%
India	9.32%
Russia	4.28%
Germany	3.63%