T3 : Transparent tracking and triggering, fine-grained computation and set overlap

AI model inference training AI model #Distributed technique #Hardware-software co-design #Computation overlap #Communication efficiency Standard Picks Open Source

Overview :

Large language models increasingly rely on distributed techniques for training and inference. These techniques necessitate communication between devices, and as the number of devices increases, this can degrade scaling efficiency. While some distributed techniques can overlap communication to hide independent computation, techniques like tensor parallelism (TP) inherently serialize communication with model execution. One way to hide this serialized communication is to interweave it with producer operations (data generation) in a fine-grained manner. However, implementing this fine-grained communication and computation interleaving in software can be challenging. Furthermore, like any concurrent execution, it requires sharing computational and memory resources between computation and communication, leading to resource contention and decreased overlap efficiency. To overcome these challenges, we propose T3, which uses hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with computation. T3, through simple configuration of producer output address spaces, transparently fuses producer operations and subsequent communication, requiring minimal software changes. At the hardware level, T3 incorporates lightweight tracking and triggering mechanisms to orchestrate producer computation and communication. It further leverages enhanced compute memory for computation related to communication. Consequently, T3 reduces resource contention and effectively overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 achieves a geometric mean speedup of 30% (up to 47%) for communication-intensive sublayers and a geometric mean reduction of 22% (up to 36%) in data movement. Furthermore, T3's benefits persist as models scale: achieving a geometric mean speedup of 29% for sublayers in the 500B parameter sim model, PALM, and MT-NLG.

Target Users :

Distributed technique for training and inference of large language models

Total Visits： 29.7M

Top Region： US(17.94%)

Website Views ： 46.6K

Use Cases

Accelerate the training process of the large language model T-NLG

Improve communication efficiency in the inference of models like PALM and MT-NLG

Suitable for scenarios requiring maximum overlap of computation and communication

Features

Transparently overlap serialized communication and computation

Minimize resource contention with computation