Tensor and pipeline parallelism: scaling LLMs beyond one GPU

This technical article explains two core techniques for scaling transformer models across multiple GPUs: tensor parallelism and pipeline parallelism. It is written for engineers building on-prem inference or training systems.

1) Why parallelism is needed

Large models can exceed the memory of a single GPU. Parallelism splits computation and memory across multiple devices, allowing larger models or longer contexts while keeping throughput acceptable.

2) Tensor parallelism (TP)

Tensor parallelism splits the math inside a layer across GPUs. For example, a large weight matrix is partitioned by columns or rows, each GPU computes its shard, and the partial results are combined via collective communication (typically all-reduce or all-gather).

In Megatron-LM style implementations, attention and MLP layers are split across the tensor-parallel group, so each GPU holds a portion of the weights and activations. This reduces per-GPU memory and spreads compute across devices.

Key properties

Good for large matrix multiplies and dense layers.
Requires fast interconnects for collective ops.
Works well for both training and inference when kernels are optimized.

3) Pipeline parallelism (PP)

Pipeline parallelism splits the model by layers into stages. Each GPU (or GPU group) holds a contiguous block of layers. Micro-batches flow through the pipeline so all stages stay busy.

A common schedule is 1F1B (one forward, one backward) in training, which reduces pipeline bubbles compared to naive scheduling. For inference, pipeline parallelism can still be useful when layers do not fit in memory on a single device.

Key properties

Reduces per-GPU memory footprint by splitting layers.
Adds pipeline latency; needs micro-batching to fill the pipeline.
Works best when stages are balanced to avoid idle time.

Simple pipeline diagram (colored blocks)

$$ \begin{array}{rcccc} \text{Stage} & t_1 & t_2 & t_3 & t_4 \\ S_1 & \htmlStyle{background:#DCFCE7;border:1px solid #86EFAC;border-radius:0.6em;padding:0.12em 0.55em;}{\vphantom{F_1}F_1} & \htmlStyle{background:#DCFCE7;border:1px solid #86EFAC;border-radius:0.6em;padding:0.12em 0.55em;}{\vphantom{F_1}F_2} & \htmlStyle{background:#DCFCE7;border:1px solid #86EFAC;border-radius:0.6em;padding:0.12em 0.55em;}{\vphantom{F_1}F_3} & \htmlStyle{background:#DCFCE7;border:1px solid #86EFAC;border-radius:0.6em;padding:0.12em 0.55em;}{\vphantom{F_1}F_4} \\ S_2 & & \htmlStyle{background:#FEF9C3;border:1px solid #FDE047;border-radius:0.6em;padding:0.12em 0.55em;}{\vphantom{F_1}F_1} & \htmlStyle{background:#FEF9C3;border:1px solid #FDE047;border-radius:0.6em;padding:0.12em 0.55em;}{\vphantom{F_1}F_2} & \htmlStyle{background:#FEF9C3;border:1px solid #FDE047;border-radius:0.6em;padding:0.12em 0.55em;}{\vphantom{F_1}F_3} \\ S_3 & & & \htmlStyle{background:#CFFAFE;border:1px solid #67E8F9;border-radius:0.6em;padding:0.12em 0.55em;}{\vphantom{F_1}F_1} & \htmlStyle{background:#CFFAFE;border:1px solid #67E8F9;border-radius:0.6em;padding:0.12em 0.55em;}{\vphantom{F_1}F_2} \\ \end{array} $$

4) Combining TP, PP, and data parallelism

In practice, large systems combine:

Tensor parallelism for intra-layer scaling.
Pipeline parallelism for inter-layer scaling.
Data parallelism to scale batch size across replicas.

This combination is common in large-scale training stacks such as Megatron-LM and DeepSpeed.

5) Inference vs training considerations

Training benefits from PP + TP + data parallel with careful scheduling.
Inference often uses TP for throughput, and PP when model size requires it.
KV cache lives with the layers that produce it, so PP splits cache ownership across stages.

6) Practical guidance

Use TP first when you have high-bandwidth interconnect (NVLink, InfiniBand).
Add PP when the model still does not fit in GPU memory.
Keep PP stages balanced to avoid bubbles and stalls.
Measure end-to-end latency; PP can increase latency for single requests.

Sources

https://docs.nvidia.com/megatron-core/developer-guide/latest/parallelisms.html
https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/
https://arxiv.org/abs/1909.08053