Context Parallel 简介¶

备注

|编辑| 在 GitHub 中查看和编辑本教程。

What you will learn

Prerequisites

PyTorch 2.7 或更高版本

简介¶

Context Parallel 是在大型语言模型训练中使用的一种方法，通过将长输入序列分片到多个设备上来减少峰值激活大小。它突破了因 Transformer 块中激活存储的峰值内存使用约束而导致的输入序列长度限制。

Ring Attention 是一种新颖的 Attention 层并行实现，对于高效的 Context Parallel 至关重要。Ring Attention 通过对 KV 分片进行洗牌并计算部分 Attention 分数，一直重复直到每台设备都使用了所有 KV 分片。已实现两种 Ring Attention 的变体：基于全收集的传递 KV 和基于全交换的传递 KV：

基于全收集的传递 KV 算法用于 Llama3 训练，该算法最初对关键和价值张量执行全收集，然后计算本地查询张量块的注意力输出。我们修改后的基于全收集的传递 KV 算法同时全收集 KV 分片，并使用本地关键和价值张量块计算本地查询张量块的注意力输出，然后对本地查询张量和剩余 KV 分片进行最终计算。这允许一定程度的注意力计算和全收集集合之间的重叠。例如，在 Llama3 训练的情况下，我们还对序列维度上的``freq_cis``进行分片。
全交换方法使用交错的全交换集合对 KV 分片进行环状洗牌，以重叠 SDPA（缩放点积注意力）计算和下一次 SDPA 所需的全交换通信。

Context Parallel API 包含两个部分:

context_parallel() 允许用户创建一个 Python 上下文，其中 SDPA 函数（torch.nn.functional.scaled_dot_product_attention）将自动替换为 Ring Attention。要沿某个维度对张量进行切分，只需将张量及其切分的维度分别传递给参数 buffers 和 buffer_seq_dims。我们建议用户将沿序列维度计算的张量添加到 buffers 并沿此维度进行切分。以 Llama3 训练为例，如果 buffers 中缺少 freq_cis，将导致旋转嵌入的计算错误。
set_rotate_method() 允许用户在基于全收集的 pass-KV 方法和基于全到全的 pass-KV 方法之间进行选择。

设置¶

通过使用 torch.distributed.tensor.experimental.context_parallel()，用户可以轻松地切分张量输入，并并行化 SDPA 函数的执行。为了更好地展示此 API 的使用，我们从一个执行 SDPA 的简单代码片段开始，然后使用 API 并行化它：

import torch
import torch.nn.functional as F

from torch.nn.attention import sdpa_kernel, SDPBackend


def sdpa_example():
    assert torch.cuda.is_available()
    torch.cuda.set_device("cuda:0")
    torch.cuda.manual_seed(0)

    batch = 8
    nheads = 8
    qkv_len = 8192
    dim = 32
    backend = SDPBackend.FLASH_ATTENTION
    dtype = (
        torch.bfloat16
        if backend == SDPBackend.FLASH_ATTENTION
        or backend == SDPBackend.CUDNN_ATTENTION
        else torch.float32
    )

    qkv = [
        torch.rand(
            (batch, nheads, qkv_len, dim),
            dtype=dtype,
            requires_grad=True,
            device='cuda',
        )
        for _ in range(3)
    ]
    # specify the SDPBackend to use
    with sdpa_kernel(backend):
        out = F.scaled_dot_product_attention(*qkv, is_causal=True)


if __name__ == "__main__":
    sdpa_example()

启用上下文并行¶

现在，让我们首先将其调整为一个分布式程序，其中每个 rank 都有相同的张量输入。然后我们应用上下文并行 API 来切分输入并在各个 rank 间分配计算：

# file: cp_sdpa_example.py
import os

import torch
import torch.distributed as dist
import torch.nn.functional as F
from torch.distributed.device_mesh import init_device_mesh
from torch.distributed.tensor.experimental import context_parallel
from torch.distributed.tensor.experimental._attention import context_parallel_unshard
from torch.nn.attention import sdpa_kernel, SDPBackend


def context_parallel_sdpa_example(world_size: int, rank: int):
    assert torch.cuda.is_available()
    assert dist.is_nccl_available()
    torch.cuda.set_device(f"cuda:{rank}")
    torch.cuda.manual_seed(0)

    dist.init_process_group(
        backend="nccl",
        init_method="env://",
        world_size=world_size,
        rank=rank,
    )
    device_mesh = init_device_mesh(
        device_type="cuda", mesh_shape=(world_size,), mesh_dim_names=("cp",)
    )

    batch = 8
    nheads = 8
    qkv_len = 64
    dim = 32
    backend = SDPBackend.FLASH_ATTENTION
    dtype = (
        torch.bfloat16
        if backend == SDPBackend.FLASH_ATTENTION
        or backend == SDPBackend.CUDNN_ATTENTION
        else torch.float32
    )

    qkv = [
        torch.rand(
            (batch, nheads, qkv_len, dim),
            dtype=dtype,
            requires_grad=True,
            device='cuda',
        )
        for _ in range(3)
    ]
    # specify the SDPBackend to use
    with sdpa_kernel(backend):
        out = F.scaled_dot_product_attention(*qkv, is_causal=True)

    # make a clean copy of QKV for output comparison
    cp_qkv = [t.detach().clone() for t in qkv]

    with sdpa_kernel(backend):
        # This `context_parallel()` performs two actions:
        # 1. Shard the tensor objects in `buffers` in-place along the dimension
        #    specified in `buffer_seq_dims`, the tensors in `buffers` and their
        #    sharding dims in `buffer_seq_dims` are organized in the same order.
        # 2. Replace the execution of `F.scaled_dot_product_attention` with a
        #    context-paralleled-enabled Ring Attention.
        with context_parallel(
            device_mesh, buffers=tuple(cp_qkv), buffer_seq_dims=(2, 2, 2)
        ):
            cp_out = F.scaled_dot_product_attention(*cp_qkv, is_causal=True)

        # The output `cp_out` is still sharded in the same way as QKV
        # the `context_parallel_unshard` API allows users to easily
        # unshard to gain the full tensor.
        (cp_out,) = context_parallel_unshard(device_mesh, [cp_out], [2])

    assert torch.allclose(
        cp_out,
        out,
        atol=(1e-08 if dtype == torch.float32 else 1e-03 * world_size),
    )


if __name__ == "__main__":
    rank = int(os.environ["RANK"])
    world_size = int(os.environ["WORLD_SIZE"])

    try:
        context_parallel_sdpa_example(world_size, rank)
    finally:
        dist.barrier()
        dist.destroy_process_group()

您可以使用命令 torchrun --standalone --nnodes=1 --nproc-per-node=4 cp_sdpa_example.py 在 4 个 GPU 上启动上述的上下文并行 SDPA。通过将 Ring Attention 的输出与单个 GPU 上 SDPA 的输出进行比较，我们展示了其数值准确性。

选择旋转方法¶

您可以通过使用 torch.distributed.tensor.experimental._attention.set_rotate_method() 在 Ring Attention 中选择所需的切分旋转方法：

# file: cp_sdpa_example.py
from torch.distributed.tensor.experimental._attention import set_rotate_method

set_rotate_method("alltoall")  # rotate shards using all-to-all

with sdpa_kernel(backend):
    with context_parallel(
        device_mesh, buffers=tuple(cp_qkv), buffer_seq_dims=(2, 2, 2)
    ):
        cp_out = F.scaled_dot_product_attention(*cp_qkv, is_causal=True)

默认的旋转方法是基于全收集的 pass-KV。

总结¶

在本教程中，我们学习了如何使用 Context Parallel API 来轻松地沿序列维度并行化 SDPA 计算。有关设计和实现细节、性能分析以及在 TorchTitan 中的端到端训练示例，请参阅我们在 PyTorch 原生长上下文训练上的帖子。

Context Parallel 简介¶

简介¶

设置¶

启用上下文并行¶

选择旋转方法¶

总结¶

文档

教程

资源