备注

点击这里下载完整示例代码

嵌套张量入门¶

Created On: Aug 02, 2022 | Last Updated: May 23, 2025 | Last Verified: Nov 05, 2024

嵌套张量对常规密集张量的形状进行泛化，支持表示大小不一的数据。

对于常规张量，每个维度是规则的并且具有一个大小
对于嵌套张量，并非所有维度都有规则大小；有些维度是大小不均的

嵌套张量是表示各个领域内顺序数据的自然解决方案：

在自然语言处理（NLP）中，句子可以具有不同的长度，因此句子的批量形成嵌套张量
在计算机视觉（CV）中，图像可以具有不同的形状，因此图像的批量形成嵌套张量

在本教程中，我们将展示嵌套张量的基本用法，并通过一个真实的示例说明它们在处理可变长度的顺序数据操作上的用处。特别是在构建可以有效处理不规则顺序输入的变形金刚（Transformer）时，它们是不可或缺的工具。下面我们展示了使用嵌套张量实现多头注意力的一个示例，该方法结合使用了 torch.compile，比直接对带有填充的张量操作更加高效。

嵌套张量目前是一个原型功能，可能会发生变化。

import numpy as np
import timeit
import torch
import torch.nn.functional as F

from torch import nn

torch.manual_seed(1)
np.random.seed(1)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

嵌套张量的初始化¶

从 Python 前端，可以通过张量列表创建嵌套张量。我们用 nt[i] 表示嵌套张量的第 i 个张量组件。

nt = torch.nested.nested_tensor([torch.arange(12).reshape(
    2, 6), torch.arange(18).reshape(3, 6)], dtype=torch.float, device=device)
print(f"{nt=}")

通过将每个底层张量填充为相同的形状，嵌套张量可以转换为常规张量。

padded_out_tensor = torch.nested.to_padded_tensor(nt, padding=0.0)
print(f"{padded_out_tensor=}")

所有张量都具有一个属性，用于确定它们是否是嵌套张量；

print(f"nt is nested: {nt.is_nested}")
print(f"padded_out_tensor is nested: {padded_out_tensor.is_nested}")

通常从形状不规则的张量批量构造嵌套张量。例如，第 0 维被假定为批量维度。索引第 0 维返回第一个底层张量组件。

print("First underlying tensor component:", nt[0], sep='\n')
print("last column of 2nd underlying tensor component:", nt[1, :, -1], sep='\n')

# When indexing a nestedtensor's 0th dimension, the result is a regular tensor.
print(f"First underlying tensor component is nested: {nt[0].is_nested}")

需要注意的是，目前还不支持在第 0 维进行切片。这意味着当前还无法构造合并底层张量组件的视图。

嵌套张量操作¶

由于每个操作都必须针对嵌套张量显式实现，目前嵌套张量的操作覆盖率比常规张量低。现在，仅支持基本操作，比如索引、dropout、softmax、转置、变形、线性操作、bmm。然而，覆盖范围正在扩展。如果您需要某些操作，请提交一个问题来帮助我们优先考虑覆盖。

变形

变形操作用于改变张量的形状。有关常规张量的完整语义，请参考这里。对于常规张量，在指定新形状时，单个维度可以是 -1，这种情况下会根据剩余维度和元素数量进行推断。

嵌套张量的语义类似，除了 -1 不再进行推断，而是继承旧大小（对于 nt[0] 是 2，对于 nt[1] 是 3）。-1 是指定不规则维度的唯一合法大小。

nt_reshaped = nt.reshape(2, -1, 2, 3)
print(f"{nt_reshaped=}")

转置

转置操作用于交换张量的两个维度。其完整语义请参考这里。请注意，对于嵌套张量，第 0 维是特殊的；它被假定为批量维度，因此涉及嵌套张量第 0 维的转置不受支持。

nt_transposed = nt_reshaped.transpose(1, 2)
print(f"{nt_transposed=}")

其他

其他操作的语义与常规张量相同。对嵌套张量应用操作相当于对底层张量组件应用操作，结果也是一个嵌套张量。

nt_mm = torch.nested.nested_tensor([torch.randn((2, 3, 4)), torch.randn((2, 3, 5))], device=device)
nt3 = torch.matmul(nt_transposed, nt_mm)
print(f"Result of Matmul:\n {nt3}")

nt4 = F.dropout(nt3, 0.1)
print(f"Result of Dropout:\n {nt4}")

nt5 = F.softmax(nt4, -1)
print(f"Result of Softmax:\n {nt5}")

为什么使用嵌套张量¶

当数据是顺序时，通常每个样本具有不同的长度。例如，在句子批量中，每个句子具有不同的单词数量。处理不规则序列的一种常用技术是手动将每个数据张量填充为相同的形状，以形成一个批量。例如，我们有 2 个句子，长度不同，并有一个词汇表。为了将其表示为一个张量，我们用 0 填充到批量中的最大长度。

sentences = [["goodbye", "padding"],
             ["embrace", "nested", "tensor"]]
vocabulary = {"goodbye": 1.0, "padding": 2.0,
              "embrace": 3.0, "nested": 4.0, "tensor": 5.0}
padded_sentences = torch.tensor([[1.0, 2.0, 0.0],
                                 [3.0, 4.0, 5.0]])
nested_sentences = torch.nested.nested_tensor([torch.tensor([1.0, 2.0]),
                                               torch.tensor([3.0, 4.0, 5.0])])
print(f"{padded_sentences=}")
print(f"{nested_sentences=}")

这种将数据批次填充到最大长度的技术并不理想。填充的数据不需要用于计算，并通过分配超过必要大小的张量浪费了内存。此外，并不是所有操作在应用于填充数据时都具有相同语义。对于矩阵乘法，为忽略填充条目，需要用 0 填充，而对于 softmax，则需要用 -inf 填充以忽略特定条目。嵌套张量的主要目标是使用标准 PyTorch 张量 UX 方便地对不规则数据进行操作，从而消除低效和复杂的填充和掩码需求。

padded_sentences_for_softmax = torch.tensor([[1.0, 2.0, float("-inf")],
                                             [3.0, 4.0, 5.0]])
print(F.softmax(padded_sentences_for_softmax, -1))
print(F.softmax(nested_sentences, -1))

让我们来看一个实际例子：在《Transformers》论文中使用的多头注意力组件。我们可以以这种方式实现它，使其可以在填充或嵌套张量上运行。

class MultiHeadAttention(nn.Module):
    """
    Computes multi-head attention. Supports nested or padded tensors.

    Args:
        E_q (int): Size of embedding dim for query
        E_k (int): Size of embedding dim for key
        E_v (int): Size of embedding dim for value
        E_total (int): Total embedding dim of combined heads post input projection. Each head
            has dim E_total // nheads
        nheads (int): Number of heads
        dropout_p (float, optional): Dropout probability. Default: 0.0
    """
    def __init__(self, E_q: int, E_k: int, E_v: int, E_total: int,
                 nheads: int, dropout_p: float = 0.0):
        super().__init__()
        self.nheads = nheads
        self.dropout_p = dropout_p
        self.query_proj = nn.Linear(E_q, E_total)
        self.key_proj = nn.Linear(E_k, E_total)
        self.value_proj = nn.Linear(E_v, E_total)
        E_out = E_q
        self.out_proj = nn.Linear(E_total, E_out)
        assert E_total % nheads == 0, "Embedding dim is not divisible by nheads"
        self.E_head = E_total // nheads

    def forward(self, query: torch.Tensor, key: torch.Tensor, value: torch.Tensor) -> torch.Tensor:
        """
        Forward pass; runs the following process:
            1. Apply input projection
            2. Split heads and prepare for SDPA
            3. Run SDPA
            4. Apply output projection

        Args:
            query (torch.Tensor): query of shape (N, L_t, E_q)
            key (torch.Tensor): key of shape (N, L_s, E_k)
            value (torch.Tensor): value of shape (N, L_s, E_v)

        Returns:
            attn_output (torch.Tensor): output of shape (N, L_t, E_q)
        """
        # Step 1. Apply input projection
        # TODO: demonstrate packed projection
        query = self.query_proj(query)
        key = self.key_proj(key)
        value = self.value_proj(value)

        # Step 2. Split heads and prepare for SDPA
        # reshape query, key, value to separate by head
        # (N, L_t, E_total) -> (N, L_t, nheads, E_head) -> (N, nheads, L_t, E_head)
        query = query.unflatten(-1, [self.nheads, self.E_head]).transpose(1, 2)
        # (N, L_s, E_total) -> (N, L_s, nheads, E_head) -> (N, nheads, L_s, E_head)
        key = key.unflatten(-1, [self.nheads, self.E_head]).transpose(1, 2)
        # (N, L_s, E_total) -> (N, L_s, nheads, E_head) -> (N, nheads, L_s, E_head)
        value = value.unflatten(-1, [self.nheads, self.E_head]).transpose(1, 2)

        # Step 3. Run SDPA
        # (N, nheads, L_t, E_head)
        attn_output = F.scaled_dot_product_attention(
            query, key, value, dropout_p=dropout_p, is_causal=True)
        # (N, nheads, L_t, E_head) -> (N, L_t, nheads, E_head) -> (N, L_t, E_total)
        attn_output = attn_output.transpose(1, 2).flatten(-2)

        # Step 4. Apply output projection
        # (N, L_t, E_total) -> (N, L_t, E_out)
        attn_output = self.out_proj(attn_output)

        return attn_output

按照《Transformers》论文设置超参数

N = 512
E_q, E_k, E_v, E_total = 512, 512, 512, 512
E_out = E_q
nheads = 8

除了dropout概率：为了正确性验证，设置为0

dropout_p = 0.0

让我们根据Zipf定律生成一些真实的假数据。

def zipf_sentence_lengths(alpha: float, batch_size: int) -> torch.Tensor:
    # generate fake corpus by unigram Zipf distribution
    # from wikitext-2 corpus, we get rank "." = 3, "!" = 386, "?" = 858
    sentence_lengths = np.empty(batch_size, dtype=int)
    for ibatch in range(batch_size):
        sentence_lengths[ibatch] = 1
        word = np.random.zipf(alpha)
        while word != 3 and word != 386 and word != 858:
            sentence_lengths[ibatch] += 1
            word = np.random.zipf(alpha)
    return torch.tensor(sentence_lengths)

创建嵌套张量批量输入

def gen_batch(N, E_q, E_k, E_v, device):
    # generate semi-realistic data using Zipf distribution for sentence lengths
    sentence_lengths = zipf_sentence_lengths(alpha=1.2, batch_size=N)

    # Note: the torch.jagged layout is a nested tensor layout that supports a single ragged
    # dimension and works with torch.compile. The batch items each have shape (B, S*, D)
    # where B = batch size, S* = ragged sequence length, and D = embedding dimension.
    query = torch.nested.nested_tensor([
        torch.randn(l.item(), E_q, device=device)
        for l in sentence_lengths
    ], layout=torch.jagged)

    key = torch.nested.nested_tensor([
        torch.randn(s.item(), E_k, device=device)
        for s in sentence_lengths
    ], layout=torch.jagged)

    value = torch.nested.nested_tensor([
        torch.randn(s.item(), E_v, device=device)
        for s in sentence_lengths
    ], layout=torch.jagged)

    return query, key, value, sentence_lengths

query, key, value, sentence_lengths = gen_batch(N, E_q, E_k, E_v, device)

生成用于比较的查询、键、值的填充形式

def jagged_to_padded(jt, padding_val):
    # TODO: do jagged -> padded directly when this is supported
    return torch.nested.to_padded_tensor(
        torch.nested.nested_tensor(list(jt.unbind())),
        padding_val)

padded_query, padded_key, padded_value = (
    jagged_to_padded(t, 0.0) for t in (query, key, value)
)

构建模型

mha = MultiHeadAttention(E_q, E_k, E_v, E_total, nheads, dropout_p).to(device=device)

检查正确性和性能

def benchmark(func, *args, **kwargs):
    torch.cuda.synchronize()
    begin = timeit.default_timer()
    output = func(*args, **kwargs)
    torch.cuda.synchronize()
    end = timeit.default_timer()
    return output, (end - begin)

output_nested, time_nested = benchmark(mha, query, key, value)
output_padded, time_padded = benchmark(mha, padded_query, padded_key, padded_value)

# padding-specific step: remove output projection bias from padded entries for fair comparison
for i, entry_length in enumerate(sentence_lengths):
    output_padded[i, entry_length:] = 0.0

print("=== without torch.compile ===")
print("nested and padded calculations differ by", (jagged_to_padded(output_nested, 0.0) - output_padded).abs().max().item())
print("nested tensor multi-head attention takes", time_nested, "seconds")
print("padded tensor multi-head attention takes", time_padded, "seconds")

# warm up compile first...
compiled_mha = torch.compile(mha)
compiled_mha(query, key, value)
# ...now benchmark
compiled_output_nested, compiled_time_nested = benchmark(
    compiled_mha, query, key, value)

# warm up compile first...
compiled_mha(padded_query, padded_key, padded_value)
# ...now benchmark
compiled_output_padded, compiled_time_padded = benchmark(
    compiled_mha, padded_query, padded_key, padded_value)

# padding-specific step: remove output projection bias from padded entries for fair comparison
for i, entry_length in enumerate(sentence_lengths):
    compiled_output_padded[i, entry_length:] = 0.0

print("=== with torch.compile ===")
print("nested and padded calculations differ by", (jagged_to_padded(compiled_output_nested, 0.0) - compiled_output_padded).abs().max().item())
print("nested tensor multi-head attention takes", compiled_time_nested, "seconds")
print("padded tensor multi-head attention takes", compiled_time_padded, "seconds")

请注意，在没有``torch.compile``的情况下，Python子类嵌套张量的开销可能使其比在填充张量上的等效计算更慢。然而，一旦启用``torch.compile``，在嵌套张量上操作可以提供数倍的加速。随着批量中的填充百分比增加，避免填充上的无用计算变得更加重要。

print(f"Nested speedup: {compiled_time_padded / compiled_time_nested:.3f}")

总结¶

在本教程中，我们学会了如何使用嵌套张量进行基本操作，以及如何以避免填充上的计算的方式实现多头注意力用于Transformers。有关更多信息，请查看`torch.nested <https://pytorch.org/docs/stable/nested.html>`__命名空间的文档。

另见¶

通过替换nn.Transformer为嵌套张量和torch.compile加速PyTorch Transformers

脚本总运行时间: (0分钟 0.000秒)

画廊由Sphinx-Gallery生成

嵌套张量入门¶

嵌套张量的初始化¶

嵌套张量操作¶

为什么使用嵌套张量¶

总结¶

另见¶

文档

教程

资源