（原型）高效地为Adagrad编写带”稀疏”语义的MaskedTensor¶

Created On: Oct 28, 2022 | Last Updated: Oct 28, 2022 | Last Verified: Not Verified

在阅读本教程之前，请查看MaskedTensor的`概述 <https://pytorch.org/tutorials/prototype/maskedtensor_overview.html>`_和`稀疏性 <https://pytorch.org/tutorials/prototype/maskedtensor_sparsity.html>`_教程。

介绍与动机¶

`问题1369 <https://github.com/pytorch/pytorch/issues/1369>`_讨论了在为Adagrad编写”稀疏”语义时引入的额外代码行，但实际上，代码使用稀疏性作为遮罩语义的代理，而不是稀疏性的预期应用场景：一种压缩和优化技术。之前，我们通过引入一次性语义和操作符来绕过正式遮罩语义的缺失，同时强制用户了解索引和值等存储细节。

现在我们有了遮罩语义，我们可以更好地指出何时稀疏性被用作语义扩展。我们还将与使用MaskedTensor编写的等效代码进行比较和对比。最后，代码片段将在没有额外注释的情况下重复，以显示简洁性上的差异。

准备工作¶

import torch
import warnings

# Disable prototype warnings and such
warnings.filterwarnings(action='ignore', category=UserWarning)

# Some hyperparameters
eps = 1e-10
clr = 0.1

i = torch.tensor([[0, 1, 1], [2, 0, 2]])
v = torch.tensor([3, 4, 5], dtype=torch.float32)
grad = torch.sparse_coo_tensor(i, v, [2, 4])

使用MaskedTensor的更简单代码¶

在深入细节之前，让我们更具体地介绍一下问题。我们将研究PyTorch中的`Adagrad（函数式） <https://github.com/pytorch/pytorch/blob/6c2f235d368b697072699e5ca9485fd97d0b9bcc/torch/optim/_functional.py#L16-L51>`_实现，其最终目标是简化并更忠实地代表遮罩方法。

供参考，这是没有遮罩梯度或稀疏性的常规密集代码路径：

state_sum.addcmul_(grad, grad, value=1)
std = state_sum.sqrt().add_(eps)
param.addcdiv_(grad, std, value=-clr)

稀疏的普通张量实现为：

def _make_sparse(grad, grad_indices, values):
    size = grad.size()
    if grad_indices.numel() == 0 or values.numel() == 0:
        return torch.empty_like(grad)
    return torch.sparse_coo_tensor(grad_indices, values, size)

grad = grad.coalesce()  # the update is non-linear so indices must be unique
grad_indices = grad._indices()
grad_values = grad._values()

state_sum.add_(_make_sparse(grad, grad_indices, grad_values.pow(2)))   # a different _make_sparse per layout
std = state_sum.sparse_mask(grad)
std_values = std._values().sqrt_().add_(eps)
param.add_(_make_sparse(grad, grad_indices, grad_values / std_values), alpha=-clr)

而 MaskedTensor 将代码简化为以下片段：

state_sum2 = state_sum2 + masked_grad.pow(2).get_data()
std2 = masked_tensor(state_sum2.to_sparse(), mask)
std2 = std2.sqrt().add(eps)
param2 = param2.add((masked_grad / std2).get_data(), alpha=-clr)

在本教程中，我们会逐行浏览每个实现，但大致来看，我们可以注意到（1）MaskedTensor实现有多短，以及（2）它如何避免在密集张量和稀疏张量之间进行转换。

原始稀疏实现¶

现在，让我们带上一些内联注释来分解代码：

def _make_sparse(grad, grad_indices, values):
    size = grad.size()
    if grad_indices.numel() == 0 or values.numel() == 0:
        return torch.empty_like(grad)
    return torch.sparse_coo_tensor(grad_indices, values, size)

# We don't support sparse gradients
param = torch.arange(8).reshape(2, 4).float()
state_sum = torch.full_like(param, 0.5)  # initial value for state sum

grad = grad.coalesce()  # the update is non-linear so indices must be unique
grad_indices = grad._indices()
grad_values = grad._values()
# pow(2) has the same semantics for both sparse and dense memory layouts since 0^2 is zero
state_sum.add_(_make_sparse(grad, grad_indices, grad_values.pow(2)))

# We take care to make std sparse, even though state_sum clearly is not.
# This means that we're only applying the gradient to parts of the state_sum
# for which it is specified. This further drives the point home that the passed gradient is not sparse, but masked.
# We currently dodge all these concerns using the private method `_values`.
std = state_sum.sparse_mask(grad)
std_values = std._values().sqrt_().add_(eps)

# Note here that we currently don't support div for sparse Tensors because zero / zero is not well defined,
# so we're forced to perform `grad_values / std_values` outside the sparse semantic and then convert back to a
# sparse tensor with `make_sparse`.
# We'll later see that MaskedTensor will actually handle these operations for us as well as properly denote
# undefined / undefined = undefined!
param.add_(_make_sparse(grad, grad_indices, grad_values / std_values), alpha=-clr)

倒数第三行–std = state_sum.sparse_mask(grad)–是我们发生非常重要分歧的地方。

虽然技术上应将eps应用于所有值，但实际上仅应用于指定的值。在这里，我们使用稀疏性作为语义扩展，并强制执行某种定义值和未定义值的模式。如果梯度的部分值为零，它们仍会被包含在内，即使它们可以通过其他稀疏存储布局进行压缩。这在理论上相当脆弱！不过，可以认为eps总是非常小，所以在实际操作中可能影响不大。

此外，为稀疏性作为存储布局和压缩方案实现的`add_`应导致密化，但我们为了性能强制不进行密化。对于这种一次性的情况是可以的……直到我们希望引入新的压缩方案，例如`CSC <https://pytorch.org/docs/master/sparse.html#sparse-csc-docs>`__、BSR <https://pytorch.org/docs/master/sparse.html#sparse-bsr-docs>`__或`BSC。到那时，我们需要为每种情况引入单独的Tensor类型，并为使用不同存储格式压缩的梯度编写变体，这既不方便也不太可扩展且不够简洁。

MaskedTensor稀疏实现¶

我们一直在将稀疏性作为一种优化与稀疏性作为PyTorch的语义扩展相混淆。MaskedTensor建议将稀疏性优化从语义扩展中分离出来；例如，目前我们无法以稠密语义和稀疏存储或以稠密存储实现掩码语义。MaskedTensor通过故意将存储与语义分开来实现这些想法。

考虑使用掩码梯度的上述示例：

# Let's now import MaskedTensor!
from torch.masked import masked_tensor

# Create an entirely new set of parameters to avoid errors
param2 = torch.arange(8).reshape(2, 4).float()
state_sum2 = torch.full_like(param, 0.5)  # initial value for state sum

mask = (grad.to_dense() != 0).to_sparse()
masked_grad = masked_tensor(grad, mask)

state_sum2 = state_sum2 + masked_grad.pow(2).get_data()
std2 = masked_tensor(state_sum2.to_sparse(), mask)

# We can add support for in-place operations later. Notice how this doesn't
# need to access any storage internals and is in general a lot shorter
std2 = std2.sqrt().add(eps)

param2 = param2.add((masked_grad / std2).get_data(), alpha=-clr)

注意，实现看起来很相似，但MaskedTensor的实现更短更简单。尤其是关于``_make_sparse``的大量样板代码（以及需要为每个布局进行单独实现）由:class:`MaskedTensor`为用户处理。

在这里，我们打印出这个版本和原版，以便更容易比较：

print("state_sum:\n", state_sum)
print("state_sum2:\n", state_sum2)

print("std:\n", std)
print("std2:\n", std2)

print("param:\n", param)
print("param2:\n", param2)

总结¶

在本教程中，我们讨论了原生掩码语义如何帮助改善Adagrad在PyTorch中的现有实现，从而使用稀疏性作为编写掩码语义的代理。但更重要的是，允许通过MaskedTensor使掩码语义成为一等公民，消除了对稀疏性或模拟掩码的不可靠技巧的依赖，从而实现真正的独立性和开发，同时支持稀疏语义，例如这种情况。

进一步阅读¶

要继续学习更多内容，您可以查看我们现在的最终评论`MaskedTensor高级语义 <https://pytorch.org/tutorials/prototype/maskedtensor_advanced_semantics.html>`__，以了解:class:`MaskedTensor`与NumPy's MaskedArray之间设计决策的某些差异，以及归约语义。

脚本总运行时间: (0分钟 0.000秒)

画廊由Sphinx-Gallery生成