{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "(Prototype) Efficiently writing \\\"sparse\\\" semantics for Adagrad with MaskedTensor\n==================================================================================\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Before working through this tutorial, please review the MaskedTensor\n[Overview](https://pytorch.org/tutorials/prototype/maskedtensor_overview.html)\nand\n[Sparsity](https://pytorch.org/tutorials/prototype/maskedtensor_sparsity.html)\ntutorials.\n\nIntroduction and Motivation\n===========================\n\n[Issue 1369](https://github.com/pytorch/pytorch/issues/1369) discussed\nthe additional lines of code that were introduced while writing\n\\\"sparse\\\" semantics for Adagrad, but really, the code uses sparsity as\na proxy for masked semantics rather than the intended use case of\nsparsity: a compression and optimization technique. Previously, we\nworked around the lack of formal masked semantics by introducing one-off\nsemantics and operators while forcing users to be aware of storage\ndetails such as indices and values.\n\nNow that we have masked semantics, we are better equipped to point out\nwhen sparsity is used as a semantic extension. We\\'ll also compare and\ncontrast this with equivalent code written using MaskedTensor. In the\nend the code snippets are repeated without additional comments to show\nthe difference in brevity.\n\nPreparation\n===========\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch\nimport warnings\n\n# Disable prototype warnings and such\nwarnings.filterwarnings(action='ignore', category=UserWarning)\n\n# Some hyperparameters\neps = 1e-10\nclr = 0.1\n\ni = torch.tensor([[0, 1, 1], [2, 0, 2]])\nv = torch.tensor([3, 4, 5], dtype=torch.float32)\ngrad = torch.sparse_coo_tensor(i, v, [2, 4])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Simpler Code with MaskedTensor\n==============================\n\nBefore we get too far in the weeds, let\\'s introduce the problem a bit\nmore concretely. We will be taking a look into the [Adagrad\n(functional)](https://github.com/pytorch/pytorch/blob/6c2f235d368b697072699e5ca9485fd97d0b9bcc/torch/optim/_functional.py#L16-L51)\nimplementation in PyTorch with the ultimate goal of simplifying and more\nfaithfully representing the masked approach.\n\nFor reference, this is the regular, dense code path without masked\ngradients or sparsity:\n\n``` {.python}\nstate_sum.addcmul_(grad, grad, value=1)\nstd = state_sum.sqrt().add_(eps)\nparam.addcdiv_(grad, std, value=-clr)\n```\n\nThe vanilla tensor implementation for sparse is:\n\n``` {.python}\ndef _make_sparse(grad, grad_indices, values):\n    size = grad.size()\n    if grad_indices.numel() == 0 or values.numel() == 0:\n        return torch.empty_like(grad)\n    return torch.sparse_coo_tensor(grad_indices, values, size)\n\ngrad = grad.coalesce()  # the update is non-linear so indices must be unique\ngrad_indices = grad._indices()\ngrad_values = grad._values()\n\nstate_sum.add_(_make_sparse(grad, grad_indices, grad_values.pow(2)))   # a different _make_sparse per layout\nstd = state_sum.sparse_mask(grad)\nstd_values = std._values().sqrt_().add_(eps)\nparam.add_(_make_sparse(grad, grad_indices, grad_values / std_values), alpha=-clr)\n```\n\nwhile `MaskedTensor`{.interpreted-text role=\"class\"} minimizes the code\nto the snippet:\n\n``` {.python}\nstate_sum2 = state_sum2 + masked_grad.pow(2).get_data()\nstd2 = masked_tensor(state_sum2.to_sparse(), mask)\nstd2 = std2.sqrt().add(eps)\nparam2 = param2.add((masked_grad / std2).get_data(), alpha=-clr)\n```\n\nIn this tutorial, we will go through each implementation line by line,\nbut at first glance, we can notice (1) how much shorter the MaskedTensor\nimplementation is, and (2) how it avoids conversions between dense and\nsparse tensors.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Original Sparse Implementation\n==============================\n\nNow, let\\'s break down the code with some inline comments:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def _make_sparse(grad, grad_indices, values):\n    size = grad.size()\n    if grad_indices.numel() == 0 or values.numel() == 0:\n        return torch.empty_like(grad)\n    return torch.sparse_coo_tensor(grad_indices, values, size)\n\n# We don't support sparse gradients\nparam = torch.arange(8).reshape(2, 4).float()\nstate_sum = torch.full_like(param, 0.5)  # initial value for state sum\n\ngrad = grad.coalesce()  # the update is non-linear so indices must be unique\ngrad_indices = grad._indices()\ngrad_values = grad._values()\n# pow(2) has the same semantics for both sparse and dense memory layouts since 0^2 is zero\nstate_sum.add_(_make_sparse(grad, grad_indices, grad_values.pow(2)))\n\n# We take care to make std sparse, even though state_sum clearly is not.\n# This means that we're only applying the gradient to parts of the state_sum\n# for which it is specified. This further drives the point home that the passed gradient is not sparse, but masked.\n# We currently dodge all these concerns using the private method `_values`.\nstd = state_sum.sparse_mask(grad)\nstd_values = std._values().sqrt_().add_(eps)\n\n# Note here that we currently don't support div for sparse Tensors because zero / zero is not well defined,\n# so we're forced to perform `grad_values / std_values` outside the sparse semantic and then convert back to a\n# sparse tensor with `make_sparse`.\n# We'll later see that MaskedTensor will actually handle these operations for us as well as properly denote\n# undefined / undefined = undefined!\nparam.add_(_make_sparse(grad, grad_indices, grad_values / std_values), alpha=-clr)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The third to last line \\-- [std =\nstate\\_sum.sparse\\_mask(grad)]{.title-ref} \\-- is where we have a very\nimportant divergence.\n\nThe addition of eps should technically be applied to all values but\ninstead is only applied to specified values. Here we\\'re using sparsity\nas a semantic extension and to enforce a certain pattern of defined and\nundefined values. If parts of the values of the gradient are zero, they\nare still included if materialized even though they could be compressed\nby other sparse storage layouts. This is theoretically quite brittle!\nThat said, one could argue that eps is always very small, so it might\nnot matter so much in practice.\n\nMoreover, an implementation [add\\_]{.title-ref} for sparsity as a\nstorage layout and compression scheme should cause densification, but we\nforce it not to for performance. For this one-off case it is fine..\nuntil we want to introduce new compression scheme, such as\n[CSC](https://pytorch.org/docs/master/sparse.html#sparse-csc-docs),\n[BSR](https://pytorch.org/docs/master/sparse.html#sparse-bsr-docs), or\n[BSC](https://pytorch.org/docs/master/sparse.html#sparse-bsc-docs). We\nwill then need to introduce separate Tensor types for each and write\nvariations for gradients compressed using different storage formats,\nwhich is inconvenient and not quite scalable nor clean.\n\nMaskedTensor Sparse Implementation\n==================================\n\nWe\\'ve been conflating sparsity as an optimization with sparsity as a\nsemantic extension to PyTorch. MaskedTensor proposes to disentangle the\nsparsity optimization from the semantic extension; for example,\ncurrently we can\\'t have dense semantics with sparse storage or masked\nsemantics with dense storage. MaskedTensor enables these ideas by\npurposefully separating the storage from the semantics.\n\nConsider the above example using a masked gradient:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Let's now import MaskedTensor!\nfrom torch.masked import masked_tensor\n\n# Create an entirely new set of parameters to avoid errors\nparam2 = torch.arange(8).reshape(2, 4).float()\nstate_sum2 = torch.full_like(param, 0.5)  # initial value for state sum\n\nmask = (grad.to_dense() != 0).to_sparse()\nmasked_grad = masked_tensor(grad, mask)\n\nstate_sum2 = state_sum2 + masked_grad.pow(2).get_data()\nstd2 = masked_tensor(state_sum2.to_sparse(), mask)\n\n# We can add support for in-place operations later. Notice how this doesn't\n# need to access any storage internals and is in general a lot shorter\nstd2 = std2.sqrt().add(eps)\n\nparam2 = param2.add((masked_grad / std2).get_data(), alpha=-clr)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Note that the implementations look quite similar, but the MaskedTensor\nimplementation is shorter and simpler. In particular, much of the\nboilerplate code around `_make_sparse` (and needing to have a separate\nimplementation per layout) is handled for the user with\n`MaskedTensor`{.interpreted-text role=\"class\"}.\n\nAt this point, let\\'s print both this version and original version for\neasier comparison:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(\"state_sum:\\n\", state_sum)\nprint(\"state_sum2:\\n\", state_sum2)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(\"std:\\n\", std)\nprint(\"std2:\\n\", std2)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(\"param:\\n\", param)\nprint(\"param2:\\n\", param2)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Conclusion\n==========\n\nIn this tutorial, we\\'ve discussed how native masked semantics can\nenable a cleaner developer experience for Adagrad\\'s existing\nimplementation in PyTorch, which used sparsity as a proxy for writing\nmasked semantics. But more importantly, allowing masked semantics to be\na first class citizen through MaskedTensor removes the reliance on\nsparsity or unreliable hacks to mimic masking, thereby allowing for\nproper independence and development, while enabling sparse semantics,\nsuch as this one.\n\nFurther Reading\n===============\n\nTo continue learning more, you can find our final review (for now) on\n[MaskedTensor Advanced\nSemantics](https://pytorch.org/tutorials/prototype/maskedtensor_advanced_semantics.html)\nto see some of the differences in design decisions between\n`MaskedTensor`{.interpreted-text role=\"class\"} and NumPy\\'s MaskedArray,\nas well as reduction semantics.\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.12"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}