{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# For tips on running notebooks in Google Colab, see\n# https://pytorch.org/tutorials/beginner/colab\n%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "[Introduction](introyt1_tutorial.html) \\|\\|\n[Tensors](tensors_deeper_tutorial.html) \\|\\| **Autograd** \\|\\| [Building\nModels](modelsyt_tutorial.html) \\|\\| [TensorBoard\nSupport](tensorboardyt_tutorial.html) \\|\\| [Training\nModels](trainingyt.html) \\|\\| [Model Understanding](captumyt.html)\n\nThe Fundamentals of Autograd\n============================\n\nFollow along with the video below or on\n[youtube](https://www.youtube.com/watch?v=M0fX15_-xrY).\n\n``` {.python .jupyter-code-cell}\nfrom IPython.display import display, HTML\nhtml_code = \"\"\"\n<div style=\"margin-top:10px; margin-bottom:10px;\">\n  <iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/M0fX15_-xrY\" frameborder=\"0\" allow=\"accelerometer; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen></iframe>\n</div>\n\"\"\"\ndisplay(HTML(html_code))\n```\n\nPyTorch's *Autograd* feature is part of what make PyTorch flexible and\nfast for building machine learning projects. It allows for the rapid and\neasy computation of multiple partial derivatives (also referred to as\n*gradients)* over a complex computation. This operation is central to\nbackpropagation-based neural network learning.\n\nThe power of autograd comes from the fact that it traces your\ncomputation dynamically *at runtime,* meaning that if your model has\ndecision branches, or loops whose lengths are not known until runtime,\nthe computation will still be traced correctly, and you'll get correct\ngradients to drive learning. This, combined with the fact that your\nmodels are built in Python, offers far more flexibility than frameworks\nthat rely on static analysis of a more rigidly-structured model for\ncomputing gradients.\n\nWhat Do We Need Autograd For?\n-----------------------------\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "A machine learning model is a *function*, with inputs and outputs. For\nthis discussion, we'll treat the inputs as an *i*-dimensional vector\n$\\vec{x}$, with elements $x_{i}$. We can then express the model, *M*, as\na vector-valued function of the input: $\\vec{y} =\n\\vec{M}(\\vec{x})$. (We treat the value of M's output as a vector because\nin general, a model may have any number of outputs.)\n\nSince we'll mostly be discussing autograd in the context of training,\nour output of interest will be the model's loss. The *loss function*\nL($\\vec{y}$) = L($\\vec{M}$($\\vec{x}$)) is a single-valued scalar\nfunction of the model's output. This function expresses how far off our\nmodel's prediction was from a particular input's *ideal* output. *Note:\nAfter this point, we will often omit the vector sign where it should be\ncontextually clear - e.g.,* $y$ instead of $\\vec y$.\n\nIn training a model, we want to minimize the loss. In the idealized case\nof a perfect model, that means adjusting its learning weights - that is,\nthe adjustable parameters of the function - such that loss is zero for\nall inputs. In the real world, it means an iterative process of nudging\nthe learning weights until we see that we get a tolerable loss for a\nwide variety of inputs.\n\nHow do we decide how far and in which direction to nudge the weights? We\nwant to *minimize* the loss, which means making its first derivative\nwith respect to the input equal to 0:\n$\\frac{\\partial L}{\\partial x} = 0$.\n\nRecall, though, that the loss is not *directly* derived from the input,\nbut a function of the model's output (which is a function of the input\ndirectly), $\\frac{\\partial L}{\\partial x}$ =\n$\\frac{\\partial {L({\\vec y})}}{\\partial x}$. By the chain rule of\ndifferential calculus, we have\n$\\frac{\\partial {L({\\vec y})}}{\\partial x}$ =\n$\\frac{\\partial L}{\\partial y}\\frac{\\partial y}{\\partial x}$ =\n$\\frac{\\partial L}{\\partial y}\\frac{\\partial M(x)}{\\partial x}$.\n\n$\\frac{\\partial M(x)}{\\partial x}$ is where things get complex. The\npartial derivatives of the model's outputs with respect to its inputs,\nif we were to expand the expression using the chain rule again, would\ninvolve many local partial derivatives over every multiplied learning\nweight, every activation function, and every other mathematical\ntransformation in the model. The full expression for each such partial\nderivative is the sum of the products of the local gradient of *every\npossible path* through the computation graph that ends with the variable\nwhose gradient we are trying to measure.\n\nIn particular, the gradients over the learning weights are of interest\nto us - they tell us *what direction to change each weight* to get the\nloss function closer to zero.\n\nSince the number of such local derivatives (each corresponding to a\nseparate path through the model's computation graph) will tend to go up\nexponentially with the depth of a neural network, so does the complexity\nin computing them. This is where autograd comes in: It tracks the\nhistory of every computation. Every computed tensor in your PyTorch\nmodel carries a history of its input tensors and the function used to\ncreate it. Combined with the fact that PyTorch functions meant to act on\ntensors each have a built-in implementation for computing their own\nderivatives, this greatly speeds the computation of the local\nderivatives needed for learning.\n\nA Simple Example\n================\n\nThat was a lot of theory - but what does it look like to use autograd in\npractice?\n\nLet's start with a straightforward example. First, we'll do some imports\nto let us graph our results:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# %matplotlib inline\n\nimport torch\n\nimport matplotlib.pyplot as plt\nimport matplotlib.ticker as ticker\nimport math"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, we'll create an input tensor full of evenly spaced values on the\ninterval $[0, 2{\\pi}]$, and specify `requires_grad=True`. (Like most\nfunctions that create tensors, `torch.linspace()` accepts an optional\n`requires_grad` option.) Setting this flag means that in every\ncomputation that follows, autograd will be accumulating the history of\nthe computation in the output tensors of that computation.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "a = torch.linspace(0., 2. * math.pi, steps=25, requires_grad=True)\nprint(a)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, we'll perform a computation, and plot its output in terms of its\ninputs:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "b = torch.sin(a)\nplt.plot(a.detach(), b.detach())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's have a closer look at the tensor `b`. When we print it, we see an\nindicator that it is tracking its computation history:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(b)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This `grad_fn` gives us a hint that when we execute the backpropagation\nstep and compute gradients, we'll need to compute the derivative of\n$\\sin(x)$ for all this tensor's inputs.\n\nLet's perform some more computations:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "c = 2 * b\nprint(c)\n\nd = c + 1\nprint(d)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Finally, let's compute a single-element output. When you call\n`.backward()` on a tensor with no arguments, it expects the calling\ntensor to contain only a single element, as is the case when computing a\nloss function.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "out = d.sum()\nprint(out)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Each `grad_fn` stored with our tensors allows you to walk the\ncomputation all the way back to its inputs with its `next_functions`\nproperty. We can see below that drilling down on this property on `d`\nshows us the gradient functions for all the prior tensors. Note that\n`a.grad_fn` is reported as `None`, indicating that this was an input to\nthe function with no history of its own.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print('d:')\nprint(d.grad_fn)\nprint(d.grad_fn.next_functions)\nprint(d.grad_fn.next_functions[0][0].next_functions)\nprint(d.grad_fn.next_functions[0][0].next_functions[0][0].next_functions)\nprint(d.grad_fn.next_functions[0][0].next_functions[0][0].next_functions[0][0].next_functions)\nprint('\\nc:')\nprint(c.grad_fn)\nprint('\\nb:')\nprint(b.grad_fn)\nprint('\\na:')\nprint(a.grad_fn)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "With all this machinery in place, how do we get derivatives out? You\ncall the `backward()` method on the output, and check the input's `grad`\nproperty to inspect the gradients:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "out.backward()\nprint(a.grad)\nplt.plot(a.detach(), a.grad.detach())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Recall the computation steps we took to get here:\n\n``` {.python}\na = torch.linspace(0., 2. * math.pi, steps=25, requires_grad=True)\nb = torch.sin(a)\nc = 2 * b\nd = c + 1\nout = d.sum()\n```\n\nAdding a constant, as we did to compute `d`, does not change the\nderivative. That leaves $c = 2 * b = 2 * \\sin(a)$, the derivative of\nwhich should be $2 * \\cos(a)$. Looking at the graph above, that's just\nwhat we see.\n\nBe aware that only *leaf nodes* of the computation have their gradients\ncomputed. If you tried, for example, `print(c.grad)` you'd get back\n`None`. In this simple example, only the input is a leaf node, so only\nit has gradients computed.\n\nAutograd in Training\n====================\n\nWe've had a brief look at how autograd works, but how does it look when\nit's used for its intended purpose? Let's define a small model and\nexamine how it changes after a single training batch. First, define a\nfew constants, our model, and some stand-ins for inputs and outputs:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "BATCH_SIZE = 16\nDIM_IN = 1000\nHIDDEN_SIZE = 100\nDIM_OUT = 10\n\nclass TinyModel(torch.nn.Module):\n\n    def __init__(self):\n        super(TinyModel, self).__init__()\n        \n        self.layer1 = torch.nn.Linear(DIM_IN, HIDDEN_SIZE)\n        self.relu = torch.nn.ReLU()\n        self.layer2 = torch.nn.Linear(HIDDEN_SIZE, DIM_OUT)\n    \n    def forward(self, x):\n        x = self.layer1(x)\n        x = self.relu(x)\n        x = self.layer2(x)\n        return x\n    \nsome_input = torch.randn(BATCH_SIZE, DIM_IN, requires_grad=False)\nideal_output = torch.randn(BATCH_SIZE, DIM_OUT, requires_grad=False)\n\nmodel = TinyModel()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "One thing you might notice is that we never specify `requires_grad=True`\nfor the model's layers. Within a subclass of `torch.nn.Module`, it's\nassumed that we want to track gradients on the layers' weights for\nlearning.\n\nIf we look at the layers of the model, we can examine the values of the\nweights, and verify that no gradients have been computed yet:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(model.layer2.weight[0][0:10]) # just a small slice\nprint(model.layer2.weight.grad)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's see how this changes when we run through one training batch. For a\nloss function, we'll just use the square of the Euclidean distance\nbetween our `prediction` and the `ideal_output`, and we'll use a basic\nstochastic gradient descent optimizer.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "optimizer = torch.optim.SGD(model.parameters(), lr=0.001)\n\nprediction = model(some_input)\n\nloss = (ideal_output - prediction).pow(2).sum()\nprint(loss)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, let's call `loss.backward()` and see what happens:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "loss.backward()\nprint(model.layer2.weight[0][0:10])\nprint(model.layer2.weight.grad[0][0:10])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can see that the gradients have been computed for each learning\nweight, but the weights remain unchanged, because we haven't run the\noptimizer yet. The optimizer is responsible for updating model weights\nbased on the computed gradients.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "optimizer.step()\nprint(model.layer2.weight[0][0:10])\nprint(model.layer2.weight.grad[0][0:10])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You should see that `layer2`'s weights have changed.\n\nOne important thing about the process: After calling `optimizer.step()`,\nyou need to call `optimizer.zero_grad()`, or else every time you run\n`loss.backward()`, the gradients on the learning weights will\naccumulate:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(model.layer2.weight.grad[0][0:10])\n\nfor i in range(0, 5):\n    prediction = model(some_input)\n    loss = (ideal_output - prediction).pow(2).sum()\n    loss.backward()\n    \nprint(model.layer2.weight.grad[0][0:10])\n\noptimizer.zero_grad(set_to_none=False)\n\nprint(model.layer2.weight.grad[0][0:10])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "After running the cell above, you should see that after running\n`loss.backward()` multiple times, the magnitudes of most of the\ngradients will be much larger. Failing to zero the gradients before\nrunning your next training batch will cause the gradients to blow up in\nthis manner, causing incorrect and unpredictable learning results.\n\nTurning Autograd Off and On\n===========================\n\nThere are situations where you will need fine-grained control over\nwhether autograd is enabled. There are multiple ways to do this,\ndepending on the situation.\n\nThe simplest is to change the `requires_grad` flag on a tensor directly:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "a = torch.ones(2, 3, requires_grad=True)\nprint(a)\n\nb1 = 2 * a\nprint(b1)\n\na.requires_grad = False\nb2 = 2 * a\nprint(b2)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In the cell above, we see that `b1` has a `grad_fn` (i.e., a traced\ncomputation history), which is what we expect, since it was derived from\na tensor, `a`, that had autograd turned on. When we turn off autograd\nexplicitly with `a.requires_grad = False`, computation history is no\nlonger tracked, as we see when we compute `b2`.\n\nIf you only need autograd turned off temporarily, a better way is to use\nthe `torch.no_grad()`:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "a = torch.ones(2, 3, requires_grad=True) * 2\nb = torch.ones(2, 3, requires_grad=True) * 3\n\nc1 = a + b\nprint(c1)\n\nwith torch.no_grad():\n    c2 = a + b\n\nprint(c2)\n\nc3 = a * b\nprint(c3)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "`torch.no_grad()` can also be used as a function or method decorator:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def add_tensors1(x, y):\n    return x + y\n\n@torch.no_grad()\ndef add_tensors2(x, y):\n    return x + y\n\n\na = torch.ones(2, 3, requires_grad=True) * 2\nb = torch.ones(2, 3, requires_grad=True) * 3\n\nc1 = add_tensors1(a, b)\nprint(c1)\n\nc2 = add_tensors2(a, b)\nprint(c2)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "There's a corresponding context manager, `torch.enable_grad()`, for\nturning autograd on when it isn't already. It may also be used as a\ndecorator.\n\nFinally, you may have a tensor that requires gradient tracking, but you\nwant a copy that does not. For this we have the `Tensor` object's\n`detach()` method - it creates a copy of the tensor that is *detached*\nfrom the computation history:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "x = torch.rand(5, requires_grad=True)\ny = x.detach()\n\nprint(x)\nprint(y)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We did this above when we wanted to graph some of our tensors. This is\nbecause `matplotlib` expects a NumPy array as input, and the implicit\nconversion from a PyTorch tensor to a NumPy array is not enabled for\ntensors with requires\\_grad=True. Making a detached copy lets us move\nforward.\n\nAutograd and In-place Operations\n================================\n\nIn every example in this notebook so far, we've used variables to\ncapture the intermediate values of a computation. Autograd needs these\nintermediate values to perform gradient computations. *For this reason,\nyou must be careful about using in-place operations when using\nautograd.* Doing so can destroy information you need to compute\nderivatives in the `backward()` call. PyTorch will even stop you if you\nattempt an in-place operation on leaf variable that requires autograd,\nas shown below.\n\n```{=html}\n<div style=\"background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px\"><strong>NOTE:</strong></div>\n```\n```{=html}\n<div style=\"background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px\">\n```\n```{=html}\n<p>The following code cell throws a runtime error. This is expected.<pre><code>a = torch.linspace(0., 2. * math.pi, steps=25, requires_grad=True)\ntorch.sin_(a)</code></pre></p>\n```\n```{=html}\n</div>\n```\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Autograd Profiler\n=================\n\nAutograd tracks every step of your computation in detail. Such a\ncomputation history, combined with timing information, would make a\nhandy profiler - and autograd has that feature baked in. Here's a quick\nexample usage:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "device = torch.device('cpu')\nrun_on_gpu = False\nif torch.cuda.is_available():\n    device = torch.device('cuda')\n    run_on_gpu = True\n    \nx = torch.randn(2, 3, requires_grad=True)\ny = torch.rand(2, 3, requires_grad=True)\nz = torch.ones(2, 3, requires_grad=True)\n\nwith torch.autograd.profiler.profile(use_cuda=run_on_gpu) as prf:\n    for _ in range(1000):\n        z = (z / x) * y\n        \nprint(prf.key_averages().table(sort_by='self_cpu_time_total'))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The profiler can also label individual sub-blocks of code, break out the\ndata by input tensor shape, and export data as a Chrome tracing tools\nfile. For full details of the API, see the\n[documentation](https://pytorch.org/docs/stable/autograd.html#profiler).\n\nAdvanced Topic: More Autograd Detail and the High-Level API\n===========================================================\n\nIf you have a function with an n-dimensional input and m-dimensional\noutput, $\\vec{y}=f(\\vec{x})$, the complete gradient is a matrix of the\nderivative of every output with respect to every input, called the\n*Jacobian:*\n\n$$\\begin{aligned}\nJ\n=\n\\left(\\begin{array}{ccc}\n\\frac{\\partial y_{1}}{\\partial x_{1}} & \\cdots & \\frac{\\partial y_{1}}{\\partial x_{n}}\\\\\n\\vdots & \\ddots & \\vdots\\\\\n\\frac{\\partial y_{m}}{\\partial x_{1}} & \\cdots & \\frac{\\partial y_{m}}{\\partial x_{n}}\n\\end{array}\\right)\n\\end{aligned}$$\n\nIf you have a second function, $l=g\\left(\\vec{y}\\right)$ that takes\nm-dimensional input (that is, the same dimensionality as the output\nabove), and returns a scalar output, you can express its gradients with\nrespect to $\\vec{y}$ as a column vector,\n$v=\\left(\\begin{array}{ccc}\\frac{\\partial l}{\\partial y_{1}} & \\cdots & \\frac{\\partial l}{\\partial y_{m}}\\end{array}\\right)^{T}$\n- which is really just a one-column Jacobian.\n\nMore concretely, imagine the first function as your PyTorch model (with\npotentially many inputs and many outputs) and the second function as a\nloss function (with the model's output as input, and the loss value as\nthe scalar output).\n\nIf we multiply the first function's Jacobian by the gradient of the\nsecond function, and apply the chain rule, we get:\n\n$$\\begin{aligned}\nJ^{T}\\cdot v=\\left(\\begin{array}{ccc}\n\\frac{\\partial y_{1}}{\\partial x_{1}} & \\cdots & \\frac{\\partial y_{m}}{\\partial x_{1}}\\\\\n\\vdots & \\ddots & \\vdots\\\\\n\\frac{\\partial y_{1}}{\\partial x_{n}} & \\cdots & \\frac{\\partial y_{m}}{\\partial x_{n}}\n\\end{array}\\right)\\left(\\begin{array}{c}\n\\frac{\\partial l}{\\partial y_{1}}\\\\\n\\vdots\\\\\n\\frac{\\partial l}{\\partial y_{m}}\n\\end{array}\\right)=\\left(\\begin{array}{c}\n\\frac{\\partial l}{\\partial x_{1}}\\\\\n\\vdots\\\\\n\\frac{\\partial l}{\\partial x_{n}}\n\\end{array}\\right)\n\\end{aligned}$$\n\nNote: You could also use the equivalent operation $v^{T}\\cdot J$, and\nget back a row vector.\n\nThe resulting column vector is the *gradient of the second function with\nrespect to the inputs of the first* - or in the case of our model and\nloss function, the gradient of the loss with respect to the model\ninputs.\n\n**\\`\\`torch.autograd\\`\\` is an engine for computing these products.**\nThis is how we accumulate the gradients over the learning weights during\nthe backward pass.\n\nFor this reason, the `backward()` call can *also* take an optional\nvector input. This vector represents a set of gradients over the tensor,\nwhich are multiplied by the Jacobian of the autograd-traced tensor that\nprecedes it. Let's try a specific example with a small vector:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "x = torch.randn(3, requires_grad=True)\n\ny = x * 2\nwhile y.data.norm() < 1000:\n    y = y * 2\n\nprint(y)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If we tried to call `y.backward()` now, we'd get a runtime error and a\nmessage that gradients can only be *implicitly* computed for scalar\noutputs. For a multi-dimensional output, autograd expects us to provide\ngradients for those three outputs that it can multiply into the\nJacobian:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "v = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float) # stand-in for gradients\ny.backward(v)\n\nprint(x.grad)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "(Note that the output gradients are all related to powers of two - which\nwe'd expect from a repeated doubling operation.)\n\nThe High-Level API\n==================\n\nThere is an API on autograd that gives you direct access to important\ndifferential matrix and vector operations. In particular, it allows you\nto calculate the Jacobian and the *Hessian* matrices of a particular\nfunction for particular inputs. (The Hessian is like the Jacobian, but\nexpresses all partial *second* derivatives.) It also provides methods\nfor taking vector products with these matrices.\n\nLet's take the Jacobian of a simple function, evaluated for a 2\nsingle-element inputs:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def exp_adder(x, y):\n    return 2 * x.exp() + 3 * y\n\ninputs = (torch.rand(1), torch.rand(1)) # arguments for the function\nprint(inputs)\ntorch.autograd.functional.jacobian(exp_adder, inputs)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If you look closely, the first output should equal $2e^x$ (since the\nderivative of $e^x$ is $e^x$), and the second value should be 3.\n\nYou can, of course, do this with higher-order tensors:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "inputs = (torch.rand(3), torch.rand(3)) # arguments for the function\nprint(inputs)\ntorch.autograd.functional.jacobian(exp_adder, inputs)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The `torch.autograd.functional.hessian()` method works identically\n(assuming your function is twice differentiable), but returns a matrix\nof all second derivatives.\n\nThere is also a function to directly compute the vector-Jacobian\nproduct, if you provide the vector:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def do_some_doubling(x):\n    y = x * 2\n    while y.data.norm() < 1000:\n        y = y * 2\n    return y\n\ninputs = torch.randn(3)\nmy_gradients = torch.tensor([0.1, 1.0, 0.0001])\ntorch.autograd.functional.vjp(do_some_doubling, inputs, v=my_gradients)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The `torch.autograd.functional.jvp()` method performs the same matrix\nmultiplication as `vjp()` with the operands reversed. The `vhp()` and\n`hvp()` methods do the same for a vector-Hessian product.\n\nFor more information, including performance notes on the [docs for the\nfunctional\nAPI](https://pytorch.org/docs/stable/autograd.html#functional-higher-level-api)\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.12"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}