{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# For tips on running notebooks in Google Colab, see\n# https://pytorch.org/tutorials/beginner/colab\n%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "A Gentle Introduction to `torch.autograd`\n=========================================\n\n`torch.autograd` is PyTorch's automatic differentiation engine that\npowers neural network training. In this section, you will get a\nconceptual understanding of how autograd helps a neural network train.\n\nBackground\n----------\n\nNeural networks (NNs) are a collection of nested functions that are\nexecuted on some input data. These functions are defined by *parameters*\n(consisting of weights and biases), which in PyTorch are stored in\ntensors.\n\nTraining a NN happens in two steps:\n\n**Forward Propagation**: In forward prop, the NN makes its best guess\nabout the correct output. It runs the input data through each of its\nfunctions to make this guess.\n\n**Backward Propagation**: In backprop, the NN adjusts its parameters\nproportionate to the error in its guess. It does this by traversing\nbackwards from the output, collecting the derivatives of the error with\nrespect to the parameters of the functions (*gradients*), and optimizing\nthe parameters using gradient descent. For a more detailed walkthrough\nof backprop, check out this [video from\n3Blue1Brown](https://www.youtube.com/watch?v=tIeHLnjs5U8).\n\nUsage in PyTorch\n----------------\n\nLet\\'s take a look at a single training step. For this example, we load\na pretrained resnet18 model from `torchvision`. We create a random data\ntensor to represent a single image with 3 channels, and height & width\nof 64, and its corresponding `label` initialized to some random values.\nLabel in pretrained models has shape (1,1000).\n\n```{=html}\n<div style=\"background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px\"><strong>NOTE:</strong></div>\n```\n```{=html}\n<div style=\"background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px\">\n```\n```{=html}\n<p>This tutorial works only on the CPU and will not work on GPU devices (even if tensors are moved to CUDA).</p>\n```\n```{=html}\n</div>\n```\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch\nfrom torchvision.models import resnet18, ResNet18_Weights\nmodel = resnet18(weights=ResNet18_Weights.DEFAULT)\ndata = torch.rand(1, 3, 64, 64)\nlabels = torch.rand(1, 1000)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, we run the input data through the model through each of its layers\nto make a prediction. This is the **forward pass**.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "prediction = model(data) # forward pass"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We use the model\\'s prediction and the corresponding label to calculate\nthe error (`loss`). The next step is to backpropagate this error through\nthe network. Backward propagation is kicked off when we call\n`.backward()` on the error tensor. Autograd then calculates and stores\nthe gradients for each model parameter in the parameter\\'s `.grad`\nattribute.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "loss = (prediction - labels).sum()\nloss.backward() # backward pass"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, we load an optimizer, in this case SGD with a learning rate of\n0.01 and\n[momentum](https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d)\nof 0.9. We register all the parameters of the model in the optimizer.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Finally, we call `.step()` to initiate gradient descent. The optimizer\nadjusts each parameter by its gradient stored in `.grad`.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "optim.step() #gradient descent"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "At this point, you have everything you need to train your neural\nnetwork. The below sections detail the workings of autograd - feel free\nto skip them.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "------------------------------------------------------------------------\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Differentiation in Autograd\n===========================\n\nLet\\'s take a look at how `autograd` collects gradients. We create two\ntensors `a` and `b` with `requires_grad=True`. This signals to\n`autograd` that every operation on them should be tracked.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch\n\na = torch.tensor([2., 3.], requires_grad=True)\nb = torch.tensor([6., 4.], requires_grad=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We create another tensor `Q` from `a` and `b`.\n\n$$Q = 3a^3 - b^2$$\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "Q = 3*a**3 - b**2"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let\\'s assume `a` and `b` to be parameters of an NN, and `Q` to be the\nerror. In NN training, we want gradients of the error w.r.t. parameters,\ni.e.\n\n$$\\frac{\\partial Q}{\\partial a} = 9a^2$$\n\n$$\\frac{\\partial Q}{\\partial b} = -2b$$\n\nWhen we call `.backward()` on `Q`, autograd calculates these gradients\nand stores them in the respective tensors\\' `.grad` attribute.\n\nWe need to explicitly pass a `gradient` argument in `Q.backward()`\nbecause it is a vector. `gradient` is a tensor of the same shape as `Q`,\nand it represents the gradient of Q w.r.t. itself, i.e.\n\n$$\\frac{dQ}{dQ} = 1$$\n\nEquivalently, we can also aggregate Q into a scalar and call backward\nimplicitly, like `Q.sum().backward()`.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "external_grad = torch.tensor([1., 1.])\nQ.backward(gradient=external_grad)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Gradients are now deposited in `a.grad` and `b.grad`\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# check if collected gradients are correct\nprint(9*a**2 == a.grad)\nprint(-2*b == b.grad)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Optional Reading - Vector Calculus using `autograd`\n===================================================\n\nMathematically, if you have a vector valued function\n$\\vec{y}=f(\\vec{x})$, then the gradient of $\\vec{y}$ with respect to\n$\\vec{x}$ is a Jacobian matrix $J$:\n\n$$\\begin{aligned}\nJ\n=\n \\left(\\begin{array}{cc}\n \\frac{\\partial \\bf{y}}{\\partial x_{1}} &\n ... &\n \\frac{\\partial \\bf{y}}{\\partial x_{n}}\n \\end{array}\\right)\n=\n\\left(\\begin{array}{ccc}\n \\frac{\\partial y_{1}}{\\partial x_{1}} & \\cdots & \\frac{\\partial y_{1}}{\\partial x_{n}}\\\\\n \\vdots & \\ddots & \\vdots\\\\\n \\frac{\\partial y_{m}}{\\partial x_{1}} & \\cdots & \\frac{\\partial y_{m}}{\\partial x_{n}}\n \\end{array}\\right)\n\\end{aligned}$$\n\nGenerally speaking, `torch.autograd` is an engine for computing\nvector-Jacobian product. That is, given any vector $\\vec{v}$, compute\nthe product $J^{T}\\cdot \\vec{v}$\n\nIf $\\vec{v}$ happens to be the gradient of a scalar function\n$l=g\\left(\\vec{y}\\right)$:\n\n$$\\vec{v}\n =\n \\left(\\begin{array}{ccc}\\frac{\\partial l}{\\partial y_{1}} & \\cdots & \\frac{\\partial l}{\\partial y_{m}}\\end{array}\\right)^{T}$$\n\nthen by the chain rule, the vector-Jacobian product would be the\ngradient of $l$ with respect to $\\vec{x}$:\n\n$$\\begin{aligned}\nJ^{T}\\cdot \\vec{v} = \\left(\\begin{array}{ccc}\n \\frac{\\partial y_{1}}{\\partial x_{1}} & \\cdots & \\frac{\\partial y_{m}}{\\partial x_{1}}\\\\\n \\vdots & \\ddots & \\vdots\\\\\n \\frac{\\partial y_{1}}{\\partial x_{n}} & \\cdots & \\frac{\\partial y_{m}}{\\partial x_{n}}\n \\end{array}\\right)\\left(\\begin{array}{c}\n \\frac{\\partial l}{\\partial y_{1}}\\\\\n \\vdots\\\\\n \\frac{\\partial l}{\\partial y_{m}}\n \\end{array}\\right) = \\left(\\begin{array}{c}\n \\frac{\\partial l}{\\partial x_{1}}\\\\\n \\vdots\\\\\n \\frac{\\partial l}{\\partial x_{n}}\n \\end{array}\\right)\n\\end{aligned}$$\n\nThis characteristic of vector-Jacobian product is what we use in the\nabove example; `external_grad` represents $\\vec{v}$.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Computational Graph\n===================\n\nConceptually, autograd keeps a record of data (tensors) & all executed\noperations (along with the resulting new tensors) in a directed acyclic\ngraph (DAG) consisting of\n[Function](https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function)\nobjects. In this DAG, leaves are the input tensors, roots are the output\ntensors. By tracing this graph from roots to leaves, you can\nautomatically compute the gradients using the chain rule.\n\nIn a forward pass, autograd does two things simultaneously:\n\n-   run the requested operation to compute a resulting tensor, and\n-   maintain the operation's *gradient function* in the DAG.\n\nThe backward pass kicks off when `.backward()` is called on the DAG\nroot. `autograd` then:\n\n-   computes the gradients from each `.grad_fn`,\n-   accumulates them in the respective tensor's `.grad` attribute, and\n-   using the chain rule, propagates all the way to the leaf tensors.\n\nBelow is a visual representation of the DAG in our example. In the\ngraph, the arrows are in the direction of the forward pass. The nodes\nrepresent the backward functions of each operation in the forward pass.\nThe leaf nodes in blue represent our leaf tensors `a` and `b`.\n\n![](https://pytorch.org/tutorials/_static/img/dag_autograd.png)\n\n```{=html}\n<div style=\"background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px\"><strong>NOTE:</strong></div>\n```\n```{=html}\n<div style=\"background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px\">\n```\n```{=html}\n<p>An important thing to note is that the graph is recreated from scratch; after each<code>.backward()</code> call, autograd starts populating a new graph. This isexactly what allows you to use control flow statements in your model;you can change the shape, size and operations at every iteration ifneeded.</p>\n```\n```{=html}\n</div>\n```\nExclusion from the DAG\n----------------------\n\n`torch.autograd` tracks operations on all tensors which have their\n`requires_grad` flag set to `True`. For tensors that don't require\ngradients, setting this attribute to `False` excludes it from the\ngradient computation DAG.\n\nThe output tensor of an operation will require gradients even if only a\nsingle input tensor has `requires_grad=True`.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "x = torch.rand(5, 5)\ny = torch.rand(5, 5)\nz = torch.rand((5, 5), requires_grad=True)\n\na = x + y\nprint(f\"Does `a` require gradients?: {a.requires_grad}\")\nb = x + z\nprint(f\"Does `b` require gradients?: {b.requires_grad}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In a NN, parameters that don\\'t compute gradients are usually called\n**frozen parameters**. It is useful to \\\"freeze\\\" part of your model if\nyou know in advance that you won\\'t need the gradients of those\nparameters (this offers some performance benefits by reducing autograd\ncomputations).\n\nIn finetuning, we freeze most of the model and typically only modify the\nclassifier layers to make predictions on new labels. Let\\'s walk through\na small example to demonstrate this. As before, we load a pretrained\nresnet18 model, and freeze all the parameters.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from torch import nn, optim\n\nmodel = resnet18(weights=ResNet18_Weights.DEFAULT)\n\n# Freeze all the parameters in the network\nfor param in model.parameters():\n    param.requires_grad = False"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let\\'s say we want to finetune the model on a new dataset with 10\nlabels. In resnet, the classifier is the last linear layer `model.fc`.\nWe can simply replace it with a new linear layer (unfrozen by default)\nthat acts as our classifier.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "model.fc = nn.Linear(512, 10)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now all parameters in the model, except the parameters of `model.fc`,\nare frozen. The only parameters that compute gradients are the weights\nand bias of `model.fc`.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Optimize only the classifier\noptimizer = optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Notice although we register all the parameters in the optimizer, the\nonly parameters that are computing gradients (and hence updated in\ngradient descent) are the weights and bias of the classifier.\n\nThe same exclusionary functionality is available as a context manager in\n[torch.no\\_grad()](https://pytorch.org/docs/stable/generated/torch.no_grad.html)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "------------------------------------------------------------------------\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Further readings:\n=================\n\n-   [In-place operations & Multithreaded\n    Autograd](https://pytorch.org/docs/stable/notes/autograd.html)\n-   [Example implementation of reverse-mode\n    autodiff](https://colab.research.google.com/drive/1VpeE6UvEPRz9HmsHh1KS0XxXjYu533EC)\n-   [Video: PyTorch Autograd Explained - In-depth\n    Tutorial](https://www.youtube.com/watch?v=MswxJw-8PvE)\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.12"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}