{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Profiling your PyTorch Module\n=============================\n\n**Author:** [Suraj Subramanian](https://github.com/suraj813)\n\nPyTorch includes a profiler API that is useful to identify the time and\nmemory costs of various PyTorch operations in your code. Profiler can be\neasily integrated in your code, and the results can be printed as a\ntable or returned in a JSON trace file.\n\n```{=html}\n<div style=\"background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px\"><strong>NOTE:</strong></div>\n```\n```{=html}\n<div style=\"background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px\">\n```\n```{=html}\n<p>Profiler supports multithreaded models. Profiler runs in thesame thread as the operation but it will also profile child operatorsthat might run in another thread. Concurrently-running profilers will bescoped to their own thread to prevent mixing of results.</p>\n```\n```{=html}\n</div>\n```\n```{=html}\n<div style=\"background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px\"><strong>NOTE:</strong></div>\n```\n```{=html}\n<div style=\"background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px\">\n```\n```{=html}\n<p>PyTorch 1.8 introduces the new API that will replace the older profiler APIin the future releases. Check the new API at <a href=\"https://pytorch.org/docs/master/profiler.html\">this page</a>.</p>\n```\n```{=html}\n</div>\n```\nHead on over to [this\nrecipe](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html)\nfor a quicker walkthrough of Profiler API usage.\n\n------------------------------------------------------------------------\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch\nimport numpy as np\nfrom torch import nn\nimport torch.autograd.profiler as profiler"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Performance debugging using Profiler\n====================================\n\nProfiler can be useful to identify performance bottlenecks in your\nmodels. In this example, we build a custom module that performs two\nsub-tasks:\n\n-   a linear transformation on the input, and\n-   use the transformation result to get indices on a mask tensor.\n\nWe wrap the code for each sub-task in separate labelled context managers\nusing `profiler.record_function(\"label\")`. In the profiler output, the\naggregate performance metrics of all operations in the sub-task will\nshow up under its corresponding label.\n\nNote that using Profiler incurs some overhead, and is best used only for\ninvestigating code. Remember to remove it if you are benchmarking\nruntimes.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class MyModule(nn.Module):\n    def __init__(self, in_features: int, out_features: int, bias: bool = True):\n        super(MyModule, self).__init__()\n        self.linear = nn.Linear(in_features, out_features, bias)\n\n    def forward(self, input, mask):\n        with profiler.record_function(\"LINEAR PASS\"):\n            out = self.linear(input)\n\n        with profiler.record_function(\"MASK INDICES\"):\n            threshold = out.sum(axis=1).mean().item()\n            hi_idx = np.argwhere(mask.cpu().numpy() > threshold)\n            hi_idx = torch.from_numpy(hi_idx).cuda()\n\n        return out, hi_idx"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Profile the forward pass\n========================\n\nWe initialize random input and mask tensors, and the model.\n\nBefore we run the profiler, we warm-up CUDA to ensure accurate\nperformance benchmarking. We wrap the forward pass of our module in the\n`profiler.profile` context manager. The `with_stack=True` parameter\nappends the file and line number of the operation in the trace.\n\n```{=html}\n<div style=\"background-color: #e94f3b; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px\"><strong>WARNING:</strong></div>\n```\n```{=html}\n<div style=\"background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px\">\n```\n```{=html}\n<p><code>with_stack=True</code> incurs an additional overhead, and is better suited for investigating code.Remember to remove it if you are benchmarking performance.</p>\n```\n```{=html}\n</div>\n```\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "model = MyModule(500, 10).cuda()\ninput = torch.rand(128, 500).cuda()\nmask = torch.rand((500, 500, 500), dtype=torch.double).cuda()\n\n# warm-up\nmodel(input, mask)\n\nwith profiler.profile(with_stack=True, profile_memory=True) as prof:\n    out, idx = model(input, mask)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Print profiler results\n======================\n\nFinally, we print the profiler results. `profiler.key_averages`\naggregates the results by operator name, and optionally by input shapes\nand/or stack trace events. Grouping by input shapes is useful to\nidentify which tensor shapes are utilized by the model.\n\nHere, we use `group_by_stack_n=5` which aggregates runtimes by the\noperation and its traceback (truncated to the most recent 5 events), and\ndisplay the events in the order they are registered. The table can also\nbe sorted by passing a `sort_by` argument (refer to the\n[docs](https://pytorch.org/docs/stable/autograd.html#profiler) for valid\nsorting keys).\n\n```{=html}\n<div style=\"background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px\"><strong>NOTE:</strong></div>\n```\n```{=html}\n<div style=\"background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px\">\n```\n```{=html}\n<p>When running profiler in a notebook, you might see entries like <code>&lt;ipython-input-18-193a910735e8&gt;(13): forward</code>instead of filenames in the stacktrace. These correspond to <code>&lt;notebook-cell&gt;(line number): calling-function</code>.</p>\n```\n```{=html}\n</div>\n```\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))\n\n\"\"\"\n(Some columns are omitted)\n\n-------------  ------------  ------------  ------------  ---------------------------------\n         Name    Self CPU %      Self CPU  Self CPU Mem   Source Location\n-------------  ------------  ------------  ------------  ---------------------------------\n MASK INDICES        87.88%        5.212s    -953.67 Mb  /mnt/xarfuse/.../torch/au\n                                                         <ipython-input-...>(10): forward\n                                                         /mnt/xarfuse/.../torch/nn\n                                                         <ipython-input-...>(9): <module>\n                                                         /mnt/xarfuse/.../IPython/\n\n  aten::copy_        12.07%     715.848ms           0 b  <ipython-input-...>(12): forward\n                                                         /mnt/xarfuse/.../torch/nn\n                                                         <ipython-input-...>(9): <module>\n                                                         /mnt/xarfuse/.../IPython/\n                                                         /mnt/xarfuse/.../IPython/\n\n  LINEAR PASS         0.01%     350.151us         -20 b  /mnt/xarfuse/.../torch/au\n                                                         <ipython-input-...>(7): forward\n                                                         /mnt/xarfuse/.../torch/nn\n                                                         <ipython-input-...>(9): <module>\n                                                         /mnt/xarfuse/.../IPython/\n\n  aten::addmm         0.00%     293.342us           0 b  /mnt/xarfuse/.../torch/nn\n                                                         /mnt/xarfuse/.../torch/nn\n                                                         /mnt/xarfuse/.../torch/nn\n                                                         <ipython-input-...>(8): forward\n                                                         /mnt/xarfuse/.../torch/nn\n\n   aten::mean         0.00%     235.095us           0 b  <ipython-input-...>(11): forward\n                                                         /mnt/xarfuse/.../torch/nn\n                                                         <ipython-input-...>(9): <module>\n                                                         /mnt/xarfuse/.../IPython/\n                                                         /mnt/xarfuse/.../IPython/\n\n-----------------------------  ------------  ---------- ----------------------------------\nSelf CPU time total: 5.931s\n\n\"\"\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Improve memory performance\n==========================\n\nNote that the most expensive operations - in terms of memory and time\n-are at `forward (10)` representing the operations within MASK INDICES.\nLet's try to tackle the memory consumption first. We can see that the\n`.to()` operation at line 12 consumes 953.67 Mb. This operation copies\n`mask` to the CPU. `mask` is initialized with a `torch.double` datatype.\nCan we reduce the memory footprint by casting it to `torch.float`\ninstead?\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "model = MyModule(500, 10).cuda()\ninput = torch.rand(128, 500).cuda()\nmask = torch.rand((500, 500, 500), dtype=torch.float).cuda()\n\n# warm-up\nmodel(input, mask)\n\nwith profiler.profile(with_stack=True, profile_memory=True) as prof:\n    out, idx = model(input, mask)\n\nprint(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))\n\n\"\"\"\n(Some columns are omitted)\n\n-----------------  ------------  ------------  ------------  --------------------------------\n             Name    Self CPU %      Self CPU  Self CPU Mem   Source Location\n-----------------  ------------  ------------  ------------  --------------------------------\n     MASK INDICES        93.61%        5.006s    -476.84 Mb  /mnt/xarfuse/.../torch/au\n                                                             <ipython-input-...>(10): forward\n                                                             /mnt/xarfuse/  /torch/nn\n                                                             <ipython-input-...>(9): <module>\n                                                             /mnt/xarfuse/.../IPython/\n\n      aten::copy_         6.34%     338.759ms           0 b  <ipython-input-...>(12): forward\n                                                             /mnt/xarfuse/.../torch/nn\n                                                             <ipython-input-...>(9): <module>\n                                                             /mnt/xarfuse/.../IPython/\n                                                             /mnt/xarfuse/.../IPython/\n\n aten::as_strided         0.01%     281.808us           0 b  <ipython-input-...>(11): forward\n                                                             /mnt/xarfuse/.../torch/nn\n                                                             <ipython-input-...>(9): <module>\n                                                             /mnt/xarfuse/.../IPython/\n                                                             /mnt/xarfuse/.../IPython/\n\n      aten::addmm         0.01%     275.721us           0 b  /mnt/xarfuse/.../torch/nn\n                                                             /mnt/xarfuse/.../torch/nn\n                                                             /mnt/xarfuse/.../torch/nn\n                                                             <ipython-input-...>(8): forward\n                                                             /mnt/xarfuse/.../torch/nn\n\n      aten::_local        0.01%     268.650us           0 b  <ipython-input-...>(11): forward\n      _scalar_dense                                          /mnt/xarfuse/.../torch/nn\n                                                             <ipython-input-...>(9): <module>\n                                                             /mnt/xarfuse/.../IPython/\n                                                             /mnt/xarfuse/.../IPython/\n\n-----------------  ------------  ------------  ------------  --------------------------------\nSelf CPU time total: 5.347s\n\n\"\"\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The CPU memory footprint for this operation has halved.\n\nImprove time performance\n========================\n\nWhile the time consumed has also reduced a bit, it's still too high.\nTurns out copying a matrix from CUDA to CPU is pretty expensive! The\n`aten::copy_` operator in `forward (12)` copies `mask` to CPU so that it\ncan use the NumPy `argwhere` function. `aten::copy_` at `forward(13)`\ncopies the array back to CUDA as a tensor. We could eliminate both of\nthese if we use a `torch` function `nonzero()` here instead.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class MyModule(nn.Module):\n    def __init__(self, in_features: int, out_features: int, bias: bool = True):\n        super(MyModule, self).__init__()\n        self.linear = nn.Linear(in_features, out_features, bias)\n\n    def forward(self, input, mask):\n        with profiler.record_function(\"LINEAR PASS\"):\n            out = self.linear(input)\n\n        with profiler.record_function(\"MASK INDICES\"):\n            threshold = out.sum(axis=1).mean()\n            hi_idx = (mask > threshold).nonzero(as_tuple=True)\n\n        return out, hi_idx\n\n\nmodel = MyModule(500, 10).cuda()\ninput = torch.rand(128, 500).cuda()\nmask = torch.rand((500, 500, 500), dtype=torch.float).cuda()\n\n# warm-up\nmodel(input, mask)\n\nwith profiler.profile(with_stack=True, profile_memory=True) as prof:\n    out, idx = model(input, mask)\n\nprint(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))\n\n\"\"\"\n(Some columns are omitted)\n\n--------------  ------------  ------------  ------------  ---------------------------------\n          Name    Self CPU %      Self CPU  Self CPU Mem   Source Location\n--------------  ------------  ------------  ------------  ---------------------------------\n      aten::gt        57.17%     129.089ms           0 b  <ipython-input-...>(12): forward\n                                                          /mnt/xarfuse/.../torch/nn\n                                                          <ipython-input-...>(25): <module>\n                                                          /mnt/xarfuse/.../IPython/\n                                                          /mnt/xarfuse/.../IPython/\n\n aten::nonzero        37.38%      84.402ms           0 b  <ipython-input-...>(12): forward\n                                                          /mnt/xarfuse/.../torch/nn\n                                                          <ipython-input-...>(25): <module>\n                                                          /mnt/xarfuse/.../IPython/\n                                                          /mnt/xarfuse/.../IPython/\n\n   INDEX SCORE         3.32%       7.491ms    -119.21 Mb  /mnt/xarfuse/.../torch/au\n                                                          <ipython-input-...>(10): forward\n                                                          /mnt/xarfuse/.../torch/nn\n                                                          <ipython-input-...>(25): <module>\n                                                          /mnt/xarfuse/.../IPython/\n\naten::as_strided         0.20%    441.587us          0 b  <ipython-input-...>(12): forward\n                                                          /mnt/xarfuse/.../torch/nn\n                                                          <ipython-input-...>(25): <module>\n                                                          /mnt/xarfuse/.../IPython/\n                                                          /mnt/xarfuse/.../IPython/\n\n aten::nonzero\n     _numpy             0.18%     395.602us           0 b  <ipython-input-...>(12): forward\n                                                          /mnt/xarfuse/.../torch/nn\n                                                          <ipython-input-...>(25): <module>\n                                                          /mnt/xarfuse/.../IPython/\n                                                          /mnt/xarfuse/.../IPython/\n--------------  ------------  ------------  ------------  ---------------------------------\nSelf CPU time total: 225.801ms\n\n\"\"\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Further Reading\n===============\n\nWe have seen how Profiler can be used to investigate time and memory\nbottlenecks in PyTorch models. Read more about Profiler here:\n\n-   [Profiler Usage\n    Recipe](https://pytorch.org/tutorials/recipes/recipes/profiler.html)\n-   [Profiling RPC-Based\n    Workloads](https://pytorch.org/tutorials/recipes/distributed_rpc_profiling.html)\n-   [Profiler API\n    Docs](https://pytorch.org/docs/stable/autograd.html?highlight=profiler#profiler)\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.12"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}