{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Timer quick start\n=================\n\nIn this tutorial, we\\'re going to cover the primary APIs of\n[torch.utils.benchmark.Timer]{.title-ref}. The PyTorch Timer is based on\nthe\n[timeit.Timer](https://docs.python.org/3/library/timeit.html#timeit.Timer)\nAPI, with several PyTorch specific modifications. Familiarity with the\nbuiltin [Timer]{.title-ref} class is not required for this tutorial,\nhowever we assume that the reader is familiar with the fundamentals of\nperformance work.\n\nFor a more comprehensive performance tuning tutorial, see [PyTorch\nBenchmark](https://pytorch.org/tutorials/recipes/recipes/benchmark.html).\n\n**Contents:**\n\n:   1.  [Defining a Timer](#defining-a-timer)\n    2.  [Wall time:\n        Timer.blocked\\_autorange(\\...)](#wall-time-timer-blocked-autorange)\n    3.  [C++ snippets](#c-snippets)\n    4.  [Instruction counts:\n        Timer.collect\\_callgrind(\\...)](#instruction-counts-timer-collect-callgrind)\n    5.  [Instruction counts: Delving\n        deeper](#instruction-counts-delving-deeper)\n    6.  [A/B testing with Callgrind](#a-b-testing-with-callgrind)\n    7.  [Wrapping up](#wrapping-up)\n    8.  [Footnotes](#footnotes)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "1. Defining a Timer\n===================\n\nA [Timer]{.title-ref} serves as a task definition.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from torch.utils.benchmark import Timer\n\ntimer = Timer(\n    # The computation which will be run in a loop and timed.\n    stmt=\"x * y\",\n\n    # `setup` will be run before calling the measurement loop, and is used to\n    # populate any state which is needed by `stmt`\n    setup=\"\"\"\n        x = torch.ones((128,))\n        y = torch.ones((128,))\n    \"\"\",\n\n    # Alternatively, ``globals`` can be used to pass variables from the outer scope.\n    # \n    #    globals={\n    #        \"x\": torch.ones((128,)),\n    #        \"y\": torch.ones((128,)),\n    #    },\n\n    # Control the number of threads that PyTorch uses. (Default: 1)\n    num_threads=1,\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "2. Wall time: `Timer.blocked_autorange(...)`\n============================================\n\nThis method will handle details such as picking a suitable number if\nrepeats, fixing the number of threads, and providing a convenient\nrepresentation of the results.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Measurement objects store the results of multiple repeats, and provide\n# various utility features.\nfrom torch.utils.benchmark import Measurement\n\nm: Measurement = timer.blocked_autorange(min_run_time=1)\nprint(m)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "``` {.none}\n<torch.utils.benchmark.utils.common.Measurement object at 0x7f1929a38ed0>\nx * y\nsetup:\n  x = torch.ones((128,))\n  y = torch.ones((128,))\n\n  Median: 2.34 us\n  IQR:    0.07 us (2.31 to 2.38)\n  424 measurements, 1000 runs per measurement, 1 thread\n```\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "3. C++ snippets\n===============\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from torch.utils.benchmark import Language\n\ncpp_timer = Timer(\n    \"x * y;\",\n    \"\"\"\n        auto x = torch::ones({128});\n        auto y = torch::ones({128});\n    \"\"\",\n    language=Language.CPP,\n)\n\nprint(cpp_timer.blocked_autorange(min_run_time=1))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "``` {.none}\n<torch.utils.benchmark.utils.common.Measurement object at 0x7f192b019ed0>\nx * y;\nsetup:\n  auto x = torch::ones({128});\n  auto y = torch::ones({128});\n\n  Median: 1.21 us\n  IQR:    0.03 us (1.20 to 1.23)\n  83 measurements, 10000 runs per measurement, 1 thread\n```\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Unsurprisingly, the C++ snippet is both faster and has lower variation.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "4. Instruction counts: `Timer.collect_callgrind(...)`\n=====================================================\n\nFor deep dive investigations, `Timer.collect_callgrind` wraps\n[Callgrind](https://valgrind.org/docs/manual/cl-manual.html) in order to\ncollect instruction counts. These are useful as they offer fine grained\nand deterministic (or very low noise in the case of Python) insights\ninto how a snippet is run.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from torch.utils.benchmark import CallgrindStats, FunctionCounts\n\nstats: CallgrindStats = cpp_timer.collect_callgrind()\nprint(stats)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "``` {.none}\n<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f1929a35850>\nx * y;\nsetup:\n  auto x = torch::ones({128});\n  auto y = torch::ones({128});\n\n                        All          Noisy symbols removed\n    Instructions:       563600                     563600\n    Baseline:                0                          0\n100 runs per measurement, 1 thread\n```\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "5. Instruction counts: Delving deeper\n=====================================\n\nThe string representation of `CallgrindStats` is similar to that of\nMeasurement. [Noisy symbols]{.title-ref} are a Python concept (removing\ncalls in the CPython interpreter which are known to be noisy).\n\nFor more detailed analysis, however, we will want to look at specific\ncalls. `CallgrindStats.stats()` returns a `FunctionCounts` object to\nmake this easier. Conceptually, `FunctionCounts` can be thought of as a\ntuple of pairs with some utility methods, where each pair is [(number of\ninstructions, file path and function name)]{.title-ref}.\n\nA note on paths:\n\n:   One generally doesn\\'t care about absolute path. For instance, the\n    full path and function name for a multiply call is something like:\n\n``` {.sh}\n/the/prefix/to/your/pytorch/install/dir/pytorch/build/aten/src/ATen/core/TensorMethods.cpp:at::Tensor::mul(at::Tensor const&) const [/the/path/to/your/conda/install/miniconda3/envs/ab_ref/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so]\n\nwhen in reality, all of the information that we're interested in can be\nrepresented in:\n```\n\n``` {.sh}\nbuild/aten/src/ATen/core/TensorMethods.cpp:at::Tensor::mul(at::Tensor const&) const\n\n``CallgrindStats.as_standardized()`` makes a best effort to strip low signal\nportions of the file path, as well as the shared object and is generally\nrecommended.\n```\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "inclusive_stats = stats.as_standardized().stats(inclusive=False)\nprint(inclusive_stats[:10])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "``` {.none}\ntorch.utils.benchmark.utils.valgrind_wrapper.timer_interface.FunctionCounts object at 0x7f192a6dfd90>\n  47264  ???:_int_free\n  25963  ???:_int_malloc\n  19900  build/../aten/src/ATen/TensorIter ... (at::TensorIteratorConfig const&)\n  18000  ???:__tls_get_addr\n  13500  ???:malloc\n  11300  build/../c10/util/SmallVector.h:a ... (at::TensorIteratorConfig const&)\n  10345  ???:_int_memalign\n  10000  build/../aten/src/ATen/TensorIter ... (at::TensorIteratorConfig const&)\n   9200  ???:free\n   8000  build/../c10/util/SmallVector.h:a ... IteratorBase::get_strides() const\n\nTotal: 173472\n```\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "That\\'s still quite a lot to digest. Let\\'s use the\n[FunctionCounts.transform]{.title-ref} method to trim some of the\nfunction path, and discard the function called. When we do, the counts\nof any collisions (e.g. [foo.h:a()]{.title-ref} and\n[foo.h:b()]{.title-ref} will both map to [foo.h]{.title-ref}) will be\nadded together.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import os\nimport re\n\ndef group_by_file(fn_name: str):\n    if fn_name.startswith(\"???\"):\n        fn_dir, fn_file = fn_name.split(\":\")[:2]\n    else:\n        fn_dir, fn_file = os.path.split(fn_name.split(\":\")[0])\n        fn_dir = re.sub(\"^.*build/../\", \"\", fn_dir)\n        fn_dir = re.sub(\"^.*torch/\", \"torch/\", fn_dir)\n\n    return f\"{fn_dir:<15} {fn_file}\"\n\nprint(inclusive_stats.transform(group_by_file)[:10])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "``` {.none}\n<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.FunctionCounts object at 0x7f192995d750>\n  118200  aten/src/ATen   TensorIterator.cpp\n   65000  c10/util        SmallVector.h\n   47264  ???             _int_free\n   25963  ???             _int_malloc\n   20900  c10/util        intrusive_ptr.h\n   18000  ???             __tls_get_addr\n   15900  c10/core        TensorImpl.h\n   15100  c10/core        CPUAllocator.cpp\n   13500  ???             malloc\n   12500  c10/core        TensorImpl.cpp\n\nTotal: 352327\n```\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "6. A/B testing with `Callgrind`\n===============================\n\nOne of the most useful features of instruction counts is they allow fine\ngrained comparison of computation, which is critical when analyzing\nperformance.\n\nTo see this in action, lets compare our multiplication of two size 128\nTensors with a {128} x {1} multiplication, which will broadcast the\nsecond Tensor: result = {a0 \\* b0, a1 \\* b0, \\..., a127 \\* b0}\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "broadcasting_stats = Timer(\n    \"x * y;\",\n    \"\"\"\n        auto x = torch::ones({128});\n        auto y = torch::ones({1});\n    \"\"\",\n    language=Language.CPP,\n).collect_callgrind().as_standardized().stats(inclusive=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Often we want to A/B test two different environments. (e.g. testing a\nPR, or experimenting with compile flags.) This is quite simple, as\n`CallgrindStats`, `FunctionCounts`, and Measurement are all pickleable.\nSimply save measurements from each environment, and load them in a\nsingle process for analysis.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import pickle\n\n# Let's round trip `broadcasting_stats` just to show that we can.\nbroadcasting_stats = pickle.loads(pickle.dumps(broadcasting_stats))\n\n\n# And now to diff the two tasks:\ndelta = broadcasting_stats - inclusive_stats\n\ndef extract_fn_name(fn: str):\n    \"\"\"Trim everything except the function name.\"\"\"\n    fn = \":\".join(fn.split(\":\")[1:])\n    return re.sub(r\"\\(.+\\)\", \"(...)\", fn)\n\n# We use `.transform` to make the diff readable:\nprint(delta.transform(extract_fn_name))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "``` {.none}\n<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.FunctionCounts object at 0x7f192995d750>\n    17600  at::TensorIteratorBase::compute_strides(...)\n    12700  at::TensorIteratorBase::allocate_or_resize_outputs()\n    10200  c10::SmallVectorImpl<long>::operator=(...)\n     7400  at::infer_size(...)\n     6200  at::TensorIteratorBase::invert_perm(...) const\n     6064  _int_free\n     5100  at::TensorIteratorBase::reorder_dimensions()\n     4300  malloc\n     4300  at::TensorIteratorBase::compatible_stride(...) const\n      ...\n      -28  _int_memalign\n     -100  c10::impl::check_tensor_options_and_extract_memory_format(...)\n     -300  __memcmp_avx2_movbe\n     -400  at::detail::empty_cpu(...)\n    -1100  at::TensorIteratorBase::numel() const\n    -1300  void at::native::(...)\n    -2400  c10::TensorImpl::is_contiguous(...) const\n    -6100  at::TensorIteratorBase::compute_fast_setup_type(...)\n   -22600  at::TensorIteratorBase::fast_set_up(...)\n\nTotal: 58091\n```\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "So the broadcasting version takes an extra 580 instructions per call\n(recall that we\\'re collecting 100 runs per sample), or about 10%. There\nare quite a few `TensorIterator` calls, so lets drill down to those.\n`FunctionCounts.filter` makes this easy.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(delta.transform(extract_fn_name).filter(lambda fn: \"TensorIterator\" in fn))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "``` {.none}\n<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.FunctionCounts object at 0x7f19299544d0>\n    17600  at::TensorIteratorBase::compute_strides(...)\n    12700  at::TensorIteratorBase::allocate_or_resize_outputs()\n     6200  at::TensorIteratorBase::invert_perm(...) const\n     5100  at::TensorIteratorBase::reorder_dimensions()\n     4300  at::TensorIteratorBase::compatible_stride(...) const\n     4000  at::TensorIteratorBase::compute_shape(...)\n     2300  at::TensorIteratorBase::coalesce_dimensions()\n     1600  at::TensorIteratorBase::build(...)\n    -1100  at::TensorIteratorBase::numel() const\n    -6100  at::TensorIteratorBase::compute_fast_setup_type(...)\n   -22600  at::TensorIteratorBase::fast_set_up(...)\n\nTotal: 24000\n```\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This makes plain what is going on: there is a fast path in\n`TensorIterator` setup, but in the {128} x {1} case we miss it and have\nto do a more general analysis which is more expensive. The most\nprominent call omitted by the filter is\n[c10::SmallVectorImpl\\<long\\>::operator=(\\...)]{.title-ref}, which is\nalso part of the more general setup.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "7. Wrapping up\n==============\n\nIn summary, use [Timer.blocked\\_autorange]{.title-ref} to collect wall\ntimes. If timing variation is too high, increase\n[min\\_run\\_time]{.title-ref}, or move to C++ snippets if convenient.\n\nFor fine grained analysis, use [Timer.collect\\_callgrind]{.title-ref} to\nmeasure instruction counts and [FunctionCounts.(\\_\\_add\\_\\_ /\n\\_\\_sub\\_\\_ / transform / filter)]{.title-ref} to slice-and-dice them.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "8. Footnotes\n============\n\n> -   \n>\n>     Implied `import torch`\n>\n>     :   If [globals]{.title-ref} does not contain \\\"torch\\\", Timer\n>         will automatically populate it. This means that\n>         `Timer(\"torch.empty(())\")` will work. (Though other imports\n>         should be placed in [setup]{.title-ref}, e.g.\n>         `Timer(\"np.zeros(())\", \"import numpy as np\")`)\n>\n> -   \n>\n>     `REL_WITH_DEB_INFO`\n>\n>     :   In order to provide full information about the PyTorch\n>         internals which are executed, `Callgrind` needs access to C++\n>         debug symbols. This is accomplished by setting\n>         `REL_WITH_DEB_INFO=1` when building PyTorch. Otherwise\n>         function calls will be opaque. (The resultant `CallgrindStats`\n>         will warn if debug symbols are missing.)\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.12"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}