{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Timer quick start\n=================\n\nIn this tutorial, we\\'re going to cover the primary APIs of\n[torch.utils.benchmark.Timer]{.title-ref}. The PyTorch Timer is based on\nthe\n[timeit.Timer](https://docs.python.org/3/library/timeit.html#timeit.Timer)\nAPI, with several PyTorch specific modifications. Familiarity with the\nbuiltin [Timer]{.title-ref} class is not required for this tutorial,\nhowever we assume that the reader is familiar with the fundamentals of\nperformance work.\n\nFor a more comprehensive performance tuning tutorial, see [PyTorch\nBenchmark](https://pytorch.org/tutorials/recipes/recipes/benchmark.html).\n\n**Contents:**\n\n: 1. [Defining a Timer](#defining-a-timer)\n 2. [Wall time:\n Timer.blocked\\_autorange(\\...)](#wall-time-timer-blocked-autorange)\n 3. [C++ snippets](#c-snippets)\n 4. [Instruction counts:\n Timer.collect\\_callgrind(\\...)](#instruction-counts-timer-collect-callgrind)\n 5. [Instruction counts: Delving\n deeper](#instruction-counts-delving-deeper)\n 6. [A/B testing with Callgrind](#a-b-testing-with-callgrind)\n 7. [Wrapping up](#wrapping-up)\n 8. [Footnotes](#footnotes)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Defining a Timer\n===================\n\nA [Timer]{.title-ref} serves as a task definition.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from torch.utils.benchmark import Timer\n\ntimer = Timer(\n # The computation which will be run in a loop and timed.\n stmt=\"x * y\",\n\n # `setup` will be run before calling the measurement loop, and is used to\n # populate any state which is needed by `stmt`\n setup=\"\"\"\n x = torch.ones((128,))\n y = torch.ones((128,))\n \"\"\",\n\n # Alternatively, ``globals`` can be used to pass variables from the outer scope.\n # \n # globals={\n # \"x\": torch.ones((128,)),\n # \"y\": torch.ones((128,)),\n # },\n\n # Control the number of threads that PyTorch uses. (Default: 1)\n num_threads=1,\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Wall time: `Timer.blocked_autorange(...)`\n============================================\n\nThis method will handle details such as picking a suitable number if\nrepeats, fixing the number of threads, and providing a convenient\nrepresentation of the results.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Measurement objects store the results of multiple repeats, and provide\n# various utility features.\nfrom torch.utils.benchmark import Measurement\n\nm: Measurement = timer.blocked_autorange(min_run_time=1)\nprint(m)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "``` {.none}\n\nx * y\nsetup:\n x = torch.ones((128,))\n y = torch.ones((128,))\n\n Median: 2.34 us\n IQR: 0.07 us (2.31 to 2.38)\n 424 measurements, 1000 runs per measurement, 1 thread\n```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. C++ snippets\n===============\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from torch.utils.benchmark import Language\n\ncpp_timer = Timer(\n \"x * y;\",\n \"\"\"\n auto x = torch::ones({128});\n auto y = torch::ones({128});\n \"\"\",\n language=Language.CPP,\n)\n\nprint(cpp_timer.blocked_autorange(min_run_time=1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "``` {.none}\n\nx * y;\nsetup:\n auto x = torch::ones({128});\n auto y = torch::ones({128});\n\n Median: 1.21 us\n IQR: 0.03 us (1.20 to 1.23)\n 83 measurements, 10000 runs per measurement, 1 thread\n```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unsurprisingly, the C++ snippet is both faster and has lower variation.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. Instruction counts: `Timer.collect_callgrind(...)`\n=====================================================\n\nFor deep dive investigations, `Timer.collect_callgrind` wraps\n[Callgrind](https://valgrind.org/docs/manual/cl-manual.html) in order to\ncollect instruction counts. These are useful as they offer fine grained\nand deterministic (or very low noise in the case of Python) insights\ninto how a snippet is run.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from torch.utils.benchmark import CallgrindStats, FunctionCounts\n\nstats: CallgrindStats = cpp_timer.collect_callgrind()\nprint(stats)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "``` {.none}\n\nx * y;\nsetup:\n auto x = torch::ones({128});\n auto y = torch::ones({128});\n\n All Noisy symbols removed\n Instructions: 563600 563600\n Baseline: 0 0\n100 runs per measurement, 1 thread\n```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. Instruction counts: Delving deeper\n=====================================\n\nThe string representation of `CallgrindStats` is similar to that of\nMeasurement. [Noisy symbols]{.title-ref} are a Python concept (removing\ncalls in the CPython interpreter which are known to be noisy).\n\nFor more detailed analysis, however, we will want to look at specific\ncalls. `CallgrindStats.stats()` returns a `FunctionCounts` object to\nmake this easier. Conceptually, `FunctionCounts` can be thought of as a\ntuple of pairs with some utility methods, where each pair is [(number of\ninstructions, file path and function name)]{.title-ref}.\n\nA note on paths:\n\n: One generally doesn\\'t care about absolute path. For instance, the\n full path and function name for a multiply call is something like:\n\n``` {.sh}\n/the/prefix/to/your/pytorch/install/dir/pytorch/build/aten/src/ATen/core/TensorMethods.cpp:at::Tensor::mul(at::Tensor const&) const [/the/path/to/your/conda/install/miniconda3/envs/ab_ref/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so]\n\nwhen in reality, all of the information that we're interested in can be\nrepresented in:\n```\n\n``` {.sh}\nbuild/aten/src/ATen/core/TensorMethods.cpp:at::Tensor::mul(at::Tensor const&) const\n\n``CallgrindStats.as_standardized()`` makes a best effort to strip low signal\nportions of the file path, as well as the shared object and is generally\nrecommended.\n```\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "inclusive_stats = stats.as_standardized().stats(inclusive=False)\nprint(inclusive_stats[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "``` {.none}\ntorch.utils.benchmark.utils.valgrind_wrapper.timer_interface.FunctionCounts object at 0x7f192a6dfd90>\n 47264 ???:_int_free\n 25963 ???:_int_malloc\n 19900 build/../aten/src/ATen/TensorIter ... (at::TensorIteratorConfig const&)\n 18000 ???:__tls_get_addr\n 13500 ???:malloc\n 11300 build/../c10/util/SmallVector.h:a ... (at::TensorIteratorConfig const&)\n 10345 ???:_int_memalign\n 10000 build/../aten/src/ATen/TensorIter ... (at::TensorIteratorConfig const&)\n 9200 ???:free\n 8000 build/../c10/util/SmallVector.h:a ... IteratorBase::get_strides() const\n\nTotal: 173472\n```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That\\'s still quite a lot to digest. Let\\'s use the\n[FunctionCounts.transform]{.title-ref} method to trim some of the\nfunction path, and discard the function called. When we do, the counts\nof any collisions (e.g. [foo.h:a()]{.title-ref} and\n[foo.h:b()]{.title-ref} will both map to [foo.h]{.title-ref}) will be\nadded together.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import os\nimport re\n\ndef group_by_file(fn_name: str):\n if fn_name.startswith(\"???\"):\n fn_dir, fn_file = fn_name.split(\":\")[:2]\n else:\n fn_dir, fn_file = os.path.split(fn_name.split(\":\")[0])\n fn_dir = re.sub(\"^.*build/../\", \"\", fn_dir)\n fn_dir = re.sub(\"^.*torch/\", \"torch/\", fn_dir)\n\n return f\"{fn_dir:<15} {fn_file}\"\n\nprint(inclusive_stats.transform(group_by_file)[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "``` {.none}\n\n 118200 aten/src/ATen TensorIterator.cpp\n 65000 c10/util SmallVector.h\n 47264 ??? _int_free\n 25963 ??? _int_malloc\n 20900 c10/util intrusive_ptr.h\n 18000 ??? __tls_get_addr\n 15900 c10/core TensorImpl.h\n 15100 c10/core CPUAllocator.cpp\n 13500 ??? malloc\n 12500 c10/core TensorImpl.cpp\n\nTotal: 352327\n```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6. A/B testing with `Callgrind`\n===============================\n\nOne of the most useful features of instruction counts is they allow fine\ngrained comparison of computation, which is critical when analyzing\nperformance.\n\nTo see this in action, lets compare our multiplication of two size 128\nTensors with a {128} x {1} multiplication, which will broadcast the\nsecond Tensor: result = {a0 \\* b0, a1 \\* b0, \\..., a127 \\* b0}\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "broadcasting_stats = Timer(\n \"x * y;\",\n \"\"\"\n auto x = torch::ones({128});\n auto y = torch::ones({1});\n \"\"\",\n language=Language.CPP,\n).collect_callgrind().as_standardized().stats(inclusive=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often we want to A/B test two different environments. (e.g. testing a\nPR, or experimenting with compile flags.) This is quite simple, as\n`CallgrindStats`, `FunctionCounts`, and Measurement are all pickleable.\nSimply save measurements from each environment, and load them in a\nsingle process for analysis.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pickle\n\n# Let's round trip `broadcasting_stats` just to show that we can.\nbroadcasting_stats = pickle.loads(pickle.dumps(broadcasting_stats))\n\n\n# And now to diff the two tasks:\ndelta = broadcasting_stats - inclusive_stats\n\ndef extract_fn_name(fn: str):\n \"\"\"Trim everything except the function name.\"\"\"\n fn = \":\".join(fn.split(\":\")[1:])\n return re.sub(r\"\\(.+\\)\", \"(...)\", fn)\n\n# We use `.transform` to make the diff readable:\nprint(delta.transform(extract_fn_name))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "``` {.none}\n\n 17600 at::TensorIteratorBase::compute_strides(...)\n 12700 at::TensorIteratorBase::allocate_or_resize_outputs()\n 10200 c10::SmallVectorImpl::operator=(...)\n 7400 at::infer_size(...)\n 6200 at::TensorIteratorBase::invert_perm(...) const\n 6064 _int_free\n 5100 at::TensorIteratorBase::reorder_dimensions()\n 4300 malloc\n 4300 at::TensorIteratorBase::compatible_stride(...) const\n ...\n -28 _int_memalign\n -100 c10::impl::check_tensor_options_and_extract_memory_format(...)\n -300 __memcmp_avx2_movbe\n -400 at::detail::empty_cpu(...)\n -1100 at::TensorIteratorBase::numel() const\n -1300 void at::native::(...)\n -2400 c10::TensorImpl::is_contiguous(...) const\n -6100 at::TensorIteratorBase::compute_fast_setup_type(...)\n -22600 at::TensorIteratorBase::fast_set_up(...)\n\nTotal: 58091\n```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So the broadcasting version takes an extra 580 instructions per call\n(recall that we\\'re collecting 100 runs per sample), or about 10%. There\nare quite a few `TensorIterator` calls, so lets drill down to those.\n`FunctionCounts.filter` makes this easy.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(delta.transform(extract_fn_name).filter(lambda fn: \"TensorIterator\" in fn))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "``` {.none}\n\n 17600 at::TensorIteratorBase::compute_strides(...)\n 12700 at::TensorIteratorBase::allocate_or_resize_outputs()\n 6200 at::TensorIteratorBase::invert_perm(...) const\n 5100 at::TensorIteratorBase::reorder_dimensions()\n 4300 at::TensorIteratorBase::compatible_stride(...) const\n 4000 at::TensorIteratorBase::compute_shape(...)\n 2300 at::TensorIteratorBase::coalesce_dimensions()\n 1600 at::TensorIteratorBase::build(...)\n -1100 at::TensorIteratorBase::numel() const\n -6100 at::TensorIteratorBase::compute_fast_setup_type(...)\n -22600 at::TensorIteratorBase::fast_set_up(...)\n\nTotal: 24000\n```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This makes plain what is going on: there is a fast path in\n`TensorIterator` setup, but in the {128} x {1} case we miss it and have\nto do a more general analysis which is more expensive. The most\nprominent call omitted by the filter is\n[c10::SmallVectorImpl\\::operator=(\\...)]{.title-ref}, which is\nalso part of the more general setup.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7. Wrapping up\n==============\n\nIn summary, use [Timer.blocked\\_autorange]{.title-ref} to collect wall\ntimes. If timing variation is too high, increase\n[min\\_run\\_time]{.title-ref}, or move to C++ snippets if convenient.\n\nFor fine grained analysis, use [Timer.collect\\_callgrind]{.title-ref} to\nmeasure instruction counts and [FunctionCounts.(\\_\\_add\\_\\_ /\n\\_\\_sub\\_\\_ / transform / filter)]{.title-ref} to slice-and-dice them.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "8. Footnotes\n============\n\n> - \n>\n> Implied `import torch`\n>\n> : If [globals]{.title-ref} does not contain \\\"torch\\\", Timer\n> will automatically populate it. This means that\n> `Timer(\"torch.empty(())\")` will work. (Though other imports\n> should be placed in [setup]{.title-ref}, e.g.\n> `Timer(\"np.zeros(())\", \"import numpy as np\")`)\n>\n> - \n>\n> `REL_WITH_DEB_INFO`\n>\n> : In order to provide full information about the PyTorch\n> internals which are executed, `Callgrind` needs access to C++\n> debug symbols. This is accomplished by setting\n> `REL_WITH_DEB_INFO=1` when building PyTorch. Otherwise\n> function calls will be opaque. (The resultant `CallgrindStats`\n> will warn if debug symbols are missing.)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 0 }