{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Inductor CPU backend debugging and profiling\n============================================\n\n**Authors**: [Xuan Liao](https://github.com/Valentine233), [Haozhe\nZhu](https://github.com/zhuhaozhe), [Jiong\nGong](https://github.com/jgong5), [Weihan\nWang](https://github.com/EikanWang)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Overview\n========\n\nPyTorch 2.0 introduced the compilation API called `torch.compile`. This\nnew feature offers a significant speedup over eager mode execution\nthrough graph-level optimization powered by the default Inductor\nbackend.\n\nThis tutorial is intended to provide an in-depth introduction on the\ndebugging and performance profiling on Inductor CPU backend by delving\ninto the intricacies of `torch.compile`.\n\nMeanwhile, you may also find related tutorials about `torch.compile`\naround [basic\nusage](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html),\ncomprehensive\n[troubleshooting](https://pytorch.org/docs/stable/torch.compiler_troubleshooting.html)\nand GPU-specific knowledge like [GPU performance\nprofiling](https://pytorch.org/docs/stable/torch.compiler_inductor_profiling.html).\n\nWe will start debugging with a motivating example that triggers\ncompilation issues and accuracy problems by demonstrating the process of\ndebugging to pinpoint the problems.\n\nBy enabling logging and exploring the underlying generated code, you can\nlearn how to narrow down the failure step by step and finally figure out\nthe route cause.\n\nFollowing that, we will proceed to discuss how to profile the compiled\ncode and, through a performance comparison with eager mode, elaborate on\nthe reasons why `torch.compile` can provide an additional performance\nboost compared to its eager counterpart.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Debugging\n=========\n\nHere is a simple example to run the `torch.compile` using Inductor and\ncompare its result with eager mode:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torch\n\ndef foo1(x1, x2):\n a = torch.neg(x1)\n b = torch.maximum(x2, a)\n y = torch.cat([b], dim=0)\n return y\n\nx1 = torch.randint(256, (1, 8), dtype=torch.uint8)\nx2 = torch.randint(256, (8390, 8), dtype=torch.uint8)\n\ncompiled_foo1 = torch.compile(foo1)\nresult = compiled_foo1(x1, x2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The correct implementation of `neg` in the `cpp` codegen is as follows:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def neg1(x):\n return f\"decltype({x})(-{x})\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to demonstrate the debugging, we will modify the function to a\nwrong one later.\n\nGet more logging information\n============================\n\nNo debugging information would be provided if you run this simple\nexample by default. In order to get more useful debugging and logging\ninformation, we usually add a `TORCH_COMPILE_DEBUG` environment variable\nlike below:\n\n``` {.shell}\nTORCH_COMPILE_DEBUG=1 python xx.py\n```\n\nThis would print more debug information in the output logs and also dump\nthe intermediate IRs generated during the codegen process. You can find\nthe dumped file paths in the log like below:\n\n``` {.shell}\ntorch._inductor.debug: [WARNING] model___20 debug trace: /tmp/torchinductor_root/rx/crxfi2ybd7yp5sbj2pnhw33wfhtdw7wumvrobyp5sjvdui5ktjc2.debug\n```\n\nIn this directory, the following files are saved for debugging purposes:\n\n ----------------------------------------------------------------------------\n File Description\n --------------------------- ------------------------------------------------\n `fx_graph_runnable.py` Executable FX graph, after decomposition, before\n pattern match\n\n `fx_graph_transformed.py` Transformed FX graph, after pattern match\n\n `ir_pre_fusion.txt` Inductor IR before fusion\n\n `ir_post_fusion.txt` Inductor IR after fusion\n\n `output_code.py` Generated Python code for graph, with C++/Triton\n kernels\n ----------------------------------------------------------------------------\n\nNote that `fx_graph_runnable.py` and `output_code.py` are both runnable\nand editable in order to make debugging easier. Here are the main parts\nof code extracted from the files and we correlate the C++ generated line\nwith the FX code line.\n\n`fx_graph_runnable`:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def forward1(self, arg0_1, arg1_1):\n neg = torch.ops.aten.neg.default(arg0_1); arg0_1 = None\n maximum = torch.ops.aten.maximum.default(arg1_1, neg); arg1_1 = neg = None\n clone = torch.ops.aten.clone.default(maximum); maximum = None\n return (clone,)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "C++ kernel in `output_code`:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torch\nfrom torch._inductor.async_compile import AsyncCompile\nasync_compile = AsyncCompile()\n\ncpp_fused_cat_maximum_neg_0 = async_compile.cpp('''\n#include \"/tmp/torchinductor_root/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h\"\nextern \"C\" void kernel(const unsigned char* in_ptr0,\n const unsigned char* in_ptr1,\n unsigned char* out_ptr0)\n{\n {\n #pragma GCC ivdep\n for(long i0=static_cast(0L); i0(8390L); i0+=static_cast(1L))\n {\n #pragma GCC ivdep\n for(long i1=static_cast(0L); i1(8L); i1+=static_cast(1L))\n {\n auto tmp0 = in_ptr0[static_cast(i1 + (8L*i0))];\n auto tmp1 = in_ptr1[static_cast(i1)];\n // Corresponding FX code line: neg = torch.ops.aten.neg.default(arg0_1); arg0_1 = None\n auto tmp2 = decltype(tmp1)(-tmp1);\n // Corresponding FX code line: maximum = torch.ops.aten.maximum.default(arg1_1, neg); arg1_1 = neg = None\n auto tmp3 = max_propagate_nan(tmp0, tmp2);\n // Corresponding FX code line: clone = torch.ops.aten.clone.default(maximum); maximum = None\n out_ptr0[static_cast(i1 + (8L*i0))] = tmp3;\n }\n }\n }\n}''')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Determine component of error\n============================\n\nWhen encountering errors or accuracy problems, a straightforward\nsolution to find the bug is to narrow down the problem. The first thing\nto do is to determine the component where the error occurs. Luckily, it\ncan be simply achieved by changing the backend of `torch.compile`.\n\n -----------------------------------------------------------------------------\n Code Description\n ------------------------------------------ ----------------------------------\n `torch.compile(fn, backend=\"eager\")` Enable Dynamo\n\n `torch.compile(fn, backend=\"aot_eager\")` Enable Dynamo + AOT Autograd\n\n `torch.compile(fn, backend=\"inductor\")` Enable Dynamo + AOT Autograd +\n Inductor\n -----------------------------------------------------------------------------\n\nIf the model can successfully run when the backend is set to `eager` or\n`aot_eager` while it fails with `inductor`, we can narrow down the\nfailure to Inductor.\n\nCompilation error\n=================\n\nAs we know, the evolved chain of graph-level optimization is like:\n\n``` {.sh}\ntorch.neg (Python) -> torch.ops.aten.neg.default (within FX graph) -> ops.neg (within IR node) -> tmp2 = -tmp1 (within C++ kernel)\n```\n\nIf you encounter a compilation error, there is something wrong when\ncompiling C++ kernels in the output code. This type of error indicates\nthat bugs are introduced when lowering IR nodes to output code. The root\ncause of compilation error is usually shown in the traceback log.\n\nFor example, the `neg` function is modified like this:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def neg2(x):\n return f\"-{x}\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The logging gives the following compile error with a rather clear\nreason.\n\n``` {.}\ntorch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:\nCppCompileError: C++ compile error\n/tmp/torchinductor_root/xg/cxga5tk3b4lkwoxyigrtocjp5s7vc5cg2ikuscf6bk6pjqip2bhx.cpp: In function \u2018void kernel(const unsigned char*, const unsigned char*, unsigned char*)\u2019:\n/tmp/torchinductor_root/xg/cxga5tk3b4lkwoxyigrtocjp5s7vc5cg2ikuscf6bk6pjqip2bhx.cpp:17:57: error: no matching function for call to \u2018max_propagate_nan(unsigned char&, int&)\u2019\n 17 | auto tmp3 = max_propagate_nan(tmp0, tmp2);\n | ^\nIn file included from /tmp/torchinductor_root/xg/cxga5tk3b4lkwoxyigrtocjp5s7vc5cg2ikuscf6bk6pjqip2bhx.cpp:2:\n/tmp/torchinductor_root/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h:27:17: note: candidate: \u2018template scalar_t max_propagate_nan(scalar_t, scalar_t)\u2019\n27 | inline scalar_t max_propagate_nan(scalar_t a, scalar_t b) {\n | ^~~~~~~~~~~~~~~~~\n/tmp/torchinductor_root/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h:27:17: note: template argument deduction/substitution failed:\n/tmp/torchinductor_root/xg/cxga5tk3b4lkwoxyigrtocjp5s7vc5cg2ikuscf6bk6pjqip2bhx.cpp:17:57: note: deduced conflicting types for parameter \u2018scalar_t\u2019 (\u2018unsigned char\u2019 and \u2018int\u2019)\n17 | auto tmp3 = max_propagate_nan(tmp0, tmp2);\n | ^\n```\n\nLet us also see the corresponding C++ kernel in output code and IR node.\n\nC++ kernel:\n\n``` {.c}\ninclude \"/tmp/torchinductor_root/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h\"\nextern \"C\" void kernel(const unsigned char* in_ptr0,\n const unsigned char* in_ptr1,\n unsigned char* out_ptr0)\n{\n {\n #pragma GCC ivdep\n for(long i0=static_cast(0L); i0(8390L); i0+=static_cast(1L))\n {\n #pragma GCC ivdep\n for(long i1=static_cast(0L); i1(8L); i1+=static_cast(1L))\n {\n auto tmp0 = in_ptr0[static_cast(i1 + (8L*i0))];\n auto tmp1 = in_ptr1[static_cast(i1)];\n auto tmp2 = -tmp1;\n auto tmp3 = max_propagate_nan(tmp0, tmp2);\n out_ptr0[static_cast(i1 + (8L*i0))] = tmp3;\n }\n }\n }\n}\n```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "IR node:\n\n``` {.sh}\nbuf0: SchedulerNode(ComputedBuffer)\nbuf0.writes = [MemoryDep('buf0', c0, {c0: 67120})]\nbuf0.unmet_dependencies = []\nbuf0.met_dependencies = \n [ MemoryDep('arg0_1', c1, {c0: 8390, c1: 8}),\n MemoryDep('arg1_1', c0, {c0: 67120})]\nbuf0.users = [NodeUser(node=OUTPUT, can_inplace=False)]\nbuf0.group.device = cpu\nbuf0.group.iteration = ((8390, 8), ())\nbuf0.sizes = ([8390, 8], [])\nclass buf0_loop_body:\n var_ranges = {z0: 8390, z1: 8}\n index0 = 8*z0 + z1\n index1 = z1\n def body(self, ops):\n get_index = self.get_index('index0')\n load = ops.load('arg1_1', get_index)\n get_index_1 = self.get_index('index1')\n load_1 = ops.load('arg0_1', get_index_1)\n neg = ops.neg(load_1)\n maximum = ops.maximum(load, neg)\n get_index_2 = self.get_index('index0')\n store = ops.store('buf0', get_index_2, maximum, None)\n return store\n```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "According to the traceback logging, the compilation error is caused by\nthe data type inconsistency of `max_propagate_nan`\\'s inputs. By\nchecking the C++ kernel, we know that `tmp2` is no longer `long` after\ndoing `-` as `tmp0` is `long`. We can easily match `-` and\n`max_propagate_nan` in C++ kernel with `ops.neg` and `ops.maximum` in IR\nnode respectively.\n\nNow we successfully find that the root cause is the implementation of\n`ops.neg` in `cpp` codegen, which silently changes the data type when\ndoing `neg`.\n\nAccuracy debugging\n==================\n\nOtherwise, if the model runs with other errors or accuracy problem, you\ncan use the PyTorch debugging tool called\n[Minifier](https://pytorch.org/functorch/stable/notebooks/minifier.html).\n\nThe core idea of `Minifier` is to keep removing the nodes and inputs of\ngraph until finding the minimal graph with problem. It helps to\nautomatically generate a minified problematic graph through 4\nstrategies: truncating suffix, delta debugging, eliminating dead code\nand removing unused inputs.\n\nWe will now show the debugging process for the accuracy problem with the\nhelp of `Minifer`. The accuracy problem refers to the case where the\noutputs of backends eager and inductor are different.\n\nFor instance, we modify the example like this:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from torch._dynamo.utils import same\n\ndef foo2(x1, x2):\n a = torch.neg(x1)\n b = torch.maximum(x2, a)\n y = torch.cat([b], dim=0)\n return y\n\nx1 = torch.randn((1, 8), dtype=torch.float32)\nx2 = torch.randn((8390, 8), dtype=torch.float32)\n\nexpected_result = foo2(x1, x2)\n\ncompiled_foo2 = torch.compile(foo2)\nactual_result = compiled_foo2(x1, x2)\n\nassert same(expected_result, actual_result) == True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And also modify the `neg` function:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def neg3(x):\n return f\"decltype({x})(2 * {x})\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An accuracy problem would be raised as follows:\n\n``` {.sh}\ntorch._dynamo.utils: [ERROR] Accuracy failed: allclose not within tol=0.0001\nTraceback (most recent call last):\n File \"test_script.py\", line 18, in \n assert same(expected_result, actual_result) == True\nAssertionError\n```\n\nTo debug an accuracy problem with Minifier, two environment variables\nare needed:\n\n``` {.sh}\nTORCHDYNAMO_REPRO_AFTER=\"aot\" TORCHDYNAMO_REPRO_LEVEL=4 python xx.py\n```\n\nWhich gives us logging information that demonstrates the steps of\nminifying:\n\n``` {.sh}\nStarted off with 6 nodes\n\nTrying granularity 2\nStrategy: Truncate suffix (G: 2) (6 nodes, 2 inputs)\nSUCCESS: Went from 6 to 4 nodes\n\nTrying granularity 4\nStrategy: Remove unused inputs (G: 4) (4 nodes, 2 inputs)\nSUCCESS: Went from 4 to 3 nodes\n```\n\nAfter running, we get the final minified graph with the target node\n`neg`:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def forward2(self, arg0_1):\n neg = torch.ops.aten.neg.default(arg0_1); arg0_1 = None\n return (neg,)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more usage details about Minifier, please refer to\n[Troubleshooting](https://pytorch.org/docs/stable/torch.compiler_troubleshooting.html).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Performance profiling\n=====================\n\nWithin this section, we will demonstrate the process of conducting\nperformance analysis for a model that has been compiled using the\nInductor CPU backend. In the example below, we benchmark a Hugging Face\nTransformer model `MobileBertForQuestionAnswering` with both the eager\nmode and the Inductor graph mode. The execution time and the speedup\nratio of Inductor are printed after the benchmark. We use Intel(R)\nXeon(R) Platinum 8358 CPU @ 2.60GHz and run benchmark on the first\nsocket to demonstrate the optimization within this section. We set\nfollowing environment variable as a best practice to benchmark on\nIntel(R) CPU.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "``` {.shell}\nexport KMP_BLOCKTIME=1\nexport KMP_SETTINGS=1\nexport KMP_AFFINITY=granularity=fine,compact,1,0\nexport LD_PRELOAD=${CONDA_PREFIX:-\"$(dirname $(which conda))/../\"}/lib/libiomp5.so:${CONDA_PREFIX:-\"$(dirname $(which conda))/../\"}/lib/libjemalloc.so\nexport MALLOC_CONF=\"oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1\"\nnumactl -C 0-31 -m 0 python bench.py\n```\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# bench.py\nfrom transformers import MobileBertForQuestionAnswering\n# Initialize an eager model\nmodel = MobileBertForQuestionAnswering.from_pretrained(\"csarron/mobilebert-uncased-squad-v2\")\nseq_length = 128\nbs = 128\nvocab_size = model.config.vocab_size\ninput = torch.randint(0, vocab_size, (bs, seq_length), dtype=torch.int64)\ninput_dict = {\"input_ids\": input}\n\n# Initialize the inductor model\ncompiled_model = torch.compile(model)\nwith torch.no_grad():\n compiled_model(**input_dict)\n\nNUM_ITERS=50\nimport timeit\nwith torch.no_grad():\n # warmup\n for _ in range(10):\n model(**input_dict)\n eager_t = timeit.timeit(\"model(**input_dict)\", number=NUM_ITERS, globals=globals())\n\nwith torch.no_grad():\n # warmup\n for _ in range(10):\n compiled_model(**input_dict)\n inductor_t = timeit.timeit(\"compiled_model(**input_dict)\", number=NUM_ITERS, globals=globals())\n# print(f\"eager use: {eager_t * 1000 / NUM_ITERS} ms/iter\")\n# print(f\"inductor use: {inductor_t * 1000 / NUM_ITERS} ms/iter\")\n# print(f\"speed up ratio: {eager_t / inductor_t}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Output:\n\n``` {.shell}\neager use: 802.1023553796113 ms/iter\ninductor use: 339.95180135127157 ms/iter\nspeed up ratio: 2.359459053287382\n```\n\nIn our own testing, we find the Inductor CPU backend speed up the model\nby around 2.355x.\n\nNext, let\\'s dive deep into the performance at the operation level to\nunderstand where the speed-up comes from. [Pytorch\nProfiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html)\nis a good tool to help us. Inductor CPU backend has the support to\nreport the time of the fusion kernels to the profiler with the\n`enable_kernel_profile` configuration option:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from torch._inductor import config\nconfig.cpp.enable_kernel_profile = True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Following the steps in [Pytorch\nProfiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html)\nWe are able to get the profiling table and trace files.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# bench.py\nfrom torch.profiler import profile, schedule, ProfilerActivity\nRESULT_DIR = \"./prof_trace\"\nmy_schedule = schedule(\n skip_first=10,\n wait=5,\n warmup=5,\n active=1,\n repeat=5)\n\ndef trace_handler(p):\n output = p.key_averages().table(sort_by=\"self_cpu_time_total\", row_limit=20)\n # print(output)\n p.export_chrome_trace(f\"{RESULT_DIR}/{p.step_num}.json\")\n\nfor _ in range(10):\n model(**input_dict) # compiled_model(**input_dict) to get inductor model profiling\n\ntotal = 0\nwith profile(\n activities=[ProfilerActivity.CPU],\n schedule=my_schedule,\n on_trace_ready=trace_handler\n) as p:\n for _ in range(50):\n model(**input_dict) # compiled_model(**input_dict) to get inductor model profiling\n p.step()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get the following performance profiling table for the eager-mode\nmodel (omitting some columns):\n\n``` {.shell}\n------------------------- ------------ ------------ ------------\n Name CPU total % CPU total # of Calls\n------------------------- ------------ ------------ ------------\n aten::addmm 45.73% 370.814ms 362\n aten::add 19.89% 161.276ms 363\n aten::copy_ 14.97% 121.416ms 488\n aten::mul 9.02% 73.154ms 194\n aten::clamp_min 8.81% 71.444ms 96\n aten::bmm 5.46% 44.258ms 48\n ProfilerStep* 100.00% 810.920ms 1\n aten::div 2.89% 23.447ms 24\n aten::_softmax 1.00% 8.087ms 24\n aten::linear 46.48% 376.888ms 362\n aten::clone 2.77% 22.430ms 98\n aten::t 0.31% 2.502ms 362\n aten::view 0.14% 1.161ms 850\n aten::transpose 0.17% 1.377ms 386\n aten::index_select 0.12% 952.000us 3\n aten::expand 0.12% 986.000us 458\n aten::matmul 8.31% 67.420ms 48\n aten::cat 0.09% 703.000us 1\n aten::as_strided 0.08% 656.000us 963\n aten::relu 8.86% 71.864ms 96\n------------------------- ------------ ------------ ------------\nSelf CPU time total: 810.920ms\n```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, we also get the table for the compiled model with Inductor\n(omitting some columns):\n\n``` {.shell}\n----------------------------------------------- ------------ ------------ ------------\n Name CPU total % CPU total # of Calls\n----------------------------------------------- ------------ ------------ ------------\n mkl::_mkl_linear 68.79% 231.573ms 362\n aten::bmm 8.02% 26.992ms 48\n ProfilerStep* 100.00% 336.642ms 1\n graph_0_cpp_fused_constant_pad_nd_embedding_0 0.27% 915.000us 1\n aten::empty 0.27% 911.000us 362\n graph_0_cpp_fused__mkl_linear_add_mul_relu_151 0.27% 901.000us 1\n graph_0_cpp_fused__mkl_linear_add_mul_relu_226 0.27% 899.000us 1\n graph_0_cpp_fused__mkl_linear_add_mul_relu_361 0.27% 898.000us 1\n graph_0_cpp_fused__mkl_linear_add_mul_relu_121 0.27% 895.000us 1\n graph_0_cpp_fused__mkl_linear_add_mul_relu_31 0.27% 893.000us 1\n graph_0_cpp_fused__mkl_linear_add_mul_relu_76 0.26% 892.000us 1\n graph_0_cpp_fused__mkl_linear_add_mul_relu_256 0.26% 892.000us 1\n graph_0_cpp_fused__mkl_linear_add_mul_relu_346 0.26% 892.000us 1\n graph_0_cpp_fused__mkl_linear_add_mul_relu_241 0.26% 891.000us 1\n graph_0_cpp_fused__mkl_linear_add_mul_relu_316 0.26% 891.000us 1\n graph_0_cpp_fused__mkl_linear_add_mul_relu_91 0.26% 890.000us 1\n graph_0_cpp_fused__mkl_linear_add_mul_relu_106 0.26% 890.000us 1\n graph_0_cpp_fused__mkl_linear_add_mul_relu_211 0.26% 890.000us 1\n graph_0_cpp_fused__mkl_linear_add_mul_relu_61 0.26% 889.000us 1\n graph_0_cpp_fused__mkl_linear_add_mul_relu_286 0.26% 889.000us 1\n----------------------------------------------- ------------ ------------ ------------\nSelf CPU time total: 336.642ms \n```\n\nFrom the profiling table of the eager model, we can see the most time\nconsumption ops are \\[`aten::addmm`, `aten::add`, `aten::copy_`,\n`aten::mul`, `aten::clamp_min`, `aten::bmm`\\]. Comparing with the\ninductor model profiling table, we notice an `mkl::_mkl_linear` entry\nand multiple fused kernels in the form `graph_0_cpp_fused_*`. They are\nthe major optimizations that the inductor model is doing. Let us discuss\nthem separately.\n\n\\(1\\) Regarding `mkl::_mkl_linear`: You may notice the number of calls\nto this kernel is 362, which is exactly the same as `aten::linear` in\nthe eager model profiling table. The CPU total of `aten::linear` is\n376.888ms, while it is 231.573ms for `mkl::_mkl_linear`. This suggests a\n\\~1.63x for the \\\"linear\\\" part. The speedup mainly comes from [packing\nthe weight tensor to block memory\nformat](https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2023-1/cblas-gemm-pack-002.html)\nand invoking\n[cblas\\_sgemm\\_compute](https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2023-1/cblas-gemm-compute-002.html)\nwithin the Inductor CPU backend to have a better cache behavior during\nGEMM computation.\n\n\\(2\\) Regarding other memory-intensive ops: The end-to-end latency for\nthe eager/inductor model is 802/339ms in our testing. So we can roughly\ninfer that the speed up for the other memory-intensive ops is around\n3.94x. Let\\'s read the generated code to understand how the inductor\nachieves this impressive optimization. You can find the generated code\nby searching `cpp_fused__mkl_linear_add_mul_relu_151` in\n`output_code.py`\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "cpp_fused__mkl_linear_add_mul_relu_151 = async_compile.cpp('''\n#include \n#include \"/tmp/torchinductor_root/lr/clrlgu27q4ggd472umdzwsu6qcpqxcuusjxqvx2hwitjbujiiz7z.h\"\nextern \"C\" void kernel(float* in_out_ptr0,\n const float* in_ptr0,\n const float* in_ptr1,\n const float* in_ptr2,\n const float* in_ptr3)\n{\n RECORD_FUNCTION(\"graph_0_cpp_fused__mkl_linear_add_mul_relu_151\", c10::ArrayRef({}));\n #pragma omp parallel num_threads(32)\n {\n {\n #pragma omp for \n for(long i0=static_cast(0L); i0(16384L); i0+=static_cast(1L))\n {\n for(long i1=static_cast(0L); i1(512L); i1+=static_cast(8L))\n {\n auto tmp0 = at::vec::Vectorized::loadu(in_ptr0 + static_cast(i1 + (512L*i0)));\n auto tmp1 = at::vec::Vectorized::loadu(in_ptr1 + static_cast(i1));\n auto tmp3 = at::vec::Vectorized::loadu(in_out_ptr0 + static_cast(i1 + (512L*i0)));\n auto tmp5 = at::vec::Vectorized::loadu(in_ptr2 + static_cast(i1));\n auto tmp7 = at::vec::Vectorized::loadu(in_ptr3 + static_cast(i1));\n auto tmp2 = tmp0 + tmp1;\n auto tmp4 = tmp2 + tmp3;\n auto tmp6 = tmp4 * tmp5;\n auto tmp8 = tmp6 + tmp7;\n tmp8.store(in_out_ptr0 + static_cast(i1 + (512L*i0)));\n }\n }\n }\n }\n}''')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the generated code above, we can see this kernel has done a typical\n[Loop Fusion](https://en.wikipedia.org/wiki/Loop_fission_and_fusion) on\n`[add, add, mul, add]`. This is a memory-bound bottle neck preventing\ngood performance. To get a more intuitive feeling about this\noptimization, we can infer the sizes and stride of the inputs and\nfurther benchmark this `[add, add, mul, add]` pattern.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# bench.py\ndef func(arg_0, arg_1, arg_2, arg_3, arg_4):\n add_0 = arg_0 + arg_1\n add_1 = add_0 + arg_2\n mul_1 = add_1 * arg_3\n add_2 = mul_1 + arg_4\n arg_2 = add_2\n return arg_2\n\narg_0 = torch.rand(16384, 512)\narg_1 = torch.rand(1, 512)\narg_2 = torch.zeros(16384, 512)\narg_3 = torch.rand(1, 512)\narg_4 = torch.rand(1, 512)\n\ninput = (arg_0, arg_1, arg_2, arg_3, arg_4)\ninductor_func = torch.compile(func)\nwith torch.no_grad():\n inductor_func(*input)\n\nimport timeit\nNUM_ITERS=100\nwith torch.no_grad():\n # warmup\n for _ in range(10):\n func(*input)\n eager_t = timeit.timeit(\"func(*input)\", number=NUM_ITERS, globals=globals())\n\nwith torch.no_grad():\n # warmup\n for _ in range(10):\n inductor_func(*input)\n inductor_t = timeit.timeit(\"inductor_func(*input)\", number=NUM_ITERS, globals=globals())\n# print(f\"eager use: {eager_t * 1000 / NUM_ITERS} ms/iter\")\n# print(f\"inductor use: {inductor_t * 1000 / NUM_ITERS} ms/iter\")\n# print(f\"speed up ratio: {eager_t / inductor_t}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Output:\n\n``` {.shell}\neager use: 5.780875144992024 ms/iter\ninductor use: 0.9588955780491233 ms/iter\nspeed up ratio: 6.0286805751604735\n```\n\nThis is just an example. The profiling table shows all element-wise op\nare fused within the inductor automatically in this model. You can read\nmore kernels in [output\\_code.py]{.title-ref}\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Conclusion\n==========\n\nThe document gives an in-depth tutorial for the Inductor CPU backend.\n\nWith motivating examples, we walk through the process of debugging and\nprofiling. The main idea is to narrow down the problem.\n\nWe demonstrate step by step the way to delve deeper the issue and find\nthe root cause of failures, with the help of debugging logging and the\ntool Minifier. Firstly determine which component the failure occurs in\nand then try to generate the smallest snippet of code that can reproduce\nthe failure.\n\nWhen the performance with Inductor is better than that of eager mode, we\nprovide a solid analytical method for performance profiling. We show how\nto find the time-consuming hotspot with PyTorch Profiler and figure out\nthe operator-level or kernel-level reason to explain the phenomenon.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 0 }