{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reducing torch.compile cold start compilation time with regional compilation\n============================================================================\n\n**Author:** [Animesh Jain](https://github.com/anijain2305)\n\nAs deep learning models get larger, the compilation time of these models\nalso increases. This extended compilation time can result in a large\nstartup time in inference services or wasted resources in large-scale\ntraining. This recipe shows an example of how to reduce the cold start\ncompilation time by choosing to compile a repeated region of the model\ninstead of the entire model.\n\nPrerequisites\n-------------\n\n- Pytorch 2.5 or later\n\nSetup\n-----\n\nBefore we begin, we need to install `torch` if it is not already\navailable.\n\n``` {.sh}\npip install torch\n```\n\n```{=html}\n
NOTE:
\n```\n```{=html}\n
\n```\n```{=html}\n

This feature is available starting with the 2.5 release. If you are using version 2.4,you can enable the configuration flag torch._dynamo.config.inline_inbuilt_nn_modules=Trueto prevent recompilations during regional compilation. In version 2.5, this flag is enabled by default.

\n```\n```{=html}\n
\n```\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from time import perf_counter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Steps\n=====\n\nIn this recipe, we will follow these steps:\n\n1. Import all necessary libraries.\n2. Define and initialize a neural network with repeated regions.\n3. Understand the difference between the full model and the regional\n compilation.\n4. Measure the compilation time of the full model and the regional\n compilation.\n\nFirst, let\\'s import the necessary libraries for loading our data:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torch\nimport torch.nn as nn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let\\'s define and initialize a neural network with repeated\nregions.\n\nTypically, neural networks are composed of repeated layers. For example,\na large language model is composed of many Transformer blocks. In this\nrecipe, we will create a `Layer` using the `nn.Module` class as a proxy\nfor a repeated region. We will then create a `Model` which is composed\nof 64 instances of this `Layer` class.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "class Layer(torch.nn.Module):\n def __init__(self):\n super().__init__()\n self.linear1 = torch.nn.Linear(10, 10)\n self.relu1 = torch.nn.ReLU()\n self.linear2 = torch.nn.Linear(10, 10)\n self.relu2 = torch.nn.ReLU()\n\n def forward(self, x):\n a = self.linear1(x)\n a = self.relu1(a)\n a = torch.sigmoid(a)\n b = self.linear2(a)\n b = self.relu2(b)\n return b\n\n\nclass Model(torch.nn.Module):\n def __init__(self, apply_regional_compilation):\n super().__init__()\n self.linear = torch.nn.Linear(10, 10)\n # Apply compile only to the repeated layers.\n if apply_regional_compilation:\n self.layers = torch.nn.ModuleList(\n [torch.compile(Layer()) for _ in range(64)]\n )\n else:\n self.layers = torch.nn.ModuleList([Layer() for _ in range(64)])\n\n def forward(self, x):\n # In regional compilation, the self.linear is outside of the scope of `torch.compile`.\n x = self.linear(x)\n for layer in self.layers:\n x = layer(x)\n return x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let\\'s review the difference between the full model and the\nregional compilation.\n\nIn full model compilation, the entire model is compiled as a whole. This\nis the common approach most users take with `torch.compile`. In this\nexample, we apply `torch.compile` to the `Model` object. This will\neffectively inline the 64 layers, producing a large graph to compile.\nYou can look at the full graph by running this recipe with\n`TORCH_LOGS=graph_code`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model = Model(apply_regional_compilation=False).cuda()\nfull_compiled_model = torch.compile(model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The regional compilation, on the other hand, compiles a region of the\nmodel. By strategically choosing to compile a repeated region of the\nmodel, we can compile a much smaller graph and then reuse the compiled\ngraph for all the regions. In the example, `torch.compile` is applied\nonly to the `layers` and not the full model.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "regional_compiled_model = Model(apply_regional_compilation=True).cuda()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Applying compilation to a repeated region, instead of full model, leads\nto large savings in compile time. Here, we will just compile a layer\ninstance and then reuse it 64 times in the `Model` object.\n\nNote that with repeated regions, some part of the model might not be\ncompiled. For example, the `self.linear` in the `Model` is outside of\nthe scope of regional compilation.\n\nAlso, note that there is a tradeoff between performance speedup and\ncompile time. Full model compilation involves a larger graph and,\ntheoretically, offers more scope for optimizations. However, for\npractical purposes and depending on the model, we have observed many\ncases with minimal speedup differences between the full model and\nregional compilation.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let\\'s measure the compilation time of the full model and the\nregional compilation.\n\n`torch.compile` is a JIT compiler, which means that it compiles on the\nfirst invocation. In the code below, we measure the total time spent in\nthe first invocation. While this method is not precise, it provides a\ngood estimate since the majority of the time is spent in compilation.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def measure_latency(fn, input):\n # Reset the compiler caches to ensure no reuse between different runs\n torch.compiler.reset()\n with torch._inductor.utils.fresh_inductor_cache():\n start = perf_counter()\n fn(input)\n torch.cuda.synchronize()\n end = perf_counter()\n return end - start\n\n\ninput = torch.randn(10, 10, device=\"cuda\")\nfull_model_compilation_latency = measure_latency(full_compiled_model, input)\nprint(f\"Full model compilation time = {full_model_compilation_latency:.2f} seconds\")\n\nregional_compilation_latency = measure_latency(regional_compiled_model, input)\nprint(f\"Regional compilation time = {regional_compilation_latency:.2f} seconds\")\n\nassert regional_compilation_latency < full_model_compilation_latency" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Conclusion\n==========\n\nThis recipe shows how to control the cold start compilation time if your\nmodel has repeated regions. This approach requires user modifications to\napply [torch.compile]{.title-ref} to the repeated regions instead of\nmore commonly used full model compilation. We are continually working on\nreducing cold start compilation time.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 0 }