{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Automatic Mixed Precision\n=========================\n\n**Author**: [Michael Carilli](https://github.com/mcarilli)\n\n[torch.cuda.amp](https://pytorch.org/docs/stable/amp.html) provides\nconvenience methods for mixed precision, where some operations use the\n`torch.float32` (`float`) datatype and other operations use\n`torch.float16` (`half`). Some ops, like linear layers and convolutions,\nare much faster in `float16` or `bfloat16`. Other ops, like reductions,\noften require the dynamic range of `float32`. Mixed precision tries to\nmatch each op to its appropriate datatype, which can reduce your\nnetwork\\'s runtime and memory footprint.\n\nOrdinarily, \\\"automatic mixed precision training\\\" uses\n[torch.autocast](https://pytorch.org/docs/stable/amp.html#torch.autocast)\nand\n[torch.cuda.amp.GradScaler](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler)\ntogether.\n\nThis recipe measures the performance of a simple network in default\nprecision, then walks through adding `autocast` and `GradScaler` to run\nthe same network in mixed precision with improved performance.\n\nYou may download and run this recipe as a standalone Python script. The\nonly requirements are PyTorch 1.6 or later and a CUDA-capable GPU.\n\nMixed precision primarily benefits Tensor Core-enabled architectures\n(Volta, Turing, Ampere). This recipe should show significant (2-3X)\nspeedup on those architectures. On earlier architectures (Kepler,\nMaxwell, Pascal), you may observe a modest speedup. Run `nvidia-smi` to\ndisplay your GPU\\'s architecture.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torch, time, gc\n\n# Timing utilities\nstart_time = None\n\ndef start_timer():\n global start_time\n gc.collect()\n torch.cuda.empty_cache()\n torch.cuda.reset_max_memory_allocated()\n torch.cuda.synchronize()\n start_time = time.time()\n\ndef end_timer_and_print(local_msg):\n torch.cuda.synchronize()\n end_time = time.time()\n print(\"\\n\" + local_msg)\n print(\"Total execution time = {:.3f} sec\".format(end_time - start_time))\n print(\"Max memory used by tensors = {} bytes\".format(torch.cuda.max_memory_allocated()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A simple network\n================\n\nThe following sequence of linear layers and ReLUs should show a speedup\nwith mixed precision.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def make_model(in_size, out_size, num_layers):\n layers = []\n for _ in range(num_layers - 1):\n layers.append(torch.nn.Linear(in_size, in_size))\n layers.append(torch.nn.ReLU())\n layers.append(torch.nn.Linear(in_size, out_size))\n return torch.nn.Sequential(*tuple(layers)).cuda()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`batch_size`, `in_size`, `out_size`, and `num_layers` are chosen to be\nlarge enough to saturate the GPU with work. Typically, mixed precision\nprovides the greatest speedup when the GPU is saturated. Small networks\nmay be CPU bound, in which case mixed precision won\\'t improve\nperformance. Sizes are also chosen such that linear layers\\'\nparticipating dimensions are multiples of 8, to permit Tensor Core usage\non Tensor Core-capable GPUs (see\n`Troubleshooting`{.interpreted-text role=\"ref\"} below).\n\nExercise: Vary participating sizes and see how the mixed precision\nspeedup changes.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "batch_size = 512 # Try, for example, 128, 256, 513.\nin_size = 4096\nout_size = 4096\nnum_layers = 3\nnum_batches = 50\nepochs = 3\n\ndevice = 'cuda' if torch.cuda.is_available() else 'cpu'\ntorch.set_default_device(device)\n\n# Creates data in default precision.\n# The same data is used for both default and mixed precision trials below.\n# You don't need to manually change inputs' ``dtype`` when enabling mixed precision.\ndata = [torch.randn(batch_size, in_size) for _ in range(num_batches)]\ntargets = [torch.randn(batch_size, out_size) for _ in range(num_batches)]\n\nloss_fn = torch.nn.MSELoss().cuda()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Default Precision\n=================\n\nWithout `torch.cuda.amp`, the following simple network executes all ops\nin default precision (`torch.float32`):\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "net = make_model(in_size, out_size, num_layers)\nopt = torch.optim.SGD(net.parameters(), lr=0.001)\n\nstart_timer()\nfor epoch in range(epochs):\n for input, target in zip(data, targets):\n output = net(input)\n loss = loss_fn(output, target)\n loss.backward()\n opt.step()\n opt.zero_grad() # set_to_none=True here can modestly improve performance\nend_timer_and_print(\"Default precision:\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Adding `torch.autocast`\n=======================\n\nInstances of\n[torch.autocast](https://pytorch.org/docs/stable/amp.html#autocasting)\nserve as context managers that allow regions of your script to run in\nmixed precision.\n\nIn these regions, CUDA ops run in a `dtype` chosen by `autocast` to\nimprove performance while maintaining accuracy. See the [Autocast Op\nReference](https://pytorch.org/docs/stable/amp.html#autocast-op-reference)\nfor details on what precision `autocast` chooses for each op, and under\nwhat circumstances.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for epoch in range(0): # 0 epochs, this section is for illustration only\n for input, target in zip(data, targets):\n # Runs the forward pass under ``autocast``.\n with torch.autocast(device_type=device, dtype=torch.float16):\n output = net(input)\n # output is float16 because linear layers ``autocast`` to float16.\n assert output.dtype is torch.float16\n\n loss = loss_fn(output, target)\n # loss is float32 because ``mse_loss`` layers ``autocast`` to float32.\n assert loss.dtype is torch.float32\n\n # Exits ``autocast`` before backward().\n # Backward passes under ``autocast`` are not recommended.\n # Backward ops run in the same ``dtype`` ``autocast`` chose for corresponding forward ops.\n loss.backward()\n opt.step()\n opt.zero_grad() # set_to_none=True here can modestly improve performance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Adding `GradScaler`\n===================\n\n[Gradient\nscaling](https://pytorch.org/docs/stable/amp.html#gradient-scaling)\nhelps prevent gradients with small magnitudes from flushing to zero\n(\\\"underflowing\\\") when training with mixed precision.\n\n[torch.cuda.amp.GradScaler](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler)\nperforms the steps of gradient scaling conveniently.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Constructs a ``scaler`` once, at the beginning of the convergence run, using default arguments.\n# If your network fails to converge with default ``GradScaler`` arguments, please file an issue.\n# The same ``GradScaler`` instance should be used for the entire convergence run.\n# If you perform multiple convergence runs in the same script, each run should use\n# a dedicated fresh ``GradScaler`` instance. ``GradScaler`` instances are lightweight.\nscaler = torch.amp.GradScaler(\"cuda\")\n\nfor epoch in range(0): # 0 epochs, this section is for illustration only\n for input, target in zip(data, targets):\n with torch.autocast(device_type=device, dtype=torch.float16):\n output = net(input)\n loss = loss_fn(output, target)\n\n # Scales loss. Calls ``backward()`` on scaled loss to create scaled gradients.\n scaler.scale(loss).backward()\n\n # ``scaler.step()`` first unscales the gradients of the optimizer's assigned parameters.\n # If these gradients do not contain ``inf``s or ``NaN``s, optimizer.step() is then called,\n # otherwise, optimizer.step() is skipped.\n scaler.step(opt)\n\n # Updates the scale for next iteration.\n scaler.update()\n\n opt.zero_grad() # set_to_none=True here can modestly improve performance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All together: \\\"Automatic Mixed Precision\\\"\n===========================================\n\n(The following also demonstrates `enabled`, an optional convenience\nargument to `autocast` and `GradScaler`. If False, `autocast` and\n`GradScaler`\\'s calls become no-ops. This allows switching between\ndefault precision and mixed precision without if/else statements.)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "use_amp = True\n\nnet = make_model(in_size, out_size, num_layers)\nopt = torch.optim.SGD(net.parameters(), lr=0.001)\nscaler = torch.amp.GradScaler(\"cuda\" ,enabled=use_amp)\n\nstart_timer()\nfor epoch in range(epochs):\n for input, target in zip(data, targets):\n with torch.autocast(device_type=device, dtype=torch.float16, enabled=use_amp):\n output = net(input)\n loss = loss_fn(output, target)\n scaler.scale(loss).backward()\n scaler.step(opt)\n scaler.update()\n opt.zero_grad() # set_to_none=True here can modestly improve performance\nend_timer_and_print(\"Mixed precision:\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Inspecting/modifying gradients (e.g., clipping)\n===============================================\n\nAll gradients produced by `scaler.scale(loss).backward()` are scaled. If\nyou wish to modify or inspect the parameters\\' `.grad` attributes\nbetween `backward()` and `scaler.step(optimizer)`, you should unscale\nthem first using\n[scaler.unscale\\_(optimizer)](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler.unscale_).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for epoch in range(0): # 0 epochs, this section is for illustration only\n for input, target in zip(data, targets):\n with torch.autocast(device_type=device, dtype=torch.float16):\n output = net(input)\n loss = loss_fn(output, target)\n scaler.scale(loss).backward()\n\n # Unscales the gradients of optimizer's assigned parameters in-place\n scaler.unscale_(opt)\n\n # Since the gradients of optimizer's assigned parameters are now unscaled, clips as usual.\n # You may use the same value for max_norm here as you would without gradient scaling.\n torch.nn.utils.clip_grad_norm_(net.parameters(), max_norm=0.1)\n\n scaler.step(opt)\n scaler.update()\n opt.zero_grad() # set_to_none=True here can modestly improve performance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Saving/Resuming\n===============\n\nTo save/resume Amp-enabled runs with bitwise accuracy, use\n[scaler.state\\_dict](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler.state_dict)\nand\n[scaler.load\\_state\\_dict](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler.load_state_dict).\n\nWhen saving, save the `scaler` state dict alongside the usual model and\noptimizer state `dicts`. Do this either at the beginning of an iteration\nbefore any forward passes, or at the end of an iteration after\n`scaler.update()`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "checkpoint = {\"model\": net.state_dict(),\n \"optimizer\": opt.state_dict(),\n \"scaler\": scaler.state_dict()}\n# Write checkpoint as desired, e.g.,\n# torch.save(checkpoint, \"filename\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When resuming, load the `scaler` state dict alongside the model and\noptimizer state `dicts`. Read checkpoint as desired, for example:\n\n``` {.}\ndev = torch.cuda.current_device()\ncheckpoint = torch.load(\"filename\",\n map_location = lambda storage, loc: storage.cuda(dev))\n```\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "net.load_state_dict(checkpoint[\"model\"])\nopt.load_state_dict(checkpoint[\"optimizer\"])\nscaler.load_state_dict(checkpoint[\"scaler\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If a checkpoint was created from a run *without* Amp, and you want to\nresume training *with* Amp, load model and optimizer states from the\ncheckpoint as usual. The checkpoint won\\'t contain a saved `scaler`\nstate, so use a fresh instance of `GradScaler`.\n\nIf a checkpoint was created from a run *with* Amp and you want to resume\ntraining *without* `Amp`, load model and optimizer states from the\ncheckpoint as usual, and ignore the saved `scaler` state.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Inference/Evaluation\n====================\n\n`autocast` may be used by itself to wrap inference or evaluation forward\npasses. `GradScaler` is not necessary.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Advanced topics\n===============\n\nSee the [Automatic Mixed Precision\nExamples](https://pytorch.org/docs/stable/notes/amp_examples.html) for\nadvanced use cases including:\n\n- Gradient accumulation\n- Gradient penalty/double backward\n- Networks with multiple models, optimizers, or losses\n- Multiple GPUs (`torch.nn.DataParallel` or\n `torch.nn.parallel.DistributedDataParallel`)\n- Custom autograd functions (subclasses of `torch.autograd.Function`)\n\nIf you perform multiple convergence runs in the same script, each run\nshould use a dedicated fresh `GradScaler` instance. `GradScaler`\ninstances are lightweight.\n\nIf you\\'re registering a custom C++ op with the dispatcher, see the\n[autocast\nsection](https://pytorch.org/tutorials/advanced/dispatcher.html#autocast)\nof the dispatcher tutorial.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Troubleshooting\n===============\n\nSpeedup with Amp is minor\n-------------------------\n\n1. Your network may fail to saturate the GPU(s) with work, and is\n therefore CPU bound. Amp\\'s effect on GPU performance won\\'t matter.\n - A rough rule of thumb to saturate the GPU is to increase batch\n and/or network size(s) as much as you can without running OOM.\n - Try to avoid excessive CPU-GPU synchronization (`.item()` calls,\n or printing values from CUDA tensors).\n - Try to avoid sequences of many small CUDA ops (coalesce these\n into a few large CUDA ops if you can).\n2. Your network may be GPU compute bound (lots of\n `matmuls`/convolutions) but your GPU does not have Tensor Cores. In\n this case a reduced speedup is expected.\n3. The `matmul` dimensions are not Tensor Core-friendly. Make sure\n `matmuls` participating sizes are multiples of 8. (For NLP models\n with encoders/decoders, this can be subtle. Also, convolutions used\n to have similar size constraints for Tensor Core use, but for CuDNN\n versions 7.3 and later, no such constraints exist. See\n [here](https://github.com/NVIDIA/apex/issues/221#issuecomment-478084841)\n for guidance.)\n\nLoss is inf/NaN\n---------------\n\nFirst, check if your network fits an\n`advanced use case`{.interpreted-text role=\"ref\"}. See\nalso [Prefer binary\\_cross\\_entropy\\_with\\_logits over\nbinary\\_cross\\_entropy](https://pytorch.org/docs/stable/amp.html#prefer-binary-cross-entropy-with-logits-over-binary-cross-entropy).\n\nIf you\\'re confident your Amp usage is correct, you may need to file an\nissue, but before doing so, it\\'s helpful to gather the following\ninformation:\n\n1. Disable `autocast` or `GradScaler` individually (by passing\n `enabled=False` to their constructor) and see if `infs`/`NaNs`\n persist.\n2. If you suspect part of your network (e.g., a complicated loss\n function) overflows , run that forward region in `float32` and see\n if `infs`/`NaN`s persist. [The autocast\n docstring](https://pytorch.org/docs/stable/amp.html#torch.autocast)\\'s\n last code snippet shows forcing a subregion to run in `float32` (by\n locally disabling `autocast` and casting the subregion\\'s inputs).\n\nType mismatch error (may manifest as `CUDNN_STATUS_BAD_PARAM`)\n--------------------------------------------------------------\n\n`Autocast` tries to cover all ops that benefit from or require casting.\n[Ops that receive explicit\ncoverage](https://pytorch.org/docs/stable/amp.html#autocast-op-reference)\nare chosen based on numerical properties, but also on experience. If you\nsee a type mismatch error in an `autocast` enabled forward region or a\nbackward pass following that region, it\\'s possible `autocast` missed an\nop.\n\nPlease file an issue with the error backtrace.\n`export TORCH_SHOW_CPP_STACKTRACES=1` before running your script to\nprovide fine-grained information on which backend op is failing.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 0 }