{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(beta) Channels Last Memory Format in PyTorch\n=============================================\n\n**Author**: [Vitaly Fedyunin](https://github.com/VitalyFedyunin)\n\nWhat is Channels Last\n---------------------\n\nChannels last memory format is an alternative way of ordering NCHW\ntensors in memory preserving dimensions ordering. Channels last tensors\nordered in such a way that channels become the densest dimension (aka\nstoring images pixel-per-pixel).\n\nFor example, classic (contiguous) storage of NCHW tensor (in our case it\nis two 4x4 images with 3 color channels) look like this:\n\n![](https://pytorch.org/tutorials/_static/img/classic_memory_format.png)\n\nChannels last memory format orders data differently:\n\n![](https://pytorch.org/tutorials/_static/img/channels_last_memory_format.png)\n\nPytorch supports memory formats (and provides back compatibility with\nexisting models including eager, JIT, and TorchScript) by utilizing\nexisting strides structure. For example, 10x3x16x16 batch in Channels\nlast format will have strides equal to (768, 1, 48, 3).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Channels last memory format is implemented for 4D NCHW Tensors only.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Memory Format API\n=================\n\nHere is how to convert tensors between contiguous and channels last\nmemory formats.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Classic PyTorch contiguous tensor\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torch\n\nN, C, H, W = 10, 3, 32, 32\nx = torch.empty(N, C, H, W)\nprint(x.stride()) # Outputs: (3072, 1024, 32, 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Conversion operator\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "x = x.to(memory_format=torch.channels_last)\nprint(x.shape) # Outputs: (10, 3, 32, 32) as dimensions order preserved\nprint(x.stride()) # Outputs: (3072, 1, 96, 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Back to contiguous\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "x = x.to(memory_format=torch.contiguous_format)\nprint(x.stride()) # Outputs: (3072, 1024, 32, 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternative option\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "x = x.contiguous(memory_format=torch.channels_last)\nprint(x.stride()) # Outputs: (3072, 1, 96, 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Format checks\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(x.is_contiguous(memory_format=torch.channels_last)) # Outputs: True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are minor difference between the two APIs `to` and `contiguous`.\nWe suggest to stick with `to` when explicitly converting memory format\nof tensor.\n\nFor general cases the two APIs behave the same. However in special cases\nfor a 4D tensor with size `NCHW` when either: `C==1` or `H==1 && W==1`,\nonly `to` would generate a proper stride to represent channels last\nmemory format.\n\nThis is because in either of the two cases above, the memory format of a\ntensor is ambiguous, i.e. a contiguous tensor with size `N1HW` is both\n`contiguous` and channels last in memory storage. Therefore, they are\nalready considered as `is_contiguous` for the given memory format and\nhence `contiguous` call becomes a no-op and would not update the stride.\nOn the contrary, `to` would restride tensor with a meaningful stride on\ndimensions whose sizes are 1 in order to properly represent the intended\nmemory format\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "special_x = torch.empty(4, 1, 4, 4)\nprint(special_x.is_contiguous(memory_format=torch.channels_last)) # Outputs: True\nprint(special_x.is_contiguous(memory_format=torch.contiguous_format)) # Outputs: True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Same thing applies to explicit permutation API `permute`. In special\ncase where ambiguity could occur, `permute` does not guarantee to\nproduce a stride that properly carry the intended memory format. We\nsuggest to use `to` with explicit memory format to avoid unintended\nbehavior.\n\nAnd a side note that in the extreme case, where three non-batch\ndimensions are all equal to `1` (`C==1 && H==1 && W==1`), current\nimplementation cannot mark a tensor as channels last memory format.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create as channels last\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "x = torch.empty(N, C, H, W, memory_format=torch.channels_last)\nprint(x.stride()) # Outputs: (3072, 1, 96, 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`clone` preserves memory format\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "y = x.clone()\nprint(y.stride()) # Outputs: (3072, 1, 96, 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`to`, `cuda`, `float` \\... preserves memory format\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "if torch.cuda.is_available():\n y = x.cuda()\n print(y.stride()) # Outputs: (3072, 1, 96, 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`empty_like`, `*_like` operators preserves memory format\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "y = torch.empty_like(x)\nprint(y.stride()) # Outputs: (3072, 1, 96, 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pointwise operators preserves memory format\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "z = x + y\nprint(z.stride()) # Outputs: (3072, 1, 96, 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Conv`, `Batchnorm` modules using `cudnn` backends support channels last\n(only works for cuDNN \\>= 7.6). Convolution modules, unlike binary\np-wise operator, have channels last as the dominating memory format. If\nall inputs are in contiguous memory format, the operator produces output\nin contiguous memory format. Otherwise, output will be in channels last\nmemory format.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "if torch.backends.cudnn.is_available() and torch.backends.cudnn.version() >= 7603:\n model = torch.nn.Conv2d(8, 4, 3).cuda().half()\n model = model.to(memory_format=torch.channels_last) # Module parameters need to be channels last\n\n input = torch.randint(1, 10, (2, 8, 4, 4), dtype=torch.float32, requires_grad=True)\n input = input.to(device=\"cuda\", memory_format=torch.channels_last, dtype=torch.float16)\n\n out = model(input)\n print(out.is_contiguous(memory_format=torch.channels_last)) # Outputs: True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When input tensor reaches a operator without channels last support, a\npermutation should automatically apply in the kernel to restore\ncontiguous on input tensor. This introduces overhead and stops the\nchannels last memory format propagation. Nevertheless, it guarantees\ncorrect output.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Performance Gains\n=================\n\nChannels last memory format optimizations are available on both GPU and\nCPU. On GPU, the most significant performance gains are observed on\nNVIDIA\\'s hardware with Tensor Cores support running on reduced\nprecision (`torch.float16`). We were able to archive over 22%\nperformance gains with channels last comparing to contiguous format,\nboth while utilizing \\'AMP (Automated Mixed Precision)\\' training\nscripts. Our scripts uses AMP supplied by NVIDIA\n.\n\n`python main_amp.py -a resnet50 --b 200 --workers 16 --opt-level O2 ./data`\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# opt_level = O2\n# keep_batchnorm_fp32 = None \n# loss_scale = None \n# CUDNN VERSION: 7603\n# => creating model 'resnet50'\n# Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.\n# Defaults for this optimization level are:\n# enabled : True\n# opt_level : O2\n# cast_model_type : torch.float16\n# patch_torch_functions : False\n# keep_batchnorm_fp32 : True\n# master_weights : True\n# loss_scale : dynamic\n# Processing user overrides (additional kwargs that are not None)...\n# After processing overrides, optimization options are:\n# enabled : True\n# opt_level : O2\n# cast_model_type : torch.float16\n# patch_torch_functions : False\n# keep_batchnorm_fp32 : True\n# master_weights : True\n# loss_scale : dynamic\n# Epoch: [0][10/125] Time 0.866 (0.866) Speed 230.949 (230.949) Loss 0.6735125184 (0.6735) Prec@1 61.000 (61.000) Prec@5 100.000 (100.000)\n# Epoch: [0][20/125] Time 0.259 (0.562) Speed 773.481 (355.693) Loss 0.6968704462 (0.6852) Prec@1 55.000 (58.000) Prec@5 100.000 (100.000)\n# Epoch: [0][30/125] Time 0.258 (0.461) Speed 775.089 (433.965) Loss 0.7877287269 (0.7194) Prec@1 51.500 (55.833) Prec@5 100.000 (100.000)\n# Epoch: [0][40/125] Time 0.259 (0.410) Speed 771.710 (487.281) Loss 0.8285319805 (0.7467) Prec@1 48.500 (54.000) Prec@5 100.000 (100.000)\n# Epoch: [0][50/125] Time 0.260 (0.380) Speed 770.090 (525.908) Loss 0.7370464802 (0.7447) Prec@1 56.500 (54.500) Prec@5 100.000 (100.000)\n# Epoch: [0][60/125] Time 0.258 (0.360) Speed 775.623 (555.728) Loss 0.7592862844 (0.7472) Prec@1 51.000 (53.917) Prec@5 100.000 (100.000)\n# Epoch: [0][70/125] Time 0.258 (0.345) Speed 774.746 (579.115) Loss 1.9698858261 (0.9218) Prec@1 49.500 (53.286) Prec@5 100.000 (100.000)\n# Epoch: [0][80/125] Time 0.260 (0.335) Speed 770.324 (597.659) Loss 2.2505953312 (1.0879) Prec@1 50.500 (52.938) Prec@5 100.000 (100.000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Passing `--channels-last true` allows running a model in Channels last\nformat with observed 22% performance gain.\n\n`python main_amp.py -a resnet50 --b 200 --workers 16 --opt-level O2 --channels-last true ./data`\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# opt_level = O2\n# keep_batchnorm_fp32 = None \n# loss_scale = None \n#\n# CUDNN VERSION: 7603\n#\n# => creating model 'resnet50'\n# Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.\n#\n# Defaults for this optimization level are:\n# enabled : True\n# opt_level : O2\n# cast_model_type : torch.float16\n# patch_torch_functions : False\n# keep_batchnorm_fp32 : True\n# master_weights : True\n# loss_scale : dynamic\n# Processing user overrides (additional kwargs that are not None)...\n# After processing overrides, optimization options are:\n# enabled : True\n# opt_level : O2\n# cast_model_type : torch.float16\n# patch_torch_functions : False\n# keep_batchnorm_fp32 : True\n# master_weights : True\n# loss_scale : dynamic\n#\n# Epoch: [0][10/125] Time 0.767 (0.767) Speed 260.785 (260.785) Loss 0.7579724789 (0.7580) Prec@1 53.500 (53.500) Prec@5 100.000 (100.000)\n# Epoch: [0][20/125] Time 0.198 (0.482) Speed 1012.135 (414.716) Loss 0.7007197738 (0.7293) Prec@1 49.000 (51.250) Prec@5 100.000 (100.000)\n# Epoch: [0][30/125] Time 0.198 (0.387) Speed 1010.977 (516.198) Loss 0.7113101482 (0.7233) Prec@1 55.500 (52.667) Prec@5 100.000 (100.000)\n# Epoch: [0][40/125] Time 0.197 (0.340) Speed 1013.023 (588.333) Loss 0.8943189979 (0.7661) Prec@1 54.000 (53.000) Prec@5 100.000 (100.000)\n# Epoch: [0][50/125] Time 0.198 (0.312) Speed 1010.541 (641.977) Loss 1.7113249302 (0.9551) Prec@1 51.000 (52.600) Prec@5 100.000 (100.000)\n# Epoch: [0][60/125] Time 0.198 (0.293) Speed 1011.163 (683.574) Loss 5.8537774086 (1.7716) Prec@1 50.500 (52.250) Prec@5 100.000 (100.000)\n# Epoch: [0][70/125] Time 0.198 (0.279) Speed 1011.453 (716.767) Loss 5.7595844269 (2.3413) Prec@1 46.500 (51.429) Prec@5 100.000 (100.000)\n# Epoch: [0][80/125] Time 0.198 (0.269) Speed 1011.827 (743.883) Loss 2.8196096420 (2.4011) Prec@1 47.500 (50.938) Prec@5 100.000 (100.000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following list of models has the full support of Channels last and\nshowing 8%-35% performance gains on Volta devices: `alexnet`,\n`mnasnet0_5`, `mnasnet0_75`, `mnasnet1_0`, `mnasnet1_3`, `mobilenet_v2`,\n`resnet101`, `resnet152`, `resnet18`, `resnet34`, `resnet50`,\n`resnext50_32x4d`, `shufflenet_v2_x0_5`, `shufflenet_v2_x1_0`,\n`shufflenet_v2_x1_5`, `shufflenet_v2_x2_0`, `squeezenet1_0`,\n`squeezenet1_1`, `vgg11`, `vgg11_bn`, `vgg13`, `vgg13_bn`, `vgg16`,\n`vgg16_bn`, `vgg19`, `vgg19_bn`, `wide_resnet101_2`, `wide_resnet50_2`\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following list of models has the full support of Channels last and\nshowing 26%-76% performance gains on Intel(R) Xeon(R) Ice Lake (or\nnewer) CPUs: `alexnet`, `densenet121`, `densenet161`, `densenet169`,\n`googlenet`, `inception_v3`, `mnasnet0_5`, `mnasnet1_0`, `resnet101`,\n`resnet152`, `resnet18`, `resnet34`, `resnet50`, `resnext101_32x8d`,\n`resnext50_32x4d`, `shufflenet_v2_x0_5`, `shufflenet_v2_x1_0`,\n`squeezenet1_0`, `squeezenet1_1`, `vgg11`, `vgg11_bn`, `vgg13`,\n`vgg13_bn`, `vgg16`, `vgg16_bn`, `vgg19`, `vgg19_bn`,\n`wide_resnet101_2`, `wide_resnet50_2`\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting existing models\n==========================\n\nChannels last support is not limited by existing models, as any model\ncan be converted to channels last and propagate format through the graph\nas soon as input (or certain weight) is formatted correctly.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Need to be done once, after model initialization (or load)\nmodel = model.to(memory_format=torch.channels_last) # Replace with your model\n\n# Need to be done for every input\ninput = input.to(memory_format=torch.channels_last) # Replace with your input\noutput = model(input)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, not all operators fully converted to support channels last\n(usually returning contiguous output instead). In the example posted\nabove, layers that does not support channels last will stop the memory\nformat propagation. In spite of that, as we have converted the model to\nchannels last format, that means each convolution layer, which has its 4\ndimensional weight in channels last memory format, will restore channels\nlast memory format and benefit from faster kernels.\n\nBut operators that does not support channels last does introduce\noverhead by permutation. Optionally, you can investigate and identify\noperators in your model that does not support channels last, if you want\nto improve the performance of converted model.\n\nThat means you need to verify the list of used operators against\nsupported operators list\n,\nor introduce memory format checks into eager execution mode and run your\nmodel.\n\nAfter running the code below, operators will raise an exception if the\noutput of the operator doesn\\'t match the memory format of the input.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def contains_cl(args):\n for t in args:\n if isinstance(t, torch.Tensor):\n if t.is_contiguous(memory_format=torch.channels_last) and not t.is_contiguous():\n return True\n elif isinstance(t, list) or isinstance(t, tuple):\n if contains_cl(list(t)):\n return True\n return False\n\n\ndef print_inputs(args, indent=\"\"):\n for t in args:\n if isinstance(t, torch.Tensor):\n print(indent, t.stride(), t.shape, t.device, t.dtype)\n elif isinstance(t, list) or isinstance(t, tuple):\n print(indent, type(t))\n print_inputs(list(t), indent=indent + \" \")\n else:\n print(indent, t)\n\n\ndef check_wrapper(fn):\n name = fn.__name__\n\n def check_cl(*args, **kwargs):\n was_cl = contains_cl(args)\n try:\n result = fn(*args, **kwargs)\n except Exception as e:\n print(\"`{}` inputs are:\".format(name))\n print_inputs(args)\n print(\"-------------------\")\n raise e\n failed = False\n if was_cl:\n if isinstance(result, torch.Tensor):\n if result.dim() == 4 and not result.is_contiguous(memory_format=torch.channels_last):\n print(\n \"`{}` got channels_last input, but output is not channels_last:\".format(name),\n result.shape,\n result.stride(),\n result.device,\n result.dtype,\n )\n failed = True\n if failed and True:\n print(\"`{}` inputs are:\".format(name))\n print_inputs(args)\n raise Exception(\"Operator `{}` lost channels_last property\".format(name))\n return result\n\n return check_cl\n\n\nold_attrs = dict()\n\n\ndef attribute(m):\n old_attrs[m] = dict()\n for i in dir(m):\n e = getattr(m, i)\n exclude_functions = [\"is_cuda\", \"has_names\", \"numel\", \"stride\", \"Tensor\", \"is_contiguous\", \"__class__\"]\n if i not in exclude_functions and not i.startswith(\"_\") and \"__call__\" in dir(e):\n try:\n old_attrs[m][i] = e\n setattr(m, i, check_wrapper(e))\n except Exception as e:\n print(i)\n print(e)\n\n\nattribute(torch.Tensor)\nattribute(torch.nn.functional)\nattribute(torch)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you found an operator that doesn\\'t support channels last tensors and\nyou want to contribute, feel free to use following developers guide\n.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Code below is to recover the attributes of torch.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for (m, attrs) in old_attrs.items():\n for (k, v) in attrs.items():\n setattr(m, k, v)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Work to do\n==========\n\nThere are still many things to do, such as:\n\n- Resolving ambiguity of `N1HW` and `NC11` Tensors;\n- Testing of Distributed Training support;\n- Improving operators coverage.\n\nIf you have feedback and/or suggestions for improvement, please let us\nknow by creating [an issue](https://github.com/pytorch/pytorch/issues).\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 0 }