{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "A guide on good usage of `non_blocking` and `pin_memory()` in PyTorch\n=====================================================================\n\n**Author**: [Vincent Moens](https://github.com/vmoens)\n\nIntroduction\n------------\n\nTransferring data from the CPU to the GPU is fundamental in many PyTorch\napplications. It\\'s crucial for users to understand the most effective\ntools and options available for moving data between devices. This\ntutorial examines two key methods for device-to-device data transfer in\nPyTorch: `~torch.Tensor.pin_memory`{.interpreted-text role=\"meth\"} and\n`~torch.Tensor.to`{.interpreted-text role=\"meth\"} with the\n`non_blocking=True` option.\n\n### What you will learn\n\nOptimizing the transfer of tensors from the CPU to the GPU can be\nachieved through asynchronous transfers and memory pinning. However,\nthere are important considerations:\n\n-   Using `tensor.pin_memory().to(device, non_blocking=True)` can be up\n    to twice as slow as a straightforward `tensor.to(device)`.\n-   Generally, `tensor.to(device, non_blocking=True)` is an effective\n    choice for enhancing transfer speed.\n-   While `cpu_tensor.to(\"cuda\", non_blocking=True).mean()` executes\n    correctly, attempting\n    `cuda_tensor.to(\"cpu\", non_blocking=True).mean()` will result in\n    erroneous outputs.\n\n### Preamble\n\nThe performance reported in this tutorial are conditioned on the system\nused to build the tutorial. Although the conclusions are applicable\nacross different systems, the specific observations may vary slightly\ndepending on the hardware available, especially on older hardware. The\nprimary objective of this tutorial is to offer a theoretical framework\nfor understanding CPU to GPU data transfers. However, any design\ndecisions should be tailored to individual cases and guided by\nbenchmarked throughput measurements, as well as the specific\nrequirements of the task at hand.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch\n\nassert torch.cuda.is_available(), \"A cuda device is required to run this tutorial\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This tutorial requires tensordict to be installed. If you don\\'t have\ntensordict in your environment yet, install it by running the following\ncommand in a separate cell:\n\n``` {.bash}\n# Install tensordict with the following command\n!pip3 install tensordict\n```\n\nWe start by outlining the theory surrounding these concepts, and then\nmove to concrete test examples of the features.\n\nBackground\n==========\n\nMemory management basics\n------------------------\n\nWhen one creates a CPU tensor in PyTorch, the content of this tensor\nneeds to be placed in memory. The memory we talk about here is a rather\ncomplex concept worth looking at carefully. We distinguish two types of\nmemory that are handled by the Memory Management Unit: the RAM (for\nsimplicity) and the swap space on disk (which may or may not be the hard\ndrive). Together, the available space in disk and RAM (physical memory)\nmake up the virtual memory, which is an abstraction of the total\nresources available. In short, the virtual memory makes it so that the\navailable space is larger than what can be found on RAM in isolation and\ncreates the illusion that the main memory is larger than it actually is.\n\nIn normal circumstances, a regular CPU tensor is pageable which means\nthat it is divided in blocks called pages that can live anywhere in the\nvirtual memory (both in RAM or on disk). As mentioned earlier, this has\nthe advantage that the memory seems larger than what the main memory\nactually is.\n\nTypically, when a program accesses a page that is not in RAM, a \\\"page\nfault\\\" occurs and the operating system (OS) then brings back this page\ninto RAM (\\\"swap in\\\" or \\\"page in\\\"). In turn, the OS may have to swap\nout (or \\\"page out\\\") another page to make room for the new page.\n\nIn contrast to pageable memory, a pinned (or page-locked or\nnon-pageable) memory is a type of memory that cannot be swapped out to\ndisk. It allows for faster and more predictable access times, but has\nthe downside that it is more limited than the pageable memory (aka the\nmain memory).\n\n![](https://pytorch.org/tutorials/_static/img/pinmem/pinmem.png)\n\nCUDA and (non-)pageable memory\n------------------------------\n\nTo understand how CUDA copies a tensor from CPU to CUDA, let\\'s consider\nthe two scenarios above:\n\n-   If the memory is page-locked, the device can access the memory\n    directly in the main memory. The memory addresses are well defined\n    and functions that need to read these data can be significantly\n    accelerated.\n-   If the memory is pageable, all the pages will have to be brought to\n    the main memory before being sent to the GPU. This operation may\n    take time and is less predictable than when executed on page-locked\n    tensors.\n\nMore precisely, when CUDA sends pageable data from CPU to GPU, it must\nfirst create a page-locked copy of that data before making the transfer.\n\nAsynchronous vs. Synchronous Operations with `non_blocking=True` (CUDA `cudaMemcpyAsync`)\n-----------------------------------------------------------------------------------------\n\nWhen executing a copy from a host (such as, CPU) to a device (such as,\nGPU), the CUDA toolkit offers modalities to do these operations\nsynchronously or asynchronously with respect to the host.\n\nIn practice, when calling `~torch.Tensor.to`{.interpreted-text\nrole=\"meth\"}, PyTorch always makes a call to\n[cudaMemcpyAsync](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79).\nIf `non_blocking=False` (default), a `cudaStreamSynchronize` will be\ncalled after each and every `cudaMemcpyAsync`, making the call to\n`~torch.Tensor.to`{.interpreted-text role=\"meth\"} blocking in the main\nthread. If `non_blocking=True`, no synchronization is triggered, and the\nmain thread on the host is not blocked. Therefore, from the host\nperspective, multiple tensors can be sent to the device simultaneously,\nas the thread does not need to wait for one transfer to be completed to\ninitiate the other.\n\n```{=html}\n<div style=\"background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px\"><strong>NOTE:</strong></div>\n```\n```{=html}\n<div style=\"background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px\">\n```\n```{=html}\n<p>In general, the transfer is blocking on the device side (even if it isn't on the host side):the copy on the device cannot occur while another operation is being executed.However, in some advanced scenarios, a copy and a kernel execution can be done simultaneously on the GPU side.As the following example will show, three requirements must be met to enable this:We demonstrate this by running profiles on the following script.</p>\n```\n```{=html}\n</div>\n```\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import contextlib\n\nfrom torch.cuda import Stream\n\n\ns = Stream()\n\ntorch.manual_seed(42)\nt1_cpu_pinned = torch.randn(1024**2 * 5, pin_memory=True)\nt2_cpu_paged = torch.randn(1024**2 * 5, pin_memory=False)\nt3_cuda = torch.randn(1024**2 * 5, device=\"cuda:0\")\n\nassert torch.cuda.is_available()\ndevice = torch.device(\"cuda\", torch.cuda.current_device())\n\n\n# The function we want to profile\ndef inner(pinned: bool, streamed: bool):\n    with torch.cuda.stream(s) if streamed else contextlib.nullcontext():\n        if pinned:\n            t1_cuda = t1_cpu_pinned.to(device, non_blocking=True)\n        else:\n            t2_cuda = t2_cpu_paged.to(device, non_blocking=True)\n        t_star_cuda_h2d_event = s.record_event()\n    # This operation can be executed during the CPU to GPU copy if and only if the tensor is pinned and the copy is\n    #  done in the other stream\n    t3_cuda_mul = t3_cuda * t3_cuda * t3_cuda\n    t3_cuda_h2d_event = torch.cuda.current_stream().record_event()\n    t_star_cuda_h2d_event.synchronize()\n    t3_cuda_h2d_event.synchronize()\n\n\n# Our profiler: profiles the `inner` function and stores the results in a .json file\ndef benchmark_with_profiler(\n    pinned,\n    streamed,\n) -> None:\n    torch._C._profiler._set_cuda_sync_enabled_val(True)\n    wait, warmup, active = 1, 1, 2\n    num_steps = wait + warmup + active\n    rank = 0\n    with torch.profiler.profile(\n        activities=[\n            torch.profiler.ProfilerActivity.CPU,\n            torch.profiler.ProfilerActivity.CUDA,\n        ],\n        schedule=torch.profiler.schedule(\n            wait=wait, warmup=warmup, active=active, repeat=1, skip_first=1\n        ),\n    ) as prof:\n        for step_idx in range(1, num_steps + 1):\n            inner(streamed=streamed, pinned=pinned)\n            if rank is None or rank == 0:\n                prof.step()\n    prof.export_chrome_trace(f\"trace_streamed{int(streamed)}_pinned{int(pinned)}.json\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Loading these profile traces in chrome (`chrome://tracing`) shows the\nfollowing results: first, let\\'s see what happens if both the arithmetic\noperation on `t3_cuda` is executed after the pageable tensor is sent to\nGPU in the main stream:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "benchmark_with_profiler(streamed=False, pinned=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![](https://pytorch.org/tutorials/_static/img/pinmem/trace_streamed0_pinned0.png)\n\nUsing a pinned tensor doesn\\'t change the trace much, both operations\nare still executed consecutively:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "benchmark_with_profiler(streamed=False, pinned=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![](https://pytorch.org/tutorials/_static/img/pinmem/trace_streamed0_pinned1.png)\n\nSending a pageable tensor to GPU on a separate stream is also a blocking\noperation:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "benchmark_with_profiler(streamed=True, pinned=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![](https://pytorch.org/tutorials/_static/img/pinmem/trace_streamed1_pinned0.png)\n\nOnly pinned tensors copies to GPU on a separate stream overlap with\nanother cuda kernel executed on the main stream:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "benchmark_with_profiler(streamed=True, pinned=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![](https://pytorch.org/tutorials/_static/img/pinmem/trace_streamed1_pinned1.png)\n\nA PyTorch perspective\n=====================\n\n`pin_memory()`\n--------------\n\nPyTorch offers the possibility to create and send tensors to page-locked\nmemory through the `~torch.Tensor.pin_memory`{.interpreted-text\nrole=\"meth\"} method and constructor arguments. CPU tensors on a machine\nwhere CUDA is initialized can be cast to pinned memory through the\n`~torch.Tensor.pin_memory`{.interpreted-text role=\"meth\"} method.\nImportantly, `pin_memory` is blocking on the main thread of the host: it\nwill wait for the tensor to be copied to page-locked memory before\nexecuting the next operation. New tensors can be directly created in\npinned memory with functions like `~torch.zeros`{.interpreted-text\nrole=\"func\"}, `~torch.ones`{.interpreted-text role=\"func\"} and other\nconstructors.\n\nLet us check the speed of pinning memory and sending tensors to CUDA:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch\nimport gc\nfrom torch.utils.benchmark import Timer\nimport matplotlib.pyplot as plt\n\n\ndef timer(cmd):\n    median = (\n        Timer(cmd, globals=globals())\n        .adaptive_autorange(min_run_time=1.0, max_run_time=20.0)\n        .median\n        * 1000\n    )\n    print(f\"{cmd}: {median: 4.4f} ms\")\n    return median\n\n\n# A tensor in pageable memory\npageable_tensor = torch.randn(1_000_000)\n\n# A tensor in page-locked (pinned) memory\npinned_tensor = torch.randn(1_000_000, pin_memory=True)\n\n# Runtimes:\npageable_to_device = timer(\"pageable_tensor.to('cuda:0')\")\npinned_to_device = timer(\"pinned_tensor.to('cuda:0')\")\npin_mem = timer(\"pageable_tensor.pin_memory()\")\npin_mem_to_device = timer(\"pageable_tensor.pin_memory().to('cuda:0')\")\n\n# Ratios:\nr1 = pinned_to_device / pageable_to_device\nr2 = pin_mem_to_device / pageable_to_device\n\n# Create a figure with the results\nfig, ax = plt.subplots()\n\nxlabels = [0, 1, 2]\nbar_labels = [\n    \"pageable_tensor.to(device) (1x)\",\n    f\"pinned_tensor.to(device) ({r1:4.2f}x)\",\n    f\"pageable_tensor.pin_memory().to(device) ({r2:4.2f}x)\"\n    f\"\\npin_memory()={100*pin_mem/pin_mem_to_device:.2f}% of runtime.\",\n]\nvalues = [pageable_to_device, pinned_to_device, pin_mem_to_device]\ncolors = [\"tab:blue\", \"tab:red\", \"tab:orange\"]\nax.bar(xlabels, values, label=bar_labels, color=colors)\n\nax.set_ylabel(\"Runtime (ms)\")\nax.set_title(\"Device casting runtime (pin-memory)\")\nax.set_xticks([])\nax.legend()\n\nplt.show()\n\n# Clear tensors\ndel pageable_tensor, pinned_tensor\n_ = gc.collect()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can observe that casting a pinned-memory tensor to GPU is indeed much\nfaster than a pageable tensor, because under the hood, a pageable tensor\nmust be copied to pinned memory before being sent to GPU.\n\nHowever, contrary to a somewhat common belief, calling\n`~torch.Tensor.pin_memory()`{.interpreted-text role=\"meth\"} on a\npageable tensor before casting it to GPU should not bring any\nsignificant speed-up, on the contrary this call is usually slower than\njust executing the transfer. This makes sense, since we\\'re actually\nasking Python to execute an operation that CUDA will perform anyway\nbefore copying the data from host to device.\n\n```{=html}\n<div style=\"background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px\"><strong>NOTE:</strong></div>\n```\n```{=html}\n<div style=\"background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px\">\n```\n```{=html}\n<p>The PyTorch implementation of<a href=\"https://github.com/pytorch/pytorch/blob/5298acb5c76855bc5a99ae10016efc86b27949bd/aten/src/ATen/native/Memory.cpp#L58\">pin_memory</a>which relies on creating a brand new storage in pinned memory through <a href=\"https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gb65da58f444e7230d3322b6126bb4902\">cudaHostAlloc</a>could be, in rare cases, faster than transitioning data in chunks as <code>cudaMemcpy</code> does.Here too, the observation may vary depending on the available hardware, the size of the tensors being sent orthe amount of available RAM.</p>\n```\n```{=html}\n</div>\n```\n`non_blocking=True`\n===================\n\nAs mentioned earlier, many PyTorch operations have the option of being\nexecuted asynchronously with respect to the host through the\n`non_blocking` argument.\n\nHere, to account accurately of the benefits of using `non_blocking`, we\nwill design a slightly more complex experiment since we want to assess\nhow fast it is to send multiple tensors to GPU with and without calling\n`non_blocking`.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# A simple loop that copies all tensors to cuda\ndef copy_to_device(*tensors):\n    result = []\n    for tensor in tensors:\n        result.append(tensor.to(\"cuda:0\"))\n    return result\n\n\n# A loop that copies all tensors to cuda asynchronously\ndef copy_to_device_nonblocking(*tensors):\n    result = []\n    for tensor in tensors:\n        result.append(tensor.to(\"cuda:0\", non_blocking=True))\n    # We need to synchronize\n    torch.cuda.synchronize()\n    return result\n\n\n# Create a list of tensors\ntensors = [torch.randn(1000) for _ in range(1000)]\nto_device = timer(\"copy_to_device(*tensors)\")\nto_device_nonblocking = timer(\"copy_to_device_nonblocking(*tensors)\")\n\n# Ratio\nr1 = to_device_nonblocking / to_device\n\n# Plot the results\nfig, ax = plt.subplots()\n\nxlabels = [0, 1]\nbar_labels = [f\"to(device) (1x)\", f\"to(device, non_blocking=True) ({r1:4.2f}x)\"]\ncolors = [\"tab:blue\", \"tab:red\"]\nvalues = [to_device, to_device_nonblocking]\n\nax.bar(xlabels, values, label=bar_labels, color=colors)\n\nax.set_ylabel(\"Runtime (ms)\")\nax.set_title(\"Device casting runtime (non-blocking)\")\nax.set_xticks([])\nax.legend()\n\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To get a better sense of what is happening here, let us profile these\ntwo functions:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from torch.profiler import profile, ProfilerActivity\n\n\ndef profile_mem(cmd):\n    with profile(activities=[ProfilerActivity.CPU]) as prof:\n        exec(cmd)\n    print(cmd)\n    print(prof.key_averages().table(row_limit=10))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let\\'s see the call stack with a regular `to(device)` first:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(\"Call to `to(device)`\", profile_mem(\"copy_to_device(*tensors)\"))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "and now the `non_blocking` version:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(\n    \"Call to `to(device, non_blocking=True)`\",\n    profile_mem(\"copy_to_device_nonblocking(*tensors)\"),\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The results are without any doubt better when using `non_blocking=True`,\nas all transfers are initiated simultaneously on the host side and only\none synchronization is done.\n\nThe benefit will vary depending on the number and the size of the\ntensors as well as depending on the hardware being used.\n\n```{=html}\n<div style=\"background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px\"><strong>NOTE:</strong></div>\n```\n```{=html}\n<div style=\"background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px\">\n```\n```{=html}\n<p>Interestingly, the blocking <code>to(&quot;cuda&quot;)</code> actually performs the same asynchronous device casting operation(<code>cudaMemcpyAsync</code>) as the one with <code>non_blocking=True</code> with a synchronization point after each copy.</p>\n```\n```{=html}\n</div>\n```\nSynergies\n=========\n\nNow that we have made the point that data transfer of tensors already in\npinned memory to GPU is faster than from pageable memory, and that we\nknow that doing these transfers asynchronously is also faster than\nsynchronously, we can benchmark combinations of these approaches. First,\nlet\\'s write a couple of new functions that will call `pin_memory` and\n`to(device)` on each tensor:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def pin_copy_to_device(*tensors):\n    result = []\n    for tensor in tensors:\n        result.append(tensor.pin_memory().to(\"cuda:0\"))\n    return result\n\n\ndef pin_copy_to_device_nonblocking(*tensors):\n    result = []\n    for tensor in tensors:\n        result.append(tensor.pin_memory().to(\"cuda:0\", non_blocking=True))\n    # We need to synchronize\n    torch.cuda.synchronize()\n    return result"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The benefits of using `~torch.Tensor.pin_memory`{.interpreted-text\nrole=\"meth\"} are more pronounced for somewhat large batches of large\ntensors:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "tensors = [torch.randn(1_000_000) for _ in range(1000)]\npage_copy = timer(\"copy_to_device(*tensors)\")\npage_copy_nb = timer(\"copy_to_device_nonblocking(*tensors)\")\n\ntensors_pinned = [torch.randn(1_000_000, pin_memory=True) for _ in range(1000)]\npinned_copy = timer(\"copy_to_device(*tensors_pinned)\")\npinned_copy_nb = timer(\"copy_to_device_nonblocking(*tensors_pinned)\")\n\npin_and_copy = timer(\"pin_copy_to_device(*tensors)\")\npin_and_copy_nb = timer(\"pin_copy_to_device_nonblocking(*tensors)\")\n\n# Plot\nstrategies = (\"pageable copy\", \"pinned copy\", \"pin and copy\")\nblocking = {\n    \"blocking\": [page_copy, pinned_copy, pin_and_copy],\n    \"non-blocking\": [page_copy_nb, pinned_copy_nb, pin_and_copy_nb],\n}\n\nx = torch.arange(3)\nwidth = 0.25\nmultiplier = 0\n\n\nfig, ax = plt.subplots(layout=\"constrained\")\n\nfor attribute, runtimes in blocking.items():\n    offset = width * multiplier\n    rects = ax.bar(x + offset, runtimes, width, label=attribute)\n    ax.bar_label(rects, padding=3, fmt=\"%.2f\")\n    multiplier += 1\n\n# Add some text for labels, title and custom x-axis tick labels, etc.\nax.set_ylabel(\"Runtime (ms)\")\nax.set_title(\"Runtime (pin-mem and non-blocking)\")\nax.set_xticks([0, 1, 2])\nax.set_xticklabels(strategies)\nplt.setp(ax.get_xticklabels(), rotation=45, ha=\"right\", rotation_mode=\"anchor\")\nax.legend(loc=\"upper left\", ncols=3)\n\nplt.show()\n\ndel tensors, tensors_pinned\n_ = gc.collect()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Other copy directions (GPU -\\> CPU, CPU -\\> MPS)\n================================================\n\nUntil now, we have operated under the assumption that asynchronous\ncopies from the CPU to the GPU are safe. This is generally true because\nCUDA automatically handles synchronization to ensure that the data being\naccessed is valid at read time \\_\\_whenever the tensor is in pageable\nmemory\\_\\_.\n\nHowever, in other cases we cannot make the same assumption: when a\ntensor is placed in pinned memory, mutating the original copy after\ncalling the host-to-device transfer may corrupt the data received on\nGPU. Similarly, when a transfer is achieved in the opposite direction,\nfrom GPU to CPU, or from any device that is not CPU or GPU to any device\nthat is not a CUDA-handled GPU (such as, MPS), there is no guarantee\nthat the data read on GPU is valid without explicit synchronization.\n\nIn these scenarios, these transfers offer no assurance that the copy\nwill be complete at the time of data access. Consequently, the data on\nthe host might be incomplete or incorrect, effectively rendering it\ngarbage.\n\nLet\\'s first demonstrate this with a pinned-memory tensor:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "DELAY = 100000000\ntry:\n    i = -1\n    for i in range(100):\n        # Create a tensor in pin-memory\n        cpu_tensor = torch.ones(1024, 1024, pin_memory=True)\n        torch.cuda.synchronize()\n        # Send the tensor to CUDA\n        cuda_tensor = cpu_tensor.to(\"cuda\", non_blocking=True)\n        torch.cuda._sleep(DELAY)\n        # Corrupt the original tensor\n        cpu_tensor.zero_()\n        assert (cuda_tensor == 1).all()\n    print(\"No test failed with non_blocking and pinned tensor\")\nexcept AssertionError:\n    print(f\"{i}th test failed with non_blocking and pinned tensor. Skipping remaining tests\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Using a pageable tensor always works:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "i = -1\nfor i in range(100):\n    # Create a tensor in pageable memory\n    cpu_tensor = torch.ones(1024, 1024)\n    torch.cuda.synchronize()\n    # Send the tensor to CUDA\n    cuda_tensor = cpu_tensor.to(\"cuda\", non_blocking=True)\n    torch.cuda._sleep(DELAY)\n    # Corrupt the original tensor\n    cpu_tensor.zero_()\n    assert (cuda_tensor == 1).all()\nprint(\"No test failed with non_blocking and pageable tensor\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now let\\'s demonstrate that CUDA to CPU also fails to produce reliable\noutputs without synchronization:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "tensor = (\n    torch.arange(1, 1_000_000, dtype=torch.double, device=\"cuda\")\n    .expand(100, 999999)\n    .clone()\n)\ntorch.testing.assert_close(\n    tensor.mean(), torch.tensor(500_000, dtype=torch.double, device=\"cuda\")\n), tensor.mean()\ntry:\n    i = -1\n    for i in range(100):\n        cpu_tensor = tensor.to(\"cpu\", non_blocking=True)\n        torch.testing.assert_close(\n            cpu_tensor.mean(), torch.tensor(500_000, dtype=torch.double)\n        )\n    print(\"No test failed with non_blocking\")\nexcept AssertionError:\n    print(f\"{i}th test failed with non_blocking. Skipping remaining tests\")\ntry:\n    i = -1\n    for i in range(100):\n        cpu_tensor = tensor.to(\"cpu\", non_blocking=True)\n        torch.cuda.synchronize()\n        torch.testing.assert_close(\n            cpu_tensor.mean(), torch.tensor(500_000, dtype=torch.double)\n        )\n    print(\"No test failed with synchronize\")\nexcept AssertionError:\n    print(f\"One test failed with synchronize: {i}th assertion!\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Generally, asynchronous copies to a device are safe without explicit\nsynchronization only when the target is a CUDA-enabled device and the\noriginal tensor is in pageable memory.\n\nIn summary, copying data from CPU to GPU is safe when using\n`non_blocking=True`, but for any other direction, `non_blocking=True`\ncan still be used but the user must make sure that a device\nsynchronization is executed before the data is accessed.\n\nPractical recommendations\n=========================\n\nWe can now wrap up some early recommendations based on our observations:\n\nIn general, `non_blocking=True` will provide good throughput, regardless\nof whether the original tensor is or isn\\'t in pinned memory. If the\ntensor is already in pinned memory, the transfer can be accelerated, but\nsending it to pin memory manually from python main thread is a blocking\noperation on the host, and hence will annihilate much of the benefit of\nusing `non_blocking=True` (as CUDA does the [pin\\_memory]{.title-ref}\ntransfer anyway).\n\nOne might now legitimately ask what use there is for the\n`~torch.Tensor.pin_memory`{.interpreted-text role=\"meth\"} method. In the\nfollowing section, we will explore further how this can be used to\naccelerate the data transfer even more.\n\nAdditional considerations\n=========================\n\nPyTorch notoriously provides a\n`~torch.utils.data.DataLoader`{.interpreted-text role=\"class\"} class\nwhose constructor accepts a `pin_memory` argument. Considering our\nprevious discussion on `pin_memory`, you might wonder how the\n`DataLoader` manages to accelerate data transfers if memory pinning is\ninherently blocking.\n\nThe key lies in the DataLoader\\'s use of a separate thread to handle the\ntransfer of data from pageable to pinned memory, thus preventing any\nblockage in the main thread.\n\nTo illustrate this, we will use the TensorDict primitive from the\nhomonymous library. When invoking\n`~tensordict.TensorDict.to`{.interpreted-text role=\"meth\"}, the default\nbehavior is to send tensors to the device asynchronously, followed by a\nsingle call to `torch.device.synchronize()` afterwards.\n\nAdditionally, `TensorDict.to()` includes a `non_blocking_pin` option\nwhich initiates multiple threads to execute `pin_memory()` before\nproceeding with to `to(device)`. This approach can further accelerate\ndata transfers, as demonstrated in the following example.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from tensordict import TensorDict\nimport torch\nfrom torch.utils.benchmark import Timer\nimport matplotlib.pyplot as plt\n\n# Create the dataset\ntd = TensorDict({str(i): torch.randn(1_000_000) for i in range(1000)})\n\n# Runtimes\ncopy_blocking = timer(\"td.to('cuda:0', non_blocking=False)\")\ncopy_non_blocking = timer(\"td.to('cuda:0')\")\ncopy_pin_nb = timer(\"td.to('cuda:0', non_blocking_pin=True, num_threads=0)\")\ncopy_pin_multithread_nb = timer(\"td.to('cuda:0', non_blocking_pin=True, num_threads=4)\")\n\n# Rations\nr1 = copy_non_blocking / copy_blocking\nr2 = copy_pin_nb / copy_blocking\nr3 = copy_pin_multithread_nb / copy_blocking\n\n# Figure\nfig, ax = plt.subplots()\n\nxlabels = [0, 1, 2, 3]\nbar_labels = [\n    \"Blocking copy (1x)\",\n    f\"Non-blocking copy ({r1:4.2f}x)\",\n    f\"Blocking pin, non-blocking copy ({r2:4.2f}x)\",\n    f\"Non-blocking pin, non-blocking copy ({r3:4.2f}x)\",\n]\nvalues = [copy_blocking, copy_non_blocking, copy_pin_nb, copy_pin_multithread_nb]\ncolors = [\"tab:blue\", \"tab:red\", \"tab:orange\", \"tab:green\"]\n\nax.bar(xlabels, values, label=bar_labels, color=colors)\n\nax.set_ylabel(\"Runtime (ms)\")\nax.set_title(\"Device casting runtime\")\nax.set_xticks([])\nax.legend()\n\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In this example, we are transferring many large tensors from the CPU to\nthe GPU. This scenario is ideal for utilizing multithreaded\n`pin_memory()`, which can significantly enhance performance. However, if\nthe tensors are small, the overhead associated with multithreading may\noutweigh the benefits. Similarly, if there are only a few tensors, the\nadvantages of pinning tensors on separate threads become limited.\n\nAs an additional note, while it might seem advantageous to create\npermanent buffers in pinned memory to shuttle tensors from pageable\nmemory before transferring them to the GPU, this strategy does not\nnecessarily expedite computation. The inherent bottleneck caused by\ncopying data into pinned memory remains a limiting factor.\n\nMoreover, transferring data that resides on disk (whether in shared\nmemory or files) to the GPU typically requires an intermediate step of\ncopying the data into pinned memory (located in RAM). Utilizing\nnon\\_blocking for large data transfers in this context can significantly\nincrease RAM consumption, potentially leading to adverse effects.\n\nIn practice, there is no one-size-fits-all solution. The effectiveness\nof using multithreaded `pin_memory` combined with `non_blocking`\ntransfers depends on a variety of factors, including the specific\nsystem, operating system, hardware, and the nature of the tasks being\nexecuted. Here is a list of factors to check when trying to speed-up\ndata transfers between CPU and GPU, or comparing throughput\\'s across\nscenarios:\n\n-   **Number of available cores**\n\n    How many CPU cores are available? Is the system shared with other\n    users or processes that might compete for resources?\n\n-   **Core utilization**\n\n    Are the CPU cores heavily utilized by other processes? Does the\n    application perform other CPU-intensive tasks concurrently with data\n    transfers?\n\n-   **Memory utilization**\n\n    How much pageable and page-locked memory is currently being used? Is\n    there sufficient free memory to allocate additional pinned memory\n    without affecting system performance? Remember that nothing comes\n    for free, for instance `pin_memory` will consume RAM and may impact\n    other tasks.\n\n-   **CUDA Device Capabilities**\n\n    Does the GPU support multiple DMA engines for concurrent data\n    transfers? What are the specific capabilities and limitations of the\n    CUDA device being used?\n\n-   **Number of tensors to be sent**\n\n    How many tensors are transferred in a typical operation?\n\n-   **Size of the tensors to be sent**\n\n    What is the size of the tensors being transferred? A few large\n    tensors or many small tensors may not benefit from the same transfer\n    program.\n\n-   **System Architecture**\n\n    How is the system\\'s architecture influencing data transfer speeds\n    (for example, bus speeds, network latency)?\n\nAdditionally, allocating a large number of tensors or sizable tensors in\npinned memory can monopolize a substantial portion of RAM. This reduces\nthe available memory for other critical operations, such as paging,\nwhich can negatively impact the overall performance of an algorithm.\n\nConclusion\n==========\n\nThroughout this tutorial, we have explored several critical factors that\ninfluence transfer speeds and memory management when sending tensors\nfrom the host to the device. We\\'ve learned that using\n`non_blocking=True` generally accelerates data transfers, and that\n`~torch.Tensor.pin_memory`{.interpreted-text role=\"meth\"} can also\nenhance performance if implemented correctly. However, these techniques\nrequire careful design and calibration to be effective.\n\nRemember that profiling your code and keeping an eye on the memory\nconsumption are essential to optimize resource usage and achieve the\nbest possible performance.\n\nAdditional resources\n====================\n\nIf you are dealing with issues with memory copies when using CUDA\ndevices or want to learn more about what was discussed in this tutorial,\ncheck the following references:\n\n-   [CUDA toolkit memory management\n    doc](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html);\n-   [CUDA pin-memory\n    note](https://forums.developer.nvidia.com/t/pinned-memory/268474);\n-   [How to Optimize Data Transfers in CUDA\n    C/C++](https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/);\n-   [tensordict doc](https://pytorch.org/tensordict/stable/index.html)\n    and [repo](https://github.com/pytorch/tensordict).\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.12"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}