{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A guide on good usage of `non_blocking` and `pin_memory()` in PyTorch\n=====================================================================\n\n**Author**: [Vincent Moens](https://github.com/vmoens)\n\nIntroduction\n------------\n\nTransferring data from the CPU to the GPU is fundamental in many PyTorch\napplications. It\\'s crucial for users to understand the most effective\ntools and options available for moving data between devices. This\ntutorial examines two key methods for device-to-device data transfer in\nPyTorch: `~torch.Tensor.pin_memory`{.interpreted-text role=\"meth\"} and\n`~torch.Tensor.to`{.interpreted-text role=\"meth\"} with the\n`non_blocking=True` option.\n\n### What you will learn\n\nOptimizing the transfer of tensors from the CPU to the GPU can be\nachieved through asynchronous transfers and memory pinning. However,\nthere are important considerations:\n\n- Using `tensor.pin_memory().to(device, non_blocking=True)` can be up\n to twice as slow as a straightforward `tensor.to(device)`.\n- Generally, `tensor.to(device, non_blocking=True)` is an effective\n choice for enhancing transfer speed.\n- While `cpu_tensor.to(\"cuda\", non_blocking=True).mean()` executes\n correctly, attempting\n `cuda_tensor.to(\"cpu\", non_blocking=True).mean()` will result in\n erroneous outputs.\n\n### Preamble\n\nThe performance reported in this tutorial are conditioned on the system\nused to build the tutorial. Although the conclusions are applicable\nacross different systems, the specific observations may vary slightly\ndepending on the hardware available, especially on older hardware. The\nprimary objective of this tutorial is to offer a theoretical framework\nfor understanding CPU to GPU data transfers. However, any design\ndecisions should be tailored to individual cases and guided by\nbenchmarked throughput measurements, as well as the specific\nrequirements of the task at hand.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torch\n\nassert torch.cuda.is_available(), \"A cuda device is required to run this tutorial\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial requires tensordict to be installed. If you don\\'t have\ntensordict in your environment yet, install it by running the following\ncommand in a separate cell:\n\n``` {.bash}\n# Install tensordict with the following command\n!pip3 install tensordict\n```\n\nWe start by outlining the theory surrounding these concepts, and then\nmove to concrete test examples of the features.\n\nBackground\n==========\n\nMemory management basics\n------------------------\n\nWhen one creates a CPU tensor in PyTorch, the content of this tensor\nneeds to be placed in memory. The memory we talk about here is a rather\ncomplex concept worth looking at carefully. We distinguish two types of\nmemory that are handled by the Memory Management Unit: the RAM (for\nsimplicity) and the swap space on disk (which may or may not be the hard\ndrive). Together, the available space in disk and RAM (physical memory)\nmake up the virtual memory, which is an abstraction of the total\nresources available. In short, the virtual memory makes it so that the\navailable space is larger than what can be found on RAM in isolation and\ncreates the illusion that the main memory is larger than it actually is.\n\nIn normal circumstances, a regular CPU tensor is pageable which means\nthat it is divided in blocks called pages that can live anywhere in the\nvirtual memory (both in RAM or on disk). As mentioned earlier, this has\nthe advantage that the memory seems larger than what the main memory\nactually is.\n\nTypically, when a program accesses a page that is not in RAM, a \\\"page\nfault\\\" occurs and the operating system (OS) then brings back this page\ninto RAM (\\\"swap in\\\" or \\\"page in\\\"). In turn, the OS may have to swap\nout (or \\\"page out\\\") another page to make room for the new page.\n\nIn contrast to pageable memory, a pinned (or page-locked or\nnon-pageable) memory is a type of memory that cannot be swapped out to\ndisk. It allows for faster and more predictable access times, but has\nthe downside that it is more limited than the pageable memory (aka the\nmain memory).\n\n\n\nCUDA and (non-)pageable memory\n------------------------------\n\nTo understand how CUDA copies a tensor from CPU to CUDA, let\\'s consider\nthe two scenarios above:\n\n- If the memory is page-locked, the device can access the memory\n directly in the main memory. The memory addresses are well defined\n and functions that need to read these data can be significantly\n accelerated.\n- If the memory is pageable, all the pages will have to be brought to\n the main memory before being sent to the GPU. This operation may\n take time and is less predictable than when executed on page-locked\n tensors.\n\nMore precisely, when CUDA sends pageable data from CPU to GPU, it must\nfirst create a page-locked copy of that data before making the transfer.\n\nAsynchronous vs. Synchronous Operations with `non_blocking=True` (CUDA `cudaMemcpyAsync`)\n-----------------------------------------------------------------------------------------\n\nWhen executing a copy from a host (such as, CPU) to a device (such as,\nGPU), the CUDA toolkit offers modalities to do these operations\nsynchronously or asynchronously with respect to the host.\n\nIn practice, when calling `~torch.Tensor.to`{.interpreted-text\nrole=\"meth\"}, PyTorch always makes a call to\n[cudaMemcpyAsync](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79).\nIf `non_blocking=False` (default), a `cudaStreamSynchronize` will be\ncalled after each and every `cudaMemcpyAsync`, making the call to\n`~torch.Tensor.to`{.interpreted-text role=\"meth\"} blocking in the main\nthread. If `non_blocking=True`, no synchronization is triggered, and the\nmain thread on the host is not blocked. Therefore, from the host\nperspective, multiple tensors can be sent to the device simultaneously,\nas the thread does not need to wait for one transfer to be completed to\ninitiate the other.\n\n```{=html}\n
In general, the transfer is blocking on the device side (even if it isn't on the host side):the copy on the device cannot occur while another operation is being executed.However, in some advanced scenarios, a copy and a kernel execution can be done simultaneously on the GPU side.As the following example will show, three requirements must be met to enable this:We demonstrate this by running profiles on the following script.
\n```\n```{=html}\nThe PyTorch implementation ofpin_memorywhich relies on creating a brand new storage in pinned memory through cudaHostAlloccould be, in rare cases, faster than transitioning data in chunks as cudaMemcpy
does.Here too, the observation may vary depending on the available hardware, the size of the tensors being sent orthe amount of available RAM.
Interestingly, the blocking to("cuda")
actually performs the same asynchronous device casting operation(cudaMemcpyAsync
) as the one with non_blocking=True
with a synchronization point after each copy.