{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Reinforcement Learning (PPO) with TorchRL Tutorial\n==================================================\n\n**Author**: [Vincent Moens](https://github.com/vmoens)\n\nThis tutorial demonstrates how to use PyTorch and\n:py`torchrl`{.interpreted-text role=\"mod\"} to train a parametric policy\nnetwork to solve the Inverted Pendulum task from the\n[OpenAI-Gym/Farama-Gymnasium control\nlibrary](https://github.com/Farama-Foundation/Gymnasium).\n\n![Inverted\npendulum](https://pytorch.org/tutorials/_static/img/invpendulum.gif)\n\nKey learnings:\n\n-   How to create an environment in TorchRL, transform its outputs, and\n    collect data from this environment;\n-   How to make your classes talk to each other using\n    `~tensordict.TensorDict`{.interpreted-text role=\"class\"};\n-   The basics of building your training loop with TorchRL:\n    -   How to compute the advantage signal for policy gradient methods;\n    -   How to create a stochastic policy using a probabilistic neural\n        network;\n    -   How to create a dynamic replay buffer and sample from it without\n        repetition.\n\nWe will cover six crucial components of TorchRL:\n\n-   [environments](https://pytorch.org/rl/reference/envs.html)\n-   [transforms](https://pytorch.org/rl/reference/envs.html#transforms)\n-   [models (policy and value\n    function)](https://pytorch.org/rl/reference/modules.html)\n-   [loss modules](https://pytorch.org/rl/reference/objectives.html)\n-   [data collectors](https://pytorch.org/rl/reference/collectors.html)\n-   [replay\n    buffers](https://pytorch.org/rl/reference/data.html#replay-buffers)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If you are running this in Google Colab, make sure you install the\nfollowing dependencies:\n\n``` {.bash}\n!pip3 install torchrl\n!pip3 install gym[mujoco]\n!pip3 install tqdm\n```\n\nProximal Policy Optimization (PPO) is a policy-gradient algorithm where\na batch of data is being collected and directly consumed to train the\npolicy to maximise the expected return given some proximality\nconstraints. You can think of it as a sophisticated version of\n[REINFORCE](https://link.springer.com/content/pdf/10.1007/BF00992696.pdf),\nthe foundational policy-optimization algorithm. For more information,\nsee the [Proximal Policy Optimization\nAlgorithms](https://arxiv.org/abs/1707.06347) paper.\n\nPPO is usually regarded as a fast and efficient method for online,\non-policy reinforcement algorithm. TorchRL provides a loss-module that\ndoes all the work for you, so that you can rely on this implementation\nand focus on solving your problem rather than re-inventing the wheel\nevery time you want to train a policy.\n\nFor completeness, here is a brief overview of what the loss computes,\neven though this is taken care of by our\n`~torchrl.objectives.ClipPPOLoss`{.interpreted-text role=\"class\"}\nmodule---the algorithm works as follows: 1. we will sample a batch of\ndata by playing the policy in the environment for a given number of\nsteps. 2. Then, we will perform a given number of optimization steps\nwith random sub-samples of this batch using a clipped version of the\nREINFORCE loss. 3. The clipping will put a pessimistic bound on our\nloss: lower return estimates will be favored compared to higher ones.\nThe precise formula of the loss is:\n\n$$L(s,a,\\theta_k,\\theta) = \\min\\left(\n\\frac{\\pi_{\\theta}(a|s)}{\\pi_{\\theta_k}(a|s)}  A^{\\pi_{\\theta_k}}(s,a), \\;\\;\ng(\\epsilon, A^{\\pi_{\\theta_k}}(s,a))\n\\right),$$\n\nThere are two components in that loss: in the first part of the minimum\noperator, we simply compute an importance-weighted version of the\nREINFORCE loss (for example, a REINFORCE loss that we have corrected for\nthe fact that the current policy configuration lags the one that was\nused for the data collection). The second part of that minimum operator\nis a similar loss where we have clipped the ratios when they exceeded or\nwere below a given pair of thresholds.\n\nThis loss ensures that whether the advantage is positive or negative,\npolicy updates that would produce significant shifts from the previous\nconfiguration are being discouraged.\n\nThis tutorial is structured as follows:\n\n1.  First, we will define a set of hyperparameters we will be using for\n    training.\n2.  Next, we will focus on creating our environment, or simulator, using\n    TorchRL\\'s wrappers and transforms.\n3.  Next, we will design the policy network and the value model, which\n    is indispensable to the loss function. These modules will be used to\n    configure our loss module.\n4.  Next, we will create the replay buffer and data loader.\n5.  Finally, we will run our training loop and analyze the results.\n\nThroughout this tutorial, we\\'ll be using the\n`tensordict`{.interpreted-text role=\"mod\"} library.\n`~tensordict.TensorDict`{.interpreted-text role=\"class\"} is the lingua\nfranca of TorchRL: it helps us abstract what a module reads and writes\nand care less about the specific data description and more about the\nalgorithm itself.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import warnings\nwarnings.filterwarnings(\"ignore\")\nfrom torch import multiprocessing\n\n\nfrom collections import defaultdict\n\nimport matplotlib.pyplot as plt\nimport torch\nfrom tensordict.nn import TensorDictModule\nfrom tensordict.nn.distributions import NormalParamExtractor\nfrom torch import nn\nfrom torchrl.collectors import SyncDataCollector\nfrom torchrl.data.replay_buffers import ReplayBuffer\nfrom torchrl.data.replay_buffers.samplers import SamplerWithoutReplacement\nfrom torchrl.data.replay_buffers.storages import LazyTensorStorage\nfrom torchrl.envs import (Compose, DoubleToFloat, ObservationNorm, StepCounter,\n                          TransformedEnv)\nfrom torchrl.envs.libs.gym import GymEnv\nfrom torchrl.envs.utils import check_env_specs, ExplorationType, set_exploration_type\nfrom torchrl.modules import ProbabilisticActor, TanhNormal, ValueOperator\nfrom torchrl.objectives import ClipPPOLoss\nfrom torchrl.objectives.value import GAE\nfrom tqdm import tqdm"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Define Hyperparameters\n======================\n\nWe set the hyperparameters for our algorithm. Depending on the resources\navailable, one may choose to execute the policy on GPU or on another\ndevice. The `frame_skip` will control how for how many frames is a\nsingle action being executed. The rest of the arguments that count\nframes must be corrected for this value (since one environment step will\nactually return `frame_skip` frames).\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "is_fork = multiprocessing.get_start_method() == \"fork\"\ndevice = (\n    torch.device(0)\n    if torch.cuda.is_available() and not is_fork\n    else torch.device(\"cpu\")\n)\nnum_cells = 256  # number of cells in each layer i.e. output dim.\nlr = 3e-4\nmax_grad_norm = 1.0"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Data collection parameters\n==========================\n\nWhen collecting data, we will be able to choose how big each batch will\nbe by defining a `frames_per_batch` parameter. We will also define how\nmany frames (such as the number of interactions with the simulator) we\nwill allow ourselves to use. In general, the goal of an RL algorithm is\nto learn to solve the task as fast as it can in terms of environment\ninteractions: the lower the `total_frames` the better.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "frames_per_batch = 1000\n# For a complete training, bring the number of frames up to 1M\ntotal_frames = 50_000"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "PPO parameters\n==============\n\nAt each data collection (or batch collection) we will run the\noptimization over a certain number of *epochs*, each time consuming the\nentire data we just acquired in a nested training loop. Here, the\n`sub_batch_size` is different from the `frames_per_batch` here above:\nrecall that we are working with a \\\"batch of data\\\" coming from our\ncollector, which size is defined by `frames_per_batch`, and that we will\nfurther split in smaller sub-batches during the inner training loop. The\nsize of these sub-batches is controlled by `sub_batch_size`.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "sub_batch_size = 64  # cardinality of the sub-samples gathered from the current data in the inner loop\nnum_epochs = 10  # optimization steps per batch of data collected\nclip_epsilon = (\n    0.2  # clip value for PPO loss: see the equation in the intro for more context.\n)\ngamma = 0.99\nlmbda = 0.95\nentropy_eps = 1e-4"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Define an environment\n=====================\n\nIn RL, an *environment* is usually the way we refer to a simulator or a\ncontrol system. Various libraries provide simulation environments for\nreinforcement learning, including Gymnasium (previously OpenAI Gym),\nDeepMind control suite, and many others. As a general library,\nTorchRL\\'s goal is to provide an interchangeable interface to a large\npanel of RL simulators, allowing you to easily swap one environment with\nanother. For example, creating a wrapped gym environment can be achieved\nwith few characters:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "base_env = GymEnv(\"InvertedDoublePendulum-v4\", device=device)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "There are a few things to notice in this code: first, we created the\nenvironment by calling the `GymEnv` wrapper. If extra keyword arguments\nare passed, they will be transmitted to the `gym.make` method, hence\ncovering the most common environment construction commands.\nAlternatively, one could also directly create a gym environment using\n`gym.make(env_name, **kwargs)` and wrap it in a [GymWrapper]{.title-ref}\nclass.\n\nAlso the `device` argument: for gym, this only controls the device where\ninput action and observed states will be stored, but the execution will\nalways be done on CPU. The reason for this is simply that gym does not\nsupport on-device execution, unless specified otherwise. For other\nlibraries, we have control over the execution device and, as much as we\ncan, we try to stay consistent in terms of storing and execution\nbackends.\n\nTransforms\n==========\n\nWe will append some transforms to our environments to prepare the data\nfor the policy. In Gym, this is usually achieved via wrappers. TorchRL\ntakes a different approach, more similar to other pytorch domain\nlibraries, through the use of transforms. To add transforms to an\nenvironment, one should simply wrap it in a\n`~torchrl.envs.transforms.TransformedEnv`{.interpreted-text\nrole=\"class\"} instance and append the sequence of transforms to it. The\ntransformed environment will inherit the device and meta-data of the\nwrapped environment, and transform these depending on the sequence of\ntransforms it contains.\n\nNormalization\n=============\n\nThe first to encode is a normalization transform. As a rule of thumbs,\nit is preferable to have data that loosely match a unit Gaussian\ndistribution: to obtain this, we will run a certain number of random\nsteps in the environment and compute the summary statistics of these\nobservations.\n\nWe\\'ll append two other transforms: the\n`~torchrl.envs.transforms.DoubleToFloat`{.interpreted-text role=\"class\"}\ntransform will convert double entries to single-precision numbers, ready\nto be read by the policy. The\n`~torchrl.envs.transforms.StepCounter`{.interpreted-text role=\"class\"}\ntransform will be used to count the steps before the environment is\nterminated. We will use this measure as a supplementary measure of\nperformance.\n\nAs we will see later, many of the TorchRL\\'s classes rely on\n`~tensordict.TensorDict`{.interpreted-text role=\"class\"} to communicate.\nYou could think of it as a python dictionary with some extra tensor\nfeatures. In practice, this means that many modules we will be working\nwith need to be told what key to read (`in_keys`) and what key to write\n(`out_keys`) in the `tensordict` they will receive. Usually, if\n`out_keys` is omitted, it is assumed that the `in_keys` entries will be\nupdated in-place. For our transforms, the only entry we are interested\nin is referred to as `\"observation\"` and our transform layers will be\ntold to modify this entry and this entry only:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "env = TransformedEnv(\n    base_env,\n    Compose(\n        # normalize observations\n        ObservationNorm(in_keys=[\"observation\"]),\n        DoubleToFloat(),\n        StepCounter(),\n    ),\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As you may have noticed, we have created a normalization layer but we\ndid not set its normalization parameters. To do this,\n`~torchrl.envs.transforms.ObservationNorm`{.interpreted-text\nrole=\"class\"} can automatically gather the summary statistics of our\nenvironment:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "env.transform[0].init_stats(num_iter=1000, reduce_dim=0, cat_dim=0)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The `~torchrl.envs.transforms.ObservationNorm`{.interpreted-text\nrole=\"class\"} transform has now been populated with a location and a\nscale that will be used to normalize the data.\n\nLet us do a little sanity check for the shape of our summary stats:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(\"normalization constant shape:\", env.transform[0].loc.shape)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "An environment is not only defined by its simulator and transforms, but\nalso by a series of metadata that describe what can be expected during\nits execution. For efficiency purposes, TorchRL is quite stringent when\nit comes to environment specs, but you can easily check that your\nenvironment specs are adequate. In our example, the\n`~torchrl.envs.libs.gym.GymWrapper`{.interpreted-text role=\"class\"} and\n`~torchrl.envs.libs.gym.GymEnv`{.interpreted-text role=\"class\"} that\ninherits from it already take care of setting the proper specs for your\nenvironment so you should not have to care about this.\n\nNevertheless, let\\'s see a concrete example using our transformed\nenvironment by looking at its specs. There are three specs to look at:\n`observation_spec` which defines what is to be expected when executing\nan action in the environment, `reward_spec` which indicates the reward\ndomain and finally the `input_spec` (which contains the `action_spec`)\nand which represents everything an environment requires to execute a\nsingle step.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(\"observation_spec:\", env.observation_spec)\nprint(\"reward_spec:\", env.reward_spec)\nprint(\"input_spec:\", env.input_spec)\nprint(\"action_spec (as defined by input_spec):\", env.action_spec)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "the `check_env_specs`{.interpreted-text role=\"func\"} function runs a\nsmall rollout and compares its output against the environment specs. If\nno error is raised, we can be confident that the specs are properly\ndefined:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "check_env_specs(env)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "For fun, let\\'s see what a simple random rollout looks like. You can\ncall [env.rollout(n\\_steps)]{.title-ref} and get an overview of what the\nenvironment inputs and outputs look like. Actions will automatically be\ndrawn from the action spec domain, so you don\\'t need to care about\ndesigning a random sampler.\n\nTypically, at each step, an RL environment receives an action as input,\nand outputs an observation, a reward and a done state. The observation\nmay be composite, meaning that it could be composed of more than one\ntensor. This is not a problem for TorchRL, since the whole set of\nobservations is automatically packed in the output\n`~tensordict.TensorDict`{.interpreted-text role=\"class\"}. After\nexecuting a rollout (for example, a sequence of environment steps and\nrandom action generations) over a given number of steps, we will\nretrieve a `~tensordict.TensorDict`{.interpreted-text role=\"class\"}\ninstance with a shape that matches this trajectory length:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "rollout = env.rollout(3)\nprint(\"rollout of three steps:\", rollout)\nprint(\"Shape of the rollout TensorDict:\", rollout.batch_size)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Our rollout data has a shape of `torch.Size([3])`, which matches the\nnumber of steps we ran it for. The `\"next\"` entry points to the data\ncoming after the current step. In most cases, the `\"next\"` data at time\n[t]{.title-ref} matches the data at `t+1`, but this may not be the case\nif we are using some specific transformations (for example, multi-step).\n\nPolicy\n======\n\nPPO utilizes a stochastic policy to handle exploration. This means that\nour neural network will have to output the parameters of a distribution,\nrather than a single value corresponding to the action taken.\n\nAs the data is continuous, we use a Tanh-Normal distribution to respect\nthe action space boundaries. TorchRL provides such distribution, and the\nonly thing we need to care about is to build a neural network that\noutputs the right number of parameters for the policy to work with (a\nlocation, or mean, and a scale):\n\n$$f_{\\theta}(\\text{observation}) = \\mu_{\\theta}(\\text{observation}), \\sigma^{+}_{\\theta}(\\text{observation})$$\n\nThe only extra-difficulty that is brought up here is to split our output\nin two equal parts and map the second to a strictly positive space.\n\nWe design the policy in three steps:\n\n1.  Define a neural network `D_obs` -\\> `2 * D_action`. Indeed, our\n    `loc` (mu) and `scale` (sigma) both have dimension `D_action`.\n2.  Append a\n    `~tensordict.nn.distributions.NormalParamExtractor`{.interpreted-text\n    role=\"class\"} to extract a location and a scale (for example, splits\n    the input in two equal parts and applies a positive transformation\n    to the scale parameter).\n3.  Create a probabilistic\n    `~tensordict.nn.TensorDictModule`{.interpreted-text role=\"class\"}\n    that can generate this distribution and sample from it.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "actor_net = nn.Sequential(\n    nn.LazyLinear(num_cells, device=device),\n    nn.Tanh(),\n    nn.LazyLinear(num_cells, device=device),\n    nn.Tanh(),\n    nn.LazyLinear(num_cells, device=device),\n    nn.Tanh(),\n    nn.LazyLinear(2 * env.action_spec.shape[-1], device=device),\n    NormalParamExtractor(),\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To enable the policy to \\\"talk\\\" with the environment through the\n`tensordict` data carrier, we wrap the `nn.Module` in a\n`~tensordict.nn.TensorDictModule`{.interpreted-text role=\"class\"}. This\nclass will simply ready the `in_keys` it is provided with and write the\noutputs in-place at the registered `out_keys`.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "policy_module = TensorDictModule(\n    actor_net, in_keys=[\"observation\"], out_keys=[\"loc\", \"scale\"]\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We now need to build a distribution out of the location and scale of our\nnormal distribution. To do so, we instruct the\n`~torchrl.modules.tensordict_module.ProbabilisticActor`{.interpreted-text\nrole=\"class\"} class to build a\n`~torchrl.modules.TanhNormal`{.interpreted-text role=\"class\"} out of the\nlocation and scale parameters. We also provide the minimum and maximum\nvalues of this distribution, which we gather from the environment specs.\n\nThe name of the `in_keys` (and hence the name of the `out_keys` from the\n`~tensordict.nn.TensorDictModule`{.interpreted-text role=\"class\"} above)\ncannot be set to any value one may like, as the\n`~torchrl.modules.TanhNormal`{.interpreted-text role=\"class\"}\ndistribution constructor will expect the `loc` and `scale` keyword\narguments. That being said,\n`~torchrl.modules.tensordict_module.ProbabilisticActor`{.interpreted-text\nrole=\"class\"} also accepts `Dict[str, str]` typed `in_keys` where the\nkey-value pair indicates what `in_key` string should be used for every\nkeyword argument that is to be used.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "policy_module = ProbabilisticActor(\n    module=policy_module,\n    spec=env.action_spec,\n    in_keys=[\"loc\", \"scale\"],\n    distribution_class=TanhNormal,\n    distribution_kwargs={\n        \"low\": env.action_spec.space.low,\n        \"high\": env.action_spec.space.high,\n    },\n    return_log_prob=True,\n    # we'll need the log-prob for the numerator of the importance weights\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Value network\n=============\n\nThe value network is a crucial component of the PPO algorithm, even\nthough it won\\'t be used at inference time. This module will read the\nobservations and return an estimation of the discounted return for the\nfollowing trajectory. This allows us to amortize learning by relying on\nthe some utility estimation that is learned on-the-fly during training.\nOur value network share the same structure as the policy, but for\nsimplicity we assign it its own set of parameters.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "value_net = nn.Sequential(\n    nn.LazyLinear(num_cells, device=device),\n    nn.Tanh(),\n    nn.LazyLinear(num_cells, device=device),\n    nn.Tanh(),\n    nn.LazyLinear(num_cells, device=device),\n    nn.Tanh(),\n    nn.LazyLinear(1, device=device),\n)\n\nvalue_module = ValueOperator(\n    module=value_net,\n    in_keys=[\"observation\"],\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "let\\'s try our policy and value modules. As we said earlier, the usage\nof `~tensordict.nn.TensorDictModule`{.interpreted-text role=\"class\"}\nmakes it possible to directly read the output of the environment to run\nthese modules, as they know what information to read and where to write\nit:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(\"Running policy:\", policy_module(env.reset()))\nprint(\"Running value:\", value_module(env.reset()))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Data collector\n==============\n\nTorchRL provides a set of [DataCollector\nclasses](https://pytorch.org/rl/reference/collectors.html). Briefly,\nthese classes execute three operations: reset an environment, compute an\naction given the latest observation, execute a step in the environment,\nand repeat the last two steps until the environment signals a stop (or\nreaches a done state).\n\nThey allow you to control how many frames to collect at each iteration\n(through the `frames_per_batch` parameter), when to reset the\nenvironment (through the `max_frames_per_traj` argument), on which\n`device` the policy should be executed, etc. They are also designed to\nwork efficiently with batched and multiprocessed environments.\n\nThe simplest data collector is the\n`~torchrl.collectors.collectors.SyncDataCollector`{.interpreted-text\nrole=\"class\"}: it is an iterator that you can use to get batches of data\nof a given length, and that will stop once a total number of frames\n(`total_frames`) have been collected. Other data collectors\n(`~torchrl.collectors.collectors.MultiSyncDataCollector`{.interpreted-text\nrole=\"class\"} and\n`~torchrl.collectors.collectors.MultiaSyncDataCollector`{.interpreted-text\nrole=\"class\"}) will execute the same operations in synchronous and\nasynchronous manner over a set of multiprocessed workers.\n\nAs for the policy and environment before, the data collector will return\n`~tensordict.TensorDict`{.interpreted-text role=\"class\"} instances with\na total number of elements that will match `frames_per_batch`. Using\n`~tensordict.TensorDict`{.interpreted-text role=\"class\"} to pass data to\nthe training loop allows you to write data loading pipelines that are\n100% oblivious to the actual specificities of the rollout content.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "collector = SyncDataCollector(\n    env,\n    policy_module,\n    frames_per_batch=frames_per_batch,\n    total_frames=total_frames,\n    split_trajs=False,\n    device=device,\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Replay buffer\n=============\n\nReplay buffers are a common building piece of off-policy RL algorithms.\nIn on-policy contexts, a replay buffer is refilled every time a batch of\ndata is collected, and its data is repeatedly consumed for a certain\nnumber of epochs.\n\nTorchRL\\'s replay buffers are built using a common container\n`~torchrl.data.ReplayBuffer`{.interpreted-text role=\"class\"} which takes\nas argument the components of the buffer: a storage, a writer, a sampler\nand possibly some transforms. Only the storage (which indicates the\nreplay buffer capacity) is mandatory. We also specify a sampler without\nrepetition to avoid sampling multiple times the same item in one epoch.\nUsing a replay buffer for PPO is not mandatory and we could simply\nsample the sub-batches from the collected batch, but using these classes\nmake it easy for us to build the inner training loop in a reproducible\nway.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "replay_buffer = ReplayBuffer(\n    storage=LazyTensorStorage(max_size=frames_per_batch),\n    sampler=SamplerWithoutReplacement(),\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Loss function\n=============\n\nThe PPO loss can be directly imported from TorchRL for convenience using\nthe `~torchrl.objectives.ClipPPOLoss`{.interpreted-text role=\"class\"}\nclass. This is the easiest way of utilizing PPO: it hides away the\nmathematical operations of PPO and the control flow that goes with it.\n\nPPO requires some \\\"advantage estimation\\\" to be computed. In short, an\nadvantage is a value that reflects an expectancy over the return value\nwhile dealing with the bias / variance tradeoff. To compute the\nadvantage, one just needs to (1) build the advantage module, which\nutilizes our value operator, and (2) pass each batch of data through it\nbefore each epoch. The GAE module will update the input `tensordict`\nwith new `\"advantage\"` and `\"value_target\"` entries. The\n`\"value_target\"` is a gradient-free tensor that represents the empirical\nvalue that the value network should represent with the input\nobservation. Both of these will be used by\n`~torchrl.objectives.ClipPPOLoss`{.interpreted-text role=\"class\"} to\nreturn the policy and value losses.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "advantage_module = GAE(\n    gamma=gamma, lmbda=lmbda, value_network=value_module, average_gae=True, device=device,\n)\n\nloss_module = ClipPPOLoss(\n    actor_network=policy_module,\n    critic_network=value_module,\n    clip_epsilon=clip_epsilon,\n    entropy_bonus=bool(entropy_eps),\n    entropy_coef=entropy_eps,\n    # these keys match by default but we set this for completeness\n    critic_coef=1.0,\n    loss_critic_type=\"smooth_l1\",\n)\n\noptim = torch.optim.Adam(loss_module.parameters(), lr)\nscheduler = torch.optim.lr_scheduler.CosineAnnealingLR(\n    optim, total_frames // frames_per_batch, 0.0\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Training loop\n=============\n\nWe now have all the pieces needed to code our training loop. The steps\ninclude:\n\n-   Collect data\n    -   Compute advantage\n        -   Loop over the collected to compute loss values\n        -   Back propagate\n        -   Optimize\n        -   Repeat\n    -   Repeat\n-   Repeat\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "logs = defaultdict(list)\npbar = tqdm(total=total_frames)\neval_str = \"\"\n\n# We iterate over the collector until it reaches the total number of frames it was\n# designed to collect:\nfor i, tensordict_data in enumerate(collector):\n    # we now have a batch of data to work with. Let's learn something from it.\n    for _ in range(num_epochs):\n        # We'll need an \"advantage\" signal to make PPO work.\n        # We re-compute it at each epoch as its value depends on the value\n        # network which is updated in the inner loop.\n        advantage_module(tensordict_data)\n        data_view = tensordict_data.reshape(-1)\n        replay_buffer.extend(data_view.cpu())\n        for _ in range(frames_per_batch // sub_batch_size):\n            subdata = replay_buffer.sample(sub_batch_size)\n            loss_vals = loss_module(subdata.to(device))\n            loss_value = (\n                loss_vals[\"loss_objective\"]\n                + loss_vals[\"loss_critic\"]\n                + loss_vals[\"loss_entropy\"]\n            )\n\n            # Optimization: backward, grad clipping and optimization step\n            loss_value.backward()\n            # this is not strictly mandatory but it's good practice to keep\n            # your gradient norm bounded\n            torch.nn.utils.clip_grad_norm_(loss_module.parameters(), max_grad_norm)\n            optim.step()\n            optim.zero_grad()\n\n    logs[\"reward\"].append(tensordict_data[\"next\", \"reward\"].mean().item())\n    pbar.update(tensordict_data.numel())\n    cum_reward_str = (\n        f\"average reward={logs['reward'][-1]: 4.4f} (init={logs['reward'][0]: 4.4f})\"\n    )\n    logs[\"step_count\"].append(tensordict_data[\"step_count\"].max().item())\n    stepcount_str = f\"step count (max): {logs['step_count'][-1]}\"\n    logs[\"lr\"].append(optim.param_groups[0][\"lr\"])\n    lr_str = f\"lr policy: {logs['lr'][-1]: 4.4f}\"\n    if i % 10 == 0:\n        # We evaluate the policy once every 10 batches of data.\n        # Evaluation is rather simple: execute the policy without exploration\n        # (take the expected value of the action distribution) for a given\n        # number of steps (1000, which is our ``env`` horizon).\n        # The ``rollout`` method of the ``env`` can take a policy as argument:\n        # it will then execute this policy at each step.\n        with set_exploration_type(ExplorationType.DETERMINISTIC), torch.no_grad():\n            # execute a rollout with the trained policy\n            eval_rollout = env.rollout(1000, policy_module)\n            logs[\"eval reward\"].append(eval_rollout[\"next\", \"reward\"].mean().item())\n            logs[\"eval reward (sum)\"].append(\n                eval_rollout[\"next\", \"reward\"].sum().item()\n            )\n            logs[\"eval step_count\"].append(eval_rollout[\"step_count\"].max().item())\n            eval_str = (\n                f\"eval cumulative reward: {logs['eval reward (sum)'][-1]: 4.4f} \"\n                f\"(init: {logs['eval reward (sum)'][0]: 4.4f}), \"\n                f\"eval step-count: {logs['eval step_count'][-1]}\"\n            )\n            del eval_rollout\n    pbar.set_description(\", \".join([eval_str, cum_reward_str, stepcount_str, lr_str]))\n\n    # We're also using a learning rate scheduler. Like the gradient clipping,\n    # this is a nice-to-have but nothing necessary for PPO to work.\n    scheduler.step()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Results\n=======\n\nBefore the 1M step cap is reached, the algorithm should have reached a\nmax step count of 1000 steps, which is the maximum number of steps\nbefore the trajectory is truncated.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plt.figure(figsize=(10, 10))\nplt.subplot(2, 2, 1)\nplt.plot(logs[\"reward\"])\nplt.title(\"training rewards (average)\")\nplt.subplot(2, 2, 2)\nplt.plot(logs[\"step_count\"])\nplt.title(\"Max step count (training)\")\nplt.subplot(2, 2, 3)\nplt.plot(logs[\"eval reward (sum)\"])\nplt.title(\"Return (test)\")\nplt.subplot(2, 2, 4)\nplt.plot(logs[\"eval step_count\"])\nplt.title(\"Max step count (test)\")\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Conclusion and next steps\n=========================\n\nIn this tutorial, we have learned:\n\n1.  How to create and customize an environment with\n    :py`torchrl`{.interpreted-text role=\"mod\"};\n2.  How to write a model and a loss function;\n3.  How to set up a typical training loop.\n\nIf you want to experiment with this tutorial a bit more, you can apply\nthe following modifications:\n\n-   From an efficiency perspective, we could run several simulations in\n    parallel to speed up data collection. Check\n    `~torchrl.envs.ParallelEnv`{.interpreted-text role=\"class\"} for\n    further information.\n-   From a logging perspective, one could add a\n    `torchrl.record.VideoRecorder`{.interpreted-text role=\"class\"}\n    transform to the environment after asking for rendering to get a\n    visual rendering of the inverted pendulum in action. Check\n    :py`torchrl.record`{.interpreted-text role=\"mod\"} to know more.\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.12"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}