{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# For tips on running notebooks in Google Colab, see\n# https://codelin.vip/beginner/colab\n%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Train a Mario-playing RL Agent\n==============================\n\n**Authors:** [Yuansong Feng](https://github.com/YuansongFeng), [Suraj\nSubramanian](https://github.com/suraj813), [Howard\nWang](https://github.com/hw26), [Steven\nGuo](https://github.com/GuoYuzhang).\n\nThis tutorial walks you through the fundamentals of Deep Reinforcement\nLearning. At the end, you will implement an AI-powered Mario (using\n[Double Deep Q-Networks](https://arxiv.org/pdf/1509.06461.pdf)) that can\nplay the game by itself.\n\nAlthough no prior knowledge of RL is necessary for this tutorial, you\ncan familiarize yourself with these RL\n[concepts](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html),\nand have this handy\n[cheatsheet](https://colab.research.google.com/drive/1eN33dPVtdPViiS1njTW_-r-IYCDTFU7N)\nas your companion. The full code is available\n[here](https://github.com/yuansongFeng/MadMario/).\n\n![](https://pytorch.org/tutorials/_static/img/mario.gif)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "``` {.bash}\n%%bash\npip install gym-super-mario-bros==7.4.0\npip install tensordict==0.3.0\npip install torchrl==0.3.0\n```\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch\nfrom torch import nn\nfrom torchvision import transforms as T\nfrom PIL import Image\nimport numpy as np\nfrom pathlib import Path\nfrom collections import deque\nimport random, datetime, os\n\n# Gym is an OpenAI toolkit for RL\nimport gym\nfrom gym.spaces import Box\nfrom gym.wrappers import FrameStack\n\n# NES Emulator for OpenAI Gym\nfrom nes_py.wrappers import JoypadSpace\n\n# Super Mario environment for OpenAI Gym\nimport gym_super_mario_bros\n\nfrom tensordict import TensorDict\nfrom torchrl.data import TensorDictReplayBuffer, LazyMemmapStorage"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "RL Definitions\n==============\n\n**Environment** The world that an agent interacts with and learns from.\n\n**Action** $a$ : How the Agent responds to the Environment. The set of\nall possible Actions is called *action-space*.\n\n**State** $s$ : The current characteristic of the Environment. The set\nof all possible States the Environment can be in is called\n*state-space*.\n\n**Reward** $r$ : Reward is the key feedback from Environment to Agent.\nIt is what drives the Agent to learn and to change its future action. An\naggregation of rewards over multiple time steps is called **Return**.\n\n**Optimal Action-Value function** $Q^*(s,a)$ : Gives the expected return\nif you start in state $s$, take an arbitrary action $a$, and then for\neach future time step take the action that maximizes returns. $Q$ can be\nsaid to stand for the \"quality\" of the action in a state. We try to\napproximate this function.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Environment\n===========\n\nInitialize Environment\n----------------------\n\nIn Mario, the environment consists of tubes, mushrooms and other\ncomponents.\n\nWhen Mario makes an action, the environment responds with the changed\n(next) state, reward and other info.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Initialize Super Mario environment (in v0.26 change render mode to 'human' to see results on the screen)\nif gym.__version__ < '0.26':\n    env = gym_super_mario_bros.make(\"SuperMarioBros-1-1-v0\", new_step_api=True)\nelse:\n    env = gym_super_mario_bros.make(\"SuperMarioBros-1-1-v0\", render_mode='rgb', apply_api_compatibility=True)\n\n# Limit the action-space to\n#   0. walk right\n#   1. jump right\nenv = JoypadSpace(env, [[\"right\"], [\"right\", \"A\"]])\n\nenv.reset()\nnext_state, reward, done, trunc, info = env.step(action=0)\nprint(f\"{next_state.shape},\\n {reward},\\n {done},\\n {info}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Preprocess Environment\n======================\n\nEnvironment data is returned to the agent in `next_state`. As you saw\nabove, each state is represented by a `[3, 240, 256]` size array. Often\nthat is more information than our agent needs; for instance, Mario's\nactions do not depend on the color of the pipes or the sky!\n\nWe use **Wrappers** to preprocess environment data before sending it to\nthe agent.\n\n`GrayScaleObservation` is a common wrapper to transform an RGB image to\ngrayscale; doing so reduces the size of the state representation without\nlosing useful information. Now the size of each state: `[1, 240, 256]`\n\n`ResizeObservation` downsamples each observation into a square image.\nNew size: `[1, 84, 84]`\n\n`SkipFrame` is a custom wrapper that inherits from `gym.Wrapper` and\nimplements the `step()` function. Because consecutive frames don't vary\nmuch, we can skip n-intermediate frames without losing much information.\nThe n-th frame aggregates rewards accumulated over each skipped frame.\n\n`FrameStack` is a wrapper that allows us to squash consecutive frames of\nthe environment into a single observation point to feed to our learning\nmodel. This way, we can identify if Mario was landing or jumping based\non the direction of his movement in the previous several frames.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class SkipFrame(gym.Wrapper):\n    def __init__(self, env, skip):\n        \"\"\"Return only every `skip`-th frame\"\"\"\n        super().__init__(env)\n        self._skip = skip\n\n    def step(self, action):\n        \"\"\"Repeat action, and sum reward\"\"\"\n        total_reward = 0.0\n        for i in range(self._skip):\n            # Accumulate reward and repeat the same action\n            obs, reward, done, trunk, info = self.env.step(action)\n            total_reward += reward\n            if done:\n                break\n        return obs, total_reward, done, trunk, info\n\n\nclass GrayScaleObservation(gym.ObservationWrapper):\n    def __init__(self, env):\n        super().__init__(env)\n        obs_shape = self.observation_space.shape[:2]\n        self.observation_space = Box(low=0, high=255, shape=obs_shape, dtype=np.uint8)\n\n    def permute_orientation(self, observation):\n        # permute [H, W, C] array to [C, H, W] tensor\n        observation = np.transpose(observation, (2, 0, 1))\n        observation = torch.tensor(observation.copy(), dtype=torch.float)\n        return observation\n\n    def observation(self, observation):\n        observation = self.permute_orientation(observation)\n        transform = T.Grayscale()\n        observation = transform(observation)\n        return observation\n\n\nclass ResizeObservation(gym.ObservationWrapper):\n    def __init__(self, env, shape):\n        super().__init__(env)\n        if isinstance(shape, int):\n            self.shape = (shape, shape)\n        else:\n            self.shape = tuple(shape)\n\n        obs_shape = self.shape + self.observation_space.shape[2:]\n        self.observation_space = Box(low=0, high=255, shape=obs_shape, dtype=np.uint8)\n\n    def observation(self, observation):\n        transforms = T.Compose(\n            [T.Resize(self.shape, antialias=True), T.Normalize(0, 255)]\n        )\n        observation = transforms(observation).squeeze(0)\n        return observation\n\n\n# Apply Wrappers to environment\nenv = SkipFrame(env, skip=4)\nenv = GrayScaleObservation(env)\nenv = ResizeObservation(env, shape=84)\nif gym.__version__ < '0.26':\n    env = FrameStack(env, num_stack=4, new_step_api=True)\nelse:\n    env = FrameStack(env, num_stack=4)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "After applying the above wrappers to the environment, the final wrapped\nstate consists of 4 gray-scaled consecutive frames stacked together, as\nshown above in the image on the left. Each time Mario makes an action,\nthe environment responds with a state of this structure. The structure\nis represented by a 3-D array of size `[4, 84, 84]`.\n\n![](https://pytorch.org/tutorials/_static/img/mario_env.png)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Agent\n=====\n\nWe create a class `Mario` to represent our agent in the game. Mario\nshould be able to:\n\n-   **Act** according to the optimal action policy based on the current\n    state (of the environment).\n-   **Remember** experiences. Experience = (current state, current\n    action, reward, next state). Mario *caches* and later *recalls* his\n    experiences to update his action policy.\n-   **Learn** a better action policy over time\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class Mario:\n    def __init__():\n        pass\n\n    def act(self, state):\n        \"\"\"Given a state, choose an epsilon-greedy action\"\"\"\n        pass\n\n    def cache(self, experience):\n        \"\"\"Add the experience to memory\"\"\"\n        pass\n\n    def recall(self):\n        \"\"\"Sample experiences from memory\"\"\"\n        pass\n\n    def learn(self):\n        \"\"\"Update online action value (Q) function with a batch of experiences\"\"\"\n        pass"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In the following sections, we will populate Mario's parameters and\ndefine his functions.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Act\n===\n\nFor any given state, an agent can choose to do the most optimal action\n(**exploit**) or a random action (**explore**).\n\nMario randomly explores with a chance of `self.exploration_rate`; when\nhe chooses to exploit, he relies on `MarioNet` (implemented in `Learn`\nsection) to provide the most optimal action.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class Mario:\n    def __init__(self, state_dim, action_dim, save_dir):\n        self.state_dim = state_dim\n        self.action_dim = action_dim\n        self.save_dir = save_dir\n\n        self.device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n\n        # Mario's DNN to predict the most optimal action - we implement this in the Learn section\n        self.net = MarioNet(self.state_dim, self.action_dim).float()\n        self.net = self.net.to(device=self.device)\n\n        self.exploration_rate = 1\n        self.exploration_rate_decay = 0.99999975\n        self.exploration_rate_min = 0.1\n        self.curr_step = 0\n\n        self.save_every = 5e5  # no. of experiences between saving Mario Net\n\n    def act(self, state):\n        \"\"\"\n    Given a state, choose an epsilon-greedy action and update value of step.\n\n    Inputs:\n    state(``LazyFrame``): A single observation of the current state, dimension is (state_dim)\n    Outputs:\n    ``action_idx`` (``int``): An integer representing which action Mario will perform\n    \"\"\"\n        # EXPLORE\n        if np.random.rand() < self.exploration_rate:\n            action_idx = np.random.randint(self.action_dim)\n\n        # EXPLOIT\n        else:\n            state = state[0].__array__() if isinstance(state, tuple) else state.__array__()\n            state = torch.tensor(state, device=self.device).unsqueeze(0)\n            action_values = self.net(state, model=\"online\")\n            action_idx = torch.argmax(action_values, axis=1).item()\n\n        # decrease exploration_rate\n        self.exploration_rate *= self.exploration_rate_decay\n        self.exploration_rate = max(self.exploration_rate_min, self.exploration_rate)\n\n        # increment step\n        self.curr_step += 1\n        return action_idx"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Cache and Recall\n================\n\nThese two functions serve as Mario's \"memory\" process.\n\n`cache()`: Each time Mario performs an action, he stores the\n`experience` to his memory. His experience includes the current *state*,\n*action* performed, *reward* from the action, the *next state*, and\nwhether the game is *done*.\n\n`recall()`: Mario randomly samples a batch of experiences from his\nmemory, and uses that to learn the game.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class Mario(Mario):  # subclassing for continuity\n    def __init__(self, state_dim, action_dim, save_dir):\n        super().__init__(state_dim, action_dim, save_dir)\n        self.memory = TensorDictReplayBuffer(storage=LazyMemmapStorage(100000, device=torch.device(\"cpu\")))\n        self.batch_size = 32\n\n    def cache(self, state, next_state, action, reward, done):\n        \"\"\"\n        Store the experience to self.memory (replay buffer)\n\n        Inputs:\n        state (``LazyFrame``),\n        next_state (``LazyFrame``),\n        action (``int``),\n        reward (``float``),\n        done(``bool``))\n        \"\"\"\n        def first_if_tuple(x):\n            return x[0] if isinstance(x, tuple) else x\n        state = first_if_tuple(state).__array__()\n        next_state = first_if_tuple(next_state).__array__()\n\n        state = torch.tensor(state)\n        next_state = torch.tensor(next_state)\n        action = torch.tensor([action])\n        reward = torch.tensor([reward])\n        done = torch.tensor([done])\n\n        # self.memory.append((state, next_state, action, reward, done,))\n        self.memory.add(TensorDict({\"state\": state, \"next_state\": next_state, \"action\": action, \"reward\": reward, \"done\": done}, batch_size=[]))\n\n    def recall(self):\n        \"\"\"\n        Retrieve a batch of experiences from memory\n        \"\"\"\n        batch = self.memory.sample(self.batch_size).to(self.device)\n        state, next_state, action, reward, done = (batch.get(key) for key in (\"state\", \"next_state\", \"action\", \"reward\", \"done\"))\n        return state, next_state, action.squeeze(), reward.squeeze(), done.squeeze()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Learn\n=====\n\nMario uses the [DDQN algorithm](https://arxiv.org/pdf/1509.06461) under\nthe hood. DDQN uses two ConvNets - $Q_{online}$ and $Q_{target}$ - that\nindependently approximate the optimal action-value function.\n\nIn our implementation, we share feature generator `features` across\n$Q_{online}$ and $Q_{target}$, but maintain separate FC classifiers for\neach. $\\theta_{target}$ (the parameters of $Q_{target}$) is frozen to\nprevent updating by backprop. Instead, it is periodically synced with\n$\\theta_{online}$ (more on this later).\n\nNeural Network\n--------------\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class MarioNet(nn.Module):\n    \"\"\"mini CNN structure\n  input -> (conv2d + relu) x 3 -> flatten -> (dense + relu) x 2 -> output\n  \"\"\"\n\n    def __init__(self, input_dim, output_dim):\n        super().__init__()\n        c, h, w = input_dim\n\n        if h != 84:\n            raise ValueError(f\"Expecting input height: 84, got: {h}\")\n        if w != 84:\n            raise ValueError(f\"Expecting input width: 84, got: {w}\")\n\n        self.online = self.__build_cnn(c, output_dim)\n\n        self.target = self.__build_cnn(c, output_dim)\n        self.target.load_state_dict(self.online.state_dict())\n\n        # Q_target parameters are frozen.\n        for p in self.target.parameters():\n            p.requires_grad = False\n\n    def forward(self, input, model):\n        if model == \"online\":\n            return self.online(input)\n        elif model == \"target\":\n            return self.target(input)\n\n    def __build_cnn(self, c, output_dim):\n        return nn.Sequential(\n            nn.Conv2d(in_channels=c, out_channels=32, kernel_size=8, stride=4),\n            nn.ReLU(),\n            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2),\n            nn.ReLU(),\n            nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1),\n            nn.ReLU(),\n            nn.Flatten(),\n            nn.Linear(3136, 512),\n            nn.ReLU(),\n            nn.Linear(512, output_dim),\n        )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "TD Estimate & TD Target\n=======================\n\nTwo values are involved in learning:\n\n**TD Estimate** - the predicted optimal $Q^*$ for a given state $s$\n\n$${TD}_e = Q_{online}^*(s,a)$$\n\n**TD Target** - aggregation of current reward and the estimated $Q^*$ in\nthe next state $s'$\n\n$$a' = argmax_{a} Q_{online}(s', a)$$\n\n$${TD}_t = r + \\gamma Q_{target}^*(s',a')$$\n\nBecause we don't know what next action $a'$ will be, we use the action\n$a'$ maximizes $Q_{online}$ in the next state $s'$.\n\nNotice we use the\n[\\@torch.no\\_grad()](https://pytorch.org/docs/stable/generated/torch.no_grad.html#no-grad)\ndecorator on `td_target()` to disable gradient calculations here\n(because we don't need to backpropagate on $\\theta_{target}$).\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class Mario(Mario):\n    def __init__(self, state_dim, action_dim, save_dir):\n        super().__init__(state_dim, action_dim, save_dir)\n        self.gamma = 0.9\n\n    def td_estimate(self, state, action):\n        current_Q = self.net(state, model=\"online\")[\n            np.arange(0, self.batch_size), action\n        ]  # Q_online(s,a)\n        return current_Q\n\n    @torch.no_grad()\n    def td_target(self, reward, next_state, done):\n        next_state_Q = self.net(next_state, model=\"online\")\n        best_action = torch.argmax(next_state_Q, axis=1)\n        next_Q = self.net(next_state, model=\"target\")[\n            np.arange(0, self.batch_size), best_action\n        ]\n        return (reward + (1 - done.float()) * self.gamma * next_Q).float()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Updating the model\n==================\n\nAs Mario samples inputs from his replay buffer, we compute $TD_t$ and\n$TD_e$ and backpropagate this loss down $Q_{online}$ to update its\nparameters $\\theta_{online}$ ($\\alpha$ is the learning rate `lr` passed\nto the `optimizer`)\n\n$$\\theta_{online} \\leftarrow \\theta_{online} + \\alpha \\nabla(TD_e - TD_t)$$\n\n$\\theta_{target}$ does not update through backpropagation. Instead, we\nperiodically copy $\\theta_{online}$ to $\\theta_{target}$\n\n$$\\theta_{target} \\leftarrow \\theta_{online}$$\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class Mario(Mario):\n    def __init__(self, state_dim, action_dim, save_dir):\n        super().__init__(state_dim, action_dim, save_dir)\n        self.optimizer = torch.optim.Adam(self.net.parameters(), lr=0.00025)\n        self.loss_fn = torch.nn.SmoothL1Loss()\n\n    def update_Q_online(self, td_estimate, td_target):\n        loss = self.loss_fn(td_estimate, td_target)\n        self.optimizer.zero_grad()\n        loss.backward()\n        self.optimizer.step()\n        return loss.item()\n\n    def sync_Q_target(self):\n        self.net.target.load_state_dict(self.net.online.state_dict())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Save checkpoint\n===============\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class Mario(Mario):\n    def save(self):\n        save_path = (\n            self.save_dir / f\"mario_net_{int(self.curr_step // self.save_every)}.chkpt\"\n        )\n        torch.save(\n            dict(model=self.net.state_dict(), exploration_rate=self.exploration_rate),\n            save_path,\n        )\n        print(f\"MarioNet saved to {save_path} at step {self.curr_step}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Putting it all together\n=======================\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class Mario(Mario):\n    def __init__(self, state_dim, action_dim, save_dir):\n        super().__init__(state_dim, action_dim, save_dir)\n        self.burnin = 1e4  # min. experiences before training\n        self.learn_every = 3  # no. of experiences between updates to Q_online\n        self.sync_every = 1e4  # no. of experiences between Q_target & Q_online sync\n\n    def learn(self):\n        if self.curr_step % self.sync_every == 0:\n            self.sync_Q_target()\n\n        if self.curr_step % self.save_every == 0:\n            self.save()\n\n        if self.curr_step < self.burnin:\n            return None, None\n\n        if self.curr_step % self.learn_every != 0:\n            return None, None\n\n        # Sample from memory\n        state, next_state, action, reward, done = self.recall()\n\n        # Get TD Estimate\n        td_est = self.td_estimate(state, action)\n\n        # Get TD Target\n        td_tgt = self.td_target(reward, next_state, done)\n\n        # Backpropagate loss through Q_online\n        loss = self.update_Q_online(td_est, td_tgt)\n\n        return (td_est.mean().item(), loss)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Logging\n=======\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import numpy as np\nimport time, datetime\nimport matplotlib.pyplot as plt\n\n\nclass MetricLogger:\n    def __init__(self, save_dir):\n        self.save_log = save_dir / \"log\"\n        with open(self.save_log, \"w\") as f:\n            f.write(\n                f\"{'Episode':>8}{'Step':>8}{'Epsilon':>10}{'MeanReward':>15}\"\n                f\"{'MeanLength':>15}{'MeanLoss':>15}{'MeanQValue':>15}\"\n                f\"{'TimeDelta':>15}{'Time':>20}\\n\"\n            )\n        self.ep_rewards_plot = save_dir / \"reward_plot.jpg\"\n        self.ep_lengths_plot = save_dir / \"length_plot.jpg\"\n        self.ep_avg_losses_plot = save_dir / \"loss_plot.jpg\"\n        self.ep_avg_qs_plot = save_dir / \"q_plot.jpg\"\n\n        # History metrics\n        self.ep_rewards = []\n        self.ep_lengths = []\n        self.ep_avg_losses = []\n        self.ep_avg_qs = []\n\n        # Moving averages, added for every call to record()\n        self.moving_avg_ep_rewards = []\n        self.moving_avg_ep_lengths = []\n        self.moving_avg_ep_avg_losses = []\n        self.moving_avg_ep_avg_qs = []\n\n        # Current episode metric\n        self.init_episode()\n\n        # Timing\n        self.record_time = time.time()\n\n    def log_step(self, reward, loss, q):\n        self.curr_ep_reward += reward\n        self.curr_ep_length += 1\n        if loss:\n            self.curr_ep_loss += loss\n            self.curr_ep_q += q\n            self.curr_ep_loss_length += 1\n\n    def log_episode(self):\n        \"Mark end of episode\"\n        self.ep_rewards.append(self.curr_ep_reward)\n        self.ep_lengths.append(self.curr_ep_length)\n        if self.curr_ep_loss_length == 0:\n            ep_avg_loss = 0\n            ep_avg_q = 0\n        else:\n            ep_avg_loss = np.round(self.curr_ep_loss / self.curr_ep_loss_length, 5)\n            ep_avg_q = np.round(self.curr_ep_q / self.curr_ep_loss_length, 5)\n        self.ep_avg_losses.append(ep_avg_loss)\n        self.ep_avg_qs.append(ep_avg_q)\n\n        self.init_episode()\n\n    def init_episode(self):\n        self.curr_ep_reward = 0.0\n        self.curr_ep_length = 0\n        self.curr_ep_loss = 0.0\n        self.curr_ep_q = 0.0\n        self.curr_ep_loss_length = 0\n\n    def record(self, episode, epsilon, step):\n        mean_ep_reward = np.round(np.mean(self.ep_rewards[-100:]), 3)\n        mean_ep_length = np.round(np.mean(self.ep_lengths[-100:]), 3)\n        mean_ep_loss = np.round(np.mean(self.ep_avg_losses[-100:]), 3)\n        mean_ep_q = np.round(np.mean(self.ep_avg_qs[-100:]), 3)\n        self.moving_avg_ep_rewards.append(mean_ep_reward)\n        self.moving_avg_ep_lengths.append(mean_ep_length)\n        self.moving_avg_ep_avg_losses.append(mean_ep_loss)\n        self.moving_avg_ep_avg_qs.append(mean_ep_q)\n\n        last_record_time = self.record_time\n        self.record_time = time.time()\n        time_since_last_record = np.round(self.record_time - last_record_time, 3)\n\n        print(\n            f\"Episode {episode} - \"\n            f\"Step {step} - \"\n            f\"Epsilon {epsilon} - \"\n            f\"Mean Reward {mean_ep_reward} - \"\n            f\"Mean Length {mean_ep_length} - \"\n            f\"Mean Loss {mean_ep_loss} - \"\n            f\"Mean Q Value {mean_ep_q} - \"\n            f\"Time Delta {time_since_last_record} - \"\n            f\"Time {datetime.datetime.now().strftime('%Y-%m-%dT%H:%M:%S')}\"\n        )\n\n        with open(self.save_log, \"a\") as f:\n            f.write(\n                f\"{episode:8d}{step:8d}{epsilon:10.3f}\"\n                f\"{mean_ep_reward:15.3f}{mean_ep_length:15.3f}{mean_ep_loss:15.3f}{mean_ep_q:15.3f}\"\n                f\"{time_since_last_record:15.3f}\"\n                f\"{datetime.datetime.now().strftime('%Y-%m-%dT%H:%M:%S'):>20}\\n\"\n            )\n\n        for metric in [\"ep_lengths\", \"ep_avg_losses\", \"ep_avg_qs\", \"ep_rewards\"]:\n            plt.clf()\n            plt.plot(getattr(self, f\"moving_avg_{metric}\"), label=f\"moving_avg_{metric}\")\n            plt.legend()\n            plt.savefig(getattr(self, f\"{metric}_plot\"))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's play!\n===========\n\nIn this example we run the training loop for 40 episodes, but for Mario\nto truly learn the ways of his world, we suggest running the loop for at\nleast 40,000 episodes!\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "use_cuda = torch.cuda.is_available()\nprint(f\"Using CUDA: {use_cuda}\")\nprint()\n\nsave_dir = Path(\"checkpoints\") / datetime.datetime.now().strftime(\"%Y-%m-%dT%H-%M-%S\")\nsave_dir.mkdir(parents=True)\n\nmario = Mario(state_dim=(4, 84, 84), action_dim=env.action_space.n, save_dir=save_dir)\n\nlogger = MetricLogger(save_dir)\n\nepisodes = 40\nfor e in range(episodes):\n\n    state = env.reset()\n\n    # Play the game!\n    while True:\n\n        # Run agent on the state\n        action = mario.act(state)\n\n        # Agent performs action\n        next_state, reward, done, trunc, info = env.step(action)\n\n        # Remember\n        mario.cache(state, next_state, action, reward, done)\n\n        # Learn\n        q, loss = mario.learn()\n\n        # Logging\n        logger.log_step(reward, loss, q)\n\n        # Update state\n        state = next_state\n\n        # Check if end of game\n        if done or info[\"flag_get\"]:\n            break\n\n    logger.log_episode()\n\n    if (e % 20 == 0) or (e == episodes - 1):\n        logger.record(episode=e, epsilon=mario.exploration_rate, step=mario.curr_step)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Conclusion\n==========\n\nIn this tutorial, we saw how we can use PyTorch to train a game-playing\nAI. You can use the same methods to train an AI to play any of the games\nat the [OpenAI gym](https://gym.openai.com/). Hope you enjoyed this\ntutorial, feel free to reach us at [our\ngithub](https://github.com/yuansongFeng/MadMario/)!\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.12"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}