diff --git a/notebooks/bonus-unit1/bonus-unit1.ipynb b/notebooks/bonus-unit1/bonus-unit1.ipynb
index 93db85a..5725765 100644
--- a/notebooks/bonus-unit1/bonus-unit1.ipynb
+++ b/notebooks/bonus-unit1/bonus-unit1.ipynb
@@ -199,9 +199,17 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 1,
       "metadata": {},
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Python 3.10.11\n"
+          ]
+        }
+      ],
       "source": [
         "# Colab's Current Python Version (Incompatible with ML-Agents)\n",
         "!python --version"
@@ -600,7 +608,7 @@
       },
       "outputs": [],
       "source": [
-        "!mlagents-push-to-hf --run-id=\"HuggyTraining\" --local-dir=\"./results/Huggy2\" --repo-id=\"ThomasSimonini/ppo-Huggy\" --commit-message=\"Huggy\""
+        "!mlagents-push-to-hf --run-id=\"HuggyTraining\" --local-dir=\"./results/Huggy\" --repo-id=\"turbo-maikol/rl-course-bu1\" --commit-message=\"Huggy\""
       ]
     },
     {
@@ -691,11 +699,21 @@
     },
     "gpuClass": "standard",
     "kernelspec": {
-      "display_name": "Python 3",
+      "display_name": "rl-env-bu1",
+      "language": "python",
       "name": "python3"
     },
     "language_info": {
-      "name": "python"
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.10.11"
     }
   },
   "nbformat": 4,
diff --git a/notebooks/bonus-unit1/bonus_unit1.ipynb b/notebooks/bonus-unit1/bonus_unit1.ipynb
deleted file mode 100644
index a85452b..0000000
--- a/notebooks/bonus-unit1/bonus_unit1.ipynb
+++ /dev/null
@@ -1,695 +0,0 @@
-{
-  "cells": [
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "colab_type": "text",
-        "id": "view-in-github"
-      },
-      "source": [
-        "<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/bonus-unit1/bonus_unit1.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "2D3NL_e4crQv"
-      },
-      "source": [
-        "# Bonus Unit 1: Let's train Huggy the Dog 🐶 to fetch a stick"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "FMYrDriDujzX"
-      },
-      "source": [
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit2/thumbnail.png\" alt=\"Bonus Unit 1Thumbnail\">\n",
-        "\n",
-        "In this notebook, we'll reinforce what we learned in the first Unit by **teaching Huggy the Dog to fetch the stick and then play with it directly in your browser**\n",
-        "\n",
-        "⬇️ Here is an example of what **you will achieve at the end of the unit.** ⬇️ (launch ▶ to see)"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "PnVhs1yYNyUF"
-      },
-      "outputs": [],
-      "source": [
-        "%%html\n",
-        "<video controls autoplay><source src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy.mp4\" type=\"video/mp4\"></video>"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "x7oR6R-ZIbeS"
-      },
-      "source": [
-        "### The environment 🎮\n",
-        "\n",
-        "- Huggy the Dog, an environment created by [Thomas Simonini](https://twitter.com/ThomasSimonini) based on [Puppo The Corgi](https://blog.unity.com/technology/puppo-the-corgi-cuteness-overload-with-the-unity-ml-agents-toolkit)\n",
-        "\n",
-        "### The library used 📚\n",
-        "\n",
-        "- [MLAgents](https://github.com/Unity-Technologies/ml-agents)"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "60yACvZwO0Cy"
-      },
-      "source": [
-        "We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues)."
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "Oks-ETYdO2Dc"
-      },
-      "source": [
-        "## Objectives of this notebook 🏆\n",
-        "\n",
-        "At the end of the notebook, you will:\n",
-        "\n",
-        "- Understand **the state space, action space and reward function used to train Huggy**.\n",
-        "- **Train your own Huggy** to fetch the stick.\n",
-        "- Be able to play **with your trained Huggy directly in your browser**.\n",
-        "\n",
-        "\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "mUlVrqnBv2o1"
-      },
-      "source": [
-        "## This notebook is from Deep Reinforcement Learning Course\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg\" alt=\"Deep RL Course illustration\"/>"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "pAMjaQpHwB_s"
-      },
-      "source": [
-        "In this free course, you will:\n",
-        "\n",
-        "- 📖 Study Deep Reinforcement Learning in **theory and practice**.\n",
-        "- 🧑‍💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.\n",
-        "- 🤖 Train **agents in unique environments**\n",
-        "\n",
-        "And more check 📚 the syllabus 👉 https://simoninithomas.github.io/deep-rl-course\n",
-        "\n",
-        "Don’t forget to **<a href=\"http://eepurl.com/ic5ZUD\">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**\n",
-        "\n",
-        "\n",
-        "The best way to keep in touch is to join our discord server to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "6r7Hl0uywFSO"
-      },
-      "source": [
-        "## Prerequisites 🏗️\n",
-        "\n",
-        "Before diving into the notebook, you need to:\n",
-        "\n",
-        "🔲 📚 **Develop an understanding of the foundations of Reinforcement learning** (MC, TD, Rewards hypothesis...) by doing Unit 1\n",
-        "\n",
-        "🔲 📚 **Read the introduction to Huggy** by doing Bonus Unit 1"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "DssdIjk_8vZE"
-      },
-      "source": [
-        "## Set the GPU 💪\n",
-        "- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`\n",
-        "\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg\" alt=\"GPU Step 1\">"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "sTfCXHy68xBv"
-      },
-      "source": [
-        "- `Hardware Accelerator > GPU`\n",
-        "\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg\" alt=\"GPU Step 2\">"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "## Clone the repository 🔽\n",
-        "\n",
-        "- We need to clone the repository, that contains **ML-Agents.**"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "%%capture\n",
-        "# Clone the repository (can take 3min)\n",
-        "!git clone --depth 1 https://github.com/Unity-Technologies/ml-agents"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "## Setup the Virtual Environment 🔽\n",
-        "- In order for the **ML-Agents** to run successfully in Colab,  Colab's Python version must meet the library's Python requirements.\n",
-        "\n",
-        "- We can check for the supported Python version under the `python_requires` parameter in the `setup.py` files. These files are required to set up the **ML-Agents** library for use and can be found in the following locations:\n",
-        "  - `/content/ml-agents/ml-agents/setup.py`\n",
-        "  - `/content/ml-agents/ml-agents-envs/setup.py`\n",
-        "\n",
-        "- Colab's Current Python version(can be checked using `!python --version`) doesn't match the library's `python_requires` parameter, as a result installation may silently fail and lead to errors like these, when executing the same commands later:\n",
-        "  - `/bin/bash: line 1: mlagents-learn: command not found`\n",
-        "  - `/bin/bash: line 1: mlagents-push-to-hf: command not found`\n",
-        "\n",
-        "- To resolve this, we'll create a virtual environment with a Python version compatible with the **ML-Agents** library.\n",
-        "\n",
-        "`Note:` *For future compatibility, always check the `python_requires` parameter in the installation files and set your virtual environment to the maximum supported Python version in the given below script if the Colab's Python version is not compatible*"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "# Colab's Current Python Version (Incompatible with ML-Agents)\n",
-        "!python --version"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "# Install virtualenv and create a virtual environment\n",
-        "!pip install virtualenv\n",
-        "!virtualenv myenv\n",
-        "\n",
-        "# Download and install Miniconda\n",
-        "!wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\n",
-        "!chmod +x Miniconda3-latest-Linux-x86_64.sh\n",
-        "!./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local\n",
-        "\n",
-        "# Activate Miniconda and install Python ver 3.10.12\n",
-        "!source /usr/local/bin/activate\n",
-        "!conda install -q -y --prefix /usr/local python=3.10.12 ujson  # Specify the version here\n",
-        "\n",
-        "# Set environment variables for Python and conda paths\n",
-        "!export PYTHONPATH=/usr/local/lib/python3.10/site-packages/\n",
-        "!export CONDA_PREFIX=/usr/local/envs/myenv"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "# Python Version in New Virtual Environment (Compatible with ML-Agents)\n",
-        "!python --version"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "## Installing the dependencies 🔽"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "%%capture\n",
-        "# Go inside the repository and install the package (can take 3min)\n",
-        "%cd ml-agents\n",
-        "!pip3 install -e ./ml-agents-envs\n",
-        "!pip3 install -e ./ml-agents"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "HRY5ufKUKfhI"
-      },
-      "source": [
-        "## Download and move the environment zip file in `./trained-envs-executables/linux/`\n",
-        "\n",
-        "- Our environment executable is in a zip file.\n",
-        "- We need to download it and place it to `./trained-envs-executables/linux/`"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "C9Ls6_6eOKiA"
-      },
-      "outputs": [],
-      "source": [
-        "!mkdir ./trained-envs-executables\n",
-        "!mkdir ./trained-envs-executables/linux"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "IHh_LXsRrrbM"
-      },
-      "source": [
-        "We downloaded the file Huggy.zip from https://github.com/huggingface/Huggy using `wget`"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "8xNAD1tRpy0_"
-      },
-      "outputs": [],
-      "source": [
-        "!wget \"https://github.com/huggingface/Huggy/raw/main/Huggy.zip\" -O ./trained-envs-executables/linux/Huggy.zip"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "8FPx0an9IAwO"
-      },
-      "outputs": [],
-      "source": [
-        "%%capture\n",
-        "!unzip -d ./trained-envs-executables/linux/ ./trained-envs-executables/linux/Huggy.zip"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "nyumV5XfPKzu"
-      },
-      "source": [
-        "Make sure your file is accessible"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "EdFsLJ11JvQf"
-      },
-      "outputs": [],
-      "source": [
-        "!chmod -R 755 ./trained-envs-executables/linux/Huggy"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "dYKVj8yUvj55"
-      },
-      "source": [
-        "## Let's recap how this environment works\n",
-        "\n",
-        "### The State Space: what Huggy \"perceives.\"\n",
-        "\n",
-        "Huggy doesn't \"see\" his environment. Instead, we provide him information about the environment:\n",
-        "\n",
-        "- The target (stick) position\n",
-        "- The relative position between himself and the target\n",
-        "- The orientation of his legs.\n",
-        "\n",
-        "Given all this information, Huggy **can decide which action to take next to fulfill his goal**.\n",
-        "\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy.jpg\" alt=\"Huggy\" width=\"100%\">\n",
-        "\n",
-        "\n",
-        "### The Action Space: what moves Huggy can do\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy-action.jpg\" alt=\"Huggy action\" width=\"100%\">\n",
-        "\n",
-        "**Joint motors drive huggy legs**. It means that to get the target, Huggy needs to **learn to rotate the joint motors of each of his legs correctly so he can move**.\n",
-        "\n",
-        "### The Reward Function\n",
-        "\n",
-        "The reward function is designed so that **Huggy will fulfill his goal** : fetch the stick.\n",
-        "\n",
-        "Remember that one of the foundations of Reinforcement Learning is the *reward hypothesis*: a goal can be described as the **maximization of the expected cumulative reward**.\n",
-        "\n",
-        "Here, our goal is that Huggy **goes towards the stick but without spinning too much**. Hence, our reward function must translate this goal.\n",
-        "\n",
-        "Our reward function:\n",
-        "\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/reward.jpg\" alt=\"Huggy reward function\" width=\"100%\">\n",
-        "\n",
-        "- *Orientation bonus*: we **reward him for getting close to the target**.\n",
-        "- *Time penalty*: a fixed-time penalty given at every action to **force him to get to the stick as fast as possible**.\n",
-        "- *Rotation penalty*: we penalize Huggy if **he spins too much and turns too quickly**.\n",
-        "- *Getting to the target reward*: we reward Huggy for **reaching the target**."
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "NAuEq32Mwvtz"
-      },
-      "source": [
-        "## Create the Huggy config file\n",
-        "\n",
-        "- In ML-Agents, you define the **training hyperparameters into config.yaml files.**\n",
-        "\n",
-        "- For the scope of this notebook, we're not going to modify the hyperparameters, but if you want to try as an experiment, you should also try to modify some other hyperparameters, Unity provides very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md).\n",
-        "\n",
-        "- But we need to create a config file for Huggy.\n",
-        "\n",
-        "  - To do that click on Folder logo on the left of your screen.\n",
-        "\n",
-        "  <img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/create_file.png\" alt=\"Create file\" width=\"10%\">\n",
-        "\n",
-        "  - Go to `/content/ml-agents/config/ppo`\n",
-        "  - Right mouse click and create a new file called `Huggy.yaml`\n",
-        "\n",
-        "  <img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/create-huggy.png\" alt=\"Create huggy.yaml\" width=\"20%\">\n",
-        "\n",
-        "- Copy and paste the content below 🔽"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "loQ0N5jhXW71"
-      },
-      "outputs": [],
-      "source": [
-        "behaviors:\n",
-        "  Huggy:\n",
-        "    trainer_type: ppo\n",
-        "    hyperparameters:\n",
-        "      batch_size: 2048\n",
-        "      buffer_size: 20480\n",
-        "      learning_rate: 0.0003\n",
-        "      beta: 0.005\n",
-        "      epsilon: 0.2\n",
-        "      lambd: 0.95\n",
-        "      num_epoch: 3\n",
-        "      learning_rate_schedule: linear\n",
-        "    network_settings:\n",
-        "      normalize: true\n",
-        "      hidden_units: 512\n",
-        "      num_layers: 3\n",
-        "      vis_encode_type: simple\n",
-        "    reward_signals:\n",
-        "      extrinsic:\n",
-        "        gamma: 0.995\n",
-        "        strength: 1.0\n",
-        "    checkpoint_interval: 200000\n",
-        "    keep_checkpoints: 15\n",
-        "    max_steps: 2e6\n",
-        "    time_horizon: 1000\n",
-        "    summary_freq: 50000"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "oakN7UHwXdCX"
-      },
-      "source": [
-        "- Don't forget to save the file!"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "r9wv5NYGw-05"
-      },
-      "source": [
-        "- **In the case you want to modify the hyperparameters**, in Google Colab notebook, you can click here to open the config.yaml: `/content/ml-agents/config/ppo/Huggy.yaml`\n",
-        "\n",
-        "- For instance **if you want to save more models during the training** (for now, we save every 200,000 training timesteps). You need to modify:\n",
-        "  - `checkpoint_interval`: The number of training timesteps collected between each checkpoint.\n",
-        "  - `keep_checkpoints`: The maximum number of model checkpoints to keep.\n",
-        "\n",
-        "=> Just keep in mind that **decreasing the `checkpoint_interval` means more models to upload to the Hub and so a longer uploading time**\n",
-        "We’re now ready to train our agent 🔥."
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "f9fI555bO12v"
-      },
-      "source": [
-        "## Train our agent\n",
-        "\n",
-        "To train our agent, we just need to **launch mlagents-learn and select the executable containing the environment.**\n",
-        "\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/mllearn.png\" alt=\"ml learn function\" width=\"100%\">\n",
-        "\n",
-        "With ML Agents, we run a training script. We define four parameters:\n",
-        "\n",
-        "1. `mlagents-learn <config>`: the path where the hyperparameter config file is.\n",
-        "2. `--env`: where the environment executable is.\n",
-        "3. `--run-id`: the name you want to give to your training run id.\n",
-        "4. `--no-graphics`: to not launch the visualization during the training.\n",
-        "\n",
-        "Train the model and use the `--resume` flag to continue training in case of interruption.\n",
-        "\n",
-        "> It will fail first time when you use `--resume`, try running the block again to bypass the error.\n",
-        "\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "lN32oWF8zPjs"
-      },
-      "source": [
-        "The training will take 30 to 45min depending on your machine (don't forget to **set up a GPU**), go take a ☕️you deserve it 🤗."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "bS-Yh1UdHfzy"
-      },
-      "outputs": [],
-      "source": [
-        "!mlagents-learn ./config/ppo/Huggy.yaml --env=./trained-envs-executables/linux/Huggy/Huggy --run-id=\"Huggy2\" --no-graphics"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "5Vue94AzPy1t"
-      },
-      "source": [
-        "## Push the agent to the 🤗 Hub\n",
-        "\n",
-        "- Now that we trained our agent, we’re **ready to push it to the Hub to be able to play with Huggy on your browser🔥.**"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "izT6FpgNzZ6R"
-      },
-      "source": [
-        "To be able to share your model with the community there are three more steps to follow:\n",
-        "\n",
-        "1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join\n",
-        "\n",
-        "2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.\n",
-        "- Create a new token (https://huggingface.co/settings/tokens) **with write role**\n",
-        "\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg\" alt=\"Create HF Token\">\n",
-        "\n",
-        "- Copy the token\n",
-        "- Run the cell below and paste the token"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "rKt2vsYoK56o"
-      },
-      "outputs": [],
-      "source": [
-        "from huggingface_hub import notebook_login\n",
-        "notebook_login()"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "ew59mK19zjtN"
-      },
-      "source": [
-        "If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "Xi0y_VASRzJU"
-      },
-      "source": [
-        "Then, we simply need to run `mlagents-push-to-hf`.\n",
-        "\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/mlpush.png\" alt=\"ml learn function\" width=\"100%\">"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "KK4fPfnczunT"
-      },
-      "source": [
-        "And we define 4 parameters:\n",
-        "\n",
-        "1. `--run-id`: the name of the training run id.\n",
-        "2. `--local-dir`: where the agent was saved, it’s results/<run_id name>, so in my case results/First Training.\n",
-        "3. `--repo-id`: the name of the Hugging Face repo you want to create or update. It’s always <your huggingface username>/<the repo name>\n",
-        "If the repo does not exist **it will be created automatically**\n",
-        "4. `--commit-message`: since HF repos are git repository you need to define a commit message."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "dGEFAIboLVc6"
-      },
-      "outputs": [],
-      "source": [
-        "!mlagents-push-to-hf --run-id=\"HuggyTraining\" --local-dir=\"./results/Huggy2\" --repo-id=\"ThomasSimonini/ppo-Huggy\" --commit-message=\"Huggy\""
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "yborB0850FTM"
-      },
-      "source": [
-        "Else, if everything worked you should have this at the end of the process(but with a different url 😆) :\n",
-        "\n",
-        "\n",
-        "\n",
-        "```\n",
-        "Your model is pushed to the hub. You can view your model here: https://huggingface.co/ThomasSimonini/ppo-Huggy\n",
-        "```\n",
-        "\n",
-        "It’s the link to your model repository. The repository contains a model card that explains how to use the model, your Tensorboard logs and your config file. **What’s awesome is that it’s a git repository, which means you can have different commits, update your repository with a new push, open Pull Requests, etc.**\n",
-        "\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/modelcard.png\" alt=\"ml learn function\" width=\"100%\">"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "5Uaon2cg0NrL"
-      },
-      "source": [
-        "But now comes the best: **being able to play with Huggy online 👀.**"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "VMc4oOsE0QiZ"
-      },
-      "source": [
-        "## Play with your Huggy 🐕\n",
-        "\n",
-        "This step is the simplest:\n",
-        "\n",
-        "- Open the game Huggy in your browser: https://huggingface.co/spaces/ThomasSimonini/Huggy\n",
-        "\n",
-        "- Click on Play with my Huggy model\n",
-        "\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/load-huggy.jpg\" alt=\"load-huggy\" width=\"100%\">"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "Djs8c5rR0Z8a"
-      },
-      "source": [
-        "1. In step 1, choose your model repository which is the model id (in my case ThomasSimonini/ppo-Huggy).\n",
-        "\n",
-        "2. In step 2, **choose what model you want to replay**:\n",
-        "  - I have multiple ones, since we saved a model every 500000 timesteps.\n",
-        "  - But since I want the more recent, I choose `Huggy.onnx`\n",
-        "\n",
-        "👉 What’s nice **is to try with different models steps to see the improvement of the agent.**"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "PI6dPWmh064H"
-      },
-      "source": [
-        "Congrats on finishing this bonus unit!\n",
-        "\n",
-        "You can now sit and enjoy playing with your Huggy 🐶. And don't **forget to spread the love by sharing Huggy with your friends 🤗**. And if you share about it on social media, **please tag us @huggingface and me @simoninithomas**\n",
-        "\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy-cover.jpeg\" alt=\"Huggy cover\" width=\"100%\">\n",
-        "\n",
-        "\n",
-        "## Keep Learning, Stay  awesome 🤗"
-      ]
-    }
-  ],
-  "metadata": {
-    "accelerator": "GPU",
-    "colab": {
-      "include_colab_link": true,
-      "private_outputs": true,
-      "provenance": []
-    },
-    "gpuClass": "standard",
-    "kernelspec": {
-      "display_name": "Python 3",
-      "name": "python3"
-    },
-    "language_info": {
-      "name": "python"
-    }
-  },
-  "nbformat": 4,
-  "nbformat_minor": 0
-}
diff --git a/notebooks/unit1/unit1.ipynb b/notebooks/unit1/unit1.ipynb
index 06d62b0..3605d63 100644
--- a/notebooks/unit1/unit1.ipynb
+++ b/notebooks/unit1/unit1.ipynb
@@ -284,11 +284,21 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 1,
       "metadata": {
         "id": "BE5JWP5rQIKf"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "\n",
+            "KeyboardInterrupt\n",
+            "\n"
+          ]
+        }
+      ],
       "source": [
         "# Virtual display\n",
         "from pyvirtualdisplay import Display\n",
@@ -316,11 +326,24 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 2,
       "metadata": {
         "id": "cygWLPGsEQ0m"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.\n",
+            "Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.\n",
+            "Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.\n",
+            "See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.\n",
+            "/home/mique/Desktop/Code/deep-rl-class/notebooks/unit1/venv-u1/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+            "  from .autonotebook import tqdm as notebook_tqdm\n"
+          ]
+        }
+      ],
       "source": [
         "import gymnasium\n",
         "\n",
@@ -353,7 +376,7 @@
         "\n",
         "Let's look at an example, but first let's recall the RL loop.\n",
         "\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg\" alt=\"The RL process\" width=\"100%\">"
+        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg\" alt=\"The RL process\" width=\"50%\">"
       ]
     },
     {
@@ -396,11 +419,59 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 3,
       "metadata": {
         "id": "w7vOFlpA_ONz"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Action taken: 2\n",
+            " - reward: 3.0692112001439513\n",
+            "Action taken: 3\n",
+            " - reward: -2.0283326535021318\n",
+            "Action taken: 0\n",
+            " - reward: -2.013109629392062\n",
+            "Action taken: 1\n",
+            " - reward: -1.8387614642986694\n",
+            "Action taken: 3\n",
+            " - reward: -1.9646071472228346\n",
+            "Action taken: 2\n",
+            " - reward: 1.724712789874087\n",
+            "Action taken: 3\n",
+            " - reward: -2.0772821745045733\n",
+            "Action taken: 3\n",
+            " - reward: -2.263394443942046\n",
+            "Action taken: 2\n",
+            " - reward: 1.03422570110298\n",
+            "Action taken: 1\n",
+            " - reward: -1.9686919634781634\n",
+            "Action taken: 0\n",
+            " - reward: -1.880365204866706\n",
+            "Action taken: 3\n",
+            " - reward: -2.1378125038369533\n",
+            "Action taken: 2\n",
+            " - reward: 0.23407670781683693\n",
+            "Action taken: 0\n",
+            " - reward: -2.0440816329574147\n",
+            "Action taken: 0\n",
+            " - reward: -1.9836184981424765\n",
+            "Action taken: 2\n",
+            " - reward: 1.1548347711850055\n",
+            "Action taken: 1\n",
+            " - reward: -1.7956347801317054\n",
+            "Action taken: 0\n",
+            " - reward: -1.7729850216231284\n",
+            "Action taken: 2\n",
+            " - reward: 1.9191545079788284\n",
+            "Action taken: 3\n",
+            " - reward: -2.0884827451743875\n",
+            "Total reward: -18.720944184971565\n"
+          ]
+        }
+      ],
       "source": [
         "import gymnasium as gym\n",
         "\n",
@@ -410,6 +481,7 @@
         "# Then we reset this environment\n",
         "observation, info = env.reset()\n",
         "\n",
+        "total_reward = 0\n",
         "for _ in range(20):\n",
         "  # Take a random action\n",
         "  action = env.action_space.sample()\n",
@@ -418,13 +490,16 @@
         "  # Do this action in the environment and get\n",
         "  # next_state, reward, terminated, truncated and info\n",
         "  observation, reward, terminated, truncated, info = env.step(action)\n",
-        "\n",
+        "  print(f\" - reward: {reward}\")\n",
+        "  total_reward += reward\n",
         "  # If the game is terminated (in our case we land, crashed) or truncated (timeout)\n",
+        "  \n",
         "  if terminated or truncated:\n",
         "      # Reset the environment\n",
         "      print(\"Environment is reset\")\n",
         "      observation, info = env.reset()\n",
         "\n",
+        "print(\"Total reward:\", total_reward)\n",
         "env.close()"
       ]
     },
@@ -450,6 +525,29 @@
         "---\n"
       ]
     },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "The state is an 8-dimensional vector: the coordinates of the lander in x & y, its linear velocities in x & y, its angle, its angular velocity, and two booleans that represent whether each leg is in contact with the ground or not.\n",
+        "\n",
+        "```\n",
+        "Box([ -2.5 -2.5 -10. -10. -6.2831855 -10. -0. -0. ], [ 2.5 2.5 10. 10. 6.2831855 10. 1. 1. ], (8,), float32)\n",
+        "Box(\n",
+        "[ \n",
+        "    x  -2.5 y  -2.5 \n",
+        "    vx -10. vy -10. \n",
+        "    lv -6.2831855 \n",
+        "    av -10. \n",
+        "    ll 1. \n",
+        "    rl 1. \n",
+        "],\n",
+        "size (8,)\n",
+        ", float32)\n",
+        "```\n",
+        "\n"
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {
@@ -461,11 +559,23 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 4,
       "metadata": {
         "id": "ZNPG0g_UGCfh"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "_____OBSERVATION SPACE_____ \n",
+            "\n",
+            "Observation Space Shape (8,)\n",
+            "Sample observation [ 53.21532    -87.118256    -0.84611297   3.4404945    0.7532178\n",
+            "   2.645675     0.9980984    0.40649492]\n"
+          ]
+        }
+      ],
       "source": [
         "# We create our environment with gym.make(\"<name_of_the_environment>\")\n",
         "env = gym.make(\"LunarLander-v2\")\n",
@@ -494,11 +604,23 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 5,
       "metadata": {
         "id": "We5WqOBGLoSm"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "\n",
+            " _____ACTION SPACE_____ \n",
+            "\n",
+            "Action Space Shape 4\n",
+            "Action Space Sample 3\n"
+          ]
+        }
+      ],
       "source": [
         "print(\"\\n _____ACTION SPACE_____ \\n\")\n",
         "print(\"Action Space Shape\", env.action_space.n)\n",
@@ -549,7 +671,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 6,
       "metadata": {
         "id": "99hqQ_etEy1N"
       },
@@ -629,16 +751,24 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 7,
       "metadata": {
         "id": "nxI6hT1GE4-A"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Using cuda device\n"
+          ]
+        }
+      ],
       "source": [
         "# TODO: Define a PPO MlpPolicy architecture\n",
         "# We use MultiLayerPerceptron (MLPPolicy) because the input is a vector,\n",
         "# if we had frames as input we would use CnnPolicy\n",
-        "model ="
+        "model = PPO(\"MlpPolicy\", env=env, verbose=1)"
       ]
     },
     {
@@ -652,11 +782,19 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 8,
       "metadata": {
         "id": "543OHYDfcjK4"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Using cuda device\n"
+          ]
+        }
+      ],
       "source": [
         "# SOLUTION\n",
         "# We added some parameters to accelerate the training\n",
@@ -669,7 +807,9 @@
         "    gamma = 0.999,\n",
         "    gae_lambda = 0.98,\n",
         "    ent_coef = 0.01,\n",
-        "    verbose=1)"
+        "    verbose=1\n",
+        ")\n",
+        "model_name = \"ppo-LunarLander-v2\"\n"
       ]
     },
     {
@@ -685,16 +825,16 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 14,
       "metadata": {
         "id": "qKnYkNiVp89p"
       },
       "outputs": [],
       "source": [
         "# TODO: Train it for 1,000,000 timesteps\n",
-        "\n",
+        "model.learn(total_timesteps=1_000_000)\n",
         "# TODO: Specify file name for model and save the model to file\n",
-        "model_name = \"ppo-LunarLander-v2\"\n"
+        "model.save(model_name)"
       ]
     },
     {
@@ -741,21 +881,238 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 9,
       "metadata": {
         "id": "yRpno0glsADy"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Mean reward: 252.266642 +/- 34.661009936952354\n"
+          ]
+        }
+      ],
       "source": [
         "# TODO: Evaluate the agent\n",
         "# Create a new environment for evaluation\n",
-        "eval_env =\n",
+        "eval_env = Monitor(gym.make(\"LunarLander-v2\", render_mode='rgb_array'))\n",
         "\n",
+        "# Load model\n",
+        "model = PPO.load(model_name, env=eval_env)\n",
         "# Evaluate the model with 10 evaluation episodes and deterministic=True\n",
-        "mean_reward, std_reward =\n",
+        "mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10, deterministic=True)\n",
         "\n",
         "# Print the results\n",
-        "\n"
+        "print(f\"Mean reward: {mean_reward} +/- {std_reward}\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 11,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n",
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n",
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n",
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            " - frame: 250\n",
+            " - frame: 500\n",
+            " - frame: 750\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            " - frame: 1000\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            " - frame: 250\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n",
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n",
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            " - frame: 250\n",
+            " - frame: 500\n",
+            " - frame: 750\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            " - frame: 1000\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n",
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            " - frame: 250\n",
+            " - frame: 250\n",
+            " - frame: 500\n",
+            " - frame: 750\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            " - frame: 1000\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n",
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n",
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n",
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n",
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n",
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n",
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n",
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n",
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            " - frame: 250\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n",
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n",
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n",
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n",
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n",
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n",
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n",
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n"
+          ]
+        }
+      ],
+      "source": [
+        "from stable_baselines3.common.vec_env import DummyVecEnv\n",
+        "from stable_baselines3.common.monitor import Monitor\n",
+        "import imageio\n",
+        "import gym\n",
+        "import numpy as np\n",
+        "np.bool8 = np.bool_\n",
+        "\n",
+        "for i in range(30):\n",
+        "    eval_env = DummyVecEnv([lambda: Monitor(gym.make(\"LunarLander-v2\", render_mode=\"rgb_array\"))])\n",
+        "\n",
+        "    frames = []\n",
+        "    obs = eval_env.reset()\n",
+        "    done = False\n",
+        "    while not done:\n",
+        "        action, _ = model.predict(obs, deterministic=False)\n",
+        "        obs, reward, done, info = eval_env.step(action)\n",
+        "        done = done[0]  # VecEnv returns a list\n",
+        "        img = eval_env.envs[0].render()  # returns RGB array\n",
+        "        frames.append(img)\n",
+        "        if len(frames) % 250 == 0:\n",
+        "            print(f\" - frame: {len(frames)}\")\n",
+        "\n",
+        "    imageio.mimsave(f'lunarlander_run-{i}.mp4', frames, fps=30)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 29,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (600, 400) to (608, 400) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n"
+          ]
+        },
+        {
+          "ename": "",
+          "evalue": "",
+          "output_type": "error",
+          "traceback": [
+            "\u001b[1;31mThe Kernel crashed while executing code in the current cell or a previous cell. \n",
+            "\u001b[1;31mPlease review the code in the cell(s) to identify a possible cause of the failure. \n",
+            "\u001b[1;31mClick <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. \n",
+            "\u001b[1;31mView Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
+          ]
+        }
+      ],
+      "source": [
+        "imageio.mimsave('lunarlander_run.mp4', frames, fps=30)\n"
       ]
     },
     {
@@ -777,7 +1134,7 @@
       "source": [
         "#@title\n",
         "eval_env = Monitor(gym.make(\"LunarLander-v2\", render_mode='rgb_array'))\n",
-        "mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)\n",
+        "mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True, )\n",
         "print(f\"mean_reward={mean_reward:.2f} +/- {std_reward}\")"
       ]
     },
@@ -889,12 +1246,248 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 19,
       "metadata": {
         "id": "JPG7ofdGIHN8"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "\u001b[38;5;4mℹ This function will save, evaluate, generate a video of your agent,\n",
+            "create a model card and push everything to the hub. It might take up to 1min.\n",
+            "This is a work in progress: if you encounter a bug, please open an issue.\u001b[0m\n",
+            "Saving video to /tmp/tmpztlifguu/-step-0-to-step-1000.mp4\n",
+            "MoviePy - Building video /tmp/tmpztlifguu/-step-0-to-step-1000.mp4.\n",
+            "MoviePy - Writing video /tmp/tmpztlifguu/-step-0-to-step-1000.mp4\n",
+            "\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "ffmpeg version 6.1.1-3ubuntu5 Copyright (c) 2000-2023 the FFmpeg developers\n",
+            "  built with gcc 13 (Ubuntu 13.2.0-23ubuntu3)\n",
+            "  configuration: --prefix=/usr --extra-version=3ubuntu5 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --disable-omx --enable-gnutls --enable-libaom --enable-libass --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libglslang --enable-libgme --enable-libgsm --enable-libharfbuzz --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-openal --enable-opencl --enable-opengl --disable-sndio --enable-libvpl --disable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-ladspa --enable-libbluray --enable-libjack --enable-libpulse --enable-librabbitmq --enable-librist --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libx264 --enable-libzmq --enable-libzvbi --enable-lv2 --enable-sdl2 --enable-libplacebo --enable-librav1e --enable-pocketsphinx --enable-librsvg --enable-libjxl --enable-shared\n",
+            "  libavutil      58. 29.100 / 58. 29.100\n",
+            "  libavcodec     60. 31.102 / 60. 31.102\n",
+            "  libavformat    60. 16.100 / 60. 16.100\n",
+            "  libavdevice    60.  3.100 / 60.  3.100\n",
+            "  libavfilter     9. 12.100 /  9. 12.100\n",
+            "  libswscale      7.  5.100 /  7.  5.100\n",
+            "  libswresample   4. 12.100 /  4. 12.100\n",
+            "  libpostproc    57.  3.100 / 57.  3.100\n",
+            "Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/tmp/tmpztlifguu/-step-0-to-step-1000.mp4':\n",
+            "  Metadata:\n",
+            "    major_brand     : isom\n",
+            "    minor_version   : 512\n",
+            "    compatible_brands: isomiso2avc1mp41\n",
+            "    encoder         : Lavf61.1.100\n",
+            "  Duration: 00:00:20.00, start: 0.000000, bitrate: 51 kb/s\n",
+            "  Stream #0:0[0x1](und): Video: h264 (High) (avc1 / 0x31637661), yuv420p(progressive), 600x400, 46 kb/s, 50 fps, 50 tbr, 12800 tbn (default)\n",
+            "    Metadata:\n",
+            "      handler_name    : VideoHandler\n",
+            "      vendor_id       : [0][0][0][0]\n",
+            "      encoder         : Lavc61.3.100 libx264\n",
+            "Stream mapping:\n",
+            "  Stream #0:0 -> #0:0 (h264 (native) -> h264 (libx264))\n",
+            "Press [q] to stop, [?] for help\n",
+            "[libx264 @ 0x55b3ab1fa980] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2\n",
+            "[libx264 @ 0x55b3ab1fa980] profile High, level 3.1, 4:2:0, 8-bit\n",
+            "[libx264 @ 0x55b3ab1fa980] 264 - core 164 r3108 31e19f9 - H.264/MPEG-4 AVC codec - Copyleft 2003-2023 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=12 lookahead_threads=2 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00\n",
+            "Output #0, mp4, to '/tmp/tmp2etu86el/replay.mp4':\n",
+            "  Metadata:\n",
+            "    major_brand     : isom\n",
+            "    minor_version   : 512\n",
+            "    compatible_brands: isomiso2avc1mp41\n",
+            "    encoder         : Lavf60.16.100\n",
+            "  Stream #0:0(und): Video: h264 (avc1 / 0x31637661), yuv420p(progressive), 600x400, q=2-31, 50 fps, 12800 tbn (default)\n",
+            "    Metadata:\n",
+            "      handler_name    : VideoHandler\n",
+            "      vendor_id       : [0][0][0][0]\n",
+            "      encoder         : Lavc60.31.102 libx264\n",
+            "    Side data:\n",
+            "      cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A\n",
+            "frame=    0 fps=0.0 q=0.0 size=       0kB time=N/A bitrate=N/A speed=N/A    \r"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "MoviePy - Done !\n",
+            "MoviePy - video ready /tmp/tmpztlifguu/-step-0-to-step-1000.mp4\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "[out#0/mp4 @ 0x55b3ab12b880] video:110kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 11.302154%\n",
+            "frame= 1000 fps=0.0 q=-1.0 Lsize=     122kB time=00:00:19.94 bitrate=  50.1kbits/s speed=22.1x    \n",
+            "[libx264 @ 0x55b3ab1fa980] frame I:4     Avg QP: 9.42  size:  2199\n",
+            "[libx264 @ 0x55b3ab1fa980] frame P:268   Avg QP:18.69  size:   158\n",
+            "[libx264 @ 0x55b3ab1fa980] frame B:728   Avg QP:20.11  size:    83\n",
+            "[libx264 @ 0x55b3ab1fa980] consecutive B-frames:  0.8%  4.2%  6.6% 88.4%\n",
+            "[libx264 @ 0x55b3ab1fa980] mb I  I16..4: 92.1%  1.7%  6.2%\n",
+            "[libx264 @ 0x55b3ab1fa980] mb P  I16..4:  0.1%  0.3%  0.1%  P16..4:  1.1%  0.2%  0.1%  0.0%  0.0%    skip:98.1%\n",
+            "[libx264 @ 0x55b3ab1fa980] mb B  I16..4:  0.0%  0.0%  0.0%  B16..8:  1.7%  0.2%  0.0%  direct: 0.0%  skip:98.0%  L0:55.7% L1:43.8% BI: 0.6%\n",
+            "[libx264 @ 0x55b3ab1fa980] 8x8 transform intra:15.7% inter:16.2%\n",
+            "[libx264 @ 0x55b3ab1fa980] coded y,uvDC,uvAC intra: 7.0% 9.7% 8.7% inter: 0.1% 0.2% 0.1%\n",
+            "[libx264 @ 0x55b3ab1fa980] i16 v,h,dc,p: 90%  5%  5%  0%\n",
+            "[libx264 @ 0x55b3ab1fa980] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 11%  4% 84%  0%  0%  0%  0%  0%  0%\n",
+            "[libx264 @ 0x55b3ab1fa980] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 15% 15% 58%  2%  3%  1%  3%  1%  3%\n",
+            "[libx264 @ 0x55b3ab1fa980] i8c dc,h,v,p: 93%  3%  3%  0%\n",
+            "[libx264 @ 0x55b3ab1fa980] Weighted P-Frames: Y:0.0% UV:0.0%\n",
+            "[libx264 @ 0x55b3ab1fa980] ref P L0: 66.6%  1.8% 20.4% 11.1%\n",
+            "[libx264 @ 0x55b3ab1fa980] ref B L0: 68.4% 27.3%  4.2%\n",
+            "[libx264 @ 0x55b3ab1fa980] ref B L1: 93.3%  6.7%\n",
+            "[libx264 @ 0x55b3ab1fa980] kb/s:44.60\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "\u001b[38;5;4mℹ Pushing repo turbo-maikol/rl-course-unit1 to the Hugging Face Hub\u001b[0m\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "Processing Files (0 / 0)                : |          |  0.00B /  0.00B            \n",
+            "\u001b[A\n",
+            "Processing Files (1 / 1)                :   0%|          | 1.26kB /  408kB, 3.16kB/s  \n",
+            "\u001b[A\n",
+            "\u001b[A\n",
+            "\u001b[A\n",
+            "\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "Processing Files (5 / 5)                : 100%|██████████|  408kB /  408kB,  255kB/s  \n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "Processing Files (5 / 5)                : 100%|██████████|  408kB /  408kB,  185kB/s  \n",
+            "New Data Upload                         : 100%|██████████|  406kB /  406kB,  185kB/s  \n",
+            "  ...unarLander-v2/pytorch_variables.pth: 100%|██████████| 1.26kB / 1.26kB            \n",
+            "  ...LunarLander-v2/policy.optimizer.pth: 100%|██████████| 88.9kB / 88.9kB            \n",
+            "  ...u86el/ppo-LunarLander-v2/policy.pth: 100%|██████████| 44.1kB / 44.1kB            \n",
+            "  .../tmp2etu86el/ppo-LunarLander-v2.zip: 100%|██████████|  149kB /  149kB            \n",
+            "  /tmp/tmp2etu86el/replay.mp4           : 100%|██████████|  125kB /  125kB            \n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "\u001b[38;5;4mℹ Your model is pushed to the Hub. You can view your model here:\n",
+            "https://huggingface.co/turbo-maikol/rl-course-unit1/tree/main/\u001b[0m\n"
+          ]
+        },
+        {
+          "data": {
+            "text/plain": [
+              "CommitInfo(commit_url='https://huggingface.co/turbo-maikol/rl-course-unit1/commit/3de80d180623b404c50319ea857ba782dccad4c9', commit_message='Model trained with PPO on LunarLander-v2 for the DEEP RL huggingface course', commit_description='', oid='3de80d180623b404c50319ea857ba782dccad4c9', pr_url=None, repo_url=RepoUrl('https://huggingface.co/turbo-maikol/rl-course-unit1', endpoint='https://huggingface.co', repo_type='model', repo_id='turbo-maikol/rl-course-unit1'), pr_revision=None, pr_num=None)"
+            ]
+          },
+          "execution_count": 19,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
       "source": [
+        "import os\n",
+        "from dotenv import load_dotenv\n",
+        "load_dotenv()\n",
+        "\n",
+        "\n",
         "import gymnasium as gym\n",
         "from stable_baselines3.common.vec_env import DummyVecEnv\n",
         "from stable_baselines3.common.env_util import make_vec_env\n",
@@ -903,29 +1496,32 @@
         "\n",
         "## TODO: Define a repo_id\n",
         "## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2\n",
-        "repo_id =\n",
+        "repo_id = \"turbo-maikol/rl-course-unit1\"\n",
         "\n",
         "# TODO: Define the name of the environment\n",
-        "env_id =\n",
+        "env_id = \"LunarLander-v2\"\n",
         "\n",
         "# Create the evaluation env and set the render_mode=\"rgb_array\"\n",
         "eval_env = DummyVecEnv([lambda: Monitor(gym.make(env_id, render_mode=\"rgb_array\"))])\n",
         "\n",
         "\n",
         "# TODO: Define the model architecture we used\n",
-        "model_architecture = \"\"\n",
+        "model_architecture = \"PPO\"\n",
         "\n",
         "## TODO: Define the commit message\n",
-        "commit_message = \"\"\n",
+        "commit_message = \"Model trained with PPO on LunarLander-v2 for the DEEP RL huggingface course\"\n",
         "\n",
         "# method save, evaluate, generate a model card and record a replay video of your agent before pushing the repo to the hub\n",
-        "package_to_hub(model=model, # Our trained model\n",
-        "               model_name=model_name, # The name of our trained model\n",
-        "               model_architecture=model_architecture, # The model architecture we used: in our case PPO\n",
-        "               env_id=env_id, # Name of the environment\n",
-        "               eval_env=eval_env, # Evaluation Environment\n",
-        "               repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2\n",
-        "               commit_message=commit_message)"
+        "package_to_hub(\n",
+        "    model=model, # Our trained model\n",
+        "    model_name=model_name, # The name of our trained model\n",
+        "    model_architecture=model_architecture, # The model architecture we used: in our case PPO\n",
+        "    env_id=env_id, # Name of the environment\n",
+        "    eval_env=eval_env, # Evaluation Environment\n",
+        "    repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2\n",
+        "    commit_message=commit_message,\n",
+        "    token=os.getenv(\"HF_HUB_TOKEN\")\n",
+        ")"
       ]
     },
     {
@@ -1066,9 +1662,9 @@
         "# 1. Install pickle5 (we done it at the beginning of the colab)\n",
         "# 2. Create a custom empty object we pass as parameter to PPO.load()\n",
         "custom_objects = {\n",
-        "            \"learning_rate\": 0.0,\n",
-        "            \"lr_schedule\": lambda _: 0.0,\n",
-        "            \"clip_range\": lambda _: 0.0,\n",
+        "    \"learning_rate\": 0.0,\n",
+        "    \"lr_schedule\": lambda _: 0.0,   \n",
+        "    \"clip_range\": lambda _: 0.0,\n",
         "}\n",
         "\n",
         "checkpoint = load_from_hub(repo_id, filename)\n",
@@ -1163,18 +1759,21 @@
     },
     "gpuClass": "standard",
     "kernelspec": {
-      "display_name": "Python 3.9.7",
+      "display_name": "venv-u1",
       "language": "python",
       "name": "python3"
     },
     "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
       "name": "python",
-      "version": "3.9.7"
-    },
-    "vscode": {
-      "interpreter": {
-        "hash": "ed7f8024e43d3b8f5ca3c5e1a8151ab4d136b3ecee1e3fd59e0766ccc55e1b10"
-      }
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.12.3"
     }
   },
   "nbformat": 4,
diff --git a/notebooks/unit2/unit2.ipynb b/notebooks/unit2/unit2.ipynb
index e9ae624..5df36f4 100644
--- a/notebooks/unit2/unit2.ipynb
+++ b/notebooks/unit2/unit2.ipynb
@@ -3,8 +3,8 @@
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "view-in-github",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "view-in-github"
       },
       "source": [
         "<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit2/unit2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
@@ -36,6 +36,9 @@
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "DPTBOv9HYLZ2"
+      },
       "source": [
         "###🎮 Environments:\n",
         "\n",
@@ -48,10 +51,7 @@
         "- [Gymnasium](https://gymnasium.farama.org/)\n",
         "\n",
         "We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues)."
-      ],
-      "metadata": {
-        "id": "DPTBOv9HYLZ2"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
@@ -72,14 +72,14 @@
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "viNzVbVaYvY3"
+      },
       "source": [
         "## This notebook is from the Deep Reinforcement Learning Course\n",
         "\n",
         "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg\" alt=\"Deep RL Course illustration\"/>"
-      ],
-      "metadata": {
-        "id": "viNzVbVaYvY3"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
@@ -156,28 +156,31 @@
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "# Let's code our first Reinforcement Learning algorithm 🚀"
-      ],
       "metadata": {
         "id": "HEtx8Y8MqKfH"
-      }
+      },
+      "source": [
+        "# Let's code our first Reinforcement Learning algorithm 🚀"
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "Kdxb1IhzTn0v"
+      },
       "source": [
         "To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained Taxi model to the Hub and **get a result of >= 4.5**.\n",
         "\n",
         "To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**\n",
         "\n",
         "For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process"
-      ],
-      "metadata": {
-        "id": "Kdxb1IhzTn0v"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "4gpxC1_kqUYe"
+      },
       "source": [
         "## Install dependencies and create a virtual display 🔽\n",
         "\n",
@@ -194,10 +197,7 @@
         "The Hugging Face Hub 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.\n",
         "\n",
         "You can see here all the Deep RL models available (if they use Q Learning) here 👉 https://huggingface.co/models?other=q-learning"
-      ],
-      "metadata": {
-        "id": "4gpxC1_kqUYe"
-      }
+      ]
     },
     {
       "cell_type": "code",
@@ -212,53 +212,53 @@
     },
     {
       "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "n71uTX7qqzz2"
+      },
+      "outputs": [],
       "source": [
         "!sudo apt-get update\n",
         "!sudo apt-get install -y python3-opengl\n",
         "!apt install ffmpeg xvfb\n",
         "!pip3 install pyvirtualdisplay"
-      ],
-      "metadata": {
-        "id": "n71uTX7qqzz2"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks to this trick, **we will be able to run our virtual screen.**"
-      ],
       "metadata": {
         "id": "K6XC13pTfFiD"
-      }
+      },
+      "source": [
+        "To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks to this trick, **we will be able to run our virtual screen.**"
+      ]
     },
     {
       "cell_type": "code",
-      "source": [
-        "import os\n",
-        "os.kill(os.getpid(), 9)"
-      ],
+      "execution_count": null,
       "metadata": {
         "id": "3kuZbWAkfHdg"
       },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "os.kill(os.getpid(), 9)"
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "DaY1N4dBrabi"
+      },
+      "outputs": [],
       "source": [
         "# Virtual display\n",
         "from pyvirtualdisplay import Display\n",
         "\n",
         "virtual_display = Display(visible=0, size=(1400, 900))\n",
         "virtual_display.start()"
-      ],
-      "metadata": {
-        "id": "DaY1N4dBrabi"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "markdown",
@@ -276,7 +276,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 7,
       "metadata": {
         "id": "VcNvOAQlysBJ"
       },
@@ -287,10 +287,8 @@
         "import random\n",
         "import imageio\n",
         "import os\n",
-        "import tqdm\n",
         "\n",
-        "import pickle5 as pickle\n",
-        "from tqdm.notebook import tqdm"
+        "import pickle5 as pickle"
       ]
     },
     {
@@ -354,14 +352,21 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 35,
       "metadata": {
         "id": "IzJnb8O3y8up"
       },
       "outputs": [],
       "source": [
         "# Create the FrozenLake-v1 environment using 4x4 map and non-slippery version and render_mode=\"rgb_array\"\n",
-        "env = gym.make() # TODO use the correct parameters"
+        "\n",
+        "desc=[\n",
+        "    \"SFFF\", \n",
+        "    \"FHFH\", \n",
+        "    \"FFFH\", \n",
+        "    \"HFFG\"\n",
+        "]\n",
+        "env = gym.make(\"FrozenLake-v1\", map_name=\"4x4\", desc=desc, is_slippery=False, render_mode=\"rgb_array\") # TODO use the correct parameters"
       ]
     },
     {
@@ -411,11 +416,22 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 34,
       "metadata": {
         "id": "ZNPG0g_UGCfh"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "_____OBSERVATION SPACE_____ \n",
+            "\n",
+            "Observation Space Discrete(16)\n",
+            "Sample observation 0\n"
+          ]
+        }
+      ],
       "source": [
         "# We create our environment with gym.make(\"<name_of_the_environment>\")- `is_slippery=False`: The agent always moves in the intended direction due to the non-slippery nature of the frozen lake (deterministic).\n",
         "print(\"_____OBSERVATION SPACE_____ \\n\")\n",
@@ -441,11 +457,23 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 10,
       "metadata": {
         "id": "We5WqOBGLoSm"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "\n",
+            " _____ACTION SPACE_____ \n",
+            "\n",
+            "Action Space Shape 4\n",
+            "Action Space Sample 2\n"
+          ]
+        }
+      ],
       "source": [
         "print(\"\\n _____ACTION SPACE_____ \\n\")\n",
         "print(\"Action Space Shape\", env.action_space.n)\n",
@@ -488,22 +516,31 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 11,
       "metadata": {
         "id": "y3ZCdluj3k0l"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "There are  16  possible states\n",
+            "There are  4  possible actions\n"
+          ]
+        }
+      ],
       "source": [
-        "state_space =\n",
+        "state_space = env.observation_space.n\n",
         "print(\"There are \", state_space, \" possible states\")\n",
         "\n",
-        "action_space =\n",
+        "action_space = env.action_space.n\n",
         "print(\"There are \", action_space, \" possible actions\")"
       ]
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 12,
       "metadata": {
         "id": "rCddoOXM3UQH"
       },
@@ -511,19 +548,47 @@
       "source": [
         "# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros. np.zeros needs a tuple (a,b)\n",
         "def initialize_q_table(state_space, action_space):\n",
-        "  Qtable =\n",
+        "  \"\"\"Is not a matrix, is an array and we can locate each game cell later with `current_row * ncols + current_col`\"\"\"\n",
+        "  Qtable = np.zeros((state_space, action_space))\n",
         "  return Qtable"
       ]
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 13,
       "metadata": {
         "id": "9YfvrqRt3jdR"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "array([[0., 0., 0., 0.],\n",
+              "       [0., 0., 0., 0.],\n",
+              "       [0., 0., 0., 0.],\n",
+              "       [0., 0., 0., 0.],\n",
+              "       [0., 0., 0., 0.],\n",
+              "       [0., 0., 0., 0.],\n",
+              "       [0., 0., 0., 0.],\n",
+              "       [0., 0., 0., 0.],\n",
+              "       [0., 0., 0., 0.],\n",
+              "       [0., 0., 0., 0.],\n",
+              "       [0., 0., 0., 0.],\n",
+              "       [0., 0., 0., 0.],\n",
+              "       [0., 0., 0., 0.],\n",
+              "       [0., 0., 0., 0.],\n",
+              "       [0., 0., 0., 0.],\n",
+              "       [0., 0., 0., 0.]])"
+            ]
+          },
+          "execution_count": 13,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
       "source": [
-        "Qtable_frozenlake = initialize_q_table(state_space, action_space)"
+        "Qtable_frozenlake = initialize_q_table(state_space, action_space)\n",
+        "Qtable_frozenlake"
       ]
     },
     {
@@ -595,17 +660,30 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 14,
       "metadata": {
         "id": "E3SCLmLX5bWG"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "np.int64(0)"
+            ]
+          },
+          "execution_count": 14,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
       "source": [
         "def greedy_policy(Qtable, state):\n",
         "  # Exploitation: take the action with the highest state, action value\n",
-        "  action =\n",
+        "  action = np.argmax(Qtable[state])\n",
         "\n",
-        "  return action"
+        "  return action\n",
+        "\n",
+        "greedy_policy(Qtable_frozenlake, 2)"
       ]
     },
     {
@@ -638,7 +716,7 @@
         "id": "flILKhBU3yZ7"
       },
       "source": [
-        "##Define the epsilon-greedy policy 🤖\n",
+        "## Define the epsilon-greedy policy 🤖\n",
         "\n",
         "Epsilon-greedy is the training policy that handles the exploration/exploitation trade-off.\n",
         "\n",
@@ -655,7 +733,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 15,
       "metadata": {
         "id": "6Bj7x3in3_Pq"
       },
@@ -663,15 +741,15 @@
       "source": [
         "def epsilon_greedy_policy(Qtable, state, epsilon):\n",
         "  # Randomly generate a number between 0 and 1\n",
-        "  random_num =\n",
+        "  random_num = np.random.random()\n",
         "  # if random_num > greater than epsilon --> exploitation\n",
         "  if random_num > epsilon:\n",
         "    # Take the action with the highest value given a state\n",
         "    # np.argmax can be useful here\n",
-        "    action =\n",
+        "    action = greedy_policy(Qtable, state)\n",
         "  # else --> exploration\n",
         "  else:\n",
-        "    action = # Take a random action\n",
+        "    action = env.action_space.sample() # np.random.randint(0, Qtable[state].size) # Take a random action\n",
         "\n",
         "  return action"
       ]
@@ -724,7 +802,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 16,
       "metadata": {
         "id": "Y1tWn0tycWZ1"
       },
@@ -778,12 +856,13 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 17,
       "metadata": {
         "id": "paOynXy3aoJW"
       },
       "outputs": [],
       "source": [
+        "from tqdm import tqdm\n",
         "def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):\n",
         "  for episode in tqdm(range(n_training_episodes)):\n",
         "    # Reduce epsilon (because we need less and less exploration)\n",
@@ -796,15 +875,16 @@
         "\n",
         "    # repeat\n",
         "    for step in range(max_steps):\n",
-        "      # Choose the action At using epsilon greedy policy\n",
-        "      action =\n",
+        "      # TODO: Choose the action At using epsilon greedy policy\n",
+        "      action = epsilon_greedy_policy(Qtable, state, epsilon)\n",
         "\n",
-        "      # Take action At and observe Rt+1 and St+1\n",
-        "      # Take the action (a) and observe the outcome state(s') and reward (r)\n",
-        "      new_state, reward, terminated, truncated, info =\n",
+        "      #  TODO: Take action At and observe Rt+1 and St+1\n",
+        "      #  TODO: Take the action (a) and observe the outcome state(s') and reward (r)\n",
+        "      new_state, reward, terminated, truncated, info = env.step(action)\n",
         "\n",
-        "      # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]\n",
-        "      Qtable[state][action] =\n",
+        "      #  TODO:  Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]\n",
+        "      old_Qsa = Qtable[state][action]\n",
+        "      Qtable[state][action] = old_Qsa + learning_rate * (reward + gamma * np.max(Qtable[new_state]) - old_Qsa)\n",
         "\n",
         "      # If terminated or truncated finish the episode\n",
         "      if terminated or truncated:\n",
@@ -874,11 +954,19 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 18,
       "metadata": {
         "id": "DPBxfjJdTCOH"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "100%|██████████| 10000/10000 [00:00<00:00, 11230.14it/s]\n"
+          ]
+        }
+      ],
       "source": [
         "Qtable_frozenlake = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_frozenlake)"
       ]
@@ -894,11 +982,37 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 19,
       "metadata": {
         "id": "nmfchsTITw4q"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "array([[0.73509189, 0.77378094, 0.77378094, 0.73509189],\n",
+              "       [0.73509189, 0.        , 0.81450625, 0.77378094],\n",
+              "       [0.77378094, 0.857375  , 0.77378094, 0.81450625],\n",
+              "       [0.81450625, 0.        , 0.77378094, 0.77378094],\n",
+              "       [0.77378094, 0.81450625, 0.        , 0.73509189],\n",
+              "       [0.        , 0.        , 0.        , 0.        ],\n",
+              "       [0.        , 0.9025    , 0.        , 0.81450625],\n",
+              "       [0.        , 0.        , 0.        , 0.        ],\n",
+              "       [0.81450625, 0.        , 0.857375  , 0.77378094],\n",
+              "       [0.81450625, 0.9025    , 0.9025    , 0.        ],\n",
+              "       [0.857375  , 0.95      , 0.        , 0.857375  ],\n",
+              "       [0.        , 0.        , 0.        , 0.        ],\n",
+              "       [0.        , 0.        , 0.        , 0.        ],\n",
+              "       [0.        , 0.9025    , 0.95      , 0.857375  ],\n",
+              "       [0.9025    , 0.95      , 1.        , 0.9025    ],\n",
+              "       [0.        , 0.        , 0.        , 0.        ]])"
+            ]
+          },
+          "execution_count": 19,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
       "source": [
         "Qtable_frozenlake"
       ]
@@ -916,7 +1030,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 20,
       "metadata": {
         "id": "jNl0_JO2cbkm"
       },
@@ -972,11 +1086,33 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 21,
       "metadata": {
         "id": "fAgB7s0HEFMm"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "100%|██████████| 100/100 [00:00<00:00, 12881.37it/s]"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Mean_reward=1.00 +/- 0.00\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "\n"
+          ]
+        }
+      ],
       "source": [
         "# Evaluate our Agent\n",
         "mean_reward, std_reward = evaluate_agent(env, max_steps, n_eval_episodes, Qtable_frozenlake, eval_seed)\n",
@@ -1018,7 +1154,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 23,
       "metadata": {
         "id": "Jex3i9lZ8ksX"
       },
@@ -1034,7 +1170,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 24,
       "metadata": {
         "id": "Qo57HBn3W74O"
       },
@@ -1065,6 +1201,11 @@
     },
     {
       "cell_type": "code",
+      "execution_count": 26,
+      "metadata": {
+        "id": "U4mdUTKkGnUd"
+      },
+      "outputs": [],
       "source": [
         "def push_to_hub(\n",
         "    repo_id, model, env, video_fps=1, local_repo_path=\"hub\"\n",
@@ -1194,12 +1335,7 @@
         "    )\n",
         "\n",
         "    print(\"Your model is pushed to the Hub. You can view your model here: \", repo_url)"
-      ],
-      "metadata": {
-        "id": "U4mdUTKkGnUd"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "markdown",
@@ -1269,7 +1405,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 27,
       "metadata": {
         "id": "FiMqxqVHg0I4"
       },
@@ -1311,24 +1447,153 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 28,
       "metadata": {
         "id": "5sBo2umnXpPd"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "{'env_id': 'FrozenLake-v1',\n",
+              " 'max_steps': 99,\n",
+              " 'n_training_episodes': 10000,\n",
+              " 'n_eval_episodes': 100,\n",
+              " 'eval_seed': [],\n",
+              " 'learning_rate': 0.7,\n",
+              " 'gamma': 0.95,\n",
+              " 'max_epsilon': 1.0,\n",
+              " 'min_epsilon': 0.05,\n",
+              " 'decay_rate': 0.0005,\n",
+              " 'qtable': array([[0.73509189, 0.77378094, 0.77378094, 0.73509189],\n",
+              "        [0.73509189, 0.        , 0.81450625, 0.77378094],\n",
+              "        [0.77378094, 0.857375  , 0.77378094, 0.81450625],\n",
+              "        [0.81450625, 0.        , 0.77378094, 0.77378094],\n",
+              "        [0.77378094, 0.81450625, 0.        , 0.73509189],\n",
+              "        [0.        , 0.        , 0.        , 0.        ],\n",
+              "        [0.        , 0.9025    , 0.        , 0.81450625],\n",
+              "        [0.        , 0.        , 0.        , 0.        ],\n",
+              "        [0.81450625, 0.        , 0.857375  , 0.77378094],\n",
+              "        [0.81450625, 0.9025    , 0.9025    , 0.        ],\n",
+              "        [0.857375  , 0.95      , 0.        , 0.857375  ],\n",
+              "        [0.        , 0.        , 0.        , 0.        ],\n",
+              "        [0.        , 0.        , 0.        , 0.        ],\n",
+              "        [0.        , 0.9025    , 0.95      , 0.857375  ],\n",
+              "        [0.9025    , 0.95      , 1.        , 0.9025    ],\n",
+              "        [0.        , 0.        , 0.        , 0.        ]])}"
+            ]
+          },
+          "execution_count": 28,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
       "source": [
         "model"
       ]
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 29,
       "metadata": {
         "id": "RpOTtSt83kPZ"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "e4d5c292dab14baa940d2ed46f0dd484",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "d936b75185204c77bf43f7e60d55d730",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              ".gitattributes: 0.00B [00:00, ?B/s]"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "100%|██████████| 100/100 [00:00<00:00, 15035.50it/s]\n",
+            "100%|██████████| 100/100 [00:00<00:00, 20026.28it/s]\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "False\n"
+          ]
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "b7cec9324f194da5a02d6aa4ff9f8bf0",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "Processing Files (0 / 0)                : |          |  0.00B /  0.00B            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "735207614ab44f32b930ea8f15c2196e",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "New Data Upload                         : |          |  0.00B /  0.00B            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "939cd7ae71654dd0b4ef8a2be93050a9",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "  ...d1efee86aa38377f5cc6/q-learning.pkl: 100%|##########|   915B /   915B            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Your model is pushed to the Hub. You can view your model here:  https://huggingface.co/turbo-maikol/q-FrozenLake-v1-4x4-noSlippery\n"
+          ]
+        }
+      ],
       "source": [
-        "username = \"\" # FILL THIS\n",
+        "username = \"turbo-maikol\" # FILL THIS\n",
         "repo_name = \"q-FrozenLake-v1-4x4-noSlippery\"\n",
         "push_to_hub(\n",
         "    repo_id=f\"{username}/{repo_name}\",\n",
@@ -1373,7 +1638,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 56,
       "metadata": {
         "id": "gL0wpeO8gpej"
       },
@@ -1393,11 +1658,19 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 37,
       "metadata": {
         "id": "_TPNaGSZrgqA"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "There are  500  possible states\n"
+          ]
+        }
+      ],
       "source": [
         "state_space = env.observation_space.n\n",
         "print(\"There are \", state_space, \" possible states\")"
@@ -1405,11 +1678,19 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 38,
       "metadata": {
         "id": "CdeeZuokrhit"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "There are  6  possible actions\n"
+          ]
+        }
+      ],
       "source": [
         "action_space = env.action_space.n\n",
         "print(\"There are \", action_space, \" possible actions\")"
@@ -1439,11 +1720,26 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 39,
       "metadata": {
         "id": "US3yDXnEtY9I"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "[[0. 0. 0. 0. 0. 0.]\n",
+            " [0. 0. 0. 0. 0. 0.]\n",
+            " [0. 0. 0. 0. 0. 0.]\n",
+            " ...\n",
+            " [0. 0. 0. 0. 0. 0.]\n",
+            " [0. 0. 0. 0. 0. 0.]\n",
+            " [0. 0. 0. 0. 0. 0.]]\n",
+            "Q-table shape:  (500, 6)\n"
+          ]
+        }
+      ],
       "source": [
         "# Create our Q table with state_size rows and action_size columns (500x6)\n",
         "Qtable_taxi = initialize_q_table(state_space, action_space)\n",
@@ -1464,14 +1760,14 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 46,
       "metadata": {
         "id": "AB6n__hhg7YS"
       },
       "outputs": [],
       "source": [
         "# Training parameters\n",
-        "n_training_episodes = 25000   # Total training episodes\n",
+        "n_training_episodes = 1_000_000   # Total training episodes\n",
         "learning_rate = 0.7           # Learning rate\n",
         "\n",
         "# Evaluation parameters\n",
@@ -1484,14 +1780,14 @@
         "                                                          # Each seed has a specific starting state\n",
         "\n",
         "# Environment parameters\n",
-        "env_id = \"Taxi-v3\"           # Name of the environment\n",
-        "max_steps = 99               # Max steps per episode\n",
-        "gamma = 0.95                 # Discounting rate\n",
+        "env_id = \"Taxi-v3\"            # Name of the environment\n",
+        "max_steps = 1000              # Max steps per episode\n",
+        "gamma = 0.90                  # Discounting rate\n",
         "\n",
         "# Exploration parameters\n",
         "max_epsilon = 1.0             # Exploration probability at start\n",
-        "min_epsilon = 0.05           # Minimum exploration probability\n",
-        "decay_rate = 0.005            # Exponential decay rate for exploration prob\n"
+        "min_epsilon = 0.05            # Minimum exploration probability\n",
+        "decay_rate = 0.001            # Exponential decay rate for exploration prob\n"
       ]
     },
     {
@@ -1505,11 +1801,41 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 47,
       "metadata": {
         "id": "WwP3Y2z2eS-K"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "100%|██████████| 1000000/1000000 [04:31<00:00, 3685.46it/s]\n"
+          ]
+        },
+        {
+          "data": {
+            "text/plain": [
+              "array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
+              "         0.        ],\n",
+              "       [-0.58568212,  0.4603532 , -0.58568212,  0.4603532 ,  1.62261467,\n",
+              "        -8.5396468 ],\n",
+              "       [ 4.348907  ,  5.94323   ,  4.348907  ,  5.94323   ,  7.7147    ,\n",
+              "        -3.05677   ],\n",
+              "       ...,\n",
+              "       [ 7.7147    ,  9.683     ,  7.7147    ,  5.94323   , -1.28529994,\n",
+              "        -1.28529997],\n",
+              "       [ 1.62261485,  2.9140163 ,  1.62261467,  2.9140163 , -7.37738434,\n",
+              "        -7.37738533],\n",
+              "       [14.3       , 11.87      , 14.3       , 17.        ,  5.3       ,\n",
+              "         5.3       ]], shape=(500, 6))"
+            ]
+          },
+          "execution_count": 47,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
       "source": [
         "Qtable_taxi = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_taxi)\n",
         "Qtable_taxi"
@@ -1528,7 +1854,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 62,
       "metadata": {
         "id": "0a1FpE_3hNYr"
       },
@@ -1554,14 +1880,108 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 57,
       "metadata": {
         "id": "dhQtiQozhOn1"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "9af9e29c591e4d18b2095650f25420f1",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "100%|██████████| 100/100 [00:00<00:00, 5536.60it/s]\n",
+            "100%|██████████| 100/100 [00:00<00:00, 3498.34it/s]\n",
+            "IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (550, 350) to (560, 352) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to 1 (risking incompatibility).\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "True\n"
+          ]
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "086883edd0094686bbdc96d55b16e223",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "Processing Files (0 / 0)                : |          |  0.00B /  0.00B            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "f7558465d2ba49ac921286f9b8bf3488",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "New Data Upload                         : |          |  0.00B /  0.00B            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "e66c844d130a4596a3998d30f0af6754",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "  ...01fbfd9e2ed75397e4c7/q-learning.pkl: 100%|##########| 24.6kB / 24.6kB            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "b8c2768b25794da980f7d1264b32ba1f",
+              "version_major": 2,
+              "version_minor": 0
+            },
+            "text/plain": [
+              "  ...59e101fbfd9e2ed75397e4c7/replay.mp4: 100%|##########|  117kB /  117kB            "
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Your model is pushed to the Hub. You can view your model here:  https://huggingface.co/turbo-maikol/rl-course-unit2\n"
+          ]
+        }
+      ],
       "source": [
-        "username = \"\" # FILL THIS\n",
-        "repo_name = \"\" # FILL THIS\n",
+        "username = \"turbo-maikol\" # FILL THIS\n",
+        "repo_name = \"rl-course-unit2\" # FILL THIS\n",
         "push_to_hub(\n",
         "    repo_id=f\"{username}/{repo_name}\",\n",
         "    model=model,\n",
@@ -1620,7 +2040,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 60,
       "metadata": {
         "id": "Eo8qEzNtCaVI"
       },
@@ -1660,13 +2080,50 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 61,
       "metadata": {
         "id": "JUm9lz2gCQcU"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "{'env_id': 'Taxi-v3', 'max_steps': 1000, 'n_training_episodes': 1000000, 'n_eval_episodes': 100, 'eval_seed': [16, 54, 165, 177, 191, 191, 120, 80, 149, 178, 48, 38, 6, 125, 174, 73, 50, 172, 100, 148, 146, 6, 25, 40, 68, 148, 49, 167, 9, 97, 164, 176, 61, 7, 54, 55, 161, 131, 184, 51, 170, 12, 120, 113, 95, 126, 51, 98, 36, 135, 54, 82, 45, 95, 89, 59, 95, 124, 9, 113, 58, 85, 51, 134, 121, 169, 105, 21, 30, 11, 50, 65, 12, 43, 82, 145, 152, 97, 106, 55, 31, 85, 38, 112, 102, 168, 123, 97, 21, 83, 158, 26, 80, 63, 5, 81, 32, 11, 28, 148], 'learning_rate': 0.7, 'gamma': 0.9, 'max_epsilon': 1.0, 'min_epsilon': 0.05, 'decay_rate': 0.001, 'qtable': array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,\n",
+            "         0.        ],\n",
+            "       [-0.58568212,  0.4603532 , -0.58568212,  0.4603532 ,  1.62261467,\n",
+            "        -8.5396468 ],\n",
+            "       [ 4.348907  ,  5.94323   ,  4.348907  ,  5.94323   ,  7.7147    ,\n",
+            "        -3.05677   ],\n",
+            "       ...,\n",
+            "       [ 7.7147    ,  9.683     ,  7.7147    ,  5.94323   , -1.28529994,\n",
+            "        -1.28529997],\n",
+            "       [ 1.62261485,  2.9140163 ,  1.62261467,  2.9140163 , -7.37738434,\n",
+            "        -7.37738533],\n",
+            "       [14.3       , 11.87      , 14.3       , 17.        ,  5.3       ,\n",
+            "         5.3       ]], shape=(500, 6))}\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "100%|██████████| 100/100 [00:00<00:00, 2779.93it/s]\n"
+          ]
+        },
+        {
+          "data": {
+            "text/plain": [
+              "(np.float64(7.56), np.float64(2.706732347314747))"
+            ]
+          },
+          "execution_count": 61,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
       "source": [
-        "model = load_from_hub(repo_id=\"ThomasSimonini/q-Taxi-v3\", filename=\"q-learning.pkl\") # Try to use another model\n",
+        "model = load_from_hub(repo_id=\"turbo-maikol/rl-course-unit2\", filename=\"q-learning.pkl\") # Try to use another model\n",
         "\n",
         "print(model)\n",
         "env = gym.make(model[\"env_id\"])\n",
@@ -1675,18 +2132,10 @@
       ]
     },
     {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "O7pL8rg1MulN"
-      },
-      "outputs": [],
+      "cell_type": "markdown",
+      "metadata": {},
       "source": [
-        "model = load_from_hub(repo_id=\"ThomasSimonini/q-FrozenLake-v1-no-slippery\", filename=\"q-learning.pkl\") # Try to use another model\n",
-        "\n",
-        "env = gym.make(model[\"env_id\"], is_slippery=False)\n",
-        "\n",
-        "evaluate_agent(env, model[\"max_steps\"], model[\"n_eval_episodes\"], model[\"qtable\"], model[\"eval_seed\"])"
+        "np.float64(7.56), np.float64(2.706732347314747)"
       ]
     },
     {
@@ -1748,25 +2197,35 @@
   ],
   "metadata": {
     "colab": {
-      "private_outputs": true,
-      "provenance": [],
       "collapsed_sections": [
         "67OdoKL63eDD",
         "B2_-8b8z5k54",
         "8R5ej1fS4P2V",
         "Pnpk2ePoem3r"
       ],
-      "include_colab_link": true
+      "include_colab_link": true,
+      "private_outputs": true,
+      "provenance": []
     },
     "gpuClass": "standard",
     "kernelspec": {
-      "display_name": "Python 3",
-      "name": "python3"
+      "display_name": "Python (venv-u2)",
+      "language": "python",
+      "name": "venv-u2"
     },
     "language_info": {
-      "name": "python"
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.10.18"
     }
   },
   "nbformat": 4,
   "nbformat_minor": 0
-}
\ No newline at end of file
+}
diff --git a/notebooks/unit3/unit3.ipynb b/notebooks/unit3/unit3.ipynb
index bcd3410..43265af 100644
--- a/notebooks/unit3/unit3.ipynb
+++ b/notebooks/unit3/unit3.ipynb
@@ -3,8 +3,8 @@
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "view-in-github",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "view-in-github"
       },
       "source": [
         "<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit3/unit3.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
@@ -41,6 +41,9 @@
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "ykJiGevCMVc5"
+      },
       "source": [
         "### 🎮 Environments:\n",
         "\n",
@@ -51,10 +54,7 @@
         "### 📚 RL-Library:\n",
         "\n",
         "- [RL-Baselines3-Zoo](https://github.com/DLR-RM/rl-baselines3-zoo)"
-      ],
-      "metadata": {
-        "id": "ykJiGevCMVc5"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
@@ -72,13 +72,13 @@
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "TsnP0rjxMn1e"
+      },
       "source": [
         "## This notebook is from Deep Reinforcement Learning Course\n",
         "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg\" alt=\"Deep RL Course illustration\"/>"
-      ],
-      "metadata": {
-        "id": "TsnP0rjxMn1e"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
@@ -114,12 +114,12 @@
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues)."
-      ],
       "metadata": {
         "id": "7kszpGFaRVhq"
-      }
+      },
+      "source": [
+        "We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues)."
+      ]
     },
     {
       "cell_type": "markdown",
@@ -142,6 +142,9 @@
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "Nc8BnyVEc3Ys"
+      },
       "source": [
         "## An advice 💡\n",
         "It's better to run this colab in a copy on your Google Drive, so that **if it timeouts** you still have the saved notebook on your Google Drive and do not need to fill everything from scratch.\n",
@@ -151,66 +154,63 @@
         "Also, we're going to **train it for 90 minutes with 1M timesteps**. By typing `!nvidia-smi` will tell you what GPU you're using.\n",
         "\n",
         "And if you want to train more such 10 million steps, this will take about 9 hours, potentially resulting in Colab timing out. In that case, I recommend running this on your local computer (or somewhere else). Just click on: `File>Download`."
-      ],
-      "metadata": {
-        "id": "Nc8BnyVEc3Ys"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "PU4FVzaoM6fC"
+      },
       "source": [
         "## Set the GPU 💪\n",
         "- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`\n",
         "\n",
         "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg\" alt=\"GPU Step 1\">"
-      ],
-      "metadata": {
-        "id": "PU4FVzaoM6fC"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "KV0NyFdQM9ZG"
+      },
       "source": [
         "- `Hardware Accelerator > GPU`\n",
         "\n",
         "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg\" alt=\"GPU Step 2\">"
-      ],
-      "metadata": {
-        "id": "KV0NyFdQM9ZG"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "wS_cVefO-aYg"
+      },
       "source": [
         "# Install RL-Baselines3 Zoo and its dependencies 📚\n",
         "\n",
         "If you see `ERROR: pip's dependency resolver does not currently take into account all the packages that are installed.` **this is normal and it's not a critical error** there's a conflict of version. But the packages we need are installed."
-      ],
-      "metadata": {
-        "id": "wS_cVefO-aYg"
-      }
+      ]
     },
     {
       "cell_type": "code",
-      "source": [
-        "!pip install git+https://github.com/DLR-RM/rl-baselines3-zoo"
-      ],
+      "execution_count": null,
       "metadata": {
         "id": "S1A_E4z3awa_"
       },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [],
+      "source": [
+        "!pip install git+https://github.com/DLR-RM/rl-baselines3-zoo"
+      ]
     },
     {
       "cell_type": "code",
-      "source": [
-        "!apt-get install swig cmake ffmpeg"
-      ],
+      "execution_count": null,
       "metadata": {
         "id": "8_MllY6Om1eI"
       },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [],
+      "source": [
+        "!apt-get install swig cmake ffmpeg"
+      ]
     },
     {
       "cell_type": "markdown",
@@ -223,28 +223,28 @@
     },
     {
       "cell_type": "code",
-      "source": [
-        "!pip install gymnasium[atari]\n",
-        "!pip install gymnasium[accept-rom-license]"
-      ],
+      "execution_count": null,
       "metadata": {
         "id": "NsRP-lX1_2fC"
       },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [],
+      "source": [
+        "!pip install gymnasium[atari]\n",
+        "!pip install gymnasium[accept-rom-license]"
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "bTpYcVZVMzUI"
+      },
       "source": [
         "## Create a virtual display 🔽\n",
         "\n",
         "During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).\n",
         "\n",
         "Hence the following cell will install the librairies and create and run a virtual screen 🖥"
-      ],
-      "metadata": {
-        "id": "bTpYcVZVMzUI"
-      }
+      ]
     },
     {
       "cell_type": "code",
@@ -262,18 +262,18 @@
     },
     {
       "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "BE5JWP5rQIKf"
+      },
+      "outputs": [],
       "source": [
         "# Virtual display\n",
         "from pyvirtualdisplay import Display\n",
         "\n",
         "virtual_display = Display(visible=0, size=(1400, 900))\n",
         "virtual_display.start()"
-      ],
-      "metadata": {
-        "id": "BE5JWP5rQIKf"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "markdown",
@@ -360,7 +360,7 @@
       },
       "outputs": [],
       "source": [
-        "!python -m rl_zoo3.train --algo ________ --env SpaceInvadersNoFrameskip-v4  -f _________  -c _________"
+        "!python -m rl_zoo3.train --algo dqn --env SpaceInvadersNoFrameskip-v4  -f logs/ -c dqn.yml"
       ]
     },
     {
@@ -396,13 +396,185 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 1,
       "metadata": {
         "id": "co5um_KeKbBJ"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Loading latest experiment, id=2\n",
+            "Loading logs/dqn/SpaceInvadersNoFrameskip-v4_2/SpaceInvadersNoFrameskip-v4.zip\n",
+            "A.L.E: Arcade Learning Environment (version 0.11.2+ecc1138)\n",
+            "[Powered by Stella]\n",
+            "Stacking 4 frames\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 1973\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 2771\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 25.00\n",
+            "Atari Episode Length 1973\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 2709\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 2709\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 1943\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 35.00\n",
+            "Atari Episode Length 1891\n",
+            "Atari Episode Score: 15.00\n",
+            "Atari Episode Length 2727\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 2749\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 1985\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 15.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 30.00\n",
+            "Atari Episode Length 2727\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 2709\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 2787\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 2787\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 1927\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 30.00\n",
+            "Atari Episode Length 2077\n",
+            "Atari Episode Score: 35.00\n",
+            "Atari Episode Length 1973\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 2709\n",
+            "Atari Episode Score: 30.00\n",
+            "Atari Episode Length 2749\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 15.00\n",
+            "Atari Episode Length 2699\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 2709\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 1973\n",
+            "Atari Episode Score: 25.00\n",
+            "Atari Episode Length 2001\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 2069\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 2863\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 1973\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 1943\n",
+            "Atari Episode Score: 10.00\n",
+            "Atari Episode Length 2675\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2775\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 2025\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 2709\n",
+            "Atari Episode Score: 15.00\n",
+            "Atari Episode Length 2709\n",
+            "Atari Episode Score: 35.00\n",
+            "Atari Episode Length 2787\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 1925\n",
+            "Atari Episode Score: 15.00\n",
+            "Atari Episode Length 2699\n",
+            "Atari Episode Score: 10.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 35.00\n",
+            "Atari Episode Length 2709\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 1973\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 2771\n",
+            "Atari Episode Score: 15.00\n",
+            "Atari Episode Length 2709\n",
+            "Atari Episode Score: 20.00\n",
+            "Atari Episode Length 1943\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 1973\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 1973\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 15.00\n",
+            "Atari Episode Length 2749\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 1973\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 1943\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 15.00\n",
+            "Atari Episode Length 2025\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 2749\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 0.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 30.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 30.00\n",
+            "Atari Episode Length 2769\n",
+            "Atari Episode Score: 5.00\n",
+            "Atari Episode Length 2769\n"
+          ]
+        }
+      ],
       "source": [
-        "!python -m rl_zoo3.enjoy  --algo dqn  --env SpaceInvadersNoFrameskip-v4  --no-render  --n-timesteps _________  --folder logs/"
+        "!python -m rl_zoo3.enjoy  --algo dqn  --env SpaceInvadersNoFrameskip-v4  --no-render  --n-timesteps 50000  --folder logs/"
       ]
     },
     {
@@ -534,7 +706,7 @@
       },
       "outputs": [],
       "source": [
-        "!python -m rl_zoo3.push_to_hub  --algo dqn  --env SpaceInvadersNoFrameskip-v4  --repo-name _____________________ -orga _____________________ -f logs/"
+        "python -m rl_zoo3.push_to_hub  --algo dqn  --env SpaceInvadersNoFrameskip-v4  --repo-name rl-course-unit3 -orga turbo-maikol -f logs/"
       ]
     },
     {
@@ -627,11 +799,26 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 1,
       "metadata": {
         "id": "OdBNZHy0NGTR"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Downloading from https://huggingface.co/sb3/dqn-BeamRiderNoFrameskip-v4\n",
+            "dqn-BeamRiderNoFrameskip-v4.zip: 100%|█████| 27.2M/27.2M [00:02<00:00, 12.6MB/s]\n",
+            "config.yml: 100%|██████████████████████████████| 548/548 [00:00<00:00, 4.99MB/s]\n",
+            "No normalization file\n",
+            "args.yml: 100%|████████████████████████████████| 887/887 [00:00<00:00, 4.13MB/s]\n",
+            "env_kwargs.yml: 100%|████████████████████████| 3.00/3.00 [00:00<00:00, 9.20kB/s]\n",
+            "train_eval_metrics.zip: 100%|████████████████| 244k/244k [00:00<00:00, 12.7MB/s]\n",
+            "Saving to rl_trained/dqn/BeamRiderNoFrameskip-v4_1\n"
+          ]
+        }
+      ],
       "source": [
         "# Download model and save it into the logs/ folder\n",
         "!python -m rl_zoo3.load_from_hub --algo dqn --env BeamRiderNoFrameskip-v4 -orga sb3 -f rl_trained/"
@@ -648,11 +835,35 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 3,
       "metadata": {
         "id": "aOxs0rNuN0uS"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.\n",
+            "Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.\n",
+            "Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.\n",
+            "See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.\n",
+            "Loading latest experiment, id=1\n",
+            "Loading rl_trained/dqn/BeamRiderNoFrameskip-v4_1/BeamRiderNoFrameskip-v4.zip\n",
+            "A.L.E: Arcade Learning Environment (version 0.11.2+ecc1138)\n",
+            "[Powered by Stella]\n",
+            "Stacking 4 frames\n",
+            "/home/mique/Desktop/Code/deep-rl-class/notebooks/unit2/venv-u2/lib/python3.10/site-packages/stable_baselines3/common/save_util.py:167: UserWarning: Could not deserialize object exploration_schedule. Consider using `custom_objects` argument to replace this object.\n",
+            "Exception: 'bytes' object cannot be interpreted as an integer\n",
+            "  warnings.warn(\n",
+            "/home/mique/Desktop/Code/deep-rl-class/notebooks/unit2/venv-u2/lib/python3.10/site-packages/stable_baselines3/common/vec_env/patch_gym.py:95: UserWarning: You loaded a model that was trained using OpenAI Gym. We strongly recommend transitioning to Gymnasium by saving that model again.\n",
+            "  warnings.warn(\n",
+            "/home/mique/Desktop/Code/deep-rl-class/notebooks/unit2/venv-u2/lib/python3.10/site-packages/stable_baselines3/common/base_class.py:773: UserWarning: You are probably loading a DQN model saved with SB3 < 2.4.0, we truncated the optimizer state so you can save the model again to avoid issues in the future (see https://github.com/DLR-RM/stable-baselines3/pull/1963 for more info). Original error: loaded state dict contains a parameter group that doesn't match the size of optimizer's group \n",
+            "Note: the model should still work fine, this only a warning.\n",
+            "  warnings.warn(\n"
+          ]
+        }
+      ],
       "source": [
         "!python -m rl_zoo3.enjoy --algo dqn --env BeamRiderNoFrameskip-v4 -n 5000  -f rl_trained/ --no-render"
       ]
@@ -734,12 +945,12 @@
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "See you on Bonus unit 2! 🔥"
-      ],
       "metadata": {
         "id": "Kc3udPT-RcXc"
-      }
+      },
+      "source": [
+        "See you on Bonus unit 2! 🔥"
+      ]
     },
     {
       "cell_type": "markdown",
@@ -752,13 +963,15 @@
     }
   ],
   "metadata": {
+    "accelerator": "GPU",
     "colab": {
+      "include_colab_link": true,
       "private_outputs": true,
-      "provenance": [],
-      "include_colab_link": true
+      "provenance": []
     },
+    "gpuClass": "standard",
     "kernelspec": {
-      "display_name": "Python 3 (ipykernel)",
+      "display_name": "venv-u2",
       "language": "python",
       "name": "python3"
     },
@@ -772,7 +985,7 @@
       "name": "python",
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython3",
-      "version": "3.10.6"
+      "version": "3.10.18"
     },
     "varInspector": {
       "cols": {
@@ -802,10 +1015,8 @@
         "_Feature"
       ],
       "window_display": false
-    },
-    "accelerator": "GPU",
-    "gpuClass": "standard"
+    }
   },
   "nbformat": 4,
   "nbformat_minor": 0
-}
\ No newline at end of file
+}
diff --git a/notebooks/unit4/unit4.ipynb b/notebooks/unit4/unit4.ipynb
index 884eddd..afa1d5c 100644
--- a/notebooks/unit4/unit4.ipynb
+++ b/notebooks/unit4/unit4.ipynb
@@ -3,8 +3,8 @@
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "view-in-github",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "view-in-github"
       },
       "source": [
         "<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/GymnasiumUpdate%2FUnit4/notebooks/unit4/unit4.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
@@ -36,15 +36,18 @@
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "  <img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/envs.gif\" alt=\"Environments\"/>\n"
-      ],
       "metadata": {
         "id": "s4rBom2sbo7S"
-      }
+      },
+      "source": [
+        "  <img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/envs.gif\" alt=\"Environments\"/>\n"
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "BPLwsPajb1f8"
+      },
       "source": [
         "### 🎮 Environments: \n",
         "\n",
@@ -58,10 +61,7 @@
         "\n",
         "\n",
         "We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues)."
-      ],
-      "metadata": {
-        "id": "BPLwsPajb1f8"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
@@ -120,6 +120,9 @@
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "Bsh4ZAamchSl"
+      },
       "source": [
         "# Let's code Reinforce algorithm from scratch 🔥\n",
         "\n",
@@ -132,58 +135,55 @@
         "To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**. **If you don't see your model on the leaderboard, go at the bottom of the leaderboard page and click on the refresh button**.\n",
         "\n",
         "For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process\n"
-      ],
-      "metadata": {
-        "id": "Bsh4ZAamchSl"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "JoTC9o2SczNn"
+      },
       "source": [
         "## An advice 💡\n",
         "It's better to run this colab in a copy on your Google Drive, so that **if it timeouts** you still have the saved notebook on your Google Drive and do not need to fill everything from scratch.\n",
         "\n",
         "To do that you can either do `Ctrl + S` or `File > Save a copy in Google Drive.`"
-      ],
-      "metadata": {
-        "id": "JoTC9o2SczNn"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "PU4FVzaoM6fC"
+      },
       "source": [
         "## Set the GPU 💪\n",
         "- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`\n",
         "\n",
         "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg\" alt=\"GPU Step 1\">"
-      ],
-      "metadata": {
-        "id": "PU4FVzaoM6fC"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "KV0NyFdQM9ZG"
+      },
       "source": [
         "- `Hardware Accelerator > GPU`\n",
         "\n",
         "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg\" alt=\"GPU Step 2\">"
-      ],
-      "metadata": {
-        "id": "KV0NyFdQM9ZG"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "bTpYcVZVMzUI"
+      },
       "source": [
         "## Create a virtual display 🖥\n",
         "\n",
         "During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames). \n",
         "\n",
         "Hence the following cell will install the librairies and create and run a virtual screen 🖥"
-      ],
-      "metadata": {
-        "id": "bTpYcVZVMzUI"
-      }
+      ]
     },
     {
       "cell_type": "code",
@@ -203,18 +203,18 @@
     },
     {
       "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "Sr-Nuyb1dBm0"
+      },
+      "outputs": [],
       "source": [
         "# Virtual display\n",
         "from pyvirtualdisplay import Display\n",
         "\n",
         "virtual_display = Display(visible=0, size=(1400, 900))\n",
         "virtual_display.start()"
-      ],
-      "metadata": {
-        "id": "Sr-Nuyb1dBm0"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "markdown",
@@ -245,14 +245,14 @@
     },
     {
       "cell_type": "code",
-      "source": [
-        "!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit4/requirements-unit4.txt"
-      ],
+      "execution_count": null,
       "metadata": {
         "id": "e8ZVi-uydpgL"
       },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [],
+      "source": [
+        "!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit4/requirements-unit4.txt"
+      ]
     },
     {
       "cell_type": "markdown",
@@ -269,7 +269,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 25,
       "metadata": {
         "id": "V8oadoJSWp7C"
       },
@@ -290,46 +290,47 @@
         "from torch.distributions import Categorical\n",
         "\n",
         "# Gym\n",
-        "import gym\n",
-        "import gym_pygame\n",
+        "import gymnasium as gym\n",
+        "# import gym_pygame\n",
         "\n",
         "# Hugging Face Hub\n",
         "from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.\n",
-        "import imageio"
+        "import imageio\n",
+        "\n",
+        "%load_ext autoreload\n",
+        "%autoreload 2"
       ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "RfxJYdMeeVgv"
+      },
       "source": [
         "## Check if we have a GPU\n",
         "\n",
         "- Let's check if we have a GPU\n",
         "- If it's the case you should see `device:cuda0`"
-      ],
-      "metadata": {
-        "id": "RfxJYdMeeVgv"
-      }
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "kaJu5FeZxXGY"
-      },
-      "outputs": [],
-      "source": [
-        "device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")"
       ]
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 5,
       "metadata": {
-        "id": "U5TNYa14aRav"
+        "id": "kaJu5FeZxXGY"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "cuda:0\n"
+          ]
+        }
+      ],
       "source": [
-        "print(device)"
+        "device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")\n",
+        "print(device)\n"
       ]
     },
     {
@@ -393,7 +394,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 6,
       "metadata": {
         "id": "POOOk15_K6KA"
       },
@@ -404,7 +405,7 @@
         "env = gym.make(env_id)\n",
         "\n",
         "# Create the evaluation env\n",
-        "eval_env = gym.make(env_id)\n",
+        "eval_env = gym.make(env_id, render_mode=\"rgb_array\")\n",
         "\n",
         "# Get the state space and action space\n",
         "s_size = env.observation_space.shape[0]\n",
@@ -413,11 +414,22 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 7,
       "metadata": {
         "id": "FMLFrjiBNLYJ"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "_____OBSERVATION SPACE_____ \n",
+            "\n",
+            "The State Space is:  4\n",
+            "Sample observation [-0.92062986 -0.65902454  0.2579916  -0.6175645 ]\n"
+          ]
+        }
+      ],
       "source": [
         "print(\"_____OBSERVATION SPACE_____ \\n\")\n",
         "print(\"The State Space is: \", s_size)\n",
@@ -426,11 +438,23 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 8,
       "metadata": {
         "id": "Lu6t4sRNNWkN"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "\n",
+            " _____ACTION SPACE_____ \n",
+            "\n",
+            "The Action Space is:  2\n",
+            "Action Space Sample 1\n"
+          ]
+        }
+      ],
       "source": [
         "print(\"\\n _____ACTION SPACE_____ \\n\")\n",
         "print(\"The Action Space is: \", a_size)\n",
@@ -466,27 +490,43 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 21,
       "metadata": {
         "id": "w2LHcHhVZvPZ"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "ename": "NameError",
+          "evalue": "name 'nn' is not defined",
+          "output_type": "error",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
+            "Cell \u001b[0;32mIn[21], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;01mclass\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mPolicy\u001b[39;00m(\u001b[43mnn\u001b[49m\u001b[38;5;241m.\u001b[39mModule):\n\u001b[1;32m      2\u001b[0m                       \u001b[38;5;66;03m# State # Action # hidden\u001b[39;00m\n\u001b[1;32m      3\u001b[0m     \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21m__init__\u001b[39m(\u001b[38;5;28mself\u001b[39m, s_size, a_size, h_size):\n\u001b[1;32m      4\u001b[0m         \u001b[38;5;28msuper\u001b[39m(Policy, \u001b[38;5;28mself\u001b[39m)\u001b[38;5;241m.\u001b[39m\u001b[38;5;21m__init__\u001b[39m()\n",
+            "\u001b[0;31mNameError\u001b[0m: name 'nn' is not defined"
+          ]
+        }
+      ],
       "source": [
         "class Policy(nn.Module):\n",
+        "                      # State # Action # hidden\n",
         "    def __init__(self, s_size, a_size, h_size):\n",
         "        super(Policy, self).__init__()\n",
         "        # Create two fully connected layers\n",
-        "\n",
-        "\n",
+        "        self.fc1 = nn.Linear(s_size, h_size)\n",
+        "        self.fc2 = nn.Linear(h_size, a_size)\n",
+        "        self.relu = nn.ReLU()\n",
         "\n",
         "    def forward(self, x):\n",
         "        # Define the forward pass\n",
         "        # state goes to fc1 then we apply ReLU activation function\n",
-        "\n",
+        "        x = self.relu(self.fc1(x))\n",
         "        # fc1 outputs goes to fc2\n",
+        "        x = self.fc2(x)\n",
         "\n",
         "        # We output the softmax\n",
-        "    \n",
+        "        return F.softmax(x, dim=1)\n",
+        "\n",
         "    def act(self, state):\n",
         "        \"\"\"\n",
         "        Given a state, take action\n",
@@ -494,7 +534,7 @@
         "        state = torch.from_numpy(state).float().unsqueeze(0).to(device)\n",
         "        probs = self.forward(state).cpu()\n",
         "        m = Categorical(probs)\n",
-        "        action = np.argmax(m)\n",
+        "        action = m.sample() #torch.argmax(probs, dim=1)#np.argmax(m)\n",
         "        return action.item(), m.log_prob(action)"
       ]
     },
@@ -554,7 +594,7 @@
       "outputs": [],
       "source": [
         "debug_policy = Policy(s_size, a_size, 64).to(device)\n",
-        "debug_policy.act(env.reset())"
+        "debug_policy.act(env.reset()[0])"
       ]
     },
     {
@@ -619,14 +659,14 @@
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "c-20i7Pk0l1T"
+      },
       "source": [
         "- Since **we want to sample an action from the probability distribution over actions**, we can't use `action = np.argmax(m)` since it will always output the action that have the highest probability.\n",
         "\n",
         "- We need to replace with `action = m.sample()` that will sample an action from the probability distribution P(.|s)"
-      ],
-      "metadata": {
-        "id": "c-20i7Pk0l1T"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
@@ -643,6 +683,9 @@
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "QmcXG-9i2Qu2"
+      },
       "source": [
         "- When we calculate the return Gt (line 6) we see that we calculate the sum of discounted rewards **starting at timestep t**.\n",
         "\n",
@@ -652,10 +695,7 @@
         "\n",
         "We use an interesting technique coded by [Chris1nexus](https://github.com/Chris1nexus) to **compute the return at each timestep efficiently**. The comments explained the procedure. Don't hesitate also [to check the PR explanation](https://github.com/huggingface/deep-rl-class/pull/95)\n",
         "But overall the idea is to **compute the return at each timestep efficiently**."
-      ],
-      "metadata": {
-        "id": "QmcXG-9i2Qu2"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
@@ -676,38 +716,72 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 10,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "state = array([-0.01473051,  0.02841404,  0.0272485 , -0.03844116], dtype=float32)\n",
+            "state = array([-0.01416223,  0.22313488,  0.02647968, -0.3224039 ], dtype=float32)\n",
+            "reward = 1.0\n",
+            "terminated = False\n",
+            "truncated = False\n",
+            "_ = {}\n"
+          ]
+        }
+      ],
+      "source": [
+        "state, _ = env.reset()\n",
+        "print(f\"{state = }\")\n",
+        "state, reward, terminated, truncated, _  = env.step(1)\n",
+        "\n",
+        "print(f\"{state = }\")\n",
+        "print(f\"{reward = }\")\n",
+        "print(f\"{terminated = }\")\n",
+        "print(f\"{truncated = }\")\n",
+        "print(f\"{_ = }\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 11,
       "metadata": {
         "id": "iOdv8Q9NfLK7"
       },
       "outputs": [],
       "source": [
-        "def reinforce(policy, optimizer, n_training_episodes, max_t, gamma, print_every):\n",
+        "def reinforce(policy, optimizer, n_training_episodes, max_t, gamma, print_every, max_pacience = 50):\n",
         "    # Help us to calculate the score during the training\n",
-        "    scores_deque = deque(maxlen=100)\n",
-        "    scores = []\n",
+        "    scores = deque(maxlen=100)\n",
+        "\n",
+        "    last_max = np.mean(scores)\n",
+        "    pacience = 0\n",
         "    # Line 3 of pseudocode\n",
         "    for i_episode in range(1, n_training_episodes+1):\n",
-        "        saved_log_probs = []\n",
-        "        rewards = []\n",
-        "        state = # TODO: reset the environment\n",
-        "        # Line 4 of pseudocode\n",
+        "        rewards, saved_log_probs = [], []\n",
+        "        state, _ = env.reset() # TODO: reset the environment\n",
+        "\n",
+        "        # ========= Line 4 of pseudocode =========\n",
         "        for t in range(max_t):\n",
-        "            action, log_prob = # TODO get the action\n",
+        "            action, log_prob = policy.act(state) # TODO get the action\n",
         "            saved_log_probs.append(log_prob)\n",
-        "            state, reward, done, _ = # TODO: take an env step\n",
+        "            state, reward, terminated, truncated, _ = env.step(action) # TODO: take an env step\n",
         "            rewards.append(reward)\n",
-        "            if done:\n",
+        "            if terminated or truncated:\n",
         "                break \n",
-        "        scores_deque.append(sum(rewards))\n",
+        "\n",
         "        scores.append(sum(rewards))\n",
         "        \n",
-        "        # Line 6 of pseudocode: calculate the return\n",
+        "        # ========= Line 6 of pseudocode: calculate the return =========\n",
         "        returns = deque(maxlen=max_t) \n",
         "        n_steps = len(rewards) \n",
+        "        \n",
+        "        \"\"\"# ================ EXPLANATION ================\n",
         "        # Compute the discounted returns at each timestep,\n",
         "        # as the sum of the gamma-discounted return at time t (G_t) + the reward at time t\n",
-        "        \n",
+        "\n",
         "        # In O(N) time, where N is the number of time steps\n",
         "        # (this definition of the discounted return G_t follows the definition of this quantity \n",
         "        # shown at page 44 of Sutton&Barto 2017 2nd draft)\n",
@@ -723,7 +797,6 @@
         "        # This is correct since the above is equivalent to (see also page 46 of Sutton&Barto 2017 2nd draft)\n",
         "        # G_(t-1) = r_t + gamma*r_(t+1) + gamma*gamma*r_(t+2) + ...\n",
         "        \n",
-        "        \n",
         "        ## Given the above, we calculate the returns at timestep t as: \n",
         "        #               gamma[t] * return[t] + reward[t]\n",
         "        #\n",
@@ -733,10 +806,11 @@
         "        \n",
         "        ## Hence, the queue \"returns\" will hold the returns in chronological order, from t=0 to t=n_steps\n",
         "        ## thanks to the appendleft() function which allows to append to the position 0 in constant time O(1)\n",
-        "        ## a normal python list would instead require O(N) to do this.\n",
+        "        ## a normal python list would instead require O(N) to do this.\"\"\"\n",
+        "        disc_return_t = 0\n",
         "        for t in range(n_steps)[::-1]:\n",
-        "            disc_return_t = (returns[0] if len(returns)>0 else 0)\n",
-        "            returns.appendleft(    ) # TODO: complete here        \n",
+        "            returns.appendleft(disc_return_t * gamma + rewards[t])   \n",
+        "            disc_return_t = returns[0]\n",
         "       \n",
         "        ## standardization of the returns is employed to make training more stable\n",
         "        eps = np.finfo(np.float32).eps.item()\n",
@@ -746,21 +820,30 @@
         "        returns = torch.tensor(returns)\n",
         "        returns = (returns - returns.mean()) / (returns.std() + eps)\n",
         "        \n",
-        "        # Line 7:\n",
+        "        # ========= Line 7=========\n",
         "        policy_loss = []\n",
         "        for log_prob, disc_return in zip(saved_log_probs, returns):\n",
         "            policy_loss.append(-log_prob * disc_return)\n",
         "        policy_loss = torch.cat(policy_loss).sum()\n",
         "        \n",
-        "        # Line 8: PyTorch prefers gradient descent \n",
+        "        # ========= Line 8: PyTorch prefers gradient descent =========\n",
         "        optimizer.zero_grad()\n",
         "        policy_loss.backward()\n",
         "        optimizer.step()\n",
         "        \n",
+        "        mean = np.mean(scores)\n",
         "        if i_episode % print_every == 0:\n",
-        "            print('Episode {}\\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))\n",
+        "            print('Episode {}\\tAverage Score: {:.2f}'.format(i_episode, mean))\n",
+        "\n",
+        "        if last_max >= mean:\n",
+        "            pacience += 1\n",
+        "            if pacience >= max_pacience:\n",
+        "                print(' - Breaking at Episode {}\\t with average Score: {:.2f} for max pacience {:.2f}'.format(i_episode, mean, last_max))\n",
+        "                break\n",
+        "        else:\n",
+        "            last_max, pacience = mean, 0\n",
         "        \n",
-        "    return scores"
+        "    return list(scores)"
       ]
     },
     {
@@ -788,7 +871,7 @@
         "    for i_episode in range(1, n_training_episodes+1):\n",
         "        saved_log_probs = []\n",
         "        rewards = []\n",
-        "        state = env.reset()\n",
+        "        state, _ = env.reset()\n",
         "        # Line 4 of pseudocode\n",
         "        for t in range(max_t):\n",
         "            action, log_prob = policy.act(state)\n",
@@ -875,7 +958,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 12,
       "metadata": {
         "id": "utRe1NgtVBYF"
       },
@@ -886,17 +969,17 @@
         "    \"n_training_episodes\": 1000,\n",
         "    \"n_evaluation_episodes\": 10,\n",
         "    \"max_t\": 1000,\n",
-        "    \"gamma\": 1.0,\n",
+        "    \"gamma\": 0.99,\n",
         "    \"lr\": 1e-2,\n",
         "    \"env_id\": env_id,\n",
-        "    \"state_space\": s_size,\n",
-        "    \"action_space\": a_size,\n",
+        "    \"state_space\": int(s_size),\n",
+        "    \"action_space\": int(a_size),\n",
         "}"
       ]
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 13,
       "metadata": {
         "id": "D3lWyVXBVfl6"
       },
@@ -909,18 +992,31 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 32,
       "metadata": {
         "id": "uGf-hQCnfouB"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Episode 25\tAverage Score: 500.00\n",
+            "Episode 50\tAverage Score: 500.00\n",
+            " - Breaking at Episode 51\t with average Score: 500.00 for max pacience 500.00\n"
+          ]
+        }
+      ],
       "source": [
-        "scores = reinforce(cartpole_policy,\n",
-        "                   cartpole_optimizer,\n",
-        "                   cartpole_hyperparameters[\"n_training_episodes\"], \n",
-        "                   cartpole_hyperparameters[\"max_t\"],\n",
-        "                   cartpole_hyperparameters[\"gamma\"], \n",
-        "                   100)"
+        "scores = reinforce(\n",
+        "    cartpole_policy,\n",
+        "    cartpole_optimizer,\n",
+        "    cartpole_hyperparameters[\"n_training_episodes\"], \n",
+        "    cartpole_hyperparameters[\"max_t\"],\n",
+        "    cartpole_hyperparameters[\"gamma\"], \n",
+        "    25,\n",
+        "    50\n",
+        ")"
       ]
     },
     {
@@ -935,7 +1031,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 20,
       "metadata": {
         "id": "3FamHmxyhBEU"
       },
@@ -950,20 +1046,60 @@
         "  \"\"\"\n",
         "  episode_rewards = []\n",
         "  for episode in range(n_eval_episodes):\n",
-        "    state = env.reset()\n",
-        "    step = 0\n",
-        "    done = False\n",
+        "    state, _ = env.reset()\n",
         "    total_rewards_ep = 0\n",
         "    \n",
-        "    for step in range(max_steps):\n",
+        "    for _ in range(max_steps):\n",
         "      action, _ = policy.act(state)\n",
-        "      new_state, reward, done, info = env.step(action)\n",
+        "      new_state, reward, terminated, truncated, _ = env.step(action)\n",
         "      total_rewards_ep += reward\n",
         "        \n",
-        "      if done:\n",
+        "      if terminated or truncated:\n",
+        "        break\n",
+        "\n",
+        "      state = new_state\n",
+        "    episode_rewards.append(total_rewards_ep)\n",
+        "\n",
+        "    if episode % 100 == 0:\n",
+        "      print(f\"Episode: {episode:.4f}, mean reward: {np.mean(episode_rewards):.4f}\")\n",
+        "\n",
+        "  mean_reward = np.mean(episode_rewards)\n",
+        "  std_reward = np.std(episode_rewards)\n",
+        "\n",
+        "  return mean_reward, std_reward\n",
+        "\n",
+        "\n",
+        "def evaluate_agent_pygame(env, max_steps, n_eval_episodes, policy, game_p):\n",
+        "  \"\"\"\n",
+        "  Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.\n",
+        "  :param env: The evaluation environment\n",
+        "  :param n_eval_episodes: Number of episode to evaluate the agent\n",
+        "  :param policy: The Reinforce agent\n",
+        "  \"\"\"\n",
+        "  episode_rewards = []\n",
+        "  game_p.init()\n",
+        "  actions_set = game_p.getActionSet()\n",
+        "  for episode in range(n_eval_episodes):\n",
+        "    game_p.reset_game()\n",
+        "    state = np.array(list(game_p.getGameState().values()), dtype=np.float32)\n",
+        "    \n",
+        "    total_rewards_ep = 0\n",
+        "    \n",
+        "    for _ in range(max_steps):\n",
+        "      action, _ = policy.act(state)\n",
+        "      action = actions_set[action]\n",
+        "      reward = game_p.act(action)\n",
+        "      total_rewards_ep += reward\n",
+        "      new_state = np.array(list(game_p.getGameState().values()), dtype=np.float32) \n",
+        "      if game_p.game_over():\n",
         "        break\n",
+        "\n",
         "      state = new_state\n",
         "    episode_rewards.append(total_rewards_ep)\n",
+        "\n",
+        "    if episode % 100 == 0:\n",
+        "      print(f\"Episode: {episode:.4f}, mean reward: {np.mean(episode_rewards):.4f}\")\n",
+        "\n",
         "  mean_reward = np.mean(episode_rewards)\n",
         "  std_reward = np.std(episode_rewards)\n",
         "\n",
@@ -981,16 +1117,36 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 34,
       "metadata": {
         "id": "ohGSXDyHh0xx"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Episode: 0.0000, mean reward: 500.0000\n"
+          ]
+        },
+        {
+          "data": {
+            "text/plain": [
+              "(np.float64(500.0), np.float64(0.0))"
+            ]
+          },
+          "execution_count": 34,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
       "source": [
-        "evaluate_agent(eval_env, \n",
-        "               cartpole_hyperparameters[\"max_t\"], \n",
-        "               cartpole_hyperparameters[\"n_evaluation_episodes\"],\n",
-        "               cartpole_policy)"
+        "evaluate_agent(\n",
+        "    eval_env, \n",
+        "    cartpole_hyperparameters[\"max_t\"], \n",
+        "    cartpole_hyperparameters[\"n_evaluation_episodes\"],\n",
+        "    cartpole_policy\n",
+        ")"
       ]
     },
     {
@@ -1019,6 +1175,11 @@
     },
     {
       "cell_type": "code",
+      "execution_count": 16,
+      "metadata": {
+        "id": "LIVsvlW_8tcw"
+      },
+      "outputs": [],
       "source": [
         "from huggingface_hub import HfApi, snapshot_download\n",
         "from huggingface_hub.repocard import metadata_eval_result, metadata_save\n",
@@ -1031,21 +1192,18 @@
         "import tempfile\n",
         "\n",
         "import os"
-      ],
-      "metadata": {
-        "id": "LIVsvlW_8tcw"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 17,
       "metadata": {
         "id": "Lo4JH45if81z"
       },
       "outputs": [],
       "source": [
+        "import pygame\n",
+        "\n",
         "def record_video(env, policy, out_directory, fps=30):\n",
         "  \"\"\"\n",
         "  Generate a replay video of the agent\n",
@@ -1056,27 +1214,75 @@
         "  \"\"\"\n",
         "  images = []  \n",
         "  done = False\n",
-        "  state = env.reset()\n",
-        "  img = env.render(mode='rgb_array')\n",
+        "  state, _ = env.reset()\n",
+        "  img = env.render()\n",
         "  images.append(img)\n",
-        "  while not done:\n",
+        "  for frame in range(fps*100):\n",
         "    # Take the action (index) that have the maximum expected future reward given that state\n",
         "    action, _ = policy.act(state)\n",
-        "    state, reward, done, info = env.step(action) # We directly put next_state = state for recording logic\n",
-        "    img = env.render(mode='rgb_array')\n",
+        "    state, reward, terminated, truncated, _ = env.step(action) # We directly put next_state = state for recording logic\n",
+        "    img = env.render()\n",
         "    images.append(img)\n",
+        "\n",
+        "    if terminated or truncated:\n",
+        "      break\n",
+        "\n",
+        "\n",
+        "  print(\" - Teminated video loop, mimsave...\")\n",
+        "  imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)\n",
+        "\n",
+        "\n",
+        "def record_video_pygame(env, policy, out_directory, game_p, fps=30):\n",
+        "  \"\"\"\n",
+        "  Generate a replay video of the agent\n",
+        "  :param env\n",
+        "  :param Qtable: Qtable of our agent\n",
+        "  :param out_directory\n",
+        "  :param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)\n",
+        "  \"\"\"\n",
+        "  images = []  \n",
+        "  game_p.init()\n",
+        "  actions_set = game_p.getActionSet()\n",
+        "  game_p.reset_game()\n",
+        "  state = np.array(list(game_p.getGameState().values()), dtype=np.float32) # TODO: reset the environment\n",
+        "  \n",
+        "  for frame in range(fps*100):\n",
+        "    # Take the action (index) that have the maximum expected future reward given that state\n",
+        "    action, _ = policy.act(state)\n",
+        "    action = actions_set[action]\n",
+        "    reward = game_p.act(action) # We directly put next_state = state for recording logic\n",
+        "\n",
+        "\n",
+        "    surface = pygame.display.get_surface()\n",
+        "    if surface is not None:\n",
+        "        img = pygame.surfarray.array3d(surface)  # shape (W,H,3)\n",
+        "        img = np.transpose(img, (1, 0, 2))       # (H,W,3)\n",
+        "        images.append(img)\n",
+        "\n",
+        "    state = np.array(list(game_p.getGameState().values()), dtype=np.float32) \n",
+        "    if game_p.game_over():\n",
+        "      break\n",
+        "\n",
+        "\n",
+        "  print(\" - Teminated video loop, mimsave...\")\n",
         "  imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)"
       ]
     },
     {
       "cell_type": "code",
+      "execution_count": 18,
+      "metadata": {
+        "id": "_TPdq47D7_f_"
+      },
+      "outputs": [],
       "source": [
-        "def push_to_hub(repo_id, \n",
-        "                model,\n",
-        "                hyperparameters,\n",
-        "                eval_env,\n",
-        "                video_fps=30\n",
-        "                ):\n",
+        "def push_to_hub(\n",
+        "  repo_id, \n",
+        "  model,\n",
+        "  hyperparameters,\n",
+        "  eval_env,\n",
+        "  video_fps=30\n",
+        "):\n",
         "  \"\"\"\n",
         "  Evaluate, Generate a video and Upload a model to Hugging Face Hub.\n",
         "  This method does the complete pipeline:\n",
@@ -1180,9 +1386,11 @@
         "\n",
         "    # Step 6: Record a video\n",
         "    video_path =  local_directory / \"replay.mp4\"\n",
-        "    record_video(env, model, video_path, video_fps)\n",
+        "    print(\"VIDEO\")\n",
+        "    record_video(eval_env, model, video_path, video_fps)\n",
         "\n",
         "    # Step 7. Push everything to the Hub\n",
+        "    print(\"PUSH\")\n",
         "    api.upload_folder(\n",
         "          repo_id=repo_id,\n",
         "          folder_path=local_directory,\n",
@@ -1190,26 +1398,155 @@
         "    )\n",
         "\n",
         "    print(f\"Your model is pushed to the Hub. You can view your model here: {repo_url}\")"
-      ],
-      "metadata": {
-        "id": "_TPdq47D7_f_"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "w17w8CxzoURM"
-      },
+      "cell_type": "code",
+      "execution_count": 19,
+      "metadata": {},
+      "outputs": [],
       "source": [
-        "### .\n",
-        "\n",
-        "By using `push_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the Hub**.\n",
+        "def push_to_hub_pygame(\n",
+        "  repo_id, \n",
+        "  model,\n",
+        "  hyperparameters,\n",
+        "  eval_env,\n",
+        "  game_p,\n",
+        "  video_fps=30,\n",
+        "):\n",
+        "  \"\"\"\n",
+        "  Evaluate, Generate a video and Upload a model to Hugging Face Hub.\n",
+        "  This method does the complete pipeline:\n",
+        "  - It evaluates the model\n",
+        "  - It generates the model card\n",
+        "  - It generates a replay video of the agent\n",
+        "  - It pushes everything to the Hub\n",
         "\n",
-        "This way:\n",
-        "- You can **showcase our work** 🔥\n",
-        "- You can **visualize your agent playing** 👀\n",
+        "  :param repo_id: repo_id: id of the model repository from the Hugging Face Hub\n",
+        "  :param model: the pytorch model we want to save\n",
+        "  :param hyperparameters: training hyperparameters\n",
+        "  :param eval_env: evaluation environment\n",
+        "  :param video_fps: how many frame per seconds to record our video replay \n",
+        "  \"\"\"\n",
+        "\n",
+        "  _, repo_name = repo_id.split(\"/\")\n",
+        "  api = HfApi()\n",
+        "  \n",
+        "  # Step 1: Create the repo\n",
+        "  repo_url = api.create_repo(\n",
+        "        repo_id=repo_id,\n",
+        "        exist_ok=True,\n",
+        "  )\n",
+        "\n",
+        "  with tempfile.TemporaryDirectory() as tmpdirname:\n",
+        "    local_directory = Path(tmpdirname)\n",
+        "  \n",
+        "    # Step 2: Save the model\n",
+        "    torch.save(model, local_directory / \"model.pt\")\n",
+        "\n",
+        "    # Step 3: Save the hyperparameters to JSON\n",
+        "    with open(local_directory / \"hyperparameters.json\", \"w\") as outfile:\n",
+        "      json.dump(hyperparameters, outfile)\n",
+        "    \n",
+        "    # Step 4: Evaluate the model and build JSON\n",
+        "    mean_reward, std_reward = evaluate_agent_pygame(\n",
+        "      eval_env, \n",
+        "      hyperparameters[\"max_t\"],\n",
+        "      hyperparameters[\"n_evaluation_episodes\"], \n",
+        "      model,\n",
+        "      game_p\n",
+        "    )\n",
+        "    # Get datetime\n",
+        "    eval_datetime = datetime.datetime.now()\n",
+        "    eval_form_datetime = eval_datetime.isoformat()\n",
+        "\n",
+        "    evaluate_data = {\n",
+        "          \"env_id\": hyperparameters[\"env_id\"], \n",
+        "          \"mean_reward\": mean_reward,\n",
+        "          \"n_evaluation_episodes\": hyperparameters[\"n_evaluation_episodes\"],\n",
+        "          \"eval_datetime\": eval_form_datetime,\n",
+        "    }\n",
+        "\n",
+        "    # Write a JSON file\n",
+        "    with open(local_directory / \"results.json\", \"w\") as outfile:\n",
+        "        json.dump(evaluate_data, outfile)\n",
+        "\n",
+        "    # Step 5: Create the model card\n",
+        "    env_name = hyperparameters[\"env_id\"]\n",
+        "    \n",
+        "    metadata = {}\n",
+        "    metadata[\"tags\"] = [\n",
+        "          env_name,\n",
+        "          \"reinforce\",\n",
+        "          \"reinforcement-learning\",\n",
+        "          \"custom-implementation\",\n",
+        "          \"deep-rl-class\"\n",
+        "      ]\n",
+        "\n",
+        "    # Add metrics\n",
+        "    eval = metadata_eval_result(\n",
+        "        model_pretty_name=repo_name,\n",
+        "        task_pretty_name=\"reinforcement-learning\",\n",
+        "        task_id=\"reinforcement-learning\",\n",
+        "        metrics_pretty_name=\"mean_reward\",\n",
+        "        metrics_id=\"mean_reward\",\n",
+        "        metrics_value=f\"{mean_reward:.2f} +/- {std_reward:.2f}\",\n",
+        "        dataset_pretty_name=env_name,\n",
+        "        dataset_id=env_name,\n",
+        "      )\n",
+        "\n",
+        "    # Merges both dictionaries\n",
+        "    metadata = {**metadata, **eval}\n",
+        "\n",
+        "    model_card = f\"\"\"\n",
+        "  # **Reinforce** Agent playing **{env_id}**\n",
+        "  This is a trained model of a **Reinforce** agent playing **{env_id}** .\n",
+        "  To learn to use this model and train yours check Unit 4 of the Deep Reinforcement Learning Course: https://huggingface.co/deep-rl-course/unit4/introduction\n",
+        "  \"\"\"\n",
+        "\n",
+        "    readme_path = local_directory / \"README.md\"\n",
+        "    readme = \"\"\n",
+        "    if readme_path.exists():\n",
+        "        with readme_path.open(\"r\", encoding=\"utf8\") as f:\n",
+        "          readme = f.read()\n",
+        "    else:\n",
+        "      readme = model_card\n",
+        "\n",
+        "    with readme_path.open(\"w\", encoding=\"utf-8\") as f:\n",
+        "      f.write(readme)\n",
+        "\n",
+        "    # Save our metrics to Readme metadata\n",
+        "    metadata_save(readme_path, metadata)\n",
+        "\n",
+        "    # Step 6: Record a video\n",
+        "    video_path =  local_directory / \"replay.mp4\"\n",
+        "    print(\"VIDEO\")\n",
+        "    record_video_pygame(eval_env, model, video_path, game_p, video_fps)\n",
+        "\n",
+        "    # Step 7. Push everything to the Hub\n",
+        "    print(\"PUSH\")\n",
+        "    api.upload_folder(\n",
+        "          repo_id=repo_id,\n",
+        "          folder_path=local_directory,\n",
+        "          path_in_repo=\".\",\n",
+        "    )\n",
+        "\n",
+        "    print(f\"Your model is pushed to the Hub. You can view your model here: {repo_url}\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "w17w8CxzoURM"
+      },
+      "source": [
+        "### .\n",
+        "\n",
+        "By using `push_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the Hub**.\n",
+        "\n",
+        "This way:\n",
+        "- You can **showcase our work** 🔥\n",
+        "- You can **visualize your agent playing** 👀\n",
         "- You can **share with the community an agent that others can use** 💾\n",
         "- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard\n"
       ]
@@ -1262,19 +1599,32 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 37,
       "metadata": {
         "id": "UNwkTS65Uq3Q"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "ename": "NameError",
+          "evalue": "name 'cartpole_policy' is not defined",
+          "output_type": "error",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
+            "Cell \u001b[0;32mIn[37], line 4\u001b[0m\n\u001b[1;32m      1\u001b[0m repo_id \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mturbo-maikol/Reinforce-rl-course-unit4-cartpole\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;66;03m#TODO Define your repo id {username/Reinforce-{model-id}}\u001b[39;00m\n\u001b[1;32m      2\u001b[0m push_to_hub(\n\u001b[1;32m      3\u001b[0m     repo_id,\n\u001b[0;32m----> 4\u001b[0m     \u001b[43mcartpole_policy\u001b[49m, \u001b[38;5;66;03m# The model we want to save\u001b[39;00m\n\u001b[1;32m      5\u001b[0m     cartpole_hyperparameters, \u001b[38;5;66;03m# Hyperparameters\u001b[39;00m\n\u001b[1;32m      6\u001b[0m     eval_env, \u001b[38;5;66;03m# Evaluation environment\u001b[39;00m\n\u001b[1;32m      7\u001b[0m     video_fps\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m30\u001b[39m\n\u001b[1;32m      8\u001b[0m )\n",
+            "\u001b[0;31mNameError\u001b[0m: name 'cartpole_policy' is not defined"
+          ]
+        }
+      ],
       "source": [
-        "repo_id = \"\" #TODO Define your repo id {username/Reinforce-{model-id}}\n",
-        "push_to_hub(repo_id,\n",
-        "                cartpole_policy, # The model we want to save\n",
-        "                cartpole_hyperparameters, # Hyperparameters\n",
-        "                eval_env, # Evaluation environment\n",
-        "                video_fps=30\n",
-        "                )"
+        "repo_id = \"turbo-maikol/Reinforce-rl-course-unit4-cartpole\" #TODO Define your repo id {username/Reinforce-{model-id}}\n",
+        "push_to_hub(\n",
+        "    repo_id,\n",
+        "    cartpole_policy, # The model we want to save\n",
+        "    cartpole_hyperparameters, # Hyperparameters\n",
+        "    eval_env, # Evaluation environment\n",
+        "    video_fps=30\n",
+        ")"
       ]
     },
     {
@@ -1290,56 +1640,145 @@
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "JNLVmKKVKA6j"
+      },
       "source": [
         "## Second agent: PixelCopter 🚁\n",
         "\n",
         "### Study the PixelCopter environment 👀\n",
         "- [The Environment documentation](https://pygame-learning-environment.readthedocs.io/en/latest/user/games/pixelcopter.html)\n"
-      ],
-      "metadata": {
-        "id": "JNLVmKKVKA6j"
-      }
+      ]
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 7,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from ple.games.pixelcopter import Pixelcopter\n",
+        "from ple import PLE\n",
+        "import numpy as np"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 35,
       "metadata": {
         "id": "JBSc8mlfyin3"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "[119, None]"
+            ]
+          },
+          "execution_count": 35,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
       "source": [
-        "env_id = \"Pixelcopter-PLE-v0\"\n",
-        "env = gym.make(env_id)\n",
-        "eval_env = gym.make(env_id)\n",
-        "s_size = env.observation_space.shape[0]\n",
-        "a_size = env.action_space.n"
+        "# env_id = \"Pixelcopter-PLE-v0\"\n",
+        "# env = gym.make(env_id)\n",
+        "env = Pixelcopter()\n",
+        "p = PLE(env, fps=30, display_screen=True)\n",
+        "\n",
+        "p.init()\n",
+        "reward = 0.0\n",
+        "\n",
+        "actions = p.getActionSet()\n",
+        "# for i in range(10_000):\n",
+        "   #   if p.game_over():\n",
+        "   #      print(f\"{p.reset_game()}\")\n",
+        "   #   print(f\"{i: }\")\n",
+        "\n",
+        "   #   print(f\" - {np.array(list(env.getGameState().values()), dtype=np.float32)}\")\n",
+        "   #   reward = p.act(np.random.randint(0,2))\n",
+        "   #   print(f\" - {reward = }\")\n",
+        "actions"
       ]
     },
     {
       "cell_type": "code",
-      "source": [
-        "print(\"_____OBSERVATION SPACE_____ \\n\")\n",
-        "print(\"The State Space is: \", s_size)\n",
-        "print(\"Sample observation\", env.observation_space.sample()) # Get a random observation"
+      "execution_count": 8,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            " - [24.  0.  7. 17. 48. 22. 31.]\n",
+            " - reward = 0.0\n"
+          ]
+        }
       ],
+      "source": [
+        "if p.game_over():\n",
+        "    print(f\"{p.reset_game()}\")\n",
+        "\n",
+        "print(f\" - {np.array(list(env.getGameState().values()), dtype=np.float32)}\")\n",
+        "reward = p.act(actions[0])\n",
+        "print(f\" - {reward = }\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 9,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "s_size = len(env.getGameState())\n",
+        "a_size = 2"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 10,
       "metadata": {
         "id": "L5u_zAHsKBy7"
       },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "_____OBSERVATION SPACE_____ \n",
+            "\n",
+            "The State Space is:  7\n"
+          ]
+        }
+      ],
+      "source": [
+        "print(\"_____OBSERVATION SPACE_____ \\n\")\n",
+        "print(\"The State Space is: \", s_size)\n",
+        "# print(\"Sample observation\", env.observation_space.sample()) # Get a random observation"
+      ]
     },
     {
       "cell_type": "code",
-      "source": [
-        "print(\"\\n _____ACTION SPACE_____ \\n\")\n",
-        "print(\"The Action Space is: \", a_size)\n",
-        "print(\"Action Space Sample\", env.action_space.sample()) # Take a random action"
-      ],
+      "execution_count": 11,
       "metadata": {
         "id": "D7yJM9YXKNbq"
       },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "\n",
+            " _____ACTION SPACE_____ \n",
+            "\n",
+            "The Action Space is:  2\n"
+          ]
+        }
+      ],
+      "source": [
+        "print(\"\\n _____ACTION SPACE_____ \\n\")\n",
+        "print(\"The Action Space is: \", a_size)\n",
+        "# print(\"Action Space Sample\", env.action_space.sample()) # Take a random action"
+      ]
     },
     {
       "cell_type": "markdown",
@@ -1366,17 +1805,17 @@
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "aV1466QP8crz"
+      },
       "source": [
         "### Define the new Policy 🧠\n",
         "- We need to have a deeper neural network since the environment is more complex"
-      ],
-      "metadata": {
-        "id": "aV1466QP8crz"
-      }
+      ]
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 27,
       "metadata": {
         "id": "I1eBkCiX2X_S"
       },
@@ -1386,9 +1825,20 @@
         "    def __init__(self, s_size, a_size, h_size):\n",
         "        super(Policy, self).__init__()\n",
         "        # Define the three layers here\n",
+        "        self.fc1 = nn.Linear(s_size, h_size)\n",
+        "        self.fc2 = nn.Linear(h_size, h_size*2)\n",
+        "        self.fc3 = nn.Linear(h_size*2, a_size)\n",
+        "        # self.fc4 = nn.Linear(h_size, a_size)\n",
+        "        self.relu = nn.ReLU()\n",
         "\n",
         "    def forward(self, x):\n",
         "        # Define the forward process here\n",
+        "        x = self.relu(self.fc1(x))\n",
+        "        x = self.relu(self.fc2(x))\n",
+        "        # x = self.relu(self.fc3(x))\n",
+        "        # x = self.fc4(x)  \n",
+        "        x = self.fc3(x)  \n",
+        "\n",
         "        return F.softmax(x, dim=1)\n",
         "    \n",
         "    def act(self, state):\n",
@@ -1401,15 +1851,20 @@
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "#### Solution"
-      ],
       "metadata": {
         "id": "47iuAFqV8Ws-"
-      }
+      },
+      "source": [
+        "#### Solution"
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "wrNuVcHC8Xu7"
+      },
+      "outputs": [],
       "source": [
         "class Policy(nn.Module):\n",
         "    def __init__(self, s_size, a_size, h_size):\n",
@@ -1430,12 +1885,7 @@
         "        m = Categorical(probs)\n",
         "        action = m.sample()\n",
         "        return action.item(), m.log_prob(action)"
-      ],
-      "metadata": {
-        "id": "wrNuVcHC8Xu7"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "markdown",
@@ -1449,16 +1899,135 @@
       ]
     },
     {
-      "cell_type": "code",
-      "execution_count": null,
+      "cell_type": "markdown",
       "metadata": {
-        "id": "y0uujOR_ypB6"
+        "id": "wyvXTJWm9GJG"
       },
+      "source": [
+        "###  Train it\n",
+        "- We're now ready to train our agent 🔥."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 359,
+      "metadata": {},
       "outputs": [],
       "source": [
+        "from collections import Counter\n",
+        "\n",
+        "def reinforce_pygame(game_p, policy, optimizer, n_training_episodes, max_t, gamma, print_every, max_pacience = None):\n",
+        "    # Help us to calculate the score during the training\n",
+        "    scores = deque(maxlen=100)\n",
+        "    game_p.init()\n",
+        "    actions_set = game_p.getActionSet()\n",
+        "\n",
+        "    last_max = np.mean(scores)\n",
+        "    pacience = 0\n",
+        "    # Line 3 of pseudocode\n",
+        "    for i_episode in range(1, n_training_episodes+1):\n",
+        "        rewards, saved_log_probs, actions = [], [], []\n",
+        "        game_p.reset_game()\n",
+        "        state = np.array(list(game_p.getGameState().values()), dtype=np.float32) # TODO: reset the environment\n",
+        "\n",
+        "        # ========= Line 4 of pseudocode =========\n",
+        "        for t in range(max_t):\n",
+        "            action, log_prob = policy.act(state) # TODO get the action\n",
+        "            action = actions_set[action]\n",
+        "            actions.append(action)\n",
+        "\n",
+        "            saved_log_probs.append(log_prob)\n",
+        "            reward = game_p.act(action) # TODO: take an env step\n",
+        "            rewards.append(reward)\n",
+        "\n",
+        "            state = np.array(list(game_p.getGameState().values()), dtype=np.float32) \n",
+        "            if game_p.game_over():\n",
+        "                break\n",
+        "\n",
+        "        scores.append(sum(rewards))\n",
+        "        \n",
+        "        # ========= Line 6 of pseudocode: calculate the return =========\n",
+        "        returns = deque(maxlen=max_t) \n",
+        "        n_steps = len(rewards) \n",
+        "        \n",
+        "        \"\"\"# ================ EXPLANATION ================\n",
+        "        # Compute the discounted returns at each timestep,\n",
+        "        # as the sum of the gamma-discounted return at time t (G_t) + the reward at time t\n",
+        "\n",
+        "        # In O(N) time, where N is the number of time steps\n",
+        "        # (this definition of the discounted return G_t follows the definition of this quantity \n",
+        "        # shown at page 44 of Sutton&Barto 2017 2nd draft)\n",
+        "        # G_t = r_(t+1) + r_(t+2) + ...\n",
+        "        \n",
+        "        # Given this formulation, the returns at each timestep t can be computed \n",
+        "        # by re-using the computed future returns G_(t+1) to compute the current return G_t\n",
+        "        # G_t = r_(t+1) + gamma*G_(t+1)\n",
+        "        # G_(t-1) = r_t + gamma* G_t\n",
+        "        # (this follows a dynamic programming approach, with which we memorize solutions in order \n",
+        "        # to avoid computing them multiple times)\n",
+        "        \n",
+        "        # This is correct since the above is equivalent to (see also page 46 of Sutton&Barto 2017 2nd draft)\n",
+        "        # G_(t-1) = r_t + gamma*r_(t+1) + gamma*gamma*r_(t+2) + ...\n",
+        "        \n",
+        "        ## Given the above, we calculate the returns at timestep t as: \n",
+        "        #               gamma[t] * return[t] + reward[t]\n",
+        "        #\n",
+        "        ## We compute this starting from the last timestep to the first, in order\n",
+        "        ## to employ the formula presented above and avoid redundant computations that would be needed \n",
+        "        ## if we were to do it from first to last.\n",
+        "        \n",
+        "        ## Hence, the queue \"returns\" will hold the returns in chronological order, from t=0 to t=n_steps\n",
+        "        ## thanks to the appendleft() function which allows to append to the position 0 in constant time O(1)\n",
+        "        ## a normal python list would instead require O(N) to do this.\"\"\"\n",
+        "        disc_return_t = 0\n",
+        "        for t in range(n_steps)[::-1]:\n",
+        "            returns.appendleft(disc_return_t * gamma + rewards[t])   \n",
+        "            disc_return_t = returns[0]\n",
+        "       \n",
+        "        ## standardization of the returns is employed to make training more stable\n",
+        "        eps = np.finfo(np.float32).eps.item()\n",
+        "        \n",
+        "        ## eps is the smallest representable float, which is \n",
+        "        # added to the standard deviation of the returns to avoid numerical instabilities\n",
+        "        returns = torch.tensor(returns)\n",
+        "        returns = (returns - returns.mean()) / (returns.std() + eps)\n",
+        "        \n",
+        "        # ========= Line 7=========\n",
+        "        policy_loss = []\n",
+        "        for log_prob, disc_return in zip(saved_log_probs, returns):\n",
+        "            policy_loss.append(-log_prob * disc_return)\n",
+        "        policy_loss = torch.cat(policy_loss).sum()\n",
+        "        \n",
+        "        # ========= Line 8: PyTorch prefers gradient descent =========\n",
+        "        optimizer.zero_grad()\n",
+        "        policy_loss.backward()\n",
+        "        optimizer.step()\n",
+        "        \n",
+        "        mean = np.mean(scores)\n",
+        "        if i_episode % print_every == 0:\n",
+        "            print(f'Episode {i_episode} Average Score: {mean:.2f}. Move count {Counter(actions)}')\n",
+        "\n",
+        "        if last_max >= mean:\n",
+        "            pacience += 1\n",
+        "            if max_pacience is not None and pacience >= max_pacience:\n",
+        "                print(' - Breaking at Episode {}\\t with average Score: {:.2f} for max pacience {:.2f}'.format(i_episode, mean, last_max))\n",
+        "                break\n",
+        "        else:\n",
+        "            last_max, pacience = mean, 0\n",
+        "        \n",
+        "    return list(scores)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 30,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "env_id = \"Pixelcopter-PLE-v0\"\n",
         "pixelcopter_hyperparameters = {\n",
         "    \"h_size\": 64,\n",
-        "    \"n_training_episodes\": 50000,\n",
+        "    \"n_training_episodes\": 100_000,\n",
         "    \"n_evaluation_episodes\": 10,\n",
         "    \"max_t\": 10000,\n",
         "    \"gamma\": 0.99,\n",
@@ -1466,74 +2035,135 @@
         "    \"env_id\": env_id,\n",
         "    \"state_space\": s_size,\n",
         "    \"action_space\": a_size,\n",
-        "}"
+        "}\n",
+        "\n",
+        "device = torch.device(\"cuda\")\n",
+        "# Create policy and place it to the device\n",
+        "# torch.manual_seed(50)\n",
+        "pixelcopter_policy = Policy(pixelcopter_hyperparameters[\"state_space\"], pixelcopter_hyperparameters[\"action_space\"], pixelcopter_hyperparameters[\"h_size\"]).to(device)\n",
+        "pixelcopter_optimizer = optim.Adam(pixelcopter_policy.parameters(), lr=pixelcopter_hyperparameters[\"lr\"])\n",
+        "\n",
+        "env = Pixelcopter()\n",
+        "game_p = PLE(env, fps=30, display_screen=True)\n",
+        "\n",
+        "# scores = reinforce_pygame(\n",
+        "#     game_p,\n",
+        "#     pixelcopter_policy,\n",
+        "#     pixelcopter_optimizer,\n",
+        "#     pixelcopter_hyperparameters[\"n_training_episodes\"], \n",
+        "#     pixelcopter_hyperparameters[\"max_t\"],\n",
+        "#     pixelcopter_hyperparameters[\"gamma\"], \n",
+        "#     100,\n",
+        "# )"
       ]
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "###  Train it\n",
-        "- We're now ready to train our agent 🔥."
-      ],
       "metadata": {
-        "id": "wyvXTJWm9GJG"
-      }
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "7mM2P_ckysFE"
+        "id": "8kwFQ-Ip85BE"
       },
-      "outputs": [],
       "source": [
-        "# Create policy and place it to the device\n",
-        "# torch.manual_seed(50)\n",
-        "pixelcopter_policy = Policy(pixelcopter_hyperparameters[\"state_space\"], pixelcopter_hyperparameters[\"action_space\"], pixelcopter_hyperparameters[\"h_size\"]).to(device)\n",
-        "pixelcopter_optimizer = optim.Adam(pixelcopter_policy.parameters(), lr=pixelcopter_hyperparameters[\"lr\"])"
+        "### Publish our trained model on the Hub 🔥"
       ]
     },
     {
       "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "v1HEqP-fy-Rf"
-      },
+      "execution_count": 363,
+      "metadata": {},
       "outputs": [],
       "source": [
-        "scores = reinforce(pixelcopter_policy,\n",
-        "                   pixelcopter_optimizer,\n",
-        "                   pixelcopter_hyperparameters[\"n_training_episodes\"], \n",
-        "                   pixelcopter_hyperparameters[\"max_t\"],\n",
-        "                   pixelcopter_hyperparameters[\"gamma\"], \n",
-        "                   1000)"
+        "torch.save(pixelcopter_policy.state_dict(), \"./saved_model.pth\")"
       ]
     },
     {
-      "cell_type": "markdown",
-      "source": [
-        "### Publish our trained model on the Hub 🔥"
+      "cell_type": "code",
+      "execution_count": 39,
+      "metadata": {},
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "<All keys matched successfully>"
+            ]
+          },
+          "execution_count": 39,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
       ],
-      "metadata": {
-        "id": "8kwFQ-Ip85BE"
-      }
+      "source": [
+        "pixelcopter_hyperparameters = {\n",
+        "    \"h_size\": 64,\n",
+        "    \"n_training_episodes\": 100_000,\n",
+        "    \"n_evaluation_episodes\": 10,\n",
+        "    \"max_t\": 10000,\n",
+        "    \"gamma\": 0.99,\n",
+        "    \"lr\": 1e-4,\n",
+        "    \"env_id\": env_id,\n",
+        "    \"state_space\": s_size,\n",
+        "    \"action_space\": a_size,\n",
+        "}\n",
+        "\n",
+        "# Create policy and place it to the device\n",
+        "# torch.manual_seed(50)\n",
+        "pixelcopter_policy = Policy(pixelcopter_hyperparameters[\"state_space\"], pixelcopter_hyperparameters[\"action_space\"], pixelcopter_hyperparameters[\"h_size\"]).to(device)\n",
+        "\n",
+        "pixelcopter_policy.load_state_dict(torch.load(\"./saved_model.pth\"))\n"
+      ]
     },
     {
       "cell_type": "code",
-      "source": [
-        "repo_id = \"\" #TODO Define your repo id {username/Reinforce-{model-id}}\n",
-        "push_to_hub(repo_id,\n",
-        "                pixelcopter_policy, # The model we want to save\n",
-        "                pixelcopter_hyperparameters, # Hyperparameters\n",
-        "                eval_env, # Evaluation environment\n",
-        "                video_fps=30\n",
-        "                )"
-      ],
+      "execution_count": 40,
       "metadata": {
         "id": "6PtB7LRbTKWK"
       },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Episode: 0.0000, mean reward: 18.0000\n",
+            "VIDEO\n",
+            " - Teminated video loop, mimsave...\n",
+            "PUSH\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "Processing Files (0 / 0)                : |          |  0.00B /  0.00B            \n",
+            "\u001b[A\n",
+            "Processing Files (1 / 1)                : 100%|██████████| 40.3kB / 40.3kB,   ???B/s  \n",
+            "\u001b[A\n",
+            "\u001b[A\n",
+            "Processing Files (1 / 1)                : 100%|██████████| 40.3kB / 40.3kB,  0.00B/s  \n",
+            "New Data Upload                         : |          |  0.00B /  0.00B,  0.00B/s  \n",
+            "  /tmp/tmpec45byqc/model.pt             : 100%|██████████| 40.3kB / 40.3kB            \n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Your model is pushed to the Hub. You can view your model here: https://huggingface.co/turbo-maikol/Reinforce-rl-course-unit4-pixelcopter\n"
+          ]
+        }
+      ],
+      "source": [
+        "repo_id = \"turbo-maikol/Reinforce-rl-course-unit4-pixelcopter\" #TODO Define your repo id {username/Reinforce-{model-id}}\n",
+        "\n",
+        "env = Pixelcopter()\n",
+        "game_p = PLE(env, fps=30, display_screen=True)\n",
+        "push_to_hub_pygame(\n",
+        "    repo_id,\n",
+        "    pixelcopter_policy, # The model we want to save\n",
+        "    pixelcopter_hyperparameters, # Hyperparameters\n",
+        "    env, # Evaluation environment\n",
+        "    game_p,\n",
+        "    video_fps=30,\n",
+        ")"
+      ]
     },
     {
       "cell_type": "markdown",
@@ -1585,8 +2215,6 @@
   "metadata": {
     "accelerator": "GPU",
     "colab": {
-      "private_outputs": true,
-      "provenance": [],
       "collapsed_sections": [
         "BPLwsPajb1f8",
         "L_WSo0VUV99t",
@@ -1597,11 +2225,13 @@
         "47iuAFqV8Ws-",
         "x62pP0PHdA-y"
       ],
-      "include_colab_link": true
+      "include_colab_link": true,
+      "private_outputs": true,
+      "provenance": []
     },
     "gpuClass": "standard",
     "kernelspec": {
-      "display_name": "Python 3 (ipykernel)",
+      "display_name": ".venv",
       "language": "python",
       "name": "python3"
     },
@@ -1615,7 +2245,7 @@
       "name": "python",
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython3",
-      "version": "3.8.10"
+      "version": "3.10.18"
     }
   },
   "nbformat": 4,
diff --git a/notebooks/unit5/unit5.ipynb b/notebooks/unit5/unit5.ipynb
index cb9ec8b..580dffc 100644
--- a/notebooks/unit5/unit5.ipynb
+++ b/notebooks/unit5/unit5.ipynb
@@ -277,7 +277,10 @@
         "# Go inside the repository and install the package (can take 3min)\n",
         "%cd ml-agents\n",
         "!pip3 install -e ./ml-agents-envs\n",
-        "!pip3 install -e ./ml-agents"
+        "!pip3 install -e ./ml-agents\n",
+        "\n",
+        "uv pip install -e ./ml-agents/ml-agents-envs\n",
+        "uv pip install -e ./ml-agents/ml-agents\n"
       ]
     },
     {
@@ -584,7 +587,7 @@
       },
       "outputs": [],
       "source": [
-        "!mlagents-push-to-hf  --run-id= # Add your run id  --local-dir= # Your local dir  --repo-id= # Your repo id  --commit-message= # Your commit message"
+        "!mlagents-push-to-hf  --run-id=\"SnowballTarget1\" --local-dir=\"./results/SnowballTarget1\" --repo-id=\"turbo-maikol/rl-course-unit5-snowball\" --commit-message=\"First Push\""
       ]
     },
     {
@@ -796,7 +799,9 @@
       },
       "outputs": [],
       "source": [
-        "!mlagents-push-to-hf  --run-id= # Add your run id  --local-dir= # Your local dir  --repo-id= # Your repo id  --commit-message= # Your commit message"
+        "!mlagents-push-to-hf  --run-id= # Add your run id  --local-dir= # Your local dir  --repo-id= # Your repo id  --commit-message= # Your commit message\n",
+        "\n",
+        "mlagents-push-to-hf  --run-id=\"Pyramids Training\" --local-dir=\"./results/Pyramids Training\" --repo-id=\"turbo-maikol/rl-course-unit5-pyramids\" --commit-message=\"First Push\"\n"
       ]
     },
     {
diff --git a/notebooks/unit6/unit6.ipynb b/notebooks/unit6/unit6.ipynb
index e5d0081..2584c5a 100644
--- a/notebooks/unit6/unit6.ipynb
+++ b/notebooks/unit6/unit6.ipynb
@@ -1,31 +1,10 @@
 {
-  "nbformat": 4,
-  "nbformat_minor": 0,
-  "metadata": {
-    "colab": {
-      "provenance": [],
-      "private_outputs": true,
-      "collapsed_sections": [
-        "tF42HvI7-gs5"
-      ],
-      "include_colab_link": true
-    },
-    "kernelspec": {
-      "name": "python3",
-      "display_name": "Python 3"
-    },
-    "language_info": {
-      "name": "python"
-    },
-    "accelerator": "GPU",
-    "gpuClass": "standard"
-  },
   "cells": [
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "view-in-github",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "view-in-github"
       },
       "source": [
         "<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit6/unit6.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
@@ -33,6 +12,9 @@
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "-PTReiOw-RAN"
+      },
       "source": [
         "# Unit 6: Advantage Actor Critic (A2C) using Robotics Simulations with Panda-Gym 🤖\n",
         "\n",
@@ -43,37 +25,37 @@
         "- `Reach`: the robot must place its end-effector at a target position.\n",
         "\n",
         "After that, you'll be able **to train in other robotics tasks**.\n"
-      ],
-      "metadata": {
-        "id": "-PTReiOw-RAN"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "QInFitfWno1Q"
+      },
       "source": [
         "### 🎮 Environments:\n",
         "\n",
         "- [Panda-Gym](https://github.com/qgallouedec/panda-gym)\n",
         "\n",
-        "###📚 RL-Library:\n",
+        "### 📚 RL-Library:\n",
         "\n",
         "- [Stable-Baselines3](https://stable-baselines3.readthedocs.io/)"
-      ],
-      "metadata": {
-        "id": "QInFitfWno1Q"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues)."
-      ],
       "metadata": {
         "id": "2CcdX4g3oFlp"
-      }
+      },
+      "source": [
+        "We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues)."
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "MoubJX20oKaQ"
+      },
       "source": [
         "## Objectives of this notebook 🏆\n",
         "\n",
@@ -85,13 +67,13 @@
         "- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.\n",
         "\n",
         "\n"
-      ],
-      "metadata": {
-        "id": "MoubJX20oKaQ"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "DoUNkTExoUED"
+      },
       "source": [
         "## This notebook is from the Deep Reinforcement Learning Course\n",
         "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg\" alt=\"Deep RL Course illustration\"/>\n",
@@ -108,34 +90,34 @@
         "\n",
         "\n",
         "The best way to keep in touch is to join our discord server to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5"
-      ],
-      "metadata": {
-        "id": "DoUNkTExoUED"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "BTuQAUAPoa5E"
+      },
       "source": [
         "## Prerequisites 🏗️\n",
         "Before diving into the notebook, you need to:\n",
         "\n",
         "🔲 📚 Study [Actor-Critic methods by reading Unit 6](https://huggingface.co/deep-rl-course/unit6/introduction) 🤗  "
-      ],
-      "metadata": {
-        "id": "BTuQAUAPoa5E"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "# Let's train our first robots 🤖"
-      ],
       "metadata": {
         "id": "iajHvVDWoo01"
-      }
+      },
+      "source": [
+        "# Let's train our first robots 🤖"
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "zbOENTE2os_D"
+      },
       "source": [
         "To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process),  you need to push your trained model to the Hub and get the following results:\n",
         "\n",
@@ -144,46 +126,43 @@
         "To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**\n",
         "\n",
         "For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process"
-      ],
-      "metadata": {
-        "id": "zbOENTE2os_D"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "PU4FVzaoM6fC"
+      },
       "source": [
         "## Set the GPU 💪\n",
         "- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`\n",
         "\n",
         "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg\" alt=\"GPU Step 1\">"
-      ],
-      "metadata": {
-        "id": "PU4FVzaoM6fC"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "KV0NyFdQM9ZG"
+      },
       "source": [
         "- `Hardware Accelerator > GPU`\n",
         "\n",
         "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg\" alt=\"GPU Step 2\">"
-      ],
-      "metadata": {
-        "id": "KV0NyFdQM9ZG"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "bTpYcVZVMzUI"
+      },
       "source": [
         "## Create a virtual display 🔽\n",
         "\n",
         "During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).\n",
         "\n",
         "Hence the following cell will install the librairies and create and run a virtual screen 🖥"
-      ],
-      "metadata": {
-        "id": "bTpYcVZVMzUI"
-      }
+      ]
     },
     {
       "cell_type": "code",
@@ -202,21 +181,24 @@
     },
     {
       "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ww5PQH1gNLI4"
+      },
+      "outputs": [],
       "source": [
         "# Virtual display\n",
         "from pyvirtualdisplay import Display\n",
         "\n",
         "virtual_display = Display(visible=0, size=(1400, 900))\n",
         "virtual_display.start()"
-      ],
-      "metadata": {
-        "id": "ww5PQH1gNLI4"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "e1obkbdJ_KnG"
+      },
       "source": [
         "### Install dependencies 🔽\n",
         "\n",
@@ -228,48 +210,60 @@
         "- `huggingface_hub`: Library allowing anyone to work with the Hub repositories.\n",
         "\n",
         "⏲ The installation can **take 10 minutes**."
-      ],
-      "metadata": {
-        "id": "e1obkbdJ_KnG"
-      }
+      ]
     },
     {
       "cell_type": "code",
-      "source": [
-        "!pip install stable-baselines3[extra]\n",
-        "!pip install gymnasium"
-      ],
+      "execution_count": null,
       "metadata": {
         "id": "TgZUkjKYSgvn"
       },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [],
+      "source": [
+        "!pip install stable-baselines3[extra]\n",
+        "!pip install gymnasium"
+      ]
     },
     {
       "cell_type": "code",
-      "source": [
-        "!pip install huggingface_sb3\n",
-        "!pip install huggingface_hub\n",
-        "!pip install panda_gym"
-      ],
+      "execution_count": null,
       "metadata": {
         "id": "ABneW6tOSpyU"
       },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [],
+      "source": [
+        "uv pip install stable-baselines3[extra] gymnasium huggingface_sb3 huggingface_hub panda_gym"
+      ]
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "## Import the packages 📦"
-      ],
       "metadata": {
         "id": "QTep3PQQABLr"
-      }
+      },
+      "source": [
+        "## Import the packages 📦"
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": 1,
+      "metadata": {
+        "id": "HpiB8VdnQ7Bk"
+      },
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "/home/mique/Desktop/Code/deep-rl-class/notebooks/unit6/venv-u6/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+            "  from .autonotebook import tqdm as notebook_tqdm\n"
+          ]
+        }
+      ],
       "source": [
+        "%load_ext autoreload\n",
+        "%autoreload 2\n",
+        "\n",
         "import os\n",
         "\n",
         "import gymnasium as gym\n",
@@ -283,15 +277,13 @@
         "from stable_baselines3.common.env_util import make_vec_env\n",
         "\n",
         "from huggingface_hub import notebook_login"
-      ],
-      "metadata": {
-        "id": "HpiB8VdnQ7Bk"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "lfBwIS_oAVXI"
+      },
       "source": [
         "## PandaReachDense-v3 🦾\n",
         "\n",
@@ -310,26 +302,38 @@
         "\n",
         "This way **the training will be easier**.\n",
         "\n"
-      ],
-      "metadata": {
-        "id": "lfBwIS_oAVXI"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "frVXOrnlBerQ"
+      },
       "source": [
         "### Create the environment\n",
         "\n",
         "#### The environment 🎮\n",
         "\n",
         "In `PandaReachDense-v3` the robotic arm must place its end-effector at a target position (green ball)."
-      ],
-      "metadata": {
-        "id": "frVXOrnlBerQ"
-      }
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": 6,
+      "metadata": {
+        "id": "zXzAu3HYF1WD"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "argv[0]=--background_color_red=0.8745098114013672\n",
+            "argv[1]=--background_color_green=0.21176470816135406\n",
+            "argv[2]=--background_color_blue=0.1764705926179886\n"
+          ]
+        }
+      ],
       "source": [
         "env_id = \"PandaReachDense-v3\"\n",
         "\n",
@@ -339,28 +343,47 @@
         "# Get the state space and action space\n",
         "s_size = env.observation_space.shape\n",
         "a_size = env.action_space"
-      ],
-      "metadata": {
-        "id": "zXzAu3HYF1WD"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": 15,
+      "metadata": {},
+      "outputs": [],
       "source": [
-        "print(\"_____OBSERVATION SPACE_____ \\n\")\n",
-        "print(\"The State Space is: \", s_size)\n",
-        "print(\"Sample observation\", env.observation_space.sample()) # Get a random observation"
-      ],
+        "s_size = env.observation_space.sample()[\"observation\"].shape[0]"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 17,
       "metadata": {
         "id": "E-U9dexcF-FB"
       },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "_____OBSERVATION SPACE_____ \n",
+            "\n",
+            "The State Space is:  6\n",
+            "Sample observation OrderedDict([('achieved_goal', array([-5.6249957,  3.2377138,  9.631121 ], dtype=float32)), ('desired_goal', array([-5.9595466,  4.739131 , -3.3849702], dtype=float32)), ('observation', array([-3.4746149 , -1.6921669 , -9.1196995 ,  1.4088092 ,  0.84349155,\n",
+            "       -9.425635  ], dtype=float32))])\n"
+          ]
+        }
+      ],
+      "source": [
+        "print(\"_____OBSERVATION SPACE_____ \\n\")\n",
+        "print(\"The State Space is: \", s_size)\n",
+        "print(\"Sample observation\", env.observation_space.sample()) # Get a random observation"
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "g_JClfElGFnF"
+      },
       "source": [
         "The observation space **is a dictionary with 3 different elements**:\n",
         "- `achieved_goal`: (x,y,z) the current position of the end-effector.\n",
@@ -368,45 +391,57 @@
         "- `observation`: position (x,y,z) and velocity of the end-effector (vx, vy, vz).\n",
         "\n",
         "Given it's a dictionary as observation, **we will need to use a MultiInputPolicy policy instead of MlpPolicy**."
-      ],
-      "metadata": {
-        "id": "g_JClfElGFnF"
-      }
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": 18,
+      "metadata": {
+        "id": "ib1Kxy4AF-FC"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "\n",
+            " _____ACTION SPACE_____ \n",
+            "\n",
+            "The Action Space is:  Box(-1.0, 1.0, (3,), float32)\n",
+            "Action Space Sample [-0.28385562 -0.9789819  -0.80975497]\n"
+          ]
+        }
+      ],
       "source": [
         "print(\"\\n _____ACTION SPACE_____ \\n\")\n",
         "print(\"The Action Space is: \", a_size)\n",
         "print(\"Action Space Sample\", env.action_space.sample()) # Take a random action"
-      ],
-      "metadata": {
-        "id": "ib1Kxy4AF-FC"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "5MHTHEHZS4yp"
+      },
       "source": [
         "The action space is a vector with 3 values:\n",
         "- Control x, y, z movement"
-      ],
-      "metadata": {
-        "id": "5MHTHEHZS4yp"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "### Normalize observation and rewards"
-      ],
       "metadata": {
         "id": "S5sXcg469ysB"
-      }
+      },
+      "source": [
+        "### Normalize observation and rewards"
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "1ZyX6qf3Zva9"
+      },
       "source": [
         "A good practice in reinforcement learning is to [normalize input features](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html).\n",
         "\n",
@@ -415,140 +450,9205 @@
         "We also normalize rewards with this same wrapper by adding `norm_reward = True`\n",
         "\n",
         "[You should check the documentation to fill this cell](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)"
-      ],
-      "metadata": {
-        "id": "1ZyX6qf3Zva9"
-      }
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": 20,
+      "metadata": {
+        "id": "1RsDtHHAQ9Ie"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "argv[0]=--background_color_red=0.8745098114013672\n",
+            "argv[1]=--background_color_green=0.21176470816135406\n",
+            "argv[2]=--background_color_blue=0.1764705926179886\n",
+            "argv[0]=--background_color_red=0.8745098114013672\n",
+            "argv[1]=--background_color_green=0.21176470816135406\n",
+            "argv[2]=--background_color_blue=0.1764705926179886\n",
+            "argv[0]=--background_color_red=0.8745098114013672\n",
+            "argv[1]=--background_color_green=0.21176470816135406\n",
+            "argv[2]=--background_color_blue=0.1764705926179886\n",
+            "argv[0]=--background_color_red=0.8745098114013672\n",
+            "argv[1]=--background_color_green=0.21176470816135406\n",
+            "argv[2]=--background_color_blue=0.1764705926179886\n"
+          ]
+        }
+      ],
       "source": [
         "env = make_vec_env(env_id, n_envs=4)\n",
         "\n",
         "# Adding this wrapper to normalize the observation and the reward\n",
-        "env = # TODO: Add the wrapper"
-      ],
-      "metadata": {
-        "id": "1RsDtHHAQ9Ie"
-      },
-      "execution_count": null,
-      "outputs": []
+        "env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10)# TODO: Add the wrapper"
+      ]
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "#### Solution"
-      ],
       "metadata": {
         "id": "tF42HvI7-gs5"
-      }
+      },
+      "source": [
+        "#### Solution"
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "2O67mqgC-hol"
+      },
+      "outputs": [],
       "source": [
         "env = make_vec_env(env_id, n_envs=4)\n",
         "\n",
         "env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.)"
-      ],
-      "metadata": {
-        "id": "2O67mqgC-hol"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "4JmEVU6z1ZA-"
+      },
       "source": [
         "### Create the A2C Model 🤖\n",
         "\n",
         "For more information about A2C implementation with StableBaselines3 check: https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html#notes\n",
         "\n",
         "To find the best parameters I checked the [official trained agents by Stable-Baselines3 team](https://huggingface.co/sb3)."
-      ],
-      "metadata": {
-        "id": "4JmEVU6z1ZA-"
-      }
+      ]
     },
     {
       "cell_type": "code",
-      "source": [
-        "model = # Create the A2C model and try to find the best parameters"
-      ],
+      "execution_count": 26,
       "metadata": {
         "id": "vR3T4qFt164I"
       },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Using cuda device\n"
+          ]
+        }
+      ],
+      "source": [
+        "model = A2C(\"MultiInputPolicy\", env, verbose=1) # Create the A2C model and try to find the best parameters"
+      ]
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "#### Solution"
-      ],
       "metadata": {
         "id": "nWAuOOLh-oQf"
-      }
+      },
+      "source": [
+        "#### Solution"
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "FKFLY54T-pU1"
+      },
+      "outputs": [],
       "source": [
         "model = A2C(policy = \"MultiInputPolicy\",\n",
         "            env = env,\n",
         "            verbose=1)"
-      ],
-      "metadata": {
-        "id": "FKFLY54T-pU1"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "opyK3mpJ1-m9"
+      },
       "source": [
         "### Train the A2C agent 🏃\n",
         "- Let's train our agent for 1,000,000 timesteps, don't forget to use GPU on Colab. It will take approximately ~25-40min"
-      ],
-      "metadata": {
-        "id": "opyK3mpJ1-m9"
-      }
+      ]
     },
     {
       "cell_type": "code",
-      "source": [
-        "model.learn(1_000_000)"
-      ],
+      "execution_count": 27,
       "metadata": {
         "id": "4TuGHZD7RF1G"
       },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 44.5      |\n",
+            "|    ep_rew_mean        | -12.4     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 313       |\n",
+            "|    iterations         | 100       |\n",
+            "|    time_elapsed       | 6         |\n",
+            "|    total_timesteps    | 2000      |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.22     |\n",
+            "|    explained_variance | 0.9545538 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 99        |\n",
+            "|    policy_loss        | -0.349    |\n",
+            "|    std                | 0.988     |\n",
+            "|    value_loss         | 0.322     |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 45.4       |\n",
+            "|    ep_rew_mean        | -13        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 316        |\n",
+            "|    iterations         | 200        |\n",
+            "|    time_elapsed       | 12         |\n",
+            "|    total_timesteps    | 4000       |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.25      |\n",
+            "|    explained_variance | 0.97950953 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 199        |\n",
+            "|    policy_loss        | -1.26      |\n",
+            "|    std                | 0.998      |\n",
+            "|    value_loss         | 0.118      |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 44.8      |\n",
+            "|    ep_rew_mean        | -13.4     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 326       |\n",
+            "|    iterations         | 300       |\n",
+            "|    time_elapsed       | 18        |\n",
+            "|    total_timesteps    | 6000      |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.25     |\n",
+            "|    explained_variance | 0.9330401 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 299       |\n",
+            "|    policy_loss        | 0.0686    |\n",
+            "|    std                | 0.998     |\n",
+            "|    value_loss         | 0.38      |\n",
+            "-------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 43.2     |\n",
+            "|    ep_rew_mean        | -13.1    |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 284      |\n",
+            "|    iterations         | 400      |\n",
+            "|    time_elapsed       | 28       |\n",
+            "|    total_timesteps    | 8000     |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -4.25    |\n",
+            "|    explained_variance | 0.521664 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 399      |\n",
+            "|    policy_loss        | 0.192    |\n",
+            "|    std                | 0.999    |\n",
+            "|    value_loss         | 0.0753   |\n",
+            "------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 43.8      |\n",
+            "|    ep_rew_mean        | -12.6     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 295       |\n",
+            "|    iterations         | 500       |\n",
+            "|    time_elapsed       | 33        |\n",
+            "|    total_timesteps    | 10000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.26     |\n",
+            "|    explained_variance | 0.9645154 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 499       |\n",
+            "|    policy_loss        | 0.284     |\n",
+            "|    std                | 1         |\n",
+            "|    value_loss         | 0.0268    |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 44.8      |\n",
+            "|    ep_rew_mean        | -12.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 299       |\n",
+            "|    iterations         | 600       |\n",
+            "|    time_elapsed       | 40        |\n",
+            "|    total_timesteps    | 12000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.27     |\n",
+            "|    explained_variance | 0.9733915 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 599       |\n",
+            "|    policy_loss        | -0.31     |\n",
+            "|    std                | 1.01      |\n",
+            "|    value_loss         | 0.0512    |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 43.1      |\n",
+            "|    ep_rew_mean        | -11.9     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 301       |\n",
+            "|    iterations         | 700       |\n",
+            "|    time_elapsed       | 46        |\n",
+            "|    total_timesteps    | 14000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.27     |\n",
+            "|    explained_variance | 0.9599183 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 699       |\n",
+            "|    policy_loss        | -0.199    |\n",
+            "|    std                | 1.01      |\n",
+            "|    value_loss         | 0.0319    |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 38.5      |\n",
+            "|    ep_rew_mean        | -9.69     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 306       |\n",
+            "|    iterations         | 800       |\n",
+            "|    time_elapsed       | 52        |\n",
+            "|    total_timesteps    | 16000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.26     |\n",
+            "|    explained_variance | 0.9946942 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 799       |\n",
+            "|    policy_loss        | -0.0755   |\n",
+            "|    std                | 1         |\n",
+            "|    value_loss         | 0.0182    |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 36.8      |\n",
+            "|    ep_rew_mean        | -8.98     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 289       |\n",
+            "|    iterations         | 900       |\n",
+            "|    time_elapsed       | 62        |\n",
+            "|    total_timesteps    | 18000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.29     |\n",
+            "|    explained_variance | 0.9358697 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 899       |\n",
+            "|    policy_loss        | 0.139     |\n",
+            "|    std                | 1.01      |\n",
+            "|    value_loss         | 0.0267    |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 39        |\n",
+            "|    ep_rew_mean        | -9.08     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 294       |\n",
+            "|    iterations         | 1000      |\n",
+            "|    time_elapsed       | 68        |\n",
+            "|    total_timesteps    | 20000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.3      |\n",
+            "|    explained_variance | 0.6885923 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 999       |\n",
+            "|    policy_loss        | 0.508     |\n",
+            "|    std                | 1.02      |\n",
+            "|    value_loss         | 0.108     |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 37         |\n",
+            "|    ep_rew_mean        | -8.26      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 295        |\n",
+            "|    iterations         | 1100       |\n",
+            "|    time_elapsed       | 74         |\n",
+            "|    total_timesteps    | 22000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.32      |\n",
+            "|    explained_variance | 0.97540057 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 1099       |\n",
+            "|    policy_loss        | -0.747     |\n",
+            "|    std                | 1.02       |\n",
+            "|    value_loss         | 0.0566     |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 38.1      |\n",
+            "|    ep_rew_mean        | -8.9      |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 299       |\n",
+            "|    iterations         | 1200      |\n",
+            "|    time_elapsed       | 80        |\n",
+            "|    total_timesteps    | 24000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.32     |\n",
+            "|    explained_variance | 0.9357179 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 1199      |\n",
+            "|    policy_loss        | -0.0341   |\n",
+            "|    std                | 1.02      |\n",
+            "|    value_loss         | 0.0184    |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 40.9      |\n",
+            "|    ep_rew_mean        | -10.1     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 303       |\n",
+            "|    iterations         | 1300      |\n",
+            "|    time_elapsed       | 85        |\n",
+            "|    total_timesteps    | 26000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.33     |\n",
+            "|    explained_variance | 0.9974439 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 1299      |\n",
+            "|    policy_loss        | -0.35     |\n",
+            "|    std                | 1.03      |\n",
+            "|    value_loss         | 0.0159    |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 39.5       |\n",
+            "|    ep_rew_mean        | -9.84      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 306        |\n",
+            "|    iterations         | 1400       |\n",
+            "|    time_elapsed       | 91         |\n",
+            "|    total_timesteps    | 28000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.33      |\n",
+            "|    explained_variance | 0.99562174 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 1399       |\n",
+            "|    policy_loss        | -0.294     |\n",
+            "|    std                | 1.03       |\n",
+            "|    value_loss         | 0.0101     |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 37.8      |\n",
+            "|    ep_rew_mean        | -8.64     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 298       |\n",
+            "|    iterations         | 1500      |\n",
+            "|    time_elapsed       | 100       |\n",
+            "|    total_timesteps    | 30000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.34     |\n",
+            "|    explained_variance | 0.9823143 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 1499      |\n",
+            "|    policy_loss        | 0.104     |\n",
+            "|    std                | 1.03      |\n",
+            "|    value_loss         | 0.0163    |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 33.4       |\n",
+            "|    ep_rew_mean        | -6.76      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 302        |\n",
+            "|    iterations         | 1600       |\n",
+            "|    time_elapsed       | 105        |\n",
+            "|    total_timesteps    | 32000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.34      |\n",
+            "|    explained_variance | 0.68713135 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 1599       |\n",
+            "|    policy_loss        | -0.544     |\n",
+            "|    std                | 1.03       |\n",
+            "|    value_loss         | 0.0491     |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 32.2      |\n",
+            "|    ep_rew_mean        | -5.76     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 307       |\n",
+            "|    iterations         | 1700      |\n",
+            "|    time_elapsed       | 110       |\n",
+            "|    total_timesteps    | 34000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.33     |\n",
+            "|    explained_variance | 0.8737136 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 1699      |\n",
+            "|    policy_loss        | -0.729    |\n",
+            "|    std                | 1.02      |\n",
+            "|    value_loss         | 0.0799    |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 22.7      |\n",
+            "|    ep_rew_mean        | -3.62     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 307       |\n",
+            "|    iterations         | 1800      |\n",
+            "|    time_elapsed       | 116       |\n",
+            "|    total_timesteps    | 36000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.33     |\n",
+            "|    explained_variance | 0.9746877 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 1799      |\n",
+            "|    policy_loss        | -0.063    |\n",
+            "|    std                | 1.03      |\n",
+            "|    value_loss         | 0.00838   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 20.6       |\n",
+            "|    ep_rew_mean        | -2.93      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 309        |\n",
+            "|    iterations         | 1900       |\n",
+            "|    time_elapsed       | 122        |\n",
+            "|    total_timesteps    | 38000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.32      |\n",
+            "|    explained_variance | 0.87348247 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 1899       |\n",
+            "|    policy_loss        | -0.449     |\n",
+            "|    std                | 1.02       |\n",
+            "|    value_loss         | 0.0365     |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 18.6      |\n",
+            "|    ep_rew_mean        | -2.43     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 301       |\n",
+            "|    iterations         | 2000      |\n",
+            "|    time_elapsed       | 132       |\n",
+            "|    total_timesteps    | 40000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.3      |\n",
+            "|    explained_variance | 0.9903151 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 1999      |\n",
+            "|    policy_loss        | -0.0678   |\n",
+            "|    std                | 1.01      |\n",
+            "|    value_loss         | 0.00914   |\n",
+            "-------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 14.8     |\n",
+            "|    ep_rew_mean        | -1.72    |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 302      |\n",
+            "|    iterations         | 2100     |\n",
+            "|    time_elapsed       | 139      |\n",
+            "|    total_timesteps    | 42000    |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -4.28    |\n",
+            "|    explained_variance | -1.61925 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 2099     |\n",
+            "|    policy_loss        | 1.93     |\n",
+            "|    std                | 1.01     |\n",
+            "|    value_loss         | 0.395    |\n",
+            "------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 8.37       |\n",
+            "|    ep_rew_mean        | -0.83      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 302        |\n",
+            "|    iterations         | 2200       |\n",
+            "|    time_elapsed       | 145        |\n",
+            "|    total_timesteps    | 44000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.25      |\n",
+            "|    explained_variance | 0.24723881 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 2199       |\n",
+            "|    policy_loss        | -0.415     |\n",
+            "|    std                | 0.998      |\n",
+            "|    value_loss         | 0.0231     |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 6.63      |\n",
+            "|    ep_rew_mean        | -0.601    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 303       |\n",
+            "|    iterations         | 2300      |\n",
+            "|    time_elapsed       | 151       |\n",
+            "|    total_timesteps    | 46000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.18     |\n",
+            "|    explained_variance | 0.5324587 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 2299      |\n",
+            "|    policy_loss        | 0.143     |\n",
+            "|    std                | 0.977     |\n",
+            "|    value_loss         | 0.00811   |\n",
+            "-------------------------------------\n",
+            "---------------------------------------\n",
+            "| rollout/              |             |\n",
+            "|    ep_len_mean        | 4.89        |\n",
+            "|    ep_rew_mean        | -0.413      |\n",
+            "| time/                 |             |\n",
+            "|    fps                | 304         |\n",
+            "|    iterations         | 2400        |\n",
+            "|    time_elapsed       | 157         |\n",
+            "|    total_timesteps    | 48000       |\n",
+            "| train/                |             |\n",
+            "|    entropy_loss       | -4.14       |\n",
+            "|    explained_variance | -0.40441775 |\n",
+            "|    learning_rate      | 0.0007      |\n",
+            "|    n_updates          | 2399        |\n",
+            "|    policy_loss        | 0.153       |\n",
+            "|    std                | 0.962       |\n",
+            "|    value_loss         | 0.00267     |\n",
+            "---------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3.79       |\n",
+            "|    ep_rew_mean        | -0.294     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 297        |\n",
+            "|    iterations         | 2500       |\n",
+            "|    time_elapsed       | 168        |\n",
+            "|    total_timesteps    | 50000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.11      |\n",
+            "|    explained_variance | -0.5201168 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 2499       |\n",
+            "|    policy_loss        | 0.119      |\n",
+            "|    std                | 0.955      |\n",
+            "|    value_loss         | 0.0067     |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3.5        |\n",
+            "|    ep_rew_mean        | -0.273     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 297        |\n",
+            "|    iterations         | 2600       |\n",
+            "|    time_elapsed       | 175        |\n",
+            "|    total_timesteps    | 52000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.06      |\n",
+            "|    explained_variance | 0.44135898 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 2599       |\n",
+            "|    policy_loss        | -0.151     |\n",
+            "|    std                | 0.938      |\n",
+            "|    value_loss         | 0.00198    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3.36       |\n",
+            "|    ep_rew_mean        | -0.265     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 297        |\n",
+            "|    iterations         | 2700       |\n",
+            "|    time_elapsed       | 181        |\n",
+            "|    total_timesteps    | 54000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.04      |\n",
+            "|    explained_variance | 0.22025794 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 2699       |\n",
+            "|    policy_loss        | 0.0714     |\n",
+            "|    std                | 0.931      |\n",
+            "|    value_loss         | 0.00233    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3.41       |\n",
+            "|    ep_rew_mean        | -0.266     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 296        |\n",
+            "|    iterations         | 2800       |\n",
+            "|    time_elapsed       | 188        |\n",
+            "|    total_timesteps    | 56000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.99      |\n",
+            "|    explained_variance | 0.51552105 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 2799       |\n",
+            "|    policy_loss        | -0.101     |\n",
+            "|    std                | 0.917      |\n",
+            "|    value_loss         | 0.0015     |\n",
+            "--------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 3.39     |\n",
+            "|    ep_rew_mean        | -0.275   |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 296      |\n",
+            "|    iterations         | 2900     |\n",
+            "|    time_elapsed       | 195      |\n",
+            "|    total_timesteps    | 58000    |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -3.97    |\n",
+            "|    explained_variance | 0.455292 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 2899     |\n",
+            "|    policy_loss        | -0.0549  |\n",
+            "|    std                | 0.908    |\n",
+            "|    value_loss         | 0.0009   |\n",
+            "------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 3.07      |\n",
+            "|    ep_rew_mean        | -0.243    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 292       |\n",
+            "|    iterations         | 3000      |\n",
+            "|    time_elapsed       | 205       |\n",
+            "|    total_timesteps    | 60000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.94     |\n",
+            "|    explained_variance | 0.7100135 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 2999      |\n",
+            "|    policy_loss        | 0.0327    |\n",
+            "|    std                | 0.9       |\n",
+            "|    value_loss         | 0.000537  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 3.12      |\n",
+            "|    ep_rew_mean        | -0.249    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 293       |\n",
+            "|    iterations         | 3100      |\n",
+            "|    time_elapsed       | 210       |\n",
+            "|    total_timesteps    | 62000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.9      |\n",
+            "|    explained_variance | 0.5587412 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 3099      |\n",
+            "|    policy_loss        | -0.0118   |\n",
+            "|    std                | 0.889     |\n",
+            "|    value_loss         | 0.000493  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3.08       |\n",
+            "|    ep_rew_mean        | -0.247     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 295        |\n",
+            "|    iterations         | 3200       |\n",
+            "|    time_elapsed       | 216        |\n",
+            "|    total_timesteps    | 64000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.87      |\n",
+            "|    explained_variance | 0.12363589 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 3199       |\n",
+            "|    policy_loss        | 0.0663     |\n",
+            "|    std                | 0.878      |\n",
+            "|    value_loss         | 0.0016     |\n",
+            "--------------------------------------\n",
+            "---------------------------------------\n",
+            "| rollout/              |             |\n",
+            "|    ep_len_mean        | 3.44        |\n",
+            "|    ep_rew_mean        | -0.273      |\n",
+            "| time/                 |             |\n",
+            "|    fps                | 297         |\n",
+            "|    iterations         | 3300        |\n",
+            "|    time_elapsed       | 222         |\n",
+            "|    total_timesteps    | 66000       |\n",
+            "| train/                |             |\n",
+            "|    entropy_loss       | -3.84       |\n",
+            "|    explained_variance | -0.74497736 |\n",
+            "|    learning_rate      | 0.0007      |\n",
+            "|    n_updates          | 3299        |\n",
+            "|    policy_loss        | 0.111       |\n",
+            "|    std                | 0.873       |\n",
+            "|    value_loss         | 0.00291     |\n",
+            "---------------------------------------\n",
+            "---------------------------------------\n",
+            "| rollout/              |             |\n",
+            "|    ep_len_mean        | 3.25        |\n",
+            "|    ep_rew_mean        | -0.265      |\n",
+            "| time/                 |             |\n",
+            "|    fps                | 298         |\n",
+            "|    iterations         | 3400        |\n",
+            "|    time_elapsed       | 228         |\n",
+            "|    total_timesteps    | 68000       |\n",
+            "| train/                |             |\n",
+            "|    entropy_loss       | -3.8        |\n",
+            "|    explained_variance | -0.30396366 |\n",
+            "|    learning_rate      | 0.0007      |\n",
+            "|    n_updates          | 3399        |\n",
+            "|    policy_loss        | 0.216       |\n",
+            "|    std                | 0.86        |\n",
+            "|    value_loss         | 0.00472     |\n",
+            "---------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3.13       |\n",
+            "|    ep_rew_mean        | -0.247     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 298        |\n",
+            "|    iterations         | 3500       |\n",
+            "|    time_elapsed       | 234        |\n",
+            "|    total_timesteps    | 70000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.78      |\n",
+            "|    explained_variance | 0.67658997 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 3499       |\n",
+            "|    policy_loss        | -0.0138    |\n",
+            "|    std                | 0.854      |\n",
+            "|    value_loss         | 0.00042    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3.33       |\n",
+            "|    ep_rew_mean        | -0.263     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 294        |\n",
+            "|    iterations         | 3600       |\n",
+            "|    time_elapsed       | 244        |\n",
+            "|    total_timesteps    | 72000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.74      |\n",
+            "|    explained_variance | -0.9163362 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 3599       |\n",
+            "|    policy_loss        | 0.134      |\n",
+            "|    std                | 0.842      |\n",
+            "|    value_loss         | 0.00447    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.85       |\n",
+            "|    ep_rew_mean        | -0.232     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 296        |\n",
+            "|    iterations         | 3700       |\n",
+            "|    time_elapsed       | 249        |\n",
+            "|    total_timesteps    | 74000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.7       |\n",
+            "|    explained_variance | -0.2427007 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 3699       |\n",
+            "|    policy_loss        | -0.0458    |\n",
+            "|    std                | 0.832      |\n",
+            "|    value_loss         | 0.000685   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.78       |\n",
+            "|    ep_rew_mean        | -0.215     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 297        |\n",
+            "|    iterations         | 3800       |\n",
+            "|    time_elapsed       | 255        |\n",
+            "|    total_timesteps    | 76000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.69      |\n",
+            "|    explained_variance | 0.70643455 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 3799       |\n",
+            "|    policy_loss        | 0.087      |\n",
+            "|    std                | 0.827      |\n",
+            "|    value_loss         | 0.00111    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.92      |\n",
+            "|    ep_rew_mean        | -0.237    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 298       |\n",
+            "|    iterations         | 3900      |\n",
+            "|    time_elapsed       | 260       |\n",
+            "|    total_timesteps    | 78000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.64     |\n",
+            "|    explained_variance | 0.3595901 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 3899      |\n",
+            "|    policy_loss        | -0.207    |\n",
+            "|    std                | 0.815     |\n",
+            "|    value_loss         | 0.00451   |\n",
+            "-------------------------------------\n",
+            "---------------------------------------\n",
+            "| rollout/              |             |\n",
+            "|    ep_len_mean        | 3.01        |\n",
+            "|    ep_rew_mean        | -0.232      |\n",
+            "| time/                 |             |\n",
+            "|    fps                | 300         |\n",
+            "|    iterations         | 4000        |\n",
+            "|    time_elapsed       | 266         |\n",
+            "|    total_timesteps    | 80000       |\n",
+            "| train/                |             |\n",
+            "|    entropy_loss       | -3.62       |\n",
+            "|    explained_variance | -0.26341498 |\n",
+            "|    learning_rate      | 0.0007      |\n",
+            "|    n_updates          | 3999        |\n",
+            "|    policy_loss        | 0.0463      |\n",
+            "|    std                | 0.807       |\n",
+            "|    value_loss         | 0.00229     |\n",
+            "---------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.86       |\n",
+            "|    ep_rew_mean        | -0.218     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 297        |\n",
+            "|    iterations         | 4100       |\n",
+            "|    time_elapsed       | 275        |\n",
+            "|    total_timesteps    | 82000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.6       |\n",
+            "|    explained_variance | 0.66822636 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 4099       |\n",
+            "|    policy_loss        | -0.00514   |\n",
+            "|    std                | 0.804      |\n",
+            "|    value_loss         | 0.000157   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3.03       |\n",
+            "|    ep_rew_mean        | -0.23      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 297        |\n",
+            "|    iterations         | 4200       |\n",
+            "|    time_elapsed       | 282        |\n",
+            "|    total_timesteps    | 84000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.56      |\n",
+            "|    explained_variance | 0.62520474 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 4199       |\n",
+            "|    policy_loss        | 0.0369     |\n",
+            "|    std                | 0.793      |\n",
+            "|    value_loss         | 0.00071    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.96       |\n",
+            "|    ep_rew_mean        | -0.233     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 297        |\n",
+            "|    iterations         | 4300       |\n",
+            "|    time_elapsed       | 288        |\n",
+            "|    total_timesteps    | 86000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.53      |\n",
+            "|    explained_variance | -0.7739824 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 4299       |\n",
+            "|    policy_loss        | 0.0406     |\n",
+            "|    std                | 0.786      |\n",
+            "|    value_loss         | 0.00184    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3.13       |\n",
+            "|    ep_rew_mean        | -0.251     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 298        |\n",
+            "|    iterations         | 4400       |\n",
+            "|    time_elapsed       | 294        |\n",
+            "|    total_timesteps    | 88000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.52      |\n",
+            "|    explained_variance | 0.36605334 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 4399       |\n",
+            "|    policy_loss        | -0.0104    |\n",
+            "|    std                | 0.784      |\n",
+            "|    value_loss         | 0.000911   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.94      |\n",
+            "|    ep_rew_mean        | -0.23     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 299       |\n",
+            "|    iterations         | 4500      |\n",
+            "|    time_elapsed       | 300       |\n",
+            "|    total_timesteps    | 90000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.5      |\n",
+            "|    explained_variance | -1.494292 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 4499      |\n",
+            "|    policy_loss        | 0.166     |\n",
+            "|    std                | 0.776     |\n",
+            "|    value_loss         | 0.00448   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.87       |\n",
+            "|    ep_rew_mean        | -0.219     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 299        |\n",
+            "|    iterations         | 4600       |\n",
+            "|    time_elapsed       | 307        |\n",
+            "|    total_timesteps    | 92000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.5       |\n",
+            "|    explained_variance | 0.86099774 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 4599       |\n",
+            "|    policy_loss        | -0.00686   |\n",
+            "|    std                | 0.776      |\n",
+            "|    value_loss         | 0.000183   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.96      |\n",
+            "|    ep_rew_mean        | -0.236    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 296       |\n",
+            "|    iterations         | 4700      |\n",
+            "|    time_elapsed       | 317       |\n",
+            "|    total_timesteps    | 94000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.49     |\n",
+            "|    explained_variance | 0.8097523 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 4699      |\n",
+            "|    policy_loss        | -0.0231   |\n",
+            "|    std                | 0.775     |\n",
+            "|    value_loss         | 0.000184  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3          |\n",
+            "|    ep_rew_mean        | -0.241     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 296        |\n",
+            "|    iterations         | 4800       |\n",
+            "|    time_elapsed       | 323        |\n",
+            "|    total_timesteps    | 96000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.49      |\n",
+            "|    explained_variance | 0.85981035 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 4799       |\n",
+            "|    policy_loss        | 0.039      |\n",
+            "|    std                | 0.774      |\n",
+            "|    value_loss         | 0.000231   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3.03       |\n",
+            "|    ep_rew_mean        | -0.232     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 297        |\n",
+            "|    iterations         | 4900       |\n",
+            "|    time_elapsed       | 328        |\n",
+            "|    total_timesteps    | 98000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.48      |\n",
+            "|    explained_variance | 0.78069174 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 4899       |\n",
+            "|    policy_loss        | 0.0365     |\n",
+            "|    std                | 0.771      |\n",
+            "|    value_loss         | 0.000307   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.8       |\n",
+            "|    ep_rew_mean        | -0.219    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 298       |\n",
+            "|    iterations         | 5000      |\n",
+            "|    time_elapsed       | 334       |\n",
+            "|    total_timesteps    | 100000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.46     |\n",
+            "|    explained_variance | 0.9213156 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 4999      |\n",
+            "|    policy_loss        | 0.0419    |\n",
+            "|    std                | 0.766     |\n",
+            "|    value_loss         | 0.000267  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.91       |\n",
+            "|    ep_rew_mean        | -0.222     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 300        |\n",
+            "|    iterations         | 5100       |\n",
+            "|    time_elapsed       | 339        |\n",
+            "|    total_timesteps    | 102000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.43      |\n",
+            "|    explained_variance | 0.39434808 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 5099       |\n",
+            "|    policy_loss        | -0.036     |\n",
+            "|    std                | 0.759      |\n",
+            "|    value_loss         | 0.000629   |\n",
+            "--------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 2.95     |\n",
+            "|    ep_rew_mean        | -0.231   |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 296      |\n",
+            "|    iterations         | 5200     |\n",
+            "|    time_elapsed       | 350      |\n",
+            "|    total_timesteps    | 104000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -3.39    |\n",
+            "|    explained_variance | 0.839466 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 5199     |\n",
+            "|    policy_loss        | 0.0243   |\n",
+            "|    std                | 0.749    |\n",
+            "|    value_loss         | 0.000199 |\n",
+            "------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3.02       |\n",
+            "|    ep_rew_mean        | -0.238     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 297        |\n",
+            "|    iterations         | 5300       |\n",
+            "|    time_elapsed       | 356        |\n",
+            "|    total_timesteps    | 106000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.35      |\n",
+            "|    explained_variance | -1.5323973 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 5299       |\n",
+            "|    policy_loss        | -0.0499    |\n",
+            "|    std                | 0.739      |\n",
+            "|    value_loss         | 0.0026     |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3.04       |\n",
+            "|    ep_rew_mean        | -0.24      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 297        |\n",
+            "|    iterations         | 5400       |\n",
+            "|    time_elapsed       | 362        |\n",
+            "|    total_timesteps    | 108000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.35      |\n",
+            "|    explained_variance | 0.73881704 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 5399       |\n",
+            "|    policy_loss        | 0.0459     |\n",
+            "|    std                | 0.739      |\n",
+            "|    value_loss         | 0.000478   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.97      |\n",
+            "|    ep_rew_mean        | -0.234    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 298       |\n",
+            "|    iterations         | 5500      |\n",
+            "|    time_elapsed       | 368       |\n",
+            "|    total_timesteps    | 110000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.32     |\n",
+            "|    explained_variance | 0.8745833 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 5499      |\n",
+            "|    policy_loss        | 0.0212    |\n",
+            "|    std                | 0.732     |\n",
+            "|    value_loss         | 0.000189  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.91       |\n",
+            "|    ep_rew_mean        | -0.234     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 299        |\n",
+            "|    iterations         | 5600       |\n",
+            "|    time_elapsed       | 373        |\n",
+            "|    total_timesteps    | 112000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.31      |\n",
+            "|    explained_variance | 0.44390965 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 5599       |\n",
+            "|    policy_loss        | 0.04       |\n",
+            "|    std                | 0.729      |\n",
+            "|    value_loss         | 0.000846   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.84       |\n",
+            "|    ep_rew_mean        | -0.219     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 297        |\n",
+            "|    iterations         | 5700       |\n",
+            "|    time_elapsed       | 383        |\n",
+            "|    total_timesteps    | 114000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.29      |\n",
+            "|    explained_variance | 0.76370406 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 5699       |\n",
+            "|    policy_loss        | 0.042      |\n",
+            "|    std                | 0.726      |\n",
+            "|    value_loss         | 0.000353   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.97       |\n",
+            "|    ep_rew_mean        | -0.228     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 297        |\n",
+            "|    iterations         | 5800       |\n",
+            "|    time_elapsed       | 389        |\n",
+            "|    total_timesteps    | 116000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.26      |\n",
+            "|    explained_variance | 0.48743385 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 5799       |\n",
+            "|    policy_loss        | 0.0545     |\n",
+            "|    std                | 0.719      |\n",
+            "|    value_loss         | 0.000668   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3.06       |\n",
+            "|    ep_rew_mean        | -0.24      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 298        |\n",
+            "|    iterations         | 5900       |\n",
+            "|    time_elapsed       | 395        |\n",
+            "|    total_timesteps    | 118000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.24      |\n",
+            "|    explained_variance | 0.48620242 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 5899       |\n",
+            "|    policy_loss        | -0.00115   |\n",
+            "|    std                | 0.713      |\n",
+            "|    value_loss         | 0.0011     |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.71       |\n",
+            "|    ep_rew_mean        | -0.213     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 298        |\n",
+            "|    iterations         | 6000       |\n",
+            "|    time_elapsed       | 402        |\n",
+            "|    total_timesteps    | 120000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.22      |\n",
+            "|    explained_variance | 0.48468244 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 5999       |\n",
+            "|    policy_loss        | -0.0515    |\n",
+            "|    std                | 0.708      |\n",
+            "|    value_loss         | 0.000484   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.96       |\n",
+            "|    ep_rew_mean        | -0.23      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 298        |\n",
+            "|    iterations         | 6100       |\n",
+            "|    time_elapsed       | 408        |\n",
+            "|    total_timesteps    | 122000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.2       |\n",
+            "|    explained_variance | 0.36996192 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 6099       |\n",
+            "|    policy_loss        | 0.0359     |\n",
+            "|    std                | 0.704      |\n",
+            "|    value_loss         | 0.000894   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.77       |\n",
+            "|    ep_rew_mean        | -0.215     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 299        |\n",
+            "|    iterations         | 6200       |\n",
+            "|    time_elapsed       | 414        |\n",
+            "|    total_timesteps    | 124000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.17      |\n",
+            "|    explained_variance | 0.96674925 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 6199       |\n",
+            "|    policy_loss        | 0.0178     |\n",
+            "|    std                | 0.696      |\n",
+            "|    value_loss         | 8.14e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.69       |\n",
+            "|    ep_rew_mean        | -0.208     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 296        |\n",
+            "|    iterations         | 6300       |\n",
+            "|    time_elapsed       | 424        |\n",
+            "|    total_timesteps    | 126000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.16      |\n",
+            "|    explained_variance | 0.15048164 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 6299       |\n",
+            "|    policy_loss        | 0.0595     |\n",
+            "|    std                | 0.695      |\n",
+            "|    value_loss         | 0.000888   |\n",
+            "--------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 3        |\n",
+            "|    ep_rew_mean        | -0.237   |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 297      |\n",
+            "|    iterations         | 6400     |\n",
+            "|    time_elapsed       | 430      |\n",
+            "|    total_timesteps    | 128000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -3.15    |\n",
+            "|    explained_variance | 0.840007 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 6399     |\n",
+            "|    policy_loss        | 0.0199   |\n",
+            "|    std                | 0.693    |\n",
+            "|    value_loss         | 0.000293 |\n",
+            "------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.86       |\n",
+            "|    ep_rew_mean        | -0.226     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 298        |\n",
+            "|    iterations         | 6500       |\n",
+            "|    time_elapsed       | 435        |\n",
+            "|    total_timesteps    | 130000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.14      |\n",
+            "|    explained_variance | 0.93121815 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 6499       |\n",
+            "|    policy_loss        | -0.00323   |\n",
+            "|    std                | 0.689      |\n",
+            "|    value_loss         | 8.66e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3          |\n",
+            "|    ep_rew_mean        | -0.233     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 299        |\n",
+            "|    iterations         | 6600       |\n",
+            "|    time_elapsed       | 441        |\n",
+            "|    total_timesteps    | 132000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.1       |\n",
+            "|    explained_variance | 0.86104846 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 6599       |\n",
+            "|    policy_loss        | 0.0496     |\n",
+            "|    std                | 0.681      |\n",
+            "|    value_loss         | 0.000438   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.87       |\n",
+            "|    ep_rew_mean        | -0.231     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 299        |\n",
+            "|    iterations         | 6700       |\n",
+            "|    time_elapsed       | 446        |\n",
+            "|    total_timesteps    | 134000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.09      |\n",
+            "|    explained_variance | 0.90795654 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 6699       |\n",
+            "|    policy_loss        | 0.017      |\n",
+            "|    std                | 0.678      |\n",
+            "|    value_loss         | 0.000259   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 3         |\n",
+            "|    ep_rew_mean        | -0.231    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 297       |\n",
+            "|    iterations         | 6800      |\n",
+            "|    time_elapsed       | 456       |\n",
+            "|    total_timesteps    | 136000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.08     |\n",
+            "|    explained_variance | 0.5615423 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 6799      |\n",
+            "|    policy_loss        | -0.0315   |\n",
+            "|    std                | 0.677     |\n",
+            "|    value_loss         | 0.000951  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3          |\n",
+            "|    ep_rew_mean        | -0.239     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 298        |\n",
+            "|    iterations         | 6900       |\n",
+            "|    time_elapsed       | 462        |\n",
+            "|    total_timesteps    | 138000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.07      |\n",
+            "|    explained_variance | 0.53915024 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 6899       |\n",
+            "|    policy_loss        | -0.0635    |\n",
+            "|    std                | 0.673      |\n",
+            "|    value_loss         | 0.000951   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.87       |\n",
+            "|    ep_rew_mean        | -0.231     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 298        |\n",
+            "|    iterations         | 7000       |\n",
+            "|    time_elapsed       | 468        |\n",
+            "|    total_timesteps    | 140000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.06      |\n",
+            "|    explained_variance | 0.89674634 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 6999       |\n",
+            "|    policy_loss        | -0.025     |\n",
+            "|    std                | 0.672      |\n",
+            "|    value_loss         | 0.000152   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.85       |\n",
+            "|    ep_rew_mean        | -0.221     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 299        |\n",
+            "|    iterations         | 7100       |\n",
+            "|    time_elapsed       | 474        |\n",
+            "|    total_timesteps    | 142000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.03      |\n",
+            "|    explained_variance | 0.78443396 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 7099       |\n",
+            "|    policy_loss        | 0.0301     |\n",
+            "|    std                | 0.665      |\n",
+            "|    value_loss         | 0.00029    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.83      |\n",
+            "|    ep_rew_mean        | -0.224    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 299       |\n",
+            "|    iterations         | 7200      |\n",
+            "|    time_elapsed       | 480       |\n",
+            "|    total_timesteps    | 144000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.01     |\n",
+            "|    explained_variance | 0.8908392 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 7199      |\n",
+            "|    policy_loss        | 0.0116    |\n",
+            "|    std                | 0.66      |\n",
+            "|    value_loss         | 9.44e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.78      |\n",
+            "|    ep_rew_mean        | -0.215    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 300       |\n",
+            "|    iterations         | 7300      |\n",
+            "|    time_elapsed       | 486       |\n",
+            "|    total_timesteps    | 146000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.02     |\n",
+            "|    explained_variance | 0.8968644 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 7299      |\n",
+            "|    policy_loss        | 0.00186   |\n",
+            "|    std                | 0.663     |\n",
+            "|    value_loss         | 0.000115  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.8       |\n",
+            "|    ep_rew_mean        | -0.219    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 298       |\n",
+            "|    iterations         | 7400      |\n",
+            "|    time_elapsed       | 495       |\n",
+            "|    total_timesteps    | 148000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3        |\n",
+            "|    explained_variance | 0.9469945 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 7399      |\n",
+            "|    policy_loss        | 0.0135    |\n",
+            "|    std                | 0.658     |\n",
+            "|    value_loss         | 6.55e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.68       |\n",
+            "|    ep_rew_mean        | -0.197     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 298        |\n",
+            "|    iterations         | 7500       |\n",
+            "|    time_elapsed       | 501        |\n",
+            "|    total_timesteps    | 150000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3         |\n",
+            "|    explained_variance | 0.95443714 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 7499       |\n",
+            "|    policy_loss        | -0.0519    |\n",
+            "|    std                | 0.657      |\n",
+            "|    value_loss         | 0.000292   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.81       |\n",
+            "|    ep_rew_mean        | -0.228     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 299        |\n",
+            "|    iterations         | 7600       |\n",
+            "|    time_elapsed       | 507        |\n",
+            "|    total_timesteps    | 152000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.96      |\n",
+            "|    explained_variance | 0.89563817 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 7599       |\n",
+            "|    policy_loss        | -0.0471    |\n",
+            "|    std                | 0.649      |\n",
+            "|    value_loss         | 0.000453   |\n",
+            "--------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 2.72     |\n",
+            "|    ep_rew_mean        | -0.211   |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 299      |\n",
+            "|    iterations         | 7700     |\n",
+            "|    time_elapsed       | 513      |\n",
+            "|    total_timesteps    | 154000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -2.95    |\n",
+            "|    explained_variance | 0.379259 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 7699     |\n",
+            "|    policy_loss        | -0.0699  |\n",
+            "|    std                | 0.645    |\n",
+            "|    value_loss         | 0.000852 |\n",
+            "------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 3.06      |\n",
+            "|    ep_rew_mean        | -0.258    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 299       |\n",
+            "|    iterations         | 7800      |\n",
+            "|    time_elapsed       | 520       |\n",
+            "|    total_timesteps    | 156000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.92     |\n",
+            "|    explained_variance | 0.9296672 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 7799      |\n",
+            "|    policy_loss        | 0.0176    |\n",
+            "|    std                | 0.64      |\n",
+            "|    value_loss         | 0.000166  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.73       |\n",
+            "|    ep_rew_mean        | -0.206     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 298        |\n",
+            "|    iterations         | 7900       |\n",
+            "|    time_elapsed       | 530        |\n",
+            "|    total_timesteps    | 158000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.91      |\n",
+            "|    explained_variance | 0.87379307 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 7899       |\n",
+            "|    policy_loss        | -0.0405    |\n",
+            "|    std                | 0.638      |\n",
+            "|    value_loss         | 0.000283   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 3.07      |\n",
+            "|    ep_rew_mean        | -0.249    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 298       |\n",
+            "|    iterations         | 8000      |\n",
+            "|    time_elapsed       | 536       |\n",
+            "|    total_timesteps    | 160000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.9      |\n",
+            "|    explained_variance | 0.4368084 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 7999      |\n",
+            "|    policy_loss        | -0.00266  |\n",
+            "|    std                | 0.636     |\n",
+            "|    value_loss         | 0.000391  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.82       |\n",
+            "|    ep_rew_mean        | -0.221     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 298        |\n",
+            "|    iterations         | 8100       |\n",
+            "|    time_elapsed       | 542        |\n",
+            "|    total_timesteps    | 162000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.88      |\n",
+            "|    explained_variance | 0.75877607 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 8099       |\n",
+            "|    policy_loss        | -0.0239    |\n",
+            "|    std                | 0.633      |\n",
+            "|    value_loss         | 0.000364   |\n",
+            "--------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 2.96     |\n",
+            "|    ep_rew_mean        | -0.238   |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 298      |\n",
+            "|    iterations         | 8200     |\n",
+            "|    time_elapsed       | 548      |\n",
+            "|    total_timesteps    | 164000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -2.89    |\n",
+            "|    explained_variance | 0.703933 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 8199     |\n",
+            "|    policy_loss        | 0.00433  |\n",
+            "|    std                | 0.633    |\n",
+            "|    value_loss         | 0.000467 |\n",
+            "------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.81      |\n",
+            "|    ep_rew_mean        | -0.215    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 299       |\n",
+            "|    iterations         | 8300      |\n",
+            "|    time_elapsed       | 554       |\n",
+            "|    total_timesteps    | 166000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.89     |\n",
+            "|    explained_variance | 0.7966901 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 8299      |\n",
+            "|    policy_loss        | 0.0451    |\n",
+            "|    std                | 0.635     |\n",
+            "|    value_loss         | 0.00083   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.88      |\n",
+            "|    ep_rew_mean        | -0.224    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 297       |\n",
+            "|    iterations         | 8400      |\n",
+            "|    time_elapsed       | 564       |\n",
+            "|    total_timesteps    | 168000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.88     |\n",
+            "|    explained_variance | 0.7485693 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 8399      |\n",
+            "|    policy_loss        | -0.00633  |\n",
+            "|    std                | 0.633     |\n",
+            "|    value_loss         | 0.000419  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.81      |\n",
+            "|    ep_rew_mean        | -0.223    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 297       |\n",
+            "|    iterations         | 8500      |\n",
+            "|    time_elapsed       | 570       |\n",
+            "|    total_timesteps    | 170000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.88     |\n",
+            "|    explained_variance | 0.7972623 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 8499      |\n",
+            "|    policy_loss        | -0.0462   |\n",
+            "|    std                | 0.631     |\n",
+            "|    value_loss         | 0.000487  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.73       |\n",
+            "|    ep_rew_mean        | -0.206     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 298        |\n",
+            "|    iterations         | 8600       |\n",
+            "|    time_elapsed       | 576        |\n",
+            "|    total_timesteps    | 172000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.85      |\n",
+            "|    explained_variance | 0.90221983 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 8599       |\n",
+            "|    policy_loss        | 0.0116     |\n",
+            "|    std                | 0.625      |\n",
+            "|    value_loss         | 0.000125   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.74       |\n",
+            "|    ep_rew_mean        | -0.215     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 298        |\n",
+            "|    iterations         | 8700       |\n",
+            "|    time_elapsed       | 582        |\n",
+            "|    total_timesteps    | 174000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.81      |\n",
+            "|    explained_variance | 0.98035014 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 8699       |\n",
+            "|    policy_loss        | -9.69e-05  |\n",
+            "|    std                | 0.618      |\n",
+            "|    value_loss         | 2.93e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3.01       |\n",
+            "|    ep_rew_mean        | -0.234     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 298        |\n",
+            "|    iterations         | 8800       |\n",
+            "|    time_elapsed       | 588        |\n",
+            "|    total_timesteps    | 176000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.81      |\n",
+            "|    explained_variance | 0.91983217 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 8799       |\n",
+            "|    policy_loss        | -0.0214    |\n",
+            "|    std                | 0.618      |\n",
+            "|    value_loss         | 0.000142   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 3.02      |\n",
+            "|    ep_rew_mean        | -0.244    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 297       |\n",
+            "|    iterations         | 8900      |\n",
+            "|    time_elapsed       | 598       |\n",
+            "|    total_timesteps    | 178000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.8      |\n",
+            "|    explained_variance | 0.8616807 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 8899      |\n",
+            "|    policy_loss        | 0.0346    |\n",
+            "|    std                | 0.615     |\n",
+            "|    value_loss         | 0.000569  |\n",
+            "-------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 2.76     |\n",
+            "|    ep_rew_mean        | -0.211   |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 297      |\n",
+            "|    iterations         | 9000     |\n",
+            "|    time_elapsed       | 604      |\n",
+            "|    total_timesteps    | 180000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -2.78    |\n",
+            "|    explained_variance | 0.951248 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 8999     |\n",
+            "|    policy_loss        | -0.0212  |\n",
+            "|    std                | 0.611    |\n",
+            "|    value_loss         | 7.19e-05 |\n",
+            "------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.83      |\n",
+            "|    ep_rew_mean        | -0.216    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 297       |\n",
+            "|    iterations         | 9100      |\n",
+            "|    time_elapsed       | 611       |\n",
+            "|    total_timesteps    | 182000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.75     |\n",
+            "|    explained_variance | 0.9046812 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 9099      |\n",
+            "|    policy_loss        | -0.0139   |\n",
+            "|    std                | 0.605     |\n",
+            "|    value_loss         | 0.000137  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.9        |\n",
+            "|    ep_rew_mean        | -0.228     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 297        |\n",
+            "|    iterations         | 9200       |\n",
+            "|    time_elapsed       | 617        |\n",
+            "|    total_timesteps    | 184000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.73      |\n",
+            "|    explained_variance | 0.91447115 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 9199       |\n",
+            "|    policy_loss        | 0.0295     |\n",
+            "|    std                | 0.602      |\n",
+            "|    value_loss         | 0.000232   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.56       |\n",
+            "|    ep_rew_mean        | -0.188     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 297        |\n",
+            "|    iterations         | 9300       |\n",
+            "|    time_elapsed       | 624        |\n",
+            "|    total_timesteps    | 186000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.71      |\n",
+            "|    explained_variance | 0.93495154 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 9299       |\n",
+            "|    policy_loss        | -0.00335   |\n",
+            "|    std                | 0.597      |\n",
+            "|    value_loss         | 7.19e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.78      |\n",
+            "|    ep_rew_mean        | -0.216    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 296       |\n",
+            "|    iterations         | 9400      |\n",
+            "|    time_elapsed       | 633       |\n",
+            "|    total_timesteps    | 188000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.72     |\n",
+            "|    explained_variance | 0.8910692 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 9399      |\n",
+            "|    policy_loss        | -0.021    |\n",
+            "|    std                | 0.599     |\n",
+            "|    value_loss         | 0.000176  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.74      |\n",
+            "|    ep_rew_mean        | -0.203    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 297       |\n",
+            "|    iterations         | 9500      |\n",
+            "|    time_elapsed       | 639       |\n",
+            "|    total_timesteps    | 190000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.71     |\n",
+            "|    explained_variance | 0.8695343 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 9499      |\n",
+            "|    policy_loss        | 0.00743   |\n",
+            "|    std                | 0.596     |\n",
+            "|    value_loss         | 0.000188  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.9       |\n",
+            "|    ep_rew_mean        | -0.234    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 297       |\n",
+            "|    iterations         | 9600      |\n",
+            "|    time_elapsed       | 645       |\n",
+            "|    total_timesteps    | 192000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.7      |\n",
+            "|    explained_variance | 0.9704885 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 9599      |\n",
+            "|    policy_loss        | 0.0164    |\n",
+            "|    std                | 0.595     |\n",
+            "|    value_loss         | 0.000116  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.73      |\n",
+            "|    ep_rew_mean        | -0.213    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 298       |\n",
+            "|    iterations         | 9700      |\n",
+            "|    time_elapsed       | 650       |\n",
+            "|    total_timesteps    | 194000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.67     |\n",
+            "|    explained_variance | 0.9151263 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 9699      |\n",
+            "|    policy_loss        | -0.00882  |\n",
+            "|    std                | 0.59      |\n",
+            "|    value_loss         | 0.000251  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.81      |\n",
+            "|    ep_rew_mean        | -0.214    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 298       |\n",
+            "|    iterations         | 9800      |\n",
+            "|    time_elapsed       | 656       |\n",
+            "|    total_timesteps    | 196000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.66     |\n",
+            "|    explained_variance | 0.9109387 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 9799      |\n",
+            "|    policy_loss        | -0.0265   |\n",
+            "|    std                | 0.587     |\n",
+            "|    value_loss         | 0.000159  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.74       |\n",
+            "|    ep_rew_mean        | -0.209     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 298        |\n",
+            "|    iterations         | 9900       |\n",
+            "|    time_elapsed       | 662        |\n",
+            "|    total_timesteps    | 198000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.63      |\n",
+            "|    explained_variance | 0.90884244 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 9899       |\n",
+            "|    policy_loss        | -0.0342    |\n",
+            "|    std                | 0.581      |\n",
+            "|    value_loss         | 0.000265   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 3         |\n",
+            "|    ep_rew_mean        | -0.239    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 297       |\n",
+            "|    iterations         | 10000     |\n",
+            "|    time_elapsed       | 672       |\n",
+            "|    total_timesteps    | 200000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.62     |\n",
+            "|    explained_variance | 0.9498335 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 9999      |\n",
+            "|    policy_loss        | 0.0115    |\n",
+            "|    std                | 0.58      |\n",
+            "|    value_loss         | 0.000148  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.81      |\n",
+            "|    ep_rew_mean        | -0.22     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 297       |\n",
+            "|    iterations         | 10100     |\n",
+            "|    time_elapsed       | 678       |\n",
+            "|    total_timesteps    | 202000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.61     |\n",
+            "|    explained_variance | 0.9500687 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 10099     |\n",
+            "|    policy_loss        | 0.0277    |\n",
+            "|    std                | 0.578     |\n",
+            "|    value_loss         | 0.000288  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.93      |\n",
+            "|    ep_rew_mean        | -0.229    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 297       |\n",
+            "|    iterations         | 10200     |\n",
+            "|    time_elapsed       | 684       |\n",
+            "|    total_timesteps    | 204000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.58     |\n",
+            "|    explained_variance | 0.9149855 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 10199     |\n",
+            "|    policy_loss        | 0.0532    |\n",
+            "|    std                | 0.571     |\n",
+            "|    value_loss         | 0.000604  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3.35       |\n",
+            "|    ep_rew_mean        | -0.263     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 298        |\n",
+            "|    iterations         | 10300      |\n",
+            "|    time_elapsed       | 690        |\n",
+            "|    total_timesteps    | 206000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.57      |\n",
+            "|    explained_variance | 0.42474955 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 10299      |\n",
+            "|    policy_loss        | -0.212     |\n",
+            "|    std                | 0.571      |\n",
+            "|    value_loss         | 0.0399     |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.79       |\n",
+            "|    ep_rew_mean        | -0.212     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 298        |\n",
+            "|    iterations         | 10400      |\n",
+            "|    time_elapsed       | 696        |\n",
+            "|    total_timesteps    | 208000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.57      |\n",
+            "|    explained_variance | 0.30399168 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 10399      |\n",
+            "|    policy_loss        | 0.00621    |\n",
+            "|    std                | 0.57       |\n",
+            "|    value_loss         | 0.00118    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.95      |\n",
+            "|    ep_rew_mean        | -0.233    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 297       |\n",
+            "|    iterations         | 10500     |\n",
+            "|    time_elapsed       | 705       |\n",
+            "|    total_timesteps    | 210000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.56     |\n",
+            "|    explained_variance | 0.8920195 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 10499     |\n",
+            "|    policy_loss        | -0.0377   |\n",
+            "|    std                | 0.569     |\n",
+            "|    value_loss         | 0.000283  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.87       |\n",
+            "|    ep_rew_mean        | -0.225     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 297        |\n",
+            "|    iterations         | 10600      |\n",
+            "|    time_elapsed       | 711        |\n",
+            "|    total_timesteps    | 212000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.55      |\n",
+            "|    explained_variance | 0.87574327 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 10599      |\n",
+            "|    policy_loss        | 0.0187     |\n",
+            "|    std                | 0.566      |\n",
+            "|    value_loss         | 0.000248   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.75      |\n",
+            "|    ep_rew_mean        | -0.214    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 297       |\n",
+            "|    iterations         | 10700     |\n",
+            "|    time_elapsed       | 718       |\n",
+            "|    total_timesteps    | 214000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.55     |\n",
+            "|    explained_variance | 0.4303016 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 10699     |\n",
+            "|    policy_loss        | -0.0253   |\n",
+            "|    std                | 0.566     |\n",
+            "|    value_loss         | 0.00045   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.68       |\n",
+            "|    ep_rew_mean        | -0.209     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 298        |\n",
+            "|    iterations         | 10800      |\n",
+            "|    time_elapsed       | 724        |\n",
+            "|    total_timesteps    | 216000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.53      |\n",
+            "|    explained_variance | 0.94789743 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 10799      |\n",
+            "|    policy_loss        | 0.00158    |\n",
+            "|    std                | 0.564      |\n",
+            "|    value_loss         | 0.000105   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.91       |\n",
+            "|    ep_rew_mean        | -0.22      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 298        |\n",
+            "|    iterations         | 10900      |\n",
+            "|    time_elapsed       | 730        |\n",
+            "|    total_timesteps    | 218000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.52      |\n",
+            "|    explained_variance | 0.89148146 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 10899      |\n",
+            "|    policy_loss        | -0.00433   |\n",
+            "|    std                | 0.561      |\n",
+            "|    value_loss         | 0.000233   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.88      |\n",
+            "|    ep_rew_mean        | -0.224    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 298       |\n",
+            "|    iterations         | 11000     |\n",
+            "|    time_elapsed       | 736       |\n",
+            "|    total_timesteps    | 220000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.52     |\n",
+            "|    explained_variance | 0.6723757 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 10999     |\n",
+            "|    policy_loss        | 0.0337    |\n",
+            "|    std                | 0.561     |\n",
+            "|    value_loss         | 0.000452  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.82      |\n",
+            "|    ep_rew_mean        | -0.221    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 297       |\n",
+            "|    iterations         | 11100     |\n",
+            "|    time_elapsed       | 745       |\n",
+            "|    total_timesteps    | 222000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.51     |\n",
+            "|    explained_variance | 0.9861437 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 11099     |\n",
+            "|    policy_loss        | 0.00968   |\n",
+            "|    std                | 0.56      |\n",
+            "|    value_loss         | 6.46e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.81      |\n",
+            "|    ep_rew_mean        | -0.218    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 298       |\n",
+            "|    iterations         | 11200     |\n",
+            "|    time_elapsed       | 751       |\n",
+            "|    total_timesteps    | 224000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.52     |\n",
+            "|    explained_variance | 0.9299464 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 11199     |\n",
+            "|    policy_loss        | -0.00854  |\n",
+            "|    std                | 0.56      |\n",
+            "|    value_loss         | 8.38e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.96      |\n",
+            "|    ep_rew_mean        | -0.236    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 298       |\n",
+            "|    iterations         | 11300     |\n",
+            "|    time_elapsed       | 757       |\n",
+            "|    total_timesteps    | 226000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.5      |\n",
+            "|    explained_variance | 0.8100773 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 11299     |\n",
+            "|    policy_loss        | 0.0223    |\n",
+            "|    std                | 0.558     |\n",
+            "|    value_loss         | 0.000133  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.88      |\n",
+            "|    ep_rew_mean        | -0.223    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 298       |\n",
+            "|    iterations         | 11400     |\n",
+            "|    time_elapsed       | 762       |\n",
+            "|    total_timesteps    | 228000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.48     |\n",
+            "|    explained_variance | 0.9284025 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 11399     |\n",
+            "|    policy_loss        | 0.0199    |\n",
+            "|    std                | 0.553     |\n",
+            "|    value_loss         | 0.000181  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.71       |\n",
+            "|    ep_rew_mean        | -0.204     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 299        |\n",
+            "|    iterations         | 11500      |\n",
+            "|    time_elapsed       | 768        |\n",
+            "|    total_timesteps    | 230000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.46      |\n",
+            "|    explained_variance | 0.91747606 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 11499      |\n",
+            "|    policy_loss        | 0.00503    |\n",
+            "|    std                | 0.55       |\n",
+            "|    value_loss         | 9.29e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.61       |\n",
+            "|    ep_rew_mean        | -0.21      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 298        |\n",
+            "|    iterations         | 11600      |\n",
+            "|    time_elapsed       | 777        |\n",
+            "|    total_timesteps    | 232000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.44      |\n",
+            "|    explained_variance | 0.94396347 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 11599      |\n",
+            "|    policy_loss        | 0.0116     |\n",
+            "|    std                | 0.546      |\n",
+            "|    value_loss         | 0.000102   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.74       |\n",
+            "|    ep_rew_mean        | -0.223     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 298        |\n",
+            "|    iterations         | 11700      |\n",
+            "|    time_elapsed       | 783        |\n",
+            "|    total_timesteps    | 234000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.41      |\n",
+            "|    explained_variance | 0.90888155 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 11699      |\n",
+            "|    policy_loss        | -0.00239   |\n",
+            "|    std                | 0.542      |\n",
+            "|    value_loss         | 0.00012    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.79      |\n",
+            "|    ep_rew_mean        | -0.223    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 299       |\n",
+            "|    iterations         | 11800     |\n",
+            "|    time_elapsed       | 789       |\n",
+            "|    total_timesteps    | 236000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.41     |\n",
+            "|    explained_variance | 0.8832614 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 11799     |\n",
+            "|    policy_loss        | 0.00491   |\n",
+            "|    std                | 0.542     |\n",
+            "|    value_loss         | 0.000124  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.86       |\n",
+            "|    ep_rew_mean        | -0.228     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 299        |\n",
+            "|    iterations         | 11900      |\n",
+            "|    time_elapsed       | 795        |\n",
+            "|    total_timesteps    | 238000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.41      |\n",
+            "|    explained_variance | 0.74971235 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 11899      |\n",
+            "|    policy_loss        | -0.0369    |\n",
+            "|    std                | 0.542      |\n",
+            "|    value_loss         | 0.000599   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.76      |\n",
+            "|    ep_rew_mean        | -0.213    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 299       |\n",
+            "|    iterations         | 12000     |\n",
+            "|    time_elapsed       | 800       |\n",
+            "|    total_timesteps    | 240000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.41     |\n",
+            "|    explained_variance | 0.9652594 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 11999     |\n",
+            "|    policy_loss        | 0.0374    |\n",
+            "|    std                | 0.542     |\n",
+            "|    value_loss         | 0.00025   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.83      |\n",
+            "|    ep_rew_mean        | -0.234    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 300       |\n",
+            "|    iterations         | 12100     |\n",
+            "|    time_elapsed       | 806       |\n",
+            "|    total_timesteps    | 242000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.4      |\n",
+            "|    explained_variance | 0.7710167 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 12099     |\n",
+            "|    policy_loss        | 0.0269    |\n",
+            "|    std                | 0.54      |\n",
+            "|    value_loss         | 0.000402  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.77       |\n",
+            "|    ep_rew_mean        | -0.206     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 299        |\n",
+            "|    iterations         | 12200      |\n",
+            "|    time_elapsed       | 815        |\n",
+            "|    total_timesteps    | 244000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.39      |\n",
+            "|    explained_variance | 0.91825235 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 12199      |\n",
+            "|    policy_loss        | 0.0442     |\n",
+            "|    std                | 0.539      |\n",
+            "|    value_loss         | 0.000402   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.92      |\n",
+            "|    ep_rew_mean        | -0.229    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 299       |\n",
+            "|    iterations         | 12300     |\n",
+            "|    time_elapsed       | 821       |\n",
+            "|    total_timesteps    | 246000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.39     |\n",
+            "|    explained_variance | 0.9738185 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 12299     |\n",
+            "|    policy_loss        | 0.0144    |\n",
+            "|    std                | 0.538     |\n",
+            "|    value_loss         | 9.16e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.79       |\n",
+            "|    ep_rew_mean        | -0.22      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 299        |\n",
+            "|    iterations         | 12400      |\n",
+            "|    time_elapsed       | 826        |\n",
+            "|    total_timesteps    | 248000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.4       |\n",
+            "|    explained_variance | 0.89304084 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 12399      |\n",
+            "|    policy_loss        | 0.0163     |\n",
+            "|    std                | 0.539      |\n",
+            "|    value_loss         | 0.000238   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.75      |\n",
+            "|    ep_rew_mean        | -0.213    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 300       |\n",
+            "|    iterations         | 12500     |\n",
+            "|    time_elapsed       | 832       |\n",
+            "|    total_timesteps    | 250000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.4      |\n",
+            "|    explained_variance | 0.9213255 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 12499     |\n",
+            "|    policy_loss        | -0.00162  |\n",
+            "|    std                | 0.538     |\n",
+            "|    value_loss         | 7.67e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.68      |\n",
+            "|    ep_rew_mean        | -0.203    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 300       |\n",
+            "|    iterations         | 12600     |\n",
+            "|    time_elapsed       | 837       |\n",
+            "|    total_timesteps    | 252000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.37     |\n",
+            "|    explained_variance | 0.8798297 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 12599     |\n",
+            "|    policy_loss        | -0.0144   |\n",
+            "|    std                | 0.534     |\n",
+            "|    value_loss         | 0.000193  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.83      |\n",
+            "|    ep_rew_mean        | -0.231    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 301       |\n",
+            "|    iterations         | 12700     |\n",
+            "|    time_elapsed       | 843       |\n",
+            "|    total_timesteps    | 254000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.35     |\n",
+            "|    explained_variance | 0.9373393 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 12699     |\n",
+            "|    policy_loss        | -0.0111   |\n",
+            "|    std                | 0.53      |\n",
+            "|    value_loss         | 9.91e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.78      |\n",
+            "|    ep_rew_mean        | -0.219    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 300       |\n",
+            "|    iterations         | 12800     |\n",
+            "|    time_elapsed       | 853       |\n",
+            "|    total_timesteps    | 256000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.33     |\n",
+            "|    explained_variance | 0.9766309 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 12799     |\n",
+            "|    policy_loss        | -0.0042   |\n",
+            "|    std                | 0.527     |\n",
+            "|    value_loss         | 7.05e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.67      |\n",
+            "|    ep_rew_mean        | -0.212    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 300       |\n",
+            "|    iterations         | 12900     |\n",
+            "|    time_elapsed       | 858       |\n",
+            "|    total_timesteps    | 258000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.34     |\n",
+            "|    explained_variance | 0.9444415 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 12899     |\n",
+            "|    policy_loss        | 0.0116    |\n",
+            "|    std                | 0.528     |\n",
+            "|    value_loss         | 0.000126  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.77      |\n",
+            "|    ep_rew_mean        | -0.219    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 300       |\n",
+            "|    iterations         | 13000     |\n",
+            "|    time_elapsed       | 864       |\n",
+            "|    total_timesteps    | 260000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.32     |\n",
+            "|    explained_variance | 0.8382063 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 12999     |\n",
+            "|    policy_loss        | -0.0395   |\n",
+            "|    std                | 0.524     |\n",
+            "|    value_loss         | 0.000643  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.83      |\n",
+            "|    ep_rew_mean        | -0.219    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 300       |\n",
+            "|    iterations         | 13100     |\n",
+            "|    time_elapsed       | 870       |\n",
+            "|    total_timesteps    | 262000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.32     |\n",
+            "|    explained_variance | 0.9722576 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 13099     |\n",
+            "|    policy_loss        | 0.0103    |\n",
+            "|    std                | 0.525     |\n",
+            "|    value_loss         | 7.43e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.8       |\n",
+            "|    ep_rew_mean        | -0.216    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 301       |\n",
+            "|    iterations         | 13200     |\n",
+            "|    time_elapsed       | 875       |\n",
+            "|    total_timesteps    | 264000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.3      |\n",
+            "|    explained_variance | 0.9009595 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 13199     |\n",
+            "|    policy_loss        | -0.00122  |\n",
+            "|    std                | 0.522     |\n",
+            "|    value_loss         | 9.14e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.74       |\n",
+            "|    ep_rew_mean        | -0.21      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 300        |\n",
+            "|    iterations         | 13300      |\n",
+            "|    time_elapsed       | 885        |\n",
+            "|    total_timesteps    | 266000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.3       |\n",
+            "|    explained_variance | 0.95411175 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 13299      |\n",
+            "|    policy_loss        | -0.00844   |\n",
+            "|    std                | 0.522      |\n",
+            "|    value_loss         | 8.08e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.87      |\n",
+            "|    ep_rew_mean        | -0.235    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 300       |\n",
+            "|    iterations         | 13400     |\n",
+            "|    time_elapsed       | 890       |\n",
+            "|    total_timesteps    | 268000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.3      |\n",
+            "|    explained_variance | 0.9836456 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 13399     |\n",
+            "|    policy_loss        | 0.0253    |\n",
+            "|    std                | 0.522     |\n",
+            "|    value_loss         | 0.000132  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.68      |\n",
+            "|    ep_rew_mean        | -0.216    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 301       |\n",
+            "|    iterations         | 13500     |\n",
+            "|    time_elapsed       | 896       |\n",
+            "|    total_timesteps    | 270000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.3      |\n",
+            "|    explained_variance | 0.9624939 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 13499     |\n",
+            "|    policy_loss        | -0.000755 |\n",
+            "|    std                | 0.522     |\n",
+            "|    value_loss         | 7.38e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.67      |\n",
+            "|    ep_rew_mean        | -0.202    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 301       |\n",
+            "|    iterations         | 13600     |\n",
+            "|    time_elapsed       | 901       |\n",
+            "|    total_timesteps    | 272000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.27     |\n",
+            "|    explained_variance | 0.0788098 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 13599     |\n",
+            "|    policy_loss        | -0.0734   |\n",
+            "|    std                | 0.516     |\n",
+            "|    value_loss         | 0.00111   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.91       |\n",
+            "|    ep_rew_mean        | -0.24      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 302        |\n",
+            "|    iterations         | 13700      |\n",
+            "|    time_elapsed       | 907        |\n",
+            "|    total_timesteps    | 274000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.25      |\n",
+            "|    explained_variance | 0.94273585 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 13699      |\n",
+            "|    policy_loss        | 0.00141    |\n",
+            "|    std                | 0.513      |\n",
+            "|    value_loss         | 0.000132   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.77      |\n",
+            "|    ep_rew_mean        | -0.216    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 302       |\n",
+            "|    iterations         | 13800     |\n",
+            "|    time_elapsed       | 912       |\n",
+            "|    total_timesteps    | 276000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.25     |\n",
+            "|    explained_variance | 0.8948201 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 13799     |\n",
+            "|    policy_loss        | 0.0349    |\n",
+            "|    std                | 0.514     |\n",
+            "|    value_loss         | 0.000385  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.81       |\n",
+            "|    ep_rew_mean        | -0.217     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 301        |\n",
+            "|    iterations         | 13900      |\n",
+            "|    time_elapsed       | 921        |\n",
+            "|    total_timesteps    | 278000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.25      |\n",
+            "|    explained_variance | 0.69086885 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 13899      |\n",
+            "|    policy_loss        | -0.014     |\n",
+            "|    std                | 0.513      |\n",
+            "|    value_loss         | 0.000472   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.86       |\n",
+            "|    ep_rew_mean        | -0.226     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 301        |\n",
+            "|    iterations         | 14000      |\n",
+            "|    time_elapsed       | 927        |\n",
+            "|    total_timesteps    | 280000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.24      |\n",
+            "|    explained_variance | 0.92939675 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 13999      |\n",
+            "|    policy_loss        | 0.00778    |\n",
+            "|    std                | 0.511      |\n",
+            "|    value_loss         | 0.000158   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.91       |\n",
+            "|    ep_rew_mean        | -0.23      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 302        |\n",
+            "|    iterations         | 14100      |\n",
+            "|    time_elapsed       | 933        |\n",
+            "|    total_timesteps    | 282000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.23      |\n",
+            "|    explained_variance | 0.97067803 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 14099      |\n",
+            "|    policy_loss        | -0.00249   |\n",
+            "|    std                | 0.51       |\n",
+            "|    value_loss         | 5.35e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.78      |\n",
+            "|    ep_rew_mean        | -0.216    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 302       |\n",
+            "|    iterations         | 14200     |\n",
+            "|    time_elapsed       | 938       |\n",
+            "|    total_timesteps    | 284000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.23     |\n",
+            "|    explained_variance | 0.9633729 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 14199     |\n",
+            "|    policy_loss        | 0.000154  |\n",
+            "|    std                | 0.51      |\n",
+            "|    value_loss         | 5.12e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.87       |\n",
+            "|    ep_rew_mean        | -0.231     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 302        |\n",
+            "|    iterations         | 14300      |\n",
+            "|    time_elapsed       | 944        |\n",
+            "|    total_timesteps    | 286000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.23      |\n",
+            "|    explained_variance | 0.97925717 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 14299      |\n",
+            "|    policy_loss        | -0.0138    |\n",
+            "|    std                | 0.51       |\n",
+            "|    value_loss         | 8.23e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.97      |\n",
+            "|    ep_rew_mean        | -0.241    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 302       |\n",
+            "|    iterations         | 14400     |\n",
+            "|    time_elapsed       | 950       |\n",
+            "|    total_timesteps    | 288000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.23     |\n",
+            "|    explained_variance | 0.8687136 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 14399     |\n",
+            "|    policy_loss        | -0.0248   |\n",
+            "|    std                | 0.511     |\n",
+            "|    value_loss         | 0.000259  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.79       |\n",
+            "|    ep_rew_mean        | -0.223     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 302        |\n",
+            "|    iterations         | 14500      |\n",
+            "|    time_elapsed       | 959        |\n",
+            "|    total_timesteps    | 290000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.21      |\n",
+            "|    explained_variance | 0.98206425 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 14499      |\n",
+            "|    policy_loss        | -0.0198    |\n",
+            "|    std                | 0.508      |\n",
+            "|    value_loss         | 0.000166   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.76       |\n",
+            "|    ep_rew_mean        | -0.211     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 302        |\n",
+            "|    iterations         | 14600      |\n",
+            "|    time_elapsed       | 965        |\n",
+            "|    total_timesteps    | 292000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.19      |\n",
+            "|    explained_variance | 0.98284197 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 14599      |\n",
+            "|    policy_loss        | 0.0095     |\n",
+            "|    std                | 0.505      |\n",
+            "|    value_loss         | 4.55e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.85      |\n",
+            "|    ep_rew_mean        | -0.231    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 302       |\n",
+            "|    iterations         | 14700     |\n",
+            "|    time_elapsed       | 970       |\n",
+            "|    total_timesteps    | 294000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.16     |\n",
+            "|    explained_variance | 0.7622324 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 14699     |\n",
+            "|    policy_loss        | -0.0373   |\n",
+            "|    std                | 0.499     |\n",
+            "|    value_loss         | 0.000854  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.65       |\n",
+            "|    ep_rew_mean        | -0.206     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 303        |\n",
+            "|    iterations         | 14800      |\n",
+            "|    time_elapsed       | 976        |\n",
+            "|    total_timesteps    | 296000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.15      |\n",
+            "|    explained_variance | 0.94090515 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 14799      |\n",
+            "|    policy_loss        | -0.0101    |\n",
+            "|    std                | 0.497      |\n",
+            "|    value_loss         | 0.000149   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.99       |\n",
+            "|    ep_rew_mean        | -0.242     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 303        |\n",
+            "|    iterations         | 14900      |\n",
+            "|    time_elapsed       | 982        |\n",
+            "|    total_timesteps    | 298000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.15      |\n",
+            "|    explained_variance | 0.94472414 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 14899      |\n",
+            "|    policy_loss        | 0.0115     |\n",
+            "|    std                | 0.498      |\n",
+            "|    value_loss         | 0.000171   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3.04       |\n",
+            "|    ep_rew_mean        | -0.237     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 303        |\n",
+            "|    iterations         | 15000      |\n",
+            "|    time_elapsed       | 988        |\n",
+            "|    total_timesteps    | 300000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.16      |\n",
+            "|    explained_variance | 0.93526465 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 14999      |\n",
+            "|    policy_loss        | 0.0374     |\n",
+            "|    std                | 0.499      |\n",
+            "|    value_loss         | 0.000519   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.6       |\n",
+            "|    ep_rew_mean        | -0.21     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 302       |\n",
+            "|    iterations         | 15100     |\n",
+            "|    time_elapsed       | 997       |\n",
+            "|    total_timesteps    | 302000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.16     |\n",
+            "|    explained_variance | 0.9759287 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 15099     |\n",
+            "|    policy_loss        | 0.0122    |\n",
+            "|    std                | 0.499     |\n",
+            "|    value_loss         | 7.45e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.62       |\n",
+            "|    ep_rew_mean        | -0.195     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 303        |\n",
+            "|    iterations         | 15200      |\n",
+            "|    time_elapsed       | 1003       |\n",
+            "|    total_timesteps    | 304000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.14      |\n",
+            "|    explained_variance | 0.96417016 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 15199      |\n",
+            "|    policy_loss        | 0.0111     |\n",
+            "|    std                | 0.497      |\n",
+            "|    value_loss         | 7.27e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.84       |\n",
+            "|    ep_rew_mean        | -0.225     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 303        |\n",
+            "|    iterations         | 15300      |\n",
+            "|    time_elapsed       | 1009       |\n",
+            "|    total_timesteps    | 306000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.15      |\n",
+            "|    explained_variance | 0.96453744 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 15299      |\n",
+            "|    policy_loss        | 0.011      |\n",
+            "|    std                | 0.499      |\n",
+            "|    value_loss         | 0.00012    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.81      |\n",
+            "|    ep_rew_mean        | -0.219    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 303       |\n",
+            "|    iterations         | 15400     |\n",
+            "|    time_elapsed       | 1014      |\n",
+            "|    total_timesteps    | 308000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.13     |\n",
+            "|    explained_variance | 0.6340892 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 15399     |\n",
+            "|    policy_loss        | -0.0265   |\n",
+            "|    std                | 0.496     |\n",
+            "|    value_loss         | 0.000917  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.89      |\n",
+            "|    ep_rew_mean        | -0.23     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 303       |\n",
+            "|    iterations         | 15500     |\n",
+            "|    time_elapsed       | 1020      |\n",
+            "|    total_timesteps    | 310000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.12     |\n",
+            "|    explained_variance | 0.9757865 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 15499     |\n",
+            "|    policy_loss        | 0.00557   |\n",
+            "|    std                | 0.493     |\n",
+            "|    value_loss         | 5.41e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.84      |\n",
+            "|    ep_rew_mean        | -0.234    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 303       |\n",
+            "|    iterations         | 15600     |\n",
+            "|    time_elapsed       | 1029      |\n",
+            "|    total_timesteps    | 312000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.1      |\n",
+            "|    explained_variance | 0.9816367 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 15599     |\n",
+            "|    policy_loss        | -0.00155  |\n",
+            "|    std                | 0.49      |\n",
+            "|    value_loss         | 4.53e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.87       |\n",
+            "|    ep_rew_mean        | -0.228     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 303        |\n",
+            "|    iterations         | 15700      |\n",
+            "|    time_elapsed       | 1034       |\n",
+            "|    total_timesteps    | 314000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -2.07      |\n",
+            "|    explained_variance | 0.59498584 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 15699      |\n",
+            "|    policy_loss        | 0.00891    |\n",
+            "|    std                | 0.484      |\n",
+            "|    value_loss         | 0.0011     |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.74      |\n",
+            "|    ep_rew_mean        | -0.213    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 303       |\n",
+            "|    iterations         | 15800     |\n",
+            "|    time_elapsed       | 1039      |\n",
+            "|    total_timesteps    | 316000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -2.04     |\n",
+            "|    explained_variance | 0.9804266 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 15799     |\n",
+            "|    policy_loss        | -0.0119   |\n",
+            "|    std                | 0.48      |\n",
+            "|    value_loss         | 6.58e-05  |\n",
+            "-------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 2.69     |\n",
+            "|    ep_rew_mean        | -0.209   |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 304      |\n",
+            "|    iterations         | 15900    |\n",
+            "|    time_elapsed       | 1045     |\n",
+            "|    total_timesteps    | 318000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -2       |\n",
+            "|    explained_variance | 0.974797 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 15899    |\n",
+            "|    policy_loss        | -0.0222  |\n",
+            "|    std                | 0.475    |\n",
+            "|    value_loss         | 0.000213 |\n",
+            "------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.8        |\n",
+            "|    ep_rew_mean        | -0.227     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 304        |\n",
+            "|    iterations         | 16000      |\n",
+            "|    time_elapsed       | 1050       |\n",
+            "|    total_timesteps    | 320000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.98      |\n",
+            "|    explained_variance | 0.90541655 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 15999      |\n",
+            "|    policy_loss        | 0.0202     |\n",
+            "|    std                | 0.47       |\n",
+            "|    value_loss         | 0.000274   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.86      |\n",
+            "|    ep_rew_mean        | -0.232    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 304       |\n",
+            "|    iterations         | 16100     |\n",
+            "|    time_elapsed       | 1056      |\n",
+            "|    total_timesteps    | 322000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.98     |\n",
+            "|    explained_variance | 0.9099645 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 16099     |\n",
+            "|    policy_loss        | 0.00779   |\n",
+            "|    std                | 0.471     |\n",
+            "|    value_loss         | 0.000138  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.77       |\n",
+            "|    ep_rew_mean        | -0.209     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 304        |\n",
+            "|    iterations         | 16200      |\n",
+            "|    time_elapsed       | 1065       |\n",
+            "|    total_timesteps    | 324000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.98      |\n",
+            "|    explained_variance | 0.90357727 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 16199      |\n",
+            "|    policy_loss        | -0.0167    |\n",
+            "|    std                | 0.471      |\n",
+            "|    value_loss         | 0.000208   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.93       |\n",
+            "|    ep_rew_mean        | -0.232     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 304        |\n",
+            "|    iterations         | 16300      |\n",
+            "|    time_elapsed       | 1071       |\n",
+            "|    total_timesteps    | 326000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.97      |\n",
+            "|    explained_variance | 0.87062764 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 16299      |\n",
+            "|    policy_loss        | -0.00104   |\n",
+            "|    std                | 0.469      |\n",
+            "|    value_loss         | 0.000104   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.78       |\n",
+            "|    ep_rew_mean        | -0.221     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 304        |\n",
+            "|    iterations         | 16400      |\n",
+            "|    time_elapsed       | 1076       |\n",
+            "|    total_timesteps    | 328000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.95      |\n",
+            "|    explained_variance | 0.96559095 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 16399      |\n",
+            "|    policy_loss        | 0.00231    |\n",
+            "|    std                | 0.467      |\n",
+            "|    value_loss         | 5.08e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.81      |\n",
+            "|    ep_rew_mean        | -0.22     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 304       |\n",
+            "|    iterations         | 16500     |\n",
+            "|    time_elapsed       | 1082      |\n",
+            "|    total_timesteps    | 330000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.97     |\n",
+            "|    explained_variance | 0.9584811 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 16499     |\n",
+            "|    policy_loss        | 0.00624   |\n",
+            "|    std                | 0.469     |\n",
+            "|    value_loss         | 0.000136  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.71      |\n",
+            "|    ep_rew_mean        | -0.208    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 305       |\n",
+            "|    iterations         | 16600     |\n",
+            "|    time_elapsed       | 1087      |\n",
+            "|    total_timesteps    | 332000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.95     |\n",
+            "|    explained_variance | 0.9770625 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 16599     |\n",
+            "|    policy_loss        | 0.00544   |\n",
+            "|    std                | 0.467     |\n",
+            "|    value_loss         | 6.65e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.68       |\n",
+            "|    ep_rew_mean        | -0.21      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 305        |\n",
+            "|    iterations         | 16700      |\n",
+            "|    time_elapsed       | 1093       |\n",
+            "|    total_timesteps    | 334000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.95      |\n",
+            "|    explained_variance | 0.63326836 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 16699      |\n",
+            "|    policy_loss        | -0.0177    |\n",
+            "|    std                | 0.467      |\n",
+            "|    value_loss         | 0.00115    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.81       |\n",
+            "|    ep_rew_mean        | -0.221     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 304        |\n",
+            "|    iterations         | 16800      |\n",
+            "|    time_elapsed       | 1102       |\n",
+            "|    total_timesteps    | 336000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.93      |\n",
+            "|    explained_variance | 0.98614395 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 16799      |\n",
+            "|    policy_loss        | 0.00795    |\n",
+            "|    std                | 0.463      |\n",
+            "|    value_loss         | 4.68e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.86      |\n",
+            "|    ep_rew_mean        | -0.231    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 305       |\n",
+            "|    iterations         | 16900     |\n",
+            "|    time_elapsed       | 1108      |\n",
+            "|    total_timesteps    | 338000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.89     |\n",
+            "|    explained_variance | 0.9440542 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 16899     |\n",
+            "|    policy_loss        | -0.0238   |\n",
+            "|    std                | 0.458     |\n",
+            "|    value_loss         | 0.000362  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.8       |\n",
+            "|    ep_rew_mean        | -0.219    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 305       |\n",
+            "|    iterations         | 17000     |\n",
+            "|    time_elapsed       | 1113      |\n",
+            "|    total_timesteps    | 340000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.88     |\n",
+            "|    explained_variance | 0.9288571 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 16999     |\n",
+            "|    policy_loss        | 0.0118    |\n",
+            "|    std                | 0.456     |\n",
+            "|    value_loss         | 0.00026   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.63      |\n",
+            "|    ep_rew_mean        | -0.208    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 305       |\n",
+            "|    iterations         | 17100     |\n",
+            "|    time_elapsed       | 1119      |\n",
+            "|    total_timesteps    | 342000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.88     |\n",
+            "|    explained_variance | 0.9744407 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 17099     |\n",
+            "|    policy_loss        | -0.0129   |\n",
+            "|    std                | 0.455     |\n",
+            "|    value_loss         | 7.61e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.85      |\n",
+            "|    ep_rew_mean        | -0.234    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 305       |\n",
+            "|    iterations         | 17200     |\n",
+            "|    time_elapsed       | 1125      |\n",
+            "|    total_timesteps    | 344000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.87     |\n",
+            "|    explained_variance | 0.9596539 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 17199     |\n",
+            "|    policy_loss        | -0.0125   |\n",
+            "|    std                | 0.455     |\n",
+            "|    value_loss         | 0.000129  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.75       |\n",
+            "|    ep_rew_mean        | -0.216     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 306        |\n",
+            "|    iterations         | 17300      |\n",
+            "|    time_elapsed       | 1130       |\n",
+            "|    total_timesteps    | 346000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.86      |\n",
+            "|    explained_variance | 0.97371745 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 17299      |\n",
+            "|    policy_loss        | 0.0135     |\n",
+            "|    std                | 0.453      |\n",
+            "|    value_loss         | 0.000125   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.71       |\n",
+            "|    ep_rew_mean        | -0.211     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 305        |\n",
+            "|    iterations         | 17400      |\n",
+            "|    time_elapsed       | 1139       |\n",
+            "|    total_timesteps    | 348000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.84      |\n",
+            "|    explained_variance | 0.97407156 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 17399      |\n",
+            "|    policy_loss        | -0.000319  |\n",
+            "|    std                | 0.451      |\n",
+            "|    value_loss         | 7.55e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.84      |\n",
+            "|    ep_rew_mean        | -0.227    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 305       |\n",
+            "|    iterations         | 17500     |\n",
+            "|    time_elapsed       | 1145      |\n",
+            "|    total_timesteps    | 350000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.83     |\n",
+            "|    explained_variance | 0.9800369 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 17499     |\n",
+            "|    policy_loss        | -0.00272  |\n",
+            "|    std                | 0.449     |\n",
+            "|    value_loss         | 7.61e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.57      |\n",
+            "|    ep_rew_mean        | -0.192    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 305       |\n",
+            "|    iterations         | 17600     |\n",
+            "|    time_elapsed       | 1151      |\n",
+            "|    total_timesteps    | 352000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.83     |\n",
+            "|    explained_variance | 0.9593339 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 17599     |\n",
+            "|    policy_loss        | -0.0111   |\n",
+            "|    std                | 0.449     |\n",
+            "|    value_loss         | 9e-05     |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.67      |\n",
+            "|    ep_rew_mean        | -0.214    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 306       |\n",
+            "|    iterations         | 17700     |\n",
+            "|    time_elapsed       | 1156      |\n",
+            "|    total_timesteps    | 354000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.81     |\n",
+            "|    explained_variance | 0.9596555 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 17699     |\n",
+            "|    policy_loss        | -0.00942  |\n",
+            "|    std                | 0.447     |\n",
+            "|    value_loss         | 6.95e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.74       |\n",
+            "|    ep_rew_mean        | -0.217     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 306        |\n",
+            "|    iterations         | 17800      |\n",
+            "|    time_elapsed       | 1162       |\n",
+            "|    total_timesteps    | 356000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.81      |\n",
+            "|    explained_variance | 0.98978764 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 17799      |\n",
+            "|    policy_loss        | -0.00202   |\n",
+            "|    std                | 0.446      |\n",
+            "|    value_loss         | 3.37e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.71      |\n",
+            "|    ep_rew_mean        | -0.209    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 305       |\n",
+            "|    iterations         | 17900     |\n",
+            "|    time_elapsed       | 1171      |\n",
+            "|    total_timesteps    | 358000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.81     |\n",
+            "|    explained_variance | 0.7599305 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 17899     |\n",
+            "|    policy_loss        | -0.0287   |\n",
+            "|    std                | 0.446     |\n",
+            "|    value_loss         | 0.000693  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.84       |\n",
+            "|    ep_rew_mean        | -0.22      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 305        |\n",
+            "|    iterations         | 18000      |\n",
+            "|    time_elapsed       | 1177       |\n",
+            "|    total_timesteps    | 360000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.8       |\n",
+            "|    explained_variance | 0.98177564 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 17999      |\n",
+            "|    policy_loss        | 0.0091     |\n",
+            "|    std                | 0.446      |\n",
+            "|    value_loss         | 0.00011    |\n",
+            "--------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 2.6      |\n",
+            "|    ep_rew_mean        | -0.202   |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 306      |\n",
+            "|    iterations         | 18100    |\n",
+            "|    time_elapsed       | 1182     |\n",
+            "|    total_timesteps    | 362000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -1.79    |\n",
+            "|    explained_variance | 0.934681 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 18099    |\n",
+            "|    policy_loss        | -0.0094  |\n",
+            "|    std                | 0.444    |\n",
+            "|    value_loss         | 0.000113 |\n",
+            "------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.68       |\n",
+            "|    ep_rew_mean        | -0.206     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 306        |\n",
+            "|    iterations         | 18200      |\n",
+            "|    time_elapsed       | 1188       |\n",
+            "|    total_timesteps    | 364000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.79      |\n",
+            "|    explained_variance | 0.95457464 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 18199      |\n",
+            "|    policy_loss        | -0.00085   |\n",
+            "|    std                | 0.444      |\n",
+            "|    value_loss         | 6.5e-05    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.72       |\n",
+            "|    ep_rew_mean        | -0.205     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 306        |\n",
+            "|    iterations         | 18300      |\n",
+            "|    time_elapsed       | 1194       |\n",
+            "|    total_timesteps    | 366000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.77      |\n",
+            "|    explained_variance | 0.95278776 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 18299      |\n",
+            "|    policy_loss        | -0.00984   |\n",
+            "|    std                | 0.44       |\n",
+            "|    value_loss         | 8.43e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.76       |\n",
+            "|    ep_rew_mean        | -0.211     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 306        |\n",
+            "|    iterations         | 18400      |\n",
+            "|    time_elapsed       | 1199       |\n",
+            "|    total_timesteps    | 368000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.76      |\n",
+            "|    explained_variance | 0.94221777 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 18399      |\n",
+            "|    policy_loss        | 0.000339   |\n",
+            "|    std                | 0.439      |\n",
+            "|    value_loss         | 0.000151   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.78       |\n",
+            "|    ep_rew_mean        | -0.214     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 306        |\n",
+            "|    iterations         | 18500      |\n",
+            "|    time_elapsed       | 1208       |\n",
+            "|    total_timesteps    | 370000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.75      |\n",
+            "|    explained_variance | 0.80928534 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 18499      |\n",
+            "|    policy_loss        | 0.00918    |\n",
+            "|    std                | 0.438      |\n",
+            "|    value_loss         | 0.000224   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.79      |\n",
+            "|    ep_rew_mean        | -0.226    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 306       |\n",
+            "|    iterations         | 18600     |\n",
+            "|    time_elapsed       | 1214      |\n",
+            "|    total_timesteps    | 372000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.74     |\n",
+            "|    explained_variance | 0.9688626 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 18599     |\n",
+            "|    policy_loss        | -8.22e-06 |\n",
+            "|    std                | 0.437     |\n",
+            "|    value_loss         | 0.000102  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.82      |\n",
+            "|    ep_rew_mean        | -0.226    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 306       |\n",
+            "|    iterations         | 18700     |\n",
+            "|    time_elapsed       | 1219      |\n",
+            "|    total_timesteps    | 374000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.72     |\n",
+            "|    explained_variance | 0.9825928 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 18699     |\n",
+            "|    policy_loss        | -0.00274  |\n",
+            "|    std                | 0.434     |\n",
+            "|    value_loss         | 5.74e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.76      |\n",
+            "|    ep_rew_mean        | -0.212    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 306       |\n",
+            "|    iterations         | 18800     |\n",
+            "|    time_elapsed       | 1225      |\n",
+            "|    total_timesteps    | 376000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.7      |\n",
+            "|    explained_variance | 0.9257292 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 18799     |\n",
+            "|    policy_loss        | 0.0254    |\n",
+            "|    std                | 0.431     |\n",
+            "|    value_loss         | 0.000312  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.83       |\n",
+            "|    ep_rew_mean        | -0.217     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 307        |\n",
+            "|    iterations         | 18900      |\n",
+            "|    time_elapsed       | 1230       |\n",
+            "|    total_timesteps    | 378000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.67      |\n",
+            "|    explained_variance | 0.62272656 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 18899      |\n",
+            "|    policy_loss        | 0.0324     |\n",
+            "|    std                | 0.428      |\n",
+            "|    value_loss         | 0.00136    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.82      |\n",
+            "|    ep_rew_mean        | -0.222    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 307       |\n",
+            "|    iterations         | 19000     |\n",
+            "|    time_elapsed       | 1236      |\n",
+            "|    total_timesteps    | 380000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.67     |\n",
+            "|    explained_variance | 0.8762253 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 18999     |\n",
+            "|    policy_loss        | 0.0126    |\n",
+            "|    std                | 0.427     |\n",
+            "|    value_loss         | 0.000196  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.73      |\n",
+            "|    ep_rew_mean        | -0.209    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 306       |\n",
+            "|    iterations         | 19100     |\n",
+            "|    time_elapsed       | 1245      |\n",
+            "|    total_timesteps    | 382000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.67     |\n",
+            "|    explained_variance | 0.9610008 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 19099     |\n",
+            "|    policy_loss        | -0.00729  |\n",
+            "|    std                | 0.428     |\n",
+            "|    value_loss         | 9.97e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.74      |\n",
+            "|    ep_rew_mean        | -0.213    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 307       |\n",
+            "|    iterations         | 19200     |\n",
+            "|    time_elapsed       | 1250      |\n",
+            "|    total_timesteps    | 384000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.65     |\n",
+            "|    explained_variance | 0.9789909 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 19199     |\n",
+            "|    policy_loss        | 0.00905   |\n",
+            "|    std                | 0.425     |\n",
+            "|    value_loss         | 8.39e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.76      |\n",
+            "|    ep_rew_mean        | -0.213    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 307       |\n",
+            "|    iterations         | 19300     |\n",
+            "|    time_elapsed       | 1256      |\n",
+            "|    total_timesteps    | 386000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.65     |\n",
+            "|    explained_variance | 0.9824363 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 19299     |\n",
+            "|    policy_loss        | 0.0189    |\n",
+            "|    std                | 0.425     |\n",
+            "|    value_loss         | 0.000169  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.76       |\n",
+            "|    ep_rew_mean        | -0.212     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 307        |\n",
+            "|    iterations         | 19400      |\n",
+            "|    time_elapsed       | 1262       |\n",
+            "|    total_timesteps    | 388000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.64      |\n",
+            "|    explained_variance | 0.86109364 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 19399      |\n",
+            "|    policy_loss        | -0.0201    |\n",
+            "|    std                | 0.424      |\n",
+            "|    value_loss         | 0.000426   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.9        |\n",
+            "|    ep_rew_mean        | -0.23      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 307        |\n",
+            "|    iterations         | 19500      |\n",
+            "|    time_elapsed       | 1267       |\n",
+            "|    total_timesteps    | 390000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.63      |\n",
+            "|    explained_variance | 0.70620143 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 19499      |\n",
+            "|    policy_loss        | 0.0118     |\n",
+            "|    std                | 0.423      |\n",
+            "|    value_loss         | 0.000389   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.85       |\n",
+            "|    ep_rew_mean        | -0.223     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 307        |\n",
+            "|    iterations         | 19600      |\n",
+            "|    time_elapsed       | 1273       |\n",
+            "|    total_timesteps    | 392000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.61      |\n",
+            "|    explained_variance | 0.39207745 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 19599      |\n",
+            "|    policy_loss        | 0.0517     |\n",
+            "|    std                | 0.42       |\n",
+            "|    value_loss         | 0.00316    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.85       |\n",
+            "|    ep_rew_mean        | -0.233     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 307        |\n",
+            "|    iterations         | 19700      |\n",
+            "|    time_elapsed       | 1282       |\n",
+            "|    total_timesteps    | 394000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.6       |\n",
+            "|    explained_variance | 0.96949595 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 19699      |\n",
+            "|    policy_loss        | 0.0161     |\n",
+            "|    std                | 0.419      |\n",
+            "|    value_loss         | 0.000128   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.79      |\n",
+            "|    ep_rew_mean        | -0.213    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 307       |\n",
+            "|    iterations         | 19800     |\n",
+            "|    time_elapsed       | 1288      |\n",
+            "|    total_timesteps    | 396000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.59     |\n",
+            "|    explained_variance | 0.9210366 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 19799     |\n",
+            "|    policy_loss        | -0.0121   |\n",
+            "|    std                | 0.419     |\n",
+            "|    value_loss         | 0.000303  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.83       |\n",
+            "|    ep_rew_mean        | -0.227     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 307        |\n",
+            "|    iterations         | 19900      |\n",
+            "|    time_elapsed       | 1294       |\n",
+            "|    total_timesteps    | 398000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.59      |\n",
+            "|    explained_variance | 0.91853833 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 19899      |\n",
+            "|    policy_loss        | -0.0153    |\n",
+            "|    std                | 0.42       |\n",
+            "|    value_loss         | 0.000214   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.64      |\n",
+            "|    ep_rew_mean        | -0.204    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 307       |\n",
+            "|    iterations         | 20000     |\n",
+            "|    time_elapsed       | 1300      |\n",
+            "|    total_timesteps    | 400000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.58     |\n",
+            "|    explained_variance | 0.8020137 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 19999     |\n",
+            "|    policy_loss        | -0.0282   |\n",
+            "|    std                | 0.418     |\n",
+            "|    value_loss         | 0.000849  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.79       |\n",
+            "|    ep_rew_mean        | -0.219     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 307        |\n",
+            "|    iterations         | 20100      |\n",
+            "|    time_elapsed       | 1306       |\n",
+            "|    total_timesteps    | 402000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.56      |\n",
+            "|    explained_variance | 0.95373213 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 20099      |\n",
+            "|    policy_loss        | 0.00944    |\n",
+            "|    std                | 0.416      |\n",
+            "|    value_loss         | 0.000117   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.76      |\n",
+            "|    ep_rew_mean        | -0.219    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 307       |\n",
+            "|    iterations         | 20200     |\n",
+            "|    time_elapsed       | 1315      |\n",
+            "|    total_timesteps    | 404000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.55     |\n",
+            "|    explained_variance | 0.9797992 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 20199     |\n",
+            "|    policy_loss        | -0.0116   |\n",
+            "|    std                | 0.414     |\n",
+            "|    value_loss         | 0.000106  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.83      |\n",
+            "|    ep_rew_mean        | -0.216    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 307       |\n",
+            "|    iterations         | 20300     |\n",
+            "|    time_elapsed       | 1320      |\n",
+            "|    total_timesteps    | 406000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.54     |\n",
+            "|    explained_variance | 0.9802046 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 20299     |\n",
+            "|    policy_loss        | -0.04     |\n",
+            "|    std                | 0.412     |\n",
+            "|    value_loss         | 0.000222  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.51       |\n",
+            "|    ep_rew_mean        | -0.185     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 307        |\n",
+            "|    iterations         | 20400      |\n",
+            "|    time_elapsed       | 1326       |\n",
+            "|    total_timesteps    | 408000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.53      |\n",
+            "|    explained_variance | 0.97018176 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 20399      |\n",
+            "|    policy_loss        | -0.000202  |\n",
+            "|    std                | 0.412      |\n",
+            "|    value_loss         | 4.38e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.88      |\n",
+            "|    ep_rew_mean        | -0.235    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 307       |\n",
+            "|    iterations         | 20500     |\n",
+            "|    time_elapsed       | 1332      |\n",
+            "|    total_timesteps    | 410000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.5      |\n",
+            "|    explained_variance | 0.957062  |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 20499     |\n",
+            "|    policy_loss        | -0.000156 |\n",
+            "|    std                | 0.407     |\n",
+            "|    value_loss         | 8.57e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.78       |\n",
+            "|    ep_rew_mean        | -0.22      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 307        |\n",
+            "|    iterations         | 20600      |\n",
+            "|    time_elapsed       | 1338       |\n",
+            "|    total_timesteps    | 412000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.49      |\n",
+            "|    explained_variance | 0.94187564 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 20599      |\n",
+            "|    policy_loss        | 0.00251    |\n",
+            "|    std                | 0.405      |\n",
+            "|    value_loss         | 8.7e-05    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.9        |\n",
+            "|    ep_rew_mean        | -0.228     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 307        |\n",
+            "|    iterations         | 20700      |\n",
+            "|    time_elapsed       | 1345       |\n",
+            "|    total_timesteps    | 414000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.49      |\n",
+            "|    explained_variance | 0.88869613 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 20699      |\n",
+            "|    policy_loss        | 0.00114    |\n",
+            "|    std                | 0.407      |\n",
+            "|    value_loss         | 0.000132   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.86       |\n",
+            "|    ep_rew_mean        | -0.233     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 307        |\n",
+            "|    iterations         | 20800      |\n",
+            "|    time_elapsed       | 1354       |\n",
+            "|    total_timesteps    | 416000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.48      |\n",
+            "|    explained_variance | 0.92743963 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 20799      |\n",
+            "|    policy_loss        | 0.00954    |\n",
+            "|    std                | 0.406      |\n",
+            "|    value_loss         | 0.000317   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.93       |\n",
+            "|    ep_rew_mean        | -0.227     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 307        |\n",
+            "|    iterations         | 20900      |\n",
+            "|    time_elapsed       | 1360       |\n",
+            "|    total_timesteps    | 418000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.46      |\n",
+            "|    explained_variance | 0.90706563 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 20899      |\n",
+            "|    policy_loss        | 0.0194     |\n",
+            "|    std                | 0.403      |\n",
+            "|    value_loss         | 0.000414   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 3.02      |\n",
+            "|    ep_rew_mean        | -0.237    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 307       |\n",
+            "|    iterations         | 21000     |\n",
+            "|    time_elapsed       | 1364      |\n",
+            "|    total_timesteps    | 420000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.44     |\n",
+            "|    explained_variance | -3.516026 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 20999     |\n",
+            "|    policy_loss        | 0.0564    |\n",
+            "|    std                | 0.401     |\n",
+            "|    value_loss         | 0.0189    |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.61      |\n",
+            "|    ep_rew_mean        | -0.204    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 308       |\n",
+            "|    iterations         | 21100     |\n",
+            "|    time_elapsed       | 1370      |\n",
+            "|    total_timesteps    | 422000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.42     |\n",
+            "|    explained_variance | 0.9308317 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 21099     |\n",
+            "|    policy_loss        | -0.00902  |\n",
+            "|    std                | 0.398     |\n",
+            "|    value_loss         | 0.000172  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.72      |\n",
+            "|    ep_rew_mean        | -0.217    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 308       |\n",
+            "|    iterations         | 21200     |\n",
+            "|    time_elapsed       | 1376      |\n",
+            "|    total_timesteps    | 424000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.4      |\n",
+            "|    explained_variance | 0.9731085 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 21199     |\n",
+            "|    policy_loss        | 0.00145   |\n",
+            "|    std                | 0.396     |\n",
+            "|    value_loss         | 6.53e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.75       |\n",
+            "|    ep_rew_mean        | -0.209     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 308        |\n",
+            "|    iterations         | 21300      |\n",
+            "|    time_elapsed       | 1381       |\n",
+            "|    total_timesteps    | 426000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.4       |\n",
+            "|    explained_variance | 0.32687587 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 21299      |\n",
+            "|    policy_loss        | -0.0387    |\n",
+            "|    std                | 0.396      |\n",
+            "|    value_loss         | 0.0013     |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.73      |\n",
+            "|    ep_rew_mean        | -0.215    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 307       |\n",
+            "|    iterations         | 21400     |\n",
+            "|    time_elapsed       | 1391      |\n",
+            "|    total_timesteps    | 428000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.39     |\n",
+            "|    explained_variance | 0.9532345 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 21399     |\n",
+            "|    policy_loss        | -0.00503  |\n",
+            "|    std                | 0.395     |\n",
+            "|    value_loss         | 9.31e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.81       |\n",
+            "|    ep_rew_mean        | -0.209     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 307        |\n",
+            "|    iterations         | 21500      |\n",
+            "|    time_elapsed       | 1396       |\n",
+            "|    total_timesteps    | 430000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.4       |\n",
+            "|    explained_variance | 0.87219936 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 21499      |\n",
+            "|    policy_loss        | -0.00404   |\n",
+            "|    std                | 0.396      |\n",
+            "|    value_loss         | 0.000149   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.81      |\n",
+            "|    ep_rew_mean        | -0.222    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 308       |\n",
+            "|    iterations         | 21600     |\n",
+            "|    time_elapsed       | 1401      |\n",
+            "|    total_timesteps    | 432000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.38     |\n",
+            "|    explained_variance | 0.9910213 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 21599     |\n",
+            "|    policy_loss        | 0.00199   |\n",
+            "|    std                | 0.393     |\n",
+            "|    value_loss         | 3.43e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.77      |\n",
+            "|    ep_rew_mean        | -0.216    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 308       |\n",
+            "|    iterations         | 21700     |\n",
+            "|    time_elapsed       | 1406      |\n",
+            "|    total_timesteps    | 434000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.37     |\n",
+            "|    explained_variance | 0.9567144 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 21699     |\n",
+            "|    policy_loss        | -0.0111   |\n",
+            "|    std                | 0.393     |\n",
+            "|    value_loss         | 0.000193  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.78       |\n",
+            "|    ep_rew_mean        | -0.213     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 308        |\n",
+            "|    iterations         | 21800      |\n",
+            "|    time_elapsed       | 1411       |\n",
+            "|    total_timesteps    | 436000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.38      |\n",
+            "|    explained_variance | 0.97068435 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 21799      |\n",
+            "|    policy_loss        | 0.00666    |\n",
+            "|    std                | 0.393      |\n",
+            "|    value_loss         | 0.000148   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.77      |\n",
+            "|    ep_rew_mean        | -0.208    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 309       |\n",
+            "|    iterations         | 21900     |\n",
+            "|    time_elapsed       | 1417      |\n",
+            "|    total_timesteps    | 438000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.34     |\n",
+            "|    explained_variance | 0.9512937 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 21899     |\n",
+            "|    policy_loss        | 0.00756   |\n",
+            "|    std                | 0.388     |\n",
+            "|    value_loss         | 0.00022   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.85      |\n",
+            "|    ep_rew_mean        | -0.241    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 308       |\n",
+            "|    iterations         | 22000     |\n",
+            "|    time_elapsed       | 1424      |\n",
+            "|    total_timesteps    | 440000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.33     |\n",
+            "|    explained_variance | 0.9791408 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 21999     |\n",
+            "|    policy_loss        | -0.00404  |\n",
+            "|    std                | 0.388     |\n",
+            "|    value_loss         | 8.91e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.78      |\n",
+            "|    ep_rew_mean        | -0.213    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 309       |\n",
+            "|    iterations         | 22100     |\n",
+            "|    time_elapsed       | 1429      |\n",
+            "|    total_timesteps    | 442000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.33     |\n",
+            "|    explained_variance | 0.9878471 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 22099     |\n",
+            "|    policy_loss        | 0.00572   |\n",
+            "|    std                | 0.388     |\n",
+            "|    value_loss         | 3.05e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.76      |\n",
+            "|    ep_rew_mean        | -0.217    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 309       |\n",
+            "|    iterations         | 22200     |\n",
+            "|    time_elapsed       | 1434      |\n",
+            "|    total_timesteps    | 444000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.32     |\n",
+            "|    explained_variance | 0.9775455 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 22199     |\n",
+            "|    policy_loss        | -0.0153   |\n",
+            "|    std                | 0.386     |\n",
+            "|    value_loss         | 0.000116  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.65      |\n",
+            "|    ep_rew_mean        | -0.207    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 309       |\n",
+            "|    iterations         | 22300     |\n",
+            "|    time_elapsed       | 1438      |\n",
+            "|    total_timesteps    | 446000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.32     |\n",
+            "|    explained_variance | 0.9002794 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 22299     |\n",
+            "|    policy_loss        | -0.0161   |\n",
+            "|    std                | 0.386     |\n",
+            "|    value_loss         | 0.000522  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.74      |\n",
+            "|    ep_rew_mean        | -0.217    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 310       |\n",
+            "|    iterations         | 22400     |\n",
+            "|    time_elapsed       | 1442      |\n",
+            "|    total_timesteps    | 448000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.3      |\n",
+            "|    explained_variance | 0.9801052 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 22399     |\n",
+            "|    policy_loss        | -0.00323  |\n",
+            "|    std                | 0.385     |\n",
+            "|    value_loss         | 6.21e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3.14       |\n",
+            "|    ep_rew_mean        | -0.249     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 311        |\n",
+            "|    iterations         | 22500      |\n",
+            "|    time_elapsed       | 1446       |\n",
+            "|    total_timesteps    | 450000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.29      |\n",
+            "|    explained_variance | 0.94702476 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 22499      |\n",
+            "|    policy_loss        | 0.00382    |\n",
+            "|    std                | 0.384      |\n",
+            "|    value_loss         | 0.00018    |\n",
+            "--------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 2.72     |\n",
+            "|    ep_rew_mean        | -0.211   |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 311      |\n",
+            "|    iterations         | 22600    |\n",
+            "|    time_elapsed       | 1450     |\n",
+            "|    total_timesteps    | 452000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -1.27    |\n",
+            "|    explained_variance | 0.857116 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 22599    |\n",
+            "|    policy_loss        | 0.0067   |\n",
+            "|    std                | 0.382    |\n",
+            "|    value_loss         | 0.000226 |\n",
+            "------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.69      |\n",
+            "|    ep_rew_mean        | -0.207    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 311       |\n",
+            "|    iterations         | 22700     |\n",
+            "|    time_elapsed       | 1458      |\n",
+            "|    total_timesteps    | 454000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.26     |\n",
+            "|    explained_variance | 0.9482857 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 22699     |\n",
+            "|    policy_loss        | 0.00906   |\n",
+            "|    std                | 0.381     |\n",
+            "|    value_loss         | 0.000168  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.8        |\n",
+            "|    ep_rew_mean        | -0.225     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 311        |\n",
+            "|    iterations         | 22800      |\n",
+            "|    time_elapsed       | 1462       |\n",
+            "|    total_timesteps    | 456000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.25      |\n",
+            "|    explained_variance | 0.97008514 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 22799      |\n",
+            "|    policy_loss        | 0.015      |\n",
+            "|    std                | 0.381      |\n",
+            "|    value_loss         | 0.000137   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.68       |\n",
+            "|    ep_rew_mean        | -0.207     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 312        |\n",
+            "|    iterations         | 22900      |\n",
+            "|    time_elapsed       | 1466       |\n",
+            "|    total_timesteps    | 458000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.24      |\n",
+            "|    explained_variance | 0.94469047 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 22899      |\n",
+            "|    policy_loss        | -0.0147    |\n",
+            "|    std                | 0.379      |\n",
+            "|    value_loss         | 0.000413   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.76      |\n",
+            "|    ep_rew_mean        | -0.219    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 312       |\n",
+            "|    iterations         | 23000     |\n",
+            "|    time_elapsed       | 1470      |\n",
+            "|    total_timesteps    | 460000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.23     |\n",
+            "|    explained_variance | 0.9643724 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 22999     |\n",
+            "|    policy_loss        | -0.0134   |\n",
+            "|    std                | 0.377     |\n",
+            "|    value_loss         | 0.000259  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.73      |\n",
+            "|    ep_rew_mean        | -0.214    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 313       |\n",
+            "|    iterations         | 23100     |\n",
+            "|    time_elapsed       | 1473      |\n",
+            "|    total_timesteps    | 462000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.23     |\n",
+            "|    explained_variance | 0.9723082 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 23099     |\n",
+            "|    policy_loss        | -0.00796  |\n",
+            "|    std                | 0.376     |\n",
+            "|    value_loss         | 0.000132  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.85      |\n",
+            "|    ep_rew_mean        | -0.227    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 314       |\n",
+            "|    iterations         | 23200     |\n",
+            "|    time_elapsed       | 1477      |\n",
+            "|    total_timesteps    | 464000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.22     |\n",
+            "|    explained_variance | 0.9105156 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 23199     |\n",
+            "|    policy_loss        | 0.00883   |\n",
+            "|    std                | 0.375     |\n",
+            "|    value_loss         | 0.00018   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.8       |\n",
+            "|    ep_rew_mean        | -0.217    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 314       |\n",
+            "|    iterations         | 23300     |\n",
+            "|    time_elapsed       | 1481      |\n",
+            "|    total_timesteps    | 466000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.21     |\n",
+            "|    explained_variance | 0.9802915 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 23299     |\n",
+            "|    policy_loss        | 0.00295   |\n",
+            "|    std                | 0.374     |\n",
+            "|    value_loss         | 6.99e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.58       |\n",
+            "|    ep_rew_mean        | -0.2       |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 314        |\n",
+            "|    iterations         | 23400      |\n",
+            "|    time_elapsed       | 1485       |\n",
+            "|    total_timesteps    | 468000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.21      |\n",
+            "|    explained_variance | 0.94888604 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 23399      |\n",
+            "|    policy_loss        | 0.00301    |\n",
+            "|    std                | 0.375      |\n",
+            "|    value_loss         | 0.000121   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.82      |\n",
+            "|    ep_rew_mean        | -0.211    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 315       |\n",
+            "|    iterations         | 23500     |\n",
+            "|    time_elapsed       | 1489      |\n",
+            "|    total_timesteps    | 470000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.21     |\n",
+            "|    explained_variance | 0.9344809 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 23499     |\n",
+            "|    policy_loss        | 0.0167    |\n",
+            "|    std                | 0.375     |\n",
+            "|    value_loss         | 0.000243  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.69      |\n",
+            "|    ep_rew_mean        | -0.209    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 315       |\n",
+            "|    iterations         | 23600     |\n",
+            "|    time_elapsed       | 1497      |\n",
+            "|    total_timesteps    | 472000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.2      |\n",
+            "|    explained_variance | 0.9716975 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 23599     |\n",
+            "|    policy_loss        | 0.0191    |\n",
+            "|    std                | 0.373     |\n",
+            "|    value_loss         | 0.000384  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.71      |\n",
+            "|    ep_rew_mean        | -0.204    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 315       |\n",
+            "|    iterations         | 23700     |\n",
+            "|    time_elapsed       | 1501      |\n",
+            "|    total_timesteps    | 474000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.19     |\n",
+            "|    explained_variance | 0.9716921 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 23699     |\n",
+            "|    policy_loss        | 0.00755   |\n",
+            "|    std                | 0.372     |\n",
+            "|    value_loss         | 7.3e-05   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.81       |\n",
+            "|    ep_rew_mean        | -0.228     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 316        |\n",
+            "|    iterations         | 23800      |\n",
+            "|    time_elapsed       | 1505       |\n",
+            "|    total_timesteps    | 476000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.2       |\n",
+            "|    explained_variance | 0.90949667 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 23799      |\n",
+            "|    policy_loss        | 0.0125     |\n",
+            "|    std                | 0.375      |\n",
+            "|    value_loss         | 0.000263   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.95       |\n",
+            "|    ep_rew_mean        | -0.242     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 316        |\n",
+            "|    iterations         | 23900      |\n",
+            "|    time_elapsed       | 1509       |\n",
+            "|    total_timesteps    | 478000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.18      |\n",
+            "|    explained_variance | 0.96989167 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 23899      |\n",
+            "|    policy_loss        | -0.00168   |\n",
+            "|    std                | 0.372      |\n",
+            "|    value_loss         | 0.000114   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.86       |\n",
+            "|    ep_rew_mean        | -0.23      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 317        |\n",
+            "|    iterations         | 24000      |\n",
+            "|    time_elapsed       | 1513       |\n",
+            "|    total_timesteps    | 480000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.17      |\n",
+            "|    explained_variance | 0.95877844 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 23999      |\n",
+            "|    policy_loss        | -0.00787   |\n",
+            "|    std                | 0.37       |\n",
+            "|    value_loss         | 0.000112   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.77       |\n",
+            "|    ep_rew_mean        | -0.214     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 317        |\n",
+            "|    iterations         | 24100      |\n",
+            "|    time_elapsed       | 1517       |\n",
+            "|    total_timesteps    | 482000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.14      |\n",
+            "|    explained_variance | 0.95311856 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 24099      |\n",
+            "|    policy_loss        | 0.00426    |\n",
+            "|    std                | 0.366      |\n",
+            "|    value_loss         | 0.000191   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.74      |\n",
+            "|    ep_rew_mean        | -0.212    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 318       |\n",
+            "|    iterations         | 24200     |\n",
+            "|    time_elapsed       | 1521      |\n",
+            "|    total_timesteps    | 484000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.15     |\n",
+            "|    explained_variance | 0.8985226 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 24199     |\n",
+            "|    policy_loss        | 0.0182    |\n",
+            "|    std                | 0.367     |\n",
+            "|    value_loss         | 0.000501  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.74      |\n",
+            "|    ep_rew_mean        | -0.219    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 318       |\n",
+            "|    iterations         | 24300     |\n",
+            "|    time_elapsed       | 1525      |\n",
+            "|    total_timesteps    | 486000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.12     |\n",
+            "|    explained_variance | 0.8981765 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 24299     |\n",
+            "|    policy_loss        | -0.00977  |\n",
+            "|    std                | 0.365     |\n",
+            "|    value_loss         | 0.000423  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.82       |\n",
+            "|    ep_rew_mean        | -0.219     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 318        |\n",
+            "|    iterations         | 24400      |\n",
+            "|    time_elapsed       | 1533       |\n",
+            "|    total_timesteps    | 488000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.12      |\n",
+            "|    explained_variance | 0.81411386 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 24399      |\n",
+            "|    policy_loss        | -0.00471   |\n",
+            "|    std                | 0.364      |\n",
+            "|    value_loss         | 0.000483   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.75       |\n",
+            "|    ep_rew_mean        | -0.21      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 318        |\n",
+            "|    iterations         | 24500      |\n",
+            "|    time_elapsed       | 1536       |\n",
+            "|    total_timesteps    | 490000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.11      |\n",
+            "|    explained_variance | 0.97057235 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 24499      |\n",
+            "|    policy_loss        | 0.000997   |\n",
+            "|    std                | 0.363      |\n",
+            "|    value_loss         | 8.4e-05    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.69      |\n",
+            "|    ep_rew_mean        | -0.208    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 319       |\n",
+            "|    iterations         | 24600     |\n",
+            "|    time_elapsed       | 1540      |\n",
+            "|    total_timesteps    | 492000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.1      |\n",
+            "|    explained_variance | 0.9207422 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 24599     |\n",
+            "|    policy_loss        | 0.00211   |\n",
+            "|    std                | 0.362     |\n",
+            "|    value_loss         | 0.000173  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.86      |\n",
+            "|    ep_rew_mean        | -0.229    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 319       |\n",
+            "|    iterations         | 24700     |\n",
+            "|    time_elapsed       | 1545      |\n",
+            "|    total_timesteps    | 494000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.11     |\n",
+            "|    explained_variance | 0.7847704 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 24699     |\n",
+            "|    policy_loss        | -0.0204   |\n",
+            "|    std                | 0.363     |\n",
+            "|    value_loss         | 0.0011    |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.74       |\n",
+            "|    ep_rew_mean        | -0.209     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 320        |\n",
+            "|    iterations         | 24800      |\n",
+            "|    time_elapsed       | 1549       |\n",
+            "|    total_timesteps    | 496000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.09      |\n",
+            "|    explained_variance | 0.53824604 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 24799      |\n",
+            "|    policy_loss        | -0.00417   |\n",
+            "|    std                | 0.363      |\n",
+            "|    value_loss         | 0.00134    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.72       |\n",
+            "|    ep_rew_mean        | -0.215     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 320        |\n",
+            "|    iterations         | 24900      |\n",
+            "|    time_elapsed       | 1553       |\n",
+            "|    total_timesteps    | 498000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.08      |\n",
+            "|    explained_variance | 0.96404845 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 24899      |\n",
+            "|    policy_loss        | 0.00329    |\n",
+            "|    std                | 0.361      |\n",
+            "|    value_loss         | 0.000237   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.71       |\n",
+            "|    ep_rew_mean        | -0.212     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 320        |\n",
+            "|    iterations         | 25000      |\n",
+            "|    time_elapsed       | 1557       |\n",
+            "|    total_timesteps    | 500000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.06      |\n",
+            "|    explained_variance | 0.97420067 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 24999      |\n",
+            "|    policy_loss        | 0.00337    |\n",
+            "|    std                | 0.358      |\n",
+            "|    value_loss         | 7.87e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.89      |\n",
+            "|    ep_rew_mean        | -0.241    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 320       |\n",
+            "|    iterations         | 25100     |\n",
+            "|    time_elapsed       | 1565      |\n",
+            "|    total_timesteps    | 502000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.05     |\n",
+            "|    explained_variance | 0.9236581 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 25099     |\n",
+            "|    policy_loss        | -0.0275   |\n",
+            "|    std                | 0.356     |\n",
+            "|    value_loss         | 0.00045   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.67       |\n",
+            "|    ep_rew_mean        | -0.205     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 321        |\n",
+            "|    iterations         | 25200      |\n",
+            "|    time_elapsed       | 1569       |\n",
+            "|    total_timesteps    | 504000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.03      |\n",
+            "|    explained_variance | 0.94585973 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 25199      |\n",
+            "|    policy_loss        | 0.00344    |\n",
+            "|    std                | 0.355      |\n",
+            "|    value_loss         | 0.000152   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.81      |\n",
+            "|    ep_rew_mean        | -0.217    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 321       |\n",
+            "|    iterations         | 25300     |\n",
+            "|    time_elapsed       | 1574      |\n",
+            "|    total_timesteps    | 506000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1.02     |\n",
+            "|    explained_variance | 0.9588999 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 25299     |\n",
+            "|    policy_loss        | -0.00463  |\n",
+            "|    std                | 0.354     |\n",
+            "|    value_loss         | 0.000145  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.84       |\n",
+            "|    ep_rew_mean        | -0.23      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 321        |\n",
+            "|    iterations         | 25400      |\n",
+            "|    time_elapsed       | 1578       |\n",
+            "|    total_timesteps    | 508000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -1.03      |\n",
+            "|    explained_variance | 0.98241514 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 25399      |\n",
+            "|    policy_loss        | -0.0109    |\n",
+            "|    std                | 0.354      |\n",
+            "|    value_loss         | 7.51e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.83      |\n",
+            "|    ep_rew_mean        | -0.222    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 322       |\n",
+            "|    iterations         | 25500     |\n",
+            "|    time_elapsed       | 1583      |\n",
+            "|    total_timesteps    | 510000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1        |\n",
+            "|    explained_variance | 0.9718913 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 25499     |\n",
+            "|    policy_loss        | -0.00206  |\n",
+            "|    std                | 0.351     |\n",
+            "|    value_loss         | 6.36e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.75      |\n",
+            "|    ep_rew_mean        | -0.214    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 322       |\n",
+            "|    iterations         | 25600     |\n",
+            "|    time_elapsed       | 1589      |\n",
+            "|    total_timesteps    | 512000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -1        |\n",
+            "|    explained_variance | 0.9023121 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 25599     |\n",
+            "|    policy_loss        | -0.00161  |\n",
+            "|    std                | 0.351     |\n",
+            "|    value_loss         | 0.000443  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.89       |\n",
+            "|    ep_rew_mean        | -0.238     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 322        |\n",
+            "|    iterations         | 25700      |\n",
+            "|    time_elapsed       | 1595       |\n",
+            "|    total_timesteps    | 514000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.978     |\n",
+            "|    explained_variance | 0.93654746 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 25699      |\n",
+            "|    policy_loss        | -0.00729   |\n",
+            "|    std                | 0.348      |\n",
+            "|    value_loss         | 0.000471   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.63       |\n",
+            "|    ep_rew_mean        | -0.202     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 321        |\n",
+            "|    iterations         | 25800      |\n",
+            "|    time_elapsed       | 1605       |\n",
+            "|    total_timesteps    | 516000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.987     |\n",
+            "|    explained_variance | 0.97084665 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 25799      |\n",
+            "|    policy_loss        | 0.000631   |\n",
+            "|    std                | 0.349      |\n",
+            "|    value_loss         | 0.000164   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.77       |\n",
+            "|    ep_rew_mean        | -0.215     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 321        |\n",
+            "|    iterations         | 25900      |\n",
+            "|    time_elapsed       | 1611       |\n",
+            "|    total_timesteps    | 518000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.972     |\n",
+            "|    explained_variance | 0.97883755 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 25899      |\n",
+            "|    policy_loss        | 0.00163    |\n",
+            "|    std                | 0.347      |\n",
+            "|    value_loss         | 8.23e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.75      |\n",
+            "|    ep_rew_mean        | -0.214    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 321       |\n",
+            "|    iterations         | 26000     |\n",
+            "|    time_elapsed       | 1617      |\n",
+            "|    total_timesteps    | 520000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.952    |\n",
+            "|    explained_variance | 0.9676675 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 25999     |\n",
+            "|    policy_loss        | 0.00262   |\n",
+            "|    std                | 0.345     |\n",
+            "|    value_loss         | 0.000103  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.77       |\n",
+            "|    ep_rew_mean        | -0.214     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 321        |\n",
+            "|    iterations         | 26100      |\n",
+            "|    time_elapsed       | 1623       |\n",
+            "|    total_timesteps    | 522000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.943     |\n",
+            "|    explained_variance | 0.97848845 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 26099      |\n",
+            "|    policy_loss        | -0.0121    |\n",
+            "|    std                | 0.344      |\n",
+            "|    value_loss         | 9.36e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.73      |\n",
+            "|    ep_rew_mean        | -0.213    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 321       |\n",
+            "|    iterations         | 26200     |\n",
+            "|    time_elapsed       | 1629      |\n",
+            "|    total_timesteps    | 524000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.929    |\n",
+            "|    explained_variance | 0.9451687 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 26199     |\n",
+            "|    policy_loss        | -0.000101 |\n",
+            "|    std                | 0.343     |\n",
+            "|    value_loss         | 0.000122  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.68       |\n",
+            "|    ep_rew_mean        | -0.21      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 320        |\n",
+            "|    iterations         | 26300      |\n",
+            "|    time_elapsed       | 1639       |\n",
+            "|    total_timesteps    | 526000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.919     |\n",
+            "|    explained_variance | 0.98636997 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 26299      |\n",
+            "|    policy_loss        | -0.00152   |\n",
+            "|    std                | 0.341      |\n",
+            "|    value_loss         | 3.07e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.66       |\n",
+            "|    ep_rew_mean        | -0.203     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 320        |\n",
+            "|    iterations         | 26400      |\n",
+            "|    time_elapsed       | 1645       |\n",
+            "|    total_timesteps    | 528000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.914     |\n",
+            "|    explained_variance | 0.97304374 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 26399      |\n",
+            "|    policy_loss        | -0.00223   |\n",
+            "|    std                | 0.341      |\n",
+            "|    value_loss         | 9.55e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.9       |\n",
+            "|    ep_rew_mean        | -0.226    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 320       |\n",
+            "|    iterations         | 26500     |\n",
+            "|    time_elapsed       | 1651      |\n",
+            "|    total_timesteps    | 530000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.896    |\n",
+            "|    explained_variance | 0.9557487 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 26499     |\n",
+            "|    policy_loss        | -0.00242  |\n",
+            "|    std                | 0.339     |\n",
+            "|    value_loss         | 0.000109  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.74       |\n",
+            "|    ep_rew_mean        | -0.207     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 320        |\n",
+            "|    iterations         | 26600      |\n",
+            "|    time_elapsed       | 1657       |\n",
+            "|    total_timesteps    | 532000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.896     |\n",
+            "|    explained_variance | 0.95472234 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 26599      |\n",
+            "|    policy_loss        | -0.00131   |\n",
+            "|    std                | 0.339      |\n",
+            "|    value_loss         | 0.000107   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.68       |\n",
+            "|    ep_rew_mean        | -0.204     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 320        |\n",
+            "|    iterations         | 26700      |\n",
+            "|    time_elapsed       | 1664       |\n",
+            "|    total_timesteps    | 534000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.898     |\n",
+            "|    explained_variance | 0.97528225 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 26699      |\n",
+            "|    policy_loss        | -0.00439   |\n",
+            "|    std                | 0.34       |\n",
+            "|    value_loss         | 8e-05      |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.66      |\n",
+            "|    ep_rew_mean        | -0.21     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 320       |\n",
+            "|    iterations         | 26800     |\n",
+            "|    time_elapsed       | 1673      |\n",
+            "|    total_timesteps    | 536000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.896    |\n",
+            "|    explained_variance | 0.9557003 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 26799     |\n",
+            "|    policy_loss        | -0.00705  |\n",
+            "|    std                | 0.339     |\n",
+            "|    value_loss         | 0.000122  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.7       |\n",
+            "|    ep_rew_mean        | -0.205    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 320       |\n",
+            "|    iterations         | 26900     |\n",
+            "|    time_elapsed       | 1677      |\n",
+            "|    total_timesteps    | 538000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.899    |\n",
+            "|    explained_variance | 0.9643465 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 26899     |\n",
+            "|    policy_loss        | -0.00129  |\n",
+            "|    std                | 0.339     |\n",
+            "|    value_loss         | 0.000202  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.74      |\n",
+            "|    ep_rew_mean        | -0.207    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 320       |\n",
+            "|    iterations         | 27000     |\n",
+            "|    time_elapsed       | 1682      |\n",
+            "|    total_timesteps    | 540000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.878    |\n",
+            "|    explained_variance | 0.9536327 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 26999     |\n",
+            "|    policy_loss        | 0.00474   |\n",
+            "|    std                | 0.337     |\n",
+            "|    value_loss         | 0.000155  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.77      |\n",
+            "|    ep_rew_mean        | -0.211    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 321       |\n",
+            "|    iterations         | 27100     |\n",
+            "|    time_elapsed       | 1686      |\n",
+            "|    total_timesteps    | 542000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.872    |\n",
+            "|    explained_variance | 0.9857932 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 27099     |\n",
+            "|    policy_loss        | 0.00156   |\n",
+            "|    std                | 0.337     |\n",
+            "|    value_loss         | 4.18e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.68       |\n",
+            "|    ep_rew_mean        | -0.207     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 321        |\n",
+            "|    iterations         | 27200      |\n",
+            "|    time_elapsed       | 1690       |\n",
+            "|    total_timesteps    | 544000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.867     |\n",
+            "|    explained_variance | 0.95144564 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 27199      |\n",
+            "|    policy_loss        | 0.00798    |\n",
+            "|    std                | 0.335      |\n",
+            "|    value_loss         | 0.000209   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.8        |\n",
+            "|    ep_rew_mean        | -0.217     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 322        |\n",
+            "|    iterations         | 27300      |\n",
+            "|    time_elapsed       | 1695       |\n",
+            "|    total_timesteps    | 546000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.861     |\n",
+            "|    explained_variance | 0.96456105 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 27299      |\n",
+            "|    policy_loss        | 0.00775    |\n",
+            "|    std                | 0.335      |\n",
+            "|    value_loss         | 0.000121   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.77      |\n",
+            "|    ep_rew_mean        | -0.219    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 322       |\n",
+            "|    iterations         | 27400     |\n",
+            "|    time_elapsed       | 1699      |\n",
+            "|    total_timesteps    | 548000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.85     |\n",
+            "|    explained_variance | 0.9638762 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 27399     |\n",
+            "|    policy_loss        | 0.00218   |\n",
+            "|    std                | 0.334     |\n",
+            "|    value_loss         | 0.000101  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.58       |\n",
+            "|    ep_rew_mean        | -0.189     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 322        |\n",
+            "|    iterations         | 27500      |\n",
+            "|    time_elapsed       | 1703       |\n",
+            "|    total_timesteps    | 550000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.827     |\n",
+            "|    explained_variance | 0.97292376 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 27499      |\n",
+            "|    policy_loss        | 0.000216   |\n",
+            "|    std                | 0.332      |\n",
+            "|    value_loss         | 0.000109   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.83      |\n",
+            "|    ep_rew_mean        | -0.221    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 322       |\n",
+            "|    iterations         | 27600     |\n",
+            "|    time_elapsed       | 1711      |\n",
+            "|    total_timesteps    | 552000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.817    |\n",
+            "|    explained_variance | 0.9647702 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 27599     |\n",
+            "|    policy_loss        | -0.000127 |\n",
+            "|    std                | 0.331     |\n",
+            "|    value_loss         | 0.000103  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.75      |\n",
+            "|    ep_rew_mean        | -0.214    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 322       |\n",
+            "|    iterations         | 27700     |\n",
+            "|    time_elapsed       | 1716      |\n",
+            "|    total_timesteps    | 554000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.804    |\n",
+            "|    explained_variance | 0.8263781 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 27699     |\n",
+            "|    policy_loss        | 0.0203    |\n",
+            "|    std                | 0.329     |\n",
+            "|    value_loss         | 0.00182   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.7        |\n",
+            "|    ep_rew_mean        | -0.21      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 323        |\n",
+            "|    iterations         | 27800      |\n",
+            "|    time_elapsed       | 1720       |\n",
+            "|    total_timesteps    | 556000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.797     |\n",
+            "|    explained_variance | 0.97323596 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 27799      |\n",
+            "|    policy_loss        | 0.00182    |\n",
+            "|    std                | 0.328      |\n",
+            "|    value_loss         | 0.000173   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.81       |\n",
+            "|    ep_rew_mean        | -0.213     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 323        |\n",
+            "|    iterations         | 27900      |\n",
+            "|    time_elapsed       | 1725       |\n",
+            "|    total_timesteps    | 558000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.774     |\n",
+            "|    explained_variance | 0.22033525 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 27899      |\n",
+            "|    policy_loss        | 0.00906    |\n",
+            "|    std                | 0.325      |\n",
+            "|    value_loss         | 0.00211    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.7       |\n",
+            "|    ep_rew_mean        | -0.214    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 323       |\n",
+            "|    iterations         | 28000     |\n",
+            "|    time_elapsed       | 1730      |\n",
+            "|    total_timesteps    | 560000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.764    |\n",
+            "|    explained_variance | 0.9667121 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 27999     |\n",
+            "|    policy_loss        | -0.00939  |\n",
+            "|    std                | 0.324     |\n",
+            "|    value_loss         | 0.000226  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.71      |\n",
+            "|    ep_rew_mean        | -0.206    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 323       |\n",
+            "|    iterations         | 28100     |\n",
+            "|    time_elapsed       | 1735      |\n",
+            "|    total_timesteps    | 562000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.76     |\n",
+            "|    explained_variance | 0.9180963 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 28099     |\n",
+            "|    policy_loss        | 0.000922  |\n",
+            "|    std                | 0.324     |\n",
+            "|    value_loss         | 0.00013   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.73       |\n",
+            "|    ep_rew_mean        | -0.21      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 324        |\n",
+            "|    iterations         | 28200      |\n",
+            "|    time_elapsed       | 1740       |\n",
+            "|    total_timesteps    | 564000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.758     |\n",
+            "|    explained_variance | 0.58245325 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 28199      |\n",
+            "|    policy_loss        | -0.0188    |\n",
+            "|    std                | 0.324      |\n",
+            "|    value_loss         | 0.00212    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.72       |\n",
+            "|    ep_rew_mean        | -0.212     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 323        |\n",
+            "|    iterations         | 28300      |\n",
+            "|    time_elapsed       | 1748       |\n",
+            "|    total_timesteps    | 566000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.747     |\n",
+            "|    explained_variance | 0.98212785 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 28299      |\n",
+            "|    policy_loss        | -0.00883   |\n",
+            "|    std                | 0.322      |\n",
+            "|    value_loss         | 0.000125   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.68      |\n",
+            "|    ep_rew_mean        | -0.208    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 323       |\n",
+            "|    iterations         | 28400     |\n",
+            "|    time_elapsed       | 1753      |\n",
+            "|    total_timesteps    | 568000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.77     |\n",
+            "|    explained_variance | 0.9311479 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 28399     |\n",
+            "|    policy_loss        | -0.00106  |\n",
+            "|    std                | 0.325     |\n",
+            "|    value_loss         | 8.91e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.63       |\n",
+            "|    ep_rew_mean        | -0.215     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 324        |\n",
+            "|    iterations         | 28500      |\n",
+            "|    time_elapsed       | 1757       |\n",
+            "|    total_timesteps    | 570000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.724     |\n",
+            "|    explained_variance | 0.99005914 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 28499      |\n",
+            "|    policy_loss        | 0.000466   |\n",
+            "|    std                | 0.32       |\n",
+            "|    value_loss         | 2.75e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.65       |\n",
+            "|    ep_rew_mean        | -0.194     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 324        |\n",
+            "|    iterations         | 28600      |\n",
+            "|    time_elapsed       | 1762       |\n",
+            "|    total_timesteps    | 572000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.688     |\n",
+            "|    explained_variance | 0.97127336 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 28599      |\n",
+            "|    policy_loss        | -0.00282   |\n",
+            "|    std                | 0.316      |\n",
+            "|    value_loss         | 0.000111   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.88       |\n",
+            "|    ep_rew_mean        | -0.229     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 324        |\n",
+            "|    iterations         | 28700      |\n",
+            "|    time_elapsed       | 1767       |\n",
+            "|    total_timesteps    | 574000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.673     |\n",
+            "|    explained_variance | 0.97094136 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 28699      |\n",
+            "|    policy_loss        | -0.00301   |\n",
+            "|    std                | 0.315      |\n",
+            "|    value_loss         | 0.00014    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.76      |\n",
+            "|    ep_rew_mean        | -0.214    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 325       |\n",
+            "|    iterations         | 28800     |\n",
+            "|    time_elapsed       | 1772      |\n",
+            "|    total_timesteps    | 576000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.658    |\n",
+            "|    explained_variance | 0.9512483 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 28799     |\n",
+            "|    policy_loss        | -0.00559  |\n",
+            "|    std                | 0.313     |\n",
+            "|    value_loss         | 0.000173  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.7        |\n",
+            "|    ep_rew_mean        | -0.212     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 324        |\n",
+            "|    iterations         | 28900      |\n",
+            "|    time_elapsed       | 1780       |\n",
+            "|    total_timesteps    | 578000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.65      |\n",
+            "|    explained_variance | 0.97945994 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 28899      |\n",
+            "|    policy_loss        | -0.00472   |\n",
+            "|    std                | 0.312      |\n",
+            "|    value_loss         | 0.000126   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.81      |\n",
+            "|    ep_rew_mean        | -0.225    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 324       |\n",
+            "|    iterations         | 29000     |\n",
+            "|    time_elapsed       | 1785      |\n",
+            "|    total_timesteps    | 580000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.631    |\n",
+            "|    explained_variance | 0.9418761 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 28999     |\n",
+            "|    policy_loss        | -2.69e-05 |\n",
+            "|    std                | 0.311     |\n",
+            "|    value_loss         | 0.000187  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.77      |\n",
+            "|    ep_rew_mean        | -0.214    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 325       |\n",
+            "|    iterations         | 29100     |\n",
+            "|    time_elapsed       | 1790      |\n",
+            "|    total_timesteps    | 582000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.62     |\n",
+            "|    explained_variance | 0.9561486 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 29099     |\n",
+            "|    policy_loss        | 0.0037    |\n",
+            "|    std                | 0.309     |\n",
+            "|    value_loss         | 0.0002    |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.71      |\n",
+            "|    ep_rew_mean        | -0.211    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 325       |\n",
+            "|    iterations         | 29200     |\n",
+            "|    time_elapsed       | 1795      |\n",
+            "|    total_timesteps    | 584000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.618    |\n",
+            "|    explained_variance | 0.9853928 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 29199     |\n",
+            "|    policy_loss        | -0.00218  |\n",
+            "|    std                | 0.309     |\n",
+            "|    value_loss         | 5.64e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.7        |\n",
+            "|    ep_rew_mean        | -0.199     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 325        |\n",
+            "|    iterations         | 29300      |\n",
+            "|    time_elapsed       | 1800       |\n",
+            "|    total_timesteps    | 586000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.611     |\n",
+            "|    explained_variance | 0.97477955 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 29299      |\n",
+            "|    policy_loss        | 0.00242    |\n",
+            "|    std                | 0.309      |\n",
+            "|    value_loss         | 0.000135   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.73      |\n",
+            "|    ep_rew_mean        | -0.208    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 325       |\n",
+            "|    iterations         | 29400     |\n",
+            "|    time_elapsed       | 1804      |\n",
+            "|    total_timesteps    | 588000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.613    |\n",
+            "|    explained_variance | 0.6755737 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 29399     |\n",
+            "|    policy_loss        | -0.0265   |\n",
+            "|    std                | 0.309     |\n",
+            "|    value_loss         | 0.00202   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.65      |\n",
+            "|    ep_rew_mean        | -0.203    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 326       |\n",
+            "|    iterations         | 29500     |\n",
+            "|    time_elapsed       | 1809      |\n",
+            "|    total_timesteps    | 590000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.596    |\n",
+            "|    explained_variance | 0.9880968 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 29499     |\n",
+            "|    policy_loss        | 0.00439   |\n",
+            "|    std                | 0.307     |\n",
+            "|    value_loss         | 5.99e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.67       |\n",
+            "|    ep_rew_mean        | -0.203     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 325        |\n",
+            "|    iterations         | 29600      |\n",
+            "|    time_elapsed       | 1818       |\n",
+            "|    total_timesteps    | 592000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.589     |\n",
+            "|    explained_variance | 0.93901527 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 29599      |\n",
+            "|    policy_loss        | 0.00187    |\n",
+            "|    std                | 0.306      |\n",
+            "|    value_loss         | 0.000172   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3.4        |\n",
+            "|    ep_rew_mean        | -0.275     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 325        |\n",
+            "|    iterations         | 29700      |\n",
+            "|    time_elapsed       | 1823       |\n",
+            "|    total_timesteps    | 594000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.584     |\n",
+            "|    explained_variance | -1.5040169 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 29699      |\n",
+            "|    policy_loss        | -0.0121    |\n",
+            "|    std                | 0.306      |\n",
+            "|    value_loss         | 0.00608    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3.08       |\n",
+            "|    ep_rew_mean        | -0.234     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 325        |\n",
+            "|    iterations         | 29800      |\n",
+            "|    time_elapsed       | 1828       |\n",
+            "|    total_timesteps    | 596000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.602     |\n",
+            "|    explained_variance | 0.35909313 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 29799      |\n",
+            "|    policy_loss        | -0.00559   |\n",
+            "|    std                | 0.308      |\n",
+            "|    value_loss         | 0.00806    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.95       |\n",
+            "|    ep_rew_mean        | -0.23      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 326        |\n",
+            "|    iterations         | 29900      |\n",
+            "|    time_elapsed       | 1833       |\n",
+            "|    total_timesteps    | 598000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.617     |\n",
+            "|    explained_variance | -10.621065 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 29899      |\n",
+            "|    policy_loss        | -0.0245    |\n",
+            "|    std                | 0.309      |\n",
+            "|    value_loss         | 0.0672     |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.85       |\n",
+            "|    ep_rew_mean        | -0.222     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 326        |\n",
+            "|    iterations         | 30000      |\n",
+            "|    time_elapsed       | 1838       |\n",
+            "|    total_timesteps    | 600000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.585     |\n",
+            "|    explained_variance | 0.41773236 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 29999      |\n",
+            "|    policy_loss        | -0.0285    |\n",
+            "|    std                | 0.305      |\n",
+            "|    value_loss         | 0.00465    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.77      |\n",
+            "|    ep_rew_mean        | -0.216    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 326       |\n",
+            "|    iterations         | 30100     |\n",
+            "|    time_elapsed       | 1843      |\n",
+            "|    total_timesteps    | 602000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.596    |\n",
+            "|    explained_variance | 0.9414502 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 30099     |\n",
+            "|    policy_loss        | 0.00102   |\n",
+            "|    std                | 0.307     |\n",
+            "|    value_loss         | 0.000178  |\n",
+            "-------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 2.88     |\n",
+            "|    ep_rew_mean        | -0.224   |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 326      |\n",
+            "|    iterations         | 30200    |\n",
+            "|    time_elapsed       | 1852     |\n",
+            "|    total_timesteps    | 604000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -0.586   |\n",
+            "|    explained_variance | 0.598702 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 30199    |\n",
+            "|    policy_loss        | -0.0509  |\n",
+            "|    std                | 0.306    |\n",
+            "|    value_loss         | 0.00411  |\n",
+            "------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.94      |\n",
+            "|    ep_rew_mean        | -0.233    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 326       |\n",
+            "|    iterations         | 30300     |\n",
+            "|    time_elapsed       | 1857      |\n",
+            "|    total_timesteps    | 606000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.568    |\n",
+            "|    explained_variance | 0.9546901 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 30299     |\n",
+            "|    policy_loss        | 0.00376   |\n",
+            "|    std                | 0.304     |\n",
+            "|    value_loss         | 0.000196  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.68      |\n",
+            "|    ep_rew_mean        | -0.211    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 326       |\n",
+            "|    iterations         | 30400     |\n",
+            "|    time_elapsed       | 1861      |\n",
+            "|    total_timesteps    | 608000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.556    |\n",
+            "|    explained_variance | 0.9634257 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 30399     |\n",
+            "|    policy_loss        | -0.00117  |\n",
+            "|    std                | 0.304     |\n",
+            "|    value_loss         | 0.000128  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.59      |\n",
+            "|    ep_rew_mean        | -0.192    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 326       |\n",
+            "|    iterations         | 30500     |\n",
+            "|    time_elapsed       | 1866      |\n",
+            "|    total_timesteps    | 610000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.559    |\n",
+            "|    explained_variance | 0.9729128 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 30499     |\n",
+            "|    policy_loss        | -0.0041   |\n",
+            "|    std                | 0.304     |\n",
+            "|    value_loss         | 8.32e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.77       |\n",
+            "|    ep_rew_mean        | -0.216     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 326        |\n",
+            "|    iterations         | 30600      |\n",
+            "|    time_elapsed       | 1871       |\n",
+            "|    total_timesteps    | 612000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.542     |\n",
+            "|    explained_variance | 0.98225373 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 30599      |\n",
+            "|    policy_loss        | -0.00189   |\n",
+            "|    std                | 0.303      |\n",
+            "|    value_loss         | 9.61e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.8        |\n",
+            "|    ep_rew_mean        | -0.218     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 327        |\n",
+            "|    iterations         | 30700      |\n",
+            "|    time_elapsed       | 1876       |\n",
+            "|    total_timesteps    | 614000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.539     |\n",
+            "|    explained_variance | 0.92065257 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 30699      |\n",
+            "|    policy_loss        | -0.00712   |\n",
+            "|    std                | 0.302      |\n",
+            "|    value_loss         | 0.000221   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.68      |\n",
+            "|    ep_rew_mean        | -0.204    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 327       |\n",
+            "|    iterations         | 30800     |\n",
+            "|    time_elapsed       | 1881      |\n",
+            "|    total_timesteps    | 616000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.539    |\n",
+            "|    explained_variance | 0.9718508 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 30799     |\n",
+            "|    policy_loss        | -0.00467  |\n",
+            "|    std                | 0.303     |\n",
+            "|    value_loss         | 9.48e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.67       |\n",
+            "|    ep_rew_mean        | -0.206     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 327        |\n",
+            "|    iterations         | 30900      |\n",
+            "|    time_elapsed       | 1889       |\n",
+            "|    total_timesteps    | 618000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.526     |\n",
+            "|    explained_variance | 0.97893435 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 30899      |\n",
+            "|    policy_loss        | 0.00621    |\n",
+            "|    std                | 0.302      |\n",
+            "|    value_loss         | 0.000222   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.82       |\n",
+            "|    ep_rew_mean        | -0.226     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 327        |\n",
+            "|    iterations         | 31000      |\n",
+            "|    time_elapsed       | 1893       |\n",
+            "|    total_timesteps    | 620000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.512     |\n",
+            "|    explained_variance | 0.95006245 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 30999      |\n",
+            "|    policy_loss        | 0.00233    |\n",
+            "|    std                | 0.301      |\n",
+            "|    value_loss         | 0.000139   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.84       |\n",
+            "|    ep_rew_mean        | -0.224     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 327        |\n",
+            "|    iterations         | 31100      |\n",
+            "|    time_elapsed       | 1898       |\n",
+            "|    total_timesteps    | 622000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.512     |\n",
+            "|    explained_variance | 0.93657434 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 31099      |\n",
+            "|    policy_loss        | -0.000242  |\n",
+            "|    std                | 0.301      |\n",
+            "|    value_loss         | 0.000198   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.65      |\n",
+            "|    ep_rew_mean        | -0.206    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 327       |\n",
+            "|    iterations         | 31200     |\n",
+            "|    time_elapsed       | 1905      |\n",
+            "|    total_timesteps    | 624000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.466    |\n",
+            "|    explained_variance | 0.9877893 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 31199     |\n",
+            "|    policy_loss        | 0.00217   |\n",
+            "|    std                | 0.297     |\n",
+            "|    value_loss         | 4.42e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.73       |\n",
+            "|    ep_rew_mean        | -0.216     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 327        |\n",
+            "|    iterations         | 31300      |\n",
+            "|    time_elapsed       | 1911       |\n",
+            "|    total_timesteps    | 626000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.437     |\n",
+            "|    explained_variance | 0.98232895 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 31299      |\n",
+            "|    policy_loss        | -0.00259   |\n",
+            "|    std                | 0.294      |\n",
+            "|    value_loss         | 6.34e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.75       |\n",
+            "|    ep_rew_mean        | -0.214     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 327        |\n",
+            "|    iterations         | 31400      |\n",
+            "|    time_elapsed       | 1916       |\n",
+            "|    total_timesteps    | 628000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.447     |\n",
+            "|    explained_variance | 0.97575915 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 31399      |\n",
+            "|    policy_loss        | 0.000886   |\n",
+            "|    std                | 0.295      |\n",
+            "|    value_loss         | 8.11e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.71       |\n",
+            "|    ep_rew_mean        | -0.208     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 327        |\n",
+            "|    iterations         | 31500      |\n",
+            "|    time_elapsed       | 1925       |\n",
+            "|    total_timesteps    | 630000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.438     |\n",
+            "|    explained_variance | 0.97809935 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 31499      |\n",
+            "|    policy_loss        | 0.00624    |\n",
+            "|    std                | 0.294      |\n",
+            "|    value_loss         | 7.33e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.73       |\n",
+            "|    ep_rew_mean        | -0.209     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 327        |\n",
+            "|    iterations         | 31600      |\n",
+            "|    time_elapsed       | 1930       |\n",
+            "|    total_timesteps    | 632000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.411     |\n",
+            "|    explained_variance | 0.95562655 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 31599      |\n",
+            "|    policy_loss        | -0.00718   |\n",
+            "|    std                | 0.291      |\n",
+            "|    value_loss         | 0.00027    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.75       |\n",
+            "|    ep_rew_mean        | -0.218     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 327        |\n",
+            "|    iterations         | 31700      |\n",
+            "|    time_elapsed       | 1934       |\n",
+            "|    total_timesteps    | 634000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.402     |\n",
+            "|    explained_variance | 0.98256177 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 31699      |\n",
+            "|    policy_loss        | -0.000849  |\n",
+            "|    std                | 0.29       |\n",
+            "|    value_loss         | 5.61e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.71       |\n",
+            "|    ep_rew_mean        | -0.213     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 328        |\n",
+            "|    iterations         | 31800      |\n",
+            "|    time_elapsed       | 1938       |\n",
+            "|    total_timesteps    | 636000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.398     |\n",
+            "|    explained_variance | 0.49550873 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 31799      |\n",
+            "|    policy_loss        | -0.0132    |\n",
+            "|    std                | 0.289      |\n",
+            "|    value_loss         | 0.00197    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.71      |\n",
+            "|    ep_rew_mean        | -0.207    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 328       |\n",
+            "|    iterations         | 31900     |\n",
+            "|    time_elapsed       | 1942      |\n",
+            "|    total_timesteps    | 638000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.37     |\n",
+            "|    explained_variance | 0.9492866 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 31899     |\n",
+            "|    policy_loss        | 0.00263   |\n",
+            "|    std                | 0.287     |\n",
+            "|    value_loss         | 0.000154  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.85       |\n",
+            "|    ep_rew_mean        | -0.228     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 328        |\n",
+            "|    iterations         | 32000      |\n",
+            "|    time_elapsed       | 1946       |\n",
+            "|    total_timesteps    | 640000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.369     |\n",
+            "|    explained_variance | 0.94642025 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 31999      |\n",
+            "|    policy_loss        | 0.00381    |\n",
+            "|    std                | 0.287      |\n",
+            "|    value_loss         | 0.000234   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.82       |\n",
+            "|    ep_rew_mean        | -0.225     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 328        |\n",
+            "|    iterations         | 32100      |\n",
+            "|    time_elapsed       | 1951       |\n",
+            "|    total_timesteps    | 642000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.361     |\n",
+            "|    explained_variance | 0.98188424 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 32099      |\n",
+            "|    policy_loss        | 0.00164    |\n",
+            "|    std                | 0.286      |\n",
+            "|    value_loss         | 6.05e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.85      |\n",
+            "|    ep_rew_mean        | -0.224    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 329       |\n",
+            "|    iterations         | 32200     |\n",
+            "|    time_elapsed       | 1955      |\n",
+            "|    total_timesteps    | 644000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.343    |\n",
+            "|    explained_variance | 0.9386951 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 32199     |\n",
+            "|    policy_loss        | -0.00134  |\n",
+            "|    std                | 0.285     |\n",
+            "|    value_loss         | 0.00101   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.7       |\n",
+            "|    ep_rew_mean        | -0.207    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 328       |\n",
+            "|    iterations         | 32300     |\n",
+            "|    time_elapsed       | 1963      |\n",
+            "|    total_timesteps    | 646000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.338    |\n",
+            "|    explained_variance | 0.9384046 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 32299     |\n",
+            "|    policy_loss        | 0.00246   |\n",
+            "|    std                | 0.285     |\n",
+            "|    value_loss         | 0.000127  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.5       |\n",
+            "|    ep_rew_mean        | -0.19     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 329       |\n",
+            "|    iterations         | 32400     |\n",
+            "|    time_elapsed       | 1968      |\n",
+            "|    total_timesteps    | 648000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.325    |\n",
+            "|    explained_variance | 0.9613883 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 32399     |\n",
+            "|    policy_loss        | 0.00174   |\n",
+            "|    std                | 0.284     |\n",
+            "|    value_loss         | 0.00013   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.78      |\n",
+            "|    ep_rew_mean        | -0.219    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 329       |\n",
+            "|    iterations         | 32500     |\n",
+            "|    time_elapsed       | 1972      |\n",
+            "|    total_timesteps    | 650000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.317    |\n",
+            "|    explained_variance | 0.9388133 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 32499     |\n",
+            "|    policy_loss        | -0.00116  |\n",
+            "|    std                | 0.283     |\n",
+            "|    value_loss         | 0.000238  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.82       |\n",
+            "|    ep_rew_mean        | -0.218     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 329        |\n",
+            "|    iterations         | 32600      |\n",
+            "|    time_elapsed       | 1976       |\n",
+            "|    total_timesteps    | 652000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.301     |\n",
+            "|    explained_variance | 0.97595453 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 32599      |\n",
+            "|    policy_loss        | 0.0013     |\n",
+            "|    std                | 0.281      |\n",
+            "|    value_loss         | 7.5e-05    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.79      |\n",
+            "|    ep_rew_mean        | -0.218    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 329       |\n",
+            "|    iterations         | 32700     |\n",
+            "|    time_elapsed       | 1981      |\n",
+            "|    total_timesteps    | 654000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.303    |\n",
+            "|    explained_variance | 0.9642559 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 32699     |\n",
+            "|    policy_loss        | -0.00392  |\n",
+            "|    std                | 0.281     |\n",
+            "|    value_loss         | 0.000116  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.65       |\n",
+            "|    ep_rew_mean        | -0.205     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 330        |\n",
+            "|    iterations         | 32800      |\n",
+            "|    time_elapsed       | 1986       |\n",
+            "|    total_timesteps    | 656000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.308     |\n",
+            "|    explained_variance | 0.96840066 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 32799      |\n",
+            "|    policy_loss        | -0.00165   |\n",
+            "|    std                | 0.281      |\n",
+            "|    value_loss         | 0.000215   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.76      |\n",
+            "|    ep_rew_mean        | -0.209    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 330       |\n",
+            "|    iterations         | 32900     |\n",
+            "|    time_elapsed       | 1990      |\n",
+            "|    total_timesteps    | 658000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.28     |\n",
+            "|    explained_variance | 0.8290812 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 32899     |\n",
+            "|    policy_loss        | -0.00931  |\n",
+            "|    std                | 0.279     |\n",
+            "|    value_loss         | 0.000542  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.68      |\n",
+            "|    ep_rew_mean        | -0.198    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 330       |\n",
+            "|    iterations         | 33000     |\n",
+            "|    time_elapsed       | 1999      |\n",
+            "|    total_timesteps    | 660000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.271    |\n",
+            "|    explained_variance | 0.9427947 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 32999     |\n",
+            "|    policy_loss        | -0.00779  |\n",
+            "|    std                | 0.278     |\n",
+            "|    value_loss         | 0.000146  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.69       |\n",
+            "|    ep_rew_mean        | -0.213     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 330        |\n",
+            "|    iterations         | 33100      |\n",
+            "|    time_elapsed       | 2003       |\n",
+            "|    total_timesteps    | 662000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.252     |\n",
+            "|    explained_variance | 0.98349446 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 33099      |\n",
+            "|    policy_loss        | -0.000681  |\n",
+            "|    std                | 0.277      |\n",
+            "|    value_loss         | 8.71e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.88       |\n",
+            "|    ep_rew_mean        | -0.224     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 330        |\n",
+            "|    iterations         | 33200      |\n",
+            "|    time_elapsed       | 2008       |\n",
+            "|    total_timesteps    | 664000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.251     |\n",
+            "|    explained_variance | 0.91295236 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 33199      |\n",
+            "|    policy_loss        | 0.00814    |\n",
+            "|    std                | 0.278      |\n",
+            "|    value_loss         | 0.000507   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.82      |\n",
+            "|    ep_rew_mean        | -0.224    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 330       |\n",
+            "|    iterations         | 33300     |\n",
+            "|    time_elapsed       | 2013      |\n",
+            "|    total_timesteps    | 666000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.248    |\n",
+            "|    explained_variance | 0.9723704 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 33299     |\n",
+            "|    policy_loss        | 0.00126   |\n",
+            "|    std                | 0.278     |\n",
+            "|    value_loss         | 0.000105  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.54       |\n",
+            "|    ep_rew_mean        | -0.197     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 331        |\n",
+            "|    iterations         | 33400      |\n",
+            "|    time_elapsed       | 2018       |\n",
+            "|    total_timesteps    | 668000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.226     |\n",
+            "|    explained_variance | 0.96713173 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 33399      |\n",
+            "|    policy_loss        | 0.000928   |\n",
+            "|    std                | 0.275      |\n",
+            "|    value_loss         | 0.000142   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.65       |\n",
+            "|    ep_rew_mean        | -0.2       |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 331        |\n",
+            "|    iterations         | 33500      |\n",
+            "|    time_elapsed       | 2023       |\n",
+            "|    total_timesteps    | 670000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.221     |\n",
+            "|    explained_variance | 0.98223233 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 33499      |\n",
+            "|    policy_loss        | -0.000939  |\n",
+            "|    std                | 0.275      |\n",
+            "|    value_loss         | 6.86e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.69       |\n",
+            "|    ep_rew_mean        | -0.211     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 330        |\n",
+            "|    iterations         | 33600      |\n",
+            "|    time_elapsed       | 2031       |\n",
+            "|    total_timesteps    | 672000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.215     |\n",
+            "|    explained_variance | 0.97093904 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 33599      |\n",
+            "|    policy_loss        | 0.00312    |\n",
+            "|    std                | 0.275      |\n",
+            "|    value_loss         | 0.000142   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.83      |\n",
+            "|    ep_rew_mean        | -0.228    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 330       |\n",
+            "|    iterations         | 33700     |\n",
+            "|    time_elapsed       | 2037      |\n",
+            "|    total_timesteps    | 674000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.212    |\n",
+            "|    explained_variance | 0.9846972 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 33699     |\n",
+            "|    policy_loss        | -0.0108   |\n",
+            "|    std                | 0.274     |\n",
+            "|    value_loss         | 0.000128  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.59      |\n",
+            "|    ep_rew_mean        | -0.196    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 331       |\n",
+            "|    iterations         | 33800     |\n",
+            "|    time_elapsed       | 2042      |\n",
+            "|    total_timesteps    | 676000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.224    |\n",
+            "|    explained_variance | 0.9849943 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 33799     |\n",
+            "|    policy_loss        | 0.000105  |\n",
+            "|    std                | 0.276     |\n",
+            "|    value_loss         | 0.000179  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 3.05      |\n",
+            "|    ep_rew_mean        | -0.245    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 331       |\n",
+            "|    iterations         | 33900     |\n",
+            "|    time_elapsed       | 2046      |\n",
+            "|    total_timesteps    | 678000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.218    |\n",
+            "|    explained_variance | 0.7833823 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 33899     |\n",
+            "|    policy_loss        | 0.000391  |\n",
+            "|    std                | 0.276     |\n",
+            "|    value_loss         | 0.000815  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.62       |\n",
+            "|    ep_rew_mean        | -0.199     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 331        |\n",
+            "|    iterations         | 34000      |\n",
+            "|    time_elapsed       | 2051       |\n",
+            "|    total_timesteps    | 680000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.196     |\n",
+            "|    explained_variance | 0.98827285 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 33999      |\n",
+            "|    policy_loss        | -0.00283   |\n",
+            "|    std                | 0.273      |\n",
+            "|    value_loss         | 4.13e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.76       |\n",
+            "|    ep_rew_mean        | -0.209     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 331        |\n",
+            "|    iterations         | 34100      |\n",
+            "|    time_elapsed       | 2055       |\n",
+            "|    total_timesteps    | 682000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.19      |\n",
+            "|    explained_variance | 0.97333074 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 34099      |\n",
+            "|    policy_loss        | 0.00383    |\n",
+            "|    std                | 0.273      |\n",
+            "|    value_loss         | 6.79e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.78       |\n",
+            "|    ep_rew_mean        | -0.22      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 332        |\n",
+            "|    iterations         | 34200      |\n",
+            "|    time_elapsed       | 2060       |\n",
+            "|    total_timesteps    | 684000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.18      |\n",
+            "|    explained_variance | 0.96711797 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 34199      |\n",
+            "|    policy_loss        | 0.00052    |\n",
+            "|    std                | 0.272      |\n",
+            "|    value_loss         | 7.96e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.67       |\n",
+            "|    ep_rew_mean        | -0.2       |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 331        |\n",
+            "|    iterations         | 34300      |\n",
+            "|    time_elapsed       | 2067       |\n",
+            "|    total_timesteps    | 686000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.167     |\n",
+            "|    explained_variance | 0.98783183 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 34299      |\n",
+            "|    policy_loss        | -0.00374   |\n",
+            "|    std                | 0.27       |\n",
+            "|    value_loss         | 6.82e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.75       |\n",
+            "|    ep_rew_mean        | -0.214     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 332        |\n",
+            "|    iterations         | 34400      |\n",
+            "|    time_elapsed       | 2072       |\n",
+            "|    total_timesteps    | 688000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.147     |\n",
+            "|    explained_variance | 0.97669905 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 34399      |\n",
+            "|    policy_loss        | -0.00153   |\n",
+            "|    std                | 0.269      |\n",
+            "|    value_loss         | 9.47e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.64      |\n",
+            "|    ep_rew_mean        | -0.204    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 332       |\n",
+            "|    iterations         | 34500     |\n",
+            "|    time_elapsed       | 2076      |\n",
+            "|    total_timesteps    | 690000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.131    |\n",
+            "|    explained_variance | 0.9739228 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 34499     |\n",
+            "|    policy_loss        | -0.00306  |\n",
+            "|    std                | 0.267     |\n",
+            "|    value_loss         | 0.000116  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.59       |\n",
+            "|    ep_rew_mean        | -0.192     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 332        |\n",
+            "|    iterations         | 34600      |\n",
+            "|    time_elapsed       | 2080       |\n",
+            "|    total_timesteps    | 692000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.112     |\n",
+            "|    explained_variance | 0.98734295 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 34599      |\n",
+            "|    policy_loss        | -0.000203  |\n",
+            "|    std                | 0.266      |\n",
+            "|    value_loss         | 5.21e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.68      |\n",
+            "|    ep_rew_mean        | -0.208    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 332       |\n",
+            "|    iterations         | 34700     |\n",
+            "|    time_elapsed       | 2084      |\n",
+            "|    total_timesteps    | 694000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.101    |\n",
+            "|    explained_variance | 0.9875052 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 34699     |\n",
+            "|    policy_loss        | -0.000905 |\n",
+            "|    std                | 0.265     |\n",
+            "|    value_loss         | 6e-05     |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.75      |\n",
+            "|    ep_rew_mean        | -0.214    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 333       |\n",
+            "|    iterations         | 34800     |\n",
+            "|    time_elapsed       | 2088      |\n",
+            "|    total_timesteps    | 696000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.0791   |\n",
+            "|    explained_variance | 0.8980861 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 34799     |\n",
+            "|    policy_loss        | 0.00288   |\n",
+            "|    std                | 0.263     |\n",
+            "|    value_loss         | 0.000259  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.76       |\n",
+            "|    ep_rew_mean        | -0.215     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 333        |\n",
+            "|    iterations         | 34900      |\n",
+            "|    time_elapsed       | 2093       |\n",
+            "|    total_timesteps    | 698000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -0.0663    |\n",
+            "|    explained_variance | 0.89876664 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 34899      |\n",
+            "|    policy_loss        | -0.00562   |\n",
+            "|    std                | 0.262      |\n",
+            "|    value_loss         | 0.000626   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.82      |\n",
+            "|    ep_rew_mean        | -0.222    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 333       |\n",
+            "|    iterations         | 35000     |\n",
+            "|    time_elapsed       | 2097      |\n",
+            "|    total_timesteps    | 700000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.0387   |\n",
+            "|    explained_variance | 0.9740894 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 34999     |\n",
+            "|    policy_loss        | 0.00189   |\n",
+            "|    std                | 0.261     |\n",
+            "|    value_loss         | 0.000131  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.97      |\n",
+            "|    ep_rew_mean        | -0.243    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 333       |\n",
+            "|    iterations         | 35100     |\n",
+            "|    time_elapsed       | 2105      |\n",
+            "|    total_timesteps    | 702000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.049    |\n",
+            "|    explained_variance | 0.9775146 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 35099     |\n",
+            "|    policy_loss        | -0.000261 |\n",
+            "|    std                | 0.262     |\n",
+            "|    value_loss         | 6.19e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.8       |\n",
+            "|    ep_rew_mean        | -0.229    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 333       |\n",
+            "|    iterations         | 35200     |\n",
+            "|    time_elapsed       | 2109      |\n",
+            "|    total_timesteps    | 704000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.016    |\n",
+            "|    explained_variance | 0.9900468 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 35199     |\n",
+            "|    policy_loss        | -0.00254  |\n",
+            "|    std                | 0.26      |\n",
+            "|    value_loss         | 4.96e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.75      |\n",
+            "|    ep_rew_mean        | -0.212    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 334       |\n",
+            "|    iterations         | 35300     |\n",
+            "|    time_elapsed       | 2113      |\n",
+            "|    total_timesteps    | 706000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -0.00382  |\n",
+            "|    explained_variance | 0.9701562 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 35299     |\n",
+            "|    policy_loss        | 0.000467  |\n",
+            "|    std                | 0.258     |\n",
+            "|    value_loss         | 8.83e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.62       |\n",
+            "|    ep_rew_mean        | -0.207     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 334        |\n",
+            "|    iterations         | 35400      |\n",
+            "|    time_elapsed       | 2117       |\n",
+            "|    total_timesteps    | 708000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.0109     |\n",
+            "|    explained_variance | 0.98845273 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 35399      |\n",
+            "|    policy_loss        | 0.00228    |\n",
+            "|    std                | 0.258      |\n",
+            "|    value_loss         | 7.26e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.53       |\n",
+            "|    ep_rew_mean        | -0.192     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 334        |\n",
+            "|    iterations         | 35500      |\n",
+            "|    time_elapsed       | 2122       |\n",
+            "|    total_timesteps    | 710000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.0242     |\n",
+            "|    explained_variance | 0.98698086 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 35499      |\n",
+            "|    policy_loss        | 0.000431   |\n",
+            "|    std                | 0.257      |\n",
+            "|    value_loss         | 5.86e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.67      |\n",
+            "|    ep_rew_mean        | -0.205    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 334       |\n",
+            "|    iterations         | 35600     |\n",
+            "|    time_elapsed       | 2126      |\n",
+            "|    total_timesteps    | 712000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.055     |\n",
+            "|    explained_variance | 0.9600166 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 35599     |\n",
+            "|    policy_loss        | -0.00257  |\n",
+            "|    std                | 0.254     |\n",
+            "|    value_loss         | 0.00043   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.74       |\n",
+            "|    ep_rew_mean        | -0.214     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 335        |\n",
+            "|    iterations         | 35700      |\n",
+            "|    time_elapsed       | 2131       |\n",
+            "|    total_timesteps    | 714000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.0481     |\n",
+            "|    explained_variance | 0.98819304 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 35699      |\n",
+            "|    policy_loss        | -0.000964  |\n",
+            "|    std                | 0.255      |\n",
+            "|    value_loss         | 0.000127   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.77      |\n",
+            "|    ep_rew_mean        | -0.214    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 334       |\n",
+            "|    iterations         | 35800     |\n",
+            "|    time_elapsed       | 2139      |\n",
+            "|    total_timesteps    | 716000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.0711    |\n",
+            "|    explained_variance | 0.6373999 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 35799     |\n",
+            "|    policy_loss        | -0.00345  |\n",
+            "|    std                | 0.253     |\n",
+            "|    value_loss         | 0.00253   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.77      |\n",
+            "|    ep_rew_mean        | -0.216    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 334       |\n",
+            "|    iterations         | 35900     |\n",
+            "|    time_elapsed       | 2144      |\n",
+            "|    total_timesteps    | 718000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.0727    |\n",
+            "|    explained_variance | 0.9650853 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 35899     |\n",
+            "|    policy_loss        | 0.000592  |\n",
+            "|    std                | 0.253     |\n",
+            "|    value_loss         | 0.000188  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.7        |\n",
+            "|    ep_rew_mean        | -0.205     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 335        |\n",
+            "|    iterations         | 36000      |\n",
+            "|    time_elapsed       | 2149       |\n",
+            "|    total_timesteps    | 720000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.0974     |\n",
+            "|    explained_variance | 0.96997017 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 35999      |\n",
+            "|    policy_loss        | 0.00803    |\n",
+            "|    std                | 0.251      |\n",
+            "|    value_loss         | 0.000125   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.57      |\n",
+            "|    ep_rew_mean        | -0.195    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 335       |\n",
+            "|    iterations         | 36100     |\n",
+            "|    time_elapsed       | 2153      |\n",
+            "|    total_timesteps    | 722000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.116     |\n",
+            "|    explained_variance | 0.9551712 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 36099     |\n",
+            "|    policy_loss        | 0.00695   |\n",
+            "|    std                | 0.249     |\n",
+            "|    value_loss         | 0.000214  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.62      |\n",
+            "|    ep_rew_mean        | -0.195    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 335       |\n",
+            "|    iterations         | 36200     |\n",
+            "|    time_elapsed       | 2158      |\n",
+            "|    total_timesteps    | 724000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.115     |\n",
+            "|    explained_variance | 0.9668451 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 36199     |\n",
+            "|    policy_loss        | -0.00064  |\n",
+            "|    std                | 0.249     |\n",
+            "|    value_loss         | 9.82e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.69      |\n",
+            "|    ep_rew_mean        | -0.207    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 335       |\n",
+            "|    iterations         | 36300     |\n",
+            "|    time_elapsed       | 2162      |\n",
+            "|    total_timesteps    | 726000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.138     |\n",
+            "|    explained_variance | 0.9729329 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 36299     |\n",
+            "|    policy_loss        | 0.00271   |\n",
+            "|    std                | 0.247     |\n",
+            "|    value_loss         | 7.26e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.65       |\n",
+            "|    ep_rew_mean        | -0.207     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 335        |\n",
+            "|    iterations         | 36400      |\n",
+            "|    time_elapsed       | 2167       |\n",
+            "|    total_timesteps    | 728000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.139      |\n",
+            "|    explained_variance | 0.93757355 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 36399      |\n",
+            "|    policy_loss        | 0.00147    |\n",
+            "|    std                | 0.246      |\n",
+            "|    value_loss         | 0.00028    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.73      |\n",
+            "|    ep_rew_mean        | -0.218    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 335       |\n",
+            "|    iterations         | 36500     |\n",
+            "|    time_elapsed       | 2175      |\n",
+            "|    total_timesteps    | 730000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.159     |\n",
+            "|    explained_variance | 0.9726939 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 36499     |\n",
+            "|    policy_loss        | 0.000194  |\n",
+            "|    std                | 0.245     |\n",
+            "|    value_loss         | 0.000112  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.72      |\n",
+            "|    ep_rew_mean        | -0.207    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 335       |\n",
+            "|    iterations         | 36600     |\n",
+            "|    time_elapsed       | 2179      |\n",
+            "|    total_timesteps    | 732000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.161     |\n",
+            "|    explained_variance | 0.8855412 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 36599     |\n",
+            "|    policy_loss        | -0.00331  |\n",
+            "|    std                | 0.245     |\n",
+            "|    value_loss         | 0.000437  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.52      |\n",
+            "|    ep_rew_mean        | -0.181    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 336       |\n",
+            "|    iterations         | 36700     |\n",
+            "|    time_elapsed       | 2184      |\n",
+            "|    total_timesteps    | 734000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.179     |\n",
+            "|    explained_variance | 0.9100865 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 36699     |\n",
+            "|    policy_loss        | -0.00193  |\n",
+            "|    std                | 0.244     |\n",
+            "|    value_loss         | 0.000243  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.73      |\n",
+            "|    ep_rew_mean        | -0.211    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 336       |\n",
+            "|    iterations         | 36800     |\n",
+            "|    time_elapsed       | 2189      |\n",
+            "|    total_timesteps    | 736000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.179     |\n",
+            "|    explained_variance | 0.9853102 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 36799     |\n",
+            "|    policy_loss        | -0.000901 |\n",
+            "|    std                | 0.243     |\n",
+            "|    value_loss         | 0.000174  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.63      |\n",
+            "|    ep_rew_mean        | -0.207    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 336       |\n",
+            "|    iterations         | 36900     |\n",
+            "|    time_elapsed       | 2194      |\n",
+            "|    total_timesteps    | 738000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.185     |\n",
+            "|    explained_variance | 0.9825456 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 36899     |\n",
+            "|    policy_loss        | -0.00222  |\n",
+            "|    std                | 0.243     |\n",
+            "|    value_loss         | 6.28e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.65      |\n",
+            "|    ep_rew_mean        | -0.208    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 336       |\n",
+            "|    iterations         | 37000     |\n",
+            "|    time_elapsed       | 2198      |\n",
+            "|    total_timesteps    | 740000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.185     |\n",
+            "|    explained_variance | 0.9939991 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 36999     |\n",
+            "|    policy_loss        | 0.00416   |\n",
+            "|    std                | 0.243     |\n",
+            "|    value_loss         | 0.000158  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.79       |\n",
+            "|    ep_rew_mean        | -0.215     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 336        |\n",
+            "|    iterations         | 37100      |\n",
+            "|    time_elapsed       | 2203       |\n",
+            "|    total_timesteps    | 742000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.198      |\n",
+            "|    explained_variance | 0.99118507 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 37099      |\n",
+            "|    policy_loss        | -0.00179   |\n",
+            "|    std                | 0.241      |\n",
+            "|    value_loss         | 3.91e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.73      |\n",
+            "|    ep_rew_mean        | -0.207    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 336       |\n",
+            "|    iterations         | 37200     |\n",
+            "|    time_elapsed       | 2212      |\n",
+            "|    total_timesteps    | 744000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.21      |\n",
+            "|    explained_variance | 0.9838521 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 37199     |\n",
+            "|    policy_loss        | -0.00318  |\n",
+            "|    std                | 0.24      |\n",
+            "|    value_loss         | 0.000107  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.7        |\n",
+            "|    ep_rew_mean        | -0.212     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 336        |\n",
+            "|    iterations         | 37300      |\n",
+            "|    time_elapsed       | 2217       |\n",
+            "|    total_timesteps    | 746000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.221      |\n",
+            "|    explained_variance | 0.98889595 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 37299      |\n",
+            "|    policy_loss        | -0.00152   |\n",
+            "|    std                | 0.239      |\n",
+            "|    value_loss         | 5.86e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.72      |\n",
+            "|    ep_rew_mean        | -0.198    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 336       |\n",
+            "|    iterations         | 37400     |\n",
+            "|    time_elapsed       | 2222      |\n",
+            "|    total_timesteps    | 748000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.231     |\n",
+            "|    explained_variance | 0.9741186 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 37399     |\n",
+            "|    policy_loss        | 0.00128   |\n",
+            "|    std                | 0.239     |\n",
+            "|    value_loss         | 0.000114  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.92      |\n",
+            "|    ep_rew_mean        | -0.229    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 336       |\n",
+            "|    iterations         | 37500     |\n",
+            "|    time_elapsed       | 2226      |\n",
+            "|    total_timesteps    | 750000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.255     |\n",
+            "|    explained_variance | 0.9857825 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 37499     |\n",
+            "|    policy_loss        | 0.00347   |\n",
+            "|    std                | 0.236     |\n",
+            "|    value_loss         | 9.37e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.74       |\n",
+            "|    ep_rew_mean        | -0.212     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 337        |\n",
+            "|    iterations         | 37600      |\n",
+            "|    time_elapsed       | 2231       |\n",
+            "|    total_timesteps    | 752000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.274      |\n",
+            "|    explained_variance | 0.97550106 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 37599      |\n",
+            "|    policy_loss        | -0.00678   |\n",
+            "|    std                | 0.235      |\n",
+            "|    value_loss         | 9.08e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.6       |\n",
+            "|    ep_rew_mean        | -0.198    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 337       |\n",
+            "|    iterations         | 37700     |\n",
+            "|    time_elapsed       | 2236      |\n",
+            "|    total_timesteps    | 754000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.292     |\n",
+            "|    explained_variance | 0.979112  |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 37699     |\n",
+            "|    policy_loss        | -0.000219 |\n",
+            "|    std                | 0.234     |\n",
+            "|    value_loss         | 8.7e-05   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.73       |\n",
+            "|    ep_rew_mean        | -0.215     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 337        |\n",
+            "|    iterations         | 37800      |\n",
+            "|    time_elapsed       | 2241       |\n",
+            "|    total_timesteps    | 756000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.317      |\n",
+            "|    explained_variance | 0.97191083 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 37799      |\n",
+            "|    policy_loss        | 0.00196    |\n",
+            "|    std                | 0.232      |\n",
+            "|    value_loss         | 0.000147   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.77       |\n",
+            "|    ep_rew_mean        | -0.219     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 336        |\n",
+            "|    iterations         | 37900      |\n",
+            "|    time_elapsed       | 2249       |\n",
+            "|    total_timesteps    | 758000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.31       |\n",
+            "|    explained_variance | 0.98574996 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 37899      |\n",
+            "|    policy_loss        | -0.00227   |\n",
+            "|    std                | 0.232      |\n",
+            "|    value_loss         | 5.8e-05    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.76      |\n",
+            "|    ep_rew_mean        | -0.21     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 337       |\n",
+            "|    iterations         | 38000     |\n",
+            "|    time_elapsed       | 2254      |\n",
+            "|    total_timesteps    | 760000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.312     |\n",
+            "|    explained_variance | 0.9653875 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 37999     |\n",
+            "|    policy_loss        | -0.00167  |\n",
+            "|    std                | 0.232     |\n",
+            "|    value_loss         | 0.000143  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.86       |\n",
+            "|    ep_rew_mean        | -0.226     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 337        |\n",
+            "|    iterations         | 38100      |\n",
+            "|    time_elapsed       | 2258       |\n",
+            "|    total_timesteps    | 762000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.337      |\n",
+            "|    explained_variance | 0.98960257 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 38099      |\n",
+            "|    policy_loss        | 0.00316    |\n",
+            "|    std                | 0.23       |\n",
+            "|    value_loss         | 4.85e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.57      |\n",
+            "|    ep_rew_mean        | -0.193    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 337       |\n",
+            "|    iterations         | 38200     |\n",
+            "|    time_elapsed       | 2263      |\n",
+            "|    total_timesteps    | 764000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.342     |\n",
+            "|    explained_variance | 0.9582597 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 38199     |\n",
+            "|    policy_loss        | -0.0103   |\n",
+            "|    std                | 0.23      |\n",
+            "|    value_loss         | 0.000147  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.69       |\n",
+            "|    ep_rew_mean        | -0.204     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 337        |\n",
+            "|    iterations         | 38300      |\n",
+            "|    time_elapsed       | 2268       |\n",
+            "|    total_timesteps    | 766000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.363      |\n",
+            "|    explained_variance | 0.99025834 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 38299      |\n",
+            "|    policy_loss        | -0.002     |\n",
+            "|    std                | 0.228      |\n",
+            "|    value_loss         | 8.47e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.85      |\n",
+            "|    ep_rew_mean        | -0.226    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 337       |\n",
+            "|    iterations         | 38400     |\n",
+            "|    time_elapsed       | 2272      |\n",
+            "|    total_timesteps    | 768000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.371     |\n",
+            "|    explained_variance | 0.9848292 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 38399     |\n",
+            "|    policy_loss        | -0.0029   |\n",
+            "|    std                | 0.228     |\n",
+            "|    value_loss         | 9.26e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.73      |\n",
+            "|    ep_rew_mean        | -0.213    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 338       |\n",
+            "|    iterations         | 38500     |\n",
+            "|    time_elapsed       | 2277      |\n",
+            "|    total_timesteps    | 770000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.377     |\n",
+            "|    explained_variance | 0.9523325 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 38499     |\n",
+            "|    policy_loss        | 0.000736  |\n",
+            "|    std                | 0.227     |\n",
+            "|    value_loss         | 0.000545  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.52       |\n",
+            "|    ep_rew_mean        | -0.188     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 337        |\n",
+            "|    iterations         | 38600      |\n",
+            "|    time_elapsed       | 2285       |\n",
+            "|    total_timesteps    | 772000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.379      |\n",
+            "|    explained_variance | 0.98056465 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 38599      |\n",
+            "|    policy_loss        | -0.00449   |\n",
+            "|    std                | 0.226      |\n",
+            "|    value_loss         | 0.000108   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.78       |\n",
+            "|    ep_rew_mean        | -0.224     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 338        |\n",
+            "|    iterations         | 38700      |\n",
+            "|    time_elapsed       | 2289       |\n",
+            "|    total_timesteps    | 774000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.392      |\n",
+            "|    explained_variance | 0.98470736 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 38699      |\n",
+            "|    policy_loss        | -0.00333   |\n",
+            "|    std                | 0.226      |\n",
+            "|    value_loss         | 3.93e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.72      |\n",
+            "|    ep_rew_mean        | -0.217    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 338       |\n",
+            "|    iterations         | 38800     |\n",
+            "|    time_elapsed       | 2293      |\n",
+            "|    total_timesteps    | 776000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.412     |\n",
+            "|    explained_variance | 0.8823349 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 38799     |\n",
+            "|    policy_loss        | 0.00308   |\n",
+            "|    std                | 0.224     |\n",
+            "|    value_loss         | 0.000418  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.9        |\n",
+            "|    ep_rew_mean        | -0.229     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 338        |\n",
+            "|    iterations         | 38900      |\n",
+            "|    time_elapsed       | 2298       |\n",
+            "|    total_timesteps    | 778000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.418      |\n",
+            "|    explained_variance | 0.73990154 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 38899      |\n",
+            "|    policy_loss        | 0.00874    |\n",
+            "|    std                | 0.223      |\n",
+            "|    value_loss         | 0.00106    |\n",
+            "--------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 2.64     |\n",
+            "|    ep_rew_mean        | -0.199   |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 338      |\n",
+            "|    iterations         | 39000    |\n",
+            "|    time_elapsed       | 2302     |\n",
+            "|    total_timesteps    | 780000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | 0.396    |\n",
+            "|    explained_variance | 0.966514 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 38999    |\n",
+            "|    policy_loss        | -0.00503 |\n",
+            "|    std                | 0.225    |\n",
+            "|    value_loss         | 0.000231 |\n",
+            "------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.85       |\n",
+            "|    ep_rew_mean        | -0.234     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 338        |\n",
+            "|    iterations         | 39100      |\n",
+            "|    time_elapsed       | 2307       |\n",
+            "|    total_timesteps    | 782000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.402      |\n",
+            "|    explained_variance | 0.97168076 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 39099      |\n",
+            "|    policy_loss        | 0.00411    |\n",
+            "|    std                | 0.225      |\n",
+            "|    value_loss         | 0.000162   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.55      |\n",
+            "|    ep_rew_mean        | -0.187    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 339       |\n",
+            "|    iterations         | 39200     |\n",
+            "|    time_elapsed       | 2312      |\n",
+            "|    total_timesteps    | 784000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.393     |\n",
+            "|    explained_variance | 0.9620045 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 39199     |\n",
+            "|    policy_loss        | 0.00139   |\n",
+            "|    std                | 0.226     |\n",
+            "|    value_loss         | 0.000167  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.93       |\n",
+            "|    ep_rew_mean        | -0.228     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 338        |\n",
+            "|    iterations         | 39300      |\n",
+            "|    time_elapsed       | 2321       |\n",
+            "|    total_timesteps    | 786000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.386      |\n",
+            "|    explained_variance | 0.90208817 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 39299      |\n",
+            "|    policy_loss        | -0.00257   |\n",
+            "|    std                | 0.226      |\n",
+            "|    value_loss         | 0.000814   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.7       |\n",
+            "|    ep_rew_mean        | -0.205    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 338       |\n",
+            "|    iterations         | 39400     |\n",
+            "|    time_elapsed       | 2328      |\n",
+            "|    total_timesteps    | 788000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.402     |\n",
+            "|    explained_variance | 0.9760422 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 39399     |\n",
+            "|    policy_loss        | 0.00142   |\n",
+            "|    std                | 0.225     |\n",
+            "|    value_loss         | 0.000173  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.6        |\n",
+            "|    ep_rew_mean        | -0.2       |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 338        |\n",
+            "|    iterations         | 39500      |\n",
+            "|    time_elapsed       | 2334       |\n",
+            "|    total_timesteps    | 790000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.415      |\n",
+            "|    explained_variance | 0.97800255 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 39499      |\n",
+            "|    policy_loss        | -0.00893   |\n",
+            "|    std                | 0.224      |\n",
+            "|    value_loss         | 0.000148   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.62      |\n",
+            "|    ep_rew_mean        | -0.197    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 338       |\n",
+            "|    iterations         | 39600     |\n",
+            "|    time_elapsed       | 2339      |\n",
+            "|    total_timesteps    | 792000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.409     |\n",
+            "|    explained_variance | 0.5413128 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 39599     |\n",
+            "|    policy_loss        | 0.00744   |\n",
+            "|    std                | 0.224     |\n",
+            "|    value_loss         | 0.00194   |\n",
+            "-------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 2.69     |\n",
+            "|    ep_rew_mean        | -0.214   |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 338      |\n",
+            "|    iterations         | 39700    |\n",
+            "|    time_elapsed       | 2344     |\n",
+            "|    total_timesteps    | 794000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | 0.399    |\n",
+            "|    explained_variance | 0.984776 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 39699    |\n",
+            "|    policy_loss        | -0.00425 |\n",
+            "|    std                | 0.225    |\n",
+            "|    value_loss         | 7.3e-05  |\n",
+            "------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.79       |\n",
+            "|    ep_rew_mean        | -0.224     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 338        |\n",
+            "|    iterations         | 39800      |\n",
+            "|    time_elapsed       | 2349       |\n",
+            "|    total_timesteps    | 796000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.399      |\n",
+            "|    explained_variance | 0.96791893 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 39799      |\n",
+            "|    policy_loss        | 0.00499    |\n",
+            "|    std                | 0.225      |\n",
+            "|    value_loss         | 0.000191   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.74      |\n",
+            "|    ep_rew_mean        | -0.218    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 338       |\n",
+            "|    iterations         | 39900     |\n",
+            "|    time_elapsed       | 2358      |\n",
+            "|    total_timesteps    | 798000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.383     |\n",
+            "|    explained_variance | -2.428947 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 39899     |\n",
+            "|    policy_loss        | -0.0136   |\n",
+            "|    std                | 0.226     |\n",
+            "|    value_loss         | 0.0102    |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.65       |\n",
+            "|    ep_rew_mean        | -0.206     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 338        |\n",
+            "|    iterations         | 40000      |\n",
+            "|    time_elapsed       | 2363       |\n",
+            "|    total_timesteps    | 800000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.369      |\n",
+            "|    explained_variance | 0.98433495 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 39999      |\n",
+            "|    policy_loss        | -0.00361   |\n",
+            "|    std                | 0.228      |\n",
+            "|    value_loss         | 9.41e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.7       |\n",
+            "|    ep_rew_mean        | -0.209    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 338       |\n",
+            "|    iterations         | 40100     |\n",
+            "|    time_elapsed       | 2368      |\n",
+            "|    total_timesteps    | 802000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.365     |\n",
+            "|    explained_variance | 0.9824995 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 40099     |\n",
+            "|    policy_loss        | 0.000342  |\n",
+            "|    std                | 0.228     |\n",
+            "|    value_loss         | 5.56e-05  |\n",
+            "-------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 2.55     |\n",
+            "|    ep_rew_mean        | -0.193   |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 338      |\n",
+            "|    iterations         | 40200    |\n",
+            "|    time_elapsed       | 2373     |\n",
+            "|    total_timesteps    | 804000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | 0.402    |\n",
+            "|    explained_variance | 0.985323 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 40199    |\n",
+            "|    policy_loss        | 0.00106  |\n",
+            "|    std                | 0.226    |\n",
+            "|    value_loss         | 5.57e-05 |\n",
+            "------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 7.44      |\n",
+            "|    ep_rew_mean        | -0.644    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 338       |\n",
+            "|    iterations         | 40300     |\n",
+            "|    time_elapsed       | 2378      |\n",
+            "|    total_timesteps    | 806000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.396     |\n",
+            "|    explained_variance | 0.8747574 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 40299     |\n",
+            "|    policy_loss        | -0.0966   |\n",
+            "|    std                | 0.225     |\n",
+            "|    value_loss         | 0.127     |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3.99       |\n",
+            "|    ep_rew_mean        | -0.314     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 338        |\n",
+            "|    iterations         | 40400      |\n",
+            "|    time_elapsed       | 2384       |\n",
+            "|    total_timesteps    | 808000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.4        |\n",
+            "|    explained_variance | -3.9502196 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 40399      |\n",
+            "|    policy_loss        | -0.000899  |\n",
+            "|    std                | 0.224      |\n",
+            "|    value_loss         | 0.0228     |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3.68       |\n",
+            "|    ep_rew_mean        | -0.293     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 338        |\n",
+            "|    iterations         | 40500      |\n",
+            "|    time_elapsed       | 2392       |\n",
+            "|    total_timesteps    | 810000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.379      |\n",
+            "|    explained_variance | 0.84740096 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 40499      |\n",
+            "|    policy_loss        | -0.000565  |\n",
+            "|    std                | 0.226      |\n",
+            "|    value_loss         | 0.0476     |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.7        |\n",
+            "|    ep_rew_mean        | -0.212     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 338        |\n",
+            "|    iterations         | 40600      |\n",
+            "|    time_elapsed       | 2398       |\n",
+            "|    total_timesteps    | 812000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.395      |\n",
+            "|    explained_variance | 0.80924624 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 40599      |\n",
+            "|    policy_loss        | -0.00864   |\n",
+            "|    std                | 0.225      |\n",
+            "|    value_loss         | 0.00108    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 3.04       |\n",
+            "|    ep_rew_mean        | -0.236     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 338        |\n",
+            "|    iterations         | 40700      |\n",
+            "|    time_elapsed       | 2402       |\n",
+            "|    total_timesteps    | 814000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.417      |\n",
+            "|    explained_variance | 0.73390794 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 40699      |\n",
+            "|    policy_loss        | 0.0243     |\n",
+            "|    std                | 0.224      |\n",
+            "|    value_loss         | 0.0164     |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.61       |\n",
+            "|    ep_rew_mean        | -0.205     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 338        |\n",
+            "|    iterations         | 40800      |\n",
+            "|    time_elapsed       | 2407       |\n",
+            "|    total_timesteps    | 816000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.406      |\n",
+            "|    explained_variance | 0.97868705 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 40799      |\n",
+            "|    policy_loss        | -0.00146   |\n",
+            "|    std                | 0.225      |\n",
+            "|    value_loss         | 0.000133   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.71      |\n",
+            "|    ep_rew_mean        | -0.213    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 339       |\n",
+            "|    iterations         | 40900     |\n",
+            "|    time_elapsed       | 2412      |\n",
+            "|    total_timesteps    | 818000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.429     |\n",
+            "|    explained_variance | 0.8363371 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 40899     |\n",
+            "|    policy_loss        | -0.00689  |\n",
+            "|    std                | 0.223     |\n",
+            "|    value_loss         | 0.000464  |\n",
+            "-------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 2.59     |\n",
+            "|    ep_rew_mean        | -0.198   |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 339      |\n",
+            "|    iterations         | 41000    |\n",
+            "|    time_elapsed       | 2417     |\n",
+            "|    total_timesteps    | 820000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | 0.445    |\n",
+            "|    explained_variance | 0.977923 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 40999    |\n",
+            "|    policy_loss        | -0.00173 |\n",
+            "|    std                | 0.222    |\n",
+            "|    value_loss         | 0.000178 |\n",
+            "------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.73       |\n",
+            "|    ep_rew_mean        | -0.208     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 338        |\n",
+            "|    iterations         | 41100      |\n",
+            "|    time_elapsed       | 2426       |\n",
+            "|    total_timesteps    | 822000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.458      |\n",
+            "|    explained_variance | 0.63355607 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 41099      |\n",
+            "|    policy_loss        | 0.0177     |\n",
+            "|    std                | 0.22       |\n",
+            "|    value_loss         | 0.00102    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.72      |\n",
+            "|    ep_rew_mean        | -0.214    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 338       |\n",
+            "|    iterations         | 41200     |\n",
+            "|    time_elapsed       | 2431      |\n",
+            "|    total_timesteps    | 824000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.453     |\n",
+            "|    explained_variance | 0.9759229 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 41199     |\n",
+            "|    policy_loss        | -0.0158   |\n",
+            "|    std                | 0.22      |\n",
+            "|    value_loss         | 0.000228  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.68       |\n",
+            "|    ep_rew_mean        | -0.211     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 339        |\n",
+            "|    iterations         | 41300      |\n",
+            "|    time_elapsed       | 2436       |\n",
+            "|    total_timesteps    | 826000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.445      |\n",
+            "|    explained_variance | 0.99040455 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 41299      |\n",
+            "|    policy_loss        | 0.00678    |\n",
+            "|    std                | 0.221      |\n",
+            "|    value_loss         | 7.98e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.62      |\n",
+            "|    ep_rew_mean        | -0.207    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 339       |\n",
+            "|    iterations         | 41400     |\n",
+            "|    time_elapsed       | 2441      |\n",
+            "|    total_timesteps    | 828000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.458     |\n",
+            "|    explained_variance | 0.9926231 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 41399     |\n",
+            "|    policy_loss        | 0.00175   |\n",
+            "|    std                | 0.22      |\n",
+            "|    value_loss         | 2.89e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.58       |\n",
+            "|    ep_rew_mean        | -0.196     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 339        |\n",
+            "|    iterations         | 41500      |\n",
+            "|    time_elapsed       | 2447       |\n",
+            "|    total_timesteps    | 830000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.47       |\n",
+            "|    explained_variance | 0.97897565 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 41499      |\n",
+            "|    policy_loss        | 0.0038     |\n",
+            "|    std                | 0.219      |\n",
+            "|    value_loss         | 9.45e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.61      |\n",
+            "|    ep_rew_mean        | -0.203    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 339       |\n",
+            "|    iterations         | 41600     |\n",
+            "|    time_elapsed       | 2452      |\n",
+            "|    total_timesteps    | 832000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.446     |\n",
+            "|    explained_variance | 0.9452324 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 41599     |\n",
+            "|    policy_loss        | -0.011    |\n",
+            "|    std                | 0.22      |\n",
+            "|    value_loss         | 0.000302  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.62      |\n",
+            "|    ep_rew_mean        | -0.202    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 339       |\n",
+            "|    iterations         | 41700     |\n",
+            "|    time_elapsed       | 2457      |\n",
+            "|    total_timesteps    | 834000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.471     |\n",
+            "|    explained_variance | 0.9743598 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 41699     |\n",
+            "|    policy_loss        | 0.00613   |\n",
+            "|    std                | 0.218     |\n",
+            "|    value_loss         | 0.000198  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.78      |\n",
+            "|    ep_rew_mean        | -0.212    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 339       |\n",
+            "|    iterations         | 41800     |\n",
+            "|    time_elapsed       | 2465      |\n",
+            "|    total_timesteps    | 836000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.465     |\n",
+            "|    explained_variance | 0.6682483 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 41799     |\n",
+            "|    policy_loss        | -0.0067   |\n",
+            "|    std                | 0.219     |\n",
+            "|    value_loss         | 0.00284   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.72      |\n",
+            "|    ep_rew_mean        | -0.211    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 339       |\n",
+            "|    iterations         | 41900     |\n",
+            "|    time_elapsed       | 2469      |\n",
+            "|    total_timesteps    | 838000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.488     |\n",
+            "|    explained_variance | 0.9824863 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 41899     |\n",
+            "|    policy_loss        | -0.00377  |\n",
+            "|    std                | 0.217     |\n",
+            "|    value_loss         | 9.89e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.63       |\n",
+            "|    ep_rew_mean        | -0.205     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 339        |\n",
+            "|    iterations         | 42000      |\n",
+            "|    time_elapsed       | 2473       |\n",
+            "|    total_timesteps    | 840000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.508      |\n",
+            "|    explained_variance | 0.97226715 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 41999      |\n",
+            "|    policy_loss        | -0.0114    |\n",
+            "|    std                | 0.216      |\n",
+            "|    value_loss         | 0.000727   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.77       |\n",
+            "|    ep_rew_mean        | -0.218     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 339        |\n",
+            "|    iterations         | 42100      |\n",
+            "|    time_elapsed       | 2477       |\n",
+            "|    total_timesteps    | 842000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.504      |\n",
+            "|    explained_variance | 0.98028255 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 42099      |\n",
+            "|    policy_loss        | 0.00354    |\n",
+            "|    std                | 0.217      |\n",
+            "|    value_loss         | 0.000129   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.57       |\n",
+            "|    ep_rew_mean        | -0.199     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 339        |\n",
+            "|    iterations         | 42200      |\n",
+            "|    time_elapsed       | 2482       |\n",
+            "|    total_timesteps    | 844000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.489      |\n",
+            "|    explained_variance | 0.96648335 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 42199      |\n",
+            "|    policy_loss        | -0.00583   |\n",
+            "|    std                | 0.218      |\n",
+            "|    value_loss         | 0.00013    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.68      |\n",
+            "|    ep_rew_mean        | -0.198    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 340       |\n",
+            "|    iterations         | 42300     |\n",
+            "|    time_elapsed       | 2487      |\n",
+            "|    total_timesteps    | 846000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.503     |\n",
+            "|    explained_variance | 0.9749611 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 42299     |\n",
+            "|    policy_loss        | 0.00564   |\n",
+            "|    std                | 0.218     |\n",
+            "|    value_loss         | 0.000109  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.75       |\n",
+            "|    ep_rew_mean        | -0.218     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 340        |\n",
+            "|    iterations         | 42400      |\n",
+            "|    time_elapsed       | 2491       |\n",
+            "|    total_timesteps    | 848000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.524      |\n",
+            "|    explained_variance | 0.99248254 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 42399      |\n",
+            "|    policy_loss        | -0.000632  |\n",
+            "|    std                | 0.216      |\n",
+            "|    value_loss         | 4.3e-05    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.61       |\n",
+            "|    ep_rew_mean        | -0.199     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 340        |\n",
+            "|    iterations         | 42500      |\n",
+            "|    time_elapsed       | 2499       |\n",
+            "|    total_timesteps    | 850000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.546      |\n",
+            "|    explained_variance | 0.98732716 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 42499      |\n",
+            "|    policy_loss        | 0.000381   |\n",
+            "|    std                | 0.214      |\n",
+            "|    value_loss         | 4.16e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.67      |\n",
+            "|    ep_rew_mean        | -0.205    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 340       |\n",
+            "|    iterations         | 42600     |\n",
+            "|    time_elapsed       | 2504      |\n",
+            "|    total_timesteps    | 852000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.548     |\n",
+            "|    explained_variance | 0.9690981 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 42599     |\n",
+            "|    policy_loss        | -0.0125   |\n",
+            "|    std                | 0.213     |\n",
+            "|    value_loss         | 0.000401  |\n",
+            "-------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 2.58     |\n",
+            "|    ep_rew_mean        | -0.194   |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 340      |\n",
+            "|    iterations         | 42700    |\n",
+            "|    time_elapsed       | 2510     |\n",
+            "|    total_timesteps    | 854000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | 0.548    |\n",
+            "|    explained_variance | 0.948852 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 42699    |\n",
+            "|    policy_loss        | 0.00354  |\n",
+            "|    std                | 0.214    |\n",
+            "|    value_loss         | 9.49e-05 |\n",
+            "------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.72      |\n",
+            "|    ep_rew_mean        | -0.21     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 340       |\n",
+            "|    iterations         | 42800     |\n",
+            "|    time_elapsed       | 2514      |\n",
+            "|    total_timesteps    | 856000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.541     |\n",
+            "|    explained_variance | 0.9658161 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 42799     |\n",
+            "|    policy_loss        | -3.41e-05 |\n",
+            "|    std                | 0.214     |\n",
+            "|    value_loss         | 0.000158  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.7       |\n",
+            "|    ep_rew_mean        | -0.209    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 340       |\n",
+            "|    iterations         | 42900     |\n",
+            "|    time_elapsed       | 2519      |\n",
+            "|    total_timesteps    | 858000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.531     |\n",
+            "|    explained_variance | 0.9794916 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 42899     |\n",
+            "|    policy_loss        | 0.00583   |\n",
+            "|    std                | 0.214     |\n",
+            "|    value_loss         | 9.42e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.47       |\n",
+            "|    ep_rew_mean        | -0.187     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 340        |\n",
+            "|    iterations         | 43000      |\n",
+            "|    time_elapsed       | 2524       |\n",
+            "|    total_timesteps    | 860000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.527      |\n",
+            "|    explained_variance | 0.98890656 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 42999      |\n",
+            "|    policy_loss        | -0.00859   |\n",
+            "|    std                | 0.214      |\n",
+            "|    value_loss         | 0.000106   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.7        |\n",
+            "|    ep_rew_mean        | -0.204     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 340        |\n",
+            "|    iterations         | 43100      |\n",
+            "|    time_elapsed       | 2528       |\n",
+            "|    total_timesteps    | 862000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.519      |\n",
+            "|    explained_variance | 0.97553414 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 43099      |\n",
+            "|    policy_loss        | -0.00251   |\n",
+            "|    std                | 0.215      |\n",
+            "|    value_loss         | 6.28e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.69      |\n",
+            "|    ep_rew_mean        | -0.218    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 340       |\n",
+            "|    iterations         | 43200     |\n",
+            "|    time_elapsed       | 2537      |\n",
+            "|    total_timesteps    | 864000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.553     |\n",
+            "|    explained_variance | 0.9948936 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 43199     |\n",
+            "|    policy_loss        | 0.000498  |\n",
+            "|    std                | 0.212     |\n",
+            "|    value_loss         | 3.8e-05   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.7       |\n",
+            "|    ep_rew_mean        | -0.21     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 340       |\n",
+            "|    iterations         | 43300     |\n",
+            "|    time_elapsed       | 2542      |\n",
+            "|    total_timesteps    | 866000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.561     |\n",
+            "|    explained_variance | 0.9760311 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 43299     |\n",
+            "|    policy_loss        | -0.00212  |\n",
+            "|    std                | 0.212     |\n",
+            "|    value_loss         | 0.000129  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.85      |\n",
+            "|    ep_rew_mean        | -0.231    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 340       |\n",
+            "|    iterations         | 43400     |\n",
+            "|    time_elapsed       | 2547      |\n",
+            "|    total_timesteps    | 868000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.558     |\n",
+            "|    explained_variance | 0.9611102 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 43399     |\n",
+            "|    policy_loss        | -0.000699 |\n",
+            "|    std                | 0.212     |\n",
+            "|    value_loss         | 0.000191  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.67       |\n",
+            "|    ep_rew_mean        | -0.2       |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 340        |\n",
+            "|    iterations         | 43500      |\n",
+            "|    time_elapsed       | 2551       |\n",
+            "|    total_timesteps    | 870000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.573      |\n",
+            "|    explained_variance | 0.98930174 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 43499      |\n",
+            "|    policy_loss        | -0.0037    |\n",
+            "|    std                | 0.211      |\n",
+            "|    value_loss         | 4.31e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.62       |\n",
+            "|    ep_rew_mean        | -0.201     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 341        |\n",
+            "|    iterations         | 43600      |\n",
+            "|    time_elapsed       | 2556       |\n",
+            "|    total_timesteps    | 872000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.586      |\n",
+            "|    explained_variance | 0.98348564 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 43599      |\n",
+            "|    policy_loss        | 0.00287    |\n",
+            "|    std                | 0.21       |\n",
+            "|    value_loss         | 0.000106   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.72       |\n",
+            "|    ep_rew_mean        | -0.199     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 341        |\n",
+            "|    iterations         | 43700      |\n",
+            "|    time_elapsed       | 2562       |\n",
+            "|    total_timesteps    | 874000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.616      |\n",
+            "|    explained_variance | 0.69001275 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 43699      |\n",
+            "|    policy_loss        | 0.0157     |\n",
+            "|    std                | 0.208      |\n",
+            "|    value_loss         | 0.00192    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.74       |\n",
+            "|    ep_rew_mean        | -0.208     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 340        |\n",
+            "|    iterations         | 43800      |\n",
+            "|    time_elapsed       | 2571       |\n",
+            "|    total_timesteps    | 876000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.615      |\n",
+            "|    explained_variance | 0.97150284 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 43799      |\n",
+            "|    policy_loss        | 0.000559   |\n",
+            "|    std                | 0.208      |\n",
+            "|    value_loss         | 0.000119   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.85       |\n",
+            "|    ep_rew_mean        | -0.218     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 340        |\n",
+            "|    iterations         | 43900      |\n",
+            "|    time_elapsed       | 2576       |\n",
+            "|    total_timesteps    | 878000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.633      |\n",
+            "|    explained_variance | 0.98416793 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 43899      |\n",
+            "|    policy_loss        | 0.00093    |\n",
+            "|    std                | 0.206      |\n",
+            "|    value_loss         | 0.000116   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.64      |\n",
+            "|    ep_rew_mean        | -0.206    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 340       |\n",
+            "|    iterations         | 44000     |\n",
+            "|    time_elapsed       | 2582      |\n",
+            "|    total_timesteps    | 880000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.618     |\n",
+            "|    explained_variance | 0.9784064 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 43999     |\n",
+            "|    policy_loss        | 0.00364   |\n",
+            "|    std                | 0.208     |\n",
+            "|    value_loss         | 0.000147  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.74      |\n",
+            "|    ep_rew_mean        | -0.211    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 340       |\n",
+            "|    iterations         | 44100     |\n",
+            "|    time_elapsed       | 2586      |\n",
+            "|    total_timesteps    | 882000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.635     |\n",
+            "|    explained_variance | 0.9662014 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 44099     |\n",
+            "|    policy_loss        | -0.00547  |\n",
+            "|    std                | 0.207     |\n",
+            "|    value_loss         | 0.000243  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.89      |\n",
+            "|    ep_rew_mean        | -0.241    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 341       |\n",
+            "|    iterations         | 44200     |\n",
+            "|    time_elapsed       | 2591      |\n",
+            "|    total_timesteps    | 884000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.656     |\n",
+            "|    explained_variance | 0.7693075 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 44199     |\n",
+            "|    policy_loss        | 0.0143    |\n",
+            "|    std                | 0.206     |\n",
+            "|    value_loss         | 0.00191   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.75      |\n",
+            "|    ep_rew_mean        | -0.206    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 341       |\n",
+            "|    iterations         | 44300     |\n",
+            "|    time_elapsed       | 2595      |\n",
+            "|    total_timesteps    | 886000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.646     |\n",
+            "|    explained_variance | 0.9649852 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 44299     |\n",
+            "|    policy_loss        | -0.00818  |\n",
+            "|    std                | 0.206     |\n",
+            "|    value_loss         | 0.000203  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.73      |\n",
+            "|    ep_rew_mean        | -0.213    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 341       |\n",
+            "|    iterations         | 44400     |\n",
+            "|    time_elapsed       | 2600      |\n",
+            "|    total_timesteps    | 888000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.677     |\n",
+            "|    explained_variance | 0.9866615 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 44399     |\n",
+            "|    policy_loss        | -0.00452  |\n",
+            "|    std                | 0.204     |\n",
+            "|    value_loss         | 0.000119  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.75       |\n",
+            "|    ep_rew_mean        | -0.219     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 341        |\n",
+            "|    iterations         | 44500      |\n",
+            "|    time_elapsed       | 2608       |\n",
+            "|    total_timesteps    | 890000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.688      |\n",
+            "|    explained_variance | 0.98133665 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 44499      |\n",
+            "|    policy_loss        | 0.00382    |\n",
+            "|    std                | 0.204      |\n",
+            "|    value_loss         | 0.000157   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.6       |\n",
+            "|    ep_rew_mean        | -0.189    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 341       |\n",
+            "|    iterations         | 44600     |\n",
+            "|    time_elapsed       | 2612      |\n",
+            "|    total_timesteps    | 892000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.697     |\n",
+            "|    explained_variance | 0.9878949 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 44599     |\n",
+            "|    policy_loss        | -0.000211 |\n",
+            "|    std                | 0.203     |\n",
+            "|    value_loss         | 6.87e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.7       |\n",
+            "|    ep_rew_mean        | -0.21     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 341       |\n",
+            "|    iterations         | 44700     |\n",
+            "|    time_elapsed       | 2617      |\n",
+            "|    total_timesteps    | 894000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.71      |\n",
+            "|    explained_variance | 0.9808317 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 44699     |\n",
+            "|    policy_loss        | -0.000497 |\n",
+            "|    std                | 0.202     |\n",
+            "|    value_loss         | 0.000117  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.69      |\n",
+            "|    ep_rew_mean        | -0.207    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 341       |\n",
+            "|    iterations         | 44800     |\n",
+            "|    time_elapsed       | 2622      |\n",
+            "|    total_timesteps    | 896000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.737     |\n",
+            "|    explained_variance | 0.9543187 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 44799     |\n",
+            "|    policy_loss        | -0.0002   |\n",
+            "|    std                | 0.201     |\n",
+            "|    value_loss         | 0.00014   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.67      |\n",
+            "|    ep_rew_mean        | -0.203    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 341       |\n",
+            "|    iterations         | 44900     |\n",
+            "|    time_elapsed       | 2626      |\n",
+            "|    total_timesteps    | 898000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.736     |\n",
+            "|    explained_variance | 0.9573474 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 44899     |\n",
+            "|    policy_loss        | -0.016    |\n",
+            "|    std                | 0.2       |\n",
+            "|    value_loss         | 0.000362  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.47      |\n",
+            "|    ep_rew_mean        | -0.181    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 342       |\n",
+            "|    iterations         | 45000     |\n",
+            "|    time_elapsed       | 2631      |\n",
+            "|    total_timesteps    | 900000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.706     |\n",
+            "|    explained_variance | 0.9849114 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 44999     |\n",
+            "|    policy_loss        | 0.00167   |\n",
+            "|    std                | 0.203     |\n",
+            "|    value_loss         | 0.000118  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.68       |\n",
+            "|    ep_rew_mean        | -0.207     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 342        |\n",
+            "|    iterations         | 45100      |\n",
+            "|    time_elapsed       | 2636       |\n",
+            "|    total_timesteps    | 902000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.696      |\n",
+            "|    explained_variance | 0.80178624 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 45099      |\n",
+            "|    policy_loss        | 0.00337    |\n",
+            "|    std                | 0.203      |\n",
+            "|    value_loss         | 0.00101    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.64      |\n",
+            "|    ep_rew_mean        | -0.208    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 341       |\n",
+            "|    iterations         | 45200     |\n",
+            "|    time_elapsed       | 2644      |\n",
+            "|    total_timesteps    | 904000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.717     |\n",
+            "|    explained_variance | 0.9752399 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 45199     |\n",
+            "|    policy_loss        | -0.00655  |\n",
+            "|    std                | 0.203     |\n",
+            "|    value_loss         | 0.00014   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.68       |\n",
+            "|    ep_rew_mean        | -0.204     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 342        |\n",
+            "|    iterations         | 45300      |\n",
+            "|    time_elapsed       | 2649       |\n",
+            "|    total_timesteps    | 906000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.709      |\n",
+            "|    explained_variance | 0.98351824 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 45299      |\n",
+            "|    policy_loss        | -0.00476   |\n",
+            "|    std                | 0.203      |\n",
+            "|    value_loss         | 8.73e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.63       |\n",
+            "|    ep_rew_mean        | -0.197     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 342        |\n",
+            "|    iterations         | 45400      |\n",
+            "|    time_elapsed       | 2653       |\n",
+            "|    total_timesteps    | 908000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.71       |\n",
+            "|    explained_variance | 0.99506396 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 45399      |\n",
+            "|    policy_loss        | -0.00952   |\n",
+            "|    std                | 0.203      |\n",
+            "|    value_loss         | 0.000135   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.69      |\n",
+            "|    ep_rew_mean        | -0.208    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 342       |\n",
+            "|    iterations         | 45500     |\n",
+            "|    time_elapsed       | 2658      |\n",
+            "|    total_timesteps    | 910000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.711     |\n",
+            "|    explained_variance | 0.9815719 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 45499     |\n",
+            "|    policy_loss        | 0.00707   |\n",
+            "|    std                | 0.203     |\n",
+            "|    value_loss         | 0.000139  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.45       |\n",
+            "|    ep_rew_mean        | -0.171     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 342        |\n",
+            "|    iterations         | 45600      |\n",
+            "|    time_elapsed       | 2665       |\n",
+            "|    total_timesteps    | 912000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.718      |\n",
+            "|    explained_variance | 0.93521357 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 45599      |\n",
+            "|    policy_loss        | 0.016      |\n",
+            "|    std                | 0.203      |\n",
+            "|    value_loss         | 0.00045    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.93      |\n",
+            "|    ep_rew_mean        | -0.235    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 341       |\n",
+            "|    iterations         | 45700     |\n",
+            "|    time_elapsed       | 2676      |\n",
+            "|    total_timesteps    | 914000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.724     |\n",
+            "|    explained_variance | 0.9420419 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 45699     |\n",
+            "|    policy_loss        | -0.00632  |\n",
+            "|    std                | 0.202     |\n",
+            "|    value_loss         | 0.000384  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.83       |\n",
+            "|    ep_rew_mean        | -0.22      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 341        |\n",
+            "|    iterations         | 45800      |\n",
+            "|    time_elapsed       | 2682       |\n",
+            "|    total_timesteps    | 916000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.739      |\n",
+            "|    explained_variance | 0.95885766 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 45799      |\n",
+            "|    policy_loss        | 0.00537    |\n",
+            "|    std                | 0.202      |\n",
+            "|    value_loss         | 0.000219   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.62       |\n",
+            "|    ep_rew_mean        | -0.2       |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 341        |\n",
+            "|    iterations         | 45900      |\n",
+            "|    time_elapsed       | 2689       |\n",
+            "|    total_timesteps    | 918000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.743      |\n",
+            "|    explained_variance | 0.94796073 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 45899      |\n",
+            "|    policy_loss        | -6.43e-05  |\n",
+            "|    std                | 0.202      |\n",
+            "|    value_loss         | 0.000202   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.59      |\n",
+            "|    ep_rew_mean        | -0.197    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 341       |\n",
+            "|    iterations         | 46000     |\n",
+            "|    time_elapsed       | 2695      |\n",
+            "|    total_timesteps    | 920000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.734     |\n",
+            "|    explained_variance | 0.9870241 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 45999     |\n",
+            "|    policy_loss        | 0.00297   |\n",
+            "|    std                | 0.202     |\n",
+            "|    value_loss         | 6.71e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.74       |\n",
+            "|    ep_rew_mean        | -0.214     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 341        |\n",
+            "|    iterations         | 46100      |\n",
+            "|    time_elapsed       | 2701       |\n",
+            "|    total_timesteps    | 922000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.735      |\n",
+            "|    explained_variance | 0.97447634 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 46099      |\n",
+            "|    policy_loss        | -0.00677   |\n",
+            "|    std                | 0.202      |\n",
+            "|    value_loss         | 0.000132   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.75      |\n",
+            "|    ep_rew_mean        | -0.22     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 341       |\n",
+            "|    iterations         | 46200     |\n",
+            "|    time_elapsed       | 2707      |\n",
+            "|    total_timesteps    | 924000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.748     |\n",
+            "|    explained_variance | 0.9668383 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 46199     |\n",
+            "|    policy_loss        | -0.00958  |\n",
+            "|    std                | 0.201     |\n",
+            "|    value_loss         | 0.00026   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.83      |\n",
+            "|    ep_rew_mean        | -0.22     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 340       |\n",
+            "|    iterations         | 46300     |\n",
+            "|    time_elapsed       | 2717      |\n",
+            "|    total_timesteps    | 926000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.753     |\n",
+            "|    explained_variance | 0.9713844 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 46299     |\n",
+            "|    policy_loss        | -0.00175  |\n",
+            "|    std                | 0.202     |\n",
+            "|    value_loss         | 0.000267  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.73      |\n",
+            "|    ep_rew_mean        | -0.216    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 340       |\n",
+            "|    iterations         | 46400     |\n",
+            "|    time_elapsed       | 2724      |\n",
+            "|    total_timesteps    | 928000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.744     |\n",
+            "|    explained_variance | 0.9690941 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 46399     |\n",
+            "|    policy_loss        | 0.00198   |\n",
+            "|    std                | 0.202     |\n",
+            "|    value_loss         | 9.71e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.71       |\n",
+            "|    ep_rew_mean        | -0.211     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 340        |\n",
+            "|    iterations         | 46500      |\n",
+            "|    time_elapsed       | 2730       |\n",
+            "|    total_timesteps    | 930000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.752      |\n",
+            "|    explained_variance | 0.98169756 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 46499      |\n",
+            "|    policy_loss        | -0.00182   |\n",
+            "|    std                | 0.201      |\n",
+            "|    value_loss         | 7.57e-05   |\n",
+            "--------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 2.73     |\n",
+            "|    ep_rew_mean        | -0.216   |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 340      |\n",
+            "|    iterations         | 46600    |\n",
+            "|    time_elapsed       | 2736     |\n",
+            "|    total_timesteps    | 932000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | 0.768    |\n",
+            "|    explained_variance | 0.958521 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 46599    |\n",
+            "|    policy_loss        | 0.00796  |\n",
+            "|    std                | 0.2      |\n",
+            "|    value_loss         | 0.000321 |\n",
+            "------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 2.54     |\n",
+            "|    ep_rew_mean        | -0.194   |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 340      |\n",
+            "|    iterations         | 46700    |\n",
+            "|    time_elapsed       | 2743     |\n",
+            "|    total_timesteps    | 934000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | 0.771    |\n",
+            "|    explained_variance | 0.9603   |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 46699    |\n",
+            "|    policy_loss        | 0.00811  |\n",
+            "|    std                | 0.2      |\n",
+            "|    value_loss         | 0.000171 |\n",
+            "------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.6       |\n",
+            "|    ep_rew_mean        | -0.198    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 339       |\n",
+            "|    iterations         | 46800     |\n",
+            "|    time_elapsed       | 2753      |\n",
+            "|    total_timesteps    | 936000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.768     |\n",
+            "|    explained_variance | 0.9908145 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 46799     |\n",
+            "|    policy_loss        | 0.00219   |\n",
+            "|    std                | 0.199     |\n",
+            "|    value_loss         | 7.84e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.66      |\n",
+            "|    ep_rew_mean        | -0.198    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 340       |\n",
+            "|    iterations         | 46900     |\n",
+            "|    time_elapsed       | 2757      |\n",
+            "|    total_timesteps    | 938000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.781     |\n",
+            "|    explained_variance | 0.9639614 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 46899     |\n",
+            "|    policy_loss        | 0.0111    |\n",
+            "|    std                | 0.199     |\n",
+            "|    value_loss         | 0.000552  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.69       |\n",
+            "|    ep_rew_mean        | -0.213     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 340        |\n",
+            "|    iterations         | 47000      |\n",
+            "|    time_elapsed       | 2762       |\n",
+            "|    total_timesteps    | 940000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.787      |\n",
+            "|    explained_variance | 0.97391367 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 46999      |\n",
+            "|    policy_loss        | 0.00372    |\n",
+            "|    std                | 0.199      |\n",
+            "|    value_loss         | 0.000137   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.81       |\n",
+            "|    ep_rew_mean        | -0.22      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 340        |\n",
+            "|    iterations         | 47100      |\n",
+            "|    time_elapsed       | 2766       |\n",
+            "|    total_timesteps    | 942000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.788      |\n",
+            "|    explained_variance | 0.97501403 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 47099      |\n",
+            "|    policy_loss        | -0.0168    |\n",
+            "|    std                | 0.198      |\n",
+            "|    value_loss         | 0.000357   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.59      |\n",
+            "|    ep_rew_mean        | -0.2      |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 340       |\n",
+            "|    iterations         | 47200     |\n",
+            "|    time_elapsed       | 2771      |\n",
+            "|    total_timesteps    | 944000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.786     |\n",
+            "|    explained_variance | 0.7917006 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 47199     |\n",
+            "|    policy_loss        | 0.0273    |\n",
+            "|    std                | 0.199     |\n",
+            "|    value_loss         | 0.00183   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.75      |\n",
+            "|    ep_rew_mean        | -0.217    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 340       |\n",
+            "|    iterations         | 47300     |\n",
+            "|    time_elapsed       | 2775      |\n",
+            "|    total_timesteps    | 946000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.784     |\n",
+            "|    explained_variance | 0.9474554 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 47299     |\n",
+            "|    policy_loss        | 0.0125    |\n",
+            "|    std                | 0.199     |\n",
+            "|    value_loss         | 0.000405  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.74       |\n",
+            "|    ep_rew_mean        | -0.219     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 341        |\n",
+            "|    iterations         | 47400      |\n",
+            "|    time_elapsed       | 2779       |\n",
+            "|    total_timesteps    | 948000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.786      |\n",
+            "|    explained_variance | 0.98800665 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 47399      |\n",
+            "|    policy_loss        | 0.00237    |\n",
+            "|    std                | 0.198      |\n",
+            "|    value_loss         | 8.46e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.76      |\n",
+            "|    ep_rew_mean        | -0.211    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 340       |\n",
+            "|    iterations         | 47500     |\n",
+            "|    time_elapsed       | 2787      |\n",
+            "|    total_timesteps    | 950000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.781     |\n",
+            "|    explained_variance | 0.9724678 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 47499     |\n",
+            "|    policy_loss        | 0.0024    |\n",
+            "|    std                | 0.199     |\n",
+            "|    value_loss         | 0.000164  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.77       |\n",
+            "|    ep_rew_mean        | -0.216     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 340        |\n",
+            "|    iterations         | 47600      |\n",
+            "|    time_elapsed       | 2792       |\n",
+            "|    total_timesteps    | 952000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.785      |\n",
+            "|    explained_variance | 0.99027014 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 47599      |\n",
+            "|    policy_loss        | -0.00857   |\n",
+            "|    std                | 0.198      |\n",
+            "|    value_loss         | 0.000118   |\n",
+            "--------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 2.8      |\n",
+            "|    ep_rew_mean        | -0.223   |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 341      |\n",
+            "|    iterations         | 47700    |\n",
+            "|    time_elapsed       | 2796     |\n",
+            "|    total_timesteps    | 954000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | 0.789    |\n",
+            "|    explained_variance | 0.990915 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 47699    |\n",
+            "|    policy_loss        | -0.00501 |\n",
+            "|    std                | 0.198    |\n",
+            "|    value_loss         | 6.49e-05 |\n",
+            "------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.69      |\n",
+            "|    ep_rew_mean        | -0.209    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 341       |\n",
+            "|    iterations         | 47800     |\n",
+            "|    time_elapsed       | 2800      |\n",
+            "|    total_timesteps    | 956000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.792     |\n",
+            "|    explained_variance | 0.9743882 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 47799     |\n",
+            "|    policy_loss        | -0.00831  |\n",
+            "|    std                | 0.198     |\n",
+            "|    value_loss         | 0.00032   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.68      |\n",
+            "|    ep_rew_mean        | -0.203    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 341       |\n",
+            "|    iterations         | 47900     |\n",
+            "|    time_elapsed       | 2806      |\n",
+            "|    total_timesteps    | 958000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.814     |\n",
+            "|    explained_variance | 0.9837645 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 47899     |\n",
+            "|    policy_loss        | 0.00121   |\n",
+            "|    std                | 0.196     |\n",
+            "|    value_loss         | 7.55e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.67      |\n",
+            "|    ep_rew_mean        | -0.203    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 341       |\n",
+            "|    iterations         | 48000     |\n",
+            "|    time_elapsed       | 2810      |\n",
+            "|    total_timesteps    | 960000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.816     |\n",
+            "|    explained_variance | 0.9931013 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 47999     |\n",
+            "|    policy_loss        | -0.00325  |\n",
+            "|    std                | 0.196     |\n",
+            "|    value_loss         | 5.91e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.58       |\n",
+            "|    ep_rew_mean        | -0.197     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 341        |\n",
+            "|    iterations         | 48100      |\n",
+            "|    time_elapsed       | 2815       |\n",
+            "|    total_timesteps    | 962000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.838      |\n",
+            "|    explained_variance | 0.97392714 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 48099      |\n",
+            "|    policy_loss        | -0.00442   |\n",
+            "|    std                | 0.195      |\n",
+            "|    value_loss         | 7.03e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.66      |\n",
+            "|    ep_rew_mean        | -0.205    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 341       |\n",
+            "|    iterations         | 48200     |\n",
+            "|    time_elapsed       | 2823      |\n",
+            "|    total_timesteps    | 964000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.853     |\n",
+            "|    explained_variance | 0.9561405 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 48199     |\n",
+            "|    policy_loss        | 0.0005    |\n",
+            "|    std                | 0.194     |\n",
+            "|    value_loss         | 0.000178  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.63       |\n",
+            "|    ep_rew_mean        | -0.195     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 341        |\n",
+            "|    iterations         | 48300      |\n",
+            "|    time_elapsed       | 2827       |\n",
+            "|    total_timesteps    | 966000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.871      |\n",
+            "|    explained_variance | 0.98603976 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 48299      |\n",
+            "|    policy_loss        | 0.0108     |\n",
+            "|    std                | 0.193      |\n",
+            "|    value_loss         | 0.00013    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.66      |\n",
+            "|    ep_rew_mean        | -0.206    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 341       |\n",
+            "|    iterations         | 48400     |\n",
+            "|    time_elapsed       | 2832      |\n",
+            "|    total_timesteps    | 968000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.876     |\n",
+            "|    explained_variance | 0.8844548 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 48399     |\n",
+            "|    policy_loss        | -0.00342  |\n",
+            "|    std                | 0.192     |\n",
+            "|    value_loss         | 0.000601  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.61       |\n",
+            "|    ep_rew_mean        | -0.203     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 341        |\n",
+            "|    iterations         | 48500      |\n",
+            "|    time_elapsed       | 2837       |\n",
+            "|    total_timesteps    | 970000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.892      |\n",
+            "|    explained_variance | 0.97415155 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 48499      |\n",
+            "|    policy_loss        | 0.00535    |\n",
+            "|    std                | 0.191      |\n",
+            "|    value_loss         | 0.00015    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.75      |\n",
+            "|    ep_rew_mean        | -0.215    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 342       |\n",
+            "|    iterations         | 48600     |\n",
+            "|    time_elapsed       | 2841      |\n",
+            "|    total_timesteps    | 972000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.903     |\n",
+            "|    explained_variance | 0.9617835 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 48599     |\n",
+            "|    policy_loss        | 0.0132    |\n",
+            "|    std                | 0.191     |\n",
+            "|    value_loss         | 0.000383  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.72      |\n",
+            "|    ep_rew_mean        | -0.203    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 342       |\n",
+            "|    iterations         | 48700     |\n",
+            "|    time_elapsed       | 2846      |\n",
+            "|    total_timesteps    | 974000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.89      |\n",
+            "|    explained_variance | 0.9830824 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 48699     |\n",
+            "|    policy_loss        | 0.00573   |\n",
+            "|    std                | 0.191     |\n",
+            "|    value_loss         | 0.000103  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.73      |\n",
+            "|    ep_rew_mean        | -0.214    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 342       |\n",
+            "|    iterations         | 48800     |\n",
+            "|    time_elapsed       | 2850      |\n",
+            "|    total_timesteps    | 976000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.887     |\n",
+            "|    explained_variance | 0.9614461 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 48799     |\n",
+            "|    policy_loss        | -0.0139   |\n",
+            "|    std                | 0.192     |\n",
+            "|    value_loss         | 0.000635  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.76      |\n",
+            "|    ep_rew_mean        | -0.21     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 342       |\n",
+            "|    iterations         | 48900     |\n",
+            "|    time_elapsed       | 2858      |\n",
+            "|    total_timesteps    | 978000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.88      |\n",
+            "|    explained_variance | 0.9425929 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 48899     |\n",
+            "|    policy_loss        | 4.7e-05   |\n",
+            "|    std                | 0.192     |\n",
+            "|    value_loss         | 0.000399  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.62      |\n",
+            "|    ep_rew_mean        | -0.208    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 342       |\n",
+            "|    iterations         | 49000     |\n",
+            "|    time_elapsed       | 2862      |\n",
+            "|    total_timesteps    | 980000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.873     |\n",
+            "|    explained_variance | 0.9772742 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 48999     |\n",
+            "|    policy_loss        | 0.014     |\n",
+            "|    std                | 0.193     |\n",
+            "|    value_loss         | 0.000161  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.5        |\n",
+            "|    ep_rew_mean        | -0.185     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 342        |\n",
+            "|    iterations         | 49100      |\n",
+            "|    time_elapsed       | 2868       |\n",
+            "|    total_timesteps    | 982000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.878      |\n",
+            "|    explained_variance | 0.97900677 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 49099      |\n",
+            "|    policy_loss        | -0.00167   |\n",
+            "|    std                | 0.191      |\n",
+            "|    value_loss         | 0.000136   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.77       |\n",
+            "|    ep_rew_mean        | -0.216     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 342        |\n",
+            "|    iterations         | 49200      |\n",
+            "|    time_elapsed       | 2872       |\n",
+            "|    total_timesteps    | 984000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.883      |\n",
+            "|    explained_variance | 0.96298355 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 49199      |\n",
+            "|    policy_loss        | -0.000752  |\n",
+            "|    std                | 0.191      |\n",
+            "|    value_loss         | 9.23e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.79       |\n",
+            "|    ep_rew_mean        | -0.222     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 342        |\n",
+            "|    iterations         | 49300      |\n",
+            "|    time_elapsed       | 2876       |\n",
+            "|    total_timesteps    | 986000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.911      |\n",
+            "|    explained_variance | 0.98365396 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 49299      |\n",
+            "|    policy_loss        | -0.00456   |\n",
+            "|    std                | 0.189      |\n",
+            "|    value_loss         | 0.000266   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.68       |\n",
+            "|    ep_rew_mean        | -0.212     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 342        |\n",
+            "|    iterations         | 49400      |\n",
+            "|    time_elapsed       | 2881       |\n",
+            "|    total_timesteps    | 988000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.931      |\n",
+            "|    explained_variance | 0.95916724 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 49399      |\n",
+            "|    policy_loss        | -0.00675   |\n",
+            "|    std                | 0.187      |\n",
+            "|    value_loss         | 0.000311   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.73       |\n",
+            "|    ep_rew_mean        | -0.219     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 343        |\n",
+            "|    iterations         | 49500      |\n",
+            "|    time_elapsed       | 2885       |\n",
+            "|    total_timesteps    | 990000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.918      |\n",
+            "|    explained_variance | 0.99461377 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 49499      |\n",
+            "|    policy_loss        | 0.0052     |\n",
+            "|    std                | 0.188      |\n",
+            "|    value_loss         | 5.61e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.67      |\n",
+            "|    ep_rew_mean        | -0.203    |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 342       |\n",
+            "|    iterations         | 49600     |\n",
+            "|    time_elapsed       | 2893      |\n",
+            "|    total_timesteps    | 992000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.942     |\n",
+            "|    explained_variance | 0.9908923 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 49599     |\n",
+            "|    policy_loss        | 0.00297   |\n",
+            "|    std                | 0.187     |\n",
+            "|    value_loss         | 9.09e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.51       |\n",
+            "|    ep_rew_mean        | -0.186     |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 342        |\n",
+            "|    iterations         | 49700      |\n",
+            "|    time_elapsed       | 2898       |\n",
+            "|    total_timesteps    | 994000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.961      |\n",
+            "|    explained_variance | 0.97241545 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 49699      |\n",
+            "|    policy_loss        | -0.00635   |\n",
+            "|    std                | 0.186      |\n",
+            "|    value_loss         | 0.000121   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.71       |\n",
+            "|    ep_rew_mean        | -0.21      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 343        |\n",
+            "|    iterations         | 49800      |\n",
+            "|    time_elapsed       | 2902       |\n",
+            "|    total_timesteps    | 996000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.958      |\n",
+            "|    explained_variance | 0.99666107 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 49799      |\n",
+            "|    policy_loss        | -0.00314   |\n",
+            "|    std                | 0.186      |\n",
+            "|    value_loss         | 4e-05      |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 2.66      |\n",
+            "|    ep_rew_mean        | -0.2      |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 343       |\n",
+            "|    iterations         | 49900     |\n",
+            "|    time_elapsed       | 2906      |\n",
+            "|    total_timesteps    | 998000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | 0.943     |\n",
+            "|    explained_variance | 0.9459344 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 49899     |\n",
+            "|    policy_loss        | -0.00787  |\n",
+            "|    std                | 0.187     |\n",
+            "|    value_loss         | 0.000279  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 2.86       |\n",
+            "|    ep_rew_mean        | -0.22      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 343        |\n",
+            "|    iterations         | 50000      |\n",
+            "|    time_elapsed       | 2911       |\n",
+            "|    total_timesteps    | 1000000    |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | 0.965      |\n",
+            "|    explained_variance | 0.93244386 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 49999      |\n",
+            "|    policy_loss        | -0.0262    |\n",
+            "|    std                | 0.186      |\n",
+            "|    value_loss         | 0.00066    |\n",
+            "--------------------------------------\n"
+          ]
+        },
+        {
+          "data": {
+            "text/plain": [
+              "<stable_baselines3.a2c.a2c.A2C at 0x7f760a1e30d0>"
+            ]
+          },
+          "execution_count": 27,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "model.learn(1_000_000)"
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": 28,
+      "metadata": {
+        "id": "MfYtjj19cKFr"
+      },
+      "outputs": [],
       "source": [
         "# Save the model and  VecNormalize statistics when saving the agent\n",
         "model.save(\"a2c-PandaReachDense-v3\")\n",
         "env.save(\"vec_normalize.pkl\")"
-      ],
-      "metadata": {
-        "id": "MfYtjj19cKFr"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "01M9GCd32Ig-"
+      },
       "source": [
         "### Evaluate the agent 📈\n",
         "- Now that's our  agent is trained, we need to **check its performance**.\n",
         "- Stable-Baselines3 provides a method to do that: `evaluate_policy`"
-      ],
-      "metadata": {
-        "id": "01M9GCd32Ig-"
-      }
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": 29,
+      "metadata": {
+        "id": "liirTVoDkHq3"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "argv[0]=--background_color_red=0.8745098114013672\n",
+            "argv[1]=--background_color_green=0.21176470816135406\n",
+            "argv[2]=--background_color_blue=0.1764705926179886\n",
+            "Mean reward = -0.26 +/- 0.09\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "/home/mique/Desktop/Code/deep-rl-class/notebooks/unit6/venv-u6/lib/python3.10/site-packages/stable_baselines3/common/evaluation.py:67: UserWarning: Evaluation environment is not wrapped with a ``Monitor`` wrapper. This may result in reporting modified episode lengths and rewards, if other wrappers happen to modify these. Consider wrapping environment first with ``Monitor`` wrapper.\n",
+            "  warnings.warn(\n"
+          ]
+        }
+      ],
       "source": [
         "from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize\n",
         "\n",
@@ -570,27 +9670,25 @@
         "mean_reward, std_reward = evaluate_policy(model, eval_env)\n",
         "\n",
         "print(f\"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}\")"
-      ],
-      "metadata": {
-        "id": "liirTVoDkHq3"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "44L9LVQaavR8"
+      },
       "source": [
         "### Publish your trained model on the Hub 🔥\n",
         "Now that we saw we got good results after the training, we can publish our trained model on the Hub with one line of code.\n",
         "\n",
         "📚 The libraries documentation 👉 https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face--x-stable-baselines3-v20\n"
-      ],
-      "metadata": {
-        "id": "44L9LVQaavR8"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "MkMk99m8bgaQ"
+      },
       "source": [
         "By using `package_to_hub`, as we already mentionned in the former units, **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.\n",
         "\n",
@@ -599,10 +9697,7 @@
         "- You can **visualize your agent playing** 👀\n",
         "- You can **share with the community an agent that others can use** 💾\n",
         "- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard\n"
-      ],
-      "metadata": {
-        "id": "MkMk99m8bgaQ"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
@@ -655,15 +9750,350 @@
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "For this environment, **running this cell can take approximately 10min**"
-      ],
       "metadata": {
         "id": "juxItTNf1W74"
-      }
+      },
+      "source": [
+        "For this environment, **running this cell can take approximately 10min**"
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": 31,
+      "metadata": {
+        "id": "V1N8r8QVwcCE"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "\u001b[38;5;4mℹ This function will save, evaluate, generate a video of your agent,\n",
+            "create a model card and push everything to the hub. It might take up to 1min.\n",
+            "This is a work in progress: if you encounter a bug, please open an issue.\u001b[0m\n",
+            "Saving video to /tmp/tmpdcvmxwip/-step-0-to-step-1000.mp4\n",
+            "MoviePy - Building video /tmp/tmpdcvmxwip/-step-0-to-step-1000.mp4.\n",
+            "MoviePy - Writing video /tmp/tmpdcvmxwip/-step-0-to-step-1000.mp4\n",
+            "\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "                                                                          \r"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "MoviePy - Done !\n",
+            "MoviePy - video ready /tmp/tmpdcvmxwip/-step-0-to-step-1000.mp4\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "ffmpeg version 6.1.1-3ubuntu5 Copyright (c) 2000-2023 the FFmpeg developers\n",
+            "  built with gcc 13 (Ubuntu 13.2.0-23ubuntu3)\n",
+            "  configuration: --prefix=/usr --extra-version=3ubuntu5 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --disable-omx --enable-gnutls --enable-libaom --enable-libass --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libglslang --enable-libgme --enable-libgsm --enable-libharfbuzz --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-openal --enable-opencl --enable-opengl --disable-sndio --enable-libvpl --disable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-ladspa --enable-libbluray --enable-libjack --enable-libpulse --enable-librabbitmq --enable-librist --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libx264 --enable-libzmq --enable-libzvbi --enable-lv2 --enable-sdl2 --enable-libplacebo --enable-librav1e --enable-pocketsphinx --enable-librsvg --enable-libjxl --enable-shared\n",
+            "  libavutil      58. 29.100 / 58. 29.100\n",
+            "  libavcodec     60. 31.102 / 60. 31.102\n",
+            "  libavformat    60. 16.100 / 60. 16.100\n",
+            "  libavdevice    60.  3.100 / 60.  3.100\n",
+            "  libavfilter     9. 12.100 /  9. 12.100\n",
+            "  libswscale      7.  5.100 /  7.  5.100\n",
+            "  libswresample   4. 12.100 /  4. 12.100\n",
+            "  libpostproc    57.  3.100 / 57.  3.100\n",
+            "Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/tmp/tmpdcvmxwip/-step-0-to-step-1000.mp4':\n",
+            "  Metadata:\n",
+            "    major_brand     : isom\n",
+            "    minor_version   : 512\n",
+            "    compatible_brands: isomiso2avc1mp41\n",
+            "    encoder         : Lavf61.1.100\n",
+            "  Duration: 00:00:40.00, start: 0.000000, bitrate: 118 kb/s\n",
+            "  Stream #0:0[0x1](und): Video: h264 (High) (avc1 / 0x31637661), yuv420p(progressive), 720x480, 116 kb/s, 25 fps, 25 tbr, 12800 tbn (default)\n",
+            "    Metadata:\n",
+            "      handler_name    : VideoHandler\n",
+            "      vendor_id       : [0][0][0][0]\n",
+            "      encoder         : Lavc61.3.100 libx264\n",
+            "Stream mapping:\n",
+            "  Stream #0:0 -> #0:0 (h264 (native) -> h264 (libx264))\n",
+            "Press [q] to stop, [?] for help\n",
+            "[libx264 @ 0x55777995a9c0] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2\n",
+            "[libx264 @ 0x55777995a9c0] profile High, level 3.0, 4:2:0, 8-bit\n",
+            "[libx264 @ 0x55777995a9c0] 264 - core 164 r3108 31e19f9 - H.264/MPEG-4 AVC codec - Copyleft 2003-2023 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=15 lookahead_threads=2 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00\n",
+            "Output #0, mp4, to '/tmp/tmpo6y5pqyw/replay.mp4':\n",
+            "  Metadata:\n",
+            "    major_brand     : isom\n",
+            "    minor_version   : 512\n",
+            "    compatible_brands: isomiso2avc1mp41\n",
+            "    encoder         : Lavf60.16.100\n",
+            "  Stream #0:0(und): Video: h264 (avc1 / 0x31637661), yuv420p(progressive), 720x480, q=2-31, 25 fps, 12800 tbn (default)\n",
+            "    Metadata:\n",
+            "      handler_name    : VideoHandler\n",
+            "      vendor_id       : [0][0][0][0]\n",
+            "      encoder         : Lavc60.31.102 libx264\n",
+            "    Side data:\n",
+            "      cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A\n",
+            "[out#0/mp4 @ 0x5577798d6080] video:551kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 2.198598%\n",
+            "frame= 1000 fps=870 q=-1.0 Lsize=     564kB time=00:00:39.88 bitrate= 115.8kbits/s speed=34.7x    \n",
+            "[libx264 @ 0x55777995a9c0] frame I:4     Avg QP:14.60  size:  7429\n",
+            "[libx264 @ 0x55777995a9c0] frame P:297   Avg QP:23.56  size:   727\n",
+            "[libx264 @ 0x55777995a9c0] frame B:699   Avg QP:23.15  size:   455\n",
+            "[libx264 @ 0x55777995a9c0] consecutive B-frames:  1.9%  9.2% 16.5% 72.4%\n",
+            "[libx264 @ 0x55777995a9c0] mb I  I16..4: 24.4% 58.9% 16.6%\n",
+            "[libx264 @ 0x55777995a9c0] mb P  I16..4:  0.1%  0.5%  0.3%  P16..4:  2.5%  1.1%  0.7%  0.0%  0.0%    skip:94.7%\n",
+            "[libx264 @ 0x55777995a9c0] mb B  I16..4:  0.1%  0.1%  0.2%  B16..8:  3.2%  1.1%  0.4%  direct: 0.1%  skip:94.9%  L0:55.3% L1:43.6% BI: 1.1%\n",
+            "[libx264 @ 0x55777995a9c0] 8x8 transform intra:47.8% inter:8.5%\n",
+            "[libx264 @ 0x55777995a9c0] coded y,uvDC,uvAC intra: 18.0% 5.8% 4.5% inter: 0.7% 0.0% 0.0%\n",
+            "[libx264 @ 0x55777995a9c0] i16 v,h,dc,p: 66% 15% 18%  0%\n",
+            "[libx264 @ 0x55777995a9c0] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 38% 11% 49%  0%  0%  0%  0%  0%  0%\n",
+            "[libx264 @ 0x55777995a9c0] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 23% 17% 34%  3%  5%  5%  6%  3%  5%\n",
+            "[libx264 @ 0x55777995a9c0] i8c dc,h,v,p: 94%  3%  3%  0%\n",
+            "[libx264 @ 0x55777995a9c0] Weighted P-Frames: Y:0.0% UV:0.0%\n",
+            "[libx264 @ 0x55777995a9c0] ref P L0: 44.0%  2.7% 36.4% 16.9%\n",
+            "[libx264 @ 0x55777995a9c0] ref B L0: 67.2% 24.2%  8.7%\n",
+            "[libx264 @ 0x55777995a9c0] ref B L1: 94.0%  6.0%\n",
+            "[libx264 @ 0x55777995a9c0] kb/s:112.80\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "\u001b[38;5;4mℹ Pushing repo turbo-maikol/a2c-PandaReachDense-v3 to the Hugging Face\n",
+            "Hub\u001b[0m\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "Processing Files (0 / 0)                : |          |  0.00B /  0.00B            \n",
+            "\u001b[A\n",
+            "Processing Files (1 / 1)                :   0%|          | 1.26kB /  789kB,   ???B/s  \n",
+            "\u001b[A\n",
+            "\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "Processing Files (1 / 6)                :  70%|███████   |  553kB /  789kB,  690kB/s  \n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "Processing Files (6 / 6)                : 100%|██████████|  789kB /  789kB,  563kB/s  \n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "Processing Files (6 / 6)                : 100%|██████████|  789kB /  789kB,  394kB/s  \n",
+            "New Data Upload                         : 100%|██████████|  788kB /  788kB,  394kB/s  \n",
+            "  ...ReachDense-v3/pytorch_variables.pth: 100%|██████████| 1.26kB / 1.26kB            \n",
+            "  ...aReachDense-v3/policy.optimizer.pth: 100%|██████████| 48.9kB / 48.9kB            \n",
+            "  ...w/a2c-PandaReachDense-v3/policy.pth: 100%|██████████| 46.8kB / 46.8kB            \n",
+            "  ...o6y5pqyw/a2c-PandaReachDense-v3.zip: 100%|██████████|  113kB /  113kB            \n",
+            "  /tmp/tmpo6y5pqyw/replay.mp4           : 100%|██████████|  577kB /  577kB            \n",
+            "  /tmp/tmpo6y5pqyw/vec_normalize.pkl    : 100%|██████████| 2.61kB / 2.61kB            \n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "\u001b[38;5;4mℹ Your model is pushed to the Hub. You can view your model here:\n",
+            "https://huggingface.co/turbo-maikol/a2c-PandaReachDense-v3/tree/main/\u001b[0m\n"
+          ]
+        },
+        {
+          "data": {
+            "text/plain": [
+              "CommitInfo(commit_url='https://huggingface.co/turbo-maikol/a2c-PandaReachDense-v3/commit/ca3b9e054bb58644bb45ae278b3f9887e1f7081d', commit_message='Initial commit', commit_description='', oid='ca3b9e054bb58644bb45ae278b3f9887e1f7081d', pr_url=None, repo_url=RepoUrl('https://huggingface.co/turbo-maikol/a2c-PandaReachDense-v3', endpoint='https://huggingface.co', repo_type='model', repo_id='turbo-maikol/a2c-PandaReachDense-v3'), pr_revision=None, pr_num=None)"
+            ]
+          },
+          "execution_count": 31,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
       "source": [
         "from huggingface_sb3 import package_to_hub\n",
         "\n",
@@ -673,18 +10103,16 @@
         "    model_architecture=\"A2C\",\n",
         "    env_id=env_id,\n",
         "    eval_env=eval_env,\n",
-        "    repo_id=f\"ThomasSimonini/a2c-{env_id}\", # Change the username\n",
+        "    repo_id=f\"turbo-maikol/a2c-{env_id}\", # Change the username\n",
         "    commit_message=\"Initial commit\",\n",
         ")"
-      ],
-      "metadata": {
-        "id": "V1N8r8QVwcCE"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "G3xy3Nf3c2O1"
+      },
       "source": [
         "## Some additional challenges 🏆\n",
         "The best way to learn **is to try things by your own**! Why not trying  `PandaPickAndPlace-v3`?\n",
@@ -705,22 +10133,9436 @@
         "6. Save the model and  VecNormalize statistics when saving the agent\n",
         "7. Evaluate your agent\n",
         "8. Publish your trained model on the Hub 🔥 with `package_to_hub`\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 32,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "argv[0]=--background_color_red=0.8745098114013672\n",
+            "argv[1]=--background_color_green=0.21176470816135406\n",
+            "argv[2]=--background_color_blue=0.1764705926179886\n",
+            "argv[0]=--background_color_red=0.8745098114013672\n",
+            "argv[1]=--background_color_green=0.21176470816135406\n",
+            "argv[2]=--background_color_blue=0.1764705926179886\n",
+            "argv[0]=--background_color_red=0.8745098114013672\n",
+            "argv[1]=--background_color_green=0.21176470816135406\n",
+            "argv[2]=--background_color_blue=0.1764705926179886\n",
+            "Using cuda device\n",
+            "argv[0]=--background_color_red=0.8745098114013672\n",
+            "argv[1]=--background_color_green=0.21176470816135406\n",
+            "argv[2]=--background_color_blue=0.1764705926179886\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.5       |\n",
+            "|    ep_rew_mean        | -47.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 247        |\n",
+            "|    iterations         | 100        |\n",
+            "|    time_elapsed       | 8          |\n",
+            "|    total_timesteps    | 2000       |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.68      |\n",
+            "|    explained_variance | 0.92173636 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 99         |\n",
+            "|    policy_loss        | -0.453     |\n",
+            "|    std                | 1          |\n",
+            "|    value_loss         | 0.0769     |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.8      |\n",
+            "|    ep_rew_mean        | -48.8     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 269       |\n",
+            "|    iterations         | 200       |\n",
+            "|    time_elapsed       | 14        |\n",
+            "|    total_timesteps    | 4000      |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.67     |\n",
+            "|    explained_variance | 0.9866529 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 199       |\n",
+            "|    policy_loss        | -1.13     |\n",
+            "|    std                | 0.999     |\n",
+            "|    value_loss         | 0.0935    |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 50         |\n",
+            "|    ep_rew_mean        | -50        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 300        |\n",
+            "|    time_elapsed       | 21         |\n",
+            "|    total_timesteps    | 6000       |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.66      |\n",
+            "|    explained_variance | 0.91406125 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 299        |\n",
+            "|    policy_loss        | -1.29      |\n",
+            "|    std                | 0.997      |\n",
+            "|    value_loss         | 0.106      |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 50         |\n",
+            "|    ep_rew_mean        | -50        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 254        |\n",
+            "|    iterations         | 400        |\n",
+            "|    time_elapsed       | 31         |\n",
+            "|    total_timesteps    | 8000       |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.68      |\n",
+            "|    explained_variance | 0.97533536 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 399        |\n",
+            "|    policy_loss        | 0.149      |\n",
+            "|    std                | 1          |\n",
+            "|    value_loss         | 0.0134     |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 50         |\n",
+            "|    ep_rew_mean        | -50        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 264        |\n",
+            "|    iterations         | 500        |\n",
+            "|    time_elapsed       | 37         |\n",
+            "|    total_timesteps    | 10000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.71      |\n",
+            "|    explained_variance | 0.97877157 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 499        |\n",
+            "|    policy_loss        | 0.671      |\n",
+            "|    std                | 1.01       |\n",
+            "|    value_loss         | 0.0334     |\n",
+            "--------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 50       |\n",
+            "|    ep_rew_mean        | -50      |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 270      |\n",
+            "|    iterations         | 600      |\n",
+            "|    time_elapsed       | 44       |\n",
+            "|    total_timesteps    | 12000    |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -5.72    |\n",
+            "|    explained_variance | 0.941841 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 599      |\n",
+            "|    policy_loss        | -0.656   |\n",
+            "|    std                | 1.01     |\n",
+            "|    value_loss         | 0.0444   |\n",
+            "------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 50        |\n",
+            "|    ep_rew_mean        | -50       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 277       |\n",
+            "|    iterations         | 700       |\n",
+            "|    time_elapsed       | 50        |\n",
+            "|    total_timesteps    | 14000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.72     |\n",
+            "|    explained_variance | 0.7981684 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 699       |\n",
+            "|    policy_loss        | -0.123    |\n",
+            "|    std                | 1.01      |\n",
+            "|    value_loss         | 0.0345    |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 50        |\n",
+            "|    ep_rew_mean        | -50       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 800       |\n",
+            "|    time_elapsed       | 57        |\n",
+            "|    total_timesteps    | 16000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.71     |\n",
+            "|    explained_variance | 0.7265997 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 799       |\n",
+            "|    policy_loss        | 0.379     |\n",
+            "|    std                | 1.01      |\n",
+            "|    value_loss         | 0.0249    |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.5       |\n",
+            "|    ep_rew_mean        | -49.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 267        |\n",
+            "|    iterations         | 900        |\n",
+            "|    time_elapsed       | 67         |\n",
+            "|    total_timesteps    | 18000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.69      |\n",
+            "|    explained_variance | 0.89900863 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 899        |\n",
+            "|    policy_loss        | -0.36      |\n",
+            "|    std                | 1          |\n",
+            "|    value_loss         | 0.0117     |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 269       |\n",
+            "|    iterations         | 1000      |\n",
+            "|    time_elapsed       | 74        |\n",
+            "|    total_timesteps    | 20000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.7      |\n",
+            "|    explained_variance | 0.9879093 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 999       |\n",
+            "|    policy_loss        | -0.238    |\n",
+            "|    std                | 1.01      |\n",
+            "|    value_loss         | 0.0122    |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 271        |\n",
+            "|    iterations         | 1100       |\n",
+            "|    time_elapsed       | 81         |\n",
+            "|    total_timesteps    | 22000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.67      |\n",
+            "|    explained_variance | 0.96510875 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 1099       |\n",
+            "|    policy_loss        | -0.0184    |\n",
+            "|    std                | 0.998      |\n",
+            "|    value_loss         | 0.0225     |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.5       |\n",
+            "|    ep_rew_mean        | -47.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 272        |\n",
+            "|    iterations         | 1200       |\n",
+            "|    time_elapsed       | 87         |\n",
+            "|    total_timesteps    | 24000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.68      |\n",
+            "|    explained_variance | 0.98142165 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 1199       |\n",
+            "|    policy_loss        | -0.408     |\n",
+            "|    std                | 1          |\n",
+            "|    value_loss         | 0.0161     |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 275       |\n",
+            "|    iterations         | 1300      |\n",
+            "|    time_elapsed       | 94        |\n",
+            "|    total_timesteps    | 26000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.7      |\n",
+            "|    explained_variance | 0.8481641 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 1299      |\n",
+            "|    policy_loss        | -0.445    |\n",
+            "|    std                | 1.01      |\n",
+            "|    value_loss         | 0.00926   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 266        |\n",
+            "|    iterations         | 1400       |\n",
+            "|    time_elapsed       | 105        |\n",
+            "|    total_timesteps    | 28000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.7       |\n",
+            "|    explained_variance | 0.24699801 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 1399       |\n",
+            "|    policy_loss        | -0.0865    |\n",
+            "|    std                | 1.01       |\n",
+            "|    value_loss         | 0.00277    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48         |\n",
+            "|    ep_rew_mean        | -48        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 266        |\n",
+            "|    iterations         | 1500       |\n",
+            "|    time_elapsed       | 112        |\n",
+            "|    total_timesteps    | 30000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.69      |\n",
+            "|    explained_variance | 0.98543787 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 1499       |\n",
+            "|    policy_loss        | -0.122     |\n",
+            "|    std                | 1          |\n",
+            "|    value_loss         | 0.00198    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.5       |\n",
+            "|    ep_rew_mean        | -47.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 268        |\n",
+            "|    iterations         | 1600       |\n",
+            "|    time_elapsed       | 119        |\n",
+            "|    total_timesteps    | 32000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.7       |\n",
+            "|    explained_variance | 0.97692937 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 1599       |\n",
+            "|    policy_loss        | 0.102      |\n",
+            "|    std                | 1          |\n",
+            "|    value_loss         | 0.00283    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 269       |\n",
+            "|    iterations         | 1700      |\n",
+            "|    time_elapsed       | 126       |\n",
+            "|    total_timesteps    | 34000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.68     |\n",
+            "|    explained_variance | 0.9177654 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 1699      |\n",
+            "|    policy_loss        | -0.247    |\n",
+            "|    std                | 1         |\n",
+            "|    value_loss         | 0.00606   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 264        |\n",
+            "|    iterations         | 1800       |\n",
+            "|    time_elapsed       | 135        |\n",
+            "|    total_timesteps    | 36000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.68      |\n",
+            "|    explained_variance | 0.88942945 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 1799       |\n",
+            "|    policy_loss        | -0.0544    |\n",
+            "|    std                | 1          |\n",
+            "|    value_loss         | 0.00655    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48        |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 268       |\n",
+            "|    iterations         | 1900      |\n",
+            "|    time_elapsed       | 141       |\n",
+            "|    total_timesteps    | 38000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.68     |\n",
+            "|    explained_variance | 0.9895952 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 1899      |\n",
+            "|    policy_loss        | 0.179     |\n",
+            "|    std                | 1         |\n",
+            "|    value_loss         | 0.00177   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.5      |\n",
+            "|    ep_rew_mean        | -47.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 271       |\n",
+            "|    iterations         | 2000      |\n",
+            "|    time_elapsed       | 147       |\n",
+            "|    total_timesteps    | 40000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.69     |\n",
+            "|    explained_variance | 0.7657582 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 1999      |\n",
+            "|    policy_loss        | 0.267     |\n",
+            "|    std                | 1         |\n",
+            "|    value_loss         | 0.00616   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48        |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 2100      |\n",
+            "|    time_elapsed       | 153       |\n",
+            "|    total_timesteps    | 42000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.67     |\n",
+            "|    explained_variance | 0.9649579 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 2099      |\n",
+            "|    policy_loss        | -0.0232   |\n",
+            "|    std                | 0.998     |\n",
+            "|    value_loss         | 0.000706  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.5      |\n",
+            "|    ep_rew_mean        | -47.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 275       |\n",
+            "|    iterations         | 2200      |\n",
+            "|    time_elapsed       | 159       |\n",
+            "|    total_timesteps    | 44000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.67     |\n",
+            "|    explained_variance | 0.9855432 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 2199      |\n",
+            "|    policy_loss        | -0.388    |\n",
+            "|    std                | 0.999     |\n",
+            "|    value_loss         | 0.0113    |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48        |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 276       |\n",
+            "|    iterations         | 2300      |\n",
+            "|    time_elapsed       | 166       |\n",
+            "|    total_timesteps    | 46000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.69     |\n",
+            "|    explained_variance | 0.7222178 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 2299      |\n",
+            "|    policy_loss        | -0.0183   |\n",
+            "|    std                | 1         |\n",
+            "|    value_loss         | 0.00072   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.5       |\n",
+            "|    ep_rew_mean        | -47.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 272        |\n",
+            "|    iterations         | 2400       |\n",
+            "|    time_elapsed       | 175        |\n",
+            "|    total_timesteps    | 48000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.68      |\n",
+            "|    explained_variance | 0.98888546 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 2399       |\n",
+            "|    policy_loss        | -0.238     |\n",
+            "|    std                | 1          |\n",
+            "|    value_loss         | 0.00958    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.1       |\n",
+            "|    ep_rew_mean        | -47        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 2500       |\n",
+            "|    time_elapsed       | 182        |\n",
+            "|    total_timesteps    | 50000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.69      |\n",
+            "|    explained_variance | 0.96954125 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 2499       |\n",
+            "|    policy_loss        | -0.0431    |\n",
+            "|    std                | 1.01       |\n",
+            "|    value_loss         | 0.000864   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.5       |\n",
+            "|    ep_rew_mean        | -47.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 275        |\n",
+            "|    iterations         | 2600       |\n",
+            "|    time_elapsed       | 188        |\n",
+            "|    total_timesteps    | 52000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.69      |\n",
+            "|    explained_variance | 0.96610194 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 2599       |\n",
+            "|    policy_loss        | -0.105     |\n",
+            "|    std                | 1          |\n",
+            "|    value_loss         | 0.00381    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 46.6      |\n",
+            "|    ep_rew_mean        | -46.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 276       |\n",
+            "|    iterations         | 2700      |\n",
+            "|    time_elapsed       | 195       |\n",
+            "|    total_timesteps    | 54000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.7      |\n",
+            "|    explained_variance | 0.9916272 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 2699      |\n",
+            "|    policy_loss        | 0.0748    |\n",
+            "|    std                | 1.01      |\n",
+            "|    value_loss         | 0.00139   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 278        |\n",
+            "|    iterations         | 2800       |\n",
+            "|    time_elapsed       | 201        |\n",
+            "|    total_timesteps    | 56000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.7       |\n",
+            "|    explained_variance | 0.96441084 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 2799       |\n",
+            "|    policy_loss        | 0.1        |\n",
+            "|    std                | 1.01       |\n",
+            "|    value_loss         | 0.00154    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 2900      |\n",
+            "|    time_elapsed       | 211       |\n",
+            "|    total_timesteps    | 58000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.68     |\n",
+            "|    explained_variance | 0.9759128 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 2899      |\n",
+            "|    policy_loss        | -0.0517   |\n",
+            "|    std                | 1         |\n",
+            "|    value_loss         | 0.00165   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 275       |\n",
+            "|    iterations         | 3000      |\n",
+            "|    time_elapsed       | 217       |\n",
+            "|    total_timesteps    | 60000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.66     |\n",
+            "|    explained_variance | 0.9729539 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 2999      |\n",
+            "|    policy_loss        | 0.032     |\n",
+            "|    std                | 0.997     |\n",
+            "|    value_loss         | 0.000864  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 275       |\n",
+            "|    iterations         | 3100      |\n",
+            "|    time_elapsed       | 224       |\n",
+            "|    total_timesteps    | 62000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.66     |\n",
+            "|    explained_variance | 0.9806961 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 3099      |\n",
+            "|    policy_loss        | -0.0191   |\n",
+            "|    std                | 0.995     |\n",
+            "|    value_loss         | 0.000556  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.5      |\n",
+            "|    ep_rew_mean        | -49.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 276       |\n",
+            "|    iterations         | 3200      |\n",
+            "|    time_elapsed       | 231       |\n",
+            "|    total_timesteps    | 64000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.64     |\n",
+            "|    explained_variance | 0.9768019 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 3199      |\n",
+            "|    policy_loss        | -0.0979   |\n",
+            "|    std                | 0.991     |\n",
+            "|    value_loss         | 0.00653   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.1       |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 277        |\n",
+            "|    iterations         | 3300       |\n",
+            "|    time_elapsed       | 237        |\n",
+            "|    total_timesteps    | 66000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.61      |\n",
+            "|    explained_variance | 0.98892736 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 3299       |\n",
+            "|    policy_loss        | -0.292     |\n",
+            "|    std                | 0.984      |\n",
+            "|    value_loss         | 0.0137     |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.6      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 3400      |\n",
+            "|    time_elapsed       | 247       |\n",
+            "|    total_timesteps    | 68000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.59     |\n",
+            "|    explained_variance | 0.9900239 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 3399      |\n",
+            "|    policy_loss        | 0.00371   |\n",
+            "|    std                | 0.98      |\n",
+            "|    value_loss         | 0.000155  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.1      |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 275       |\n",
+            "|    iterations         | 3500      |\n",
+            "|    time_elapsed       | 253       |\n",
+            "|    total_timesteps    | 70000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.59     |\n",
+            "|    explained_variance | 0.9155875 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 3499      |\n",
+            "|    policy_loss        | 0.14      |\n",
+            "|    std                | 0.978     |\n",
+            "|    value_loss         | 0.00183   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 276       |\n",
+            "|    iterations         | 3600      |\n",
+            "|    time_elapsed       | 260       |\n",
+            "|    total_timesteps    | 72000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.58     |\n",
+            "|    explained_variance | 0.9748997 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 3599      |\n",
+            "|    policy_loss        | -0.173    |\n",
+            "|    std                | 0.976     |\n",
+            "|    value_loss         | 0.00162   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 277       |\n",
+            "|    iterations         | 3700      |\n",
+            "|    time_elapsed       | 266       |\n",
+            "|    total_timesteps    | 74000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.59     |\n",
+            "|    explained_variance | 0.9340458 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 3699      |\n",
+            "|    policy_loss        | -0.0189   |\n",
+            "|    std                | 0.978     |\n",
+            "|    value_loss         | 0.00181   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.5       |\n",
+            "|    ep_rew_mean        | -49.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 278        |\n",
+            "|    iterations         | 3800       |\n",
+            "|    time_elapsed       | 272        |\n",
+            "|    total_timesteps    | 76000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.59      |\n",
+            "|    explained_variance | 0.96108234 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 3799       |\n",
+            "|    policy_loss        | 0.0171     |\n",
+            "|    std                | 0.979      |\n",
+            "|    value_loss         | 0.000512   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 275        |\n",
+            "|    iterations         | 3900       |\n",
+            "|    time_elapsed       | 282        |\n",
+            "|    total_timesteps    | 78000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.62      |\n",
+            "|    explained_variance | 0.92484015 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 3899       |\n",
+            "|    policy_loss        | -0.0346    |\n",
+            "|    std                | 0.986      |\n",
+            "|    value_loss         | 0.000586   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.5       |\n",
+            "|    ep_rew_mean        | -47.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 276        |\n",
+            "|    iterations         | 4000       |\n",
+            "|    time_elapsed       | 289        |\n",
+            "|    total_timesteps    | 80000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.63      |\n",
+            "|    explained_variance | 0.99392503 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 3999       |\n",
+            "|    policy_loss        | 0.0322     |\n",
+            "|    std                | 0.988      |\n",
+            "|    value_loss         | 0.000124   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48         |\n",
+            "|    ep_rew_mean        | -48        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 276        |\n",
+            "|    iterations         | 4100       |\n",
+            "|    time_elapsed       | 296        |\n",
+            "|    total_timesteps    | 82000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.6       |\n",
+            "|    explained_variance | 0.93192434 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 4099       |\n",
+            "|    policy_loss        | 0.122      |\n",
+            "|    std                | 0.981      |\n",
+            "|    value_loss         | 0.0031     |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.5      |\n",
+            "|    ep_rew_mean        | -49.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 277       |\n",
+            "|    iterations         | 4200      |\n",
+            "|    time_elapsed       | 302       |\n",
+            "|    total_timesteps    | 84000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.59     |\n",
+            "|    explained_variance | 0.9695567 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 4199      |\n",
+            "|    policy_loss        | 0.013     |\n",
+            "|    std                | 0.98      |\n",
+            "|    value_loss         | 0.000614  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.5       |\n",
+            "|    ep_rew_mean        | -49.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 278        |\n",
+            "|    iterations         | 4300       |\n",
+            "|    time_elapsed       | 308        |\n",
+            "|    total_timesteps    | 86000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.58      |\n",
+            "|    explained_variance | 0.90897274 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 4299       |\n",
+            "|    policy_loss        | 0.0164     |\n",
+            "|    std                | 0.976      |\n",
+            "|    value_loss         | 0.00136    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 50        |\n",
+            "|    ep_rew_mean        | -50       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 276       |\n",
+            "|    iterations         | 4400      |\n",
+            "|    time_elapsed       | 318       |\n",
+            "|    total_timesteps    | 88000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.57     |\n",
+            "|    explained_variance | -0.491444 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 4399      |\n",
+            "|    policy_loss        | 0.313     |\n",
+            "|    std                | 0.976     |\n",
+            "|    value_loss         | 0.00968   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 277       |\n",
+            "|    iterations         | 4500      |\n",
+            "|    time_elapsed       | 324       |\n",
+            "|    total_timesteps    | 90000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.57     |\n",
+            "|    explained_variance | 0.9918581 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 4499      |\n",
+            "|    policy_loss        | 0.00828   |\n",
+            "|    std                | 0.975     |\n",
+            "|    value_loss         | 0.000251  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.6      |\n",
+            "|    ep_rew_mean        | -47.6     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 278       |\n",
+            "|    iterations         | 4600      |\n",
+            "|    time_elapsed       | 330       |\n",
+            "|    total_timesteps    | 92000     |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.59     |\n",
+            "|    explained_variance | 0.9781639 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 4599      |\n",
+            "|    policy_loss        | -0.0435   |\n",
+            "|    std                | 0.98      |\n",
+            "|    value_loss         | 0.00035   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.6       |\n",
+            "|    ep_rew_mean        | -47.6      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 278        |\n",
+            "|    iterations         | 4700       |\n",
+            "|    time_elapsed       | 337        |\n",
+            "|    total_timesteps    | 94000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.6       |\n",
+            "|    explained_variance | 0.11130333 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 4699       |\n",
+            "|    policy_loss        | -1.37      |\n",
+            "|    std                | 0.982      |\n",
+            "|    value_loss         | 0.515      |\n",
+            "--------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 48.5     |\n",
+            "|    ep_rew_mean        | -48.5    |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 278      |\n",
+            "|    iterations         | 4800     |\n",
+            "|    time_elapsed       | 344      |\n",
+            "|    total_timesteps    | 96000    |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -5.59    |\n",
+            "|    explained_variance | 0.871283 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 4799     |\n",
+            "|    policy_loss        | -0.17    |\n",
+            "|    std                | 0.98     |\n",
+            "|    value_loss         | 0.0107   |\n",
+            "------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.5       |\n",
+            "|    ep_rew_mean        | -47.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 275        |\n",
+            "|    iterations         | 4900       |\n",
+            "|    time_elapsed       | 355        |\n",
+            "|    total_timesteps    | 98000      |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.58      |\n",
+            "|    explained_variance | 0.99265605 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 4899       |\n",
+            "|    policy_loss        | 0.0609     |\n",
+            "|    std                | 0.977      |\n",
+            "|    value_loss         | 0.000309   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48         |\n",
+            "|    ep_rew_mean        | -48        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 276        |\n",
+            "|    iterations         | 5000       |\n",
+            "|    time_elapsed       | 361        |\n",
+            "|    total_timesteps    | 100000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.59      |\n",
+            "|    explained_variance | 0.90010345 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 4999       |\n",
+            "|    policy_loss        | 0.0621     |\n",
+            "|    std                | 0.979      |\n",
+            "|    value_loss         | 0.000767   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.5       |\n",
+            "|    ep_rew_mean        | -47.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 276        |\n",
+            "|    iterations         | 5100       |\n",
+            "|    time_elapsed       | 368        |\n",
+            "|    total_timesteps    | 102000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.59      |\n",
+            "|    explained_variance | 0.47578835 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 5099       |\n",
+            "|    policy_loss        | -0.121     |\n",
+            "|    std                | 0.979      |\n",
+            "|    value_loss         | 0.000866   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 277       |\n",
+            "|    iterations         | 5200      |\n",
+            "|    time_elapsed       | 374       |\n",
+            "|    total_timesteps    | 104000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.59     |\n",
+            "|    explained_variance | 0.8169235 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 5199      |\n",
+            "|    policy_loss        | -0.177    |\n",
+            "|    std                | 0.98      |\n",
+            "|    value_loss         | 0.00596   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 277       |\n",
+            "|    iterations         | 5300      |\n",
+            "|    time_elapsed       | 381       |\n",
+            "|    total_timesteps    | 106000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.58     |\n",
+            "|    explained_variance | 0.5145811 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 5299      |\n",
+            "|    policy_loss        | -0.316    |\n",
+            "|    std                | 0.976     |\n",
+            "|    value_loss         | 0.0262    |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 275        |\n",
+            "|    iterations         | 5400       |\n",
+            "|    time_elapsed       | 391        |\n",
+            "|    total_timesteps    | 108000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.57      |\n",
+            "|    explained_variance | 0.99181134 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 5399       |\n",
+            "|    policy_loss        | -0.00888   |\n",
+            "|    std                | 0.975      |\n",
+            "|    value_loss         | 0.0171     |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 275        |\n",
+            "|    iterations         | 5500       |\n",
+            "|    time_elapsed       | 398        |\n",
+            "|    total_timesteps    | 110000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.56      |\n",
+            "|    explained_variance | 0.76695526 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 5499       |\n",
+            "|    policy_loss        | -0.0179    |\n",
+            "|    std                | 0.973      |\n",
+            "|    value_loss         | 0.000661   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 275        |\n",
+            "|    iterations         | 5600       |\n",
+            "|    time_elapsed       | 405        |\n",
+            "|    total_timesteps    | 112000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.54      |\n",
+            "|    explained_variance | 0.98992884 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 5599       |\n",
+            "|    policy_loss        | 0.00993    |\n",
+            "|    std                | 0.968      |\n",
+            "|    value_loss         | 9.45e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 276       |\n",
+            "|    iterations         | 5700      |\n",
+            "|    time_elapsed       | 412       |\n",
+            "|    total_timesteps    | 114000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.54     |\n",
+            "|    explained_variance | 0.9023917 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 5699      |\n",
+            "|    policy_loss        | -0.0113   |\n",
+            "|    std                | 0.967     |\n",
+            "|    value_loss         | 5.5e-05   |\n",
+            "-------------------------------------\n",
+            "---------------------------------------\n",
+            "| rollout/              |             |\n",
+            "|    ep_len_mean        | 48          |\n",
+            "|    ep_rew_mean        | -48         |\n",
+            "| time/                 |             |\n",
+            "|    fps                | 274         |\n",
+            "|    iterations         | 5800        |\n",
+            "|    time_elapsed       | 423         |\n",
+            "|    total_timesteps    | 116000      |\n",
+            "| train/                |             |\n",
+            "|    entropy_loss       | -5.52       |\n",
+            "|    explained_variance | -0.55550146 |\n",
+            "|    learning_rate      | 0.0007      |\n",
+            "|    n_updates          | 5799        |\n",
+            "|    policy_loss        | 0.0259      |\n",
+            "|    std                | 0.962       |\n",
+            "|    value_loss         | 0.000285    |\n",
+            "---------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 48.1     |\n",
+            "|    ep_rew_mean        | -48      |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 274      |\n",
+            "|    iterations         | 5900     |\n",
+            "|    time_elapsed       | 430      |\n",
+            "|    total_timesteps    | 118000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -5.52    |\n",
+            "|    explained_variance | 0.579239 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 5899     |\n",
+            "|    policy_loss        | 0.468    |\n",
+            "|    std                | 0.961    |\n",
+            "|    value_loss         | 0.0336   |\n",
+            "------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.6       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 6000       |\n",
+            "|    time_elapsed       | 437        |\n",
+            "|    total_timesteps    | 120000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.52      |\n",
+            "|    explained_variance | 0.98050183 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 5999       |\n",
+            "|    policy_loss        | -0.0048    |\n",
+            "|    std                | 0.962      |\n",
+            "|    value_loss         | 1.05e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.1       |\n",
+            "|    ep_rew_mean        | -48        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 6100       |\n",
+            "|    time_elapsed       | 443        |\n",
+            "|    total_timesteps    | 122000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.5       |\n",
+            "|    explained_variance | -18.419111 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 6099       |\n",
+            "|    policy_loss        | 0.408      |\n",
+            "|    std                | 0.958      |\n",
+            "|    value_loss         | 0.0274     |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 275       |\n",
+            "|    iterations         | 6200      |\n",
+            "|    time_elapsed       | 450       |\n",
+            "|    total_timesteps    | 124000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.5      |\n",
+            "|    explained_variance | 0.8330802 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 6199      |\n",
+            "|    policy_loss        | 0.269     |\n",
+            "|    std                | 0.959     |\n",
+            "|    value_loss         | 0.00514   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 6300       |\n",
+            "|    time_elapsed       | 460        |\n",
+            "|    total_timesteps    | 126000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.51      |\n",
+            "|    explained_variance | 0.93740755 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 6299       |\n",
+            "|    policy_loss        | -0.0936    |\n",
+            "|    std                | 0.959      |\n",
+            "|    value_loss         | 0.000951   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48        |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 6400      |\n",
+            "|    time_elapsed       | 467       |\n",
+            "|    total_timesteps    | 128000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.51     |\n",
+            "|    explained_variance | 0.9103676 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 6399      |\n",
+            "|    policy_loss        | -0.0194   |\n",
+            "|    std                | 0.96      |\n",
+            "|    value_loss         | 3.56e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.5       |\n",
+            "|    ep_rew_mean        | -47.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 6500       |\n",
+            "|    time_elapsed       | 473        |\n",
+            "|    total_timesteps    | 130000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.52      |\n",
+            "|    explained_variance | 0.94888294 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 6499       |\n",
+            "|    policy_loss        | -0.0402    |\n",
+            "|    std                | 0.961      |\n",
+            "|    value_loss         | 0.00023    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48         |\n",
+            "|    ep_rew_mean        | -48        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 6600       |\n",
+            "|    time_elapsed       | 480        |\n",
+            "|    total_timesteps    | 132000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.5       |\n",
+            "|    explained_variance | -5.1157684 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 6599       |\n",
+            "|    policy_loss        | -0.0434    |\n",
+            "|    std                | 0.959      |\n",
+            "|    value_loss         | 0.0138     |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48         |\n",
+            "|    ep_rew_mean        | -48        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 275        |\n",
+            "|    iterations         | 6700       |\n",
+            "|    time_elapsed       | 487        |\n",
+            "|    total_timesteps    | 134000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.5       |\n",
+            "|    explained_variance | 0.94738376 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 6699       |\n",
+            "|    policy_loss        | -0.0506    |\n",
+            "|    std                | 0.958      |\n",
+            "|    value_loss         | 0.000277   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 6800      |\n",
+            "|    time_elapsed       | 497       |\n",
+            "|    total_timesteps    | 136000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.5      |\n",
+            "|    explained_variance | 0.9724636 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 6799      |\n",
+            "|    policy_loss        | 0.0752    |\n",
+            "|    std                | 0.957     |\n",
+            "|    value_loss         | 0.254     |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 50         |\n",
+            "|    ep_rew_mean        | -50        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 6900       |\n",
+            "|    time_elapsed       | 504        |\n",
+            "|    total_timesteps    | 138000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.47      |\n",
+            "|    explained_variance | 0.79116476 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 6899       |\n",
+            "|    policy_loss        | -0.0356    |\n",
+            "|    std                | 0.951      |\n",
+            "|    value_loss         | 0.00429    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.5      |\n",
+            "|    ep_rew_mean        | -49.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 7000      |\n",
+            "|    time_elapsed       | 510       |\n",
+            "|    total_timesteps    | 140000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.47     |\n",
+            "|    explained_variance | 0.9288944 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 6999      |\n",
+            "|    policy_loss        | -0.0441   |\n",
+            "|    std                | 0.951     |\n",
+            "|    value_loss         | 0.00104   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.6      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 7100      |\n",
+            "|    time_elapsed       | 516       |\n",
+            "|    total_timesteps    | 142000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.46     |\n",
+            "|    explained_variance | 0.5042453 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 7099      |\n",
+            "|    policy_loss        | 0.0777    |\n",
+            "|    std                | 0.948     |\n",
+            "|    value_loss         | 0.000893  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 46.6       |\n",
+            "|    ep_rew_mean        | -46.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 275        |\n",
+            "|    iterations         | 7200       |\n",
+            "|    time_elapsed       | 523        |\n",
+            "|    total_timesteps    | 144000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.46      |\n",
+            "|    explained_variance | 0.20719147 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 7199       |\n",
+            "|    policy_loss        | -0.131     |\n",
+            "|    std                | 0.949      |\n",
+            "|    value_loss         | 0.00243    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.1       |\n",
+            "|    ep_rew_mean        | -47        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 7300       |\n",
+            "|    time_elapsed       | 533        |\n",
+            "|    total_timesteps    | 146000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.46      |\n",
+            "|    explained_variance | 0.93220496 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 7299       |\n",
+            "|    policy_loss        | 0.0805     |\n",
+            "|    std                | 0.948      |\n",
+            "|    value_loss         | 0.001      |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 7400       |\n",
+            "|    time_elapsed       | 540        |\n",
+            "|    total_timesteps    | 148000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.48      |\n",
+            "|    explained_variance | 0.84394395 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 7399       |\n",
+            "|    policy_loss        | -0.115     |\n",
+            "|    std                | 0.953      |\n",
+            "|    value_loss         | 0.00974    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.5       |\n",
+            "|    ep_rew_mean        | -49.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 7500       |\n",
+            "|    time_elapsed       | 546        |\n",
+            "|    total_timesteps    | 150000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.48      |\n",
+            "|    explained_variance | 0.96859866 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 7499       |\n",
+            "|    policy_loss        | -0.000996  |\n",
+            "|    std                | 0.953      |\n",
+            "|    value_loss         | 3.46e-05   |\n",
+            "--------------------------------------\n",
+            "---------------------------------------\n",
+            "| rollout/              |             |\n",
+            "|    ep_len_mean        | 49.1        |\n",
+            "|    ep_rew_mean        | -49.1       |\n",
+            "| time/                 |             |\n",
+            "|    fps                | 274         |\n",
+            "|    iterations         | 7600        |\n",
+            "|    time_elapsed       | 553         |\n",
+            "|    total_timesteps    | 152000      |\n",
+            "| train/                |             |\n",
+            "|    entropy_loss       | -5.46       |\n",
+            "|    explained_variance | 0.114243984 |\n",
+            "|    learning_rate      | 0.0007      |\n",
+            "|    n_updates          | 7599        |\n",
+            "|    policy_loss        | -0.0376     |\n",
+            "|    std                | 0.949       |\n",
+            "|    value_loss         | 0.00137     |\n",
+            "---------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.6      |\n",
+            "|    ep_rew_mean        | -47.6     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 7700      |\n",
+            "|    time_elapsed       | 560       |\n",
+            "|    total_timesteps    | 154000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.44     |\n",
+            "|    explained_variance | -5.899102 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 7699      |\n",
+            "|    policy_loss        | 0.106     |\n",
+            "|    std                | 0.942     |\n",
+            "|    value_loss         | 0.00164   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.6       |\n",
+            "|    ep_rew_mean        | -47.6      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 7800       |\n",
+            "|    time_elapsed       | 570        |\n",
+            "|    total_timesteps    | 156000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.46      |\n",
+            "|    explained_variance | 0.43601793 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 7799       |\n",
+            "|    policy_loss        | 0.206      |\n",
+            "|    std                | 0.947      |\n",
+            "|    value_loss         | 0.00701    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 7900      |\n",
+            "|    time_elapsed       | 576       |\n",
+            "|    total_timesteps    | 158000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.46     |\n",
+            "|    explained_variance | 0.9624703 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 7899      |\n",
+            "|    policy_loss        | -0.00135  |\n",
+            "|    std                | 0.949     |\n",
+            "|    value_loss         | 3.39e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.6      |\n",
+            "|    ep_rew_mean        | -49.6     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 8000      |\n",
+            "|    time_elapsed       | 583       |\n",
+            "|    total_timesteps    | 160000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.46     |\n",
+            "|    explained_variance | 0.8270585 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 7999      |\n",
+            "|    policy_loss        | 0.00402   |\n",
+            "|    std                | 0.949     |\n",
+            "|    value_loss         | 3.21e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.2       |\n",
+            "|    ep_rew_mean        | -48.1      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 8100       |\n",
+            "|    time_elapsed       | 590        |\n",
+            "|    total_timesteps    | 162000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.47      |\n",
+            "|    explained_variance | -2.4589894 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 8099       |\n",
+            "|    policy_loss        | -0.012     |\n",
+            "|    std                | 0.951      |\n",
+            "|    value_loss         | 0.000215   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.2       |\n",
+            "|    ep_rew_mean        | -48.1      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 272        |\n",
+            "|    iterations         | 8200       |\n",
+            "|    time_elapsed       | 600        |\n",
+            "|    total_timesteps    | 164000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.46      |\n",
+            "|    explained_variance | 0.92311984 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 8199       |\n",
+            "|    policy_loss        | 0.0122     |\n",
+            "|    std                | 0.948      |\n",
+            "|    value_loss         | 1.28e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.1       |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 8300       |\n",
+            "|    time_elapsed       | 607        |\n",
+            "|    total_timesteps    | 166000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.45      |\n",
+            "|    explained_variance | 0.49472392 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 8299       |\n",
+            "|    policy_loss        | -0.0236    |\n",
+            "|    std                | 0.946      |\n",
+            "|    value_loss         | 0.000205   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.5       |\n",
+            "|    ep_rew_mean        | -49.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 8400       |\n",
+            "|    time_elapsed       | 613        |\n",
+            "|    total_timesteps    | 168000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.46      |\n",
+            "|    explained_variance | -1.6017191 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 8399       |\n",
+            "|    policy_loss        | -0.00243   |\n",
+            "|    std                | 0.947      |\n",
+            "|    value_loss         | 0.00038    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 8500      |\n",
+            "|    time_elapsed       | 620       |\n",
+            "|    total_timesteps    | 170000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.44     |\n",
+            "|    explained_variance | 0.4432842 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 8499      |\n",
+            "|    policy_loss        | -0.0109   |\n",
+            "|    std                | 0.944     |\n",
+            "|    value_loss         | 4.59e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 8600      |\n",
+            "|    time_elapsed       | 627       |\n",
+            "|    total_timesteps    | 172000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.45     |\n",
+            "|    explained_variance | 0.2270329 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 8599      |\n",
+            "|    policy_loss        | -0.0348   |\n",
+            "|    std                | 0.945     |\n",
+            "|    value_loss         | 6e-05     |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 272       |\n",
+            "|    iterations         | 8700      |\n",
+            "|    time_elapsed       | 637       |\n",
+            "|    total_timesteps    | 174000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.44     |\n",
+            "|    explained_variance | 0.6000677 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 8699      |\n",
+            "|    policy_loss        | 0.00449   |\n",
+            "|    std                | 0.944     |\n",
+            "|    value_loss         | 1.61e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48         |\n",
+            "|    ep_rew_mean        | -48        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 272        |\n",
+            "|    iterations         | 8800       |\n",
+            "|    time_elapsed       | 644        |\n",
+            "|    total_timesteps    | 176000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.44      |\n",
+            "|    explained_variance | 0.35898274 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 8799       |\n",
+            "|    policy_loss        | 0.0663     |\n",
+            "|    std                | 0.943      |\n",
+            "|    value_loss         | 0.000623   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 8900       |\n",
+            "|    time_elapsed       | 651        |\n",
+            "|    total_timesteps    | 178000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.46      |\n",
+            "|    explained_variance | 0.23337007 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 8899       |\n",
+            "|    policy_loss        | 0.0131     |\n",
+            "|    std                | 0.947      |\n",
+            "|    value_loss         | 3.51e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 9000      |\n",
+            "|    time_elapsed       | 658       |\n",
+            "|    total_timesteps    | 180000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.45     |\n",
+            "|    explained_variance | 0.7281234 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 8999      |\n",
+            "|    policy_loss        | 0.133     |\n",
+            "|    std                | 0.946     |\n",
+            "|    value_loss         | 0.00179   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.5       |\n",
+            "|    ep_rew_mean        | -49.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 9100       |\n",
+            "|    time_elapsed       | 665        |\n",
+            "|    total_timesteps    | 182000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.47      |\n",
+            "|    explained_variance | -9.7073145 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 9099       |\n",
+            "|    policy_loss        | -0.015     |\n",
+            "|    std                | 0.949      |\n",
+            "|    value_loss         | 3.28e-05   |\n",
+            "--------------------------------------\n",
+            "-----------------------------------------\n",
+            "| rollout/              |               |\n",
+            "|    ep_len_mean        | 50            |\n",
+            "|    ep_rew_mean        | -50           |\n",
+            "| time/                 |               |\n",
+            "|    fps                | 272           |\n",
+            "|    iterations         | 9200          |\n",
+            "|    time_elapsed       | 675           |\n",
+            "|    total_timesteps    | 184000        |\n",
+            "| train/                |               |\n",
+            "|    entropy_loss       | -5.47         |\n",
+            "|    explained_variance | -0.0025390387 |\n",
+            "|    learning_rate      | 0.0007        |\n",
+            "|    n_updates          | 9199          |\n",
+            "|    policy_loss        | 0.113         |\n",
+            "|    std                | 0.951         |\n",
+            "|    value_loss         | 0.00119       |\n",
+            "-----------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48        |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 272       |\n",
+            "|    iterations         | 9300      |\n",
+            "|    time_elapsed       | 681       |\n",
+            "|    total_timesteps    | 186000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.44     |\n",
+            "|    explained_variance | 0.6690495 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 9299      |\n",
+            "|    policy_loss        | -0.00437  |\n",
+            "|    std                | 0.944     |\n",
+            "|    value_loss         | 4.59e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.5      |\n",
+            "|    ep_rew_mean        | -47.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 9400      |\n",
+            "|    time_elapsed       | 688       |\n",
+            "|    total_timesteps    | 188000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.43     |\n",
+            "|    explained_variance | 0.7185387 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 9399      |\n",
+            "|    policy_loss        | 0.00195   |\n",
+            "|    std                | 0.941     |\n",
+            "|    value_loss         | 6.26e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.5       |\n",
+            "|    ep_rew_mean        | -47.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 9500       |\n",
+            "|    time_elapsed       | 694        |\n",
+            "|    total_timesteps    | 190000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.43      |\n",
+            "|    explained_variance | -0.9873673 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 9499       |\n",
+            "|    policy_loss        | -0.0856    |\n",
+            "|    std                | 0.94       |\n",
+            "|    value_loss         | 0.000481   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.2       |\n",
+            "|    ep_rew_mean        | -48.2      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 9600       |\n",
+            "|    time_elapsed       | 701        |\n",
+            "|    total_timesteps    | 192000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.4       |\n",
+            "|    explained_variance | 0.78831863 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 9599       |\n",
+            "|    policy_loss        | -0.0148    |\n",
+            "|    std                | 0.934      |\n",
+            "|    value_loss         | 3.55e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.2       |\n",
+            "|    ep_rew_mean        | -47.2      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 272        |\n",
+            "|    iterations         | 9700       |\n",
+            "|    time_elapsed       | 711        |\n",
+            "|    total_timesteps    | 194000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.4       |\n",
+            "|    explained_variance | 0.95177764 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 9699       |\n",
+            "|    policy_loss        | 0.0071     |\n",
+            "|    std                | 0.935      |\n",
+            "|    value_loss         | 2.33e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.1      |\n",
+            "|    ep_rew_mean        | -47       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 9800      |\n",
+            "|    time_elapsed       | 717       |\n",
+            "|    total_timesteps    | 196000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.37     |\n",
+            "|    explained_variance | 0.9991041 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 9799      |\n",
+            "|    policy_loss        | -0.0138   |\n",
+            "|    std                | 0.928     |\n",
+            "|    value_loss         | 1.32e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.5       |\n",
+            "|    ep_rew_mean        | -47.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 9900       |\n",
+            "|    time_elapsed       | 724        |\n",
+            "|    total_timesteps    | 198000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.39      |\n",
+            "|    explained_variance | 0.98675823 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 9899       |\n",
+            "|    policy_loss        | -0.00244   |\n",
+            "|    std                | 0.933      |\n",
+            "|    value_loss         | 2.88e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.5      |\n",
+            "|    ep_rew_mean        | -47.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 10000     |\n",
+            "|    time_elapsed       | 731       |\n",
+            "|    total_timesteps    | 200000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.39     |\n",
+            "|    explained_variance | 0.8423076 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 9999      |\n",
+            "|    policy_loss        | 0.0111    |\n",
+            "|    std                | 0.933     |\n",
+            "|    value_loss         | 0.000101  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48        |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 10100     |\n",
+            "|    time_elapsed       | 737       |\n",
+            "|    total_timesteps    | 202000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.38     |\n",
+            "|    explained_variance | 0.9848204 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 10099     |\n",
+            "|    policy_loss        | -0.0111   |\n",
+            "|    std                | 0.929     |\n",
+            "|    value_loss         | 8.33e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 272       |\n",
+            "|    iterations         | 10200     |\n",
+            "|    time_elapsed       | 748       |\n",
+            "|    total_timesteps    | 204000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.35     |\n",
+            "|    explained_variance | 0.9231719 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 10199     |\n",
+            "|    policy_loss        | -0.00118  |\n",
+            "|    std                | 0.923     |\n",
+            "|    value_loss         | 1.12e-05  |\n",
+            "-------------------------------------\n",
+            "---------------------------------------\n",
+            "| rollout/              |             |\n",
+            "|    ep_len_mean        | 48.1        |\n",
+            "|    ep_rew_mean        | -48         |\n",
+            "| time/                 |             |\n",
+            "|    fps                | 273         |\n",
+            "|    iterations         | 10300       |\n",
+            "|    time_elapsed       | 754         |\n",
+            "|    total_timesteps    | 206000      |\n",
+            "| train/                |             |\n",
+            "|    entropy_loss       | -5.35       |\n",
+            "|    explained_variance | 0.056429803 |\n",
+            "|    learning_rate      | 0.0007      |\n",
+            "|    n_updates          | 10299       |\n",
+            "|    policy_loss        | 0.0131      |\n",
+            "|    std                | 0.923       |\n",
+            "|    value_loss         | 0.000336    |\n",
+            "---------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.6      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 10400     |\n",
+            "|    time_elapsed       | 761       |\n",
+            "|    total_timesteps    | 208000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.35     |\n",
+            "|    explained_variance | 0.9514538 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 10399     |\n",
+            "|    policy_loss        | -0.00282  |\n",
+            "|    std                | 0.923     |\n",
+            "|    value_loss         | 4.15e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.6       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 10500      |\n",
+            "|    time_elapsed       | 768        |\n",
+            "|    total_timesteps    | 210000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.34      |\n",
+            "|    explained_variance | 0.94718796 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 10499      |\n",
+            "|    policy_loss        | -0.00272   |\n",
+            "|    std                | 0.921      |\n",
+            "|    value_loss         | 5.42e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.1      |\n",
+            "|    ep_rew_mean        | -47       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 10600     |\n",
+            "|    time_elapsed       | 774       |\n",
+            "|    total_timesteps    | 212000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.35     |\n",
+            "|    explained_variance | 0.8666384 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 10599     |\n",
+            "|    policy_loss        | -0.0275   |\n",
+            "|    std                | 0.923     |\n",
+            "|    value_loss         | 5.03e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.5       |\n",
+            "|    ep_rew_mean        | -47.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 272        |\n",
+            "|    iterations         | 10700      |\n",
+            "|    time_elapsed       | 784        |\n",
+            "|    total_timesteps    | 214000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.35      |\n",
+            "|    explained_variance | -0.4472072 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 10699      |\n",
+            "|    policy_loss        | -0.00202   |\n",
+            "|    std                | 0.923      |\n",
+            "|    value_loss         | 3e-05      |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.5      |\n",
+            "|    ep_rew_mean        | -47.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 10800     |\n",
+            "|    time_elapsed       | 790       |\n",
+            "|    total_timesteps    | 216000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.36     |\n",
+            "|    explained_variance | 0.5958526 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 10799     |\n",
+            "|    policy_loss        | 0.0985    |\n",
+            "|    std                | 0.923     |\n",
+            "|    value_loss         | 0.00219   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 10900     |\n",
+            "|    time_elapsed       | 797       |\n",
+            "|    total_timesteps    | 218000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.36     |\n",
+            "|    explained_variance | 0.9833134 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 10899     |\n",
+            "|    policy_loss        | 0.0262    |\n",
+            "|    std                | 0.924     |\n",
+            "|    value_loss         | 3.51e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.4       |\n",
+            "|    ep_rew_mean        | -48.3      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 11000      |\n",
+            "|    time_elapsed       | 803        |\n",
+            "|    total_timesteps    | 220000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.37      |\n",
+            "|    explained_variance | 0.80022764 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 10999      |\n",
+            "|    policy_loss        | 0.0101     |\n",
+            "|    std                | 0.927      |\n",
+            "|    value_loss         | 1.26e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.4       |\n",
+            "|    ep_rew_mean        | -48.4      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 11100      |\n",
+            "|    time_elapsed       | 809        |\n",
+            "|    total_timesteps    | 222000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.39      |\n",
+            "|    explained_variance | 0.86710656 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 11099      |\n",
+            "|    policy_loss        | -0.00484   |\n",
+            "|    std                | 0.932      |\n",
+            "|    value_loss         | 5.19e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.6      |\n",
+            "|    ep_rew_mean        | -48.6     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 11200     |\n",
+            "|    time_elapsed       | 819       |\n",
+            "|    total_timesteps    | 224000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.38     |\n",
+            "|    explained_variance | 0.9757535 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 11199     |\n",
+            "|    policy_loss        | 0.00393   |\n",
+            "|    std                | 0.928     |\n",
+            "|    value_loss         | 4.17e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.9      |\n",
+            "|    ep_rew_mean        | -47.8     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 11300     |\n",
+            "|    time_elapsed       | 826       |\n",
+            "|    total_timesteps    | 226000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.37     |\n",
+            "|    explained_variance | 0.9639573 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 11299     |\n",
+            "|    policy_loss        | -0.00267  |\n",
+            "|    std                | 0.928     |\n",
+            "|    value_loss         | 2.23e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.7       |\n",
+            "|    ep_rew_mean        | -47.6      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 11400      |\n",
+            "|    time_elapsed       | 833        |\n",
+            "|    total_timesteps    | 228000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.38      |\n",
+            "|    explained_variance | -2.9878726 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 11399      |\n",
+            "|    policy_loss        | -0.00966   |\n",
+            "|    std                | 0.929      |\n",
+            "|    value_loss         | 1.41e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.7       |\n",
+            "|    ep_rew_mean        | -48.6      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 11500      |\n",
+            "|    time_elapsed       | 839        |\n",
+            "|    total_timesteps    | 230000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.37      |\n",
+            "|    explained_variance | 0.61145973 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 11499      |\n",
+            "|    policy_loss        | 0.000605   |\n",
+            "|    std                | 0.927      |\n",
+            "|    value_loss         | 1.67e-06   |\n",
+            "--------------------------------------\n",
+            "---------------------------------------\n",
+            "| rollout/              |             |\n",
+            "|    ep_len_mean        | 48.2        |\n",
+            "|    ep_rew_mean        | -48.2       |\n",
+            "| time/                 |             |\n",
+            "|    fps                | 274         |\n",
+            "|    iterations         | 11600       |\n",
+            "|    time_elapsed       | 846         |\n",
+            "|    total_timesteps    | 232000      |\n",
+            "| train/                |             |\n",
+            "|    entropy_loss       | -5.34       |\n",
+            "|    explained_variance | -0.48529482 |\n",
+            "|    learning_rate      | 0.0007      |\n",
+            "|    n_updates          | 11599       |\n",
+            "|    policy_loss        | -0.00852    |\n",
+            "|    std                | 0.92        |\n",
+            "|    value_loss         | 0.000263    |\n",
+            "---------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.1      |\n",
+            "|    ep_rew_mean        | -47       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 11700     |\n",
+            "|    time_elapsed       | 856       |\n",
+            "|    total_timesteps    | 234000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.32     |\n",
+            "|    explained_variance | 0.5111707 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 11699     |\n",
+            "|    policy_loss        | 0.0447    |\n",
+            "|    std                | 0.916     |\n",
+            "|    value_loss         | 0.00013   |\n",
+            "-------------------------------------\n",
+            "---------------------------------------\n",
+            "| rollout/              |             |\n",
+            "|    ep_len_mean        | 46.1        |\n",
+            "|    ep_rew_mean        | -46.1       |\n",
+            "| time/                 |             |\n",
+            "|    fps                | 273         |\n",
+            "|    iterations         | 11800       |\n",
+            "|    time_elapsed       | 863         |\n",
+            "|    total_timesteps    | 236000      |\n",
+            "| train/                |             |\n",
+            "|    entropy_loss       | -5.33       |\n",
+            "|    explained_variance | -0.15370154 |\n",
+            "|    learning_rate      | 0.0007      |\n",
+            "|    n_updates          | 11799       |\n",
+            "|    policy_loss        | -0.0161     |\n",
+            "|    std                | 0.918       |\n",
+            "|    value_loss         | 2.98e-05    |\n",
+            "---------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.1       |\n",
+            "|    ep_rew_mean        | -48.1      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 11900      |\n",
+            "|    time_elapsed       | 869        |\n",
+            "|    total_timesteps    | 238000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.31      |\n",
+            "|    explained_variance | 0.80145425 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 11899      |\n",
+            "|    policy_loss        | -0.00846   |\n",
+            "|    std                | 0.914      |\n",
+            "|    value_loss         | 1.04e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.1      |\n",
+            "|    ep_rew_mean        | -48.1     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 12000     |\n",
+            "|    time_elapsed       | 875       |\n",
+            "|    total_timesteps    | 240000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.31     |\n",
+            "|    explained_variance | 0.7291146 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 11999     |\n",
+            "|    policy_loss        | 0.0131    |\n",
+            "|    std                | 0.914     |\n",
+            "|    value_loss         | 1.66e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.1      |\n",
+            "|    ep_rew_mean        | -48.1     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 12100     |\n",
+            "|    time_elapsed       | 882       |\n",
+            "|    total_timesteps    | 242000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.31     |\n",
+            "|    explained_variance | 0.9325699 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 12099     |\n",
+            "|    policy_loss        | -0.00958  |\n",
+            "|    std                | 0.913     |\n",
+            "|    value_loss         | 9.12e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.6      |\n",
+            "|    ep_rew_mean        | -47.6     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 12200     |\n",
+            "|    time_elapsed       | 892       |\n",
+            "|    total_timesteps    | 244000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.29     |\n",
+            "|    explained_variance | 0.9233826 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 12199     |\n",
+            "|    policy_loss        | 0.00747   |\n",
+            "|    std                | 0.908     |\n",
+            "|    value_loss         | 5.35e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.1       |\n",
+            "|    ep_rew_mean        | -48.1      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 12300      |\n",
+            "|    time_elapsed       | 898        |\n",
+            "|    total_timesteps    | 246000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.28      |\n",
+            "|    explained_variance | 0.89778614 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 12299      |\n",
+            "|    policy_loss        | -0.00221   |\n",
+            "|    std                | 0.906      |\n",
+            "|    value_loss         | 4.25e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 12400     |\n",
+            "|    time_elapsed       | 905       |\n",
+            "|    total_timesteps    | 248000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.26     |\n",
+            "|    explained_variance | 0.8887966 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 12399     |\n",
+            "|    policy_loss        | 0.00342   |\n",
+            "|    std                | 0.902     |\n",
+            "|    value_loss         | 9.93e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 46.9       |\n",
+            "|    ep_rew_mean        | -46.9      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 12500      |\n",
+            "|    time_elapsed       | 911        |\n",
+            "|    total_timesteps    | 250000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.22      |\n",
+            "|    explained_variance | 0.66576827 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 12499      |\n",
+            "|    policy_loss        | -0.0226    |\n",
+            "|    std                | 0.894      |\n",
+            "|    value_loss         | 3.2e-05    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 46.9      |\n",
+            "|    ep_rew_mean        | -46.9     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 12600     |\n",
+            "|    time_elapsed       | 917       |\n",
+            "|    total_timesteps    | 252000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.22     |\n",
+            "|    explained_variance | 0.7417493 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 12599     |\n",
+            "|    policy_loss        | -0.00908  |\n",
+            "|    std                | 0.893     |\n",
+            "|    value_loss         | 0.000403  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 46.9       |\n",
+            "|    ep_rew_mean        | -46.9      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 12700      |\n",
+            "|    time_elapsed       | 927        |\n",
+            "|    total_timesteps    | 254000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.21      |\n",
+            "|    explained_variance | -0.4401511 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 12699      |\n",
+            "|    policy_loss        | -0.00291   |\n",
+            "|    std                | 0.891      |\n",
+            "|    value_loss         | 0.000273   |\n",
+            "--------------------------------------\n",
+            "---------------------------------------\n",
+            "| rollout/              |             |\n",
+            "|    ep_len_mean        | 47.1        |\n",
+            "|    ep_rew_mean        | -47         |\n",
+            "| time/                 |             |\n",
+            "|    fps                | 274         |\n",
+            "|    iterations         | 12800       |\n",
+            "|    time_elapsed       | 933         |\n",
+            "|    total_timesteps    | 256000      |\n",
+            "| train/                |             |\n",
+            "|    entropy_loss       | -5.19       |\n",
+            "|    explained_variance | 0.049697876 |\n",
+            "|    learning_rate      | 0.0007      |\n",
+            "|    n_updates          | 12799       |\n",
+            "|    policy_loss        | -0.0232     |\n",
+            "|    std                | 0.887       |\n",
+            "|    value_loss         | 0.00016     |\n",
+            "---------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48         |\n",
+            "|    ep_rew_mean        | -48        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 12900      |\n",
+            "|    time_elapsed       | 940        |\n",
+            "|    total_timesteps    | 258000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.19      |\n",
+            "|    explained_variance | -1.4899552 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 12899      |\n",
+            "|    policy_loss        | 0.0311     |\n",
+            "|    std                | 0.887      |\n",
+            "|    value_loss         | 8.18e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48        |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 13000     |\n",
+            "|    time_elapsed       | 947       |\n",
+            "|    total_timesteps    | 260000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.17     |\n",
+            "|    explained_variance | 0.8485774 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 12999     |\n",
+            "|    policy_loss        | -0.0228   |\n",
+            "|    std                | 0.881     |\n",
+            "|    value_loss         | 0.000253  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 13100      |\n",
+            "|    time_elapsed       | 953        |\n",
+            "|    total_timesteps    | 262000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.17      |\n",
+            "|    explained_variance | -19.859615 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 13099      |\n",
+            "|    policy_loss        | -0.113     |\n",
+            "|    std                | 0.882      |\n",
+            "|    value_loss         | 0.00415    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.5      |\n",
+            "|    ep_rew_mean        | -49.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 13200     |\n",
+            "|    time_elapsed       | 963       |\n",
+            "|    total_timesteps    | 264000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.2      |\n",
+            "|    explained_variance | 0.9869409 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 13199     |\n",
+            "|    policy_loss        | -0.0141   |\n",
+            "|    std                | 0.888     |\n",
+            "|    value_loss         | 1.01e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 50         |\n",
+            "|    ep_rew_mean        | -50        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 13300      |\n",
+            "|    time_elapsed       | 969        |\n",
+            "|    total_timesteps    | 266000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.2       |\n",
+            "|    explained_variance | 0.91975826 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 13299      |\n",
+            "|    policy_loss        | 0.00825    |\n",
+            "|    std                | 0.889      |\n",
+            "|    value_loss         | 1.69e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.3       |\n",
+            "|    ep_rew_mean        | -48.3      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 13400      |\n",
+            "|    time_elapsed       | 976        |\n",
+            "|    total_timesteps    | 268000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.19      |\n",
+            "|    explained_variance | 0.88386124 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 13399      |\n",
+            "|    policy_loss        | -0.0196    |\n",
+            "|    std                | 0.887      |\n",
+            "|    value_loss         | 2.02e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.8       |\n",
+            "|    ep_rew_mean        | -47.8      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 13500      |\n",
+            "|    time_elapsed       | 982        |\n",
+            "|    total_timesteps    | 270000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.21      |\n",
+            "|    explained_variance | 0.88700855 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 13499      |\n",
+            "|    policy_loss        | -0.0174    |\n",
+            "|    std                | 0.892      |\n",
+            "|    value_loss         | 1.85e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.7      |\n",
+            "|    ep_rew_mean        | -47.6     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 13600     |\n",
+            "|    time_elapsed       | 989       |\n",
+            "|    total_timesteps    | 272000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.2      |\n",
+            "|    explained_variance | 0.9246665 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 13599     |\n",
+            "|    policy_loss        | -0.0265   |\n",
+            "|    std                | 0.889     |\n",
+            "|    value_loss         | 3.59e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.7       |\n",
+            "|    ep_rew_mean        | -48.6      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 13700      |\n",
+            "|    time_elapsed       | 999        |\n",
+            "|    total_timesteps    | 274000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.2       |\n",
+            "|    explained_variance | 0.90511894 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 13699      |\n",
+            "|    policy_loss        | -0.0152    |\n",
+            "|    std                | 0.889      |\n",
+            "|    value_loss         | 1.5e-05    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.5       |\n",
+            "|    ep_rew_mean        | -47.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 13800      |\n",
+            "|    time_elapsed       | 1006       |\n",
+            "|    total_timesteps    | 276000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.21      |\n",
+            "|    explained_variance | 0.96453655 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 13799      |\n",
+            "|    policy_loss        | 0.00467    |\n",
+            "|    std                | 0.89       |\n",
+            "|    value_loss         | 6.43e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.5      |\n",
+            "|    ep_rew_mean        | -47.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 13900     |\n",
+            "|    time_elapsed       | 1012      |\n",
+            "|    total_timesteps    | 278000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.21     |\n",
+            "|    explained_variance | 0.9376099 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 13899     |\n",
+            "|    policy_loss        | -0.00328  |\n",
+            "|    std                | 0.892     |\n",
+            "|    value_loss         | 1e-06     |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 14000     |\n",
+            "|    time_elapsed       | 1019      |\n",
+            "|    total_timesteps    | 280000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.2      |\n",
+            "|    explained_variance | 0.9059786 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 13999     |\n",
+            "|    policy_loss        | 0.00602   |\n",
+            "|    std                | 0.889     |\n",
+            "|    value_loss         | 7.84e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.5       |\n",
+            "|    ep_rew_mean        | -47.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 14100      |\n",
+            "|    time_elapsed       | 1025       |\n",
+            "|    total_timesteps    | 282000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.17      |\n",
+            "|    explained_variance | 0.72370136 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 14099      |\n",
+            "|    policy_loss        | -0.016     |\n",
+            "|    std                | 0.883      |\n",
+            "|    value_loss         | 2.04e-05   |\n",
+            "--------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 47.5     |\n",
+            "|    ep_rew_mean        | -47.5    |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 274      |\n",
+            "|    iterations         | 14200    |\n",
+            "|    time_elapsed       | 1035     |\n",
+            "|    total_timesteps    | 284000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -5.18    |\n",
+            "|    explained_variance | 0.92715  |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 14199    |\n",
+            "|    policy_loss        | 0.00173  |\n",
+            "|    std                | 0.885    |\n",
+            "|    value_loss         | 3e-05    |\n",
+            "------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48         |\n",
+            "|    ep_rew_mean        | -48        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 14300      |\n",
+            "|    time_elapsed       | 1042       |\n",
+            "|    total_timesteps    | 286000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.19      |\n",
+            "|    explained_variance | 0.86559874 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 14299      |\n",
+            "|    policy_loss        | -0.29      |\n",
+            "|    std                | 0.885      |\n",
+            "|    value_loss         | 0.00812    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48         |\n",
+            "|    ep_rew_mean        | -48        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 14400      |\n",
+            "|    time_elapsed       | 1048       |\n",
+            "|    total_timesteps    | 288000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.18      |\n",
+            "|    explained_variance | 0.95699525 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 14399      |\n",
+            "|    policy_loss        | -0.0155    |\n",
+            "|    std                | 0.884      |\n",
+            "|    value_loss         | 1.59e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 46.7       |\n",
+            "|    ep_rew_mean        | -46.6      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 14500      |\n",
+            "|    time_elapsed       | 1055       |\n",
+            "|    total_timesteps    | 290000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.15      |\n",
+            "|    explained_variance | 0.27279484 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 14499      |\n",
+            "|    policy_loss        | -0.0854    |\n",
+            "|    std                | 0.878      |\n",
+            "|    value_loss         | 0.000576   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.7       |\n",
+            "|    ep_rew_mean        | -47.6      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 14600      |\n",
+            "|    time_elapsed       | 1062       |\n",
+            "|    total_timesteps    | 292000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.15      |\n",
+            "|    explained_variance | 0.96234167 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 14599      |\n",
+            "|    policy_loss        | -0.00485   |\n",
+            "|    std                | 0.878      |\n",
+            "|    value_loss         | 5.17e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.5      |\n",
+            "|    ep_rew_mean        | -49.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 14700     |\n",
+            "|    time_elapsed       | 1072      |\n",
+            "|    total_timesteps    | 294000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.17     |\n",
+            "|    explained_variance | 0.9157329 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 14699     |\n",
+            "|    policy_loss        | -0.000324 |\n",
+            "|    std                | 0.881     |\n",
+            "|    value_loss         | 6.58e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 50        |\n",
+            "|    ep_rew_mean        | -50       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 14800     |\n",
+            "|    time_elapsed       | 1078      |\n",
+            "|    total_timesteps    | 296000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.17     |\n",
+            "|    explained_variance | 0.9132874 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 14799     |\n",
+            "|    policy_loss        | -0.0059   |\n",
+            "|    std                | 0.883     |\n",
+            "|    value_loss         | 3.24e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.5       |\n",
+            "|    ep_rew_mean        | -49.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 14900      |\n",
+            "|    time_elapsed       | 1085       |\n",
+            "|    total_timesteps    | 298000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.18      |\n",
+            "|    explained_variance | 0.84724855 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 14899      |\n",
+            "|    policy_loss        | -0.0291    |\n",
+            "|    std                | 0.884      |\n",
+            "|    value_loss         | 6.02e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.5       |\n",
+            "|    ep_rew_mean        | -49.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 15000      |\n",
+            "|    time_elapsed       | 1091       |\n",
+            "|    total_timesteps    | 300000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.18      |\n",
+            "|    explained_variance | 0.98511446 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 14999      |\n",
+            "|    policy_loss        | 0.00702    |\n",
+            "|    std                | 0.884      |\n",
+            "|    value_loss         | 8.7e-06    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.1      |\n",
+            "|    ep_rew_mean        | -47       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 15100     |\n",
+            "|    time_elapsed       | 1098      |\n",
+            "|    total_timesteps    | 302000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.19     |\n",
+            "|    explained_variance | 0.9877453 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 15099     |\n",
+            "|    policy_loss        | 0.0733    |\n",
+            "|    std                | 0.886     |\n",
+            "|    value_loss         | 0.000876  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 46.6       |\n",
+            "|    ep_rew_mean        | -46.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 15200      |\n",
+            "|    time_elapsed       | 1108       |\n",
+            "|    total_timesteps    | 304000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.19      |\n",
+            "|    explained_variance | 0.97673744 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 15199      |\n",
+            "|    policy_loss        | 0.00857    |\n",
+            "|    std                | 0.886      |\n",
+            "|    value_loss         | 1.03e-05   |\n",
+            "--------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 48.5     |\n",
+            "|    ep_rew_mean        | -48.5    |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 274      |\n",
+            "|    iterations         | 15300    |\n",
+            "|    time_elapsed       | 1114     |\n",
+            "|    total_timesteps    | 306000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -5.18    |\n",
+            "|    explained_variance | 0.94397  |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 15299    |\n",
+            "|    policy_loss        | 0.00493  |\n",
+            "|    std                | 0.884    |\n",
+            "|    value_loss         | 1.62e-05 |\n",
+            "------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.5      |\n",
+            "|    ep_rew_mean        | -47.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 15400     |\n",
+            "|    time_elapsed       | 1121      |\n",
+            "|    total_timesteps    | 308000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.17     |\n",
+            "|    explained_variance | 0.9985177 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 15399     |\n",
+            "|    policy_loss        | -0.433    |\n",
+            "|    std                | 0.881     |\n",
+            "|    value_loss         | 0.0286    |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48        |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 15500     |\n",
+            "|    time_elapsed       | 1128      |\n",
+            "|    total_timesteps    | 310000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.17     |\n",
+            "|    explained_variance | 0.8468628 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 15499     |\n",
+            "|    policy_loss        | -0.153    |\n",
+            "|    std                | 0.882     |\n",
+            "|    value_loss         | 0.00212   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 275       |\n",
+            "|    iterations         | 15600     |\n",
+            "|    time_elapsed       | 1134      |\n",
+            "|    total_timesteps    | 312000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.18     |\n",
+            "|    explained_variance | 0.9753182 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 15599     |\n",
+            "|    policy_loss        | 0.000916  |\n",
+            "|    std                | 0.885     |\n",
+            "|    value_loss         | 9.08e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 50        |\n",
+            "|    ep_rew_mean        | -50       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 15700     |\n",
+            "|    time_elapsed       | 1144      |\n",
+            "|    total_timesteps    | 314000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.16     |\n",
+            "|    explained_variance | 0.6507284 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 15699     |\n",
+            "|    policy_loss        | -0.024    |\n",
+            "|    std                | 0.881     |\n",
+            "|    value_loss         | 0.000117  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 50        |\n",
+            "|    ep_rew_mean        | -50       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 15800     |\n",
+            "|    time_elapsed       | 1150      |\n",
+            "|    total_timesteps    | 316000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.15     |\n",
+            "|    explained_variance | 0.9728615 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 15799     |\n",
+            "|    policy_loss        | -0.0219   |\n",
+            "|    std                | 0.878     |\n",
+            "|    value_loss         | 0.000419  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.3      |\n",
+            "|    ep_rew_mean        | -49.3     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 15900     |\n",
+            "|    time_elapsed       | 1157      |\n",
+            "|    total_timesteps    | 318000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.13     |\n",
+            "|    explained_variance | 0.9922751 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 15899     |\n",
+            "|    policy_loss        | 0.0166    |\n",
+            "|    std                | 0.872     |\n",
+            "|    value_loss         | 3.5e-05   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.4       |\n",
+            "|    ep_rew_mean        | -48.4      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 16000      |\n",
+            "|    time_elapsed       | 1163       |\n",
+            "|    total_timesteps    | 320000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.12      |\n",
+            "|    explained_variance | 0.39260852 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 15999      |\n",
+            "|    policy_loss        | 0.0382     |\n",
+            "|    std                | 0.871      |\n",
+            "|    value_loss         | 0.000285   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 46.6       |\n",
+            "|    ep_rew_mean        | -46.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 275        |\n",
+            "|    iterations         | 16100      |\n",
+            "|    time_elapsed       | 1170       |\n",
+            "|    total_timesteps    | 322000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.09      |\n",
+            "|    explained_variance | 0.88883996 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 16099      |\n",
+            "|    policy_loss        | -0.288     |\n",
+            "|    std                | 0.863      |\n",
+            "|    value_loss         | 0.00633    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47         |\n",
+            "|    ep_rew_mean        | -46.9      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 16200      |\n",
+            "|    time_elapsed       | 1180       |\n",
+            "|    total_timesteps    | 324000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.1       |\n",
+            "|    explained_variance | 0.43979698 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 16199      |\n",
+            "|    policy_loss        | -0.00578   |\n",
+            "|    std                | 0.867      |\n",
+            "|    value_loss         | 6.38e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 46.8      |\n",
+            "|    ep_rew_mean        | -46.7     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 16300     |\n",
+            "|    time_elapsed       | 1186      |\n",
+            "|    total_timesteps    | 326000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.12     |\n",
+            "|    explained_variance | 0.8098929 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 16299     |\n",
+            "|    policy_loss        | 0.000928  |\n",
+            "|    std                | 0.872     |\n",
+            "|    value_loss         | 2.55e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.1      |\n",
+            "|    ep_rew_mean        | -47       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 16400     |\n",
+            "|    time_elapsed       | 1193      |\n",
+            "|    total_timesteps    | 328000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.14     |\n",
+            "|    explained_variance | 0.9836917 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 16399     |\n",
+            "|    policy_loss        | -0.108    |\n",
+            "|    std                | 0.874     |\n",
+            "|    value_loss         | 0.00153   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.5      |\n",
+            "|    ep_rew_mean        | -47.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 16500     |\n",
+            "|    time_elapsed       | 1200      |\n",
+            "|    total_timesteps    | 330000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.13     |\n",
+            "|    explained_variance | 0.9889623 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 16499     |\n",
+            "|    policy_loss        | -0.00139  |\n",
+            "|    std                | 0.874     |\n",
+            "|    value_loss         | 5.57e-06  |\n",
+            "-------------------------------------\n",
+            "---------------------------------------\n",
+            "| rollout/              |             |\n",
+            "|    ep_len_mean        | 46.6        |\n",
+            "|    ep_rew_mean        | -46.5       |\n",
+            "| time/                 |             |\n",
+            "|    fps                | 274         |\n",
+            "|    iterations         | 16600       |\n",
+            "|    time_elapsed       | 1210        |\n",
+            "|    total_timesteps    | 332000      |\n",
+            "| train/                |             |\n",
+            "|    entropy_loss       | -5.12       |\n",
+            "|    explained_variance | 0.012164831 |\n",
+            "|    learning_rate      | 0.0007      |\n",
+            "|    n_updates          | 16599       |\n",
+            "|    policy_loss        | 4.24        |\n",
+            "|    std                | 0.871       |\n",
+            "|    value_loss         | 3.84        |\n",
+            "---------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 45.7       |\n",
+            "|    ep_rew_mean        | -45.6      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 16700      |\n",
+            "|    time_elapsed       | 1217       |\n",
+            "|    total_timesteps    | 334000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.12      |\n",
+            "|    explained_variance | 0.99397725 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 16699      |\n",
+            "|    policy_loss        | 0.00754    |\n",
+            "|    std                | 0.872      |\n",
+            "|    value_loss         | 7.58e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.4      |\n",
+            "|    ep_rew_mean        | -47.3     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 16800     |\n",
+            "|    time_elapsed       | 1224      |\n",
+            "|    total_timesteps    | 336000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.11     |\n",
+            "|    explained_variance | -4.876895 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 16799     |\n",
+            "|    policy_loss        | -0.00223  |\n",
+            "|    std                | 0.868     |\n",
+            "|    value_loss         | 5.49e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.9       |\n",
+            "|    ep_rew_mean        | -47.8      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 16900      |\n",
+            "|    time_elapsed       | 1231       |\n",
+            "|    total_timesteps    | 338000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.11      |\n",
+            "|    explained_variance | 0.89987206 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 16899      |\n",
+            "|    policy_loss        | -0.00284   |\n",
+            "|    std                | 0.869      |\n",
+            "|    value_loss         | 1.76e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.8      |\n",
+            "|    ep_rew_mean        | -47.7     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 272       |\n",
+            "|    iterations         | 17000     |\n",
+            "|    time_elapsed       | 1245      |\n",
+            "|    total_timesteps    | 340000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.11     |\n",
+            "|    explained_variance | 0.5528773 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 16999     |\n",
+            "|    policy_loss        | -0.00291  |\n",
+            "|    std                | 0.868     |\n",
+            "|    value_loss         | 9.32e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.7      |\n",
+            "|    ep_rew_mean        | -47.6     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 272       |\n",
+            "|    iterations         | 17100     |\n",
+            "|    time_elapsed       | 1255      |\n",
+            "|    total_timesteps    | 342000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.13     |\n",
+            "|    explained_variance | 0.8135508 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 17099     |\n",
+            "|    policy_loss        | 0.00777   |\n",
+            "|    std                | 0.872     |\n",
+            "|    value_loss         | 2.38e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.7       |\n",
+            "|    ep_rew_mean        | -47.6      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 272        |\n",
+            "|    iterations         | 17200      |\n",
+            "|    time_elapsed       | 1264       |\n",
+            "|    total_timesteps    | 344000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.12      |\n",
+            "|    explained_variance | 0.16375178 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 17199      |\n",
+            "|    policy_loss        | -0.0403    |\n",
+            "|    std                | 0.871      |\n",
+            "|    value_loss         | 0.000449   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.1       |\n",
+            "|    ep_rew_mean        | -49.1      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 272        |\n",
+            "|    iterations         | 17300      |\n",
+            "|    time_elapsed       | 1270       |\n",
+            "|    total_timesteps    | 346000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.12      |\n",
+            "|    explained_variance | 0.22681582 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 17299      |\n",
+            "|    policy_loss        | 0.0323     |\n",
+            "|    std                | 0.87       |\n",
+            "|    value_loss         | 0.000112   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.2      |\n",
+            "|    ep_rew_mean        | -49.2     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 272       |\n",
+            "|    iterations         | 17400     |\n",
+            "|    time_elapsed       | 1277      |\n",
+            "|    total_timesteps    | 348000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.11     |\n",
+            "|    explained_variance | 0.9767354 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 17399     |\n",
+            "|    policy_loss        | -0.00717  |\n",
+            "|    std                | 0.87      |\n",
+            "|    value_loss         | 5.74e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.8      |\n",
+            "|    ep_rew_mean        | -48.8     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 271       |\n",
+            "|    iterations         | 17500     |\n",
+            "|    time_elapsed       | 1287      |\n",
+            "|    total_timesteps    | 350000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.1      |\n",
+            "|    explained_variance | 0.9386433 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 17499     |\n",
+            "|    policy_loss        | -0.00407  |\n",
+            "|    std                | 0.867     |\n",
+            "|    value_loss         | 1.6e-06   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.3      |\n",
+            "|    ep_rew_mean        | -48.3     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 272       |\n",
+            "|    iterations         | 17600     |\n",
+            "|    time_elapsed       | 1293      |\n",
+            "|    total_timesteps    | 352000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.09     |\n",
+            "|    explained_variance | 0.8396387 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 17599     |\n",
+            "|    policy_loss        | -0.0163   |\n",
+            "|    std                | 0.865     |\n",
+            "|    value_loss         | 6.88e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 46.8      |\n",
+            "|    ep_rew_mean        | -46.8     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 272       |\n",
+            "|    iterations         | 17700     |\n",
+            "|    time_elapsed       | 1299      |\n",
+            "|    total_timesteps    | 354000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.11     |\n",
+            "|    explained_variance | 0.9501319 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 17699     |\n",
+            "|    policy_loss        | 0.0174    |\n",
+            "|    std                | 0.87      |\n",
+            "|    value_loss         | 3.56e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 272        |\n",
+            "|    iterations         | 17800      |\n",
+            "|    time_elapsed       | 1306       |\n",
+            "|    total_timesteps    | 356000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.09      |\n",
+            "|    explained_variance | 0.94616187 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 17799      |\n",
+            "|    policy_loss        | 0.00346    |\n",
+            "|    std                | 0.865      |\n",
+            "|    value_loss         | 2.21e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 50         |\n",
+            "|    ep_rew_mean        | -50        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 272        |\n",
+            "|    iterations         | 17900      |\n",
+            "|    time_elapsed       | 1312       |\n",
+            "|    total_timesteps    | 358000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.09      |\n",
+            "|    explained_variance | 0.96140474 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 17899      |\n",
+            "|    policy_loss        | -0.0017    |\n",
+            "|    std                | 0.865      |\n",
+            "|    value_loss         | 0.000215   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.8       |\n",
+            "|    ep_rew_mean        | -48.8      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 272        |\n",
+            "|    iterations         | 18000      |\n",
+            "|    time_elapsed       | 1322       |\n",
+            "|    total_timesteps    | 360000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.1       |\n",
+            "|    explained_variance | 0.97701174 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 17999      |\n",
+            "|    policy_loss        | 0.00294    |\n",
+            "|    std                | 0.868      |\n",
+            "|    value_loss         | 4.87e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 272       |\n",
+            "|    iterations         | 18100     |\n",
+            "|    time_elapsed       | 1328      |\n",
+            "|    total_timesteps    | 362000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.1      |\n",
+            "|    explained_variance | 0.5843673 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 18099     |\n",
+            "|    policy_loss        | -0.0331   |\n",
+            "|    std                | 0.868     |\n",
+            "|    value_loss         | 0.00043   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.2       |\n",
+            "|    ep_rew_mean        | -49.2      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 272        |\n",
+            "|    iterations         | 18200      |\n",
+            "|    time_elapsed       | 1334       |\n",
+            "|    total_timesteps    | 364000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.1       |\n",
+            "|    explained_variance | 0.98297787 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 18199      |\n",
+            "|    policy_loss        | 0.00959    |\n",
+            "|    std                | 0.868      |\n",
+            "|    value_loss         | 6.49e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.8      |\n",
+            "|    ep_rew_mean        | -48.8     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 272       |\n",
+            "|    iterations         | 18300     |\n",
+            "|    time_elapsed       | 1341      |\n",
+            "|    total_timesteps    | 366000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.11     |\n",
+            "|    explained_variance | 0.9543413 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 18299     |\n",
+            "|    policy_loss        | 0.00256   |\n",
+            "|    std                | 0.868     |\n",
+            "|    value_loss         | 6.71e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48        |\n",
+            "|    ep_rew_mean        | -47.9     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 18400     |\n",
+            "|    time_elapsed       | 1347      |\n",
+            "|    total_timesteps    | 368000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.08     |\n",
+            "|    explained_variance | -5.475193 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 18399     |\n",
+            "|    policy_loss        | 0.119     |\n",
+            "|    std                | 0.863     |\n",
+            "|    value_loss         | 0.00233   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.8      |\n",
+            "|    ep_rew_mean        | -47.7     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 272       |\n",
+            "|    iterations         | 18500     |\n",
+            "|    time_elapsed       | 1356      |\n",
+            "|    total_timesteps    | 370000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.09     |\n",
+            "|    explained_variance | 0.9711382 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 18499     |\n",
+            "|    policy_loss        | 0.00211   |\n",
+            "|    std                | 0.865     |\n",
+            "|    value_loss         | 3.28e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.1      |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 272       |\n",
+            "|    iterations         | 18600     |\n",
+            "|    time_elapsed       | 1363      |\n",
+            "|    total_timesteps    | 372000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.07     |\n",
+            "|    explained_variance | 0.3015197 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 18599     |\n",
+            "|    policy_loss        | 0.0673    |\n",
+            "|    std                | 0.861     |\n",
+            "|    value_loss         | 0.000601  |\n",
+            "-------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 49       |\n",
+            "|    ep_rew_mean        | -49      |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 273      |\n",
+            "|    iterations         | 18700    |\n",
+            "|    time_elapsed       | 1369     |\n",
+            "|    total_timesteps    | 374000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -5.06    |\n",
+            "|    explained_variance | 0.972879 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 18699    |\n",
+            "|    policy_loss        | 0.00273  |\n",
+            "|    std                | 0.859    |\n",
+            "|    value_loss         | 8.61e-06 |\n",
+            "------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.5      |\n",
+            "|    ep_rew_mean        | -49.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 18800     |\n",
+            "|    time_elapsed       | 1375      |\n",
+            "|    total_timesteps    | 376000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.07     |\n",
+            "|    explained_variance | 0.9947514 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 18799     |\n",
+            "|    policy_loss        | 0.000101  |\n",
+            "|    std                | 0.86      |\n",
+            "|    value_loss         | 1.14e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 50         |\n",
+            "|    ep_rew_mean        | -50        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 18900      |\n",
+            "|    time_elapsed       | 1381       |\n",
+            "|    total_timesteps    | 378000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.07      |\n",
+            "|    explained_variance | 0.97353405 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 18899      |\n",
+            "|    policy_loss        | 0.00465    |\n",
+            "|    std                | 0.861      |\n",
+            "|    value_loss         | 3.59e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 19000      |\n",
+            "|    time_elapsed       | 1391       |\n",
+            "|    total_timesteps    | 380000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.05      |\n",
+            "|    explained_variance | 0.91057456 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 18999      |\n",
+            "|    policy_loss        | 0.0117     |\n",
+            "|    std                | 0.857      |\n",
+            "|    value_loss         | 5.27e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.8      |\n",
+            "|    ep_rew_mean        | -47.8     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 19100     |\n",
+            "|    time_elapsed       | 1397      |\n",
+            "|    total_timesteps    | 382000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.03     |\n",
+            "|    explained_variance | 0.9235318 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 19099     |\n",
+            "|    policy_loss        | 0.00576   |\n",
+            "|    std                | 0.852     |\n",
+            "|    value_loss         | 8.68e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.3       |\n",
+            "|    ep_rew_mean        | -47.3      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 19200      |\n",
+            "|    time_elapsed       | 1403       |\n",
+            "|    total_timesteps    | 384000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.02      |\n",
+            "|    explained_variance | 0.96858865 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 19199      |\n",
+            "|    policy_loss        | 0.0038     |\n",
+            "|    std                | 0.85       |\n",
+            "|    value_loss         | 8.57e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.3       |\n",
+            "|    ep_rew_mean        | -48.2      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 19300      |\n",
+            "|    time_elapsed       | 1410       |\n",
+            "|    total_timesteps    | 386000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5.03      |\n",
+            "|    explained_variance | 0.86711633 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 19299      |\n",
+            "|    policy_loss        | -0.0111    |\n",
+            "|    std                | 0.852      |\n",
+            "|    value_loss         | 1.94e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.5       |\n",
+            "|    ep_rew_mean        | -47.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 19400      |\n",
+            "|    time_elapsed       | 1416       |\n",
+            "|    total_timesteps    | 388000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.98      |\n",
+            "|    explained_variance | 0.94285834 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 19399      |\n",
+            "|    policy_loss        | 0.00256    |\n",
+            "|    std                | 0.842      |\n",
+            "|    value_loss         | 3.84e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 19500      |\n",
+            "|    time_elapsed       | 1427       |\n",
+            "|    total_timesteps    | 390000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.99      |\n",
+            "|    explained_variance | 0.86415195 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 19499      |\n",
+            "|    policy_loss        | -0.0182    |\n",
+            "|    std                | 0.843      |\n",
+            "|    value_loss         | 3.04e-05   |\n",
+            "--------------------------------------\n",
+            "----------------------------------------\n",
+            "| rollout/              |              |\n",
+            "|    ep_len_mean        | 49           |\n",
+            "|    ep_rew_mean        | -49          |\n",
+            "| time/                 |              |\n",
+            "|    fps                | 273          |\n",
+            "|    iterations         | 19600        |\n",
+            "|    time_elapsed       | 1433         |\n",
+            "|    total_timesteps    | 392000       |\n",
+            "| train/                |              |\n",
+            "|    entropy_loss       | -5           |\n",
+            "|    explained_variance | 0.0044369698 |\n",
+            "|    learning_rate      | 0.0007       |\n",
+            "|    n_updates          | 19599        |\n",
+            "|    policy_loss        | 2.2          |\n",
+            "|    std                | 0.845        |\n",
+            "|    value_loss         | 3.88         |\n",
+            "----------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.5       |\n",
+            "|    ep_rew_mean        | -49.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 19700      |\n",
+            "|    time_elapsed       | 1439       |\n",
+            "|    total_timesteps    | 394000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -5         |\n",
+            "|    explained_variance | 0.86026084 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 19699      |\n",
+            "|    policy_loss        | -0.0129    |\n",
+            "|    std                | 0.845      |\n",
+            "|    value_loss         | 9.01e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 19800     |\n",
+            "|    time_elapsed       | 1445      |\n",
+            "|    total_timesteps    | 396000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -5.01     |\n",
+            "|    explained_variance | 0.8322771 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 19799     |\n",
+            "|    policy_loss        | 0.00362   |\n",
+            "|    std                | 0.848     |\n",
+            "|    value_loss         | 3.95e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.5      |\n",
+            "|    ep_rew_mean        | -49.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 19900     |\n",
+            "|    time_elapsed       | 1451      |\n",
+            "|    total_timesteps    | 398000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.99     |\n",
+            "|    explained_variance | 0.9543173 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 19899     |\n",
+            "|    policy_loss        | 0.00162   |\n",
+            "|    std                | 0.843     |\n",
+            "|    value_loss         | 8.87e-07  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 20000      |\n",
+            "|    time_elapsed       | 1461       |\n",
+            "|    total_timesteps    | 400000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.99      |\n",
+            "|    explained_variance | 0.96547365 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 19999      |\n",
+            "|    policy_loss        | -0.03      |\n",
+            "|    std                | 0.844      |\n",
+            "|    value_loss         | 3.86e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.3      |\n",
+            "|    ep_rew_mean        | -47.3     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 273       |\n",
+            "|    iterations         | 20100     |\n",
+            "|    time_elapsed       | 1467      |\n",
+            "|    total_timesteps    | 402000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.98     |\n",
+            "|    explained_variance | 0.9669047 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 20099     |\n",
+            "|    policy_loss        | 0.00462   |\n",
+            "|    std                | 0.841     |\n",
+            "|    value_loss         | 5.21e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 46.9      |\n",
+            "|    ep_rew_mean        | -46.8     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 20200     |\n",
+            "|    time_elapsed       | 1473      |\n",
+            "|    total_timesteps    | 404000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.96     |\n",
+            "|    explained_variance | 0.8856006 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 20199     |\n",
+            "|    policy_loss        | -0.00941  |\n",
+            "|    std                | 0.838     |\n",
+            "|    value_loss         | 1.04e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48         |\n",
+            "|    ep_rew_mean        | -48        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 20300      |\n",
+            "|    time_elapsed       | 1479       |\n",
+            "|    total_timesteps    | 406000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.96      |\n",
+            "|    explained_variance | 0.96606714 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 20299      |\n",
+            "|    policy_loss        | -0.0049    |\n",
+            "|    std                | 0.838      |\n",
+            "|    value_loss         | 3.18e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 20400     |\n",
+            "|    time_elapsed       | 1486      |\n",
+            "|    total_timesteps    | 408000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.97     |\n",
+            "|    explained_variance | 0.9416196 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 20399     |\n",
+            "|    policy_loss        | 0.0102    |\n",
+            "|    std                | 0.839     |\n",
+            "|    value_loss         | 1.27e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 273        |\n",
+            "|    iterations         | 20500      |\n",
+            "|    time_elapsed       | 1496       |\n",
+            "|    total_timesteps    | 410000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.98      |\n",
+            "|    explained_variance | -1.4714031 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 20499      |\n",
+            "|    policy_loss        | 0.00969    |\n",
+            "|    std                | 0.842      |\n",
+            "|    value_loss         | 1.55e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.5       |\n",
+            "|    ep_rew_mean        | -49.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 20600      |\n",
+            "|    time_elapsed       | 1503       |\n",
+            "|    total_timesteps    | 412000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.97      |\n",
+            "|    explained_variance | 0.91544884 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 20599      |\n",
+            "|    policy_loss        | 0.00142    |\n",
+            "|    std                | 0.84       |\n",
+            "|    value_loss         | 1.74e-06   |\n",
+            "--------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 49       |\n",
+            "|    ep_rew_mean        | -49      |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 274      |\n",
+            "|    iterations         | 20700    |\n",
+            "|    time_elapsed       | 1509     |\n",
+            "|    total_timesteps    | 414000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -4.94    |\n",
+            "|    explained_variance | 0.906113 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 20699    |\n",
+            "|    policy_loss        | -0.00503 |\n",
+            "|    std                | 0.833    |\n",
+            "|    value_loss         | 3.68e-06 |\n",
+            "------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 20800      |\n",
+            "|    time_elapsed       | 1516       |\n",
+            "|    total_timesteps    | 416000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.93      |\n",
+            "|    explained_variance | 0.95149547 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 20799      |\n",
+            "|    policy_loss        | 0.00178    |\n",
+            "|    std                | 0.831      |\n",
+            "|    value_loss         | 1.11e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 20900     |\n",
+            "|    time_elapsed       | 1522      |\n",
+            "|    total_timesteps    | 418000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.91     |\n",
+            "|    explained_variance | 0.9828419 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 20899     |\n",
+            "|    policy_loss        | 0.0151    |\n",
+            "|    std                | 0.827     |\n",
+            "|    value_loss         | 3.95e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 21000     |\n",
+            "|    time_elapsed       | 1528      |\n",
+            "|    total_timesteps    | 420000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.9      |\n",
+            "|    explained_variance | 0.9770458 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 20999     |\n",
+            "|    policy_loss        | -0.00239  |\n",
+            "|    std                | 0.825     |\n",
+            "|    value_loss         | 7.19e-07  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 21100      |\n",
+            "|    time_elapsed       | 1538       |\n",
+            "|    total_timesteps    | 422000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.9       |\n",
+            "|    explained_variance | 0.92313987 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 21099      |\n",
+            "|    policy_loss        | 0.00894    |\n",
+            "|    std                | 0.826      |\n",
+            "|    value_loss         | 9.61e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.4       |\n",
+            "|    ep_rew_mean        | -48.3      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 21200      |\n",
+            "|    time_elapsed       | 1544       |\n",
+            "|    total_timesteps    | 424000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.89      |\n",
+            "|    explained_variance | 0.32365882 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 21199      |\n",
+            "|    policy_loss        | -0.0463    |\n",
+            "|    std                | 0.822      |\n",
+            "|    value_loss         | 9.98e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.2      |\n",
+            "|    ep_rew_mean        | -48.2     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 21300     |\n",
+            "|    time_elapsed       | 1551      |\n",
+            "|    total_timesteps    | 426000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.88     |\n",
+            "|    explained_variance | 0.7403059 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 21299     |\n",
+            "|    policy_loss        | 0.00897   |\n",
+            "|    std                | 0.821     |\n",
+            "|    value_loss         | 9.72e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.9      |\n",
+            "|    ep_rew_mean        | -47.8     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 21400     |\n",
+            "|    time_elapsed       | 1557      |\n",
+            "|    total_timesteps    | 428000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.87     |\n",
+            "|    explained_variance | 0.8968396 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 21399     |\n",
+            "|    policy_loss        | -0.00422  |\n",
+            "|    std                | 0.819     |\n",
+            "|    value_loss         | 5.15e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48        |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 275       |\n",
+            "|    iterations         | 21500     |\n",
+            "|    time_elapsed       | 1563      |\n",
+            "|    total_timesteps    | 430000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.87     |\n",
+            "|    explained_variance | 0.9448255 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 21499     |\n",
+            "|    policy_loss        | -0.00374  |\n",
+            "|    std                | 0.818     |\n",
+            "|    value_loss         | 1.92e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.1      |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 274       |\n",
+            "|    iterations         | 21600     |\n",
+            "|    time_elapsed       | 1572      |\n",
+            "|    total_timesteps    | 432000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.87     |\n",
+            "|    explained_variance | 0.850035  |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 21599     |\n",
+            "|    policy_loss        | -9.31e-05 |\n",
+            "|    std                | 0.818     |\n",
+            "|    value_loss         | 6.13e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.6       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 274        |\n",
+            "|    iterations         | 21700      |\n",
+            "|    time_elapsed       | 1578       |\n",
+            "|    total_timesteps    | 434000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.84      |\n",
+            "|    explained_variance | 0.48841304 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 21699      |\n",
+            "|    policy_loss        | -0.0312    |\n",
+            "|    std                | 0.812      |\n",
+            "|    value_loss         | 6.78e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.9       |\n",
+            "|    ep_rew_mean        | -48.9      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 275        |\n",
+            "|    iterations         | 21800      |\n",
+            "|    time_elapsed       | 1584       |\n",
+            "|    total_timesteps    | 436000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.84      |\n",
+            "|    explained_variance | 0.97507805 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 21799      |\n",
+            "|    policy_loss        | -0.00284   |\n",
+            "|    std                | 0.812      |\n",
+            "|    value_loss         | 3.46e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 275        |\n",
+            "|    iterations         | 21900      |\n",
+            "|    time_elapsed       | 1589       |\n",
+            "|    total_timesteps    | 438000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.84      |\n",
+            "|    explained_variance | 0.68833864 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 21899      |\n",
+            "|    policy_loss        | 0.0115     |\n",
+            "|    std                | 0.813      |\n",
+            "|    value_loss         | 9.05e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 275        |\n",
+            "|    iterations         | 22000      |\n",
+            "|    time_elapsed       | 1595       |\n",
+            "|    total_timesteps    | 440000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.83      |\n",
+            "|    explained_variance | 0.98591065 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 21999      |\n",
+            "|    policy_loss        | 0.00962    |\n",
+            "|    std                | 0.811      |\n",
+            "|    value_loss         | 4.56e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.8       |\n",
+            "|    ep_rew_mean        | -48.7      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 275        |\n",
+            "|    iterations         | 22100      |\n",
+            "|    time_elapsed       | 1605       |\n",
+            "|    total_timesteps    | 442000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.83      |\n",
+            "|    explained_variance | 0.82283175 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 22099      |\n",
+            "|    policy_loss        | 0.0108     |\n",
+            "|    std                | 0.811      |\n",
+            "|    value_loss         | 1.54e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.8       |\n",
+            "|    ep_rew_mean        | -47.7      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 275        |\n",
+            "|    iterations         | 22200      |\n",
+            "|    time_elapsed       | 1611       |\n",
+            "|    total_timesteps    | 444000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.8       |\n",
+            "|    explained_variance | 0.59894145 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 22199      |\n",
+            "|    policy_loss        | 0.0302     |\n",
+            "|    std                | 0.804      |\n",
+            "|    value_loss         | 0.000191   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48        |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 275       |\n",
+            "|    iterations         | 22300     |\n",
+            "|    time_elapsed       | 1617      |\n",
+            "|    total_timesteps    | 446000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.78     |\n",
+            "|    explained_variance | 0.9134196 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 22299     |\n",
+            "|    policy_loss        | 0.00497   |\n",
+            "|    std                | 0.802     |\n",
+            "|    value_loss         | 4.45e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.8      |\n",
+            "|    ep_rew_mean        | -48.8     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 275       |\n",
+            "|    iterations         | 22400     |\n",
+            "|    time_elapsed       | 1623      |\n",
+            "|    total_timesteps    | 448000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.79     |\n",
+            "|    explained_variance | 0.9829938 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 22399     |\n",
+            "|    policy_loss        | -0.0215   |\n",
+            "|    std                | 0.803     |\n",
+            "|    value_loss         | 2.6e-05   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 276       |\n",
+            "|    iterations         | 22500     |\n",
+            "|    time_elapsed       | 1629      |\n",
+            "|    total_timesteps    | 450000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.78     |\n",
+            "|    explained_variance | 0.9305882 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 22499     |\n",
+            "|    policy_loss        | -0.000147 |\n",
+            "|    std                | 0.802     |\n",
+            "|    value_loss         | 7.5e-06   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.4       |\n",
+            "|    ep_rew_mean        | -48.3      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 276        |\n",
+            "|    iterations         | 22600      |\n",
+            "|    time_elapsed       | 1635       |\n",
+            "|    total_timesteps    | 452000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.77      |\n",
+            "|    explained_variance | 0.35571432 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 22599      |\n",
+            "|    policy_loss        | -0.00716   |\n",
+            "|    std                | 0.799      |\n",
+            "|    value_loss         | 2.19e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.6      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 275       |\n",
+            "|    iterations         | 22700     |\n",
+            "|    time_elapsed       | 1645      |\n",
+            "|    total_timesteps    | 454000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.77     |\n",
+            "|    explained_variance | 0.9946183 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 22699     |\n",
+            "|    policy_loss        | 0.00128   |\n",
+            "|    std                | 0.799     |\n",
+            "|    value_loss         | 1.1e-06   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48        |\n",
+            "|    ep_rew_mean        | -47.9     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 276       |\n",
+            "|    iterations         | 22800     |\n",
+            "|    time_elapsed       | 1651      |\n",
+            "|    total_timesteps    | 456000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.76     |\n",
+            "|    explained_variance | 0.9850599 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 22799     |\n",
+            "|    policy_loss        | 0.00015   |\n",
+            "|    std                | 0.798     |\n",
+            "|    value_loss         | 3.45e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.4       |\n",
+            "|    ep_rew_mean        | -49.4      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 276        |\n",
+            "|    iterations         | 22900      |\n",
+            "|    time_elapsed       | 1657       |\n",
+            "|    total_timesteps    | 458000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.75      |\n",
+            "|    explained_variance | -1.0031595 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 22899      |\n",
+            "|    policy_loss        | 0.000758   |\n",
+            "|    std                | 0.796      |\n",
+            "|    value_loss         | 8.14e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.2      |\n",
+            "|    ep_rew_mean        | -49.1     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 276       |\n",
+            "|    iterations         | 23000     |\n",
+            "|    time_elapsed       | 1663      |\n",
+            "|    total_timesteps    | 460000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.75     |\n",
+            "|    explained_variance | 0.1282289 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 22999     |\n",
+            "|    policy_loss        | -0.0544   |\n",
+            "|    std                | 0.795     |\n",
+            "|    value_loss         | 0.000587  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.7       |\n",
+            "|    ep_rew_mean        | -47.7      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 276        |\n",
+            "|    iterations         | 23100      |\n",
+            "|    time_elapsed       | 1670       |\n",
+            "|    total_timesteps    | 462000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.75      |\n",
+            "|    explained_variance | 0.84313476 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 23099      |\n",
+            "|    policy_loss        | -0.0114    |\n",
+            "|    std                | 0.795      |\n",
+            "|    value_loss         | 8.02e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 46.5       |\n",
+            "|    ep_rew_mean        | -46.4      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 276        |\n",
+            "|    iterations         | 23200      |\n",
+            "|    time_elapsed       | 1679       |\n",
+            "|    total_timesteps    | 464000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.74      |\n",
+            "|    explained_variance | 0.71710217 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 23199      |\n",
+            "|    policy_loss        | -0.0132    |\n",
+            "|    std                | 0.793      |\n",
+            "|    value_loss         | 0.000272   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 45.3      |\n",
+            "|    ep_rew_mean        | -45.2     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 276       |\n",
+            "|    iterations         | 23300     |\n",
+            "|    time_elapsed       | 1685      |\n",
+            "|    total_timesteps    | 466000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.73     |\n",
+            "|    explained_variance | 0.9658966 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 23299     |\n",
+            "|    policy_loss        | -0.00875  |\n",
+            "|    std                | 0.792     |\n",
+            "|    value_loss         | 1.17e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.1       |\n",
+            "|    ep_rew_mean        | -47        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 276        |\n",
+            "|    iterations         | 23400      |\n",
+            "|    time_elapsed       | 1691       |\n",
+            "|    total_timesteps    | 468000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.72      |\n",
+            "|    explained_variance | 0.98442066 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 23399      |\n",
+            "|    policy_loss        | 0.000132   |\n",
+            "|    std                | 0.791      |\n",
+            "|    value_loss         | 1.35e-06   |\n",
+            "--------------------------------------\n",
+            "---------------------------------------\n",
+            "| rollout/              |             |\n",
+            "|    ep_len_mean        | 49.1        |\n",
+            "|    ep_rew_mean        | -49.1       |\n",
+            "| time/                 |             |\n",
+            "|    fps                | 276         |\n",
+            "|    iterations         | 23500       |\n",
+            "|    time_elapsed       | 1697        |\n",
+            "|    total_timesteps    | 470000      |\n",
+            "| train/                |             |\n",
+            "|    entropy_loss       | -4.71       |\n",
+            "|    explained_variance | -0.20414686 |\n",
+            "|    learning_rate      | 0.0007      |\n",
+            "|    n_updates          | 23499       |\n",
+            "|    policy_loss        | 0.0214      |\n",
+            "|    std                | 0.788       |\n",
+            "|    value_loss         | 4.31e-05    |\n",
+            "---------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.1      |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 277       |\n",
+            "|    iterations         | 23600     |\n",
+            "|    time_elapsed       | 1703      |\n",
+            "|    total_timesteps    | 472000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.68     |\n",
+            "|    explained_variance | 0.8207843 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 23599     |\n",
+            "|    policy_loss        | -0.0187   |\n",
+            "|    std                | 0.781     |\n",
+            "|    value_loss         | 4.95e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.8      |\n",
+            "|    ep_rew_mean        | -47.7     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 276       |\n",
+            "|    iterations         | 23700     |\n",
+            "|    time_elapsed       | 1713      |\n",
+            "|    total_timesteps    | 474000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.67     |\n",
+            "|    explained_variance | 0.9646553 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 23699     |\n",
+            "|    policy_loss        | -0.00441  |\n",
+            "|    std                | 0.78      |\n",
+            "|    value_loss         | 4.12e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 46.6       |\n",
+            "|    ep_rew_mean        | -46.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 276        |\n",
+            "|    iterations         | 23800      |\n",
+            "|    time_elapsed       | 1719       |\n",
+            "|    total_timesteps    | 476000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.64      |\n",
+            "|    explained_variance | 0.29265028 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 23799      |\n",
+            "|    policy_loss        | -0.0266    |\n",
+            "|    std                | 0.775      |\n",
+            "|    value_loss         | 8.52e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.8       |\n",
+            "|    ep_rew_mean        | -47.8      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 276        |\n",
+            "|    iterations         | 23900      |\n",
+            "|    time_elapsed       | 1725       |\n",
+            "|    total_timesteps    | 478000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.63      |\n",
+            "|    explained_variance | 0.53161466 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 23899      |\n",
+            "|    policy_loss        | 0.0194     |\n",
+            "|    std                | 0.773      |\n",
+            "|    value_loss         | 0.00012    |\n",
+            "--------------------------------------\n",
+            "---------------------------------------\n",
+            "| rollout/              |             |\n",
+            "|    ep_len_mean        | 48.7        |\n",
+            "|    ep_rew_mean        | -48.7       |\n",
+            "| time/                 |             |\n",
+            "|    fps                | 277         |\n",
+            "|    iterations         | 24000       |\n",
+            "|    time_elapsed       | 1731        |\n",
+            "|    total_timesteps    | 480000      |\n",
+            "| train/                |             |\n",
+            "|    entropy_loss       | -4.61       |\n",
+            "|    explained_variance | -0.45772743 |\n",
+            "|    learning_rate      | 0.0007      |\n",
+            "|    n_updates          | 23999       |\n",
+            "|    policy_loss        | 0.0221      |\n",
+            "|    std                | 0.769       |\n",
+            "|    value_loss         | 0.000243    |\n",
+            "---------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 49       |\n",
+            "|    ep_rew_mean        | -48.9    |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 277      |\n",
+            "|    iterations         | 24100    |\n",
+            "|    time_elapsed       | 1737     |\n",
+            "|    total_timesteps    | 482000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -4.6     |\n",
+            "|    explained_variance | 0.984889 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 24099    |\n",
+            "|    policy_loss        | 0.00232  |\n",
+            "|    std                | 0.767    |\n",
+            "|    value_loss         | 2.45e-06 |\n",
+            "------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.4       |\n",
+            "|    ep_rew_mean        | -49.4      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 276        |\n",
+            "|    iterations         | 24200      |\n",
+            "|    time_elapsed       | 1747       |\n",
+            "|    total_timesteps    | 484000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.6       |\n",
+            "|    explained_variance | 0.38117242 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 24199      |\n",
+            "|    policy_loss        | -0.00147   |\n",
+            "|    std                | 0.766      |\n",
+            "|    value_loss         | 2.92e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.4      |\n",
+            "|    ep_rew_mean        | -49.4     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 277       |\n",
+            "|    iterations         | 24300     |\n",
+            "|    time_elapsed       | 1753      |\n",
+            "|    total_timesteps    | 486000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.57     |\n",
+            "|    explained_variance | 0.8432429 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 24299     |\n",
+            "|    policy_loss        | -0.0173   |\n",
+            "|    std                | 0.761     |\n",
+            "|    value_loss         | 7.87e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.3       |\n",
+            "|    ep_rew_mean        | -49.3      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 277        |\n",
+            "|    iterations         | 24400      |\n",
+            "|    time_elapsed       | 1759       |\n",
+            "|    total_timesteps    | 488000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.55      |\n",
+            "|    explained_variance | 0.74071956 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 24399      |\n",
+            "|    policy_loss        | -0.000541  |\n",
+            "|    std                | 0.757      |\n",
+            "|    value_loss         | 5.11e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.4       |\n",
+            "|    ep_rew_mean        | -48.3      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 277        |\n",
+            "|    iterations         | 24500      |\n",
+            "|    time_elapsed       | 1765       |\n",
+            "|    total_timesteps    | 490000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.53      |\n",
+            "|    explained_variance | 0.93212646 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 24499      |\n",
+            "|    policy_loss        | -0.0138    |\n",
+            "|    std                | 0.752      |\n",
+            "|    value_loss         | 1.72e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.1       |\n",
+            "|    ep_rew_mean        | -48        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 277        |\n",
+            "|    iterations         | 24600      |\n",
+            "|    time_elapsed       | 1771       |\n",
+            "|    total_timesteps    | 492000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.49      |\n",
+            "|    explained_variance | 0.83804965 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 24599      |\n",
+            "|    policy_loss        | -0.0364    |\n",
+            "|    std                | 0.746      |\n",
+            "|    value_loss         | 7.15e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.2       |\n",
+            "|    ep_rew_mean        | -48.1      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 277        |\n",
+            "|    iterations         | 24700      |\n",
+            "|    time_elapsed       | 1777       |\n",
+            "|    total_timesteps    | 494000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.5       |\n",
+            "|    explained_variance | 0.99318516 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 24699      |\n",
+            "|    policy_loss        | -0.00357   |\n",
+            "|    std                | 0.746      |\n",
+            "|    value_loss         | 3.81e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 277       |\n",
+            "|    iterations         | 24800     |\n",
+            "|    time_elapsed       | 1787      |\n",
+            "|    total_timesteps    | 496000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.49     |\n",
+            "|    explained_variance | 0.9355085 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 24799     |\n",
+            "|    policy_loss        | -0.000438 |\n",
+            "|    std                | 0.745     |\n",
+            "|    value_loss         | 1.37e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 50        |\n",
+            "|    ep_rew_mean        | -50       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 277       |\n",
+            "|    iterations         | 24900     |\n",
+            "|    time_elapsed       | 1793      |\n",
+            "|    total_timesteps    | 498000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.48     |\n",
+            "|    explained_variance | 0.9021735 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 24899     |\n",
+            "|    policy_loss        | -0.000913 |\n",
+            "|    std                | 0.744     |\n",
+            "|    value_loss         | 2.17e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 50        |\n",
+            "|    ep_rew_mean        | -50       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 277       |\n",
+            "|    iterations         | 25000     |\n",
+            "|    time_elapsed       | 1799      |\n",
+            "|    total_timesteps    | 500000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.47     |\n",
+            "|    explained_variance | 0.9590338 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 24999     |\n",
+            "|    policy_loss        | -0.000665 |\n",
+            "|    std                | 0.742     |\n",
+            "|    value_loss         | 8.6e-07   |\n",
+            "-------------------------------------\n",
+            "---------------------------------------\n",
+            "| rollout/              |             |\n",
+            "|    ep_len_mean        | 50          |\n",
+            "|    ep_rew_mean        | -50         |\n",
+            "| time/                 |             |\n",
+            "|    fps                | 278         |\n",
+            "|    iterations         | 25100       |\n",
+            "|    time_elapsed       | 1805        |\n",
+            "|    total_timesteps    | 502000      |\n",
+            "| train/                |             |\n",
+            "|    entropy_loss       | -4.46       |\n",
+            "|    explained_variance | -0.09868252 |\n",
+            "|    learning_rate      | 0.0007      |\n",
+            "|    n_updates          | 25099       |\n",
+            "|    policy_loss        | 0.00935     |\n",
+            "|    std                | 0.74        |\n",
+            "|    value_loss         | 3.3e-05     |\n",
+            "---------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.8      |\n",
+            "|    ep_rew_mean        | -49.7     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 278       |\n",
+            "|    iterations         | 25200     |\n",
+            "|    time_elapsed       | 1811      |\n",
+            "|    total_timesteps    | 504000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.45     |\n",
+            "|    explained_variance | 0.9393065 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 25199     |\n",
+            "|    policy_loss        | -0.00298  |\n",
+            "|    std                | 0.739     |\n",
+            "|    value_loss         | 2.64e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.4     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 277       |\n",
+            "|    iterations         | 25300     |\n",
+            "|    time_elapsed       | 1820      |\n",
+            "|    total_timesteps    | 506000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.46     |\n",
+            "|    explained_variance | 0.9661807 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 25299     |\n",
+            "|    policy_loss        | -0.00921  |\n",
+            "|    std                | 0.739     |\n",
+            "|    value_loss         | 5.25e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.2       |\n",
+            "|    ep_rew_mean        | -47.1      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 278        |\n",
+            "|    iterations         | 25400      |\n",
+            "|    time_elapsed       | 1826       |\n",
+            "|    total_timesteps    | 508000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.45      |\n",
+            "|    explained_variance | 0.98033226 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 25399      |\n",
+            "|    policy_loss        | -0.0115    |\n",
+            "|    std                | 0.738      |\n",
+            "|    value_loss         | 1.33e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.9       |\n",
+            "|    ep_rew_mean        | -47.9      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 278        |\n",
+            "|    iterations         | 25500      |\n",
+            "|    time_elapsed       | 1832       |\n",
+            "|    total_timesteps    | 510000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.45      |\n",
+            "|    explained_variance | 0.98172903 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 25499      |\n",
+            "|    policy_loss        | -0.00525   |\n",
+            "|    std                | 0.737      |\n",
+            "|    value_loss         | 4.09e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.5      |\n",
+            "|    ep_rew_mean        | -49.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 278       |\n",
+            "|    iterations         | 25600     |\n",
+            "|    time_elapsed       | 1838      |\n",
+            "|    total_timesteps    | 512000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.44     |\n",
+            "|    explained_variance | 0.9630763 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 25599     |\n",
+            "|    policy_loss        | 0.00446   |\n",
+            "|    std                | 0.737     |\n",
+            "|    value_loss         | 3.84e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.5       |\n",
+            "|    ep_rew_mean        | -49.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 278        |\n",
+            "|    iterations         | 25700      |\n",
+            "|    time_elapsed       | 1844       |\n",
+            "|    total_timesteps    | 514000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.42      |\n",
+            "|    explained_variance | 0.74551255 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 25699      |\n",
+            "|    policy_loss        | -0.00379   |\n",
+            "|    std                | 0.733      |\n",
+            "|    value_loss         | 4.06e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48         |\n",
+            "|    ep_rew_mean        | -48        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 278        |\n",
+            "|    iterations         | 25800      |\n",
+            "|    time_elapsed       | 1851       |\n",
+            "|    total_timesteps    | 516000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.42      |\n",
+            "|    explained_variance | 0.88611174 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 25799      |\n",
+            "|    policy_loss        | -0.00438   |\n",
+            "|    std                | 0.733      |\n",
+            "|    value_loss         | 4.34e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 46.2       |\n",
+            "|    ep_rew_mean        | -46.2      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 278        |\n",
+            "|    iterations         | 25900      |\n",
+            "|    time_elapsed       | 1860       |\n",
+            "|    total_timesteps    | 518000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.4       |\n",
+            "|    explained_variance | 0.97400296 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 25899      |\n",
+            "|    policy_loss        | -0.0016    |\n",
+            "|    std                | 0.73       |\n",
+            "|    value_loss         | 3.13e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.4      |\n",
+            "|    ep_rew_mean        | -47.4     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 278       |\n",
+            "|    iterations         | 26000     |\n",
+            "|    time_elapsed       | 1866      |\n",
+            "|    total_timesteps    | 520000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.43     |\n",
+            "|    explained_variance | 0.9903519 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 25999     |\n",
+            "|    policy_loss        | -0.00792  |\n",
+            "|    std                | 0.734     |\n",
+            "|    value_loss         | 4.45e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.1       |\n",
+            "|    ep_rew_mean        | -47        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 278        |\n",
+            "|    iterations         | 26100      |\n",
+            "|    time_elapsed       | 1872       |\n",
+            "|    total_timesteps    | 522000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.41      |\n",
+            "|    explained_variance | 0.96013033 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 26099      |\n",
+            "|    policy_loss        | -0.00215   |\n",
+            "|    std                | 0.733      |\n",
+            "|    value_loss         | 2.28e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48        |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 278       |\n",
+            "|    iterations         | 26200     |\n",
+            "|    time_elapsed       | 1878      |\n",
+            "|    total_timesteps    | 524000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.42     |\n",
+            "|    explained_variance | 0.9256018 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 26199     |\n",
+            "|    policy_loss        | 0.00556   |\n",
+            "|    std                | 0.733     |\n",
+            "|    value_loss         | 6.28e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.5       |\n",
+            "|    ep_rew_mean        | -47.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 26300      |\n",
+            "|    time_elapsed       | 1884       |\n",
+            "|    total_timesteps    | 526000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.4       |\n",
+            "|    explained_variance | 0.96353793 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 26299      |\n",
+            "|    policy_loss        | -0.00559   |\n",
+            "|    std                | 0.73       |\n",
+            "|    value_loss         | 2.37e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.7      |\n",
+            "|    ep_rew_mean        | -47.6     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 278       |\n",
+            "|    iterations         | 26400     |\n",
+            "|    time_elapsed       | 1894      |\n",
+            "|    total_timesteps    | 528000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.4      |\n",
+            "|    explained_variance | 0.9796733 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 26399     |\n",
+            "|    policy_loss        | -0.00667  |\n",
+            "|    std                | 0.73      |\n",
+            "|    value_loss         | 1.02e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.2      |\n",
+            "|    ep_rew_mean        | -47.1     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 278       |\n",
+            "|    iterations         | 26500     |\n",
+            "|    time_elapsed       | 1900      |\n",
+            "|    total_timesteps    | 530000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.4      |\n",
+            "|    explained_variance | 0.9652436 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 26499     |\n",
+            "|    policy_loss        | -0.00465  |\n",
+            "|    std                | 0.73      |\n",
+            "|    value_loss         | 2.56e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.2       |\n",
+            "|    ep_rew_mean        | -48.1      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 278        |\n",
+            "|    iterations         | 26600      |\n",
+            "|    time_elapsed       | 1906       |\n",
+            "|    total_timesteps    | 532000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.39      |\n",
+            "|    explained_variance | 0.85467994 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 26599      |\n",
+            "|    policy_loss        | -0.000304  |\n",
+            "|    std                | 0.728      |\n",
+            "|    value_loss         | 8.97e-06   |\n",
+            "--------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 48.4     |\n",
+            "|    ep_rew_mean        | -48.4    |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 279      |\n",
+            "|    iterations         | 26700    |\n",
+            "|    time_elapsed       | 1912     |\n",
+            "|    total_timesteps    | 534000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -4.36    |\n",
+            "|    explained_variance | 0.99323  |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 26699    |\n",
+            "|    policy_loss        | 0.00249  |\n",
+            "|    std                | 0.724    |\n",
+            "|    value_loss         | 4.39e-07 |\n",
+            "------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.9       |\n",
+            "|    ep_rew_mean        | -47.9      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 26800      |\n",
+            "|    time_elapsed       | 1919       |\n",
+            "|    total_timesteps    | 536000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.36      |\n",
+            "|    explained_variance | 0.95387536 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 26799      |\n",
+            "|    policy_loss        | 0.0109     |\n",
+            "|    std                | 0.723      |\n",
+            "|    value_loss         | 1.75e-05   |\n",
+            "--------------------------------------\n",
+            "-----------------------------------------\n",
+            "| rollout/              |               |\n",
+            "|    ep_len_mean        | 49            |\n",
+            "|    ep_rew_mean        | -48.9         |\n",
+            "| time/                 |               |\n",
+            "|    fps                | 278           |\n",
+            "|    iterations         | 26900         |\n",
+            "|    time_elapsed       | 1928          |\n",
+            "|    total_timesteps    | 538000        |\n",
+            "| train/                |               |\n",
+            "|    entropy_loss       | -4.35         |\n",
+            "|    explained_variance | -0.0017001629 |\n",
+            "|    learning_rate      | 0.0007        |\n",
+            "|    n_updates          | 26899         |\n",
+            "|    policy_loss        | 3.99          |\n",
+            "|    std                | 0.72          |\n",
+            "|    value_loss         | 11.5          |\n",
+            "-----------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.4     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 27000     |\n",
+            "|    time_elapsed       | 1934      |\n",
+            "|    total_timesteps    | 540000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.33     |\n",
+            "|    explained_variance | 0.8643065 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 26999     |\n",
+            "|    policy_loss        | -0.0105   |\n",
+            "|    std                | 0.718     |\n",
+            "|    value_loss         | 9.68e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.6      |\n",
+            "|    ep_rew_mean        | -48.6     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 27100     |\n",
+            "|    time_elapsed       | 1941      |\n",
+            "|    total_timesteps    | 542000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.33     |\n",
+            "|    explained_variance | 0.9878161 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 27099     |\n",
+            "|    policy_loss        | 0.00164   |\n",
+            "|    std                | 0.718     |\n",
+            "|    value_loss         | 9.34e-07  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.5      |\n",
+            "|    ep_rew_mean        | -49.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 27200     |\n",
+            "|    time_elapsed       | 1947      |\n",
+            "|    total_timesteps    | 544000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.32     |\n",
+            "|    explained_variance | 0.9731284 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 27199     |\n",
+            "|    policy_loss        | -0.000907 |\n",
+            "|    std                | 0.716     |\n",
+            "|    value_loss         | 1.15e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.5      |\n",
+            "|    ep_rew_mean        | -49.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 27300     |\n",
+            "|    time_elapsed       | 1952      |\n",
+            "|    total_timesteps    | 546000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.32     |\n",
+            "|    explained_variance | 0.8868263 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 27299     |\n",
+            "|    policy_loss        | -0.000882 |\n",
+            "|    std                | 0.715     |\n",
+            "|    value_loss         | 1.92e-06  |\n",
+            "-------------------------------------\n",
+            "----------------------------------------\n",
+            "| rollout/              |              |\n",
+            "|    ep_len_mean        | 49           |\n",
+            "|    ep_rew_mean        | -49          |\n",
+            "| time/                 |              |\n",
+            "|    fps                | 279          |\n",
+            "|    iterations         | 27400        |\n",
+            "|    time_elapsed       | 1958         |\n",
+            "|    total_timesteps    | 548000       |\n",
+            "| train/                |              |\n",
+            "|    entropy_loss       | -4.31        |\n",
+            "|    explained_variance | 0.0024583936 |\n",
+            "|    learning_rate      | 0.0007       |\n",
+            "|    n_updates          | 27399        |\n",
+            "|    policy_loss        | 2.55         |\n",
+            "|    std                | 0.715        |\n",
+            "|    value_loss         | 3.87         |\n",
+            "----------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.5      |\n",
+            "|    ep_rew_mean        | -47.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 27500     |\n",
+            "|    time_elapsed       | 1968      |\n",
+            "|    total_timesteps    | 550000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.29     |\n",
+            "|    explained_variance | 0.9876905 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 27499     |\n",
+            "|    policy_loss        | -0.00128  |\n",
+            "|    std                | 0.71      |\n",
+            "|    value_loss         | 2.36e-06  |\n",
+            "-------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 47.5     |\n",
+            "|    ep_rew_mean        | -47.5    |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 279      |\n",
+            "|    iterations         | 27600    |\n",
+            "|    time_elapsed       | 1974     |\n",
+            "|    total_timesteps    | 552000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -4.29    |\n",
+            "|    explained_variance | 0.905852 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 27599    |\n",
+            "|    policy_loss        | 0.00117  |\n",
+            "|    std                | 0.711    |\n",
+            "|    value_loss         | 1.32e-06 |\n",
+            "------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 27700      |\n",
+            "|    time_elapsed       | 1980       |\n",
+            "|    total_timesteps    | 554000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.28      |\n",
+            "|    explained_variance | 0.88651055 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 27699      |\n",
+            "|    policy_loss        | -0.00138   |\n",
+            "|    std                | 0.709      |\n",
+            "|    value_loss         | 2.35e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 27800     |\n",
+            "|    time_elapsed       | 1985      |\n",
+            "|    total_timesteps    | 556000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.29     |\n",
+            "|    explained_variance | 0.9799976 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 27799     |\n",
+            "|    policy_loss        | 0.000998  |\n",
+            "|    std                | 0.711     |\n",
+            "|    value_loss         | 3.69e-07  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 27900      |\n",
+            "|    time_elapsed       | 1991       |\n",
+            "|    total_timesteps    | 558000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.29      |\n",
+            "|    explained_variance | 0.58545244 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 27899      |\n",
+            "|    policy_loss        | 0.000947   |\n",
+            "|    std                | 0.712      |\n",
+            "|    value_loss         | 6.65e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.5       |\n",
+            "|    ep_rew_mean        | -47.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 28000      |\n",
+            "|    time_elapsed       | 2001       |\n",
+            "|    total_timesteps    | 560000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.26      |\n",
+            "|    explained_variance | 0.62959266 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 27999      |\n",
+            "|    policy_loss        | 0.0204     |\n",
+            "|    std                | 0.707      |\n",
+            "|    value_loss         | 0.000115   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.4       |\n",
+            "|    ep_rew_mean        | -47.3      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 28100      |\n",
+            "|    time_elapsed       | 2007       |\n",
+            "|    total_timesteps    | 562000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.26      |\n",
+            "|    explained_variance | 0.42031288 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 28099      |\n",
+            "|    policy_loss        | -0.00396   |\n",
+            "|    std                | 0.705      |\n",
+            "|    value_loss         | 2.14e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.9       |\n",
+            "|    ep_rew_mean        | -47.8      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 28200      |\n",
+            "|    time_elapsed       | 2013       |\n",
+            "|    total_timesteps    | 564000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.25      |\n",
+            "|    explained_variance | 0.96632874 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 28199      |\n",
+            "|    policy_loss        | -0.0043    |\n",
+            "|    std                | 0.704      |\n",
+            "|    value_loss         | 5.77e-06   |\n",
+            "--------------------------------------\n",
+            "---------------------------------------\n",
+            "| rollout/              |             |\n",
+            "|    ep_len_mean        | 48.5        |\n",
+            "|    ep_rew_mean        | -48.5       |\n",
+            "| time/                 |             |\n",
+            "|    fps                | 280         |\n",
+            "|    iterations         | 28300       |\n",
+            "|    time_elapsed       | 2019        |\n",
+            "|    total_timesteps    | 566000      |\n",
+            "| train/                |             |\n",
+            "|    entropy_loss       | -4.25       |\n",
+            "|    explained_variance | -0.96487904 |\n",
+            "|    learning_rate      | 0.0007      |\n",
+            "|    n_updates          | 28299       |\n",
+            "|    policy_loss        | -0.000124   |\n",
+            "|    std                | 0.704       |\n",
+            "|    value_loss         | 0.000174    |\n",
+            "---------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.3       |\n",
+            "|    ep_rew_mean        | -48.2      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 28400      |\n",
+            "|    time_elapsed       | 2025       |\n",
+            "|    total_timesteps    | 568000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.24      |\n",
+            "|    explained_variance | 0.97085166 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 28399      |\n",
+            "|    policy_loss        | 0.000454   |\n",
+            "|    std                | 0.703      |\n",
+            "|    value_loss         | 3.15e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.9       |\n",
+            "|    ep_rew_mean        | -47.8      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 28500      |\n",
+            "|    time_elapsed       | 2035       |\n",
+            "|    total_timesteps    | 570000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.25      |\n",
+            "|    explained_variance | 0.95976454 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 28499      |\n",
+            "|    policy_loss        | -0.0113    |\n",
+            "|    std                | 0.704      |\n",
+            "|    value_loss         | 9.16e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.8      |\n",
+            "|    ep_rew_mean        | -47.8     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 28600     |\n",
+            "|    time_elapsed       | 2041      |\n",
+            "|    total_timesteps    | 572000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.23     |\n",
+            "|    explained_variance | 0.9088546 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 28599     |\n",
+            "|    policy_loss        | 0.00286   |\n",
+            "|    std                | 0.701     |\n",
+            "|    value_loss         | 3.36e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.6       |\n",
+            "|    ep_rew_mean        | -47.6      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 28700      |\n",
+            "|    time_elapsed       | 2047       |\n",
+            "|    total_timesteps    | 574000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.2       |\n",
+            "|    explained_variance | 0.97843605 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 28699      |\n",
+            "|    policy_loss        | 8.56e-06   |\n",
+            "|    std                | 0.695      |\n",
+            "|    value_loss         | 5.1e-06    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.1       |\n",
+            "|    ep_rew_mean        | -47.1      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 28800      |\n",
+            "|    time_elapsed       | 2053       |\n",
+            "|    total_timesteps    | 576000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.2       |\n",
+            "|    explained_variance | 0.93301547 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 28799      |\n",
+            "|    policy_loss        | -0.00356   |\n",
+            "|    std                | 0.695      |\n",
+            "|    value_loss         | 3.79e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.1      |\n",
+            "|    ep_rew_mean        | -48.1     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 28900     |\n",
+            "|    time_elapsed       | 2059      |\n",
+            "|    total_timesteps    | 578000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.2      |\n",
+            "|    explained_variance | 0.8788929 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 28899     |\n",
+            "|    policy_loss        | -0.00445  |\n",
+            "|    std                | 0.695     |\n",
+            "|    value_loss         | 1.31e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 29000     |\n",
+            "|    time_elapsed       | 2066      |\n",
+            "|    total_timesteps    | 580000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.19     |\n",
+            "|    explained_variance | 0.9705282 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 28999     |\n",
+            "|    policy_loss        | -0.0101   |\n",
+            "|    std                | 0.694     |\n",
+            "|    value_loss         | 7.94e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 29100      |\n",
+            "|    time_elapsed       | 2075       |\n",
+            "|    total_timesteps    | 582000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.17      |\n",
+            "|    explained_variance | 0.98360187 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 29099      |\n",
+            "|    policy_loss        | 0.00022    |\n",
+            "|    std                | 0.69       |\n",
+            "|    value_loss         | 3.07e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48        |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 29200     |\n",
+            "|    time_elapsed       | 2082      |\n",
+            "|    total_timesteps    | 584000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.17     |\n",
+            "|    explained_variance | 0.9926867 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 29199     |\n",
+            "|    policy_loss        | 0.00738   |\n",
+            "|    std                | 0.69      |\n",
+            "|    value_loss         | 5.01e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 29300      |\n",
+            "|    time_elapsed       | 2088       |\n",
+            "|    total_timesteps    | 586000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.17      |\n",
+            "|    explained_variance | 0.90506065 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 29299      |\n",
+            "|    policy_loss        | 0.00392    |\n",
+            "|    std                | 0.689      |\n",
+            "|    value_loss         | 2.7e-06    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 29400      |\n",
+            "|    time_elapsed       | 2094       |\n",
+            "|    total_timesteps    | 588000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.15      |\n",
+            "|    explained_variance | 0.80868274 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 29399      |\n",
+            "|    policy_loss        | -0.00462   |\n",
+            "|    std                | 0.686      |\n",
+            "|    value_loss         | 3.01e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 29500     |\n",
+            "|    time_elapsed       | 2101      |\n",
+            "|    total_timesteps    | 590000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.16     |\n",
+            "|    explained_variance | 0.963571  |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 29499     |\n",
+            "|    policy_loss        | -0.000898 |\n",
+            "|    std                | 0.689     |\n",
+            "|    value_loss         | 2.56e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.1      |\n",
+            "|    ep_rew_mean        | -49.1     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 29600     |\n",
+            "|    time_elapsed       | 2110      |\n",
+            "|    total_timesteps    | 592000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.16     |\n",
+            "|    explained_variance | 0.9684437 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 29599     |\n",
+            "|    policy_loss        | -0.00116  |\n",
+            "|    std                | 0.688     |\n",
+            "|    value_loss         | 2.12e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 46.9      |\n",
+            "|    ep_rew_mean        | -46.9     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 29700     |\n",
+            "|    time_elapsed       | 2117      |\n",
+            "|    total_timesteps    | 594000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.13     |\n",
+            "|    explained_variance | 0.9956419 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 29699     |\n",
+            "|    policy_loss        | 0.00424   |\n",
+            "|    std                | 0.683     |\n",
+            "|    value_loss         | 2.75e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.3      |\n",
+            "|    ep_rew_mean        | -47.3     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 29800     |\n",
+            "|    time_elapsed       | 2123      |\n",
+            "|    total_timesteps    | 596000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.12     |\n",
+            "|    explained_variance | 0.9595566 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 29799     |\n",
+            "|    policy_loss        | 0.00185   |\n",
+            "|    std                | 0.681     |\n",
+            "|    value_loss         | 4.19e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.8       |\n",
+            "|    ep_rew_mean        | -47.8      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 29900      |\n",
+            "|    time_elapsed       | 2129       |\n",
+            "|    total_timesteps    | 598000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.11      |\n",
+            "|    explained_variance | 0.96305496 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 29899      |\n",
+            "|    policy_loss        | -0.000112  |\n",
+            "|    std                | 0.68       |\n",
+            "|    value_loss         | 6.34e-07   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 30000     |\n",
+            "|    time_elapsed       | 2135      |\n",
+            "|    total_timesteps    | 600000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.11     |\n",
+            "|    explained_variance | 0.9809834 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 29999     |\n",
+            "|    policy_loss        | 6.45e-06  |\n",
+            "|    std                | 0.68      |\n",
+            "|    value_loss         | 4.58e-07  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 30100     |\n",
+            "|    time_elapsed       | 2145      |\n",
+            "|    total_timesteps    | 602000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.11     |\n",
+            "|    explained_variance | 0.8857182 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 30099     |\n",
+            "|    policy_loss        | 0.000569  |\n",
+            "|    std                | 0.68      |\n",
+            "|    value_loss         | 2.5e-06   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.4       |\n",
+            "|    ep_rew_mean        | -48.4      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 30200      |\n",
+            "|    time_elapsed       | 2151       |\n",
+            "|    total_timesteps    | 604000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.11      |\n",
+            "|    explained_variance | 0.99272937 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 30199      |\n",
+            "|    policy_loss        | 0.00266    |\n",
+            "|    std                | 0.679      |\n",
+            "|    value_loss         | 1.14e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 46.9       |\n",
+            "|    ep_rew_mean        | -46.9      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 30300      |\n",
+            "|    time_elapsed       | 2158       |\n",
+            "|    total_timesteps    | 606000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.09      |\n",
+            "|    explained_variance | 0.78605455 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 30299      |\n",
+            "|    policy_loss        | -0.00613   |\n",
+            "|    std                | 0.676      |\n",
+            "|    value_loss         | 3.41e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 46.6      |\n",
+            "|    ep_rew_mean        | -46.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 30400     |\n",
+            "|    time_elapsed       | 2164      |\n",
+            "|    total_timesteps    | 608000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.1      |\n",
+            "|    explained_variance | 0.9736405 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 30399     |\n",
+            "|    policy_loss        | -0.000302 |\n",
+            "|    std                | 0.677     |\n",
+            "|    value_loss         | 3.92e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.1       |\n",
+            "|    ep_rew_mean        | -47        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 30500      |\n",
+            "|    time_elapsed       | 2171       |\n",
+            "|    total_timesteps    | 610000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.09      |\n",
+            "|    explained_variance | 0.83784264 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 30499      |\n",
+            "|    policy_loss        | -0.00721   |\n",
+            "|    std                | 0.676      |\n",
+            "|    value_loss         | 6.27e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48         |\n",
+            "|    ep_rew_mean        | -48        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 30600      |\n",
+            "|    time_elapsed       | 2181       |\n",
+            "|    total_timesteps    | 612000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.09      |\n",
+            "|    explained_variance | 0.94835174 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 30599      |\n",
+            "|    policy_loss        | 0.00533    |\n",
+            "|    std                | 0.675      |\n",
+            "|    value_loss         | 2.49e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 30700     |\n",
+            "|    time_elapsed       | 2187      |\n",
+            "|    total_timesteps    | 614000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.08     |\n",
+            "|    explained_variance | 0.7656373 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 30699     |\n",
+            "|    policy_loss        | 0.00183   |\n",
+            "|    std                | 0.674     |\n",
+            "|    value_loss         | 1.24e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 30800      |\n",
+            "|    time_elapsed       | 2193       |\n",
+            "|    total_timesteps    | 616000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.07      |\n",
+            "|    explained_variance | 0.96022564 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 30799      |\n",
+            "|    policy_loss        | 0.00383    |\n",
+            "|    std                | 0.671      |\n",
+            "|    value_loss         | 2.81e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 30900      |\n",
+            "|    time_elapsed       | 2199       |\n",
+            "|    total_timesteps    | 618000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.07      |\n",
+            "|    explained_variance | 0.79067695 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 30899      |\n",
+            "|    policy_loss        | -0.00153   |\n",
+            "|    std                | 0.672      |\n",
+            "|    value_loss         | 1.15e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 31000     |\n",
+            "|    time_elapsed       | 2206      |\n",
+            "|    total_timesteps    | 620000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.05     |\n",
+            "|    explained_variance | 0.9759757 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 30999     |\n",
+            "|    policy_loss        | 0.000628  |\n",
+            "|    std                | 0.668     |\n",
+            "|    value_loss         | 1.7e-07   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 31100     |\n",
+            "|    time_elapsed       | 2216      |\n",
+            "|    total_timesteps    | 622000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.06     |\n",
+            "|    explained_variance | 0.9674546 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 31099     |\n",
+            "|    policy_loss        | 0.000612  |\n",
+            "|    std                | 0.67      |\n",
+            "|    value_loss         | 1.2e-06   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 31200     |\n",
+            "|    time_elapsed       | 2222      |\n",
+            "|    total_timesteps    | 624000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.06     |\n",
+            "|    explained_variance | 0.8752002 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 31199     |\n",
+            "|    policy_loss        | 0.00171   |\n",
+            "|    std                | 0.67      |\n",
+            "|    value_loss         | 7.12e-07  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.5       |\n",
+            "|    ep_rew_mean        | -49.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 31300      |\n",
+            "|    time_elapsed       | 2229       |\n",
+            "|    total_timesteps    | 626000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.03      |\n",
+            "|    explained_variance | 0.92981267 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 31299      |\n",
+            "|    policy_loss        | -0.000362  |\n",
+            "|    std                | 0.666      |\n",
+            "|    value_loss         | 5.16e-07   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 31400      |\n",
+            "|    time_elapsed       | 2235       |\n",
+            "|    total_timesteps    | 628000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.03      |\n",
+            "|    explained_variance | 0.53731203 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 31399      |\n",
+            "|    policy_loss        | -0.0292    |\n",
+            "|    std                | 0.665      |\n",
+            "|    value_loss         | 6.17e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48        |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 281       |\n",
+            "|    iterations         | 31500     |\n",
+            "|    time_elapsed       | 2241      |\n",
+            "|    total_timesteps    | 630000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.04     |\n",
+            "|    explained_variance | 0.9594686 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 31499     |\n",
+            "|    policy_loss        | -8.82e-05 |\n",
+            "|    std                | 0.667     |\n",
+            "|    value_loss         | 2.16e-07  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 31600      |\n",
+            "|    time_elapsed       | 2251       |\n",
+            "|    total_timesteps    | 632000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.04      |\n",
+            "|    explained_variance | 0.97942924 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 31599      |\n",
+            "|    policy_loss        | 0.0021     |\n",
+            "|    std                | 0.667      |\n",
+            "|    value_loss         | 1.62e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 31700     |\n",
+            "|    time_elapsed       | 2257      |\n",
+            "|    total_timesteps    | 634000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.04     |\n",
+            "|    explained_variance | 0.8932128 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 31699     |\n",
+            "|    policy_loss        | -0.00245  |\n",
+            "|    std                | 0.667     |\n",
+            "|    value_loss         | 9.37e-07  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.6       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 31800      |\n",
+            "|    time_elapsed       | 2263       |\n",
+            "|    total_timesteps    | 636000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.01      |\n",
+            "|    explained_variance | 0.83874965 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 31799      |\n",
+            "|    policy_loss        | 0.00601    |\n",
+            "|    std                | 0.663      |\n",
+            "|    value_loss         | 4.34e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.1      |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 281       |\n",
+            "|    iterations         | 31900     |\n",
+            "|    time_elapsed       | 2269      |\n",
+            "|    total_timesteps    | 638000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.01     |\n",
+            "|    explained_variance | 0.9876569 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 31899     |\n",
+            "|    policy_loss        | -0.00323  |\n",
+            "|    std                | 0.662     |\n",
+            "|    value_loss         | 1.62e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 281       |\n",
+            "|    iterations         | 32000     |\n",
+            "|    time_elapsed       | 2276      |\n",
+            "|    total_timesteps    | 640000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4        |\n",
+            "|    explained_variance | 0.9645224 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 31999     |\n",
+            "|    policy_loss        | -0.00171  |\n",
+            "|    std                | 0.66      |\n",
+            "|    value_loss         | 1.17e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 32100     |\n",
+            "|    time_elapsed       | 2286      |\n",
+            "|    total_timesteps    | 642000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.99     |\n",
+            "|    explained_variance | 0.9889257 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 32099     |\n",
+            "|    policy_loss        | 0.00162   |\n",
+            "|    std                | 0.659     |\n",
+            "|    value_loss         | 8.48e-07  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 32200     |\n",
+            "|    time_elapsed       | 2292      |\n",
+            "|    total_timesteps    | 644000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.98     |\n",
+            "|    explained_variance | 0.8465643 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 32199     |\n",
+            "|    policy_loss        | 8.41e-05  |\n",
+            "|    std                | 0.657     |\n",
+            "|    value_loss         | 7.93e-07  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.5      |\n",
+            "|    ep_rew_mean        | -47.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 281       |\n",
+            "|    iterations         | 32300     |\n",
+            "|    time_elapsed       | 2298      |\n",
+            "|    total_timesteps    | 646000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.97     |\n",
+            "|    explained_variance | 0.9307914 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 32299     |\n",
+            "|    policy_loss        | 0.00455   |\n",
+            "|    std                | 0.655     |\n",
+            "|    value_loss         | 3.1e-06   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 46.6      |\n",
+            "|    ep_rew_mean        | -46.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 281       |\n",
+            "|    iterations         | 32400     |\n",
+            "|    time_elapsed       | 2305      |\n",
+            "|    total_timesteps    | 648000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.98     |\n",
+            "|    explained_variance | 0.9708448 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 32399     |\n",
+            "|    policy_loss        | 0.00827   |\n",
+            "|    std                | 0.657     |\n",
+            "|    value_loss         | 6.87e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.5      |\n",
+            "|    ep_rew_mean        | -47.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 281       |\n",
+            "|    iterations         | 32500     |\n",
+            "|    time_elapsed       | 2311      |\n",
+            "|    total_timesteps    | 650000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.98     |\n",
+            "|    explained_variance | 0.9863495 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 32499     |\n",
+            "|    policy_loss        | 0.000939  |\n",
+            "|    std                | 0.657     |\n",
+            "|    value_loss         | 4.68e-07  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 32600     |\n",
+            "|    time_elapsed       | 2321      |\n",
+            "|    total_timesteps    | 652000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.98     |\n",
+            "|    explained_variance | 0.6925158 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 32599     |\n",
+            "|    policy_loss        | 0.00332   |\n",
+            "|    std                | 0.657     |\n",
+            "|    value_loss         | 1.72e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 32700     |\n",
+            "|    time_elapsed       | 2327      |\n",
+            "|    total_timesteps    | 654000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4        |\n",
+            "|    explained_variance | 0.9290561 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 32699     |\n",
+            "|    policy_loss        | 0.00131   |\n",
+            "|    std                | 0.66      |\n",
+            "|    value_loss         | 1.3e-06   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 281        |\n",
+            "|    iterations         | 32800      |\n",
+            "|    time_elapsed       | 2334       |\n",
+            "|    total_timesteps    | 656000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.01      |\n",
+            "|    explained_variance | 0.96650255 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 32799      |\n",
+            "|    policy_loss        | 0.000317   |\n",
+            "|    std                | 0.661      |\n",
+            "|    value_loss         | 2.75e-07   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 281        |\n",
+            "|    iterations         | 32900      |\n",
+            "|    time_elapsed       | 2340       |\n",
+            "|    total_timesteps    | 658000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.99      |\n",
+            "|    explained_variance | 0.92857796 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 32899      |\n",
+            "|    policy_loss        | -0.000708  |\n",
+            "|    std                | 0.658      |\n",
+            "|    value_loss         | 9.73e-07   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.1       |\n",
+            "|    ep_rew_mean        | -48        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 281        |\n",
+            "|    iterations         | 33000      |\n",
+            "|    time_elapsed       | 2347       |\n",
+            "|    total_timesteps    | 660000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.99      |\n",
+            "|    explained_variance | 0.81251585 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 32999      |\n",
+            "|    policy_loss        | -0.000237  |\n",
+            "|    std                | 0.659      |\n",
+            "|    value_loss         | 2.52e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.6      |\n",
+            "|    ep_rew_mean        | -47.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 33100     |\n",
+            "|    time_elapsed       | 2357      |\n",
+            "|    total_timesteps    | 662000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4        |\n",
+            "|    explained_variance | 0.8767034 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 33099     |\n",
+            "|    policy_loss        | -0.00235  |\n",
+            "|    std                | 0.661     |\n",
+            "|    value_loss         | 2.81e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48        |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 33200     |\n",
+            "|    time_elapsed       | 2363      |\n",
+            "|    total_timesteps    | 664000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.01     |\n",
+            "|    explained_variance | 0.9719703 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 33199     |\n",
+            "|    policy_loss        | 0.000635  |\n",
+            "|    std                | 0.661     |\n",
+            "|    value_loss         | 2.4e-06   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.7      |\n",
+            "|    ep_rew_mean        | -47.6     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 33300     |\n",
+            "|    time_elapsed       | 2370      |\n",
+            "|    total_timesteps    | 666000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.02     |\n",
+            "|    explained_variance | 0.9966027 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 33299     |\n",
+            "|    policy_loss        | 0.00241   |\n",
+            "|    std                | 0.664     |\n",
+            "|    value_loss         | 6.73e-07  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.2      |\n",
+            "|    ep_rew_mean        | -48.1     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 281       |\n",
+            "|    iterations         | 33400     |\n",
+            "|    time_elapsed       | 2377      |\n",
+            "|    total_timesteps    | 668000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.03     |\n",
+            "|    explained_variance | 0.9685441 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 33399     |\n",
+            "|    policy_loss        | -0.000859 |\n",
+            "|    std                | 0.665     |\n",
+            "|    value_loss         | 9.94e-07  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.7       |\n",
+            "|    ep_rew_mean        | -48.6      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 281        |\n",
+            "|    iterations         | 33500      |\n",
+            "|    time_elapsed       | 2384       |\n",
+            "|    total_timesteps    | 670000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.02      |\n",
+            "|    explained_variance | 0.46576625 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 33499      |\n",
+            "|    policy_loss        | -0.00206   |\n",
+            "|    std                | 0.664      |\n",
+            "|    value_loss         | 5.21e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 33600      |\n",
+            "|    time_elapsed       | 2394       |\n",
+            "|    total_timesteps    | 672000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.02      |\n",
+            "|    explained_variance | 0.79574704 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 33599      |\n",
+            "|    policy_loss        | 0.00271    |\n",
+            "|    std                | 0.663      |\n",
+            "|    value_loss         | 1.14e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48        |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 33700     |\n",
+            "|    time_elapsed       | 2401      |\n",
+            "|    total_timesteps    | 674000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.03     |\n",
+            "|    explained_variance | 0.9673645 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 33699     |\n",
+            "|    policy_loss        | -0.00301  |\n",
+            "|    std                | 0.665     |\n",
+            "|    value_loss         | 1.64e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 33800     |\n",
+            "|    time_elapsed       | 2408      |\n",
+            "|    total_timesteps    | 676000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.03     |\n",
+            "|    explained_variance | 0.9559499 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 33799     |\n",
+            "|    policy_loss        | 0.00404   |\n",
+            "|    std                | 0.665     |\n",
+            "|    value_loss         | 2.61e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.5      |\n",
+            "|    ep_rew_mean        | -49.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 33900     |\n",
+            "|    time_elapsed       | 2415      |\n",
+            "|    total_timesteps    | 678000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.03     |\n",
+            "|    explained_variance | 0.8689276 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 33899     |\n",
+            "|    policy_loss        | -0.00229  |\n",
+            "|    std                | 0.665     |\n",
+            "|    value_loss         | 3.25e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.8       |\n",
+            "|    ep_rew_mean        | -48.7      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 34000      |\n",
+            "|    time_elapsed       | 2421       |\n",
+            "|    total_timesteps    | 680000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -4.04      |\n",
+            "|    explained_variance | 0.92665327 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 33999      |\n",
+            "|    policy_loss        | -8.24e-05  |\n",
+            "|    std                | 0.666      |\n",
+            "|    value_loss         | 7.25e-07   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.8      |\n",
+            "|    ep_rew_mean        | -48.7     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 34100     |\n",
+            "|    time_elapsed       | 2432      |\n",
+            "|    total_timesteps    | 682000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.04     |\n",
+            "|    explained_variance | 0.9745406 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 34099     |\n",
+            "|    policy_loss        | -0.00209  |\n",
+            "|    std                | 0.666     |\n",
+            "|    value_loss         | 1.28e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 34200     |\n",
+            "|    time_elapsed       | 2438      |\n",
+            "|    total_timesteps    | 684000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.02     |\n",
+            "|    explained_variance | 0.8974001 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 34199     |\n",
+            "|    policy_loss        | -0.00629  |\n",
+            "|    std                | 0.662     |\n",
+            "|    value_loss         | 6.24e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 34300     |\n",
+            "|    time_elapsed       | 2445      |\n",
+            "|    total_timesteps    | 686000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -4.02     |\n",
+            "|    explained_variance | 0.9367453 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 34299     |\n",
+            "|    policy_loss        | -0.000219 |\n",
+            "|    std                | 0.663     |\n",
+            "|    value_loss         | 3.84e-07  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 34400     |\n",
+            "|    time_elapsed       | 2452      |\n",
+            "|    total_timesteps    | 688000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.99     |\n",
+            "|    explained_variance | 0.9830403 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 34399     |\n",
+            "|    policy_loss        | -0.000379 |\n",
+            "|    std                | 0.658     |\n",
+            "|    value_loss         | 8.19e-07  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48        |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 34500     |\n",
+            "|    time_elapsed       | 2458      |\n",
+            "|    total_timesteps    | 690000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.96     |\n",
+            "|    explained_variance | 0.7310099 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 34499     |\n",
+            "|    policy_loss        | -0.013    |\n",
+            "|    std                | 0.652     |\n",
+            "|    value_loss         | 2.52e-05  |\n",
+            "-------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 48.3     |\n",
+            "|    ep_rew_mean        | -48.2    |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 280      |\n",
+            "|    iterations         | 34600    |\n",
+            "|    time_elapsed       | 2468     |\n",
+            "|    total_timesteps    | 692000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -3.94    |\n",
+            "|    explained_variance | 0.975126 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 34599    |\n",
+            "|    policy_loss        | 0.00412  |\n",
+            "|    std                | 0.65     |\n",
+            "|    value_loss         | 3.43e-06 |\n",
+            "------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.9       |\n",
+            "|    ep_rew_mean        | -48.9      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 34700      |\n",
+            "|    time_elapsed       | 2475       |\n",
+            "|    total_timesteps    | 694000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.95      |\n",
+            "|    explained_variance | -0.6563065 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 34699      |\n",
+            "|    policy_loss        | 0.00318    |\n",
+            "|    std                | 0.651      |\n",
+            "|    value_loss         | 5.93e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.4       |\n",
+            "|    ep_rew_mean        | -48.4      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 34800      |\n",
+            "|    time_elapsed       | 2481       |\n",
+            "|    total_timesteps    | 696000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.92      |\n",
+            "|    explained_variance | 0.97628003 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 34799      |\n",
+            "|    policy_loss        | 0.000984   |\n",
+            "|    std                | 0.647      |\n",
+            "|    value_loss         | 1.03e-06   |\n",
+            "--------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 49       |\n",
+            "|    ep_rew_mean        | -49      |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 280      |\n",
+            "|    iterations         | 34900    |\n",
+            "|    time_elapsed       | 2488     |\n",
+            "|    total_timesteps    | 698000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -3.92    |\n",
+            "|    explained_variance | 0.857736 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 34899    |\n",
+            "|    policy_loss        | 0.000848 |\n",
+            "|    std                | 0.646    |\n",
+            "|    value_loss         | 8.62e-07 |\n",
+            "------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 35000     |\n",
+            "|    time_elapsed       | 2494      |\n",
+            "|    total_timesteps    | 700000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.9      |\n",
+            "|    explained_variance | 0.9739769 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 34999     |\n",
+            "|    policy_loss        | 8.95e-05  |\n",
+            "|    std                | 0.643     |\n",
+            "|    value_loss         | 6.12e-07  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 35100     |\n",
+            "|    time_elapsed       | 2505      |\n",
+            "|    total_timesteps    | 702000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.91     |\n",
+            "|    explained_variance | 0.9625768 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 35099     |\n",
+            "|    policy_loss        | 0.000541  |\n",
+            "|    std                | 0.645     |\n",
+            "|    value_loss         | 2.43e-07  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 35200      |\n",
+            "|    time_elapsed       | 2511       |\n",
+            "|    total_timesteps    | 704000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.9       |\n",
+            "|    explained_variance | 0.76877356 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 35199      |\n",
+            "|    policy_loss        | -0.00575   |\n",
+            "|    std                | 0.642      |\n",
+            "|    value_loss         | 1.08e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.5       |\n",
+            "|    ep_rew_mean        | -47.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 35300      |\n",
+            "|    time_elapsed       | 2517       |\n",
+            "|    total_timesteps    | 706000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.89      |\n",
+            "|    explained_variance | 0.84682435 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 35299      |\n",
+            "|    policy_loss        | -0.0021    |\n",
+            "|    std                | 0.641      |\n",
+            "|    value_loss         | 3.52e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 35400     |\n",
+            "|    time_elapsed       | 2524      |\n",
+            "|    total_timesteps    | 708000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.89     |\n",
+            "|    explained_variance | 0.7140837 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 35399     |\n",
+            "|    policy_loss        | -0.00864  |\n",
+            "|    std                | 0.642     |\n",
+            "|    value_loss         | 1.03e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.5      |\n",
+            "|    ep_rew_mean        | -49.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 35500     |\n",
+            "|    time_elapsed       | 2530      |\n",
+            "|    total_timesteps    | 710000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.9      |\n",
+            "|    explained_variance | 0.9013965 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 35499     |\n",
+            "|    policy_loss        | -0.00133  |\n",
+            "|    std                | 0.644     |\n",
+            "|    value_loss         | 5.98e-07  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.5       |\n",
+            "|    ep_rew_mean        | -49.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 35600      |\n",
+            "|    time_elapsed       | 2541       |\n",
+            "|    total_timesteps    | 712000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.91      |\n",
+            "|    explained_variance | 0.91648865 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 35599      |\n",
+            "|    policy_loss        | -0.00166   |\n",
+            "|    std                | 0.644      |\n",
+            "|    value_loss         | 8.45e-07   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 35700      |\n",
+            "|    time_elapsed       | 2547       |\n",
+            "|    total_timesteps    | 714000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.89      |\n",
+            "|    explained_variance | 0.78630555 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 35699      |\n",
+            "|    policy_loss        | -0.0023    |\n",
+            "|    std                | 0.642      |\n",
+            "|    value_loss         | 3.13e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 35800      |\n",
+            "|    time_elapsed       | 2554       |\n",
+            "|    total_timesteps    | 716000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.89      |\n",
+            "|    explained_variance | 0.98644364 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 35799      |\n",
+            "|    policy_loss        | -0.00116   |\n",
+            "|    std                | 0.642      |\n",
+            "|    value_loss         | 5.94e-07   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48        |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 35900     |\n",
+            "|    time_elapsed       | 2561      |\n",
+            "|    total_timesteps    | 718000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.88     |\n",
+            "|    explained_variance | 0.9824021 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 35899     |\n",
+            "|    policy_loss        | -0.00393  |\n",
+            "|    std                | 0.639     |\n",
+            "|    value_loss         | 4.01e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 36000     |\n",
+            "|    time_elapsed       | 2572      |\n",
+            "|    total_timesteps    | 720000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.87     |\n",
+            "|    explained_variance | 0.9410251 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 35999     |\n",
+            "|    policy_loss        | -0.000505 |\n",
+            "|    std                | 0.638     |\n",
+            "|    value_loss         | 2.79e-07  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.5      |\n",
+            "|    ep_rew_mean        | -49.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 36100     |\n",
+            "|    time_elapsed       | 2578      |\n",
+            "|    total_timesteps    | 722000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.87     |\n",
+            "|    explained_variance | 0.9754824 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 36099     |\n",
+            "|    policy_loss        | 0.000673  |\n",
+            "|    std                | 0.638     |\n",
+            "|    value_loss         | 3.42e-07  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.5       |\n",
+            "|    ep_rew_mean        | -49.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 36200      |\n",
+            "|    time_elapsed       | 2585       |\n",
+            "|    total_timesteps    | 724000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.86      |\n",
+            "|    explained_variance | 0.84805125 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 36199      |\n",
+            "|    policy_loss        | -0.00034   |\n",
+            "|    std                | 0.636      |\n",
+            "|    value_loss         | 2.15e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.3       |\n",
+            "|    ep_rew_mean        | -49.3      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 36300      |\n",
+            "|    time_elapsed       | 2592       |\n",
+            "|    total_timesteps    | 726000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.86      |\n",
+            "|    explained_variance | 0.98801094 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 36299      |\n",
+            "|    policy_loss        | -0.00244   |\n",
+            "|    std                | 0.637      |\n",
+            "|    value_loss         | 7.71e-07   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.8       |\n",
+            "|    ep_rew_mean        | -48.8      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 36400      |\n",
+            "|    time_elapsed       | 2598       |\n",
+            "|    total_timesteps    | 728000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.86      |\n",
+            "|    explained_variance | 0.64739573 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 36399      |\n",
+            "|    policy_loss        | -0.00118   |\n",
+            "|    std                | 0.636      |\n",
+            "|    value_loss         | 4.25e-07   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.8      |\n",
+            "|    ep_rew_mean        | -47.8     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 36500     |\n",
+            "|    time_elapsed       | 2608      |\n",
+            "|    total_timesteps    | 730000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.84     |\n",
+            "|    explained_variance | 0.9897441 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 36499     |\n",
+            "|    policy_loss        | 0.00103   |\n",
+            "|    std                | 0.633     |\n",
+            "|    value_loss         | 8.12e-07  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.9       |\n",
+            "|    ep_rew_mean        | -47.8      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 36600      |\n",
+            "|    time_elapsed       | 2615       |\n",
+            "|    total_timesteps    | 732000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.84      |\n",
+            "|    explained_variance | 0.98654985 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 36599      |\n",
+            "|    policy_loss        | -0.00401   |\n",
+            "|    std                | 0.634      |\n",
+            "|    value_loss         | 2.09e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.9       |\n",
+            "|    ep_rew_mean        | -47.8      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 36700      |\n",
+            "|    time_elapsed       | 2622       |\n",
+            "|    total_timesteps    | 734000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.84      |\n",
+            "|    explained_variance | 0.98241895 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 36699      |\n",
+            "|    policy_loss        | -0.00083   |\n",
+            "|    std                | 0.633      |\n",
+            "|    value_loss         | 1.19e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.4      |\n",
+            "|    ep_rew_mean        | -48.3     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 36800     |\n",
+            "|    time_elapsed       | 2629      |\n",
+            "|    total_timesteps    | 736000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.83     |\n",
+            "|    explained_variance | 0.8045112 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 36799     |\n",
+            "|    policy_loss        | 0.00678   |\n",
+            "|    std                | 0.632     |\n",
+            "|    value_loss         | 8.17e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.5      |\n",
+            "|    ep_rew_mean        | -49.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 36900     |\n",
+            "|    time_elapsed       | 2636      |\n",
+            "|    total_timesteps    | 738000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.83     |\n",
+            "|    explained_variance | 0.6432221 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 36899     |\n",
+            "|    policy_loss        | -0.00117  |\n",
+            "|    std                | 0.632     |\n",
+            "|    value_loss         | 2.46e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.5       |\n",
+            "|    ep_rew_mean        | -47.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 37000      |\n",
+            "|    time_elapsed       | 2646       |\n",
+            "|    total_timesteps    | 740000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.81      |\n",
+            "|    explained_variance | 0.89308023 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 36999      |\n",
+            "|    policy_loss        | -0.00348   |\n",
+            "|    std                | 0.629      |\n",
+            "|    value_loss         | 2.19e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.5       |\n",
+            "|    ep_rew_mean        | -47.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 37100      |\n",
+            "|    time_elapsed       | 2653       |\n",
+            "|    total_timesteps    | 742000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.81      |\n",
+            "|    explained_variance | 0.97850627 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 37099      |\n",
+            "|    policy_loss        | -0.000425  |\n",
+            "|    std                | 0.629      |\n",
+            "|    value_loss         | 4.43e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.3      |\n",
+            "|    ep_rew_mean        | -47.3     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 37200     |\n",
+            "|    time_elapsed       | 2660      |\n",
+            "|    total_timesteps    | 744000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.78     |\n",
+            "|    explained_variance | 0.9655469 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 37199     |\n",
+            "|    policy_loss        | 7.71e-05  |\n",
+            "|    std                | 0.624     |\n",
+            "|    value_loss         | 1.54e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 46.9       |\n",
+            "|    ep_rew_mean        | -46.8      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 37300      |\n",
+            "|    time_elapsed       | 2666       |\n",
+            "|    total_timesteps    | 746000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.78      |\n",
+            "|    explained_variance | 0.92692417 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 37299      |\n",
+            "|    policy_loss        | -0.00408   |\n",
+            "|    std                | 0.624      |\n",
+            "|    value_loss         | 5.12e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.6       |\n",
+            "|    ep_rew_mean        | -47.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 37400      |\n",
+            "|    time_elapsed       | 2673       |\n",
+            "|    total_timesteps    | 748000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.78      |\n",
+            "|    explained_variance | 0.85534066 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 37399      |\n",
+            "|    policy_loss        | -0.00534   |\n",
+            "|    std                | 0.624      |\n",
+            "|    value_loss         | 6.73e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 37500      |\n",
+            "|    time_elapsed       | 2684       |\n",
+            "|    total_timesteps    | 750000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.77      |\n",
+            "|    explained_variance | 0.91903675 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 37499      |\n",
+            "|    policy_loss        | -0.00187   |\n",
+            "|    std                | 0.623      |\n",
+            "|    value_loss         | 2.31e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48        |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 37600     |\n",
+            "|    time_elapsed       | 2690      |\n",
+            "|    total_timesteps    | 752000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.76     |\n",
+            "|    explained_variance | 0.9927211 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 37599     |\n",
+            "|    policy_loss        | 0.00225   |\n",
+            "|    std                | 0.62      |\n",
+            "|    value_loss         | 1.23e-06  |\n",
+            "-------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 48.5     |\n",
+            "|    ep_rew_mean        | -48.5    |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 279      |\n",
+            "|    iterations         | 37700    |\n",
+            "|    time_elapsed       | 2697     |\n",
+            "|    total_timesteps    | 754000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -3.76    |\n",
+            "|    explained_variance | 0.961677 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 37699    |\n",
+            "|    policy_loss        | 0.00138  |\n",
+            "|    std                | 0.621    |\n",
+            "|    value_loss         | 1.04e-06 |\n",
+            "------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.4      |\n",
+            "|    ep_rew_mean        | -49.4     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 37800     |\n",
+            "|    time_elapsed       | 2703      |\n",
+            "|    total_timesteps    | 756000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.76     |\n",
+            "|    explained_variance | 0.8840703 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 37799     |\n",
+            "|    policy_loss        | -0.00133  |\n",
+            "|    std                | 0.621     |\n",
+            "|    value_loss         | 5.31e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.4      |\n",
+            "|    ep_rew_mean        | -49.4     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 37900     |\n",
+            "|    time_elapsed       | 2710      |\n",
+            "|    total_timesteps    | 758000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.76     |\n",
+            "|    explained_variance | 0.9751732 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 37899     |\n",
+            "|    policy_loss        | 0.0017    |\n",
+            "|    std                | 0.621     |\n",
+            "|    value_loss         | 8.42e-07  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.9       |\n",
+            "|    ep_rew_mean        | -48.9      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 38000      |\n",
+            "|    time_elapsed       | 2720       |\n",
+            "|    total_timesteps    | 760000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.76      |\n",
+            "|    explained_variance | 0.90713525 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 37999      |\n",
+            "|    policy_loss        | -0.000299  |\n",
+            "|    std                | 0.62       |\n",
+            "|    value_loss         | 2.18e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 38100      |\n",
+            "|    time_elapsed       | 2727       |\n",
+            "|    total_timesteps    | 762000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.75      |\n",
+            "|    explained_variance | 0.97773933 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 38099      |\n",
+            "|    policy_loss        | 0.00071    |\n",
+            "|    std                | 0.62       |\n",
+            "|    value_loss         | 7.3e-07    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 46.8       |\n",
+            "|    ep_rew_mean        | -46.8      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 38200      |\n",
+            "|    time_elapsed       | 2733       |\n",
+            "|    total_timesteps    | 764000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.74      |\n",
+            "|    explained_variance | 0.85500395 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 38199      |\n",
+            "|    policy_loss        | -0.0188    |\n",
+            "|    std                | 0.616      |\n",
+            "|    value_loss         | 0.000115   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.3      |\n",
+            "|    ep_rew_mean        | -47.3     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 38300     |\n",
+            "|    time_elapsed       | 2740      |\n",
+            "|    total_timesteps    | 766000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.73     |\n",
+            "|    explained_variance | 0.9707148 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 38299     |\n",
+            "|    policy_loss        | -0.00259  |\n",
+            "|    std                | 0.616     |\n",
+            "|    value_loss         | 7.38e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.1      |\n",
+            "|    ep_rew_mean        | -47       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 38400     |\n",
+            "|    time_elapsed       | 2751      |\n",
+            "|    total_timesteps    | 768000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.73     |\n",
+            "|    explained_variance | 0.9092056 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 38399     |\n",
+            "|    policy_loss        | -0.00333  |\n",
+            "|    std                | 0.615     |\n",
+            "|    value_loss         | 4.96e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 38500      |\n",
+            "|    time_elapsed       | 2757       |\n",
+            "|    total_timesteps    | 770000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.73      |\n",
+            "|    explained_variance | 0.98466456 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 38499      |\n",
+            "|    policy_loss        | 0.0108     |\n",
+            "|    std                | 0.616      |\n",
+            "|    value_loss         | 1.65e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 38600     |\n",
+            "|    time_elapsed       | 2764      |\n",
+            "|    total_timesteps    | 772000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.73     |\n",
+            "|    explained_variance | 0.4182393 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 38599     |\n",
+            "|    policy_loss        | 0.0304    |\n",
+            "|    std                | 0.615     |\n",
+            "|    value_loss         | 0.000218  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.1       |\n",
+            "|    ep_rew_mean        | -49.1      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 38700      |\n",
+            "|    time_elapsed       | 2771       |\n",
+            "|    total_timesteps    | 774000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.72      |\n",
+            "|    explained_variance | 0.86738527 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 38699      |\n",
+            "|    policy_loss        | -0.00272   |\n",
+            "|    std                | 0.615      |\n",
+            "|    value_loss         | 3.13e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.7       |\n",
+            "|    ep_rew_mean        | -47.6      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 38800      |\n",
+            "|    time_elapsed       | 2777       |\n",
+            "|    total_timesteps    | 776000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.69      |\n",
+            "|    explained_variance | 0.92607296 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 38799      |\n",
+            "|    policy_loss        | -0.00675   |\n",
+            "|    std                | 0.61       |\n",
+            "|    value_loss         | 1.55e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.2      |\n",
+            "|    ep_rew_mean        | -47.2     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 278       |\n",
+            "|    iterations         | 38900     |\n",
+            "|    time_elapsed       | 2788      |\n",
+            "|    total_timesteps    | 778000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.68     |\n",
+            "|    explained_variance | 0.3575865 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 38899     |\n",
+            "|    policy_loss        | 0.105     |\n",
+            "|    std                | 0.61      |\n",
+            "|    value_loss         | 0.00223   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.3      |\n",
+            "|    ep_rew_mean        | -47.2     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 278       |\n",
+            "|    iterations         | 39000     |\n",
+            "|    time_elapsed       | 2795      |\n",
+            "|    total_timesteps    | 780000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.69     |\n",
+            "|    explained_variance | 0.9598239 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 38999     |\n",
+            "|    policy_loss        | -0.00267  |\n",
+            "|    std                | 0.61      |\n",
+            "|    value_loss         | 5.62e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.6      |\n",
+            "|    ep_rew_mean        | -48.6     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 39100     |\n",
+            "|    time_elapsed       | 2802      |\n",
+            "|    total_timesteps    | 782000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.69     |\n",
+            "|    explained_variance | 0.8572078 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 39099     |\n",
+            "|    policy_loss        | 0.00538   |\n",
+            "|    std                | 0.61      |\n",
+            "|    value_loss         | 5.86e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.5       |\n",
+            "|    ep_rew_mean        | -49.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 39200      |\n",
+            "|    time_elapsed       | 2809       |\n",
+            "|    total_timesteps    | 784000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.68      |\n",
+            "|    explained_variance | 0.86322427 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 39199      |\n",
+            "|    policy_loss        | -0.000962  |\n",
+            "|    std                | 0.61       |\n",
+            "|    value_loss         | 1.02e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.5       |\n",
+            "|    ep_rew_mean        | -49.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 39300      |\n",
+            "|    time_elapsed       | 2815       |\n",
+            "|    total_timesteps    | 786000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.67      |\n",
+            "|    explained_variance | 0.98482674 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 39299      |\n",
+            "|    policy_loss        | -0.00257   |\n",
+            "|    std                | 0.607      |\n",
+            "|    value_loss         | 2.04e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.2       |\n",
+            "|    ep_rew_mean        | -47.1      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 278        |\n",
+            "|    iterations         | 39400      |\n",
+            "|    time_elapsed       | 2826       |\n",
+            "|    total_timesteps    | 788000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.64      |\n",
+            "|    explained_variance | 0.98814607 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 39399      |\n",
+            "|    policy_loss        | -0.0132    |\n",
+            "|    std                | 0.603      |\n",
+            "|    value_loss         | 1.93e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 45.7       |\n",
+            "|    ep_rew_mean        | -45.6      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 278        |\n",
+            "|    iterations         | 39500      |\n",
+            "|    time_elapsed       | 2832       |\n",
+            "|    total_timesteps    | 790000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.63      |\n",
+            "|    explained_variance | 0.75104976 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 39499      |\n",
+            "|    policy_loss        | -0.0149    |\n",
+            "|    std                | 0.601      |\n",
+            "|    value_loss         | 4.16e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.2      |\n",
+            "|    ep_rew_mean        | -47.1     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 278       |\n",
+            "|    iterations         | 39600     |\n",
+            "|    time_elapsed       | 2839      |\n",
+            "|    total_timesteps    | 792000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.61     |\n",
+            "|    explained_variance | 0.9826381 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 39599     |\n",
+            "|    policy_loss        | -0.00721  |\n",
+            "|    std                | 0.599     |\n",
+            "|    value_loss         | 8.65e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 46.9       |\n",
+            "|    ep_rew_mean        | -46.9      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 278        |\n",
+            "|    iterations         | 39700      |\n",
+            "|    time_elapsed       | 2846       |\n",
+            "|    total_timesteps    | 794000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.59      |\n",
+            "|    explained_variance | 0.91662145 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 39699      |\n",
+            "|    policy_loss        | 0.00448    |\n",
+            "|    std                | 0.596      |\n",
+            "|    value_loss         | 5.06e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.1       |\n",
+            "|    ep_rew_mean        | -48.1      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 39800      |\n",
+            "|    time_elapsed       | 2852       |\n",
+            "|    total_timesteps    | 796000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.59      |\n",
+            "|    explained_variance | 0.97679144 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 39799      |\n",
+            "|    policy_loss        | 0.00058    |\n",
+            "|    std                | 0.595      |\n",
+            "|    value_loss         | 1.8e-06    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.1       |\n",
+            "|    ep_rew_mean        | -48.1      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 278        |\n",
+            "|    iterations         | 39900      |\n",
+            "|    time_elapsed       | 2863       |\n",
+            "|    total_timesteps    | 798000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.58      |\n",
+            "|    explained_variance | 0.99432683 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 39899      |\n",
+            "|    policy_loss        | -0.00551   |\n",
+            "|    std                | 0.593      |\n",
+            "|    value_loss         | 2.67e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.3       |\n",
+            "|    ep_rew_mean        | -48.2      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 278        |\n",
+            "|    iterations         | 40000      |\n",
+            "|    time_elapsed       | 2870       |\n",
+            "|    total_timesteps    | 800000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.58      |\n",
+            "|    explained_variance | 0.98825186 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 39999      |\n",
+            "|    policy_loss        | -0.002     |\n",
+            "|    std                | 0.594      |\n",
+            "|    value_loss         | 1.8e-06    |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.4       |\n",
+            "|    ep_rew_mean        | -48.3      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 278        |\n",
+            "|    iterations         | 40100      |\n",
+            "|    time_elapsed       | 2876       |\n",
+            "|    total_timesteps    | 802000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.57      |\n",
+            "|    explained_variance | 0.06861681 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 40099      |\n",
+            "|    policy_loss        | 0.0233     |\n",
+            "|    std                | 0.592      |\n",
+            "|    value_loss         | 8.39e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.3      |\n",
+            "|    ep_rew_mean        | -48.3     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 278       |\n",
+            "|    iterations         | 40200     |\n",
+            "|    time_elapsed       | 2883      |\n",
+            "|    total_timesteps    | 804000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.57     |\n",
+            "|    explained_variance | 0.9904497 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 40199     |\n",
+            "|    policy_loss        | -0.00605  |\n",
+            "|    std                | 0.592     |\n",
+            "|    value_loss         | 8.11e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.1       |\n",
+            "|    ep_rew_mean        | -49.1      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 278        |\n",
+            "|    iterations         | 40300      |\n",
+            "|    time_elapsed       | 2889       |\n",
+            "|    total_timesteps    | 806000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.55      |\n",
+            "|    explained_variance | 0.95933515 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 40299      |\n",
+            "|    policy_loss        | 0.00178    |\n",
+            "|    std                | 0.591      |\n",
+            "|    value_loss         | 6.86e-07   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 278        |\n",
+            "|    iterations         | 40400      |\n",
+            "|    time_elapsed       | 2899       |\n",
+            "|    total_timesteps    | 808000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.56      |\n",
+            "|    explained_variance | 0.97932297 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 40399      |\n",
+            "|    policy_loss        | 0.0017     |\n",
+            "|    std                | 0.591      |\n",
+            "|    value_loss         | 1.05e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.6      |\n",
+            "|    ep_rew_mean        | -47.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 278       |\n",
+            "|    iterations         | 40500     |\n",
+            "|    time_elapsed       | 2905      |\n",
+            "|    total_timesteps    | 810000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.57     |\n",
+            "|    explained_variance | 0.9931614 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 40499     |\n",
+            "|    policy_loss        | 9.6e-05   |\n",
+            "|    std                | 0.593     |\n",
+            "|    value_loss         | 2.44e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.6       |\n",
+            "|    ep_rew_mean        | -47.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 278        |\n",
+            "|    iterations         | 40600      |\n",
+            "|    time_elapsed       | 2911       |\n",
+            "|    total_timesteps    | 812000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.58      |\n",
+            "|    explained_variance | 0.72751546 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 40599      |\n",
+            "|    policy_loss        | 0.00225    |\n",
+            "|    std                | 0.595      |\n",
+            "|    value_loss         | 4.93e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.6      |\n",
+            "|    ep_rew_mean        | -47.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 40700     |\n",
+            "|    time_elapsed       | 2917      |\n",
+            "|    total_timesteps    | 814000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.58     |\n",
+            "|    explained_variance | 0.9609206 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 40699     |\n",
+            "|    policy_loss        | 0.00484   |\n",
+            "|    std                | 0.595     |\n",
+            "|    value_loss         | 1.85e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48        |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 40800     |\n",
+            "|    time_elapsed       | 2923      |\n",
+            "|    total_timesteps    | 816000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.59     |\n",
+            "|    explained_variance | 0.9776916 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 40799     |\n",
+            "|    policy_loss        | -0.00179  |\n",
+            "|    std                | 0.596     |\n",
+            "|    value_loss         | 6.04e-07  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48         |\n",
+            "|    ep_rew_mean        | -48        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 278        |\n",
+            "|    iterations         | 40900      |\n",
+            "|    time_elapsed       | 2932       |\n",
+            "|    total_timesteps    | 818000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.59      |\n",
+            "|    explained_variance | 0.95068985 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 40899      |\n",
+            "|    policy_loss        | 0.00321    |\n",
+            "|    std                | 0.596      |\n",
+            "|    value_loss         | 4.92e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 41000      |\n",
+            "|    time_elapsed       | 2938       |\n",
+            "|    total_timesteps    | 820000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.59      |\n",
+            "|    explained_variance | 0.94147617 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 40999      |\n",
+            "|    policy_loss        | 0.00213    |\n",
+            "|    std                | 0.596      |\n",
+            "|    value_loss         | 3.03e-06   |\n",
+            "--------------------------------------\n",
+            "---------------------------------------\n",
+            "| rollout/              |             |\n",
+            "|    ep_len_mean        | 50          |\n",
+            "|    ep_rew_mean        | -50         |\n",
+            "| time/                 |             |\n",
+            "|    fps                | 279         |\n",
+            "|    iterations         | 41100       |\n",
+            "|    time_elapsed       | 2944        |\n",
+            "|    total_timesteps    | 822000      |\n",
+            "| train/                |             |\n",
+            "|    entropy_loss       | -3.57       |\n",
+            "|    explained_variance | -0.12963355 |\n",
+            "|    learning_rate      | 0.0007      |\n",
+            "|    n_updates          | 41099       |\n",
+            "|    policy_loss        | 0.0255      |\n",
+            "|    std                | 0.593       |\n",
+            "|    value_loss         | 0.000291    |\n",
+            "---------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 50        |\n",
+            "|    ep_rew_mean        | -50       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 41200     |\n",
+            "|    time_elapsed       | 2950      |\n",
+            "|    total_timesteps    | 824000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.57     |\n",
+            "|    explained_variance | 0.5466497 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 41199     |\n",
+            "|    policy_loss        | -0.000548 |\n",
+            "|    std                | 0.593     |\n",
+            "|    value_loss         | 1.48e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 41300     |\n",
+            "|    time_elapsed       | 2956      |\n",
+            "|    total_timesteps    | 826000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.57     |\n",
+            "|    explained_variance | 0.9527324 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 41299     |\n",
+            "|    policy_loss        | 0.000149  |\n",
+            "|    std                | 0.593     |\n",
+            "|    value_loss         | 1.14e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 41400     |\n",
+            "|    time_elapsed       | 2966      |\n",
+            "|    total_timesteps    | 828000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.53     |\n",
+            "|    explained_variance | 0.8947157 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 41399     |\n",
+            "|    policy_loss        | -0.000977 |\n",
+            "|    std                | 0.588     |\n",
+            "|    value_loss         | 3.34e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 41500     |\n",
+            "|    time_elapsed       | 2972      |\n",
+            "|    total_timesteps    | 830000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.52     |\n",
+            "|    explained_variance | 0.9557045 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 41499     |\n",
+            "|    policy_loss        | -0.00129  |\n",
+            "|    std                | 0.586     |\n",
+            "|    value_loss         | 2.02e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.8       |\n",
+            "|    ep_rew_mean        | -48.8      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 41600      |\n",
+            "|    time_elapsed       | 2978       |\n",
+            "|    total_timesteps    | 832000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.51      |\n",
+            "|    explained_variance | 0.86566734 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 41599      |\n",
+            "|    policy_loss        | -0.00558   |\n",
+            "|    std                | 0.585      |\n",
+            "|    value_loss         | 8.27e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.8      |\n",
+            "|    ep_rew_mean        | -47.8     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 41700     |\n",
+            "|    time_elapsed       | 2984      |\n",
+            "|    total_timesteps    | 834000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.5      |\n",
+            "|    explained_variance | 0.9876392 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 41699     |\n",
+            "|    policy_loss        | -0.00139  |\n",
+            "|    std                | 0.583     |\n",
+            "|    value_loss         | 1.51e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.8       |\n",
+            "|    ep_rew_mean        | -47.8      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 41800      |\n",
+            "|    time_elapsed       | 2990       |\n",
+            "|    total_timesteps    | 836000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.5       |\n",
+            "|    explained_variance | 0.94977826 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 41799      |\n",
+            "|    policy_loss        | -0.00302   |\n",
+            "|    std                | 0.583      |\n",
+            "|    value_loss         | 4.13e-06   |\n",
+            "--------------------------------------\n",
+            "---------------------------------------\n",
+            "| rollout/              |             |\n",
+            "|    ep_len_mean        | 49          |\n",
+            "|    ep_rew_mean        | -49         |\n",
+            "| time/                 |             |\n",
+            "|    fps                | 279         |\n",
+            "|    iterations         | 41900       |\n",
+            "|    time_elapsed       | 2996        |\n",
+            "|    total_timesteps    | 838000      |\n",
+            "| train/                |             |\n",
+            "|    entropy_loss       | -3.5        |\n",
+            "|    explained_variance | -0.17895353 |\n",
+            "|    learning_rate      | 0.0007      |\n",
+            "|    n_updates          | 41899       |\n",
+            "|    policy_loss        | -0.000211   |\n",
+            "|    std                | 0.583       |\n",
+            "|    value_loss         | 8.77e-06    |\n",
+            "---------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 49.5     |\n",
+            "|    ep_rew_mean        | -49.5    |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 279      |\n",
+            "|    iterations         | 42000    |\n",
+            "|    time_elapsed       | 3005     |\n",
+            "|    total_timesteps    | 840000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -3.5     |\n",
+            "|    explained_variance | -16.5786 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 41999    |\n",
+            "|    policy_loss        | -0.00635 |\n",
+            "|    std                | 0.582    |\n",
+            "|    value_loss         | 1.32e-05 |\n",
+            "------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 42100     |\n",
+            "|    time_elapsed       | 3011      |\n",
+            "|    total_timesteps    | 842000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.51     |\n",
+            "|    explained_variance | 0.9696886 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 42099     |\n",
+            "|    policy_loss        | -0.000619 |\n",
+            "|    std                | 0.585     |\n",
+            "|    value_loss         | 9.09e-07  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.9      |\n",
+            "|    ep_rew_mean        | -47.8     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 42200     |\n",
+            "|    time_elapsed       | 3017      |\n",
+            "|    total_timesteps    | 844000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.5      |\n",
+            "|    explained_variance | 0.9301016 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 42199     |\n",
+            "|    policy_loss        | -0.00234  |\n",
+            "|    std                | 0.583     |\n",
+            "|    value_loss         | 2.62e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.7      |\n",
+            "|    ep_rew_mean        | -47.7     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 42300     |\n",
+            "|    time_elapsed       | 3023      |\n",
+            "|    total_timesteps    | 846000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.5      |\n",
+            "|    explained_variance | 0.5389772 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 42299     |\n",
+            "|    policy_loss        | 0.00977   |\n",
+            "|    std                | 0.583     |\n",
+            "|    value_loss         | 2.91e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.4       |\n",
+            "|    ep_rew_mean        | -49.4      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 42400      |\n",
+            "|    time_elapsed       | 3029       |\n",
+            "|    total_timesteps    | 848000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.5       |\n",
+            "|    explained_variance | 0.97215235 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 42399      |\n",
+            "|    policy_loss        | 0.000407   |\n",
+            "|    std                | 0.583      |\n",
+            "|    value_loss         | 4.69e-07   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48         |\n",
+            "|    ep_rew_mean        | -47.9      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 42500      |\n",
+            "|    time_elapsed       | 3039       |\n",
+            "|    total_timesteps    | 850000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.49      |\n",
+            "|    explained_variance | 0.98842454 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 42499      |\n",
+            "|    policy_loss        | -0.00272   |\n",
+            "|    std                | 0.581      |\n",
+            "|    value_loss         | 3.38e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 46.7       |\n",
+            "|    ep_rew_mean        | -46.6      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 42600      |\n",
+            "|    time_elapsed       | 3045       |\n",
+            "|    total_timesteps    | 852000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.46      |\n",
+            "|    explained_variance | 0.96813923 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 42599      |\n",
+            "|    policy_loss        | -0.000448  |\n",
+            "|    std                | 0.577      |\n",
+            "|    value_loss         | 1.41e-05   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.1       |\n",
+            "|    ep_rew_mean        | -47.1      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 42700      |\n",
+            "|    time_elapsed       | 3051       |\n",
+            "|    total_timesteps    | 854000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.46      |\n",
+            "|    explained_variance | 0.98248726 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 42699      |\n",
+            "|    policy_loss        | -0.000134  |\n",
+            "|    std                | 0.577      |\n",
+            "|    value_loss         | 4.33e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 42800      |\n",
+            "|    time_elapsed       | 3058       |\n",
+            "|    total_timesteps    | 856000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.48      |\n",
+            "|    explained_variance | 0.39102143 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 42799      |\n",
+            "|    policy_loss        | 0.013      |\n",
+            "|    std                | 0.579      |\n",
+            "|    value_loss         | 0.000239   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 42900     |\n",
+            "|    time_elapsed       | 3063      |\n",
+            "|    total_timesteps    | 858000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.48     |\n",
+            "|    explained_variance | 0.9779079 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 42899     |\n",
+            "|    policy_loss        | 0.000379  |\n",
+            "|    std                | 0.58      |\n",
+            "|    value_loss         | 1.16e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.1       |\n",
+            "|    ep_rew_mean        | -47        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 279        |\n",
+            "|    iterations         | 43000      |\n",
+            "|    time_elapsed       | 3073       |\n",
+            "|    total_timesteps    | 860000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.44      |\n",
+            "|    explained_variance | 0.61227715 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 42999      |\n",
+            "|    policy_loss        | 0.0209     |\n",
+            "|    std                | 0.575      |\n",
+            "|    value_loss         | 7.89e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.1      |\n",
+            "|    ep_rew_mean        | -47       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 43100     |\n",
+            "|    time_elapsed       | 3079      |\n",
+            "|    total_timesteps    | 862000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.42     |\n",
+            "|    explained_variance | 0.9107274 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 43099     |\n",
+            "|    policy_loss        | 0.00381   |\n",
+            "|    std                | 0.571     |\n",
+            "|    value_loss         | 1.34e-05  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.1      |\n",
+            "|    ep_rew_mean        | -47       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 43200     |\n",
+            "|    time_elapsed       | 3085      |\n",
+            "|    total_timesteps    | 864000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.42     |\n",
+            "|    explained_variance | 0.8499373 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 43199     |\n",
+            "|    policy_loss        | 8.45e-05  |\n",
+            "|    std                | 0.572     |\n",
+            "|    value_loss         | 4.13e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48        |\n",
+            "|    ep_rew_mean        | -48       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 43300     |\n",
+            "|    time_elapsed       | 3091      |\n",
+            "|    total_timesteps    | 866000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.41     |\n",
+            "|    explained_variance | 0.9820314 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 43299     |\n",
+            "|    policy_loss        | -0.000137 |\n",
+            "|    std                | 0.57      |\n",
+            "|    value_loss         | 1.03e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 43400     |\n",
+            "|    time_elapsed       | 3098      |\n",
+            "|    total_timesteps    | 868000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.42     |\n",
+            "|    explained_variance | 0.9914655 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 43399     |\n",
+            "|    policy_loss        | 0.00113   |\n",
+            "|    std                | 0.571     |\n",
+            "|    value_loss         | 1.02e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 43500     |\n",
+            "|    time_elapsed       | 3104      |\n",
+            "|    total_timesteps    | 870000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.39     |\n",
+            "|    explained_variance | 0.9956513 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 43499     |\n",
+            "|    policy_loss        | -0.00204  |\n",
+            "|    std                | 0.568     |\n",
+            "|    value_loss         | 1.7e-06   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 279       |\n",
+            "|    iterations         | 43600     |\n",
+            "|    time_elapsed       | 3115      |\n",
+            "|    total_timesteps    | 872000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.41     |\n",
+            "|    explained_variance | 0.3942523 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 43599     |\n",
+            "|    policy_loss        | 0.00775   |\n",
+            "|    std                | 0.57      |\n",
+            "|    value_loss         | 1.4e-05   |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 43700     |\n",
+            "|    time_elapsed       | 3121      |\n",
+            "|    total_timesteps    | 874000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.42     |\n",
+            "|    explained_variance | 0.9855038 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 43699     |\n",
+            "|    policy_loss        | 0.00744   |\n",
+            "|    std                | 0.571     |\n",
+            "|    value_loss         | 6.56e-06  |\n",
+            "-------------------------------------\n",
+            "---------------------------------------\n",
+            "| rollout/              |             |\n",
+            "|    ep_len_mean        | 47.2        |\n",
+            "|    ep_rew_mean        | -47.2       |\n",
+            "| time/                 |             |\n",
+            "|    fps                | 280         |\n",
+            "|    iterations         | 43800       |\n",
+            "|    time_elapsed       | 3127        |\n",
+            "|    total_timesteps    | 876000      |\n",
+            "| train/                |             |\n",
+            "|    entropy_loss       | -3.41       |\n",
+            "|    explained_variance | 0.008309901 |\n",
+            "|    learning_rate      | 0.0007      |\n",
+            "|    n_updates          | 43799       |\n",
+            "|    policy_loss        | 1.97        |\n",
+            "|    std                | 0.572       |\n",
+            "|    value_loss         | 3.85        |\n",
+            "---------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.7       |\n",
+            "|    ep_rew_mean        | -47.7      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 43900      |\n",
+            "|    time_elapsed       | 3133       |\n",
+            "|    total_timesteps    | 878000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.41      |\n",
+            "|    explained_variance | 0.77352124 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 43899      |\n",
+            "|    policy_loss        | -0.00231   |\n",
+            "|    std                | 0.571      |\n",
+            "|    value_loss         | 7.97e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.7       |\n",
+            "|    ep_rew_mean        | -48.7      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 44000      |\n",
+            "|    time_elapsed       | 3139       |\n",
+            "|    total_timesteps    | 880000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.42      |\n",
+            "|    explained_variance | 0.27089834 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 43999      |\n",
+            "|    policy_loss        | 0.0011     |\n",
+            "|    std                | 0.572      |\n",
+            "|    value_loss         | 3.99e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.5      |\n",
+            "|    ep_rew_mean        | -49.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 44100     |\n",
+            "|    time_elapsed       | 3148      |\n",
+            "|    total_timesteps    | 882000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.41     |\n",
+            "|    explained_variance | 0.8952299 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 44099     |\n",
+            "|    policy_loss        | -0.000954 |\n",
+            "|    std                | 0.571     |\n",
+            "|    value_loss         | 1.77e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 44200      |\n",
+            "|    time_elapsed       | 3154       |\n",
+            "|    total_timesteps    | 884000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.39      |\n",
+            "|    explained_variance | 0.99033046 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 44199      |\n",
+            "|    policy_loss        | 0.00178    |\n",
+            "|    std                | 0.568      |\n",
+            "|    value_loss         | 1.12e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 44300     |\n",
+            "|    time_elapsed       | 3160      |\n",
+            "|    total_timesteps    | 886000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.39     |\n",
+            "|    explained_variance | 0.8583008 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 44299     |\n",
+            "|    policy_loss        | -0.0018   |\n",
+            "|    std                | 0.568     |\n",
+            "|    value_loss         | 2.16e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.3       |\n",
+            "|    ep_rew_mean        | -48.3      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 44400      |\n",
+            "|    time_elapsed       | 3167       |\n",
+            "|    total_timesteps    | 888000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.37      |\n",
+            "|    explained_variance | 0.99091053 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 44399      |\n",
+            "|    policy_loss        | 0.00132    |\n",
+            "|    std                | 0.564      |\n",
+            "|    value_loss         | 3.64e-06   |\n",
+            "--------------------------------------\n",
+            "---------------------------------------\n",
+            "| rollout/              |             |\n",
+            "|    ep_len_mean        | 47.9        |\n",
+            "|    ep_rew_mean        | -47.8       |\n",
+            "| time/                 |             |\n",
+            "|    fps                | 280         |\n",
+            "|    iterations         | 44500       |\n",
+            "|    time_elapsed       | 3173        |\n",
+            "|    total_timesteps    | 890000      |\n",
+            "| train/                |             |\n",
+            "|    entropy_loss       | -3.37       |\n",
+            "|    explained_variance | 0.002827227 |\n",
+            "|    learning_rate      | 0.0007      |\n",
+            "|    n_updates          | 44499       |\n",
+            "|    policy_loss        | 1.55        |\n",
+            "|    std                | 0.565       |\n",
+            "|    value_loss         | 3.89        |\n",
+            "---------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.1       |\n",
+            "|    ep_rew_mean        | -47.1      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 44600      |\n",
+            "|    time_elapsed       | 3182       |\n",
+            "|    total_timesteps    | 892000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.37      |\n",
+            "|    explained_variance | 0.06923187 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 44599      |\n",
+            "|    policy_loss        | -0.0339    |\n",
+            "|    std                | 0.565      |\n",
+            "|    value_loss         | 0.000171   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.3       |\n",
+            "|    ep_rew_mean        | -47.3      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 44700      |\n",
+            "|    time_elapsed       | 3188       |\n",
+            "|    total_timesteps    | 894000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.37      |\n",
+            "|    explained_variance | -21.586582 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 44699      |\n",
+            "|    policy_loss        | -0.0592    |\n",
+            "|    std                | 0.567      |\n",
+            "|    value_loss         | 0.00248    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.8      |\n",
+            "|    ep_rew_mean        | -47.8     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 44800     |\n",
+            "|    time_elapsed       | 3194      |\n",
+            "|    total_timesteps    | 896000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.39     |\n",
+            "|    explained_variance | 0.9866412 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 44799     |\n",
+            "|    policy_loss        | 0.000868  |\n",
+            "|    std                | 0.568     |\n",
+            "|    value_loss         | 1.84e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.5      |\n",
+            "|    ep_rew_mean        | -49.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 44900     |\n",
+            "|    time_elapsed       | 3200      |\n",
+            "|    total_timesteps    | 898000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.39     |\n",
+            "|    explained_variance | 0.9051938 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 44899     |\n",
+            "|    policy_loss        | 0.00217   |\n",
+            "|    std                | 0.567     |\n",
+            "|    value_loss         | 7.65e-07  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 50        |\n",
+            "|    ep_rew_mean        | -50       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 45000     |\n",
+            "|    time_elapsed       | 3206      |\n",
+            "|    total_timesteps    | 900000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.39     |\n",
+            "|    explained_variance | 0.6483333 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 44999     |\n",
+            "|    policy_loss        | 0.000492  |\n",
+            "|    std                | 0.568     |\n",
+            "|    value_loss         | 8.77e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.6      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 45100     |\n",
+            "|    time_elapsed       | 3216      |\n",
+            "|    total_timesteps    | 902000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.36     |\n",
+            "|    explained_variance | 0.9926092 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 45099     |\n",
+            "|    policy_loss        | 0.000675  |\n",
+            "|    std                | 0.563     |\n",
+            "|    value_loss         | 6.19e-07  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.1       |\n",
+            "|    ep_rew_mean        | -48        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 45200      |\n",
+            "|    time_elapsed       | 3223       |\n",
+            "|    total_timesteps    | 904000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.36      |\n",
+            "|    explained_variance | 0.99412566 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 45199      |\n",
+            "|    policy_loss        | 0.00288    |\n",
+            "|    std                | 0.564      |\n",
+            "|    value_loss         | 1.07e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.2       |\n",
+            "|    ep_rew_mean        | -48.1      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 45300      |\n",
+            "|    time_elapsed       | 3229       |\n",
+            "|    total_timesteps    | 906000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.35      |\n",
+            "|    explained_variance | 0.98885244 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 45299      |\n",
+            "|    policy_loss        | 0.00442    |\n",
+            "|    std                | 0.562      |\n",
+            "|    value_loss         | 2.03e-06   |\n",
+            "--------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 48.1     |\n",
+            "|    ep_rew_mean        | -48.1    |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 280      |\n",
+            "|    iterations         | 45400    |\n",
+            "|    time_elapsed       | 3235     |\n",
+            "|    total_timesteps    | 908000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -3.33    |\n",
+            "|    explained_variance | 0.94177  |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 45399    |\n",
+            "|    policy_loss        | -0.00149 |\n",
+            "|    std                | 0.559    |\n",
+            "|    value_loss         | 4.01e-06 |\n",
+            "------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.1      |\n",
+            "|    ep_rew_mean        | -47       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 45500     |\n",
+            "|    time_elapsed       | 3241      |\n",
+            "|    total_timesteps    | 910000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.33     |\n",
+            "|    explained_variance | 0.9495938 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 45499     |\n",
+            "|    policy_loss        | -0.00533  |\n",
+            "|    std                | 0.558     |\n",
+            "|    value_loss         | 1.06e-05  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.1       |\n",
+            "|    ep_rew_mean        | -47        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 45600      |\n",
+            "|    time_elapsed       | 3248       |\n",
+            "|    total_timesteps    | 912000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.33      |\n",
+            "|    explained_variance | 0.91553783 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 45599      |\n",
+            "|    policy_loss        | -0.000339  |\n",
+            "|    std                | 0.558      |\n",
+            "|    value_loss         | 3.84e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48         |\n",
+            "|    ep_rew_mean        | -48        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 45700      |\n",
+            "|    time_elapsed       | 3257       |\n",
+            "|    total_timesteps    | 914000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.32      |\n",
+            "|    explained_variance | 0.39803714 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 45699      |\n",
+            "|    policy_loss        | 0.0227     |\n",
+            "|    std                | 0.558      |\n",
+            "|    value_loss         | 0.000166   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 45800      |\n",
+            "|    time_elapsed       | 3264       |\n",
+            "|    total_timesteps    | 916000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.33      |\n",
+            "|    explained_variance | 0.84038657 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 45799      |\n",
+            "|    policy_loss        | -0.00389   |\n",
+            "|    std                | 0.558      |\n",
+            "|    value_loss         | 1.08e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 45900     |\n",
+            "|    time_elapsed       | 3270      |\n",
+            "|    total_timesteps    | 918000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.33     |\n",
+            "|    explained_variance | 0.9584281 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 45899     |\n",
+            "|    policy_loss        | -0.00227  |\n",
+            "|    std                | 0.559     |\n",
+            "|    value_loss         | 1.63e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.5       |\n",
+            "|    ep_rew_mean        | -49.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 46000      |\n",
+            "|    time_elapsed       | 3276       |\n",
+            "|    total_timesteps    | 920000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.33      |\n",
+            "|    explained_variance | 0.83016056 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 45999      |\n",
+            "|    policy_loss        | 0.00417    |\n",
+            "|    std                | 0.559      |\n",
+            "|    value_loss         | 3.25e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 46100     |\n",
+            "|    time_elapsed       | 3283      |\n",
+            "|    total_timesteps    | 922000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.32     |\n",
+            "|    explained_variance | 0.6802498 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 46099     |\n",
+            "|    policy_loss        | 0.00223   |\n",
+            "|    std                | 0.558     |\n",
+            "|    value_loss         | 3.24e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 46200      |\n",
+            "|    time_elapsed       | 3292       |\n",
+            "|    total_timesteps    | 924000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.32      |\n",
+            "|    explained_variance | 0.99392444 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 46199      |\n",
+            "|    policy_loss        | 0.0017     |\n",
+            "|    std                | 0.557      |\n",
+            "|    value_loss         | 9.97e-07   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.2       |\n",
+            "|    ep_rew_mean        | -48.1      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 46300      |\n",
+            "|    time_elapsed       | 3299       |\n",
+            "|    total_timesteps    | 926000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.32      |\n",
+            "|    explained_variance | 0.11546546 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 46299      |\n",
+            "|    policy_loss        | -0.01      |\n",
+            "|    std                | 0.558      |\n",
+            "|    value_loss         | 8.48e-05   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.7      |\n",
+            "|    ep_rew_mean        | -48.6     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 46400     |\n",
+            "|    time_elapsed       | 3305      |\n",
+            "|    total_timesteps    | 928000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.35     |\n",
+            "|    explained_variance | 0.9007289 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 46399     |\n",
+            "|    policy_loss        | 0.000824  |\n",
+            "|    std                | 0.561     |\n",
+            "|    value_loss         | 4.22e-07  |\n",
+            "-------------------------------------\n",
+            "---------------------------------------\n",
+            "| rollout/              |             |\n",
+            "|    ep_len_mean        | 49.2        |\n",
+            "|    ep_rew_mean        | -49.1       |\n",
+            "| time/                 |             |\n",
+            "|    fps                | 280         |\n",
+            "|    iterations         | 46500       |\n",
+            "|    time_elapsed       | 3311        |\n",
+            "|    total_timesteps    | 930000      |\n",
+            "| train/                |             |\n",
+            "|    entropy_loss       | -3.35       |\n",
+            "|    explained_variance | -0.48946536 |\n",
+            "|    learning_rate      | 0.0007      |\n",
+            "|    n_updates          | 46499       |\n",
+            "|    policy_loss        | 0.000401    |\n",
+            "|    std                | 0.561       |\n",
+            "|    value_loss         | 1.02e-06    |\n",
+            "---------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.5      |\n",
+            "|    ep_rew_mean        | -49.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 46600     |\n",
+            "|    time_elapsed       | 3318      |\n",
+            "|    total_timesteps    | 932000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.35     |\n",
+            "|    explained_variance | 0.9060283 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 46599     |\n",
+            "|    policy_loss        | 0.000685  |\n",
+            "|    std                | 0.561     |\n",
+            "|    value_loss         | 9.11e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 50        |\n",
+            "|    ep_rew_mean        | -50       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 46700     |\n",
+            "|    time_elapsed       | 3327      |\n",
+            "|    total_timesteps    | 934000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.33     |\n",
+            "|    explained_variance | 0.6672769 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 46699     |\n",
+            "|    policy_loss        | -0.00052  |\n",
+            "|    std                | 0.559     |\n",
+            "|    value_loss         | 1.51e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 46800     |\n",
+            "|    time_elapsed       | 3334      |\n",
+            "|    total_timesteps    | 936000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.31     |\n",
+            "|    explained_variance | 0.7833716 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 46799     |\n",
+            "|    policy_loss        | 0.000618  |\n",
+            "|    std                | 0.556     |\n",
+            "|    value_loss         | 9.54e-07  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 46900     |\n",
+            "|    time_elapsed       | 3340      |\n",
+            "|    total_timesteps    | 938000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.29     |\n",
+            "|    explained_variance | 0.8197125 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 46899     |\n",
+            "|    policy_loss        | -0.00295  |\n",
+            "|    std                | 0.554     |\n",
+            "|    value_loss         | 2.3e-05   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 47000      |\n",
+            "|    time_elapsed       | 3347       |\n",
+            "|    total_timesteps    | 940000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.29      |\n",
+            "|    explained_variance | 0.98894083 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 46999      |\n",
+            "|    policy_loss        | -8.07e-05  |\n",
+            "|    std                | 0.553      |\n",
+            "|    value_loss         | 1.47e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 47100     |\n",
+            "|    time_elapsed       | 3353      |\n",
+            "|    total_timesteps    | 942000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.29     |\n",
+            "|    explained_variance | 0.9561706 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 47099     |\n",
+            "|    policy_loss        | -0.00176  |\n",
+            "|    std                | 0.553     |\n",
+            "|    value_loss         | 3.15e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.7       |\n",
+            "|    ep_rew_mean        | -49.7      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 47200      |\n",
+            "|    time_elapsed       | 3362       |\n",
+            "|    total_timesteps    | 944000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.3       |\n",
+            "|    explained_variance | 0.97445196 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 47199      |\n",
+            "|    policy_loss        | -0.00119   |\n",
+            "|    std                | 0.555      |\n",
+            "|    value_loss         | 2.77e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.3       |\n",
+            "|    ep_rew_mean        | -47.2      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 47300      |\n",
+            "|    time_elapsed       | 3369       |\n",
+            "|    total_timesteps    | 946000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.27      |\n",
+            "|    explained_variance | 0.75822085 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 47299      |\n",
+            "|    policy_loss        | 0.00987    |\n",
+            "|    std                | 0.551      |\n",
+            "|    value_loss         | 0.000115   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 46.6      |\n",
+            "|    ep_rew_mean        | -46.6     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 47400     |\n",
+            "|    time_elapsed       | 3375      |\n",
+            "|    total_timesteps    | 948000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.26     |\n",
+            "|    explained_variance | 0.9148364 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 47399     |\n",
+            "|    policy_loss        | 0.00772   |\n",
+            "|    std                | 0.549     |\n",
+            "|    value_loss         | 8.81e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.1       |\n",
+            "|    ep_rew_mean        | -47        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 47500      |\n",
+            "|    time_elapsed       | 3381       |\n",
+            "|    total_timesteps    | 950000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.27      |\n",
+            "|    explained_variance | 0.96930254 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 47499      |\n",
+            "|    policy_loss        | -9.47e-06  |\n",
+            "|    std                | 0.551      |\n",
+            "|    value_loss         | 1.26e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 281       |\n",
+            "|    iterations         | 47600     |\n",
+            "|    time_elapsed       | 3387      |\n",
+            "|    total_timesteps    | 952000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.27     |\n",
+            "|    explained_variance | 0.7950085 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 47599     |\n",
+            "|    policy_loss        | -0.00454  |\n",
+            "|    std                | 0.55      |\n",
+            "|    value_loss         | 8.08e-06  |\n",
+            "-------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 48.5     |\n",
+            "|    ep_rew_mean        | -48.5    |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 280      |\n",
+            "|    iterations         | 47700    |\n",
+            "|    time_elapsed       | 3397     |\n",
+            "|    total_timesteps    | 954000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -3.27    |\n",
+            "|    explained_variance | 0.781541 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 47699    |\n",
+            "|    policy_loss        | 0.000465 |\n",
+            "|    std                | 0.551    |\n",
+            "|    value_loss         | 2.91e-06 |\n",
+            "------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.5      |\n",
+            "|    ep_rew_mean        | -49.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 47800     |\n",
+            "|    time_elapsed       | 3403      |\n",
+            "|    total_timesteps    | 956000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.26     |\n",
+            "|    explained_variance | 0.9763136 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 47799     |\n",
+            "|    policy_loss        | 0.000196  |\n",
+            "|    std                | 0.55      |\n",
+            "|    value_loss         | 1.38e-06  |\n",
+            "-------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 49.5     |\n",
+            "|    ep_rew_mean        | -49.5    |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 280      |\n",
+            "|    iterations         | 47900    |\n",
+            "|    time_elapsed       | 3410     |\n",
+            "|    total_timesteps    | 958000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -3.28    |\n",
+            "|    explained_variance | 0.961331 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 47899    |\n",
+            "|    policy_loss        | -0.00369 |\n",
+            "|    std                | 0.552    |\n",
+            "|    value_loss         | 3.65e-06 |\n",
+            "------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 281        |\n",
+            "|    iterations         | 48000      |\n",
+            "|    time_elapsed       | 3416       |\n",
+            "|    total_timesteps    | 960000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.26      |\n",
+            "|    explained_variance | 0.94155157 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 47999      |\n",
+            "|    policy_loss        | -0.00508   |\n",
+            "|    std                | 0.55       |\n",
+            "|    value_loss         | 6.79e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 281        |\n",
+            "|    iterations         | 48100      |\n",
+            "|    time_elapsed       | 3422       |\n",
+            "|    total_timesteps    | 962000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.27      |\n",
+            "|    explained_variance | 0.95676875 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 48099      |\n",
+            "|    policy_loss        | 0.000654   |\n",
+            "|    std                | 0.551      |\n",
+            "|    value_loss         | 1.65e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 280        |\n",
+            "|    iterations         | 48200      |\n",
+            "|    time_elapsed       | 3431       |\n",
+            "|    total_timesteps    | 964000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.26      |\n",
+            "|    explained_variance | 0.94901574 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 48199      |\n",
+            "|    policy_loss        | 3.81e-06   |\n",
+            "|    std                | 0.55       |\n",
+            "|    value_loss         | 3.41e-07   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 47.1      |\n",
+            "|    ep_rew_mean        | -47       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 280       |\n",
+            "|    iterations         | 48300     |\n",
+            "|    time_elapsed       | 3437      |\n",
+            "|    total_timesteps    | 966000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.26     |\n",
+            "|    explained_variance | 0.9897378 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 48299     |\n",
+            "|    policy_loss        | 0.00123   |\n",
+            "|    std                | 0.55      |\n",
+            "|    value_loss         | 2.62e-07  |\n",
+            "-------------------------------------\n",
+            "------------------------------------\n",
+            "| rollout/              |          |\n",
+            "|    ep_len_mean        | 48.5     |\n",
+            "|    ep_rew_mean        | -48.5    |\n",
+            "| time/                 |          |\n",
+            "|    fps                | 281      |\n",
+            "|    iterations         | 48400    |\n",
+            "|    time_elapsed       | 3443     |\n",
+            "|    total_timesteps    | 968000   |\n",
+            "| train/                |          |\n",
+            "|    entropy_loss       | -3.27    |\n",
+            "|    explained_variance | 0.967348 |\n",
+            "|    learning_rate      | 0.0007   |\n",
+            "|    n_updates          | 48399    |\n",
+            "|    policy_loss        | 0.00125  |\n",
+            "|    std                | 0.552    |\n",
+            "|    value_loss         | 4.61e-07 |\n",
+            "------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49        |\n",
+            "|    ep_rew_mean        | -49       |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 281       |\n",
+            "|    iterations         | 48500     |\n",
+            "|    time_elapsed       | 3450      |\n",
+            "|    total_timesteps    | 970000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.25     |\n",
+            "|    explained_variance | 0.9456974 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 48499     |\n",
+            "|    policy_loss        | -0.00673  |\n",
+            "|    std                | 0.548     |\n",
+            "|    value_loss         | 5.76e-06  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 281        |\n",
+            "|    iterations         | 48600      |\n",
+            "|    time_elapsed       | 3456       |\n",
+            "|    total_timesteps    | 972000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.24      |\n",
+            "|    explained_variance | 0.81393284 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 48599      |\n",
+            "|    policy_loss        | 0.0018     |\n",
+            "|    std                | 0.548      |\n",
+            "|    value_loss         | 4.43e-07   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 47.9       |\n",
+            "|    ep_rew_mean        | -47.9      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 281        |\n",
+            "|    iterations         | 48700      |\n",
+            "|    time_elapsed       | 3462       |\n",
+            "|    total_timesteps    | 974000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.23      |\n",
+            "|    explained_variance | 0.97745365 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 48699      |\n",
+            "|    policy_loss        | 0.00484    |\n",
+            "|    std                | 0.547      |\n",
+            "|    value_loss         | 3.19e-06   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.4     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 281       |\n",
+            "|    iterations         | 48800     |\n",
+            "|    time_elapsed       | 3472      |\n",
+            "|    total_timesteps    | 976000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.23     |\n",
+            "|    explained_variance | 0.9833449 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 48799     |\n",
+            "|    policy_loss        | 0.00261   |\n",
+            "|    std                | 0.546     |\n",
+            "|    value_loss         | 3.5e-06   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -48.9      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 281        |\n",
+            "|    iterations         | 48900      |\n",
+            "|    time_elapsed       | 3478       |\n",
+            "|    total_timesteps    | 978000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.23      |\n",
+            "|    explained_variance | 0.96366274 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 48899      |\n",
+            "|    policy_loss        | 0.000591   |\n",
+            "|    std                | 0.546      |\n",
+            "|    value_loss         | 4.17e-07   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.6      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 281       |\n",
+            "|    iterations         | 49000     |\n",
+            "|    time_elapsed       | 3483      |\n",
+            "|    total_timesteps    | 980000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.22     |\n",
+            "|    explained_variance | 0.9828003 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 48999     |\n",
+            "|    policy_loss        | -0.00217  |\n",
+            "|    std                | 0.545     |\n",
+            "|    value_loss         | 1.52e-06  |\n",
+            "-------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.5      |\n",
+            "|    ep_rew_mean        | -48.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 281       |\n",
+            "|    iterations         | 49100     |\n",
+            "|    time_elapsed       | 3489      |\n",
+            "|    total_timesteps    | 982000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.23     |\n",
+            "|    explained_variance | 0.9969808 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 49099     |\n",
+            "|    policy_loss        | 7.52e-05  |\n",
+            "|    std                | 0.546     |\n",
+            "|    value_loss         | 1.92e-07  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 281        |\n",
+            "|    iterations         | 49200      |\n",
+            "|    time_elapsed       | 3495       |\n",
+            "|    total_timesteps    | 984000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.22      |\n",
+            "|    explained_variance | 0.98948133 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 49199      |\n",
+            "|    policy_loss        | -0.00204   |\n",
+            "|    std                | 0.545      |\n",
+            "|    value_loss         | 8.27e-07   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.1       |\n",
+            "|    ep_rew_mean        | -48.1      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 281        |\n",
+            "|    iterations         | 49300      |\n",
+            "|    time_elapsed       | 3505       |\n",
+            "|    total_timesteps    | 986000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.21      |\n",
+            "|    explained_variance | 0.97018635 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 49299      |\n",
+            "|    policy_loss        | -0.000292  |\n",
+            "|    std                | 0.544      |\n",
+            "|    value_loss         | 4.47e-07   |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 48.1      |\n",
+            "|    ep_rew_mean        | -48.1     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 281       |\n",
+            "|    iterations         | 49400     |\n",
+            "|    time_elapsed       | 3511      |\n",
+            "|    total_timesteps    | 988000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.21     |\n",
+            "|    explained_variance | 0.9661637 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 49399     |\n",
+            "|    policy_loss        | -0.00206  |\n",
+            "|    std                | 0.544     |\n",
+            "|    value_loss         | 9.4e-06   |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49.5       |\n",
+            "|    ep_rew_mean        | -49.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 281        |\n",
+            "|    iterations         | 49500      |\n",
+            "|    time_elapsed       | 3518       |\n",
+            "|    total_timesteps    | 990000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.2       |\n",
+            "|    explained_variance | 0.82379425 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 49499      |\n",
+            "|    policy_loss        | 0.00211    |\n",
+            "|    std                | 0.543      |\n",
+            "|    value_loss         | 2.47e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 49         |\n",
+            "|    ep_rew_mean        | -49        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 281        |\n",
+            "|    iterations         | 49600      |\n",
+            "|    time_elapsed       | 3524       |\n",
+            "|    total_timesteps    | 992000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.19      |\n",
+            "|    explained_variance | 0.99219644 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 49599      |\n",
+            "|    policy_loss        | -0.00165   |\n",
+            "|    std                | 0.542      |\n",
+            "|    value_loss         | 4.1e-07    |\n",
+            "--------------------------------------\n",
+            "-------------------------------------\n",
+            "| rollout/              |           |\n",
+            "|    ep_len_mean        | 49.5      |\n",
+            "|    ep_rew_mean        | -49.5     |\n",
+            "| time/                 |           |\n",
+            "|    fps                | 281       |\n",
+            "|    iterations         | 49700     |\n",
+            "|    time_elapsed       | 3530      |\n",
+            "|    total_timesteps    | 994000    |\n",
+            "| train/                |           |\n",
+            "|    entropy_loss       | -3.2      |\n",
+            "|    explained_variance | 0.9896941 |\n",
+            "|    learning_rate      | 0.0007    |\n",
+            "|    n_updates          | 49699     |\n",
+            "|    policy_loss        | 0.000546  |\n",
+            "|    std                | 0.543     |\n",
+            "|    value_loss         | 1.62e-07  |\n",
+            "-------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48.5       |\n",
+            "|    ep_rew_mean        | -48.5      |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 281        |\n",
+            "|    iterations         | 49800      |\n",
+            "|    time_elapsed       | 3540       |\n",
+            "|    total_timesteps    | 996000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.18      |\n",
+            "|    explained_variance | 0.99164146 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 49799      |\n",
+            "|    policy_loss        | 0.000225   |\n",
+            "|    std                | 0.54       |\n",
+            "|    value_loss         | 4.13e-07   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48         |\n",
+            "|    ep_rew_mean        | -48        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 281        |\n",
+            "|    iterations         | 49900      |\n",
+            "|    time_elapsed       | 3545       |\n",
+            "|    total_timesteps    | 998000     |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.18      |\n",
+            "|    explained_variance | 0.92336273 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 49899      |\n",
+            "|    policy_loss        | -0.00245   |\n",
+            "|    std                | 0.54       |\n",
+            "|    value_loss         | 8.85e-06   |\n",
+            "--------------------------------------\n",
+            "--------------------------------------\n",
+            "| rollout/              |            |\n",
+            "|    ep_len_mean        | 48         |\n",
+            "|    ep_rew_mean        | -48        |\n",
+            "| time/                 |            |\n",
+            "|    fps                | 281        |\n",
+            "|    iterations         | 50000      |\n",
+            "|    time_elapsed       | 3551       |\n",
+            "|    total_timesteps    | 1000000    |\n",
+            "| train/                |            |\n",
+            "|    entropy_loss       | -3.18      |\n",
+            "|    explained_variance | 0.95652837 |\n",
+            "|    learning_rate      | 0.0007     |\n",
+            "|    n_updates          | 49999      |\n",
+            "|    policy_loss        | -0.00401   |\n",
+            "|    std                | 0.54       |\n",
+            "|    value_loss         | 3.22e-06   |\n",
+            "--------------------------------------\n",
+            "argv[0]=--background_color_red=0.8745098114013672\n",
+            "argv[1]=--background_color_green=0.21176470816135406\n",
+            "argv[2]=--background_color_blue=0.1764705926179886\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "/home/mique/Desktop/Code/deep-rl-class/notebooks/unit6/venv-u6/lib/python3.10/site-packages/stable_baselines3/common/evaluation.py:67: UserWarning: Evaluation environment is not wrapped with a ``Monitor`` wrapper. This may result in reporting modified episode lengths and rewards, if other wrappers happen to modify these. Consider wrapping environment first with ``Monitor`` wrapper.\n",
+            "  warnings.warn(\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Mean reward = -45.00 +/- 15.00\n",
+            "\u001b[38;5;4mℹ This function will save, evaluate, generate a video of your agent,\n",
+            "create a model card and push everything to the hub. It might take up to 1min.\n",
+            "This is a work in progress: if you encounter a bug, please open an issue.\u001b[0m\n",
+            "Saving video to /tmp/tmppn3lzgfu/-step-0-to-step-1000.mp4\n",
+            "MoviePy - Building video /tmp/tmppn3lzgfu/-step-0-to-step-1000.mp4.\n",
+            "MoviePy - Writing video /tmp/tmppn3lzgfu/-step-0-to-step-1000.mp4\n",
+            "\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "                                                                          \r"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "MoviePy - Done !\n",
+            "MoviePy - video ready /tmp/tmppn3lzgfu/-step-0-to-step-1000.mp4\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "ffmpeg version 6.1.1-3ubuntu5 Copyright (c) 2000-2023 the FFmpeg developers\n",
+            "  built with gcc 13 (Ubuntu 13.2.0-23ubuntu3)\n",
+            "  configuration: --prefix=/usr --extra-version=3ubuntu5 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --disable-omx --enable-gnutls --enable-libaom --enable-libass --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libglslang --enable-libgme --enable-libgsm --enable-libharfbuzz --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-openal --enable-opencl --enable-opengl --disable-sndio --enable-libvpl --disable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-ladspa --enable-libbluray --enable-libjack --enable-libpulse --enable-librabbitmq --enable-librist --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libx264 --enable-libzmq --enable-libzvbi --enable-lv2 --enable-sdl2 --enable-libplacebo --enable-librav1e --enable-pocketsphinx --enable-librsvg --enable-libjxl --enable-shared\n",
+            "  libavutil      58. 29.100 / 58. 29.100\n",
+            "  libavcodec     60. 31.102 / 60. 31.102\n",
+            "  libavformat    60. 16.100 / 60. 16.100\n",
+            "  libavdevice    60.  3.100 / 60.  3.100\n",
+            "  libavfilter     9. 12.100 /  9. 12.100\n",
+            "  libswscale      7.  5.100 /  7.  5.100\n",
+            "  libswresample   4. 12.100 /  4. 12.100\n",
+            "  libpostproc    57.  3.100 / 57.  3.100\n",
+            "Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/tmp/tmppn3lzgfu/-step-0-to-step-1000.mp4':\n",
+            "  Metadata:\n",
+            "    major_brand     : isom\n",
+            "    minor_version   : 512\n",
+            "    compatible_brands: isomiso2avc1mp41\n",
+            "    encoder         : Lavf61.1.100\n",
+            "  Duration: 00:00:40.00, start: 0.000000, bitrate: 190 kb/s\n",
+            "  Stream #0:0[0x1](und): Video: h264 (High) (avc1 / 0x31637661), yuv420p(progressive), 720x480, 187 kb/s, 25 fps, 25 tbr, 12800 tbn (default)\n",
+            "    Metadata:\n",
+            "      handler_name    : VideoHandler\n",
+            "      vendor_id       : [0][0][0][0]\n",
+            "      encoder         : Lavc61.3.100 libx264\n",
+            "Stream mapping:\n",
+            "  Stream #0:0 -> #0:0 (h264 (native) -> h264 (libx264))\n",
+            "Press [q] to stop, [?] for help\n",
+            "[libx264 @ 0x5615de034a80] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2\n",
+            "[libx264 @ 0x5615de034a80] profile High, level 3.0, 4:2:0, 8-bit\n",
+            "[libx264 @ 0x5615de034a80] 264 - core 164 r3108 31e19f9 - H.264/MPEG-4 AVC codec - Copyleft 2003-2023 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=15 lookahead_threads=2 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00\n",
+            "Output #0, mp4, to '/tmp/tmp2wmkgvgp/replay.mp4':\n",
+            "  Metadata:\n",
+            "    major_brand     : isom\n",
+            "    minor_version   : 512\n",
+            "    compatible_brands: isomiso2avc1mp41\n",
+            "    encoder         : Lavf60.16.100\n",
+            "  Stream #0:0(und): Video: h264 (avc1 / 0x31637661), yuv420p(progressive), 720x480, q=2-31, 25 fps, 12800 tbn (default)\n",
+            "    Metadata:\n",
+            "      handler_name    : VideoHandler\n",
+            "      vendor_id       : [0][0][0][0]\n",
+            "      encoder         : Lavc60.31.102 libx264\n",
+            "    Side data:\n",
+            "      cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A\n",
+            "[out#0/mp4 @ 0x5615ddfb0140] video:896kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 1.371167%\n",
+            "frame= 1000 fps=740 q=-1.0 Lsize=     908kB time=00:00:39.88 bitrate= 186.5kbits/s speed=29.5x    \n",
+            "[libx264 @ 0x5615de034a80] frame I:4     Avg QP:17.50  size:  7558\n",
+            "[libx264 @ 0x5615de034a80] frame P:287   Avg QP:25.06  size:  1464\n",
+            "[libx264 @ 0x5615de034a80] frame B:709   Avg QP:25.16  size:   657\n",
+            "[libx264 @ 0x5615de034a80] consecutive B-frames:  2.6%  5.0% 10.8% 81.6%\n",
+            "[libx264 @ 0x5615de034a80] mb I  I16..4:  3.1% 79.9% 17.0%\n",
+            "[libx264 @ 0x5615de034a80] mb P  I16..4:  0.2%  1.6%  2.0%  P16..4:  2.4%  1.4%  0.7%  0.0%  0.0%    skip:91.7%\n",
+            "[libx264 @ 0x5615de034a80] mb B  I16..4:  0.1%  0.2%  0.3%  B16..8:  3.9%  1.3%  0.5%  direct: 0.2%  skip:93.5%  L0:55.0% L1:42.9% BI: 2.2%\n",
+            "[libx264 @ 0x5615de034a80] 8x8 transform intra:46.3% inter:10.9%\n",
+            "[libx264 @ 0x5615de034a80] coded y,uvDC,uvAC intra: 32.2% 3.7% 0.9% inter: 0.9% 0.0% 0.0%\n",
+            "[libx264 @ 0x5615de034a80] i16 v,h,dc,p: 54% 24% 18%  4%\n",
+            "[libx264 @ 0x5615de034a80] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 41% 12% 44%  1%  1%  0%  1%  0%  1%\n",
+            "[libx264 @ 0x5615de034a80] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 25% 19% 28%  4%  5%  5%  7%  3%  5%\n",
+            "[libx264 @ 0x5615de034a80] i8c dc,h,v,p: 93%  3%  4%  0%\n",
+            "[libx264 @ 0x5615de034a80] Weighted P-Frames: Y:0.0% UV:0.0%\n",
+            "[libx264 @ 0x5615de034a80] ref P L0: 48.4%  5.0% 27.9% 18.8%\n",
+            "[libx264 @ 0x5615de034a80] ref B L0: 78.2% 14.5%  7.3%\n",
+            "[libx264 @ 0x5615de034a80] ref B L1: 96.4%  3.6%\n",
+            "[libx264 @ 0x5615de034a80] kb/s:183.28\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "\u001b[38;5;4mℹ Pushing repo turbo-maikol/a2c-PandaPickAndPlace-v3 to the Hugging\n",
+            "Face Hub\u001b[0m\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "Processing Files (0 / 0)                : |          |  0.00B /  0.00B            \n",
+            "\u001b[A\n",
+            "Processing Files (1 / 1)                :   0%|          | 1.26kB / 1.17MB,   ???B/s  \n",
+            "\u001b[A\n",
+            "\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "Processing Files (1 / 6)                :  47%|████▋     |  545kB / 1.17MB,  680kB/s  \n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "Processing Files (1 / 6)                :  93%|█████████▎| 1.09MB / 1.17MB, 1.09MB/s  \n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "Processing Files (6 / 6)                : 100%|██████████| 1.17MB / 1.17MB,  837kB/s  \n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "Processing Files (6 / 6)                : 100%|██████████| 1.17MB / 1.17MB,  586kB/s  \n",
+            "New Data Upload                         : 100%|██████████| 1.17MB / 1.17MB,  586kB/s  \n",
+            "  ...ckAndPlace-v3/pytorch_variables.pth: 100%|██████████| 1.26kB / 1.26kB            \n",
+            "  ...ickAndPlace-v3/policy.optimizer.pth: 100%|██████████| 55.8kB / 55.8kB            \n",
+            "  ...a2c-PandaPickAndPlace-v3/policy.pth: 100%|██████████| 53.7kB / 53.7kB            \n",
+            "  ...mkgvgp/a2c-PandaPickAndPlace-v3.zip: 100%|██████████|  129kB /  129kB            \n",
+            "  /tmp/tmp2wmkgvgp/replay.mp4           : 100%|██████████|  930kB /  930kB            \n",
+            "  /tmp/tmp2wmkgvgp/vec_normalize.pkl    : 100%|██████████| 2.95kB / 2.95kB            \n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "\u001b[38;5;4mℹ Your model is pushed to the Hub. You can view your model here:\n",
+            "https://huggingface.co/turbo-maikol/a2c-PandaPickAndPlace-v3/tree/main/\u001b[0m\n"
+          ]
+        },
+        {
+          "data": {
+            "text/plain": [
+              "CommitInfo(commit_url='https://huggingface.co/turbo-maikol/a2c-PandaPickAndPlace-v3/commit/457722bba273248332eadc56aa52d5aad99a7844', commit_message='Initial commit', commit_description='', oid='457722bba273248332eadc56aa52d5aad99a7844', pr_url=None, repo_url=RepoUrl('https://huggingface.co/turbo-maikol/a2c-PandaPickAndPlace-v3', endpoint='https://huggingface.co', repo_type='model', repo_id='turbo-maikol/a2c-PandaPickAndPlace-v3'), pr_revision=None, pr_num=None)"
+            ]
+          },
+          "execution_count": 32,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
       ],
-      "metadata": {
-        "id": "G3xy3Nf3c2O1"
-      }
+      "source": [
+        "# 1 2 3\n",
+        "env_id_new = \"PandaPickAndPlace-v3\"\n",
+        "env_new = make_vec_env(env_id_new, n_envs=4)\n",
+        "env_new = VecNormalize(env_new, norm_obs=True, norm_reward=True, clip_obs=10)\n",
+        "# 4\n",
+        "model_new = A2C(\"MultiInputPolicy\", env_new, verbose=1) # Create the A2C model and try to find the best parameters\n",
+        "# 5\n",
+        "model_new.learn(1_000_000)\n",
+        "# 6\n",
+        "model_name_new = f\"new-{env_id_new}\"\n",
+        "model_new.save(model_name_new)\n",
+        "env_new.save(\"vec_normalize_new.pkl\")\n",
+        "\n",
+        "\n",
+        "# 7\n",
+        "from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize\n",
+        "# Load the saved statistics\n",
+        "eval_env_new = DummyVecEnv([lambda: gym.make(f\"{env_id_new}\")])\n",
+        "eval_env_new = VecNormalize.load(\"vec_normalize_new.pkl\", eval_env_new)\n",
+        "# We need to override the render_mode\n",
+        "eval_env_new.render_mode = \"rgb_array\"\n",
+        "#  do not update them at test time\n",
+        "eval_env_new.training = False\n",
+        "# reward normalization is not needed at test time\n",
+        "eval_env_new.norm_reward = False\n",
+        "# Load the agent\n",
+        "model = A2C.load(model_name_new)\n",
+        "\n",
+        "mean_reward, std_reward = evaluate_policy(model, eval_env_new)\n",
+        "\n",
+        "print(f\"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}\")\n",
+        "\n",
+        "\n",
+        "# 8\n",
+        "package_to_hub(\n",
+        "    model=model,\n",
+        "    model_name=f\"a2c-{env_id_new}\",\n",
+        "    model_architecture=\"A2C\",\n",
+        "    env_id=env_id_new,\n",
+        "    eval_env=eval_env_new,\n",
+        "    repo_id=f\"turbo-maikol/a2c-{env_id_new}\", # Change the username\n",
+        "    commit_message=\"Initial commit\",\n",
+        ")"
+      ]
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "### Solution (optional)"
-      ],
       "metadata": {
         "id": "sKGbFXZq9ikN"
-      }
+      },
+      "source": [
+        "### Solution (optional)"
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "J-cC-Feg9iMm"
+      },
+      "outputs": [],
       "source": [
         "# 1 - 2\n",
         "env_id = \"PandaPickAndPlace-v3\"\n",
@@ -735,15 +19577,15 @@
         "            verbose=1)\n",
         "# 5\n",
         "model.learn(1_000_000)"
-      ],
-      "metadata": {
-        "id": "J-cC-Feg9iMm"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-UnlKLmpg80p"
+      },
+      "outputs": [],
       "source": [
         "# 6\n",
         "model_name = \"a2c-PandaPickAndPlace-v3\";\n",
@@ -779,22 +19621,48 @@
         "    repo_id=f\"ThomasSimonini/a2c-{env_id}\", # TODO: Change the username\n",
         "    commit_message=\"Initial commit\",\n",
         ")"
-      ],
-      "metadata": {
-        "id": "-UnlKLmpg80p"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "usatLaZ8dM4P"
+      },
       "source": [
         "See you on Unit 7! 🔥\n",
         "## Keep learning, stay awesome 🤗"
+      ]
+    }
+  ],
+  "metadata": {
+    "accelerator": "GPU",
+    "colab": {
+      "collapsed_sections": [
+        "tF42HvI7-gs5"
       ],
-      "metadata": {
-        "id": "usatLaZ8dM4P"
-      }
+      "include_colab_link": true,
+      "private_outputs": true,
+      "provenance": []
+    },
+    "gpuClass": "standard",
+    "kernelspec": {
+      "display_name": "venv-u6",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.10.18"
     }
-  ]
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
 }
diff --git a/notebooks/unit8/unit8_part1.ipynb b/notebooks/unit8/unit8_part1.ipynb
index 653385b..3586798 100644
--- a/notebooks/unit8/unit8_part1.ipynb
+++ b/notebooks/unit8/unit8_part1.ipynb
@@ -3,8 +3,8 @@
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "view-in-github",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "view-in-github"
       },
       "source": [
         "<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit8_part1.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
@@ -60,6 +60,9 @@
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "T6lIPYFghhYL"
+      },
       "source": [
         "## Objectives of this notebook 🏆\n",
         "\n",
@@ -69,13 +72,13 @@
         "- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.\n",
         "\n",
         "\n"
-      ],
-      "metadata": {
-        "id": "T6lIPYFghhYL"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "Wp-rD6Fuhq31"
+      },
       "source": [
         "## This notebook is from the Deep Reinforcement Learning Course\n",
         "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg\" alt=\"Deep RL Course illustration\"/>\n",
@@ -90,82 +93,79 @@
         "\n",
         "\n",
         "The best way to keep in touch is to join our discord server to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5"
-      ],
-      "metadata": {
-        "id": "Wp-rD6Fuhq31"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "rasqqGQlhujA"
+      },
       "source": [
         "## Prerequisites 🏗️\n",
         "Before diving into the notebook, you need to:\n",
         "\n",
         "🔲 📚 Study [PPO by reading Unit 8](https://huggingface.co/deep-rl-course/unit8/introduction) 🤗  "
-      ],
-      "metadata": {
-        "id": "rasqqGQlhujA"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "PUFfMGOih3CW"
+      },
       "source": [
         "To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push one model, we don't ask for a minimal result but we **advise you to try different hyperparameters settings to get better results**.\n",
         "\n",
         "If you don't find your model, **go to the bottom of the page and click on the refresh button**\n",
         "\n",
         "For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process"
-      ],
-      "metadata": {
-        "id": "PUFfMGOih3CW"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "PU4FVzaoM6fC"
+      },
       "source": [
         "## Set the GPU 💪\n",
         "- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`\n",
         "\n",
         "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg\" alt=\"GPU Step 1\">"
-      ],
-      "metadata": {
-        "id": "PU4FVzaoM6fC"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "KV0NyFdQM9ZG"
+      },
       "source": [
         "- `Hardware Accelerator > GPU`\n",
         "\n",
         "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg\" alt=\"GPU Step 2\">"
-      ],
-      "metadata": {
-        "id": "KV0NyFdQM9ZG"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "bTpYcVZVMzUI"
+      },
       "source": [
         "## Create a virtual display 🔽\n",
         "\n",
         "During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames). \n",
         "\n",
         "Hence the following cell will install the librairies and create and run a virtual screen 🖥"
-      ],
-      "metadata": {
-        "id": "bTpYcVZVMzUI"
-      }
+      ]
     },
     {
       "cell_type": "code",
-      "source": [
-        "!pip install setuptools==65.5.0"
-      ],
+      "execution_count": null,
       "metadata": {
         "id": "Fd731S8-NuJA"
       },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [],
+      "source": [
+        "!pip install setuptools==65.5.0"
+      ]
     },
     {
       "cell_type": "code",
@@ -186,18 +186,18 @@
     },
     {
       "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ww5PQH1gNLI4"
+      },
+      "outputs": [],
       "source": [
         "# Virtual display\n",
         "from pyvirtualdisplay import Display\n",
         "\n",
         "virtual_display = Display(visible=0, size=(1400, 900))\n",
         "virtual_display.start()"
-      ],
-      "metadata": {
-        "id": "ww5PQH1gNLI4"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "markdown",
@@ -211,17 +211,14 @@
     },
     {
       "cell_type": "code",
-      "source": [
-        "!pip install gym==0.22\n",
-        "!pip install imageio-ffmpeg\n",
-        "!pip install huggingface_hub\n",
-        "!pip install gym[box2d]==0.22"
-      ],
+      "execution_count": null,
       "metadata": {
         "id": "9xZQFTPcsKUK"
       },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [],
+      "source": [
+        "pip install gym==0.22 imageio-ffmpeg huggingface_hub gym[box2d]==0.22"
+      ]
     },
     {
       "cell_type": "markdown",
@@ -266,7 +263,17 @@
       },
       "outputs": [],
       "source": [
-        "### Your code here:"
+        "### Your code here:\n",
+        "from ppo import "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "\n",
+        "# EXECUTED CELLS TO UPLOAD MY MODEL TO HUGGING FACE"
       ]
     },
     {
@@ -307,7 +314,10 @@
         "import imageio\n",
         "\n",
         "from wasabi import Printer\n",
-        "msg = Printer()"
+        "msg = Printer()\n",
+        "\n",
+        "%load_ext autoreload\n",
+        "%autoreload 2"
       ]
     },
     {
@@ -319,18 +329,6 @@
         "- Add new argument in `parse_args()` function to define the repo-id where we want to push the model."
       ]
     },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "iHQiqQEFn0QH"
-      },
-      "outputs": [],
-      "source": [
-        "# Adding HuggingFace argument\n",
-        "parser.add_argument(\"--repo-id\", type=str, default=\"ThomasSimonini/ppo-CartPole-v1\", help=\"id of the model repository from the Hugging Face Hub {username/repo_name}\")"
-      ]
-    },
     {
       "cell_type": "markdown",
       "metadata": {
@@ -452,17 +450,17 @@
         "  \"\"\"\n",
         "  episode_rewards = []\n",
         "  for episode in range(n_eval_episodes):\n",
-        "    state = env.reset()\n",
+        "    state,_ = env.reset()\n",
         "    step = 0\n",
         "    done = False\n",
         "    total_rewards_ep = 0\n",
         "    \n",
         "    while done is False:\n",
         "      state = torch.Tensor(state).to(device)\n",
-        "      action, _, _, _ = policy.get_action_and_value(state)\n",
-        "      new_state, reward, done, info = env.step(action.cpu().numpy())\n",
+        "      action, _, _, _ = policy.get_action_value(state)\n",
+        "      new_state, reward, trunc, term, info = env.step(action.cpu().numpy())\n",
         "      total_rewards_ep += reward    \n",
-        "      if done:\n",
+        "      if trunc or term:\n",
         "        break\n",
         "      state = new_state\n",
         "    episode_rewards.append(total_rewards_ep)\n",
@@ -474,16 +472,16 @@
         "\n",
         "def record_video(env, policy, out_directory, fps=30):\n",
         "  images = []  \n",
-        "  done = False\n",
-        "  state = env.reset()\n",
-        "  img = env.render(mode='rgb_array')\n",
+        "  trunc, term = False, False\n",
+        "  state, _= env.reset()\n",
+        "  img = env.render()\n",
         "  images.append(img)\n",
-        "  while not done:\n",
+        "  while not trunc or term:\n",
         "    state = torch.Tensor(state).to(device)\n",
         "    # Take the action (index) that have the maximum expected future reward given that state\n",
-        "    action, _, _, _  = policy.get_action_and_value(state)\n",
-        "    state, reward, done, info = env.step(action.cpu().numpy()) # We directly put next_state = state for recording logic\n",
-        "    img = env.render(mode='rgb_array')\n",
+        "    action, _, _, _  = policy.get_action_value(state)\n",
+        "    state, reward, trunc, term, info = env.step(action.cpu().numpy()) # We directly put next_state = state for recording logic\n",
+        "    img = env.render()\n",
         "    images.append(img)\n",
         "  imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)\n",
         "\n",
@@ -603,6 +601,36 @@
         "- Finally, we call this function at the end of the PPO training"
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "args_repo_id = \"turbo-maikol/rl-course-unit8-ppo-LunarLander-v2\"\n",
+        "args_env_id = \"LunarLander-v3\"\n",
+        "run_name = \"LunarLander-HF\""
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from src.utils.model_utils import load_agent\n",
+        "from src.config import Configuration\n",
+        "\n",
+        "CONFIG = Configuration(\n",
+        "    MODELS=\"../../rl-module/models\",\n",
+        "    exp_name=\"lunar-lander-hf-V2\",\n",
+        "    env_id = args_env_id\n",
+        ")\n",
+        "agent = load_agent(CONFIG)\n",
+        "\n",
+        "device = CONFIG.device"
+      ]
+    },
     {
       "cell_type": "code",
       "execution_count": null,
@@ -611,17 +639,26 @@
       },
       "outputs": [],
       "source": [
+        "import gymnasium as gym\n",
+        "import torch\n",
         "# Create the evaluation environment\n",
-        "eval_env = gym.make(args.env_id)\n",
+        "eval_env = gym.make(args_env_id, render_mode=\"rgb_array\")\n",
         "\n",
-        "package_to_hub(repo_id = args.repo_id,\n",
+        "package_to_hub(repo_id = args_repo_id,\n",
         "                model = agent, # The model we want to save\n",
-        "                hyperparameters = args,\n",
-        "                eval_env = gym.make(args.env_id),\n",
+        "                hyperparameters = CONFIG,\n",
+        "                eval_env = eval_env,\n",
         "                logs= f\"runs/{run_name}\",\n",
         "                )"
       ]
     },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "----"
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {
@@ -647,7 +684,7 @@
         "import time\n",
         "from distutils.util import strtobool\n",
         "\n",
-        "import gym\n",
+        "import gymnasium as gym\n",
         "import numpy as np\n",
         "import torch\n",
         "import torch.nn as nn\n",
@@ -840,7 +877,7 @@
         "    \n",
         "    while done is False:\n",
         "      state = torch.Tensor(state).to(device)\n",
-        "      action, _, _, _ = policy.get_action_and_value(state)\n",
+        "      action, _, _, _ = policy.get_action_value(state)\n",
         "      new_state, reward, done, info = env.step(action.cpu().numpy())\n",
         "      total_rewards_ep += reward    \n",
         "      if done:\n",
@@ -862,7 +899,7 @@
         "  while not done:\n",
         "    state = torch.Tensor(state).to(device)\n",
         "    # Take the action (index) that have the maximum expected future reward given that state\n",
-        "    action, _, _, _  = policy.get_action_and_value(state)\n",
+        "    action, _, _, _  = policy.get_action_value(state)\n",
         "    state, reward, done, info = env.step(action.cpu().numpy()) # We directly put next_state = state for recording logic\n",
         "    img = env.render(mode='rgb_array')\n",
         "    images.append(img)\n",
@@ -1013,7 +1050,7 @@
         "    def get_value(self, x):\n",
         "        return self.critic(x)\n",
         "\n",
-        "    def get_action_and_value(self, x, action=None):\n",
+        "    def get_action_value(self, x, action=None):\n",
         "        logits = self.actor(x)\n",
         "        probs = Categorical(logits=logits)\n",
         "        if action is None:\n",
@@ -1023,7 +1060,7 @@
         "\n",
         "if __name__ == \"__main__\":\n",
         "    args = parse_args()\n",
-        "    run_name = f\"{args.env_id}__{args.exp_name}__{args.seed}__{int(time.time())}\"\n",
+        "    run_name = f\"{args_env_id}__{args.exp_name}__{args.seed}__{int(time.time())}\"\n",
         "    if args.track:\n",
         "        import wandb\n",
         "\n",
@@ -1052,7 +1089,7 @@
         "\n",
         "    # env setup\n",
         "    envs = gym.vector.SyncVectorEnv(\n",
-        "        [make_env(args.env_id, args.seed + i, i, args.capture_video, run_name) for i in range(args.num_envs)]\n",
+        "        [make_env(args_env_id, args.seed + i, i, args.capture_video, run_name) for i in range(args.num_envs)]\n",
         "    )\n",
         "    assert isinstance(envs.single_action_space, gym.spaces.Discrete), \"only discrete action space is supported\"\n",
         "\n",
@@ -1088,7 +1125,7 @@
         "\n",
         "            # ALGO LOGIC: action logic\n",
         "            with torch.no_grad():\n",
-        "                action, logprob, _, value = agent.get_action_and_value(next_obs)\n",
+        "                action, logprob, _, value = agent.get_action_value(next_obs)\n",
         "                values[step] = value.flatten()\n",
         "            actions[step] = action\n",
         "            logprobs[step] = logprob\n",
@@ -1150,7 +1187,7 @@
         "                end = start + args.minibatch_size\n",
         "                mb_inds = b_inds[start:end]\n",
         "\n",
-        "                _, newlogprob, entropy, newvalue = agent.get_action_and_value(b_obs[mb_inds], b_actions.long()[mb_inds])\n",
+        "                _, newlogprob, entropy, newvalue = agent.get_action_value(b_obs[mb_inds], b_actions.long()[mb_inds])\n",
         "                logratio = newlogprob - b_logprobs[mb_inds]\n",
         "                ratio = logratio.exp()\n",
         "\n",
@@ -1216,12 +1253,12 @@
         "    writer.close()\n",
         "\n",
         "    # Create the evaluation environment\n",
-        "    eval_env = gym.make(args.env_id)\n",
+        "    eval_env = gym.make(args_env_id)\n",
         "\n",
-        "    package_to_hub(repo_id = args.repo_id,\n",
+        "    package_to_hub(repo_id = args_repo_id,\n",
         "                model = agent, # The model we want to save\n",
         "                hyperparameters = args,\n",
-        "                eval_env = gym.make(args.env_id),\n",
+        "                eval_env = gym.make(args_env_id),\n",
         "                logs= f\"runs/{run_name}\",\n",
         "                )\n",
         "    "
@@ -1290,21 +1327,21 @@
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/step1.png\" alt=\"PPO\"/>"
-      ],
       "metadata": {
         "id": "Sq0My0LOjPYR"
-      }
+      },
+      "source": [
+        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/step1.png\" alt=\"PPO\"/>"
+      ]
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/step2.png\" alt=\"PPO\"/>"
-      ],
       "metadata": {
         "id": "A8C-Q5ZyjUe3"
-      }
+      },
+      "source": [
+        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/step2.png\" alt=\"PPO\"/>"
+      ]
     },
     {
       "cell_type": "markdown",
@@ -1319,14 +1356,14 @@
     },
     {
       "cell_type": "code",
-      "source": [
-        "!python ppo.py --env-id=\"LunarLander-v2\" --repo-id=\"YOUR_REPO_ID\" --total-timesteps=50000"
-      ],
+      "execution_count": null,
       "metadata": {
         "id": "KXLih6mKseBs"
       },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [],
+      "source": [
+        "!python ppo.py --env-id=\"LunarLander-v2\" --repo-id=\"YOUR_REPO_ID\" --total-timesteps=50000"
+      ]
     },
     {
       "cell_type": "markdown",
@@ -1350,22 +1387,32 @@
     }
   ],
   "metadata": {
+    "accelerator": "GPU",
     "colab": {
-      "private_outputs": true,
-      "provenance": [],
       "history_visible": true,
-      "include_colab_link": true
+      "include_colab_link": true,
+      "private_outputs": true,
+      "provenance": []
     },
     "gpuClass": "standard",
     "kernelspec": {
-      "display_name": "Python 3",
+      "display_name": "venv",
+      "language": "python",
       "name": "python3"
     },
     "language_info": {
-      "name": "python"
-    },
-    "accelerator": "GPU"
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.10.18"
+    }
   },
   "nbformat": 4,
   "nbformat_minor": 0
-}
\ No newline at end of file
+}
diff --git a/notebooks/unit8/unit8_part2.ipynb b/notebooks/unit8/unit8_part2.ipynb
index 7c38b10..59eb35b 100644
--- a/notebooks/unit8/unit8_part2.ipynb
+++ b/notebooks/unit8/unit8_part2.ipynb
@@ -3,8 +3,8 @@
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "view-in-github",
-        "colab_type": "text"
+        "colab_type": "text",
+        "id": "view-in-github"
       },
       "source": [
         "<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit8/unit8_part2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
@@ -244,21 +244,9 @@
       "source": [
         "# install python libraries\n",
         "# thanks toinsson\n",
-        "!pip install faster-fifo==1.4.2\n",
-        "!pip install vizdoom"
+        "!pip install faster-fifo==1.4.2 vizdoom sample-factory==2.1.1"
       ]
     },
-    {
-      "cell_type": "code",
-      "source": [
-        "!pip install sample-factory==2.1.1"
-      ],
-      "metadata": {
-        "id": "alxUt7Au-O8e"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
     {
       "cell_type": "markdown",
       "metadata": {
@@ -270,7 +258,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 1,
       "metadata": {
         "id": "bCgZbeiavcDU"
       },
@@ -358,11 +346,210 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 3,
       "metadata": {
         "id": "y_TeicMvyKHP"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "\u001b[33m[2025-08-29 19:52:59,093][32845] Environment doom_basic already registered, overwriting...\u001b[0m\n",
+            "\u001b[33m[2025-08-29 19:52:59,095][32845] Environment doom_two_colors_easy already registered, overwriting...\u001b[0m\n",
+            "\u001b[33m[2025-08-29 19:52:59,096][32845] Environment doom_two_colors_hard already registered, overwriting...\u001b[0m\n",
+            "\u001b[33m[2025-08-29 19:52:59,098][32845] Environment doom_dm already registered, overwriting...\u001b[0m\n",
+            "\u001b[33m[2025-08-29 19:52:59,098][32845] Environment doom_dwango5 already registered, overwriting...\u001b[0m\n",
+            "\u001b[33m[2025-08-29 19:52:59,099][32845] Environment doom_my_way_home_flat_actions already registered, overwriting...\u001b[0m\n",
+            "\u001b[33m[2025-08-29 19:52:59,100][32845] Environment doom_defend_the_center_flat_actions already registered, overwriting...\u001b[0m\n",
+            "\u001b[33m[2025-08-29 19:52:59,100][32845] Environment doom_my_way_home already registered, overwriting...\u001b[0m\n",
+            "\u001b[33m[2025-08-29 19:52:59,101][32845] Environment doom_deadly_corridor already registered, overwriting...\u001b[0m\n",
+            "\u001b[33m[2025-08-29 19:52:59,102][32845] Environment doom_defend_the_center already registered, overwriting...\u001b[0m\n",
+            "\u001b[33m[2025-08-29 19:52:59,103][32845] Environment doom_defend_the_line already registered, overwriting...\u001b[0m\n",
+            "\u001b[33m[2025-08-29 19:52:59,104][32845] Environment doom_health_gathering already registered, overwriting...\u001b[0m\n",
+            "\u001b[33m[2025-08-29 19:52:59,104][32845] Environment doom_health_gathering_supreme already registered, overwriting...\u001b[0m\n",
+            "\u001b[33m[2025-08-29 19:52:59,105][32845] Environment doom_battle already registered, overwriting...\u001b[0m\n",
+            "\u001b[33m[2025-08-29 19:52:59,106][32845] Environment doom_battle2 already registered, overwriting...\u001b[0m\n",
+            "\u001b[33m[2025-08-29 19:52:59,106][32845] Environment doom_duel_bots already registered, overwriting...\u001b[0m\n",
+            "\u001b[33m[2025-08-29 19:52:59,107][32845] Environment doom_deathmatch_bots already registered, overwriting...\u001b[0m\n",
+            "\u001b[33m[2025-08-29 19:52:59,107][32845] Environment doom_duel already registered, overwriting...\u001b[0m\n",
+            "\u001b[33m[2025-08-29 19:52:59,108][32845] Environment doom_deathmatch_full already registered, overwriting...\u001b[0m\n",
+            "\u001b[33m[2025-08-29 19:52:59,109][32845] Environment doom_benchmark already registered, overwriting...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:52:59,109][32845] register_encoder_factory: <function make_vizdoom_encoder at 0x7f19aba5ccc0>\u001b[0m\n",
+            "\u001b[33m[2025-08-29 19:52:59,191][32845] Loading existing experiment configuration from /home/mique/Desktop/Code/deep-rl-class/notebooks/unit8/train_dir/default_experiment/config.json\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:52:59,209][32845] Experiment dir /home/mique/Desktop/Code/deep-rl-class/notebooks/unit8/train_dir/default_experiment already exists!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:52:59,224][32845] Resuming existing experiment from /home/mique/Desktop/Code/deep-rl-class/notebooks/unit8/train_dir/default_experiment...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:52:59,225][32845] Weights and Biases integration disabled\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:52:59,235][32845] Environment var CUDA_VISIBLE_DEVICES is 0\n",
+            "\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:53:01,426][43033] Doom resolution: 160x120, resize resolution: (128, 72)\u001b[0m\n",
+            "/home/mique/Desktop/Code/deep-rl-class/notebooks/unit8/venv-u82/lib/python3.12/site-packages/gymnasium/core.py:311: UserWarning: \u001b[33mWARN: env.num_agents to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do `env.unwrapped.num_agents` for environment variables or `env.get_wrapper_attr('num_agents')` that will search the reminding wrappers.\u001b[0m\n",
+            "  logger.warn(\n",
+            "/home/mique/Desktop/Code/deep-rl-class/notebooks/unit8/venv-u82/lib/python3.12/site-packages/gymnasium/core.py:311: UserWarning: \u001b[33mWARN: env.is_multiagent to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do `env.unwrapped.is_multiagent` for environment variables or `env.get_wrapper_attr('is_multiagent')` that will search the reminding wrappers.\u001b[0m\n",
+            "  logger.warn(\n",
+            "\u001b[36m[2025-08-29 19:53:01,428][43033] Env info: EnvInfo(obs_space=Dict('obs': Box(0, 255, (3, 72, 128), uint8)), action_space=Discrete(5), num_agents=1, gpu_actions=False, gpu_observations=True, action_splits=None, all_discrete=None, frameskip=4, reward_shaping_scheme=None, env_info_protocol_version=1)\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:53:01,760][32845] Starting experiment with the following configuration:\n",
+            "help=False\n",
+            "algo=APPO\n",
+            "env=doom_health_gathering_supreme\n",
+            "experiment=default_experiment\n",
+            "train_dir=/home/mique/Desktop/Code/deep-rl-class/notebooks/unit8/train_dir\n",
+            "restart_behavior=resume\n",
+            "device=gpu\n",
+            "seed=None\n",
+            "num_policies=1\n",
+            "async_rl=True\n",
+            "serial_mode=False\n",
+            "batched_sampling=False\n",
+            "num_batches_to_accumulate=2\n",
+            "worker_num_splits=2\n",
+            "policy_workers_per_policy=1\n",
+            "max_policy_lag=1000\n",
+            "num_workers=10\n",
+            "num_envs_per_worker=8\n",
+            "batch_size=16384\n",
+            "num_batches_per_epoch=1\n",
+            "num_epochs=1\n",
+            "rollout=64\n",
+            "recurrence=32\n",
+            "shuffle_minibatches=False\n",
+            "gamma=0.99\n",
+            "reward_scale=1.0\n",
+            "reward_clip=1000.0\n",
+            "value_bootstrap=False\n",
+            "normalize_returns=True\n",
+            "exploration_loss_coeff=0.001\n",
+            "value_loss_coeff=0.5\n",
+            "kl_loss_coeff=0.0\n",
+            "exploration_loss=symmetric_kl\n",
+            "gae_lambda=0.95\n",
+            "ppo_clip_ratio=0.2\n",
+            "ppo_clip_value=0.2\n",
+            "with_vtrace=False\n",
+            "vtrace_rho=1.0\n",
+            "vtrace_c=1.0\n",
+            "optimizer=adam\n",
+            "adam_eps=1e-06\n",
+            "adam_beta1=0.9\n",
+            "adam_beta2=0.999\n",
+            "max_grad_norm=4.0\n",
+            "learning_rate=0.0002\n",
+            "lr_schedule=constant\n",
+            "lr_schedule_kl_threshold=0.008\n",
+            "lr_adaptive_min=1e-06\n",
+            "lr_adaptive_max=0.01\n",
+            "obs_subtract_mean=0.0\n",
+            "obs_scale=255.0\n",
+            "normalize_input=True\n",
+            "normalize_input_keys=None\n",
+            "decorrelate_experience_max_seconds=0\n",
+            "decorrelate_envs_on_one_worker=True\n",
+            "actor_worker_gpus=[]\n",
+            "set_workers_cpu_affinity=True\n",
+            "force_envs_single_thread=False\n",
+            "default_niceness=0\n",
+            "log_to_file=True\n",
+            "experiment_summaries_interval=10\n",
+            "flush_summaries_interval=30\n",
+            "stats_avg=100\n",
+            "summaries_use_frameskip=True\n",
+            "heartbeat_interval=20\n",
+            "heartbeat_reporting_interval=600\n",
+            "train_for_env_steps=30000000\n",
+            "train_for_seconds=10000000000\n",
+            "save_every_sec=120\n",
+            "keep_checkpoints=2\n",
+            "load_checkpoint_kind=latest\n",
+            "save_milestones_sec=-1\n",
+            "save_best_every_sec=5\n",
+            "save_best_metric=reward\n",
+            "save_best_after=100000\n",
+            "benchmark=False\n",
+            "encoder_mlp_layers=[512, 512]\n",
+            "encoder_conv_architecture=convnet_simple\n",
+            "encoder_conv_mlp_layers=[512]\n",
+            "use_rnn=True\n",
+            "rnn_size=512\n",
+            "rnn_type=gru\n",
+            "rnn_num_layers=1\n",
+            "decoder_mlp_layers=[]\n",
+            "nonlinearity=elu\n",
+            "policy_initialization=orthogonal\n",
+            "policy_init_gain=1.0\n",
+            "actor_critic_share_weights=True\n",
+            "adaptive_stddev=True\n",
+            "continuous_tanh_scale=0.0\n",
+            "initial_stddev=1.0\n",
+            "use_env_info_cache=False\n",
+            "env_gpu_actions=False\n",
+            "env_gpu_observations=True\n",
+            "env_frameskip=4\n",
+            "env_framestack=1\n",
+            "pixel_format=CHW\n",
+            "use_record_episode_statistics=False\n",
+            "with_wandb=False\n",
+            "wandb_user=None\n",
+            "wandb_project=sample_factory\n",
+            "wandb_group=None\n",
+            "wandb_job_type=SF\n",
+            "wandb_tags=[]\n",
+            "with_pbt=False\n",
+            "pbt_mix_policies_in_one_env=True\n",
+            "pbt_period_env_steps=5000000\n",
+            "pbt_start_mutation=20000000\n",
+            "pbt_replace_fraction=0.3\n",
+            "pbt_mutation_rate=0.15\n",
+            "pbt_replace_reward_gap=0.1\n",
+            "pbt_replace_reward_gap_absolute=1e-06\n",
+            "pbt_optimize_gamma=False\n",
+            "pbt_target_objective=true_objective\n",
+            "pbt_perturb_min=1.1\n",
+            "pbt_perturb_max=1.5\n",
+            "num_agents=-1\n",
+            "num_humans=0\n",
+            "num_bots=-1\n",
+            "start_bot_difficulty=None\n",
+            "timelimit=None\n",
+            "res_w=128\n",
+            "res_h=72\n",
+            "wide_aspect_ratio=False\n",
+            "eval_env_frameskip=1\n",
+            "fps=35\n",
+            "command_line=--env=doom_health_gathering_supreme --num_workers=8 --num_envs_per_worker=4 --train_for_env_steps=4000000\n",
+            "cli_args={'env': 'doom_health_gathering_supreme', 'num_workers': 8, 'num_envs_per_worker': 4, 'train_for_env_steps': 4000000}\n",
+            "git_hash=f8ed470f837e96d11b86d84cc03d9d0be1dc0042\n",
+            "git_repo_name=git@github.com:huggingface/deep-rl-class.git\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:53:01,762][32845] Saving configuration to /home/mique/Desktop/Code/deep-rl-class/notebooks/unit8/train_dir/default_experiment/config.json...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:53:01,831][32845] Rollout worker 0 uses device cpu\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:53:01,832][32845] Rollout worker 1 uses device cpu\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:53:01,832][32845] Rollout worker 2 uses device cpu\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:53:01,833][32845] Rollout worker 3 uses device cpu\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:53:01,833][32845] Rollout worker 4 uses device cpu\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:53:01,834][32845] Rollout worker 5 uses device cpu\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:53:01,836][32845] Rollout worker 6 uses device cpu\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:53:01,836][32845] Rollout worker 7 uses device cpu\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:53:01,837][32845] Rollout worker 8 uses device cpu\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:53:01,837][32845] Rollout worker 9 uses device cpu\u001b[0m\n"
+          ]
+        },
+        {
+          "ename": "KeyboardInterrupt",
+          "evalue": "",
+          "output_type": "error",
+          "traceback": [
+            "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
+            "\u001b[31mKeyboardInterrupt\u001b[39m                         Traceback (most recent call last)",
+            "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[3]\u001b[39m\u001b[32m, line 31\u001b[39m\n\u001b[32m      6\u001b[39m env = \u001b[33m\"\u001b[39m\u001b[33mdoom_health_gathering_supreme\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m      7\u001b[39m cfg = parse_vizdoom_cfg(argv=[\n\u001b[32m      8\u001b[39m     \u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33m--env=\u001b[39m\u001b[38;5;132;01m{\u001b[39;00menv\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m\"\u001b[39m,\n\u001b[32m      9\u001b[39m \n\u001b[32m   (...)\u001b[39m\u001b[32m     28\u001b[39m \n\u001b[32m     29\u001b[39m ])\n\u001b[32m---> \u001b[39m\u001b[32m31\u001b[39m status = \u001b[43mrun_rl\u001b[49m\u001b[43m(\u001b[49m\u001b[43mcfg\u001b[49m\u001b[43m)\u001b[49m\n",
+            "\u001b[36mFile \u001b[39m\u001b[32m~/Desktop/Code/deep-rl-class/notebooks/unit8/venv-u82/lib/python3.12/site-packages/sample_factory/train.py:37\u001b[39m, in \u001b[36mrun_rl\u001b[39m\u001b[34m(cfg)\u001b[39m\n\u001b[32m     32\u001b[39m cfg, runner = make_runner(cfg)\n\u001b[32m     34\u001b[39m \u001b[38;5;66;03m# here we can register additional message or summary handlers\u001b[39;00m\n\u001b[32m     35\u001b[39m \u001b[38;5;66;03m# see sf_examples/dmlab/train_dmlab.py for example\u001b[39;00m\n\u001b[32m---> \u001b[39m\u001b[32m37\u001b[39m status = \u001b[43mrunner\u001b[49m\u001b[43m.\u001b[49m\u001b[43minit\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m     38\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m status == ExperimentStatus.SUCCESS:\n\u001b[32m     39\u001b[39m     status = runner.run()\n",
+            "\u001b[36mFile \u001b[39m\u001b[32m~/Desktop/Code/deep-rl-class/notebooks/unit8/venv-u82/lib/python3.12/site-packages/sample_factory/algo/runners/runner_parallel.py:21\u001b[39m, in \u001b[36mParallelRunner.init\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m     20\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34minit\u001b[39m(\u001b[38;5;28mself\u001b[39m) -> StatusCode:\n\u001b[32m---> \u001b[39m\u001b[32m21\u001b[39m     status = \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m.\u001b[49m\u001b[43minit\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m     22\u001b[39m     \u001b[38;5;28;01mif\u001b[39;00m status != ExperimentStatus.SUCCESS:\n\u001b[32m     23\u001b[39m         \u001b[38;5;28;01mreturn\u001b[39;00m status\n",
+            "\u001b[36mFile \u001b[39m\u001b[32m~/Desktop/Code/deep-rl-class/notebooks/unit8/venv-u82/lib/python3.12/site-packages/sample_factory/algo/runners/runner.py:557\u001b[39m, in \u001b[36mRunner.init\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m    554\u001b[39m \u001b[38;5;28mself\u001b[39m._save_cfg()\n\u001b[32m    555\u001b[39m save_git_diff(experiment_dir(\u001b[38;5;28mself\u001b[39m.cfg))\n\u001b[32m--> \u001b[39m\u001b[32m557\u001b[39m \u001b[38;5;28mself\u001b[39m.buffer_mgr = \u001b[43mBufferMgr\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mcfg\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43menv_info\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    559\u001b[39m \u001b[38;5;28mself\u001b[39m._observers_call(AlgoObserver.on_init, \u001b[38;5;28mself\u001b[39m)\n\u001b[32m    561\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m ExperimentStatus.SUCCESS\n",
+            "\u001b[36mFile \u001b[39m\u001b[32m~/Desktop/Code/deep-rl-class/notebooks/unit8/venv-u82/lib/python3.12/site-packages/sample_factory/algo/utils/shared_buffers.py:215\u001b[39m, in \u001b[36mBufferMgr.__init__\u001b[39m\u001b[34m(self, cfg, env_info)\u001b[39m\n\u001b[32m    208\u001b[39m num_buffers = \u001b[38;5;28mmax\u001b[39m(\n\u001b[32m    209\u001b[39m     num_buffers,\n\u001b[32m    210\u001b[39m     \u001b[38;5;28mself\u001b[39m.max_batches_to_accumulate * \u001b[38;5;28mself\u001b[39m.trajectories_per_training_iteration * cfg.num_policies,\n\u001b[32m    211\u001b[39m )\n\u001b[32m    213\u001b[39m \u001b[38;5;28mself\u001b[39m.traj_buffer_queues[device] = get_queue(cfg.serial_mode)\n\u001b[32m--> \u001b[39m\u001b[32m215\u001b[39m \u001b[38;5;28mself\u001b[39m.traj_tensors_torch[device] = \u001b[43malloc_trajectory_tensors\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m    216\u001b[39m \u001b[43m    \u001b[49m\u001b[43menv_info\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    217\u001b[39m \u001b[43m    \u001b[49m\u001b[43mnum_buffers\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    218\u001b[39m \u001b[43m    \u001b[49m\u001b[43mcfg\u001b[49m\u001b[43m.\u001b[49m\u001b[43mrollout\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    219\u001b[39m \u001b[43m    \u001b[49m\u001b[43mrnn_size\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    220\u001b[39m \u001b[43m    \u001b[49m\u001b[43mdevice\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    221\u001b[39m \u001b[43m    \u001b[49m\u001b[43mshare\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m    222\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    223\u001b[39m \u001b[38;5;28mself\u001b[39m.policy_output_tensors_torch[device], output_names, output_sizes = alloc_policy_output_tensors(\n\u001b[32m    224\u001b[39m     cfg, env_info, rnn_size, device, share\n\u001b[32m    225\u001b[39m )\n\u001b[32m    226\u001b[39m \u001b[38;5;28mself\u001b[39m.output_names, \u001b[38;5;28mself\u001b[39m.output_sizes = output_names, output_sizes\n",
+            "\u001b[36mFile \u001b[39m\u001b[32m~/Desktop/Code/deep-rl-class/notebooks/unit8/venv-u82/lib/python3.12/site-packages/sample_factory/algo/utils/shared_buffers.py:91\u001b[39m, in \u001b[36malloc_trajectory_tensors\u001b[39m\u001b[34m(env_info, num_traj, rollout, rnn_size, device, share)\u001b[39m\n\u001b[32m     89\u001b[39m \u001b[38;5;66;03m# we need to allocate an extra rollout step here to calculate the value estimates for the last step\u001b[39;00m\n\u001b[32m     90\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m space_name, space \u001b[38;5;129;01min\u001b[39;00m obs_space.spaces.items():\n\u001b[32m---> \u001b[39m\u001b[32m91\u001b[39m     tensors[\u001b[33m\"\u001b[39m\u001b[33mobs\u001b[39m\u001b[33m\"\u001b[39m][space_name] = \u001b[43minit_tensor\u001b[49m\u001b[43m(\u001b[49m\u001b[43m[\u001b[49m\u001b[43mnum_traj\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mrollout\u001b[49m\u001b[43m \u001b[49m\u001b[43m+\u001b[49m\u001b[43m \u001b[49m\u001b[32;43m1\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mspace\u001b[49m\u001b[43m.\u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mspace\u001b[49m\u001b[43m.\u001b[49m\u001b[43mshape\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdevice\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mshare\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m     92\u001b[39m tensors[\u001b[33m\"\u001b[39m\u001b[33mrnn_states\u001b[39m\u001b[33m\"\u001b[39m] = init_tensor([num_traj, rollout + \u001b[32m1\u001b[39m], torch.float32, [rnn_size], device, share)\n\u001b[32m     94\u001b[39m num_actions, num_action_distribution_parameters = action_info(env_info)\n",
+            "\u001b[36mFile \u001b[39m\u001b[32m~/Desktop/Code/deep-rl-class/notebooks/unit8/venv-u82/lib/python3.12/site-packages/sample_factory/algo/utils/shared_buffers.py:43\u001b[39m, in \u001b[36minit_tensor\u001b[39m\u001b[34m(leading_dimensions, tensor_type, tensor_shape, device, share)\u001b[39m\n\u001b[32m     40\u001b[39m tensor_shape = [x \u001b[38;5;28;01mfor\u001b[39;00m x \u001b[38;5;129;01min\u001b[39;00m tensor_shape \u001b[38;5;28;01mif\u001b[39;00m x]\n\u001b[32m     42\u001b[39m final_shape = leading_dimensions + \u001b[38;5;28mlist\u001b[39m(tensor_shape)\n\u001b[32m---> \u001b[39m\u001b[32m43\u001b[39m t = \u001b[43mtorch\u001b[49m\u001b[43m.\u001b[49m\u001b[43mzeros\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfinal_shape\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtensor_type\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m     45\u001b[39m \u001b[38;5;66;03m# fill with magic values to make it easy to spot if we ever use unintialized data\u001b[39;00m\n\u001b[32m     46\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m t.is_floating_point():\n",
+            "\u001b[31mKeyboardInterrupt\u001b[39m: "
+          ]
+        }
+      ],
       "source": [
         "## Start the training, this should take around 15 minutes\n",
         "register_vizdoom_components()\n",
@@ -370,7 +557,29 @@
         "# The scenario we train on today is health gathering\n",
         "# other scenarios include \"doom_basic\", \"doom_two_colors_easy\", \"doom_dm\", \"doom_dwango5\", \"doom_my_way_home\", \"doom_deadly_corridor\", \"doom_defend_the_center\", \"doom_defend_the_line\"\n",
         "env = \"doom_health_gathering_supreme\"\n",
-        "cfg = parse_vizdoom_cfg(argv=[f\"--env={env}\", \"--num_workers=8\", \"--num_envs_per_worker=4\", \"--train_for_env_steps=4000000\"])\n",
+        "cfg = parse_vizdoom_cfg(argv=[\n",
+        "    f\"--env={env}\",\n",
+        "\n",
+        "    # Parallelism / speed\n",
+        "    \"--num_workers=10\",              # more CPU workers if you have cores\n",
+        "    \"--num_envs_per_worker=8\",       # more envs per worker (GPU permitting)\n",
+        "\n",
+        "    # Training length\n",
+        "    \"--train_for_env_steps=30000000\",  # 20M steps → better convergence\n",
+        "\n",
+        "    # Rollouts\n",
+        "    \"--rollout=64\",                  # longer rollouts = better advantage estimates\n",
+        "\n",
+        "    # PPO / optimizer\n",
+        "    \"--batch_size=16384\",            # bigger batch for more stable updates\n",
+        "    \"--learning_rate=0.0002\",        # slightly higher than doom default\n",
+        "    \"--ppo_clip_ratio=0.2\",          # more conservative clipping\n",
+        "\n",
+        "    # Model / memory\n",
+        "    \"--recurrence=32\",               # add LSTM memory (important for Doom)\n",
+        "    \"--use_rnn=True\",\n",
+        "\n",
+        "])\n",
         "\n",
         "status = run_rl(cfg)"
       ]
@@ -386,11 +595,184 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 11,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import torch\n",
+        "import numpy \n",
+        "torch.serialization.add_safe_globals([\n",
+        "    numpy.core.multiarray.scalar,\n",
+        "    numpy.dtype,\n",
+        "    numpy.dtypes.Float64DType\n",
+        "])"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 12,
       "metadata": {
         "id": "MGSA4Kg5_i0j"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "\u001b[33m[2025-08-29 19:09:28,003][15827] Loading existing experiment configuration from /home/mique/Desktop/Code/deep-rl-class/notebooks/unit8/train_dir/default_experiment/config.json\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:28,004][15827] Overriding arg 'num_workers' with value 1 passed from command line\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:28,004][15827] Adding new argument 'no_render'=True that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:28,005][15827] Adding new argument 'save_video'=True that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:28,006][15827] Adding new argument 'video_frames'=1000000000.0 that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:28,006][15827] Adding new argument 'video_name'=None that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:28,007][15827] Adding new argument 'max_num_frames'=1000000000.0 that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:28,007][15827] Adding new argument 'max_num_episodes'=10 that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:28,008][15827] Adding new argument 'push_to_hub'=False that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:28,008][15827] Adding new argument 'hf_repository'=None that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:28,009][15827] Adding new argument 'policy_index'=0 that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:28,010][15827] Adding new argument 'eval_deterministic'=False that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:28,011][15827] Adding new argument 'train_script'=None that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:28,011][15827] Adding new argument 'enjoy_script'=None that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:28,012][15827] Using frameskip 1 and render_action_repeat=4 for evaluation\u001b[0m\n",
+            "/home/mique/Desktop/Code/deep-rl-class/notebooks/unit8/venv-u82/lib/python3.12/site-packages/gymnasium/core.py:311: UserWarning: \u001b[33mWARN: env.num_agents to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do `env.unwrapped.num_agents` for environment variables or `env.get_wrapper_attr('num_agents')` that will search the reminding wrappers.\u001b[0m\n",
+            "  logger.warn(\n",
+            "/home/mique/Desktop/Code/deep-rl-class/notebooks/unit8/venv-u82/lib/python3.12/site-packages/gymnasium/core.py:311: UserWarning: \u001b[33mWARN: env.is_multiagent to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do `env.unwrapped.is_multiagent` for environment variables or `env.get_wrapper_attr('is_multiagent')` that will search the reminding wrappers.\u001b[0m\n",
+            "  logger.warn(\n",
+            "\u001b[36m[2025-08-29 19:09:28,068][15827] RunningMeanStd input shape: (3, 72, 128)\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:28,070][15827] RunningMeanStd input shape: (1,)\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:28,078][15827] ConvEncoder: input_channels=3\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:28,110][15827] Conv encoder output size: 512\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:28,111][15827] Policy head output size: 512\u001b[0m\n",
+            "\u001b[33m[2025-08-29 19:09:28,147][15827] Loading state from checkpoint /home/mique/Desktop/Code/deep-rl-class/notebooks/unit8/train_dir/default_experiment/checkpoint_p0/checkpoint_000004884_20004864.pth...\u001b[0m\n",
+            "[W][05112.308343] pw.conf      | [          conf.c: 1031 try_load_conf()] can't load config client-rt.conf: No such file or directory\n",
+            "[E][05112.308453] pw.conf      | [          conf.c: 1060 pw_conf_load_conf_for_context()] can't load config client-rt.conf: No such file or directory\n",
+            "[ALSOFT] (EE) Failed to create PipeWire event context (errno: 2)\n",
+            "\u001b[36m[2025-08-29 19:09:28,678][15827] Num frames 100...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:28,901][15827] Num frames 200...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:29,082][15827] Num frames 300...\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:09:29,271][15827] Avg episode rewards: #0: 3.840, true rewards: #0: 3.840\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:09:29,273][15827] Avg episode reward: 3.840, avg true_objective: 3.840\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:29,303][15827] Num frames 400...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:29,488][15827] Num frames 500...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:29,669][15827] Num frames 600...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:29,866][15827] Num frames 700...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:30,068][15827] Num frames 800...\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:09:30,279][15827] Avg episode rewards: #0: 5.320, true rewards: #0: 4.320\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:09:30,281][15827] Avg episode reward: 5.320, avg true_objective: 4.320\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:30,350][15827] Num frames 900...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:30,519][15827] Num frames 1000...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:30,721][15827] Num frames 1100...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:30,932][15827] Num frames 1200...\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:09:31,091][15827] Avg episode rewards: #0: 4.827, true rewards: #0: 4.160\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:09:31,093][15827] Avg episode reward: 4.827, avg true_objective: 4.160\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:31,207][15827] Num frames 1300...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:31,409][15827] Num frames 1400...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:31,677][15827] Num frames 1500...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:31,882][15827] Num frames 1600...\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:09:32,000][15827] Avg episode rewards: #0: 4.580, true rewards: #0: 4.080\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:09:32,002][15827] Avg episode reward: 4.580, avg true_objective: 4.080\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:32,154][15827] Num frames 1700...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:32,364][15827] Num frames 1800...\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:09:32,596][15827] Avg episode rewards: #0: 4.176, true rewards: #0: 3.776\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:09:32,597][15827] Avg episode reward: 4.176, avg true_objective: 3.776\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:32,628][15827] Num frames 1900...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:32,853][15827] Num frames 2000...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:33,054][15827] Num frames 2100...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:33,264][15827] Num frames 2200...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:33,433][15827] Num frames 2300...\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:09:33,602][15827] Avg episode rewards: #0: 4.393, true rewards: #0: 3.893\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:09:33,603][15827] Avg episode reward: 4.393, avg true_objective: 3.893\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:33,741][15827] Num frames 2400...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:33,951][15827] Num frames 2500...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:34,199][15827] Num frames 2600...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:34,376][15827] Num frames 2700...\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:09:34,540][15827] Avg episode rewards: #0: 4.549, true rewards: #0: 3.977\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:09:34,541][15827] Avg episode reward: 4.549, avg true_objective: 3.977\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:34,566][15827] Num frames 2800...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:34,788][15827] Num frames 2900...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:34,990][15827] Num frames 3000...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:35,103][15827] Num frames 3100...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:35,292][15827] Num frames 3200...\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:09:35,394][15827] Avg episode rewards: #0: 4.665, true rewards: #0: 4.040\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:09:35,396][15827] Avg episode reward: 4.665, avg true_objective: 4.040\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:35,502][15827] Num frames 3300...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:35,645][15827] Num frames 3400...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:35,752][15827] Num frames 3500...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:35,878][15827] Num frames 3600...\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:09:35,951][15827] Avg episode rewards: #0: 4.573, true rewards: #0: 4.018\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:09:35,952][15827] Avg episode reward: 4.573, avg true_objective: 4.018\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:36,061][15827] Num frames 3700...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:36,168][15827] Num frames 3800...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:36,298][15827] Num frames 3900...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:09:36,417][15827] Num frames 4000...\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:09:36,468][15827] Avg episode rewards: #0: 4.500, true rewards: #0: 4.000\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:09:36,469][15827] Avg episode reward: 4.500, avg true_objective: 4.000\u001b[0m\n",
+            "ffmpeg version 6.1.1-3ubuntu5 Copyright (c) 2000-2023 the FFmpeg developers\n",
+            "  built with gcc 13 (Ubuntu 13.2.0-23ubuntu3)\n",
+            "  configuration: --prefix=/usr --extra-version=3ubuntu5 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --disable-omx --enable-gnutls --enable-libaom --enable-libass --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libglslang --enable-libgme --enable-libgsm --enable-libharfbuzz --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-openal --enable-opencl --enable-opengl --disable-sndio --enable-libvpl --disable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-ladspa --enable-libbluray --enable-libjack --enable-libpulse --enable-librabbitmq --enable-librist --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libx264 --enable-libzmq --enable-libzvbi --enable-lv2 --enable-sdl2 --enable-libplacebo --enable-librav1e --enable-pocketsphinx --enable-librsvg --enable-libjxl --enable-shared\n",
+            "  libavutil      58. 29.100 / 58. 29.100\n",
+            "  libavcodec     60. 31.102 / 60. 31.102\n",
+            "  libavformat    60. 16.100 / 60. 16.100\n",
+            "  libavdevice    60.  3.100 / 60.  3.100\n",
+            "  libavfilter     9. 12.100 /  9. 12.100\n",
+            "  libswscale      7.  5.100 /  7.  5.100\n",
+            "  libswresample   4. 12.100 /  4. 12.100\n",
+            "  libpostproc    57.  3.100 / 57.  3.100\n",
+            "Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/tmp/sf2_mique/replay.mp4':\n",
+            "  Metadata:\n",
+            "    major_brand     : isom\n",
+            "    minor_version   : 512\n",
+            "    compatible_brands: isomiso2mp41\n",
+            "    encoder         : Lavf59.27.100\n",
+            "  Duration: 00:01:54.57, start: 0.000000, bitrate: 1373 kb/s\n",
+            "  Stream #0:0[0x1](und): Video: mpeg4 (Simple Profile) (mp4v / 0x7634706D), yuv420p, 240x180 [SAR 1:1 DAR 4:3], 1372 kb/s, 35 fps, 35 tbr, 17920 tbn (default)\n",
+            "    Metadata:\n",
+            "      handler_name    : VideoHandler\n",
+            "      vendor_id       : [0][0][0][0]\n",
+            "Stream mapping:\n",
+            "  Stream #0:0 -> #0:0 (mpeg4 (native) -> h264 (libx264))\n",
+            "Press [q] to stop, [?] for help\n",
+            "[libx264 @ 0x55ba4d6002c0] using SAR=1/1\n",
+            "[libx264 @ 0x55ba4d6002c0] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2\n",
+            "[libx264 @ 0x55ba4d6002c0] profile High, level 1.3, 4:2:0, 8-bit\n",
+            "[libx264 @ 0x55ba4d6002c0] 264 - core 164 r3108 31e19f9 - H.264/MPEG-4 AVC codec - Copyleft 2003-2023 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=6 lookahead_threads=1 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00\n",
+            "Output #0, mp4, to '/home/mique/Desktop/Code/deep-rl-class/notebooks/unit8/train_dir/default_experiment/replay.mp4':\n",
+            "  Metadata:\n",
+            "    major_brand     : isom\n",
+            "    minor_version   : 512\n",
+            "    compatible_brands: isomiso2mp41\n",
+            "    encoder         : Lavf60.16.100\n",
+            "  Stream #0:0(und): Video: h264 (avc1 / 0x31637661), yuv420p(progressive), 240x180 [SAR 1:1 DAR 4:3], q=2-31, 35 fps, 17920 tbn (default)\n",
+            "    Metadata:\n",
+            "      handler_name    : VideoHandler\n",
+            "      vendor_id       : [0][0][0][0]\n",
+            "      encoder         : Lavc60.31.102 libx264\n",
+            "    Side data:\n",
+            "      cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A\n",
+            "[out#0/mp4 @ 0x55ba4d5e8500] video:5518kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.778290%\n",
+            "frame= 4010 fps=1218 q=-1.0 Lsize=    5561kB time=00:01:54.48 bitrate= 397.9kbits/s speed=34.8x    \n",
+            "[libx264 @ 0x55ba4d6002c0] frame I:27    Avg QP:22.74  size:  5791\n",
+            "[libx264 @ 0x55ba4d6002c0] frame P:1674  Avg QP:25.95  size:  1916\n",
+            "[libx264 @ 0x55ba4d6002c0] frame B:2309  Avg QP:28.24  size:   990\n",
+            "[libx264 @ 0x55ba4d6002c0] consecutive B-frames: 19.7%  7.2%  9.8% 63.2%\n",
+            "[libx264 @ 0x55ba4d6002c0] mb I  I16..4: 13.1% 76.4% 10.4%\n",
+            "[libx264 @ 0x55ba4d6002c0] mb P  I16..4:  2.7%  9.7%  3.2%  P16..4: 41.7% 24.5% 10.6%  0.0%  0.0%    skip: 7.7%\n",
+            "[libx264 @ 0x55ba4d6002c0] mb B  I16..4:  0.2%  1.8%  1.2%  B16..8: 45.8% 14.1%  3.3%  direct: 6.6%  skip:27.1%  L0:51.0% L1:38.8% BI:10.2%\n",
+            "[libx264 @ 0x55ba4d6002c0] 8x8 transform intra:62.0% inter:65.3%\n",
+            "[libx264 @ 0x55ba4d6002c0] coded y,uvDC,uvAC intra: 60.9% 71.5% 40.1% inter: 35.6% 12.3% 2.4%\n",
+            "[libx264 @ 0x55ba4d6002c0] i16 v,h,dc,p: 62%  5% 32%  1%\n",
+            "[libx264 @ 0x55ba4d6002c0] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 27%  9% 34%  4%  4%  4%  6%  5%  7%\n",
+            "[libx264 @ 0x55ba4d6002c0] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 66%  5% 10%  3%  4%  3%  4%  2%  3%\n",
+            "[libx264 @ 0x55ba4d6002c0] i8c dc,h,v,p: 56% 18% 23%  2%\n",
+            "[libx264 @ 0x55ba4d6002c0] Weighted P-Frames: Y:9.7% UV:0.7%\n",
+            "[libx264 @ 0x55ba4d6002c0] ref P L0: 61.8% 15.0% 13.9%  8.1%  1.2%\n",
+            "[libx264 @ 0x55ba4d6002c0] ref B L0: 85.0% 11.8%  3.2%\n",
+            "[libx264 @ 0x55ba4d6002c0] ref B L1: 95.5%  4.5%\n",
+            "[libx264 @ 0x55ba4d6002c0] kb/s:394.53\n",
+            "\u001b[36m[2025-08-29 19:09:41,440][15827] Replay video saved to /home/mique/Desktop/Code/deep-rl-class/notebooks/unit8/train_dir/default_experiment/replay.mp4!\u001b[0m\n"
+          ]
+        }
+      ],
       "source": [
         "from sample_factory.enjoy import enjoy\n",
         "cfg = parse_vizdoom_cfg(argv=[f\"--env={env}\", \"--num_workers=1\", \"--save_video\", \"--no_render\", \"--max_num_episodes=10\"], evaluation=True)\n",
@@ -417,7 +799,7 @@
         "from base64 import b64encode\n",
         "from IPython.display import HTML\n",
         "\n",
-        "mp4 = open('/content/train_dir/default_experiment/replay.mp4','rb').read()\n",
+        "mp4 = open('/train_dir/default_experiment/replay.mp4','rb').read()\n",
         "data_url = \"data:video/mp4;base64,\" + b64encode(mp4).decode()\n",
         "HTML(\"\"\"\n",
         "<video width=640 controls>\n",
@@ -428,12 +810,12 @@
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "The agent has learned something, but its performance could be better. We would clearly need to train for longer. But let's upload this model to the Hub."
-      ],
       "metadata": {
         "id": "2A4pf_1VwPqR"
-      }
+      },
+      "source": [
+        "The agent has learned something, but its performance could be better. We would clearly need to train for longer. But let's upload this model to the Hub."
+      ]
     },
     {
       "cell_type": "markdown",
@@ -489,15 +871,606 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 10,
       "metadata": {
         "id": "sEawW_i0OvJV"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "\u001b[33m[2025-08-29 19:06:00,917][15827] Loading existing experiment configuration from /home/mique/Desktop/Code/deep-rl-class/notebooks/unit8/train_dir/default_experiment/config.json\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:00,918][15827] Overriding arg 'num_workers' with value 1 passed from command line\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:00,919][15827] Adding new argument 'no_render'=True that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:00,920][15827] Adding new argument 'save_video'=True that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:00,922][15827] Adding new argument 'video_frames'=1000000000.0 that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:00,924][15827] Adding new argument 'video_name'=None that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:00,925][15827] Adding new argument 'max_num_frames'=100000 that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:00,926][15827] Adding new argument 'max_num_episodes'=10 that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:00,927][15827] Adding new argument 'push_to_hub'=True that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:00,927][15827] Adding new argument 'hf_repository'='turbo-maikol/rl_course_vizdoom_health_gathering_supreme' that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:00,928][15827] Adding new argument 'policy_index'=0 that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:00,929][15827] Adding new argument 'eval_deterministic'=False that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:00,929][15827] Adding new argument 'train_script'=None that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:00,930][15827] Adding new argument 'enjoy_script'=None that is not in the saved config file!\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:00,931][15827] Using frameskip 1 and render_action_repeat=4 for evaluation\u001b[0m\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "/home/mique/Desktop/Code/deep-rl-class/notebooks/unit8/venv-u82/lib/python3.12/site-packages/gymnasium/core.py:311: UserWarning: \u001b[33mWARN: env.num_agents to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do `env.unwrapped.num_agents` for environment variables or `env.get_wrapper_attr('num_agents')` that will search the reminding wrappers.\u001b[0m\n",
+            "  logger.warn(\n",
+            "/home/mique/Desktop/Code/deep-rl-class/notebooks/unit8/venv-u82/lib/python3.12/site-packages/gymnasium/core.py:311: UserWarning: \u001b[33mWARN: env.is_multiagent to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do `env.unwrapped.is_multiagent` for environment variables or `env.get_wrapper_attr('is_multiagent')` that will search the reminding wrappers.\u001b[0m\n",
+            "  logger.warn(\n",
+            "\u001b[36m[2025-08-29 19:06:00,957][15827] RunningMeanStd input shape: (3, 72, 128)\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:00,959][15827] RunningMeanStd input shape: (1,)\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:00,971][15827] ConvEncoder: input_channels=3\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:01,002][15827] Conv encoder output size: 512\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:01,003][15827] Policy head output size: 512\u001b[0m\n",
+            "\u001b[33m[2025-08-29 19:06:01,045][15827] Loading state from checkpoint /home/mique/Desktop/Code/deep-rl-class/notebooks/unit8/train_dir/default_experiment/checkpoint_p0/checkpoint_000004884_20004864.pth...\u001b[0m\n",
+            "[W][04906.218833] pw.conf      | [          conf.c: 1031 try_load_conf()] can't load config client-rt.conf: No such file or directory\n",
+            "[E][04906.218951] pw.conf      | [          conf.c: 1060 pw_conf_load_conf_for_context()] can't load config client-rt.conf: No such file or directory\n",
+            "[ALSOFT] (EE) Failed to create PipeWire event context (errno: 2)\n",
+            "\u001b[36m[2025-08-29 19:06:01,719][15827] Num frames 100...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:01,917][15827] Num frames 200...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:02,156][15827] Num frames 300...\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:06:02,427][15827] Avg episode rewards: #0: 3.840, true rewards: #0: 3.840\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:06:02,428][15827] Avg episode reward: 3.840, avg true_objective: 3.840\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:02,474][15827] Num frames 400...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:02,683][15827] Num frames 500...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:02,881][15827] Num frames 600...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:03,003][15827] Num frames 700...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:03,153][15827] Num frames 800...\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:06:03,253][15827] Avg episode rewards: #0: 4.660, true rewards: #0: 4.160\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:06:03,254][15827] Avg episode reward: 4.660, avg true_objective: 4.160\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:03,358][15827] Num frames 900...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:03,509][15827] Num frames 1000...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:03,679][15827] Num frames 1100...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:03,794][15827] Num frames 1200...\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:06:03,867][15827] Avg episode rewards: #0: 4.387, true rewards: #0: 4.053\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:06:03,868][15827] Avg episode reward: 4.387, avg true_objective: 4.053\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:04,053][15827] Num frames 1300...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:04,204][15827] Num frames 1400...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:04,377][15827] Num frames 1500...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:04,537][15827] Num frames 1600...\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:06:04,685][15827] Avg episode rewards: #0: 4.660, true rewards: #0: 4.160\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:06:04,686][15827] Avg episode reward: 4.660, avg true_objective: 4.160\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:04,750][15827] Num frames 1700...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:04,890][15827] Num frames 1800...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:05,043][15827] Num frames 1900...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:05,178][15827] Num frames 2000...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:05,329][15827] Num frames 2100...\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:06:05,400][15827] Avg episode rewards: #0: 4.824, true rewards: #0: 4.224\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:06:05,401][15827] Avg episode reward: 4.824, avg true_objective: 4.224\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:05,581][15827] Num frames 2200...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:05,717][15827] Num frames 2300...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:05,860][15827] Num frames 2400...\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:06:06,052][15827] Avg episode rewards: #0: 4.660, true rewards: #0: 4.160\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:06:06,053][15827] Avg episode reward: 4.660, avg true_objective: 4.160\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:06,068][15827] Num frames 2500...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:06,206][15827] Num frames 2600...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:06,417][15827] Num frames 2700...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:06,581][15827] Num frames 2800...\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:06:06,746][15827] Avg episode rewards: #0: 4.543, true rewards: #0: 4.114\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:06:06,747][15827] Avg episode reward: 4.543, avg true_objective: 4.114\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:06,795][15827] Num frames 2900...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:10,630][15827] Num frames 3000...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:10,769][15827] Num frames 3100...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:10,957][15827] Num frames 3200...\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:06:11,107][15827] Avg episode rewards: #0: 4.455, true rewards: #0: 4.080\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:06:11,108][15827] Avg episode reward: 4.455, avg true_objective: 4.080\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:11,177][15827] Num frames 3300...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:11,317][15827] Num frames 3400...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:11,483][15827] Num frames 3500...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:11,681][15827] Num frames 3600...\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:06:11,852][15827] Avg episode rewards: #0: 4.533, true rewards: #0: 4.089\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:06:11,853][15827] Avg episode reward: 4.533, avg true_objective: 4.089\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:11,920][15827] Num frames 3700...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:12,073][15827] Num frames 3800...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:12,202][15827] Num frames 3900...\u001b[0m\n",
+            "\u001b[36m[2025-08-29 19:06:12,483][15827] Num frames 4000...\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:06:12,731][15827] Avg episode rewards: #0: 4.464, true rewards: #0: 4.064\u001b[0m\n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:06:12,733][15827] Avg episode reward: 4.464, avg true_objective: 4.064\u001b[0m\n",
+            "ffmpeg version 6.1.1-3ubuntu5 Copyright (c) 2000-2023 the FFmpeg developers\n",
+            "  built with gcc 13 (Ubuntu 13.2.0-23ubuntu3)\n",
+            "  configuration: --prefix=/usr --extra-version=3ubuntu5 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --disable-omx --enable-gnutls --enable-libaom --enable-libass --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libglslang --enable-libgme --enable-libgsm --enable-libharfbuzz --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-openal --enable-opencl --enable-opengl --disable-sndio --enable-libvpl --disable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-ladspa --enable-libbluray --enable-libjack --enable-libpulse --enable-librabbitmq --enable-librist --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libx264 --enable-libzmq --enable-libzvbi --enable-lv2 --enable-sdl2 --enable-libplacebo --enable-librav1e --enable-pocketsphinx --enable-librsvg --enable-libjxl --enable-shared\n",
+            "  libavutil      58. 29.100 / 58. 29.100\n",
+            "  libavcodec     60. 31.102 / 60. 31.102\n",
+            "  libavformat    60. 16.100 / 60. 16.100\n",
+            "  libavdevice    60.  3.100 / 60.  3.100\n",
+            "  libavfilter     9. 12.100 /  9. 12.100\n",
+            "  libswscale      7.  5.100 /  7.  5.100\n",
+            "  libswresample   4. 12.100 /  4. 12.100\n",
+            "  libpostproc    57.  3.100 / 57.  3.100\n",
+            "Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/tmp/sf2_mique/replay.mp4':\n",
+            "  Metadata:\n",
+            "    major_brand     : isom\n",
+            "    minor_version   : 512\n",
+            "    compatible_brands: isomiso2mp41\n",
+            "    encoder         : Lavf59.27.100\n",
+            "  Duration: 00:01:56.40, start: 0.000000, bitrate: 1486 kb/s\n",
+            "  Stream #0:0[0x1](und): Video: mpeg4 (Simple Profile) (mp4v / 0x7634706D), yuv420p, 240x180 [SAR 1:1 DAR 4:3], 1485 kb/s, 35 fps, 35 tbr, 17920 tbn (default)\n",
+            "    Metadata:\n",
+            "      handler_name    : VideoHandler\n",
+            "      vendor_id       : [0][0][0][0]\n",
+            "Stream mapping:\n",
+            "  Stream #0:0 -> #0:0 (mpeg4 (native) -> h264 (libx264))\n",
+            "Press [q] to stop, [?] for help\n",
+            "[libx264 @ 0x55e24f6ab640] using SAR=1/1\n",
+            "[libx264 @ 0x55e24f6ab640] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2\n",
+            "[libx264 @ 0x55e24f6ab640] profile High, level 1.3, 4:2:0, 8-bit\n",
+            "[libx264 @ 0x55e24f6ab640] 264 - core 164 r3108 31e19f9 - H.264/MPEG-4 AVC codec - Copyleft 2003-2023 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=6 lookahead_threads=1 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00\n",
+            "Output #0, mp4, to '/home/mique/Desktop/Code/deep-rl-class/notebooks/unit8/train_dir/default_experiment/replay.mp4':\n",
+            "  Metadata:\n",
+            "    major_brand     : isom\n",
+            "    minor_version   : 512\n",
+            "    compatible_brands: isomiso2mp41\n",
+            "    encoder         : Lavf60.16.100\n",
+            "  Stream #0:0(und): Video: h264 (avc1 / 0x31637661), yuv420p(progressive), 240x180 [SAR 1:1 DAR 4:3], q=2-31, 35 fps, 17920 tbn (default)\n",
+            "    Metadata:\n",
+            "      handler_name    : VideoHandler\n",
+            "      vendor_id       : [0][0][0][0]\n",
+            "      encoder         : Lavc60.31.102 libx264\n",
+            "    Side data:\n",
+            "      cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A\n",
+            "[out#0/mp4 @ 0x55e24f693540] video:6001kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.742788%\n",
+            "frame= 4074 fps=1008 q=-1.0 Lsize=    6046kB time=00:01:56.31 bitrate= 425.8kbits/s speed=28.8x    \n",
+            "[libx264 @ 0x55e24f6ab640] frame I:30    Avg QP:22.88  size:  5491\n",
+            "[libx264 @ 0x55e24f6ab640] frame P:1584  Avg QP:25.94  size:  2056\n",
+            "[libx264 @ 0x55e24f6ab640] frame B:2460  Avg QP:27.97  size:  1107\n",
+            "[libx264 @ 0x55e24f6ab640] consecutive B-frames: 16.5%  6.8%  6.8% 69.9%\n",
+            "[libx264 @ 0x55e24f6ab640] mb I  I16..4: 15.3% 74.4% 10.4%\n",
+            "[libx264 @ 0x55e24f6ab640] mb P  I16..4:  3.6% 11.4%  3.4%  P16..4: 39.3% 23.4% 10.6%  0.0%  0.0%    skip: 8.3%\n",
+            "[libx264 @ 0x55e24f6ab640] mb B  I16..4:  0.2%  2.3%  1.2%  B16..8: 40.7% 14.4%  3.8%  direct: 7.9%  skip:29.6%  L0:49.8% L1:38.4% BI:11.9%\n",
+            "[libx264 @ 0x55e24f6ab640] 8x8 transform intra:62.8% inter:63.2%\n",
+            "[libx264 @ 0x55e24f6ab640] coded y,uvDC,uvAC intra: 60.3% 72.9% 39.8% inter: 36.5% 13.6% 2.9%\n",
+            "[libx264 @ 0x55e24f6ab640] i16 v,h,dc,p: 65%  4% 30%  1%\n",
+            "[libx264 @ 0x55e24f6ab640] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 23% 10% 35%  5%  5%  4%  6%  5%  7%\n",
+            "[libx264 @ 0x55e24f6ab640] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 65%  7% 10%  3%  4%  2%  4%  2%  3%\n",
+            "[libx264 @ 0x55e24f6ab640] i8c dc,h,v,p: 56% 20% 21%  3%\n",
+            "[libx264 @ 0x55e24f6ab640] Weighted P-Frames: Y:13.5% UV:1.0%\n",
+            "[libx264 @ 0x55e24f6ab640] ref P L0: 60.6% 14.1% 14.2%  9.4%  1.7%\n",
+            "[libx264 @ 0x55e24f6ab640] ref B L0: 83.9% 12.5%  3.6%\n",
+            "[libx264 @ 0x55e24f6ab640] ref B L1: 95.1%  4.9%\n",
+            "[libx264 @ 0x55e24f6ab640] kb/s:422.32\n",
+            "\u001b[36m[2025-08-29 19:06:18,499][15827] Replay video saved to /home/mique/Desktop/Code/deep-rl-class/notebooks/unit8/train_dir/default_experiment/replay.mp4!\u001b[0m\n",
+            "Processing Files (0 / 0)                : |          |  0.00B /  0.00B            \n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "Processing Files (7 / 7)                :  95%|█████████▍|  106MB /  113MB,   ???B/s  \n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "Processing Files (7 / 8)                :  95%|█████████▍|  107MB /  113MB, 1.32MB/s  \n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "Processing Files (7 / 8)                :  96%|█████████▋|  109MB /  113MB, 2.64MB/s  \n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "Processing Files (7 / 8)                :  99%|█████████▊|  111MB /  113MB, 4.74MB/s  \n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "Processing Files (7 / 8)                : 100%|█████████▉|  112MB /  113MB, 4.83MB/s  \n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "Processing Files (8 / 8)                : 100%|██████████|  113MB /  113MB, 3.87MB/s  \n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\u001b[A\n",
+            "\n",
+            "\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "\n",
+            "Processing Files (8 / 8)                : 100%|██████████|  113MB /  113MB, 3.44MB/s  \n",
+            "New Data Upload                         : 100%|██████████| 6.19MB / 6.19MB, 3.44MB/s  \n",
+            "  .../events.out.tfevents.1756484564.Pac: 100%|██████████| 2.08kB / 2.08kB            \n",
+            "  .../events.out.tfevents.1756484827.Pac: 100%|██████████| 5.13kB / 5.13kB            \n",
+            "  .../events.out.tfevents.1755880356.Pac: 100%|██████████|  529kB /  529kB            \n",
+            "  .../events.out.tfevents.1756484979.Pac: 100%|██████████| 1.17MB / 1.17MB            \n",
+            "  ...000003149_12898304_reward_4.751.pth: 100%|██████████| 34.9MB / 34.9MB            \n",
+            "  ...0/checkpoint_000004884_20004864.pth: 100%|██████████| 34.9MB / 34.9MB            \n",
+            "  ...0/checkpoint_000004839_19820544.pth: 100%|██████████| 34.9MB / 34.9MB            \n",
+            "  ...n_dir/default_experiment/replay.mp4: 100%|██████████| 6.19MB / 6.19MB            \n",
+            "\u001b[37m\u001b[1m[2025-08-29 19:06:22,890][15827] The model has been pushed to https://huggingface.co/turbo-maikol/rl_course_vizdoom_health_gathering_supreme\u001b[0m\n"
+          ]
+        }
+      ],
       "source": [
         "from sample_factory.enjoy import enjoy\n",
         "\n",
-        "hf_username = \"ThomasSimonini\" # insert your HuggingFace username here\n",
+        "hf_username = \"turbo-maikol\" # insert your HuggingFace username here\n",
         "\n",
         "cfg = parse_vizdoom_cfg(argv=[f\"--env={env}\", \"--num_workers=1\", \"--save_video\", \"--no_render\", \"--max_num_episodes=10\", \"--max_num_frames=100000\", \"--push_to_hub\", f\"--hf_repository={hf_username}/rl_course_vizdoom_health_gathering_supreme\"], evaluation=True)\n",
         "status = enjoy(cfg)"
@@ -505,14 +1478,14 @@
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "9PzeXx-qxVvw"
+      },
       "source": [
         "## Let's load another model\n",
         "\n",
         "\n"
-      ],
-      "metadata": {
-        "id": "9PzeXx-qxVvw"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
@@ -543,7 +1516,7 @@
       },
       "outputs": [],
       "source": [
-        "!ls train_dir/doom_health_gathering_supreme_2222"
+        "!ls train_dir/doom_health_gathering_supreme_2222\n"
       ]
     },
     {
@@ -559,6 +1532,17 @@
         "status = enjoy(cfg)"
       ]
     },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "print(os.getcwd())"
+      ]
+    },
     {
       "cell_type": "code",
       "execution_count": null,
@@ -567,7 +1551,7 @@
       },
       "outputs": [],
       "source": [
-        "mp4 = open('/content/train_dir/doom_health_gathering_supreme_2222/replay.mp4','rb').read()\n",
+        "mp4 = open('train_dir/doom_health_gathering_supreme_2222/replay.mp4','rb').read()\n",
         "data_url = \"data:video/mp4;base64,\" + b64encode(mp4).decode()\n",
         "HTML(\"\"\"\n",
         "<video width=640 controls>\n",
@@ -578,16 +1562,16 @@
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "ie5YWC3NyKO8"
+      },
       "source": [
         "## Some additional challenges 🏆: Doom Deathmatch\n",
         "\n",
         "Training an agent to play a Doom deathmatch **takes many hours on a more beefy machine than is available in Colab**.\n",
         "\n",
         "Fortunately, we have have **already trained an agent in this scenario and it is available in the 🤗 Hub!** Let’s download the model and visualize the agent’s performance."
-      ],
-      "metadata": {
-        "id": "ie5YWC3NyKO8"
-      }
+      ]
     },
     {
       "cell_type": "code",
@@ -603,12 +1587,12 @@
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "Given the agent plays for a long time the video generation can take **10 minutes**."
-      ],
       "metadata": {
         "id": "7AX_LwxR2FQ0"
-      }
+      },
+      "source": [
+        "Given the agent plays for a long time the video generation can take **10 minutes**."
+      ]
     },
     {
       "cell_type": "code",
@@ -635,17 +1619,20 @@
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "N6mEC-4zyihx"
+      },
       "source": [
         "\n",
         "You **can try to train your agent in this environment** using the code above, but not on colab.\n",
         "**Good luck 🤞**"
-      ],
-      "metadata": {
-        "id": "N6mEC-4zyihx"
-      }
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "YnDAngN6zeeI"
+      },
       "source": [
         "If you prefer an easier scenario, **why not try training in another ViZDoom scenario such as `doom_deadly_corridor` or `doom_defend_the_center`.**\n",
         "\n",
@@ -657,32 +1644,39 @@
         "This concludes the last unit. But we are not finished yet! 🤗 The following **bonus section include some of the most interesting, advanced and cutting edge work in Deep Reinforcement Learning**.\n",
         "\n",
         "## Keep learning, stay awesome 🤗"
-      ],
-      "metadata": {
-        "id": "YnDAngN6zeeI"
-      }
+      ]
     }
   ],
   "metadata": {
     "accelerator": "GPU",
     "colab": {
-      "provenance": [],
       "collapsed_sections": [
         "PU4FVzaoM6fC",
         "nB68Eb9UgC94",
         "ez5UhUtYcWXF",
         "sgRy6wnrgnij"
       ],
+      "include_colab_link": true,
       "private_outputs": true,
-      "include_colab_link": true
+      "provenance": []
     },
     "gpuClass": "standard",
     "kernelspec": {
-      "display_name": "Python 3",
+      "display_name": "venv-u82",
+      "language": "python",
       "name": "python3"
     },
     "language_info": {
-      "name": "python"
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.12.3"
     }
   },
   "nbformat": 4,
diff --git a/units/en/communication/certification.mdx b/units/en/communication/certification.mdx
deleted file mode 100644
index d82ef65..0000000
--- a/units/en/communication/certification.mdx
+++ /dev/null
@@ -1,30 +0,0 @@
-# The certification process
-
-
-The certification process is **completely free**:
-
-- To get a *certificate of completion*: you need **to pass 80% of the assignments**.
-- To get a *certificate of excellence*: you need **to pass 100% of the assignments**.
-
-There's **no deadlines, the course is self-paced**.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/certification.jpg" alt="Course certification" width="100%"/>
-
-When we say pass, **we mean that your model must be pushed to the Hub and get a result equal or above the minimal requirement**.
-
-To check your progression and which unit you passed/not passed: https://huggingface.co/spaces/ThomasSimonini/Check-my-progress-Deep-RL-Course
-
-Now that you're ready for the certification process, you need to:
-
-1. Go here: https://huggingface.co/spaces/huggingface-projects/Deep-RL-Course-Certification/
-2. Type your *hugging face username*, your *first name*, *last name*
-
-3. Click on "Generate my certificate".
-  - If you passed 80% of the assignments, **congratulations** you've just got the certificate of completion.
-  - If you passed 100% of the assignments, **congratulations** you've just got the excellence certificate.
-  - If you are below 80%, don't be discouraged! Check which units you need to do again to get your certificate.
-
-4. You can download your certificate in pdf format and png format.
-
-Don't hesitate to share your certificate on Twitter (tag me @ThomasSimonini and @huggingface) and on Linkedin.
-
diff --git a/units/en/communication/conclusion.mdx b/units/en/communication/conclusion.mdx
deleted file mode 100644
index ffbb295..0000000
--- a/units/en/communication/conclusion.mdx
+++ /dev/null
@@ -1,24 +0,0 @@
-# Congratulations
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/communication/thumbnail.png" alt="Thumbnail"/>
-
-
-**Congratulations on finishing this course!** With perseverance, hard work, and determination, **you've acquired a solid background in Deep Reinforcement Learning**.
-
-But finishing this course is **not the end of your journey**. It's just the beginning: don't hesitate to explore bonus unit 3, where we show you topics you may be interested in studying. And don't hesitate to **share what you're doing, and ask questions in the discord server**
-
-**Thank you** for being part of this course. **I hope you liked this course as much as I loved writing it**.
-
-Don't hesitate **to give us feedback on how we can improve the course** using [this form](https://forms.gle/BzKXWzLAGZESGNaE9)
-
-And don't forget **to check in the next section how you can get (if you pass) your certificate of completion ‎‍🎓.**
-
-One last thing, to keep in touch with the Reinforcement Learning Team and with me:
-
-- [Follow me on Twitter](https://twitter.com/thomassimonini)
-- [Follow Hugging Face Twitter account](https://twitter.com/huggingface)
-- [Join the Hugging Face Discord](https://www.hf.co/join/discord)
-
-## Keep Learning, Stay Awesome  🤗
-
-Thomas Simonini,
diff --git a/units/en/live1/live1.mdx b/units/en/live1/live1.mdx
deleted file mode 100644
index 624365d..0000000
--- a/units/en/live1/live1.mdx
+++ /dev/null
@@ -1,9 +0,0 @@
-# Live 1: How the course work, Q&A, and playing with Huggy
-
-In this first live stream, we explained how the course work (scope, units, challenges, and more) and answered your questions.
-
-And finally, we saw some LunarLander agents you've trained and play with your Huggies 🐶
-
-<Youtube id="JeJIswxyrsM" />
-
-To know when the next live is scheduled **check the discord server**. We will also send **you an email**. If you can't participate, don't worry, we record the live sessions.
\ No newline at end of file
diff --git a/units/en/unit0/discord101.mdx b/units/en/unit0/discord101.mdx
deleted file mode 100644
index 1c3d440..0000000
--- a/units/en/unit0/discord101.mdx
+++ /dev/null
@@ -1,37 +0,0 @@
-# Discord 101 [[discord-101]]
-
-Hey there! My name is Huggy, the dog 🐕, and I'm looking forward to train with you during this RL Course!
-Although I don't know much about fetching sticks (yet), I know one or two things about Discord. So I wrote this guide to help you learn about it!
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/huggy-logo.jpg" alt="Huggy Logo"/>
-
-Discord is a free chat platform. If you've used Slack, **it's quite similar**. There is a Hugging Face Community Discord server with 50000 members you can <a href="https://discord.gg/ydHrjt3WP5">join with a single click here</a>. So many humans to play with!
-
-Starting in Discord can be a bit intimidating, so let me take you through it.
-
-When you [sign-up to our Discord server](http://hf.co/join/discord), you'll choose your interests. Make sure to **click "Reinforcement Learning,"** and you'll get access to the Reinforcement Learning Category containing all the course-related channels. If you feel like joining even more channels, go for it! 🚀
-
-Then click next, you'll then get to **introduce yourself in the `#introduce-yourself` channel**.
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/discord2.jpg" alt="Discord"/>
-
-They are in the reinforcement learning category. **Don't forget to sign up to these channels** by clicking on 🤖 Reinforcement Learning in `role-assigment`.
-- `rl-announcements`: where we give the **latest information about the course**.
-- `rl-discussions`: where you can **exchange about RL and share information**.
-- `rl-study-group`: where you can **ask questions and exchange with your classmates**.
-- `rl-i-made-this`: where you can **share your projects and models**.
-
-The HF Community Server has a thriving community of human beings interested in many areas, so you can also learn from those. There are paper discussions, events, and many other things.
-
-Was this useful? There are a couple of tips I can share with you:
-
-- There are **voice channels** you can use as well, although most people prefer text chat.
-- You can **use markdown style** for text chats. So if you're writing code, you can use that style. Sadly this does not work as well for links.
-- You can open threads as well! It's a good idea when **it's a long conversation**.
-
-I hope this is useful! And if you have questions, just ask!
-
-See you later!
-
-Huggy 🐶
diff --git a/units/en/unit0/introduction.mdx b/units/en/unit0/introduction.mdx
deleted file mode 100644
index 14ee988..0000000
--- a/units/en/unit0/introduction.mdx
+++ /dev/null
@@ -1,142 +0,0 @@
-# Welcome to the 🤗 Deep Reinforcement Learning Course [[introduction]]
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/thumbnail.jpg" alt="Deep RL Course thumbnail" width="100%"/>
-
-Welcome to the most fascinating topic in Artificial Intelligence: **Deep Reinforcement Learning**.
-
-This course will **teach you about Deep Reinforcement Learning from beginner to expert**. It’s completely free and open-source!
-
-In this introduction unit you’ll:
-
-- Learn more about the **course content**.
-- **Define the path** you’re going to take (either self-audit or certification process).
-- Learn more about the **AI vs. AI challenges** you're going to participate in.
-- Learn more **about us**.
-- **Create your Hugging Face account** (it’s free).
-- **Sign-up to our Discord server**, the place where you can chat with your classmates and us (the Hugging Face team).
-
-Let’s get started!
-
-## What to expect? [[expect]]
-
-In this course, you will:
-
-- 📖 Study Deep Reinforcement Learning in **theory and practice.**
-- 🧑‍💻 Learn to **use famous Deep RL libraries** such as [Stable Baselines3](https://stable-baselines3.readthedocs.io/en/master/), [RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo), [Sample Factory](https://samplefactory.dev/) and [CleanRL](https://github.com/vwxyzjn/cleanrl).
-- 🤖 **Train agents in unique environments** such as [SnowballFight](https://huggingface.co/spaces/ThomasSimonini/SnowballFight), [Huggy the Doggo 🐶](https://huggingface.co/spaces/ThomasSimonini/Huggy), [VizDoom (Doom)](https://vizdoom.cs.put.edu.pl/) and classical ones such as [Space Invaders](https://gymnasium.farama.org/environments/atari/space_invaders/), [PyBullet](https://pybullet.org/wordpress/) and more.
-- 💾 Share your **trained agents with one line of code to the Hub** and also download powerful agents from the community.
-- 🏆 Participate in challenges where you will **evaluate your agents against other teams. You'll also get to play against the agents you'll train.**
-- 🎓 **Earn a certificate of completion** by completing 80% of the assignments.
-
-And more!
-
-At the end of this course, **you’ll get a solid foundation from the basics to the SOTA (state-of-the-art) of methods**.
-
-Don’t forget to **<a href="http://eepurl.com/ic5ZUD">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**
-
-Sign up  👉 <a href="http://eepurl.com/ic5ZUD">here</a>
-
-## Course Maintenance Notice 🚧
-
-Please note that this **Deep Reinforcement Learning course is now in a low-maintenance state**. However, it **remains an excellent resource to learn both the theory and practical aspects of Deep Reinforcement Learning**.
-
-Keep in mind the following points:
-
-- *Unit 7 (AI vs AI)* : This feature is currently non-functional. However, you can still train your agent to play soccer and observe its performance.
-
-- *Leaderboard* : The leaderboard is no longer operational.
-
-Aside from these points, all theory content and practical exercises remain fully accessible and effective for learning.
-
-If you have any problem with one of the hands-on **please check the issue sections where the community give some solutions to bugs**.
-
-## What does the course look like? [[course-look-like]]
-
-The course is composed of:
-
-- *A theory part*: where you learn a **concept in theory**.
-- *A hands-on*: where you’ll learn **to use famous Deep RL libraries** to train your agents in unique environments. These hands-on will be **Google Colab notebooks with companion tutorial videos** if you prefer learning with video format!
-
-- *Challenges*: you'll get to put your agent to compete against other agents in different challenges. There will also be [a leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) for you to compare the agents' performance.
-
-## What's the syllabus? [[syllabus]]
-
-This is the course's syllabus:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/syllabus1.jpg" alt="Syllabus Part 1" width="100%"/>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/syllabus2.jpg" alt="Syllabus Part 2" width="100%"/>
-
-## Two paths: choose your own adventure [[two-paths]]
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/two-paths.jpg" alt="Two paths" width="100%"/>
-
-You can choose to follow this course either:
-
-- *To get a certificate of completion*: you need to complete 80% of the assignments. 
-- *To get a certificate of honors*: you need to complete 100% of the assignments.
-- *As a simple audit*: you can participate in all challenges and do assignments if you want.
-
-There's **no deadlines, the course is self-paced**.
-Both paths **are completely free**.
-Whatever path you choose, we advise you **to follow the recommended pace to enjoy the course and challenges with your fellow classmates.**
-
-You don't need to tell us which path you choose. **If you get more than 80% of the assignments done, you'll get a certificate.**
-
-## The Certification Process [[certification-process]]
-
-The certification process is **completely free**:
-
-- *To get a certificate of completion*: you need to complete 80% of the assignments.
-- *To get a certificate of honors*: you need to complete 100% of the assignments.
-
-Again, there's **no deadline** since the course is self paced. But our advice **is to follow the recommended pace section**.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/certification.jpg" alt="Course certification" width="100%"/>
-
-## How to get most of the course? [[advice]]
-
-To get most of the course, we have some advice:
-
-1. <a href="https://discord.gg/ydHrjt3WP5">Join study groups in Discord </a>: studying in groups is always easier. To do that, you need to join our discord server. If you're new to Discord, no worries! We have some tools that will help you learn about it.
-2. **Do the quizzes and assignments**: the best way to learn is to do and test yourself.
-3. **Define a schedule to stay in sync**: you can use our recommended pace schedule below or create yours.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/advice.jpg" alt="Course advice" width="100%"/>
-
-## What tools do I need? [[tools]]
-
-You need only 3 things:
-
-- *A computer* with an internet connection.
-- *Google Colab (free version)*: most of our hands-on will use Google Colab, the **free version is enough.**
-- A *Hugging Face Account*: to push and load models. If you don’t have an account yet, you can create one **[here](https://hf.co/join)** (it’s free).
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/tools.jpg" alt="Course tools needed" width="100%"/>
-
-
-## What is the recommended pace? [[recommended-pace]]
-
-Each chapter in this course is designed **to be completed in 1 week, with approximately 3-4 hours of work per week**. However, you can take as much time as necessary to complete the course. If you want to dive into a topic more in-depth, we'll provide additional resources to help you achieve that.
-
-## Who are we [[who-are-we]]
-About the author:
-
-- <a href="https://twitter.com/ThomasSimonini">Thomas Simonini</a> is a Developer Advocate at Hugging Face 🤗 specializing in Deep Reinforcement Learning. He founded the Deep Reinforcement Learning Course in 2018, which became one of the most used courses in Deep RL.
-
-About the team:
-
-- <a href="https://twitter.com/osanseviero">Omar Sanseviero</a> is a Machine Learning Engineer at Hugging Face where he works in the intersection of ML, Community and Open Source. Previously, Omar worked as a Software Engineer at Google in the teams of Assistant and TensorFlow Graphics. He is from Peru and likes llamas 🦙.
-- <a href="https://twitter.com/RisingSayak"> Sayak Paul</a> is a Developer Advocate Engineer at Hugging Face. He's interested in the area of representation learning (self-supervision, semi-supervision, model robustness). And he loves watching crime and action thrillers 🔪.
-
-
-## What are the challenges in this course? [[challenges]]
-
-In this new version of the course, you have two types of challenges:
-- [A leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) to compare your agent's performance to other classmates'.
-- [AI vs. AI challenges](https://huggingface.co/learn/deep-rl-course/unit7/introduction?fw=pt) where you can train your agent and compete against other classmates' agents.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/challenges.jpg" alt="Challenges" width="100%"/>
-
-## I still have questions [[questions]]
-
-Please ask your question in our <a href="https://discord.gg/ydHrjt3WP5">discord server #rl-discussions.</a>
diff --git a/units/en/unit0/setup.mdx b/units/en/unit0/setup.mdx
deleted file mode 100644
index 73572a2..0000000
--- a/units/en/unit0/setup.mdx
+++ /dev/null
@@ -1,31 +0,0 @@
-# Setup [[setup]]
-
-After all this information, it's time to get started. We're going to do two things:
-
-1. **Create your Hugging Face account** if it's not already done
-2. **Sign up to Discord and introduce yourself** (don't be shy 🤗)
-
-### Let's create my Hugging Face account
-
-(If it's not already done) create an account to HF <a href="https://huggingface.co/join">here</a>
-
-### Let's join our Discord server
-
-You can now sign up for our Discord Server. This is the place where you **can chat with the community and with us, create and join study groups to grow with each other and more**
-
-👉🏻 Join our discord server <a href="https://discord.gg/ydHrjt3WP5">here.</a>
-
-When you join, remember to introduce yourself in #introduce-yourself and sign-up for reinforcement channels in #channels-and-roles.
-
-We have multiple RL-related channels:
-- `rl-announcements`: where we give the latest information about the course.
-- `rl-discussions`: where you can chat about RL and share information.
-- `rl-study-group`: where you can create and join study groups.
-- `rl-i-made-this`: where you can share your projects and models.
-
-If this is your first time using Discord, we wrote a Discord 101 to get the best practices. Check the next section.
-
-Congratulations! **You've just finished the on-boarding**. You're now ready to start to learn Deep Reinforcement Learning. Have fun!
-
-
-### Keep Learning, stay awesome 🤗
diff --git a/units/en/unit1/additional-readings.mdx b/units/en/unit1/additional-readings.mdx
deleted file mode 100644
index d1f1820..0000000
--- a/units/en/unit1/additional-readings.mdx
+++ /dev/null
@@ -1,14 +0,0 @@
-# Additional Readings [[additional-readings]]
-
-These are **optional readings** if you want to go deeper.
-
-## Deep Reinforcement Learning [[deep-rl]]
-
-- [Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto Chapter 1, 2 and 3](http://incompleteideas.net/book/RLbook2020.pdf)
-- [Foundations of Deep RL Series, L1 MDPs, Exact Solution Methods, Max-ent RL by Pieter Abbeel](https://youtu.be/2GwBez0D20A)
-- [Spinning Up RL by OpenAI Part 1: Key concepts of RL](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html)
-
-## Gym [[gym]]
-
-- [Getting Started With OpenAI Gym: The Basic Building Blocks](https://blog.paperspace.com/getting-started-with-openai-gym/)
-- [Make your own Gym custom environment](https://www.gymlibrary.dev/content/environment_creation/)
diff --git a/units/en/unit1/conclusion.mdx b/units/en/unit1/conclusion.mdx
deleted file mode 100644
index fd280e0..0000000
--- a/units/en/unit1/conclusion.mdx
+++ /dev/null
@@ -1,21 +0,0 @@
-# Conclusion [[conclusion]]
-
-Congrats on finishing this unit! **That was the biggest one**, and there was a lot of information. And congrats on finishing the tutorial. You’ve just trained your first Deep RL agents and shared them with the community! 🥳
-
-It's **normal if you still feel confused by some of these elements**. This was the same for me and for all people who studied RL.
-
-**Take time to really grasp the material** before continuing. It’s important to master these elements and have a solid foundation before entering the fun part.
-
-Naturally, during the course, we’re going to use and explain these terms again, but it’s better to understand them before diving into the next units.
-
-In the next (bonus) unit, we’re going to reinforce what we just learned by **training Huggy the Dog to fetch a stick**.
-
-You will then be able to play with him 🤗.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy.jpg" alt="Huggy"/>
-
-Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉  [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
-
-### Keep Learning, stay awesome 🤗
-
-
diff --git a/units/en/unit1/deep-rl.mdx b/units/en/unit1/deep-rl.mdx
deleted file mode 100644
index acbbac1..0000000
--- a/units/en/unit1/deep-rl.mdx
+++ /dev/null
@@ -1,21 +0,0 @@
-# The “Deep” in Reinforcement Learning [[deep-rl]]
-
-<Tip>
-What we've talked about so far is Reinforcement Learning. But where does the "Deep" come into play?
-</Tip>
-
-Deep Reinforcement Learning introduces **deep neural networks to solve Reinforcement Learning problems** — hence the name “deep”.
-
-For instance, in the next unit, we’ll learn about two value-based algorithms: Q-Learning (classic Reinforcement Learning) and then Deep Q-Learning.
-
-You’ll see the difference is that, in the first approach, **we use a traditional algorithm** to create a Q table that helps us find what action to take for each state.
-
-In the second approach, **we will use a Neural Network** (to approximate the Q value).
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/deep.jpg" alt="Value based RL"/>
-<figcaption>Schema inspired by the Q learning notebook by Udacity
-</figcaption>
-</figure>
-
-If you are not familiar with Deep Learning you should definitely watch [the FastAI Practical Deep Learning for Coders](https://course.fast.ai) (Free).
diff --git a/units/en/unit1/exp-exp-tradeoff.mdx b/units/en/unit1/exp-exp-tradeoff.mdx
deleted file mode 100644
index 1c2778f..0000000
--- a/units/en/unit1/exp-exp-tradeoff.mdx
+++ /dev/null
@@ -1,37 +0,0 @@
-# The Exploration/Exploitation trade-off [[exp-exp-tradeoff]]
-
-Finally, before looking at the different methods to solve Reinforcement Learning problems, we must cover one more very important topic: *the exploration/exploitation trade-off.*
-
-- *Exploration* is exploring the environment by trying random actions in order to **find more information about the environment.**
-- *Exploitation* is **exploiting known information to maximize the reward.**
-
-Remember, the goal of our RL agent is to maximize the expected cumulative reward. However, **we can fall into a common trap**.
-
-Let’s take an example:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/exp_1.jpg" alt="Exploration" width="100%">
-
-In this game, our mouse can have an **infinite amount of small cheese** (+1 each). But at the top of the maze, there is a gigantic sum of cheese (+1000).
-
-However, if we only focus on exploitation, our agent will never reach the gigantic sum of cheese. Instead, it will only exploit **the nearest source of rewards,** even if this source is small (exploitation).
-
-But if our agent does a little bit of exploration, it can **discover the big reward** (the pile of big cheese).
-
-This is what we call the exploration/exploitation trade-off. We need to balance how much we **explore the environment** and how much we **exploit what we know about the environment.**
-
-Therefore, we must **define a rule that helps to handle this trade-off**. We’ll see the different ways to handle it in the future units.
-
-If it’s still confusing, **think of a real problem: the choice of picking a restaurant:**
-
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/exp_2.jpg" alt="Exploration">
-<figcaption>Source: <a href="https://inst.eecs.berkeley.edu/~cs188/sp20/assets/lecture/lec15_6up.pdf"> Berkley AI Course</a>
-</figcaption>
-</figure>
-
-- *Exploitation*: You go to the same one that you know is good every day and **take the risk to miss another better restaurant.**
-- *Exploration*: Try restaurants you never went to before, with the risk of having a bad experience **but the probable opportunity of a fantastic experience.**
-
-To recap:
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/expexpltradeoff.jpg" alt="Exploration Exploitation Tradeoff" width="100%">
diff --git a/units/en/unit1/glossary.mdx b/units/en/unit1/glossary.mdx
deleted file mode 100644
index ed2b0aa..0000000
--- a/units/en/unit1/glossary.mdx
+++ /dev/null
@@ -1,70 +0,0 @@
-# Glossary [[glossary]]
-
-This is a community-created glossary. Contributions are welcome!
-
-### Agent
-
-An agent learns to **make decisions by trial and error, with rewards and punishments from the surroundings**.
-
-### Environment
-
-An environment is a simulated world **where an agent can learn by interacting with it**.
-
-### Markov Property
-
-It implies that the action taken by our agent is **conditional solely on the present state and independent of the past states and actions**.
-
-### Observations/State
-
-- **State**:  Complete description of the state of the world.
-- **Observation**: Partial description of the state of the environment/world.
-
-### Actions
-
-- **Discrete Actions**: Finite number of actions, such as left, right, up, and down.
-- **Continuous Actions**: Infinite possibility of actions; for example, in the case of self-driving cars, the driving scenario has an infinite possibility of actions occurring.
-
-### Rewards and Discounting
-
-- **Rewards**: Fundamental factor in RL. Tells the agent whether the action taken is good/bad.
-- RL algorithms are focused on maximizing the **cumulative reward**.
-- **Reward Hypothesis**: RL problems can be formulated as a maximisation of (cumulative) return.
-- **Discounting** is performed because rewards obtained at the start are more likely to happen as they are more predictable than long-term rewards.
-
-### Tasks
-
-- **Episodic**: Has a starting point and an ending point.
-- **Continuous**: Has a starting point but no ending point.
-
-### Exploration v/s Exploitation Trade-Off
-
-- **Exploration**: It's all about exploring the environment by trying random actions and receiving feedback/returns/rewards from the environment.
-- **Exploitation**: It's about exploiting what we know about the environment to gain maximum rewards.
-- **Exploration-Exploitation Trade-Off**: It balances how much we want to **explore** the environment and how much we want to **exploit** what we know about the environment.
-
-### Policy
-
-- **Policy**: It is called the agent's brain. It tells us what action to take, given the state.
-- **Optimal Policy**: Policy that **maximizes** the **expected return** when an agent acts according to it. It is learned through *training*.
-
-### Policy-based Methods:
-
-- An approach to solving RL problems.
-- In this method, the Policy is learned directly. 
-- Will map each state to the best corresponding action at that state. Or a probability distribution over the set of possible actions at that state.
-
-### Value-based Methods:
-
-- Another approach to solving RL problems.
-- Here, instead of training a policy, we train a **value function** that maps each state to the expected value of being in that state.
-
-Contributions are welcome 🤗
-
-If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)
-
-This glossary was made possible thanks to:
-
-- [@lucifermorningstar1305](https://github.com/lucifermorningstar1305)
-- [@daspartho](https://github.com/daspartho)
-- [@misza222](https://github.com/misza222)
-
diff --git a/units/en/unit1/hands-on.mdx b/units/en/unit1/hands-on.mdx
deleted file mode 100644
index 0d9eea3..0000000
--- a/units/en/unit1/hands-on.mdx
+++ /dev/null
@@ -1,699 +0,0 @@
-# Train your first Deep Reinforcement Learning Agent 🤖 [[hands-on]]
-
-
-
-
-      <CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
-      notebooks={[
-        {label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit1/unit1.ipynb"}
-        ]}
-        askForHelpUrl="http://hf.co/join/discord" />
-
-Now that you've studied the bases of Reinforcement Learning, you’re ready to train your first agent and share it with the community through the Hub 🔥:
-A Lunar Lander agent that will learn to land correctly on the Moon 🌕
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/lunarLander.gif" alt="LunarLander">
-
-And finally, you'll **upload this trained agent to the Hugging Face Hub 🤗, a free, open platform where people can share ML models, datasets, and demos.**
-
-Thanks to our <a href="https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard">leaderboard</a>, you'll be able to compare your results with other classmates and exchange the best practices to improve your agent's scores. Who will win the challenge for Unit 1 🏆?
-
-To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained model to the Hub and **get a result of >= 200**.
-
-To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
-
-**If you don't find your model, go to the bottom of the page and click on the refresh button.**
-
-For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
-
-And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSimonini/Check-my-progress-Deep-RL-Course
-
-So let's get started! 🚀
-
-**To start the hands-on click on Open In Colab button** 👇 :
-
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit1/unit1.ipynb)
-
-We strongly **recommend students use Google Colab for the hands-on exercises** instead of running them on their personal computers. 
-
-By using Google Colab, **you can focus on learning and experimenting without worrying about the technical aspects** of setting up your environments.
-
-# Unit 1: Train your first Deep Reinforcement Learning Agent 🤖
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/thumbnail.jpg" alt="Unit 1 thumbnail" width="100%">
-
-In this notebook, you'll train your **first Deep Reinforcement Learning agent** a Lunar Lander agent that will learn to **land correctly on the Moon 🌕**. Using [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) a Deep Reinforcement Learning library, share them with the community, and experiment with different configurations
-
-
-### The environment 🎮
-
-- [LunarLander-v2](https://gymnasium.farama.org/environments/box2d/lunar_lander/)
-
-### The library used 📚
-
-- [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/)
-
-We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues).
-
-## Objectives of this notebook 🏆
-
-At the end of the notebook, you will:
-
-- Be able to use **Gymnasium**, the environment library.
-- Be able to use **Stable-Baselines3**, the deep reinforcement learning library.
-- Be able to **push your trained agent to the Hub** with a nice video replay and an evaluation score 🔥.
-
-
-## This notebook is from Deep Reinforcement Learning Course
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg" alt="Deep RL Course illustration"/>
-
-In this free course, you will:
-
-- 📖 Study Deep Reinforcement Learning in **theory and practice**.
-- 🧑‍💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.
-- 🤖 Train **agents in unique environments** 
-- 🎓 **Earn a certificate of completion** by completing 80% of the assignments.
-
-And more! 
-
-Check 📚 the syllabus 👉 https://simoninithomas.github.io/deep-rl-course
-
-Don’t forget to **<a href="http://eepurl.com/ic5ZUD">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**
-
-The best way to keep in touch and ask questions is **to join our discord server** to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5
-
-## Prerequisites 🏗️
-
-Before diving into the notebook, you need to:
-
-🔲 📝 **[Read Unit 0](https://huggingface.co/deep-rl-course/unit0/introduction)** that gives you all the **information about the course and helps you to onboard** 🤗
-
-🔲 📚 **Develop an understanding of the foundations of Reinforcement learning** (MC, TD, Rewards hypothesis...) by [reading Unit 1](https://huggingface.co/deep-rl-course/unit1/introduction).
-
-## A small recap of Deep Reinforcement Learning 📚
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg" alt="The RL process" width="100%">
-
-Let's do a small recap on what we learned in the first Unit:
-
-- Reinforcement Learning is a **computational approach to learning from actions**. We build an agent that learns from the environment by **interacting with it through trial and error** and receiving rewards (negative or positive) as feedback.
-
-- The goal of any RL agent is to **maximize its expected cumulative reward** (also called expected return) because RL is based on the _reward hypothesis_, which is that all goals can be described as the maximization of an expected cumulative reward.
-
-- The RL process is a **loop that outputs a sequence of state, action, reward, and next state**.
-
-- To calculate the expected cumulative reward (expected return), **we discount the rewards**: the rewards that come sooner (at the beginning of the game) are more probable to happen since they are more predictable than the long-term future reward.
-
-- To solve an RL problem, you want to **find an optimal policy**; the policy is the "brain" of your AI that will tell us what action to take given a state. The optimal one is the one that gives you the actions that max the expected return.
-
-There are **two** ways to find your optimal policy:
-
-- By **training your policy directly**: policy-based methods.
-- By **training a value function** that tells us the expected return the agent will get at each state and use this function to define our policy: value-based methods.
-
-- Finally, we spoke about Deep RL because **we introduce deep neural networks to estimate the action to take (policy-based) or to estimate the value of a state (value-based) hence the name "deep."**
-
-# Let's train our first Deep Reinforcement Learning agent and upload it to the Hub 🚀
-
-## Get a certificate 🎓
-
-To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained model to the Hub and **get a result of >= 200**.
-
-To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
-
-For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
-
-## Set the GPU 💪
-
-- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg" alt="GPU Step 1">
-
-- `Hardware Accelerator > GPU`
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg" alt="GPU Step 2">
-
-## Install dependencies and create a virtual screen 🔽
-
-The first step is to install the dependencies, we’ll install multiple ones.
-
-- `gymnasium[box2d]`: Contains the LunarLander-v2 environment 🌛
-- `stable-baselines3[extra]`: The deep reinforcement learning library.
-- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face 🤗 Hub.
-
-To make things easier, we created a script to install all these dependencies.
-
-```bash
-apt install swig cmake
-```
-
-```bash
-pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt
-```
-
-During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames). 
-
-Hence the following cell will install virtual screen libraries and create and run a virtual screen 🖥
-
-```bash
-sudo apt-get update
-apt install python3-opengl
-apt install ffmpeg
-apt install xvfb
-pip3 install pyvirtualdisplay
-```
-
-To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks to this trick, **we will be able to run our virtual screen.**
-
-```python
-import os
-
-os.kill(os.getpid(), 9)
-```
-
-```python
-# Virtual display
-from pyvirtualdisplay import Display
-
-virtual_display = Display(visible=0, size=(1400, 900))
-virtual_display.start()
-```
-
-## Import the packages 📦
-
-One additional library we import is huggingface_hub **to be able to upload and download trained models from the hub**.
-
-
-The Hugging Face Hub 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.
-
-You can see here all the Deep reinforcement Learning models available here👉 https://huggingface.co/models?pipeline_tag=reinforcement-learning&sort=downloads
-
-
-
-```python
-import gymnasium
-
-from huggingface_sb3 import load_from_hub, package_to_hub
-from huggingface_hub import (
-    notebook_login,
-)  # To log to our Hugging Face account to be able to upload models to the Hub.
-
-from stable_baselines3 import PPO
-from stable_baselines3.common.env_util import make_vec_env
-from stable_baselines3.common.evaluation import evaluate_policy
-from stable_baselines3.common.monitor import Monitor
-```
-
-## Understand Gymnasium and how it works 🤖
-
-🏋 The library containing our environment is called Gymnasium.
-**You'll use Gymnasium a lot in Deep Reinforcement Learning.**
-
-Gymnasium is the **new version of Gym library** [maintained by the Farama Foundation](https://farama.org/).
-
-The Gymnasium library provides two things:
-
-- An interface that allows you to **create RL environments**.
-- A **collection of environments** (gym-control, atari, box2D...).
-
-Let's look at an example, but first let's recall the RL loop.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg" alt="The RL process" width="100%">
-
-At each step:
-- Our Agent receives a **state (S0)** from the **Environment** — we receive the first frame of our game (Environment).
-- Based on that **state (S0),** the Agent takes an **action (A0)** — our Agent will move to the right.
-- The environment transitions to a **new** **state (S1)** — new frame.
-- The environment gives some **reward (R1)** to the Agent — we’re not dead *(Positive Reward +1)*.
-
-
-With Gymnasium:
-
-1️⃣ We create our environment using `gymnasium.make()`
-
-2️⃣ We reset the environment to its initial state with `observation = env.reset()`
-
-At each step:
-
-3️⃣ Get an action using our model (in our example we take a random action)
-
-4️⃣ Using `env.step(action)`, we perform this action in the environment and get
-- `observation`: The new state (st+1)
-- `reward`: The reward we get after executing the action
-- `terminated`: Indicates if the episode terminated (agent reach the terminal state)
-- `truncated`: Introduced with this new version, it indicates a timelimit or if an agent go out of bounds of the environment for instance.
-- `info`: A dictionary that provides additional information (depends on the environment).
-
-For more explanations check this 👉 https://gymnasium.farama.org/api/env/#gymnasium.Env.step
-
-If the episode is terminated:
-- We reset the environment to its initial state with `observation = env.reset()`
-
-**Let's look at an example!** Make sure to read the code
-
-
-```python
-import gymnasium as gym
-
-# First, we create our environment called LunarLander-v2
-env = gym.make("LunarLander-v2")
-
-# Then we reset this environment
-observation, info = env.reset()
-
-for _ in range(20):
-    # Take a random action
-    action = env.action_space.sample()
-    print("Action taken:", action)
-
-    # Do this action in the environment and get
-    # next_state, reward, terminated, truncated and info
-    observation, reward, terminated, truncated, info = env.step(action)
-
-    # If the game is terminated (in our case we land, crashed) or truncated (timeout)
-    if terminated or truncated:
-        # Reset the environment
-        print("Environment is reset")
-        observation, info = env.reset()
-
-env.close()
-```
-
-## Create the LunarLander environment 🌛 and understand how it works
-
-### The environment 🎮
-
-In this first tutorial, we’re going to train our agent, a [Lunar Lander](https://gymnasium.farama.org/environments/box2d/lunar_lander/), **to land correctly on the moon**. To do that, the agent needs to learn **to adapt its speed and position (horizontal, vertical, and angular) to land correctly.**
-
----
-
-
-💡 A good habit when you start to use an environment is to check its documentation 
-
-👉 https://gymnasium.farama.org/environments/box2d/lunar_lander/
-
----
-
-
-Let's see what the Environment looks like:
-
-
-```python
-# We create our environment with gym.make("<name_of_the_environment>")
-env = gym.make("LunarLander-v2")
-env.reset()
-print("_____OBSERVATION SPACE_____ \n")
-print("Observation Space Shape", env.observation_space.shape)
-print("Sample observation", env.observation_space.sample())  # Get a random observation
-```
-
-We see with `Observation Space Shape (8,)` that the observation is a vector of size 8, where each value contains different information about the lander:
-- Horizontal pad coordinate (x)
-- Vertical pad coordinate (y)
-- Horizontal speed (x)
-- Vertical speed (y)
-- Angle
-- Angular speed
-- If the left leg contact point has touched the land (boolean)
-- If the right leg contact point has touched the land (boolean)
-
-
-```python
-print("\n _____ACTION SPACE_____ \n")
-print("Action Space Shape", env.action_space.n)
-print("Action Space Sample", env.action_space.sample())  # Take a random action
-```
-
-The action space (the set of possible actions the agent can take) is discrete with 4 actions available 🎮: 
-
-- Action 0: Do nothing,
-- Action 1: Fire left orientation engine,
-- Action 2: Fire the main engine,
-- Action 3: Fire right orientation engine.
-
-Reward function (the function that will give a reward at each timestep) 💰:
-
-After every step a reward is granted. The total reward of an episode is the **sum of the rewards for all the steps within that episode**.
-
-For each step, the reward:
-
-- Is increased/decreased the closer/further the lander is to the landing pad.
--  Is increased/decreased the slower/faster the lander is moving.
-- Is decreased the more the lander is tilted (angle not horizontal).
-- Is increased by 10 points for each leg that is in contact with the ground.
-- Is decreased by 0.03 points each frame a side engine is firing.
-- Is decreased by 0.3 points each frame the main engine is firing.
-
-The episode receive an **additional reward of -100 or +100 points for crashing or landing safely respectively.**
-
-An episode is **considered a solution if it scores at least 200 points.**
-
-#### Vectorized Environment
-
-- We create a vectorized environment (a method for stacking multiple independent environments into a single environment) of 16 environments, this way, **we'll have more diverse experiences during the training.**
-
-```python
-# Create the environment
-env = make_vec_env("LunarLander-v2", n_envs=16)
-```
-
-## Create the Model 🤖
-
-- We have studied our environment and we understood the problem: **being able to land the Lunar Lander to the Landing Pad correctly by controlling left, right and main orientation engine**. Now let's build the algorithm we're going to use to solve this Problem 🚀.
-
-- To do so, we're going to use our first Deep RL library, [Stable Baselines3 (SB3)](https://stable-baselines3.readthedocs.io/en/master/).
-
-- SB3 is a set of **reliable implementations of reinforcement learning algorithms in PyTorch**.
-
----
-
-💡 A good habit when using a new library is to dive first on the documentation: https://stable-baselines3.readthedocs.io/en/master/ and then try some tutorials.
-
-----
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/sb3.png" alt="Stable Baselines3">
-
-To solve this problem, we're going to use SB3 **PPO**. [PPO (aka Proximal Policy Optimization) is one of the SOTA (state of the art) Deep Reinforcement Learning algorithms that you'll study during this course](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#example%5D).
-
-PPO is a combination of:
-- *Value-based reinforcement learning method*: learning an action-value function that will tell us the **most valuable action to take given a state and action**.
-- *Policy-based reinforcement learning method*: learning a policy that will **give us a probability distribution over actions**.
-
-Stable-Baselines3 is easy to set up:
-
-1️⃣ You **create your environment** (in our case it was done above)
-
-2️⃣ You define the **model you want to use and instantiate this model** `model = PPO("MlpPolicy")`
-
-3️⃣ You **train the agent** with `model.learn` and define the number of training timesteps
-
-```
-# Create environment
-env = gym.make('LunarLander-v2')
-
-# Instantiate the agent
-model = PPO('MlpPolicy', env, verbose=1)
-# Train the agent
-model.learn(total_timesteps=int(2e5))
-```
-
-
-
-```python
-# TODO: Define a PPO MlpPolicy architecture
-# We use MultiLayerPerceptron (MLPPolicy) because the input is a vector,
-# if we had frames as input we would use CnnPolicy
-model =
-```
-
-#### Solution
-
-```python
-# SOLUTION
-# We added some parameters to accelerate the training
-model = PPO(
-    policy="MlpPolicy",
-    env=env,
-    n_steps=1024,
-    batch_size=64,
-    n_epochs=4,
-    gamma=0.999,
-    gae_lambda=0.98,
-    ent_coef=0.01,
-    verbose=1,
-)
-```
-
-## Train the PPO agent 🏃
-- Let's train our agent for 1,000,000 timesteps, don't forget to use GPU on Colab. It will take approximately ~20min, but you can use fewer timesteps if you just want to try it out.
-- During the training, take a ☕ break you deserved it 🤗
-
-```python
-# TODO: Train it for 1,000,000 timesteps
-
-# TODO: Specify file name for model and save the model to file
-model_name = "ppo-LunarLander-v2"
-```
-
-#### Solution
-
-```python
-# SOLUTION
-# Train it for 1,000,000 timesteps
-model.learn(total_timesteps=1000000)
-# Save the model
-model_name = "ppo-LunarLander-v2"
-model.save(model_name)
-```
-
-## Evaluate the agent 📈
-
-- Remember to wrap the environment in a [Monitor](https://stable-baselines3.readthedocs.io/en/master/common/monitor.html).
-- Now that our Lunar Lander agent is trained 🚀, we need to **check its performance**.
-- Stable-Baselines3 provides a method to do that: `evaluate_policy`.
-- To fill that part you need to [check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/examples.html#basic-usage-training-saving-loading)
-- In the next step,  we'll see **how to automatically evaluate and share your agent to compete in a leaderboard, but for now let's do it ourselves**
-
-
-💡 When you evaluate your agent, you should not use your training environment but create an evaluation environment.
-
-```python
-# TODO: Evaluate the agent
-# Create a new environment for evaluation
-eval_env =
-
-# Evaluate the model with 10 evaluation episodes and deterministic=True
-mean_reward, std_reward = 
-
-# Print the results
-```
-
-#### Solution
-
-```python
-# @title
-eval_env = Monitor(gym.make("LunarLander-v2"))
-mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
-print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")
-```
-
-- In my case, I got a mean reward of `200.20 +/- 20.80` after training for 1 million steps, which means that our lunar lander agent is ready to land on the moon 🌛🥳.
-
-## Publish our trained model on the Hub 🔥
-Now that we saw we got good results after the training, we can publish our trained model on the hub 🤗 with one line of code.
-
-📚 The libraries documentation 👉 https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face--x-stable-baselines3-v20
-
-Here's an example of a Model Card (with Space Invaders):
-
-By using `package_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.
-
-This way:
-- You can **showcase our work** 🔥
-- You can **visualize your agent playing** 👀
-- You can **share with the community an agent that others can use** 💾
-- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
-
-
-To be able to share your model with the community there are three more steps to follow:
-
-1️⃣ (If it's not already done) create an account on Hugging Face ➡ https://huggingface.co/join
-
-2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
-- Create a new token (https://huggingface.co/settings/tokens) **with write role**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
-
-- Copy the token 
-- Run the cell below and paste the token
-
-```python
-notebook_login()
-!git config --global credential.helper store
-```
-
-If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
-
-3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function
-
-Let's fill the `package_to_hub` function:
-- `model`: our trained model.
-- `model_name`: the name of the trained model that we defined in `model_save`
-- `model_architecture`: the model architecture we used, in our case PPO
-- `env_id`: the name of the environment, in our case `LunarLander-v2`
-- `eval_env`: the evaluation environment defined in eval_env
-- `repo_id`: the name of the Hugging Face Hub Repository that will be created/updated `(repo_id = {username}/{repo_name})`
-
-💡 **A good name is `{username}/{model_architecture}-{env_id}` **
-
-- `commit_message`: message of the commit
-
-```python
-import gymnasium as gym
-from stable_baselines3.common.vec_env import DummyVecEnv
-from stable_baselines3.common.env_util import make_vec_env
-
-from huggingface_sb3 import package_to_hub
-
-## TODO: Define a repo_id
-## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
-repo_id = 
-
-# TODO: Define the name of the environment
-env_id = 
-
-# Create the evaluation env and set the render_mode="rgb_array"
-eval_env = DummyVecEnv([lambda: gym.make(env_id, render_mode="rgb_array")])
-
-
-# TODO: Define the model architecture we used
-model_architecture = ""
-
-## TODO: Define the commit message
-commit_message = ""
-
-# method save, evaluate, generate a model card and record a replay video of your agent before pushing the repo to the hub
-package_to_hub(model=model, # Our trained model
-               model_name=model_name, # The name of our trained model 
-               model_architecture=model_architecture, # The model architecture we used: in our case PPO
-               env_id=env_id, # Name of the environment
-               eval_env=eval_env, # Evaluation Environment
-               repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
-               commit_message=commit_message)
-```
-
-#### Solution
-
-
-```python
-import gymnasium as gym
-
-from stable_baselines3 import PPO
-from stable_baselines3.common.vec_env import DummyVecEnv
-from stable_baselines3.common.env_util import make_vec_env
-
-from huggingface_sb3 import package_to_hub
-
-# PLACE the variables you've just defined two cells above
-# Define the name of the environment
-env_id = "LunarLander-v2"
-
-# TODO: Define the model architecture we used
-model_architecture = "PPO"
-
-## Define a repo_id
-## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
-## CHANGE WITH YOUR REPO ID
-repo_id = "ThomasSimonini/ppo-LunarLander-v2"  # Change with your repo id, you can't push with mine 😄
-
-## Define the commit message
-commit_message = "Upload PPO LunarLander-v2 trained agent"
-
-# Create the evaluation env and set the render_mode="rgb_array"
-eval_env = DummyVecEnv([lambda: Monitor(gym.make(env_id, render_mode="rgb_array"))])
-
-# PLACE the package_to_hub function you've just filled here
-package_to_hub(
-    model=model,  # Our trained model
-    model_name=model_name,  # The name of our trained model
-    model_architecture=model_architecture,  # The model architecture we used: in our case PPO
-    env_id=env_id,  # Name of the environment
-    eval_env=eval_env,  # Evaluation Environment
-    repo_id=repo_id,  # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
-    commit_message=commit_message,
-)
-```
-
-Congrats 🥳 you've just trained and uploaded your first Deep Reinforcement Learning agent. The script above should have displayed a link to a model repository such as https://huggingface.co/osanseviero/test_sb3. When you go to this link, you can:
-* See a video preview of your agent at the right. 
-* Click "Files and versions" to see all the files in the repository.
-* Click "Use in stable-baselines3" to get a code snippet that shows how to load the model.
-* A model card (`README.md` file) which gives a description of the model
-
-Under the hood, the Hub uses git-based repositories (don't worry if you don't know what git is), which means you can update the model with new versions as you experiment and improve your agent.
-
-Compare the results of your LunarLander-v2 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
-
-## Load a saved LunarLander model from the Hub 🤗
-Thanks to [ironbar](https://github.com/ironbar) for the contribution.
-
-Loading a saved model from the Hub is really easy. 
-
-You go to https://huggingface.co/models?library=stable-baselines3 to see the list of all the Stable-baselines3 saved models.
-1. You select one and copy its repo_id
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit1/copy-id.png" alt="Copy-id"/>
-
-2. Then we just need to use load_from_hub with:
-- The repo_id
-- The filename: the saved model inside the repo and its extension (*.zip)
-
-Because the model I download from the Hub was trained with Gym (the former version of Gymnasium) we need to install shimmy a API conversion tool that will help us to run the environment correctly.
-
-Shimmy Documentation: https://github.com/Farama-Foundation/Shimmy
-
-```python
-!pip install shimmy
-```
-
-```python
-from huggingface_sb3 import load_from_hub
-
-repo_id = "Classroom-workshop/assignment2-omar"  # The repo_id
-filename = "ppo-LunarLander-v2.zip"  # The model filename.zip
-
-# When the model was trained on Python 3.8 the pickle protocol is 5
-# But Python 3.6, 3.7 use protocol 4
-# In order to get compatibility we need to:
-# 1. Install pickle5 (we done it at the beginning of the colab)
-# 2. Create a custom empty object we pass as parameter to PPO.load()
-custom_objects = {
-    "learning_rate": 0.0,
-    "lr_schedule": lambda _: 0.0,
-    "clip_range": lambda _: 0.0,
-}
-
-checkpoint = load_from_hub(repo_id, filename)
-model = PPO.load(checkpoint, custom_objects=custom_objects, print_system_info=True)
-```
-
-Let's evaluate this agent:
-
-```python
-# @title
-eval_env = Monitor(gym.make("LunarLander-v2"))
-mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
-print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")
-```
-
-## Some additional challenges 🏆
-The best way to learn **is to try things by your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results! 
-
-In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?
-
-Here are some ideas to achieve so:
-* Train more steps
-* Try different hyperparameters for `PPO`. You can see them at https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#parameters.
-* Check the [Stable-Baselines3 documentation](https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html) and try another model such as DQN.
-* **Push your new trained model** on the Hub 🔥
-
-**Compare the results of your LunarLander-v2 with your classmates** using the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) 🏆
-
-Is moon landing too boring for you? Try to **change the environment**, why not use MountainCar-v0, CartPole-v1 or CarRacing-v0? Check how they work [using the gym documentation](https://www.gymlibrary.dev/) and have fun 🎉.
-
-________________________________________________________________________
-Congrats on finishing this chapter! That was the biggest one, **and there was a lot of information.**
-
-If you’re still feel confused with all these elements...it's totally normal! **This was the same for me and for all people who studied RL.**
-
-Take time to really **grasp the material before continuing and try the additional challenges**. It’s important to master these elements and have a solid foundations.
-
-Naturally, during the course, we’re going to dive deeper into these concepts but **it’s better to have a good understanding of them now before diving into the next chapters.**
-
-
-Next time, in the bonus unit 1, you'll train Huggy the Dog to fetch the stick.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit1/huggy.jpg" alt="Huggy"/>
-
-## Keep learning, stay awesome 🤗
diff --git a/units/en/unit1/introduction.mdx b/units/en/unit1/introduction.mdx
deleted file mode 100644
index e1ebd76..0000000
--- a/units/en/unit1/introduction.mdx
+++ /dev/null
@@ -1,27 +0,0 @@
-# Introduction to Deep Reinforcement Learning [[introduction-to-deep-reinforcement-learning]]
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/thumbnail.jpg" alt="Unit 1 thumbnail" width="100%">
-
-
-Welcome to the most fascinating topic in Artificial Intelligence: **Deep Reinforcement Learning.**
-
-Deep RL is a type of Machine Learning where an agent learns **how to behave** in an environment **by performing actions** and **seeing the results.**
-
-In this first unit, **you'll learn the foundations of Deep Reinforcement Learning.**
-
-
-Then, you'll **train your Deep Reinforcement Learning agent, a lunar lander to land correctly on the Moon** using <a href="https://stable-baselines3.readthedocs.io/en/master/"> Stable-Baselines3 </a>, a Deep Reinforcement Learning library.
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/lunarLander.gif" alt="LunarLander">
-
-And finally, you'll **upload this trained agent to the Hugging Face Hub 🤗, a free, open platform where people can share ML models, datasets, and demos.**
-
-It's essential **to master these elements** before diving into implementing Deep Reinforcement Learning agents. The goal of this chapter is to give you solid foundations.
-
-
-After this unit, in a bonus unit, you'll be **able to train Huggy the Dog 🐶 to fetch the stick and play with him 🤗**.
-
-<video src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/huggy.mp4" type="video/mp4" controls autoplay loop mute />
-
-So let's get started! 🚀
diff --git a/units/en/unit1/quiz.mdx b/units/en/unit1/quiz.mdx
deleted file mode 100644
index 1ec12f9..0000000
--- a/units/en/unit1/quiz.mdx
+++ /dev/null
@@ -1,168 +0,0 @@
-# Quiz [[quiz]]
-
-The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
-
-### Q1: What is Reinforcement Learning?
-
-<details>
-<summary>Solution</summary>
-
-Reinforcement learning is a **framework for solving control tasks (also called decision problems)** by building agents that learn from the environment by interacting with it through trial and error and **receiving rewards (positive or negative) as unique feedback**.
-
-</details>
-
-
-
-### Q2: Define the RL Loop
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rl-loop-ex.jpg" alt="Exercise RL Loop"/>
-
-At every step:
-- Our Agent receives ______ from the environment
-- Based on that ______ the Agent takes an ______
-- Our Agent will move to the right
-- The Environment goes to a ______
-- The Environment gives a ______ to the Agent
-
-
-<Question
-	choices={[
-		{
-			text: "an action a0, action a0, state s0, state s1, reward r1",
-			explain: "At every step: Our Agent receives **state s0** from the environment. Based on that **state s0** the Agent takes an **action a0**. Our Agent will move to the right. The Environment goes to a **new state s1**. The Environment gives **a reward r1** to the Agent."
-		},
-		{
-			text: "state s0, state s0, action a0, new state s1, reward r1",
-			explain: "",
-      correct: true
-		},
-		{
-			text: "a state s0, state s0, action a0, state s1, action a1",
-      explain: "At every step: Our Agent receives **state s0** from the environment. Based on that **state s0** the Agent takes an **action a0**. Our Agent will move to the right. The Environment goes to a **new state s1**. The Environment gives **a reward r1** to the Agent."
-		}
-	]}
-/>
-
-### Q3: What's the difference between a state and an observation?
-
-<Question
-	choices={[
-		{
-			text: "The state is a complete description of the state of the world (there is no hidden information)",
-			explain: "",
-      correct: true
-		},
-    {
-			text: "The state is a partial description of the state",
-			explain: ""
-		},
-    {
-      text: "The observation is a complete description of the state of the world (there is no hidden information)",
-      explain: ""
-    },
-    {
-      text: "The observation is a partial description of the state",
-      explain: "",
-      correct: true
-    },
-    {
-      text: "We receive a state when we play with chess environment",
-      explain: "Since we have access to the whole checkboard information.",
-      correct: true
-    },
-    {
-      text: "We receive an observation when we play with chess environment",
-      explain: "Since we have access to the whole checkboard information."
-    },
-    {
-      text: "We receive a state when we play with Super Mario Bros",
-      explain: "We only see a part of the level close to the player, so we receive an observation."
-    },
-    {
-      text: "We receive an observation when we play with Super Mario Bros",
-      explain: "We only see a part of the level close to the player.",
-      correct: true
-    }
-	]}
-/>
-
-### Q4: A task is an instance of a Reinforcement Learning problem. What are the two types of tasks?
-
-<Question
-	choices={[
-		{
-			text: "Episodic",
-			explain: "In Episodic task, we have a starting point and an ending point (a terminal state). This creates an episode: a list of States, Actions, Rewards, and new States. For instance, think about Super Mario Bros: an episode begin at the launch of a new Mario Level and ending when you’re killed or you reached the end of the level.",
-      correct: true
-		},
-    {
-			text: "Recursive",
-			explain: ""
-		},
-    {
-			text: "Adversarial",
-			explain: ""
-		},
-    {
-      text: "Continuing",
-      explain: "Continuing tasks are tasks that continue forever (no terminal state). In this case, the agent must learn how to choose the best actions and simultaneously interact with the environment.",
-      correct: true
-    }
-	]}
-/>
-
-### Q5: What is the exploration/exploitation tradeoff?
-
-<details>
-<summary>Solution</summary>
-
-In Reinforcement Learning, we need to **balance how much we explore the environment and how much we exploit what we know about the environment**.
-
-- *Exploration* is exploring the environment by **trying random actions in order to find more information about the environment**.
-
-- *Exploitation* is **exploiting known information to maximize the reward**.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/expexpltradeoff.jpg" alt="Exploration Exploitation Tradeoff" width="100%">
-
-</details>
-
-
-### Q6: What is a policy?
-
-<details>
-<summary>Solution</summary>
-
-- The Policy π **is the brain of our Agent**. It’s the function that tells us what action to take given the state we are in. So it defines the agent’s behavior at a given time.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_1.jpg" alt="Policy">
-
-</details>
-
-
-### Q7: What are value-based methods?
-
-<details>
-<summary>Solution</summary>
-
-- Value-based methods is one of the main approaches for solving RL problems.
-- In Value-based methods, instead of training a policy function, **we train a value function that maps a state to the expected value of being at that state**.
-
-
-
-</details>
-
-### Q8: What are policy-based methods?
-
-<details>
-<summary>Solution</summary>
-
-- In *Policy-Based Methods*, we learn a **policy function directly**.
-- This policy function will **map from each state to the best corresponding action at that state**. Or a **probability distribution over the set of possible actions at that state**.
-
-
-
-
-</details>
-
-
-Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read again the chapter to reinforce (😏) your knowledge, but **do not worry**: during the course we'll go over again of these concepts, and you'll **reinforce your theoretical knowledge with hands-on**.
diff --git a/units/en/unit1/rl-framework.mdx b/units/en/unit1/rl-framework.mdx
deleted file mode 100644
index 9745357..0000000
--- a/units/en/unit1/rl-framework.mdx
+++ /dev/null
@@ -1,144 +0,0 @@
-# The Reinforcement Learning Framework [[the-reinforcement-learning-framework]]
-
-## The RL Process [[the-rl-process]]
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process.jpg" alt="The RL process" width="100%">
-<figcaption>The RL Process: a loop of state, action, reward and next state</figcaption>
-<figcaption>Source: <a href="http://incompleteideas.net/book/RLbook2020.pdf">Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto</a></figcaption>
-</figure>
-
-To understand the RL process, let’s imagine an agent learning to play a platform game:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg" alt="The RL process" width="100%">
-
-- Our Agent receives **state  \\(S_0\\)** from the **Environment** — we receive the first frame of our game (Environment).
-- Based on that **state \\(S_0\\),** the Agent takes **action \\(A_0\\)** — our Agent will move to the right.
-- The environment goes to a **new** **state \\(S_1\\)** — new frame.
-- The environment gives some **reward \\(R_1\\)** to the Agent — we’re not dead *(Positive Reward +1)*.
-
-This RL loop outputs a sequence of **state, action, reward and next state.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/sars.jpg" alt="State, Action, Reward, Next State" width="100%">
-
-The agent's goal is to _maximize_ its cumulative reward, **called the expected return.**
-
-## The reward hypothesis: the central idea of Reinforcement Learning [[reward-hypothesis]]
-
-⇒ Why is the goal of the agent to maximize the expected return?
-
-Because RL is based on the **reward hypothesis**, which is that all goals can be described as the **maximization of the expected return** (expected cumulative reward).
-
-That’s why in Reinforcement Learning, **to have the best behavior,** we aim to learn to take actions that **maximize the expected cumulative reward.**
-
-
-## Markov Property [[markov-property]]
-
-In papers, you’ll see that the RL process is called a **Markov Decision Process** (MDP).
-
-We’ll talk again about the Markov Property in the following units. But if you need to remember something today about it, it's this: the Markov Property implies that our agent needs **only the current state to decide** what action to take and **not the history of all the states and actions** they took before.
-
-## Observations/States Space [[obs-space]]
-
-Observations/States are the **information our agent gets from the environment.** In the case of a video game, it can be a frame (a screenshot). In the case of the trading agent, it can be the value of a certain stock, etc.
-
-There is a differentiation to make between *observation* and *state*, however:
-
-- *State s*: is **a complete description of the state of the world** (there is no hidden information). In a fully observed environment.
-
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/chess.jpg" alt="Chess">
-<figcaption>In chess game, we receive a state from the environment since we have access to the whole check board information.</figcaption>
-</figure>
-
-In a chess game, we have access to the whole board information, so we receive a state from the environment. In other words, the environment is fully observed.
-
-- *Observation o*: is a **partial description of the state.** In a partially observed environment.
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario">
-<figcaption>In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.</figcaption>
-</figure>
-
-In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.
-
-In Super Mario Bros, we are in a partially observed environment. We receive an observation **since we only see a part of the level.**
-
-<Tip>
-In this course, we use the term "state" to denote both state and observation, but we will make the distinction in implementations.
-</Tip>
-
-To recap:
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/obs_space_recap.jpg" alt="Obs space recap" width="100%">
-
-
-## Action Space [[action-space]]
-
-The Action space is the set of **all possible actions in an environment.**
-
-The actions can come from a *discrete* or *continuous space*:
-
-- *Discrete space*: the number of possible actions is **finite**.
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario">
-<figcaption>In Super Mario Bros, we have only 4 possible actions: left, right, up (jumping) and down (crouching).</figcaption>
-
-</figure>
-
-Again, in Super Mario Bros, we have a finite set of actions since we have only 4 directions.
-
-- *Continuous space*: the number of possible actions is **infinite**.
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/self_driving_car.jpg" alt="Self Driving Car">
-<figcaption>A Self Driving Car agent has an infinite number of possible actions since it can turn left 20°, 21,1°, 21,2°, honk, turn right 20°…
-</figcaption>
-</figure>
-
-To recap:
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/action_space.jpg" alt="Action space recap" width="100%">
-
-Taking this information into consideration is crucial because it will **have importance when choosing the RL algorithm in the future.**
-
-## Rewards and the discounting [[rewards]]
-
-The reward is fundamental in RL because it’s **the only feedback** for the agent. Thanks to it, our agent knows **if the action taken was good or not.**
-
-The cumulative reward at each time step **t** can be written as:
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_1.jpg" alt="Rewards">
-<figcaption>The cumulative reward equals the sum of all rewards in the sequence.
-</figcaption>
-</figure>
-
-Which is equivalent to:
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_2.jpg" alt="Rewards">
-<figcaption>The cumulative reward = rt+1 (rt+k+1 = rt+0+1 = rt+1)+ rt+2 (rt+k+1 = rt+1+1 = rt+2) + ...
-</figcaption>
-</figure>
-
-However, in reality, **we can’t just add them like that.** The rewards that come sooner (at the beginning of the game) **are more likely to happen** since they are more predictable than the long-term future reward.
-
-Let’s say your agent is this tiny mouse that can move one tile each time step, and your opponent is the cat (that can move too). The mouse's goal is **to eat the maximum amount of cheese before being eaten by the cat.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_3.jpg" alt="Rewards" width="100%">
-
-As we can see in the diagram, **it’s more probable to eat the cheese near us than the cheese close to the cat** (the closer we are to the cat, the more dangerous it is).
-
-Consequently, **the reward near the cat, even if it is bigger (more cheese), will be more discounted** since we’re not really sure we’ll be able to eat it.
-
-To discount the rewards, we proceed like this:
-
-1. We define a discount rate called gamma. **It must be between 0 and 1.** Most of the time between **0.95 and 0.99**.
-- The larger the gamma, the smaller the discount. This means our agent **cares more about the long-term reward.**
-- On the other hand, the smaller the gamma, the bigger the discount. This means our **agent cares more about the short term reward (the nearest cheese).**
-
-2. Then, each reward will be discounted by gamma to the exponent of the time step. As the time step increases, the cat gets closer to us, **so the future reward is less and less likely to happen.**
-
-Our discounted expected cumulative reward is:
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_4.jpg" alt="Rewards" width="100%">
diff --git a/units/en/unit1/summary.mdx b/units/en/unit1/summary.mdx
deleted file mode 100644
index 3462ef3..0000000
--- a/units/en/unit1/summary.mdx
+++ /dev/null
@@ -1,19 +0,0 @@
-# Summary [[summary]]
-
-That was a lot of information! Let's summarize:
-
-- Reinforcement Learning is a computational approach of learning from actions. We build an agent that learns from the environment **by interacting with it through trial and error** and receiving rewards (negative or positive) as feedback.
-
-- The goal of any RL agent is to maximize its expected cumulative reward (also called expected return) because RL is based on the **reward hypothesis**, which is that **all goals can be described as the maximization of the expected cumulative reward.**
-
-- The RL process is a loop that outputs a sequence of **state, action, reward and next state.**
-
-- To calculate the expected cumulative reward (expected return), we discount the rewards: the rewards that come sooner (at the beginning of the game) **are more probable to happen since they are more predictable than the long term future reward.**
-
-- To solve an RL problem, you want to **find an optimal policy**. The policy is the “brain” of your agent, which will tell us **what action to take given a state.** The optimal policy is the one which **gives you the actions that maximize the expected return.**
-
-- There are two ways to find your optimal policy:
-    1. By training your policy directly: **policy-based methods.**
-    2. By training a value function that tells us the expected return the agent will get at each state and use this function to define our policy: **value-based methods.**
-
-- Finally, we speak about Deep RL because we introduce **deep neural networks to estimate the action to take (policy-based) or to estimate the value of a state (value-based)** hence the name “deep”.
diff --git a/units/en/unit1/tasks.mdx b/units/en/unit1/tasks.mdx
deleted file mode 100644
index cfb4d86..0000000
--- a/units/en/unit1/tasks.mdx
+++ /dev/null
@@ -1,27 +0,0 @@
-# Type of tasks [[tasks]]
-
-A task is an **instance** of a Reinforcement Learning problem. We can have two types of tasks: **episodic** and **continuing**.
-
-## Episodic task [[episodic-task]]
-
-In this case, we have a starting point and an ending point **(a terminal state). This creates an episode**: a list of States, Actions, Rewards, and new States.
-
-For instance, think about Super Mario Bros: an episode begins at the launch of a new Mario Level and ends **when you’re killed or you reached the end of the level.**
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario">
-<figcaption>Beginning of a new episode.
-</figcaption>
-</figure>
-
-
-## Continuing tasks [[continuing-tasks]]
-
-These are tasks that continue forever (**no terminal state**). In this case, the agent must **learn how to choose the best actions and simultaneously interact with the environment.**
-
-For instance, an agent that does automated stock trading. For this task, there is no starting point and terminal state. **The agent keeps running until we decide to stop it.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/stock.jpg" alt="Stock Market" width="100%">
-
-To recap:
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/tasks.jpg" alt="Tasks recap" width="100%">
diff --git a/units/en/unit1/two-methods.mdx b/units/en/unit1/two-methods.mdx
deleted file mode 100644
index fcfc04a..0000000
--- a/units/en/unit1/two-methods.mdx
+++ /dev/null
@@ -1,90 +0,0 @@
-# Two main approaches for solving RL problems [[two-methods]]
-
-<Tip>
-Now that we learned the RL framework, how do we solve the RL problem?
-</Tip>
-
-In other words, how do we build an RL agent that can **select the actions that maximize its expected cumulative reward?**
-
-## The Policy π: the agent’s brain [[policy]]
-
-The Policy **π** is the **brain of our Agent**, it’s the function that tells us what **action to take given the state we are in.** So it **defines the agent’s behavior** at a given time.
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_1.jpg" alt="Policy" />
-<figcaption>Think of policy as the brain of our agent, the function that will tell us the action to take given a state</figcaption>
-</figure>
-
-This Policy **is the function we want to learn**, our goal is to find the optimal policy π\*, the policy that **maximizes expected return** when the agent acts according to it. We find this π\* **through training.**
-
-There are two approaches to train our agent to find this optimal policy π\*:
-
-- **Directly,** by teaching the agent to learn which **action to take,** given the current state: **Policy-Based Methods.**
-- Indirectly, **teach the agent to learn which state is more valuable** and then take the action that **leads to the more valuable states**: Value-Based Methods.
-
-## Policy-Based Methods [[policy-based]]
-
-In Policy-Based methods, **we learn a policy function directly.**
-
-This function will define a mapping from each state to the best corresponding action. Alternatively, it could define **a probability distribution over the set of possible actions at that state.**
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_2.jpg" alt="Policy" />
-<figcaption>As we can see here, the policy (deterministic) <b>directly indicates the action to take for each step.</b></figcaption>
-</figure>
-
-
-We have two types of policies:
-
-
-- *Deterministic*: a policy at a given state **will always return the same action.**
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_3.jpg" alt="Policy"/>
-<figcaption>action = policy(state)</figcaption>
-</figure>
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_4.jpg" alt="Policy" width="100%"/>
-
-- *Stochastic*: outputs **a probability distribution over actions.**
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_5.jpg" alt="Policy"/>
-<figcaption>policy(actions | state) = probability distribution over the set of actions given the current state</figcaption>
-</figure>
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy-based.png" alt="Policy Based"/>
-<figcaption>Given an initial state, our stochastic policy will output probability distributions over the possible actions at that state.</figcaption>
-</figure>
-
-
-If we recap:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/pbm_1.jpg" alt="Pbm recap" width="100%" />
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/pbm_2.jpg" alt="Pbm recap" width="100%" />
-
-
-## Value-based methods [[value-based]]
-
-In value-based methods, instead of learning a policy function, we **learn a value function** that maps a state to the expected value **of being at that state.**
-
-The value of a state is the **expected discounted return** the agent can get if it **starts in that state, and then acts according to our policy.**
-
-“Act according to our policy” just means that our policy is **“going to the state with the highest value”.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/value_1.jpg" alt="Value based RL" width="100%" />
-
-Here we see that our value function **defined values for each possible state.**
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/value_2.jpg" alt="Value based RL"/>
-<figcaption>Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.</figcaption>
-</figure>
-
-Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.
-
-If we recap:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/vbm_1.jpg" alt="Vbm recap" width="100%" />
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/vbm_2.jpg" alt="Vbm recap" width="100%" />
diff --git a/units/en/unit1/what-is-rl.mdx b/units/en/unit1/what-is-rl.mdx
deleted file mode 100644
index ba63f92..0000000
--- a/units/en/unit1/what-is-rl.mdx
+++ /dev/null
@@ -1,40 +0,0 @@
-# What is Reinforcement Learning? [[what-is-reinforcement-learning]]
-
-To understand Reinforcement Learning, let’s start with the big picture.
-
-## The big picture [[the-big-picture]]
-
-The idea behind Reinforcement Learning is that an agent (an AI) will learn from the environment by **interacting with it** (through trial and error) and **receiving rewards** (negative or positive) as feedback for performing actions.
-
-Learning from interactions with the environment **comes from our natural experiences.**
-
-For instance, imagine putting your little brother in front of a video game he never played, giving him a controller, and leaving him alone.
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/Illustration_1.jpg" alt="Illustration_1" width="100%">
-
-Your brother will interact with the environment (the video game) by pressing the right button (action). He got a coin, that’s a +1 reward. It’s positive, he just understood that in this game **he must get the coins.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/Illustration_2.jpg" alt="Illustration_2" width="100%">
-
-But then, **he presses the right button again** and he touches an enemy. He just died, so that's a -1 reward.
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/Illustration_3.jpg" alt="Illustration_3" width="100%">
-
-By interacting with his environment through trial and error, your little brother understands that **he needs to get coins in this environment but avoid the enemies.**
-
-**Without any supervision**, the child will get better and better at playing the game.
-
-That’s how humans and animals learn, **through interaction.** Reinforcement Learning is just a **computational approach of learning from actions.**
-
-
-### A formal definition [[a-formal-definition]]
-
-We can now make a formal definition:
-
-<Tip>
-Reinforcement learning is a framework for solving control tasks (also called decision problems) by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback.
-</Tip>
-
-But how does Reinforcement Learning work?
diff --git a/units/en/unit2/additional-readings.mdx b/units/en/unit2/additional-readings.mdx
deleted file mode 100644
index 46a2386..0000000
--- a/units/en/unit2/additional-readings.mdx
+++ /dev/null
@@ -1,15 +0,0 @@
-# Additional Readings [[additional-readings]]
-
-These are **optional readings** if you want to go deeper.
-
-## Monte Carlo and TD Learning [[mc-td]]
-
-To dive deeper into Monte Carlo and Temporal Difference Learning:
-
-- <a href="https://stats.stackexchange.com/questions/355820/why-do-temporal-difference-td-methods-have-lower-variance-than-monte-carlo-met">Why do temporal difference (TD) methods have lower variance than Monte Carlo methods?</a>
-- <a href="https://stats.stackexchange.com/questions/336974/when-are-monte-carlo-methods-preferred-over-temporal-difference-ones"> When are Monte Carlo methods preferred over temporal difference ones?</a>
-
-## Q-Learning [[q-learning]]
-
-- <a href="http://incompleteideas.net/book/RLbook2020.pdf">Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto Chapter 5, 6 and 7</a>
-- <a href="https://youtu.be/Psrhxy88zww">Foundations of Deep RL Series, L2 Deep Q-Learning by Pieter Abbeel</a>
diff --git a/units/en/unit2/bellman-equation.mdx b/units/en/unit2/bellman-equation.mdx
deleted file mode 100644
index 6f85eed..0000000
--- a/units/en/unit2/bellman-equation.mdx
+++ /dev/null
@@ -1,63 +0,0 @@
-# The Bellman Equation: simplify our value estimation [[bellman-equation]]
-
-The Bellman equation **simplifies our state value or state-action value calculation.**
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman.jpg" alt="Bellman equation"/>
-
-With what we have learned so far, we know that if we calculate \\(V(S_t)\\) (the value of a state), we need to calculate the return starting at that state and then follow the policy forever after. **(The policy we defined in the following example is a Greedy Policy; for simplification, we don't discount the reward).**
-
-So to calculate \\(V(S_t)\\), we need to calculate the sum of the expected rewards. Hence:
-
-<figure>
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman2.jpg" alt="Bellman equation"/>
-  <figcaption>To calculate the value of State 1: the sum of rewards if the agent started in that state and then followed the greedy policy (taking actions that leads to the best states values) for all the time steps.</figcaption>
-</figure>
-
-Then, to calculate the \\(V(S_{t+1})\\), we need to calculate the return starting at that state \\(S_{t+1}\\).
-
-<figure>
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman3.jpg" alt="Bellman equation"/>
-  <figcaption>To calculate the value of State 2: the sum of rewards <b>if the agent started in that state</b>, and then followed the <b>policy for all the time steps.</b></figcaption>
-</figure>
-
-So you may have noticed, we're repeating the computation of the value of different states, which can be tedious if you need to do it for each state value or state-action value.
-
-Instead of calculating the expected return for each state or each state-action pair, **we can use the Bellman equation.** (hint: if you know what Dynamic Programming is, this is very similar! if you don't know what it is, no worries!)
-
-The Bellman equation is a recursive equation that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:
-
-**The immediate reward  \\(R_{t+1}\\)  + the discounted value of the state that follows ( \\(\gamma * V(S_{t+1}) \\) ) .**
-
-<figure>
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman4.jpg" alt="Bellman equation"/>
-</figure>
-
-
-If we go back to our example, we can say that the value of State 1 is equal to the expected cumulative return if we start at that state.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman2.jpg" alt="Bellman equation"/>
-
-
-To calculate the value of State 1: the sum of rewards **if the agent started in that state 1** and then followed the **policy for all the time steps.**
-
-This is equivalent to  \\(V(S_{t})\\)  = Immediate reward  \\(R_{t+1}\\)  + Discounted value of the next state  \\(\gamma * V(S_{t+1})\\)
-
-<figure>
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman6.jpg" alt="Bellman equation"/>
-  <figcaption>For simplification, here we don’t discount so gamma = 1.</figcaption>
-</figure>
-
-In the interest of simplicity, here we don't discount, so gamma = 1.
-But you'll study an example with gamma = 0.99 in the Q-Learning section of this unit.
-
-- The value of  \\(V(S_{t+1}) \\)  = Immediate reward  \\(R_{t+2}\\)  + Discounted value of the next state ( \\(gamma * V(S_{t+2})\\) ).
-- And so on.
-
-
-
-
-
-To recap, the idea of the Bellman equation is that instead of calculating each value as the sum of the expected return, **which is a long process**, we calculate the value as **the sum of immediate reward + the discounted value of the state that follows.**
-
-Before going to the next section, think about the role of gamma in the Bellman equation. What happens if the value of gamma is very low (e.g. 0.1 or even 0)? What happens if the value is 1? What happens if the value is very high, such as a million?
diff --git a/units/en/unit2/conclusion.mdx b/units/en/unit2/conclusion.mdx
deleted file mode 100644
index 7a3fa3c..0000000
--- a/units/en/unit2/conclusion.mdx
+++ /dev/null
@@ -1,20 +0,0 @@
-# Conclusion [[conclusion]]
-
-Congrats on finishing this chapter! There was a lot of information. And congrats on finishing the tutorials. You’ve just implemented your first RL agent from scratch and shared it on the Hub 🥳.
-
-Implementing from scratch when you study a new architecture **is important to understand how it works.**
-
-It's **normal if you still feel confused** by all these elements. **This was the same for me and for everyone who studies RL.**
-
-Take time to really grasp the material before continuing.
-
-
-In the next chapter, we’re going to dive deeper by studying our first Deep Reinforcement Learning algorithm based on Q-Learning: Deep Q-Learning. And you'll train a **DQN agent with <a href="https://github.com/DLR-RM/rl-baselines3-zoo">RL-Baselines3 Zoo</a> to play Atari Games**.
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Atari environments"/>
-
-
-Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉  [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
-
-### Keep Learning, stay awesome 🤗
\ No newline at end of file
diff --git a/units/en/unit2/glossary.mdx b/units/en/unit2/glossary.mdx
deleted file mode 100644
index f76ea52..0000000
--- a/units/en/unit2/glossary.mdx
+++ /dev/null
@@ -1,47 +0,0 @@
-# Glossary [[glossary]]
-
-This is a community-created glossary. Contributions are welcomed!
-
-
-### Strategies to find the optimal policy
-
-- **Policy-based methods.** The policy is usually trained with a neural network to select what action to take given a state. In this case it is the neural network which outputs the action that the agent should take instead of using a value function. Depending on the experience received by the environment, the neural network will be re-adjusted and will provide better actions.
-- **Value-based methods.** In this case, a value function is trained to output the value of a state or a state-action pair that will represent our policy. However, this value doesn't define what action the agent should take. In contrast, we need to specify the behavior of the agent given the output of the value function. For example, we could decide to adopt a policy to take the action that always leads to the biggest reward (Greedy Policy). In summary, the policy is a Greedy Policy (or whatever decision the user takes) that uses the values of the value-function to decide the actions to take.
-
-### Among the value-based methods, we can find two main strategies
-
-- **The state-value function.** For each state, the state-value function is the expected return if the agent starts in that state and follows the policy until the end.
-- **The action-value function.** In contrast to the state-value function, the action-value calculates for each state and action pair the expected return if the agent starts in that state, takes that action, and then follows the policy forever after.
-
-### Epsilon-greedy strategy:
-
-- Common strategy used in reinforcement learning that involves balancing exploration and exploitation.
-- Chooses the action with the highest expected reward with a probability of 1-epsilon.
-- Chooses a random action with a probability of epsilon.
-- Epsilon is typically decreased over time to shift focus towards exploitation.
-
-### Greedy strategy:
-
-- Involves always choosing the action that is expected to lead to the highest reward, based on the current knowledge of the environment. (Only exploitation)
-- Always chooses the action with the highest expected reward.
-- Does not include any exploration.
-- Can be disadvantageous in environments with uncertainty or unknown optimal actions.
-
-### Off-policy vs on-policy algorithms
-
-- **Off-policy algorithms:** A different policy is used at training time and inference time
-- **On-policy algorithms:** The same policy is used during training and inference
-
-### Monte Carlo and Temporal Difference learning strategies
-
-- **Monte Carlo (MC):** Learning at the end of the episode. With Monte Carlo, we wait until the episode ends and then we update the value function (or policy function) from a complete episode.
-
-- **Temporal Difference (TD):** Learning at each step. With Temporal Difference Learning, we update the value function (or policy function) at each step without requiring a complete episode.
-
-If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)
-
-This glossary was made possible thanks to:
-
-- [Ramón Rueda](https://github.com/ramon-rd)
-- [Hasarindu Perera](https://github.com/hasarinduperera/)
-- [Arkady Arkhangorodsky](https://github.com/arkadyark/)
diff --git a/units/en/unit2/hands-on.mdx b/units/en/unit2/hands-on.mdx
deleted file mode 100644
index f55cc13..0000000
--- a/units/en/unit2/hands-on.mdx
+++ /dev/null
@@ -1,1147 +0,0 @@
-# Hands-on [[hands-on]]
-
-      <CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
-      notebooks={[
-        {label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit2/unit2.ipynb"}
-        ]}
-        askForHelpUrl="http://hf.co/join/discord" />
-
-
-
-Now that we studied the Q-Learning algorithm, let's implement it from scratch and train our Q-Learning agent in two environments:
-1. [Frozen-Lake-v1  (non-slippery and slippery version)](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) ☃️ : where our agent will need to **go from the starting state (S) to the goal state (G)** by walking only on frozen tiles (F) and avoiding holes (H).
-2. [An autonomous taxi](https://gymnasium.farama.org/environments/toy_text/taxi/) 🚖 will need **to learn to navigate** a city to **transport its passengers from point A to point B.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/envs.gif" alt="Environments"/>
-
-Thanks to a [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard), you'll be able to compare your results with other classmates and exchange the best practices to improve your agent's scores. Who will win the challenge for Unit 2?
-
-To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained Taxi model to the Hub and **get a result of >= 4.5**.
-
-To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
-
-For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
-
-And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSimonini/Check-my-progress-Deep-RL-Course
-
-
-**To start the hands-on click on the Open In Colab button** 👇 :
-
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit2/unit2.ipynb)
-
-
-We strongly **recommend students use Google Colab for the hands-on exercises** instead of running them on their personal computers.
-
-By using Google Colab, **you can focus on learning and experimenting without worrying about the technical aspects** of setting up your environments.
-
-
-# Unit 2: Q-Learning with FrozenLake-v1 ⛄ and Taxi-v3 🚕
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/thumbnail.jpg" alt="Unit 2 Thumbnail">
-
-In this notebook, **you'll code your first Reinforcement Learning agent from scratch** to play FrozenLake ❄️ using Q-Learning, share it with the community, and experiment with different configurations.
-
-⬇️ Here is an example of what **you will achieve in just a couple of minutes.** ⬇️
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/envs.gif" alt="Environments"/>
-
-### 🎮 Environments:
-
-- [FrozenLake-v1](https://gymnasium.farama.org/environments/toy_text/frozen_lake/)
-- [Taxi-v3](https://gymnasium.farama.org/environments/toy_text/taxi/)
-
-### 📚 RL-Library:
-
-- Python and NumPy
-- [Gymnasium](https://gymnasium.farama.org/)
-
-We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).
-
-## Objectives of this notebook 🏆
-
-At the end of the notebook, you will:
-
-- Be able to use **Gymnasium**, the environment library.
-- Be able to code a Q-Learning agent from scratch.
-- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.
-
-## This notebook is from the Deep Reinforcement Learning Course
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg" alt="Deep RL Course illustration"/>
-
-In this free course, you will:
-
-- 📖 Study Deep Reinforcement Learning in **theory and practice**.
-- 🧑‍💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.
-- 🤖 Train **agents in unique environments**
-
-And more check 📚 the syllabus 👉 https://simoninithomas.github.io/deep-rl-course
-
-Don’t forget to **<a href="http://eepurl.com/ic5ZUD">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**
-
-
-The best way to keep in touch is to join our discord server to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5
-
-## Prerequisites 🏗️
-
-Before diving into the notebook, you need to:
-
-🔲 📚 **Study [Q-Learning by reading Unit 2](https://huggingface.co/deep-rl-course/unit2/introduction)**  🤗
-
-## A small recap of Q-Learning
-
-*Q-Learning* **is the RL algorithm that**:
-
-- Trains *Q-Function*, an **action-value function** that is encoded, in internal memory, by a *Q-table* **that contains all the state-action pair values.**
-
-- Given a state and action, our Q-Function **will search the Q-table for the corresponding value.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q function"  width="100%"/>
-
-- When the training is done, **we have an optimal Q-Function, so an optimal Q-Table.**
-
-- And if we **have an optimal Q-function**, we
-have an optimal policy, since we **know for each state, the best action to take.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy"  width="100%"/>
-
-
-But, in the beginning, our **Q-Table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as we’ll explore the environment and update our Q-Table it will give us better and better approximations
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/q-learning.jpeg" alt="q-learning.jpeg" width="100%"/>
-
-This is the Q-Learning pseudocode:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
-
-
-# Let's code our first Reinforcement Learning algorithm 🚀
-
-To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained Taxi model to the Hub and **get a result of >= 4.5**.
-
-To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
-
-For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
-
-## Install dependencies and create a virtual display 🔽
-
-In the notebook, we'll need to generate a replay video. To do so, with Colab, **we need to have a virtual screen to render the environment** (and thus record the frames).
-
-Hence the following cell will install the libraries and create and run a virtual screen 🖥
-
-We’ll install multiple ones:
-
-- `gymnasium`: Contains the FrozenLake-v1 ⛄ and Taxi-v3 🚕 environments.
-- `pygame`: Used for the FrozenLake-v1 and Taxi-v3 UI.
-- `numpy`: Used for handling our Q-table.
-
-The Hugging Face Hub 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.
-
-You can see here all the Deep RL models available (if they use Q Learning) here 👉 https://huggingface.co/models?other=q-learning
-
-```bash
-pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit2/requirements-unit2.txt
-```
-
-```bash
-sudo apt-get update
-sudo apt-get install -y python3-opengl
-apt install ffmpeg xvfb
-pip3 install pyvirtualdisplay
-```
-
-To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks to this trick, **we will be able to run our virtual screen.**
-
-```python
-import os
-
-os.kill(os.getpid(), 9)
-```
-
-```python
-# Virtual display
-from pyvirtualdisplay import Display
-
-virtual_display = Display(visible=0, size=(1400, 900))
-virtual_display.start()
-```
-
-## Import the packages 📦
-
-In addition to the installed libraries, we also use:
-
-- `random`: To generate random numbers (that will be useful for epsilon-greedy policy).
-- `imageio`: To generate a replay video.
-
-```python
-import numpy as np
-import gymnasium as gym
-import random
-import imageio
-import os
-import tqdm
-
-import pickle5 as pickle
-from tqdm.notebook import tqdm
-```
-
-We're now ready to code our Q-Learning algorithm 🔥
-
-# Part 1: Frozen Lake ⛄ (non slippery version)
-
-## Create and understand [FrozenLake environment ⛄]((https://gymnasium.farama.org/environments/toy_text/frozen_lake/)
----
-
-💡 A good habit when you start to use an environment is to check its documentation
-
-👉 https://gymnasium.farama.org/environments/toy_text/frozen_lake/
-
----
-
-We're going to train our Q-Learning agent **to navigate from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoid holes (H)**.
-
-We can have two sizes of environment:
-
-- `map_name="4x4"`: a 4x4 grid version
-- `map_name="8x8"`: a 8x8 grid version
-
-
-The environment has two modes:
-
-- `is_slippery=False`: The agent always moves **in the intended direction** due to the non-slippery nature of the frozen lake (deterministic).
-- `is_slippery=True`: The agent **may not always move in the intended direction** due to the slippery nature of the frozen lake (stochastic).
-
-For now let's keep it simple with the 4x4 map and non-slippery.
-We add a parameter called `render_mode` that specifies how the environment should be visualised. In our case because we **want to record a video of the environment at the end, we need to set render_mode to rgb_array**.
-
-As [explained in the documentation](https://gymnasium.farama.org/api/env/#gymnasium.Env.render) “rgb_array”: Return a single frame representing the current state of the environment. A frame is a np.ndarray with shape (x, y, 3) representing RGB values for an x-by-y pixel image.
-
-```python
-# Create the FrozenLake-v1 environment using 4x4 map and non-slippery version and render_mode="rgb_array"
-env = gym.make()  # TODO use the correct parameters
-```
-
-### Solution
-
-```python
-env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False, render_mode="rgb_array")
-```
-
-You can create your own custom grid like this:
-
-```python
-desc=["SFFF", "FHFH", "FFFH", "HFFG"]
-gym.make('FrozenLake-v1', desc=desc, is_slippery=True)
-```
-
-but we'll use the default environment for now.
-
-### Let's see what the Environment looks like:
-
-
-```python
-# We create our environment with gym.make("<name_of_the_environment>")- `is_slippery=False`: The agent always moves in the intended direction due to the non-slippery nature of the frozen lake (deterministic).
-print("_____OBSERVATION SPACE_____ \n")
-print("Observation Space", env.observation_space)
-print("Sample observation", env.observation_space.sample())  # Get a random observation
-```
-
-We see with `Observation Space Shape Discrete(16)` that the observation is an integer representing the **agent’s current position as current_row * ncols + current_col (where both the row and col start at 0)**.
-
-For example, the goal position in the 4x4 map can be calculated as follows: 3 * 4 + 3 = 15. The number of possible observations is dependent on the size of the map. **For example, the 4x4 map has 16 possible observations.**
-
-
-For instance, this is what state = 0 looks like:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/frozenlake.png" alt="FrozenLake">
-
-```python
-print("\n _____ACTION SPACE_____ \n")
-print("Action Space Shape", env.action_space.n)
-print("Action Space Sample", env.action_space.sample())  # Take a random action
-```
-
-The action space (the set of possible actions the agent can take) is discrete with 4 actions available 🎮:
-- 0: GO LEFT
-- 1: GO DOWN
-- 2: GO RIGHT
-- 3: GO UP
-
-Reward function 💰:
-- Reach goal: +1
-- Reach hole: 0
-- Reach frozen: 0
-
-## Create and Initialize the Q-table 🗄️
-
-(👀 Step 1 of the pseudocode)
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
-
-
-It's time to initialize our Q-table! To know how many rows (states) and columns (actions) to use, we need to know the action and observation space. We already know their values from before, but we'll want to obtain them programmatically so that our algorithm generalizes for different environments. Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`
-
-
-```python
-state_space =
-print("There are ", state_space, " possible states")
-
-action_space =
-print("There are ", action_space, " possible actions")
-```
-
-```python
-# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros. np.zeros needs a tuple (a,b)
-def initialize_q_table(state_space, action_space):
-  Qtable =
-  return Qtable
-```
-
-```python
-Qtable_frozenlake = initialize_q_table(state_space, action_space)
-```
-
-### Solution
-
-```python
-state_space = env.observation_space.n
-print("There are ", state_space, " possible states")
-
-action_space = env.action_space.n
-print("There are ", action_space, " possible actions")
-```
-
-```python
-# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros
-def initialize_q_table(state_space, action_space):
-    Qtable = np.zeros((state_space, action_space))
-    return Qtable
-```
-
-```python
-Qtable_frozenlake = initialize_q_table(state_space, action_space)
-```
-
-## Define the greedy policy 🤖
-
-Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**.
-
-- Epsilon-greedy policy (acting policy)
-- Greedy-policy (updating policy)
-
-The greedy policy will also be the final policy we'll have when the Q-learning agent completes training. The greedy policy is used to select an action using the Q-table.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg" alt="Q-Learning" width="100%"/>
-
-
-```python
-def greedy_policy(Qtable, state):
-  # Exploitation: take the action with the highest state, action value
-  action =
-
-  return action
-```
-
-#### Solution
-
-```python
-def greedy_policy(Qtable, state):
-    # Exploitation: take the action with the highest state, action value
-    action = np.argmax(Qtable[state][:])
-
-    return action
-```
-
-## Define the epsilon-greedy policy 🤖
-
-Epsilon-greedy is the training policy that handles the exploration/exploitation trade-off.
-
-The idea with epsilon-greedy:
-
-- With *probability 1 - ɛ* : **we do exploitation** (i.e. our agent selects the action with the highest state-action pair value).
-
-- With *probability ɛ*: we do **exploration** (trying a random action).
-
-As the training continues, we progressively **reduce the epsilon value since we will need less and less exploration and more exploitation.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Q-Learning" width="100%"/>
-
-
-```python
-def epsilon_greedy_policy(Qtable, state, epsilon):
-  # Randomly generate a number between 0 and 1
-  random_num =
-  # if random_num > greater than epsilon --> exploitation
-  if random_num > epsilon:
-    # Take the action with the highest value given a state
-    # np.argmax can be useful here
-    action =
-  # else --> exploration
-  else:
-    action = # Take a random action
-
-  return action
-```
-
-#### Solution
-
-```python
-def epsilon_greedy_policy(Qtable, state, epsilon):
-    # Randomly generate a number between 0 and 1
-    random_num = random.uniform(0, 1)
-    # if random_num > greater than epsilon --> exploitation
-    if random_num > epsilon:
-        # Take the action with the highest value given a state
-        # np.argmax can be useful here
-        action = greedy_policy(Qtable, state)
-    # else --> exploration
-    else:
-        action = env.action_space.sample()
-
-    return action
-```
-
-## Define the hyperparameters ⚙️
-
-The exploration related hyperparamters are some of the most important ones.
-
-- We need to make sure that our agent **explores enough of the state space** to learn a good value approximation. To do that, we need to have progressive decay of the epsilon.
-- If you decrease epsilon too fast (too high decay_rate), **you take the risk that your agent will be stuck**, since your agent didn't explore enough of the state space and hence can't solve the problem.
-
-```python
-# Training parameters
-n_training_episodes = 10000  # Total training episodes
-learning_rate = 0.7  # Learning rate
-
-# Evaluation parameters
-n_eval_episodes = 100  # Total number of test episodes
-
-# Environment parameters
-env_id = "FrozenLake-v1"  # Name of the environment
-max_steps = 99  # Max steps per episode
-gamma = 0.95  # Discounting rate
-eval_seed = []  # The evaluation seed of the environment
-
-# Exploration parameters
-max_epsilon = 1.0  # Exploration probability at start
-min_epsilon = 0.05  # Minimum exploration probability
-decay_rate = 0.0005  # Exponential decay rate for exploration prob
-```
-
-## Create the training loop method
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
-
-The training loop goes like this:
-
-```
-For episode in the total of training episodes:
-
-Reduce epsilon (since we need less and less exploration)
-Reset the environment
-
-  For step in max timesteps:
-    Choose the action At using epsilon greedy policy
-    Take the action (a) and observe the outcome state(s') and reward (r)
-    Update the Q-value Q(s,a) using Bellman equation Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
-    If done, finish the episode
-    Our next state is the new state
-```
-
-```python
-def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
-  for episode in tqdm(range(n_training_episodes)):
-    # Reduce epsilon (because we need less and less exploration)
-    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
-    # Reset the environment
-    state, info = env.reset()
-    step = 0
-    terminated = False
-    truncated = False
-
-    # repeat
-    for step in range(max_steps):
-      # Choose the action At using epsilon greedy policy
-      action =
-
-      # Take action At and observe Rt+1 and St+1
-      # Take the action (a) and observe the outcome state(s') and reward (r)
-      new_state, reward, terminated, truncated, info =
-
-      # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
-      Qtable[state][action] =
-
-      # If terminated or truncated finish the episode
-      if terminated or truncated:
-        break
-
-      # Our next state is the new state
-      state = new_state
-  return Qtable
-```
-
-#### Solution
-
-```python
-def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
-    for episode in tqdm(range(n_training_episodes)):
-        # Reduce epsilon (because we need less and less exploration)
-        epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)
-        # Reset the environment
-        state, info = env.reset()
-        step = 0
-        terminated = False
-        truncated = False
-
-        # repeat
-        for step in range(max_steps):
-            # Choose the action At using epsilon greedy policy
-            action = epsilon_greedy_policy(Qtable, state, epsilon)
-
-            # Take action At and observe Rt+1 and St+1
-            # Take the action (a) and observe the outcome state(s') and reward (r)
-            new_state, reward, terminated, truncated, info = env.step(action)
-
-            # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
-            Qtable[state][action] = Qtable[state][action] + learning_rate * (
-                reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action]
-            )
-
-            # If terminated or truncated finish the episode
-            if terminated or truncated:
-                break
-
-            # Our next state is the new state
-            state = new_state
-    return Qtable
-```
-
-## Train the Q-Learning agent 🏃
-
-```python
-Qtable_frozenlake = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_frozenlake)
-```
-
-## Let's see what our Q-Learning table looks like now 👀
-
-```python
-Qtable_frozenlake
-```
-
-## The evaluation method 📝
-
-- We defined the evaluation method that we're going to use to test our Q-Learning agent.
-
-```python
-def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):
-    """
-    Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.
-    :param env: The evaluation environment
-    :param n_eval_episodes: Number of episode to evaluate the agent
-    :param Q: The Q-table
-    :param seed: The evaluation seed array (for taxi-v3)
-    """
-    episode_rewards = []
-    for episode in tqdm(range(n_eval_episodes)):
-        if seed:
-            state, info = env.reset(seed=seed[episode])
-        else:
-            state, info = env.reset()
-        step = 0
-        truncated = False
-        terminated = False
-        total_rewards_ep = 0
-
-        for step in range(max_steps):
-            # Take the action (index) that have the maximum expected future reward given that state
-            action = greedy_policy(Q, state)
-            new_state, reward, terminated, truncated, info = env.step(action)
-            total_rewards_ep += reward
-
-            if terminated or truncated:
-                break
-            state = new_state
-        episode_rewards.append(total_rewards_ep)
-    mean_reward = np.mean(episode_rewards)
-    std_reward = np.std(episode_rewards)
-
-    return mean_reward, std_reward
-```
-
-## Evaluate our Q-Learning agent 📈
-
-- Usually, you should have a mean reward of 1.0
-- The **environment is relatively easy** since the state space is really small (16). What you can try to do is [to replace it with the slippery version](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/), which introduces stochasticity, making the environment more complex.
-
-```python
-# Evaluate our Agent
-mean_reward, std_reward = evaluate_agent(env, max_steps, n_eval_episodes, Qtable_frozenlake, eval_seed)
-print(f"Mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")
-```
-
-## Publish our trained model to the Hub 🔥
-
-Now that we saw good results after the training, **we can publish our trained model to the Hub 🤗 with one line of code**.
-
-Here's an example of a Model Card:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/modelcard.png" alt="Model card" width="100%"/>
-
-
-Under the hood, the Hub uses git-based repositories (don't worry if you don't know what git is), which means you can update the model with new versions as you experiment and improve your agent.
-
-#### Do not modify this code
-
-```python
-from huggingface_hub import HfApi, snapshot_download
-from huggingface_hub.repocard import metadata_eval_result, metadata_save
-
-from pathlib import Path
-import datetime
-import json
-```
-
-```python
-def record_video(env, Qtable, out_directory, fps=1):
-    """
-    Generate a replay video of the agent
-    :param env
-    :param Qtable: Qtable of our agent
-    :param out_directory
-    :param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)
-    """
-    images = []
-    terminated = False
-    truncated = False
-    state, info = env.reset(seed=random.randint(0, 500))
-    img = env.render()
-    images.append(img)
-    while not terminated or truncated:
-        # Take the action (index) that have the maximum expected future reward given that state
-        action = np.argmax(Qtable[state][:])
-        state, reward, terminated, truncated, info = env.step(
-            action
-        )  # We directly put next_state = state for recording logic
-        img = env.render()
-        images.append(img)
-    imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)
-```
-
-```python
-def push_to_hub(repo_id, model, env, video_fps=1, local_repo_path="hub"):
-    """
-    Evaluate, Generate a video and Upload a model to Hugging Face Hub.
-    This method does the complete pipeline:
-    - It evaluates the model
-    - It generates the model card
-    - It generates a replay video of the agent
-    - It pushes everything to the Hub
-
-    :param repo_id: repo_id: id of the model repository from the Hugging Face Hub
-    :param env
-    :param video_fps: how many frame per seconds to record our video replay
-    (with taxi-v3 and frozenlake-v1 we use 1)
-    :param local_repo_path: where the local repository is
-    """
-    _, repo_name = repo_id.split("/")
-
-    eval_env = env
-    api = HfApi()
-
-    # Step 1: Create the repo
-    repo_url = api.create_repo(
-        repo_id=repo_id,
-        exist_ok=True,
-    )
-
-    # Step 2: Download files
-    repo_local_path = Path(snapshot_download(repo_id=repo_id))
-
-    # Step 3: Save the model
-    if env.spec.kwargs.get("map_name"):
-        model["map_name"] = env.spec.kwargs.get("map_name")
-        if env.spec.kwargs.get("is_slippery", "") == False:
-            model["slippery"] = False
-
-    # Pickle the model
-    with open((repo_local_path) / "q-learning.pkl", "wb") as f:
-        pickle.dump(model, f)
-
-    # Step 4: Evaluate the model and build JSON with evaluation metrics
-    mean_reward, std_reward = evaluate_agent(
-        eval_env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"]
-    )
-
-    evaluate_data = {
-        "env_id": model["env_id"],
-        "mean_reward": mean_reward,
-        "n_eval_episodes": model["n_eval_episodes"],
-        "eval_datetime": datetime.datetime.now().isoformat(),
-    }
-
-    # Write a JSON file called "results.json" that will contain the
-    # evaluation results
-    with open(repo_local_path / "results.json", "w") as outfile:
-        json.dump(evaluate_data, outfile)
-
-    # Step 5: Create the model card
-    env_name = model["env_id"]
-    if env.spec.kwargs.get("map_name"):
-        env_name += "-" + env.spec.kwargs.get("map_name")
-
-    if env.spec.kwargs.get("is_slippery", "") == False:
-        env_name += "-" + "no_slippery"
-
-    metadata = {}
-    metadata["tags"] = [env_name, "q-learning", "reinforcement-learning", "custom-implementation"]
-
-    # Add metrics
-    eval = metadata_eval_result(
-        model_pretty_name=repo_name,
-        task_pretty_name="reinforcement-learning",
-        task_id="reinforcement-learning",
-        metrics_pretty_name="mean_reward",
-        metrics_id="mean_reward",
-        metrics_value=f"{mean_reward:.2f} +/- {std_reward:.2f}",
-        dataset_pretty_name=env_name,
-        dataset_id=env_name,
-    )
-
-    # Merges both dictionaries
-    metadata = {**metadata, **eval}
-
-    model_card = f"""
-  # **Q-Learning** Agent playing1 **{env_id}**
-  This is a trained model of a **Q-Learning** agent playing **{env_id}** .
-
-  ## Usage
-
-  model = load_from_hub(repo_id="{repo_id}", filename="q-learning.pkl")
-
-  # Don't forget to check if you need to add additional attributes (is_slippery=False etc)
-  env = gym.make(model["env_id"])
-  """
-
-    evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
-
-    readme_path = repo_local_path / "README.md"
-    readme = ""
-    print(readme_path.exists())
-    if readme_path.exists():
-        with readme_path.open("r", encoding="utf8") as f:
-            readme = f.read()
-    else:
-        readme = model_card
-
-    with readme_path.open("w", encoding="utf-8") as f:
-        f.write(readme)
-
-    # Save our metrics to Readme metadata
-    metadata_save(readme_path, metadata)
-
-    # Step 6: Record a video
-    video_path = repo_local_path / "replay.mp4"
-    record_video(env, model["qtable"], video_path, video_fps)
-
-    # Step 7. Push everything to the Hub
-    api.upload_folder(
-        repo_id=repo_id,
-        folder_path=repo_local_path,
-        path_in_repo=".",
-    )
-
-    print("Your model is pushed to the Hub. You can view your model here: ", repo_url)
-```
-
-### .
-
-By using `push_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the Hub**.
-
-This way:
-- You can **showcase our work** 🔥
-- You can **visualize your agent playing** 👀
-- You can **share an agent with the community that others can use** 💾
-- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
-
-
-To be able to share your model with the community there are three more steps to follow:
-
-1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
-
-2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
-- Create a new token (https://huggingface.co/settings/tokens) **with write role**
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
-
-
-```python
-from huggingface_hub import notebook_login
-
-notebook_login()
-```
-
-If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login` (or `login`)
-
-3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `push_to_hub()` function
-
-- Let's create **the model dictionary that contains the hyperparameters and the Q_table**.
-
-```python
-model = {
-    "env_id": env_id,
-    "max_steps": max_steps,
-    "n_training_episodes": n_training_episodes,
-    "n_eval_episodes": n_eval_episodes,
-    "eval_seed": eval_seed,
-    "learning_rate": learning_rate,
-    "gamma": gamma,
-    "max_epsilon": max_epsilon,
-    "min_epsilon": min_epsilon,
-    "decay_rate": decay_rate,
-    "qtable": Qtable_frozenlake,
-}
-```
-
-Let's fill the `push_to_hub` function:
-
-- `repo_id`: the name of the Hugging Face Hub Repository that will be created/updated `
-(repo_id = {username}/{repo_name})`
-💡 A good `repo_id` is `{username}/q-{env_id}`
-- `model`: our model dictionary containing the hyperparameters and the Qtable.
-- `env`: the environment.
-- `commit_message`: message of the commit
-
-```python
-model
-```
-
-```python
-username = ""  # FILL THIS
-repo_name = "q-FrozenLake-v1-4x4-noSlippery"
-push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)
-```
-
-Congrats 🥳 you've just implemented from scratch, trained, and uploaded your first Reinforcement Learning agent.
-FrozenLake-v1 no_slippery is very simple environment, let's try a harder one 🔥.
-
-# Part 2: Taxi-v3 🚖
-
-## Create and understand [Taxi-v3 🚕](https://gymnasium.farama.org/environments/toy_text/taxi/)
----
-
-💡 A good habit when you start to use an environment is to check its documentation
-
-👉 https://gymnasium.farama.org/environments/toy_text/taxi/
-
----
-
-In `Taxi-v3` 🚕, there are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue).
-
-When the episode starts, **the taxi starts off at a random square** and the passenger is at a random location. The taxi drives to the passenger’s location, **picks up the passenger**, drives to the passenger’s destination (another one of the four specified locations), and then **drops off the passenger**. Once the passenger is dropped off, the episode ends.
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi.png" alt="Taxi">
-
-
-```python
-env = gym.make("Taxi-v3", render_mode="rgb_array")
-```
-
-There are **500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger** (including the case when the passenger is in the taxi), and **4 destination locations.**
-
-
-```python
-state_space = env.observation_space.n
-print("There are ", state_space, " possible states")
-```
-
-```python
-action_space = env.action_space.n
-print("There are ", action_space, " possible actions")
-```
-
-The action space (the set of possible actions the agent can take) is discrete with **6 actions available 🎮**:
-
-- 0: move south
-- 1: move north
-- 2: move east
-- 3: move west
-- 4: pickup passenger
-- 5: drop off passenger
-
-Reward function 💰:
-
-- -1 per step unless other reward is triggered.
-- +20 delivering passenger.
-- -10 executing “pickup” and “drop-off” actions illegally.
-
-```python
-# Create our Q table with state_size rows and action_size columns (500x6)
-Qtable_taxi = initialize_q_table(state_space, action_space)
-print(Qtable_taxi)
-print("Q-table shape: ", Qtable_taxi.shape)
-```
-
-## Define the hyperparameters ⚙️
-
-⚠ DO NOT MODIFY EVAL_SEED: the eval_seed array **allows us to evaluate your agent with the same taxi starting positions for every classmate**
-
-```python
-# Training parameters
-n_training_episodes = 25000  # Total training episodes
-learning_rate = 0.7  # Learning rate
-
-# Evaluation parameters
-n_eval_episodes = 100  # Total number of test episodes
-
-# DO NOT MODIFY EVAL_SEED
-eval_seed = [
-    16,
-    54,
-    165,
-    177,
-    191,
-    191,
-    120,
-    80,
-    149,
-    178,
-    48,
-    38,
-    6,
-    125,
-    174,
-    73,
-    50,
-    172,
-    100,
-    148,
-    146,
-    6,
-    25,
-    40,
-    68,
-    148,
-    49,
-    167,
-    9,
-    97,
-    164,
-    176,
-    61,
-    7,
-    54,
-    55,
-    161,
-    131,
-    184,
-    51,
-    170,
-    12,
-    120,
-    113,
-    95,
-    126,
-    51,
-    98,
-    36,
-    135,
-    54,
-    82,
-    45,
-    95,
-    89,
-    59,
-    95,
-    124,
-    9,
-    113,
-    58,
-    85,
-    51,
-    134,
-    121,
-    169,
-    105,
-    21,
-    30,
-    11,
-    50,
-    65,
-    12,
-    43,
-    82,
-    145,
-    152,
-    97,
-    106,
-    55,
-    31,
-    85,
-    38,
-    112,
-    102,
-    168,
-    123,
-    97,
-    21,
-    83,
-    158,
-    26,
-    80,
-    63,
-    5,
-    81,
-    32,
-    11,
-    28,
-    148,
-]  # Evaluation seed, this ensures that all classmates agents are trained on the same taxi starting position
-# Each seed has a specific starting state
-
-# Environment parameters
-env_id = "Taxi-v3"  # Name of the environment
-max_steps = 99  # Max steps per episode
-gamma = 0.95  # Discounting rate
-
-# Exploration parameters
-max_epsilon = 1.0  # Exploration probability at start
-min_epsilon = 0.05  # Minimum exploration probability
-decay_rate = 0.005  # Exponential decay rate for exploration prob
-```
-
-## Train our Q-Learning agent 🏃
-
-```python
-Qtable_taxi = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_taxi)
-Qtable_taxi
-```
-
-## Create a model dictionary 💾 and publish our trained model to the Hub 🔥
-
-- We create a model dictionary that will contain all the training hyperparameters for reproducibility and the Q-Table.
-
-
-```python
-model = {
-    "env_id": env_id,
-    "max_steps": max_steps,
-    "n_training_episodes": n_training_episodes,
-    "n_eval_episodes": n_eval_episodes,
-    "eval_seed": eval_seed,
-    "learning_rate": learning_rate,
-    "gamma": gamma,
-    "max_epsilon": max_epsilon,
-    "min_epsilon": min_epsilon,
-    "decay_rate": decay_rate,
-    "qtable": Qtable_taxi,
-}
-```
-
-```python
-username = ""  # FILL THIS
-repo_name = ""  # FILL THIS
-push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)
-```
-
-Now that it's on the Hub, you can compare the results of your Taxi-v3 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi-leaderboard.png" alt="Taxi Leaderboard">
-
-# Part 3: Load from Hub 🔽
-
-What's amazing with Hugging Face Hub 🤗 is that you can easily load powerful models from the community.
-
-Loading a saved model from the Hub is really easy:
-
-1. You go https://huggingface.co/models?other=q-learning to see the list of all the q-learning saved models.
-2. You select one and copy its repo_id
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/copy-id.png" alt="Copy id">
-
-3. Then we just need to use `load_from_hub` with:
-- The repo_id
-- The filename: the saved model inside the repo.
-
-#### Do not modify this code
-
-```python
-from urllib.error import HTTPError
-
-from huggingface_hub import hf_hub_download
-
-
-def load_from_hub(repo_id: str, filename: str) -> str:
-    """
-    Download a model from Hugging Face Hub.
-    :param repo_id: id of the model repository from the Hugging Face Hub
-    :param filename: name of the model zip file from the repository
-    """
-    # Get the model from the Hub, download and cache the model on your local disk
-    pickle_model = hf_hub_download(repo_id=repo_id, filename=filename)
-
-    with open(pickle_model, "rb") as f:
-        downloaded_model_file = pickle.load(f)
-
-    return downloaded_model_file
-```
-
-### .
-
-```python
-model = load_from_hub(repo_id="ThomasSimonini/q-Taxi-v3", filename="q-learning.pkl")  # Try to use another model
-
-print(model)
-env = gym.make(model["env_id"])
-
-evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
-```
-
-```python
-model = load_from_hub(
-    repo_id="ThomasSimonini/q-FrozenLake-v1-no-slippery", filename="q-learning.pkl"
-)  # Try to use another model
-
-env = gym.make(model["env_id"], is_slippery=False)
-
-evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
-```
-
-## Some additional challenges 🏆
-
-The best way to learn **is to try things on your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results!
-
-In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?
-
-Here are some ideas to climb up the leaderboard:
-
-* Train more steps
-* Try different hyperparameters by looking at what your classmates have done.
-* **Push your new trained model** on the Hub 🔥
-
-Are walking on ice and driving taxis too boring to you? Try to **change the environment**, why not use FrozenLake-v1 slippery version? Check how they work [using the gymnasium documentation](https://gymnasium.farama.org/) and have fun 🎉.
-
-_____________________________________________________________________
-Congrats 🥳, you've just implemented, trained, and uploaded your first Reinforcement Learning agent.
-
-Understanding Q-Learning is an **important step to understanding value-based methods.**
-
-In the next Unit with Deep Q-Learning, we'll see that while creating and updating a Q-table was a good strategy — **however, it is not scalable.**
-
-For instance, imagine you create an agent that learns to play Doom.
-
-<img src="https://vizdoom.cs.put.edu.pl/user/pages/01.tutorial/basic.png" alt="Doom"/>
-
-Doom is a large environment with a huge state space (millions of different states). Creating and updating a Q-table for that environment would not be efficient.
-
-That's why we'll study Deep Q-Learning in the next unit, an algorithm **where we use a neural network that approximates, given a state, the different Q-values for each action.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
-
-
-See you in Unit 3! 🔥
-
-## Keep learning, stay awesome 🤗
diff --git a/units/en/unit2/introduction.mdx b/units/en/unit2/introduction.mdx
deleted file mode 100644
index e465f45..0000000
--- a/units/en/unit2/introduction.mdx
+++ /dev/null
@@ -1,26 +0,0 @@
-# Introduction to Q-Learning [[introduction-q-learning]]
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/thumbnail.jpg" alt="Unit 2 thumbnail" width="100%">
-
-
-In the first unit of this class, we learned about Reinforcement Learning (RL), the RL process, and the different methods to solve an RL problem. We also **trained our first agents and uploaded them to the Hugging Face Hub.**
-
-In this unit, we're going to **dive deeper into one of the Reinforcement Learning methods: value-based methods** and study our first RL algorithm: **Q-Learning.**
-
-We'll also **implement our first RL agent from scratch**, a Q-Learning agent, and will train it in two environments:
-
-1. Frozen-Lake-v1 (non-slippery version): where our agent will need to **go from the starting state (S) to the goal state (G)** by walking only on frozen tiles (F) and avoiding holes (H).
-2. An autonomous taxi: where our agent will need **to learn to navigate** a city to **transport its passengers from point A to point B.**
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/envs.gif" alt="Environments"/>
-
-Concretely, we will:
-
-- Learn about **value-based methods**.
-- Learn about the **differences between Monte Carlo and Temporal Difference Learning**.
-- Study and implement **our first RL algorithm**: Q-Learning.
-
-This unit is **fundamental if you want to be able to work on Deep Q-Learning**: the first Deep RL algorithm that played Atari games and beat the human level on some of them (breakout, space invaders, etc).
-
-So let's get started! 🚀
diff --git a/units/en/unit2/mc-vs-td.mdx b/units/en/unit2/mc-vs-td.mdx
deleted file mode 100644
index ddc97e8..0000000
--- a/units/en/unit2/mc-vs-td.mdx
+++ /dev/null
@@ -1,135 +0,0 @@
-# Monte Carlo vs Temporal Difference Learning [[mc-vs-td]]
-
-The last thing we need to discuss before diving into Q-Learning is the two learning strategies.
-
-Remember that an RL agent **learns by interacting with its environment.** The idea is that **given the experience and the received reward, the agent will update its value function or policy.**
-
-Monte Carlo and Temporal Difference Learning are two different **strategies on how to train our value function or our policy function.** Both of them **use experience to solve the RL problem.**
-
-On one hand, Monte Carlo uses **an entire episode of experience before learning.** On the other hand, Temporal Difference uses **only a step ( \\(S_t, A_t, R_{t+1}, S_{t+1}\\) ) to learn.**
-
-We'll explain both of them **using a value-based method example.**
-
-## Monte Carlo: learning at the end of the episode [[monte-carlo]]
-
-Monte Carlo waits until the end of the episode, calculates  \\(G_t\\) (return) and uses it as **a target for updating  \\(V(S_t)\\).**
-
-So it requires a **complete episode of interaction before updating our value function.**
-
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/monte-carlo-approach.jpg" alt="Monte Carlo"/>
-
-
-If we take an example:
-
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-2.jpg" alt="Monte Carlo"/>
-
-
-- We always start the episode **at the same starting point.**
-- **The agent takes actions using the policy**. For instance, using an Epsilon Greedy Strategy, a policy that alternates between exploration (random actions) and exploitation.
-- We get **the reward and the next state.**
-- We terminate the episode if the cat eats the mouse or if the mouse moves > 10 steps.
-
-- At the end of the episode, **we have a list of State, Actions, Rewards, and Next States tuples**
-For instance [[State tile 3 bottom, Go Left, +1, State tile 2 bottom], [State tile 2 bottom, Go Left, +0, State tile 1 bottom]...]
-
-- **The agent will sum the total rewards \\(G_t\\)** (to see how well it did).
-- It will then **update \\(V(s_t)\\) based on the formula**
-
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-3.jpg" alt="Monte Carlo"/>
-
-- Then **start a new game with this new knowledge**
-
-By running more and more episodes, **the agent will learn to play better and better.**
-
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-3p.jpg" alt="Monte Carlo"/>
-
-For instance, if we train a state-value function using Monte Carlo:
-
-- We initialize our value function **so that it returns 0 value for each state**
-- Our learning rate (lr) is 0.1 and our discount rate is 1 (= no discount)
-- Our mouse **explores the environment and takes random actions**
-
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-4.jpg" alt="Monte Carlo"/>
-
-
-- The mouse made more than 10 steps, so the episode ends .
-
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-4p.jpg" alt="Monte Carlo"/>
-
-
-
-- We have a list of state, action, rewards, next_state, **we need to calculate the return \\(G{t=0}\\)**
-
-\\(G_t = R_{t+1} + R_{t+2} + R_{t+3} ...\\) (for simplicity, we don't discount the rewards)
-
-\\(G_0 = R_{1} + R_{2} + R_{3}…\\)
-
-\\(G_0 = 1 + 0 + 0 + 0 + 0 + 0 + 1 + 1 + 0 + 0\\)
-
-\\(G_0 = 3\\)
-
-- We can now compute the **new** \\(V(S_0)\\):
-
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-5.jpg" alt="Monte Carlo"/>
-
-\\(V(S_0) = V(S_0) + lr * [G_0 — V(S_0)]\\)
-
-\\(V(S_0) = 0 + 0.1 * [3 – 0]\\)
-
-\\(V(S_0) = 0.3\\)
-
-
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-5p.jpg" alt="Monte Carlo"/>
-
-
-## Temporal Difference Learning: learning at each step [[td-learning]]
-
-**Temporal Difference, on the other hand, waits for only one interaction (one step) \\(S_{t+1}\\)** to form a TD target and update \\(V(S_t)\\) using \\(R_{t+1}\\) and \\( \gamma * V(S_{t+1})\\).
-
-The idea with **TD is to update the \\(V(S_t)\\) at each step.**
-
-But because we didn't experience an entire episode, we don't have \\(G_t\\) (expected return). Instead, **we estimate \\(G_t\\) by adding \\(R_{t+1}\\) and the discounted value of the next state.**
-
-This is called bootstrapping. It's called this **because TD bases its update in part on an existing estimate \\(V(S_{t+1})\\) and not a complete sample \\(G_t\\).**
-
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1.jpg" alt="Temporal Difference"/>
-
-
-This method is called TD(0) or **one-step TD (update the value function after any individual step).**
-
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1p.jpg" alt="Temporal Difference"/>
-
-If we take the same example,
-
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-2.jpg" alt="Temporal Difference"/>
-
-- We initialize our value function so that it returns 0 value for each state.
-- Our learning rate (lr) is 0.1, and our discount rate is 1 (no discount).
-- Our mouse begins to explore the environment and takes a random action: **going to the left**
-- It gets a reward  \\(R_{t+1} = 1\\) since **it eats a piece of cheese**
-
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-2p.jpg" alt="Temporal Difference"/>
-
-
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-3.jpg" alt="Temporal Difference"/>
-
-We can now update  \\(V(S_0)\\):
-
-New  \\(V(S_0) = V(S_0) + lr * [R_1 + \gamma * V(S_1) - V(S_0)]\\)
-
-New \\(V(S_0) = 0 + 0.1 * [1 + 1 * 0–0]\\)
-
-New \\(V(S_0) = 0.1\\)
-
-So we just updated our value function for State 0.
-
-Now we **continue to interact with this environment with our updated value function.**
-
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-3p.jpg" alt="Temporal Difference"/>
-
-  To summarize:
-
-  - With *Monte Carlo*, we update the value function from a complete episode, and so we **use the actual accurate discounted return of this episode.**
-  - With *TD Learning*, we update the value function from a step, and we replace \\(G_t\\), which we don't know, with **an estimated return called the TD target.**
-
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Summary.jpg" alt="Summary"/>
diff --git a/units/en/unit2/mid-way-quiz.mdx b/units/en/unit2/mid-way-quiz.mdx
deleted file mode 100644
index abb4b8b..0000000
--- a/units/en/unit2/mid-way-quiz.mdx
+++ /dev/null
@@ -1,106 +0,0 @@
-# Mid-way Quiz [[mid-way-quiz]]
-
-The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
-
-
-### Q1: What are the two main approaches to find optimal policy?
-
-
-<Question
-	choices={[
-		{
-			text: "Policy-based methods",
-			explain: "With Policy-Based methods, we train the policy directly to learn which action to take given a state.",
-      correct: true
-		},
-		{
-			text: "Random-based methods",
-			explain: ""
-		},
-    {
-			text: "Value-based methods",
-			explain: "With value-based methods, we train a value function to learn which state is more valuable and use this value function to take the action that leads to it.",
-      correct: true
-		},
-		{
-			text: "Evolution-strategies methods",
-      explain: ""
-		}
-	]}
-/>
-
-
-### Q2: What is the Bellman Equation?
-
-<details>
-<summary>Solution</summary>
-
-**The Bellman equation is a recursive equation** that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:
-
-Rt+1 + gamma * V(St+1)
-
-The immediate reward + the discounted value of the state that follows
-
-</details>
-
-### Q3: Define each part of the Bellman Equation
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman4-quiz.jpg" alt="Bellman equation quiz"/>
-
-
-<details>
-<summary>Solution</summary>
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman4.jpg" alt="Bellman equation solution"/>
-
-</details>
-
-### Q4: What is the difference between Monte Carlo and Temporal Difference learning methods?
-
-<Question
-	choices={[
-		{
-			text: "With Monte Carlo methods, we update the value function from a complete episode",
-			explain: "",
-      correct: true
-		},
-    {
-			text: "With Monte Carlo methods, we update the value function from a step",
-			explain: ""
-		},
-    {
-			text: "With TD learning methods, we update the value function from a complete episode",
-			explain: ""
-		},
-    {
-			text: "With TD learning methods, we update the value function from a step",
-			explain: "",
-      correct: true
-		},
-	]}
-/>
-
-### Q5: Define each part of Temporal Difference learning formula
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/td-ex.jpg" alt="TD Learning exercise"/>
-
-<details>
-<summary>Solution</summary>
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1.jpg" alt="TD Exercise"/>
-
-</details>
-
-
-### Q6: Define each part of Monte Carlo learning formula
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/mc-ex.jpg" alt="MC Learning exercise"/>
-
-<details>
-<summary>Solution</summary>
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/monte-carlo-approach.jpg" alt="MC Exercise"/>
-
-</details>
-
-Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read again the previous sections to reinforce (😏) your knowledge.
diff --git a/units/en/unit2/mid-way-recap.mdx b/units/en/unit2/mid-way-recap.mdx
deleted file mode 100644
index b644040..0000000
--- a/units/en/unit2/mid-way-recap.mdx
+++ /dev/null
@@ -1,17 +0,0 @@
-# Mid-way Recap [[mid-way-recap]]
-
-Before diving into Q-Learning, let's summarize what we've just learned.
-
-We have two types of value-based functions:
-
-- State-value function: outputs the expected return if **the agent starts at a given state and acts according to the policy forever after.**
-- Action-value function: outputs the expected return if **the agent starts in a given state, takes a given action at that state** and then acts accordingly to the policy forever after.
-- In value-based methods, rather than learning the policy, **we define the policy by hand** and we learn a value function. If we have an optimal value function, we **will have an optimal policy.**
-
-There are two types of methods to update the value function:
-
-- With *the Monte Carlo method*, we update the value function from a complete episode, and so we **use the actual discounted return of this episode.**
-- With *the TD Learning method,* we update the value function from a step, replacing the unknown \\(G_t\\) with **an estimated return called the TD target.**
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/summary-learning-mtds.jpg" alt="Summary"/>
diff --git a/units/en/unit2/q-learning-example.mdx b/units/en/unit2/q-learning-example.mdx
deleted file mode 100644
index 43cc3df..0000000
--- a/units/en/unit2/q-learning-example.mdx
+++ /dev/null
@@ -1,83 +0,0 @@
-# A Q-Learning example [[q-learning-example]]
-
-To better understand Q-Learning, let's take a simple example:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-Example-2.jpg" alt="Maze-Example"/>
-
-- You're a mouse in this tiny maze. You always **start at the same starting point.**
-- The goal is **to eat the big pile of cheese at the bottom right-hand corner** and avoid the poison. After all, who doesn't like cheese?
-- The episode ends if we eat the poison, **eat the big pile of cheese**, or if we take more than five steps.
-- The learning rate is 0.1
-- The discount rate (gamma) is 0.99
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-1.jpg" alt="Maze-Example"/>
-
-
-The reward function goes like this:
-
-- **+0:** Going to a state with no cheese in it.
-- **+1:** Going to a state with a small cheese in it.
-- **+10:** Going to the state with the big pile of cheese.
-- **-10:** Going to the state with the poison and thus dying.
-- **+0** If we take more than five steps.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-2.jpg" alt="Maze-Example"/>
-
-To train our agent to have an optimal policy (so a policy that goes right, right, down), **we will use the Q-Learning algorithm**.
-
-## Step 1: Initialize the Q-table [[step1]]
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Example-1.jpg" alt="Maze-Example"/>
-
-So, for now, **our Q-table is useless**; we need **to train our Q-function using the Q-Learning algorithm.**
-
-Let's do it for 2 training timesteps:
-
-Training timestep 1:
-
-## Step 2: Choose an action using the Epsilon Greedy Strategy [[step2]]
-
-Because epsilon is big (= 1.0), I take a random action. In this case, I go right.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-3.jpg" alt="Maze-Example"/>
-
-
-## Step 3: Perform action At, get Rt+1 and St+1 [[step3]]
-
-By going right, I get a small cheese, so \\(R_{t+1} = 1\\) and I'm in a new state.
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-4.jpg" alt="Maze-Example"/>
-
-
-## Step 4: Update Q(St, At) [[step4]]
-
-We can now update \\(Q(S_t, A_t)\\) using our formula.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-5.jpg" alt="Maze-Example"/>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Example-4.jpg" alt="Maze-Example"/>
-
-Training timestep 2:
-
-## Step 2: Choose an action using the Epsilon Greedy Strategy [[step2-2]]
-
-**I take a random action again, since epsilon=0.99 is big**. (Notice we decay epsilon a little bit because, as the training progress, we want less and less exploration).
-
-I took the action 'down'. **This is not a good action since it leads me to the poison.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-6.jpg" alt="Maze-Example"/>
-
-
-## Step 3: Perform action At, get Rt+1 and St+1 [[step3-3]]
-
-Because I ate poison, **I get \\(R_{t+1} = -10\\), and I die.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-7.jpg" alt="Maze-Example"/>
-
-## Step 4: Update Q(St, At) [[step4-4]]
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-8.jpg" alt="Maze-Example"/>
-
-Because we're dead, we start a new episode. But what we see here is that, **with two explorations steps, my agent became smarter.**
-
-As we continue exploring and exploiting the environment and updating Q-values using the TD target, the **Q-table will give us a better and better approximation. At the end of the training, we'll get an estimate of the optimal Q-function.**
diff --git a/units/en/unit2/q-learning-recap.mdx b/units/en/unit2/q-learning-recap.mdx
deleted file mode 100644
index 7e31bfd..0000000
--- a/units/en/unit2/q-learning-recap.mdx
+++ /dev/null
@@ -1,25 +0,0 @@
-# Q-Learning Recap [[q-learning-recap]]
-
-
-*Q-Learning* **is the RL algorithm that** :
-
-- Trains a *Q-function*, an **action-value function** encoded, in internal memory, by a *Q-table* **containing all the state-action pair values.**
-
-- Given a state and action, our Q-function **will search its Q-table for the corresponding value.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q function"  width="100%"/>
-
-- When the training is done, **we have an optimal Q-function, or, equivalently, an optimal Q-table.**
-
-- And if we **have an optimal Q-function**, we
-have an optimal policy, since we **know, for each state, the best action to take.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy"  width="100%"/>
-
-But, in the beginning, our **Q-table is useless since it gives arbitrary values for each state-action pair (most of the time we initialize the Q-table to 0 values)**. But, as we explore the environment and update our Q-table it will give us a better and better approximation.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/q-learning.jpeg" alt="q-learning.jpeg" width="100%"/>
-
-This is the Q-Learning pseudocode:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
diff --git a/units/en/unit2/q-learning.mdx b/units/en/unit2/q-learning.mdx
deleted file mode 100644
index 1357163..0000000
--- a/units/en/unit2/q-learning.mdx
+++ /dev/null
@@ -1,158 +0,0 @@
-# Introducing Q-Learning [[q-learning]]
-## What is Q-Learning? [[what-is-q-learning]]
-
-Q-Learning is an **off-policy value-based method that uses a TD approach to train its action-value function:**
-
-- *Off-policy*: we'll talk about that at the end of this unit.
-- *Value-based method*: finds the optimal policy indirectly by training a value or action-value function that will tell us **the value of each state or each state-action pair.**
-- *TD approach:* **updates its action-value function at each step instead of at the end of the episode.**
-
-**Q-Learning is the algorithm we use to train our Q-function**, an **action-value function** that determines the value of being at a particular state and taking a specific action at that state.
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function.jpg" alt="Q-function"/>
-  <figcaption>Given a state and action, our Q Function outputs a state-action value (also called Q-value)</figcaption>
-</figure>
-
-The **Q comes from "the Quality" (the value) of that action at that state.**
-
-Let's recap the difference between value and reward:
-
-- The *value of a state*, or a *state-action pair* is the expected cumulative reward our agent gets if it starts at this state (or state-action pair) and then acts accordingly to its policy.
-- The *reward* is the **feedback it gets from the environment** after performing an action at a state.
-
-Internally, our Q-function is encoded by **a Q-table, a table where each cell corresponds to a state-action pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
-
-Let's go through an example of a maze.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-1.jpg" alt="Maze example"/>
-
-The Q-table is initialized. That's why all values are = 0. This table **contains, for each state and action, the corresponding state-action values.** 
-For this simple example, the state is only defined by the position of the mouse. Therefore, we have 2*3 rows in our Q-table, one row for each possible position of the mouse. In more complex scenarios, the state could contain more information than the position of the actor.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-2.jpg" alt="Maze example"/>
-
-Here we see that the **state-action value of the initial state and going up is 0:**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-3.jpg" alt="Maze example"/>
-
-So: the Q-function uses a Q-table **that has the value of each state-action pair.** Given a state and action, **our Q-function will search inside its Q-table to output the value.**
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q-function"/>
-</figure>
-
-If we recap, *Q-Learning* **is the RL algorithm that:**
-
-- Trains a *Q-function* (an **action-value function**), which internally is a **Q-table that contains all the state-action pair values.**
-- Given a state and action, our Q-function **will search its Q-table for the corresponding value.**
-- When the training is done, **we have an optimal Q-function, which means we have optimal Q-table.**
-- And if we **have an optimal Q-function**, we **have an optimal policy** since we **know the best action to take at each state.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy"/>
-
-
-In the beginning, **our Q-table is useless since it gives arbitrary values for each state-action pair** (most of the time, we initialize the Q-table to 0). As the agent **explores the environment and we update the Q-table, it will give us a better and better approximation** to the optimal policy.
-
-<figure class="image table text-center m-0 w-full">
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-1.jpg" alt="Q-learning"/>
-  <figcaption>We see here that with the training, our Q-table is better since, thanks to it, we can know the value of each state-action pair.</figcaption>
-</figure>
-
-Now that we understand what Q-Learning, Q-functions, and Q-tables are, **let's dive deeper into the Q-Learning algorithm**.
-
-## The Q-Learning algorithm [[q-learning-algo]]
-
-This is the Q-Learning pseudocode; let's study each part and **see how it works with a simple example before implementing it.** Don't be intimidated by it, it's simpler than it looks! We'll go over each step.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-learning"/>
-
-### Step 1: We initialize the Q-table [[step1]]
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-3.jpg" alt="Q-learning"/>
-
-
-We need to initialize the Q-table for each state-action pair. **Most of the time, we initialize with values of 0.**
-
-### Step 2: Choose an action using the epsilon-greedy strategy [[step2]]
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Q-learning"/>
-
-
-The epsilon-greedy strategy is a policy that handles the exploration/exploitation trade-off.
-
-The idea is that, with an initial value of ɛ = 1.0:
-
-- *With probability 1 — ɛ* : we do **exploitation** (aka our agent selects the action with the highest state-action pair value).
-- With probability ɛ: **we do exploration** (trying random action).
-
-At the beginning of the training, **the probability of doing exploration will be huge since ɛ is very high, so most of the time, we'll explore.** But as the training goes on, and consequently our **Q-table gets better and better in its estimations, we progressively reduce the epsilon value** since we will need less and less exploration and more exploitation.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-5.jpg" alt="Q-learning"/>
-
-
-### Step 3: Perform action At, get reward Rt+1 and next state St+1 [[step3]]
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-6.jpg" alt="Q-learning"/>
-
-### Step 4: Update Q(St, At) [[step4]]
-
-Remember that in TD Learning, we update our policy or value function (depending on the RL method we choose) **after one step of the interaction.**
-
-To produce our TD target, **we used the immediate reward \\(R_{t+1}\\) plus the discounted value of the next state**, computed by finding the action that maximizes the current Q-function at the next state. (We call that bootstrap).
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-7.jpg" alt="Q-learning"/>
-
-Therefore, our \\(Q(S_t, A_t)\\) **update formula goes like this:**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-8.jpg" alt="Q-learning"/>
-
-
-This means that to update our \\(Q(S_t, A_t)\\):
-
-- We need \\(S_t, A_t, R_{t+1}, S_{t+1}\\).
-- To update our Q-value at a given state-action pair, we use the TD target.
-
-How do we form the TD target?
-1. We obtain the reward \\(R_{t+1}\\) after taking the action \\(A_t\\).
-2. To get the **best state-action pair value** for the next state, we use a greedy policy to select the next best action. Note that this is not an epsilon-greedy policy, this will always take the action with the highest state-action value.
-
-Then when the update of this Q-value is done, we start in a new state and select our action **using a epsilon-greedy policy again.**
-
-**This is why we say that Q Learning is an off-policy algorithm.**
-
-## Off-policy vs On-policy [[off-vs-on]]
-
-The difference is subtle:
-
-- *Off-policy*: using **a different policy for acting (inference) and updating (training).**
-
-For instance, with Q-Learning, the epsilon-greedy policy (acting policy), is different from the greedy policy that is **used to select the best next-state action value to update our Q-value (updating policy).**
-
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-1.jpg" alt="Off-on policy"/>
-  <figcaption>Acting Policy</figcaption>
-</figure>
-
-Is different from the policy we use during the training part:
-
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-2.jpg" alt="Off-on policy"/>
-  <figcaption>Updating policy</figcaption>
-</figure>
-
-- *On-policy:* using the **same policy for acting and updating.**
-
-For instance, with Sarsa, another value-based algorithm, **the epsilon-greedy policy selects the next state-action pair, not a greedy policy.**
-
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-3.jpg" alt="Off-on policy"/>
-    <figcaption>Sarsa</figcaption>
-</figure>
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg" alt="Off-on policy"/>
-</figure>
diff --git a/units/en/unit2/quiz2.mdx b/units/en/unit2/quiz2.mdx
deleted file mode 100644
index 3ab4f51..0000000
--- a/units/en/unit2/quiz2.mdx
+++ /dev/null
@@ -1,96 +0,0 @@
-# Second Quiz [[quiz2]]
-
-The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
-
-
-### Q1: What is Q-Learning?
-
-
-<Question
-	choices={[
-		{
-			text: "The algorithm we use to train our Q-function",
-			explain: "",
-      correct: true
-		},
-		{
-			text: "A value function",
-			explain: "It's an action-value function since it determines the value of being at a particular state and taking a specific action at that state",
-		},
-    {
-			text: "An algorithm that determines the value of being at a particular state and taking a specific action at that state",
-			explain: "Q-function is the function that determines the value of being at a particular state and taking a specific action at that state.",
-		},
-		{
-			text: "A table",
-      			explain: "Q-learning is not a Q-table. The Q-function is the algorithm that will feed the Q-table."
-		}
-	]}
-/>
-
-### Q2: What is a Q-table?
-
-<Question
-	choices={[
-		{
-			text: "An algorithm we use in Q-Learning",
-			explain: "",
-		},
-		{
-			text: "Q-table is the internal memory of our agent",
-			explain: "",
-      correct: true
-		},
-    {
-			text: "In Q-table each cell corresponds a state value",
-			explain: "Each cell corresponds to a state-action value pair value. Not a state value.",
-		}
-	]}
-/>
-
-### Q3: Why if we have an optimal Q-function Q* we have an optimal policy?
-
-<details>
-<summary>Solution</summary>
-
-Because if we have an optimal Q-function, we have an optimal policy since we know for each state what is the best action to take.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="link value policy"/>
-
-</details>
-
-### Q4: Can you explain what is Epsilon-Greedy Strategy?
-
-<details>
-<summary>Solution</summary>
-Epsilon Greedy Strategy is a policy that handles the exploration/exploitation trade-off.
-
-The idea is that we define epsilon ɛ = 1.0:
-
-- With *probability 1 — ɛ* : we do exploitation (aka our agent selects the action with the highest state-action pair value).
-- With *probability ɛ* : we do exploration (trying random action).
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Epsilon Greedy"/>
-
-
-</details>
-
-### Q5: How do we update the Q value of a state, action pair?
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-update-ex.jpg" alt="Q Update exercise"/>
-
-<details>
-<summary>Solution</summary>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-update-solution.jpg" alt="Q Update exercise"/>
-
-</details>
-
-
-
-### Q6: What's the difference between on-policy and off-policy
-
-<details>
-<summary>Solution</summary>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg" alt="On/off policy"/>
-</details>
-
-Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read again the chapter to reinforce (😏) your knowledge.
diff --git a/units/en/unit2/two-types-value-based-methods.mdx b/units/en/unit2/two-types-value-based-methods.mdx
deleted file mode 100644
index 28f8b0b..0000000
--- a/units/en/unit2/two-types-value-based-methods.mdx
+++ /dev/null
@@ -1,86 +0,0 @@
-# Two types of value-based methods [[two-types-value-based-methods]]
-
-In value-based methods, **we learn a value function** that **maps a state to the expected value of being at that state.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/vbm-1.jpg" alt="Value Based Methods"/>
-
-The value of a state is the **expected discounted return** the agent can get if it **starts at that state and then acts according to our policy.**
-
-<Tip>
-But what does it mean to act according to our policy? After all, we don't have a policy in value-based methods since we train a value function and not a policy.
-</Tip>
-
-Remember that the goal of an **RL agent is to have an optimal policy π\*.**
-
-To find the optimal policy, we learned about two different methods:
-
-- *Policy-based methods:* **Directly train the policy** to select what action to take given a state (or a probability distribution over actions at that state). In this case, we **don't have a value function.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-approaches-2.jpg" alt="Two RL approaches"/>
-
-The policy takes a state as input and outputs what action to take at that state (deterministic policy: a policy that output one action given a state, contrary to stochastic policy that output a probability distribution over actions).
-
-And consequently, **we don't define by hand the behavior of our policy; it's the training that will define it.**
-
-- *Value-based methods:* **Indirectly, by training a value function** that outputs the value of a state or a state-action pair. Given this value function, our policy **will take an action.**
-
-Since the policy is not trained/learned, **we need to specify its behavior.** For instance, if we want a policy that, given the value function, will take actions that always lead to the biggest reward, **we'll create a Greedy Policy.**
-
-<figure>
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-approaches-3.jpg" alt="Two RL approaches"/>
-  <figcaption>Given a state, our action-value function (that we train) outputs the value of each action at that state. Then, our pre-defined Greedy Policy selects the action that will yield the highest value given a state or a state action pair.</figcaption>
-</figure>
-
-Consequently, whatever method you use to solve your problem, **you will have a policy**. In the case of value-based methods, you don't train the policy: your policy **is just a simple pre-specified function** (for instance, the Greedy Policy) that uses the values given by the value-function to select its actions.
-
-So the difference is:
-
-- In policy-based training, **the optimal policy (denoted π\*) is found by training the policy directly.**
-- In value-based training, **finding an optimal value function (denoted Q\* or V\*, we'll study the difference below) leads to having an optimal policy.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link between value and policy"/>
-
-In fact, most of the time, in value-based methods, you'll use **an Epsilon-Greedy Policy** that handles the exploration/exploitation trade-off; we'll talk about this when we talk about Q-Learning in the second part of this unit.
-
-
-As we mentioned above, we have two types of value-based functions:
-
-## The state-value function [[state-value-function]]
-
-We write the state value function under a policy π like this:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/state-value-function-1.jpg" alt="State value function"/>
-
-For each state, the state-value function outputs the expected return if the agent **starts at that state** and then follows the policy forever afterward (for all future timesteps, if you prefer).
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/state-value-function-2.jpg" alt="State value function"/>
-  <figcaption>If we take the state with value -7: it's the expected return starting at that state and taking actions according to our policy (greedy policy), so right, right, right, down, down, right, right.</figcaption>
-</figure>
-
-## The action-value function [[action-value-function]]
-
-In the action-value function, for each state and action pair, the action-value function **outputs the expected return** if the agent starts in that state, takes that action, and then follows the policy forever after.
-
-The value of taking action \\(a\\) in state \\(s\\) under a policy \\(π\\) is:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/action-state-value-function-1.jpg" alt="Action State value function"/>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/action-state-value-function-2.jpg" alt="Action State value function"/>
-
-
-We see that the difference is:
-
-- For the state-value function, we calculate **the value of a state \\(S_t\\)**
-- For the action-value function, we calculate **the value of the state-action pair ( \\(S_t, A_t\\) ) hence the value of taking that action at that state.**
-
-<figure>
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-types.jpg" alt="Two types of value function"/>
-  <figcaption>
-Note: We didn't fill all the state-action pairs for the example of Action-value function</figcaption>
-</figure>
-
-In either case, whichever value function we choose (state-value or action-value function), **the returned value is the expected return.**
-
-However, the problem is that **to calculate EACH value of a state or a state-action pair, we need to sum all the rewards an agent can get if it starts at that state.**
-
-This can be a computationally expensive process, and that's **where the Bellman equation comes in to help us.**
diff --git a/units/en/unit2/what-is-rl.mdx b/units/en/unit2/what-is-rl.mdx
deleted file mode 100644
index f84bc9c..0000000
--- a/units/en/unit2/what-is-rl.mdx
+++ /dev/null
@@ -1,25 +0,0 @@
-# What is RL? A short recap [[what-is-rl]]
-
-In RL, we build an agent that can **make smart decisions**. For instance, an agent that **learns to play a video game.** Or a trading agent that **learns to maximize its benefits** by deciding on **what stocks to buy and when to sell.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/rl-process.jpg" alt="RL process"/>
-
-
-To make intelligent decisions, our agent will learn from the environment by **interacting with it through trial and error** and receiving rewards (positive or negative) **as unique feedback.**
-
-Its goal **is to maximize its expected cumulative reward** (because of the reward hypothesis).
-
-**The agent's decision-making process is called the policy π:** given a state, a policy will output an action or a probability distribution over actions. That is, given an observation of the environment, a policy will provide an action (or multiple probabilities for each action) that the agent should take.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/policy.jpg" alt="Policy"/>
-
-**Our goal is to find an optimal policy π* **, aka., a policy that leads to the best expected cumulative reward.
-
-And to find this optimal policy (hence solving the RL problem), there **are two main types of RL methods**:
-
-- *Policy-based methods*: **Train the policy directly** to learn which action to take given a state.
-- *Value-based methods*: **Train a value function** to learn **which state is more valuable** and use this value function **to take the action that leads to it.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-approaches.jpg" alt="Two RL approaches"/>
-
-And in this unit, **we'll dive deeper into the value-based methods.**
diff --git a/units/en/unit3/additional-readings.mdx b/units/en/unit3/additional-readings.mdx
deleted file mode 100644
index 2b9da60..0000000
--- a/units/en/unit3/additional-readings.mdx
+++ /dev/null
@@ -1,9 +0,0 @@
-# Additional Readings [[additional-readings]]
-
-These are **optional readings** if you want to go deeper.
-
-- [Foundations of Deep RL Series, L2 Deep Q-Learning by Pieter Abbeel](https://youtu.be/Psrhxy88zww)
-- [Playing Atari with Deep Reinforcement Learning](https://arxiv.org/abs/1312.5602)
-- [Double Deep Q-Learning](https://papers.nips.cc/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-Abstract.html)
-- [Prioritized Experience Replay](https://arxiv.org/abs/1511.05952)
-- [Dueling Deep Q-Learning](https://arxiv.org/abs/1511.06581)
diff --git a/units/en/unit3/conclusion.mdx b/units/en/unit3/conclusion.mdx
deleted file mode 100644
index 75f9322..0000000
--- a/units/en/unit3/conclusion.mdx
+++ /dev/null
@@ -1,17 +0,0 @@
-# Conclusion [[conclusion]]
-
-Congrats on finishing this chapter! There was a lot of information. And congrats on finishing the tutorial. You’ve just trained your first Deep Q-Learning agent and shared it on the Hub 🥳.
-
-Take time to really grasp the material before continuing.
-
-Don't hesitate to train your agent in other environments (Pong, Seaquest, QBert, Ms Pac Man). The **best way to learn is to try things on your own!**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
-
-
-In the next unit, **we're going to learn about Optuna**. One of the most critical tasks in Deep Reinforcement Learning is to find a good set of training hyperparameters. Optuna is a library that helps you to automate the search.
-
-Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉  [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
-
-### Keep Learning, stay awesome 🤗
-
diff --git a/units/en/unit3/deep-q-algorithm.mdx b/units/en/unit3/deep-q-algorithm.mdx
deleted file mode 100644
index 28e7fd5..0000000
--- a/units/en/unit3/deep-q-algorithm.mdx
+++ /dev/null
@@ -1,105 +0,0 @@
-# The Deep Q-Learning Algorithm [[deep-q-algorithm]]
-
-We learned that Deep Q-Learning **uses a deep neural network to approximate the different Q-values for each possible action at a state** (value-function estimation).
-
-The difference is that, during the training phase, instead of updating the Q-value of a state-action pair directly as we have done with Q-Learning:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-5.jpg" alt="Q Loss"/>
-
-in Deep Q-Learning, we create a **loss function that compares our Q-value prediction and the Q-target and uses gradient descent to update the weights of our Deep Q-Network to approximate our Q-values better**.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/Q-target.jpg" alt="Q-target"/>
-
-The Deep Q-Learning training algorithm has *two phases*:
-
-- **Sampling**: we perform actions and **store the observed experience tuples in a replay memory**.
-- **Training**: Select a **small batch of tuples randomly and learn from this batch using a gradient descent update step**.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/sampling-training.jpg" alt="Sampling Training"/>
-
-This is not the only difference compared with Q-Learning. Deep Q-Learning training **might suffer from instability**, mainly because of combining a non-linear Q-value function (Neural Network) and bootstrapping (when we update targets with existing estimates and not an actual complete return).
-
-To help us stabilize the training, we implement three different solutions:
-1. *Experience Replay* to make more **efficient use of experiences**.
-2. *Fixed Q-Target* **to stabilize the training**.
-3. *Double Deep Q-Learning*, to **handle the problem of the overestimation of Q-values**.
-
-Let's go through them!
-
-## Experience Replay to make more efficient use of experiences [[exp-replay]]
-
-Why do we create a replay memory?
-
-Experience Replay in Deep Q-Learning has two functions:
-
-1. **Make more efficient use of the experiences during the training**.
-Usually, in online reinforcement learning, the agent interacts with the environment, gets experiences (state, action, reward, and next state), learns from them (updates the neural network), and discards them. This is not efficient.
-
-Experience replay helps by **using the experiences of the training more efficiently**. We use a replay buffer that saves experience samples **that we can reuse during the training.**
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/experience-replay.jpg" alt="Experience Replay"/>
-
-⇒ This allows the agent to **learn from the same experiences multiple times**.
-
-2. **Avoid forgetting previous experiences (aka catastrophic interference, or catastrophic forgetting) and reduce the correlation between experiences**.
-- **[catastrophic forgetting](https://en.wikipedia.org/wiki/Catastrophic_interference)**: The problem we get if we give sequential samples of experiences to our neural network is that it tends to forget **the previous experiences as it gets new experiences.** For instance, if the agent is in the first level and then in the second, which is different, it can forget how to behave and play in the first level.
-
-The solution is to create a Replay Buffer that stores experience tuples while interacting with the environment and then sample a small batch of tuples. This prevents **the network from only learning about what it has done immediately before.**
-
-Experience replay also has other benefits. By randomly sampling the experiences, we remove correlation in the observation sequences and avoid **action values from oscillating or diverging catastrophically.**
-
-In the Deep Q-Learning pseudocode, we **initialize a replay memory buffer D with capacity N** (N is a hyperparameter that you can define). We then store experiences in the memory and sample a batch of experiences to feed the Deep Q-Network during the training phase.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/experience-replay-pseudocode.jpg" alt="Experience Replay Pseudocode"/>
-
-## Fixed Q-Target to stabilize the training [[fixed-q]]
-
-When we want to calculate the TD error (aka the loss), we calculate the **difference between the TD target (Q-Target) and the current Q-value (estimation of Q)**.
-
-But we **don’t have any idea of the real TD target**. We need to estimate it. Using the Bellman equation, we saw that the TD target is just the reward of taking that action at that state plus the discounted highest Q value for the next state.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/Q-target.jpg" alt="Q-target"/>
-
-However, the problem is that we are using the same parameters (weights) for estimating the TD target **and** the Q-value. Consequently, there is a significant correlation between the TD target and the parameters we are changing.
-
-Therefore, at every step of training, **both our Q-values and the target values shift.** We’re getting closer to our target, but the target is also moving. It’s like chasing a moving target! This can lead to significant oscillation in training.
-
-It’s like if you were a cowboy (the Q estimation) and you wanted to catch a cow (the Q-target). Your goal is to get closer (reduce the error).
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-1.jpg" alt="Q-target"/>
-
-At each time step, you’re trying to approach the cow, which also moves at each time step (because you use the same parameters).
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-2.jpg" alt="Q-target"/>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-3.jpg" alt="Q-target"/>
-This leads to a bizarre path of chasing (a significant oscillation in training).
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-4.jpg" alt="Q-target"/>
-
-Instead, what we see in the pseudo-code is that we:
-- Use a **separate network with fixed parameters** for estimating the TD Target
-- **Copy the parameters from our Deep Q-Network every C steps** to update the target network.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/fixed-q-target-pseudocode.jpg" alt="Fixed Q-target Pseudocode"/>
-
-
-
-## Double DQN [[double-dqn]]
-
-Double DQNs, or Double Deep Q-Learning neural networks, were introduced [by Hado van Hasselt](https://papers.nips.cc/paper/3964-double-q-learning). This method **handles the problem of the overestimation of Q-values.**
-
-To understand this problem, remember how we calculate the TD Target:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1.jpg" alt="TD target"/>
-
-We face a simple problem by calculating the TD target: how are we sure that **the best action for the next state is the action with the highest Q-value?**
-
-We know that the accuracy of Q-values depends on what action we tried **and** what neighboring states we explored.
-
-Consequently, we don’t have enough information about the best action to take at the beginning of the training. Therefore, taking the maximum Q-value (which is noisy) as the best action to take can lead to false positives. If non-optimal actions are regularly **given a higher Q value than the optimal best action, the learning will be complicated.**
-
-The solution is: when we compute the Q target, we use two networks to decouple the action selection from the target Q-value generation. We:
-- Use our **DQN network** to select the best action to take for the next state (the action with the highest Q-value).
-- Use our **Target network** to calculate the target Q-value of taking that action at the next state.
-
-Therefore, Double DQN helps us reduce the overestimation of Q-values and, as a consequence, helps us train faster and with more stable learning.
-
-Since these three improvements in Deep Q-Learning, many more have been added, such as Prioritized Experience Replay and Dueling Deep Q-Learning. They’re out of the scope of this course but if you’re interested, check the links we put in the reading list.
diff --git a/units/en/unit3/deep-q-network.mdx b/units/en/unit3/deep-q-network.mdx
deleted file mode 100644
index 50cd4f2..0000000
--- a/units/en/unit3/deep-q-network.mdx
+++ /dev/null
@@ -1,41 +0,0 @@
-# The Deep Q-Network (DQN)  [[deep-q-network]]
-This is the architecture of our Deep Q-Learning network:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/deep-q-network.jpg" alt="Deep Q Network"/>
-
-As input, we take a **stack of 4 frames** passed through the network as a state and output a **vector of Q-values for each possible action at that state**. Then, like with Q-Learning, we just need to use our epsilon-greedy policy to select which action to take.
-
-When the Neural Network is initialized, **the Q-value estimation is terrible**. But during training, our Deep Q-Network agent will associate a situation with the appropriate action and **learn to play the game well**.
-
-## Preprocessing the input and temporal limitation [[preprocessing]]
-
-We need to **preprocess the input**. It’s an essential step since we want to **reduce the complexity of our state to reduce the computation time needed for training**.
-
-To achieve this, we **reduce the state space to 84x84 and grayscale it**. We can do this since the colors in Atari environments don't add important information.
-This is a big improvement since we **reduce our three color channels (RGB) to 1**.
-
-We can also **crop a part of the screen in some games** if it does not contain important information.
-Then we stack four frames together.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/preprocessing.jpg" alt="Preprocessing"/>
-
-**Why do we stack four frames together?**
-We stack frames together because it helps us **handle the problem of temporal limitation**. Let’s take an example with the game of Pong. When you see this frame:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation.jpg" alt="Temporal Limitation"/>
-
-Can you tell me where the ball is going?
-No, because one frame is not enough to have a sense of motion! But what if I add three more frames? **Here you can see that the ball is going to the right**.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation-2.jpg" alt="Temporal Limitation"/>
-That’s why, to capture temporal information, we stack four frames together.
-
-Then the stacked frames are processed by three convolutional layers. These layers **allow us to capture and exploit spatial relationships in images**. But also, because the frames are stacked together, **we can exploit some temporal properties across those frames**.
-
-If you don't know what convolutional layers are, don't worry. You can check out [Lesson 4 of this free Deep Learning Course by Udacity](https://www.udacity.com/course/deep-learning-pytorch--ud188)
-
-Finally, we have a couple of fully connected layers that output a Q-value for each possible action at that state.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/deep-q-network.jpg" alt="Deep Q Network"/>
-
-So, we see that Deep Q-Learning uses a neural network to approximate, given a state, the different Q-values for each possible action at that state. Now let's study the Deep Q-Learning algorithm.
diff --git a/units/en/unit3/from-q-to-dqn.mdx b/units/en/unit3/from-q-to-dqn.mdx
deleted file mode 100644
index a24c584..0000000
--- a/units/en/unit3/from-q-to-dqn.mdx
+++ /dev/null
@@ -1,34 +0,0 @@
-# From Q-Learning to Deep Q-Learning [[from-q-to-dqn]]
-
-We learned that **Q-Learning is an algorithm we use to train our Q-Function**, an **action-value function** that determines the value of being at a particular state and taking a specific action at that state.
-
-<figure>
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function.jpg" alt="Q-function"/>
-</figure>
-
-The **Q comes from "the Quality" of that action at that state.**
-
-Internally, our Q-function is encoded by **a Q-table, a table where each cell corresponds to a state-action pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
-
-The problem is that Q-Learning is a *tabular method*. This becomes a problem if the states and actions spaces **are not small enough to be represented efficiently by arrays and tables**. In other words: it is **not scalable**.
-Q-Learning worked well with small state space environments like:
-
-- FrozenLake, we had 16 states.
-- Taxi-v3, we had 500 states.
-
-But think of what we're going to do today: we will train an agent to learn to play Space Invaders, a more complex game, using the frames as input.
-
-As **[Nikita Melkozerov mentioned](https://twitter.com/meln1k), Atari environments** have an observation space with a shape of (210, 160, 3)*, containing values ranging from 0 to 255 so that gives us \\(256^{210 \times 160 \times 3} = 256^{100800}\\) possible observations (for comparison, we have approximately \\(10^{80}\\) atoms in the observable universe).
-
-* A single frame in Atari is composed of an image of 210x160 pixels. Given that the images are in color (RGB), there are 3 channels. This is why the shape is (210, 160, 3). For each pixel, the value can go from 0 to 255.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari.jpg" alt="Atari State Space"/>
-
-Therefore, the state space is gigantic; due to this, creating and updating a Q-table for that environment would not be efficient. In this case, the best idea is to approximate the Q-values using a parametrized Q-function  \\(Q_{\theta}(s,a)\\)  .
-
-This neural network will approximate, given a state, the different Q-values for each possible action at that state. And that's exactly what Deep Q-Learning does.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/deep.jpg" alt="Deep Q Learning"/>
-
-
-Now that we understand Deep Q-Learning, let's dive deeper into the Deep Q-Network.
diff --git a/units/en/unit3/glossary.mdx b/units/en/unit3/glossary.mdx
deleted file mode 100644
index 2e40866..0000000
--- a/units/en/unit3/glossary.mdx
+++ /dev/null
@@ -1,38 +0,0 @@
-# Glossary 
-
-This is a community-created glossary. Contributions are welcomed!
-
-- **Tabular Method:** Type of problem in which the state and action spaces are small enough to approximate value functions to be represented as arrays and tables. 
-**Q-learning** is an example of tabular method since a table is used to represent the value for different state-action pairs.
-
-- **Deep Q-Learning:** Method that trains a neural network to approximate, given a state, the different **Q-values** for each possible action at that state.
-It is used to solve problems when observational space is too big to apply a tabular Q-Learning approach. 
-
-- **Temporal Limitation** is a difficulty presented when the environment state is represented by frames. A frame by itself does not provide temporal information. 
-In order to obtain temporal information, we need to **stack** a number of frames together.  
-
-- **Phases of Deep Q-Learning:**
-  - **Sampling:** Actions are performed, and observed experience tuples are stored in a **replay memory**.
-  - **Training:** Batches of tuples are selected randomly and the neural network updates its weights using gradient descent. 
-  
-- **Solutions to stabilize Deep Q-Learning:**
-  - **Experience Replay:** A replay memory is created to save experiences samples that can be reused during training. 
-  This allows the agent to learn from the same experiences multiple times. Also, it helps the agent avoid forgetting previous experiences as it gets new ones.
-  - **Random sampling** from replay buffer allows to remove correlation in the observation sequences and prevents action values from oscillating or diverging
-  catastrophically.
-
-  - **Fixed Q-Target:** In order to calculate the **Q-Target** we need to estimate the discounted optimal **Q-value** of the next state by using Bellman equation. The problem
-  is that the same network weights are used to calculate the **Q-Target** and the **Q-value**. This means that everytime we are modifying the **Q-value**, the **Q-Target** also moves with it.
-  To avoid this issue, a separate network with fixed parameters is used for estimating the Temporal Difference Target. The target network is updated by copying parameters from
-  our Deep Q-Network after certain **C steps**. 
-  
-  - **Double DQN:** Method to handle **overestimation** of **Q-Values**. This solution uses two networks to decouple the action selection from the target **Value generation**:
-     - **DQN Network** to select the best action to take for the next state (the action with the highest **Q-Value**)
-     - **Target Network** to calculate the target **Q-Value** of taking that action at the next state. 
-This approach reduces the **Q-Values** overestimation, it helps to train faster and have more stable learning.
-
-If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)
-
-This glossary was made possible thanks to:
-
-- [Dario Paez](https://github.com/dario248)
diff --git a/units/en/unit3/hands-on.mdx b/units/en/unit3/hands-on.mdx
deleted file mode 100644
index efcd38c..0000000
--- a/units/en/unit3/hands-on.mdx
+++ /dev/null
@@ -1,342 +0,0 @@
-# Hands-on [[hands-on]]
-
-
-
-      <CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
-      notebooks={[
-        {label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit3/unit3.ipynb"}
-        ]}
-        askForHelpUrl="http://hf.co/join/discord" />
-
-
-Now that you've studied the theory behind Deep Q-Learning, **you’re ready to train your Deep Q-Learning agent to play Atari Games**. We'll start with Space Invaders, but you'll be able to use any Atari game you want 🔥
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
-
-
-We're using the [RL-Baselines-3 Zoo integration](https://github.com/DLR-RM/rl-baselines3-zoo), a vanilla version of Deep Q-Learning with no extensions such as Double-DQN, Dueling-DQN, or Prioritized Experience Replay.
-
-Also, **if you want to learn to implement Deep Q-Learning by yourself after this hands-on**, you definitely should look at the CleanRL implementation: https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn_atari.py
-
-To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 200**.
-
-To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**
-
-**If you don't find your model, go to the bottom of the page and click on the refresh button.**
-
-For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
-
-And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSimonini/Check-my-progress-Deep-RL-Course
-
-
-**To start the hands-on click on Open In Colab button** 👇 :
-
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit3/unit3.ipynb)
-
-# Unit 3: Deep Q-Learning with Atari Games 👾 using RL Baselines3 Zoo
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/thumbnail.jpg" alt="Unit 3 Thumbnail">
-
-In this hands-on, **you'll train a Deep Q-Learning agent** playing Space Invaders using [RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo), a training framework based on [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) that provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.
-
-We're using the [RL-Baselines-3 Zoo integration, a vanilla version of Deep Q-Learning](https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html) with no extensions such as Double-DQN, Dueling-DQN, and Prioritized Experience Replay.
-
-### 🎮 Environments:
-
-- [SpacesInvadersNoFrameskip-v4](https://gymnasium.farama.org/environments/atari/space_invaders/)
-
-You can see the difference between Space Invaders versions here 👉 https://gymnasium.farama.org/environments/atari/space_invaders/#variants
-
-### 📚 RL-Library:
-
-- [RL-Baselines3-Zoo](https://github.com/DLR-RM/rl-baselines3-zoo)
-
-## Objectives of this hands-on 🏆
-
-At the end of the hands-on, you will:
-- Be able to understand deeper **how RL Baselines3 Zoo works**.
-- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.
-
-## Prerequisites 🏗️
-
-Before diving into the hands-on, you need to:
-
-🔲 📚 **[Study Deep Q-Learning by reading Unit 3](https://huggingface.co/deep-rl-course/unit3/introduction)**  🤗
-
-We're constantly trying to improve our tutorials, so **if you find some issues in this hands-on**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues).
-
-# Let's train a Deep Q-Learning agent playing Atari' Space Invaders 👾 and upload it to the Hub.
-
-We strongly recommend students **to use Google Colab for the hands-on exercises instead of running them on their personal computers**.
-
-By using Google Colab, **you can focus on learning and experimenting without worrying about the technical aspects of setting up your environments**.
-
-To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 200**.
-
-To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**
-
-For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
-
-## Set the GPU 💪
-
-- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg" alt="GPU Step 1">
-
-- `Hardware Accelerator > GPU`
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg" alt="GPU Step 2">
-
-# Install RL-Baselines3 Zoo and its dependencies 📚
-
-If you see `ERROR: pip's dependency resolver does not currently take into account all the packages that are installed.` **this is normal and it's not a critical error** there's a conflict of version. But the packages we need are installed.
-
-```python
-# For now we install this update of RL-Baselines3 Zoo
-pip install git+https://github.com/DLR-RM/rl-baselines3-zoo
-```
-
-```bash
-apt-get install swig cmake ffmpeg
-```
-
-To be able to use Atari games in Gymnasium we need to install atari package. And accept-rom-license to download the rom files (games files).
-
-```python
-!pip install gymnasium[atari]
-!pip install gymnasium[accept-rom-license]
-```
-
-## Create a virtual display 🔽
-
-During the hands-on, we'll need to generate a replay video. To do so, if you train it on a headless machine, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).
-
-Hence the following cell will install the librairies and create and run a virtual screen 🖥
-
-```bash
-apt install python-opengl
-apt install ffmpeg
-apt install xvfb
-pip3 install pyvirtualdisplay
-```
-
-```python
-# Virtual display
-from pyvirtualdisplay import Display
-
-virtual_display = Display(visible=0, size=(1400, 900))
-virtual_display.start()
-```
-
-## Train our Deep Q-Learning Agent to Play Space Invaders 👾
-
-To train an agent with RL-Baselines3-Zoo, we just need to do two things:
-
-1. Create a hyperparameter config file that will contain our training hyperparameters called `dqn.yml`.
-
-This is a template example:
-
-```
-SpaceInvadersNoFrameskip-v4:
-  env_wrapper:
-    - stable_baselines3.common.atari_wrappers.AtariWrapper
-  frame_stack: 4
-  policy: 'CnnPolicy'
-  n_timesteps: !!float 1e7
-  buffer_size: 100000
-  learning_rate: !!float 1e-4
-  batch_size: 32
-  learning_starts: 100000
-  target_update_interval: 1000
-  train_freq: 4
-  gradient_steps: 1
-  exploration_fraction: 0.1
-  exploration_final_eps: 0.01
-  # If True, you need to deactivate handle_timeout_termination
-  # in the replay_buffer_kwargs
-  optimize_memory_usage: False
-```
-
-Here we see that:
-- We use the `Atari Wrapper` that preprocess the input (Frame reduction ,grayscale, stack 4 frames)
-- We use `CnnPolicy`, since we use Convolutional layers to process the frames
-- We train it for 10 million `n_timesteps`
-- Memory (Experience Replay) size is 100000, aka the amount of experience steps you saved to train again your agent with.
-
-💡 My advice is to **reduce the training timesteps to 1M,** which will take about 90 minutes on a P100. `!nvidia-smi` will tell you what GPU you're using. At 10 million steps, this will take about 9 hours. I recommend running this on your local computer (or somewhere else). Just click on: `File>Download`.
-
-In terms of hyperparameters optimization, my advice is to focus on these 3 hyperparameters:
-- `learning_rate`
-- `buffer_size (Experience Memory size)`
-- `batch_size`
-
-As a good practice, you need to **check the documentation to understand what each hyperparameters does**: https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html#parameters
-
-
-
-2. We start the training and save the models on `logs` folder 📁
-
-- Define the algorithm after `--algo`, where we save the model after `-f` and where the hyperparameter config is after `-c`.
-
-```bash
-python -m rl_zoo3.train --algo ________ --env SpaceInvadersNoFrameskip-v4  -f _________  -c _________
-```
-
-#### Solution
-
-```bash
-python -m rl_zoo3.train --algo dqn  --env SpaceInvadersNoFrameskip-v4 -f logs/ -c dqn.yml
-```
-
-## Let's evaluate our agent 👀
-
-- RL-Baselines3-Zoo provides `enjoy.py`, a python script to evaluate our agent. In most RL libraries, we call the evaluation script `enjoy.py`.
-- Let's evaluate it for 5000 timesteps 🔥
-
-```bash
-python -m rl_zoo3.enjoy  --algo dqn  --env SpaceInvadersNoFrameskip-v4  --no-render  --n-timesteps _________  --folder logs/
-```
-
-#### Solution
-
-```bash
-python -m rl_zoo3.enjoy  --algo dqn  --env SpaceInvadersNoFrameskip-v4  --no-render  --n-timesteps 5000  --folder logs/
-```
-
-## Publish our trained model on the Hub 🚀
-Now that we saw we got good results after the training, we can publish our trained model on the hub 🤗 with one line of code.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit3/space-invaders-model.gif" alt="Space Invaders model">
-
-By using `rl_zoo3.push_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.
-
-This way:
-- You can **showcase our work** 🔥
-- You can **visualize your agent playing** 👀
-- You can **share with the community an agent that others can use** 💾
-- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉  https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
-
-To be able to share your model with the community there are three more steps to follow:
-
-1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
-
-2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
-- Create a new token (https://huggingface.co/settings/tokens) **with write role**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
-
-- Copy the token
-- Run the cell below and past the token
-
-```bash
-from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.
-notebook_login()
-!git config --global credential.helper store
-```
-
-If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
-
-3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥
-
-Let's run push_to_hub.py file to upload our trained agent to the Hub.
-
-`--repo-name `: The name of the repo
-
-`-orga`: Your Hugging Face username
-
-`-f`: Where the trained model folder is (in our case `logs`)
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit3/select-id.png" alt="Select Id">
-
-```bash
-python -m rl_zoo3.push_to_hub  --algo dqn  --env SpaceInvadersNoFrameskip-v4  --repo-name _____________________ -orga _____________________ -f logs/
-```
-
-#### Solution
-
-```bash
-python -m rl_zoo3.push_to_hub  --algo dqn  --env SpaceInvadersNoFrameskip-v4  --repo-name dqn-SpaceInvadersNoFrameskip-v4  -orga ThomasSimonini  -f logs/
-```
-
-###.
-
-Congrats 🥳 you've just trained and uploaded your first Deep Q-Learning agent using RL-Baselines-3 Zoo. The script above should have displayed a link to a model repository such as https://huggingface.co/ThomasSimonini/dqn-SpaceInvadersNoFrameskip-v4. When you go to this link, you can:
-
-- See a **video preview of your agent** at the right.
-- Click "Files and versions" to see all the files in the repository.
-- Click "Use in stable-baselines3" to get a code snippet that shows how to load the model.
-- A model card (`README.md` file) which gives a description of the model and the hyperparameters you used.
-
-Under the hood, the Hub uses git-based repositories (don't worry if you don't know what git is), which means you can update the model with new versions as you experiment and improve your agent.
-
-**Compare the results of your agents with your classmates** using the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) 🏆
-
-## Load a powerful trained model 🔥
-
-- The Stable-Baselines3 team uploaded **more than 150 trained Deep Reinforcement Learning agents on the Hub**.
-
-You can find them here: 👉 https://huggingface.co/sb3
-
-Some examples:
-- Asteroids: https://huggingface.co/sb3/dqn-AsteroidsNoFrameskip-v4
-- Beam Rider: https://huggingface.co/sb3/dqn-BeamRiderNoFrameskip-v4
-- Breakout: https://huggingface.co/sb3/dqn-BreakoutNoFrameskip-v4
-- Road Runner: https://huggingface.co/sb3/dqn-RoadRunnerNoFrameskip-v4
-
-Let's load an agent playing Beam Rider: https://huggingface.co/sb3/dqn-BeamRiderNoFrameskip-v4
-
-1. We download the model using `rl_zoo3.load_from_hub`, and place it in a new folder that we can call `rl_trained`
-
-```bash
-# Download model and save it into the logs/ folder
-python -m rl_zoo3.load_from_hub --algo dqn --env BeamRiderNoFrameskip-v4 -orga sb3 -f rl_trained/
-```
-
-2. Let's evaluate if for 5000 timesteps
-
-```bash
-python -m rl_zoo3.enjoy --algo dqn --env BeamRiderNoFrameskip-v4 -n 5000  -f rl_trained/ --no-render
-```
-
-Why not trying to train your own **Deep Q-Learning Agent playing BeamRiderNoFrameskip-v4? 🏆.**
-
-If you want to try, check https://huggingface.co/sb3/dqn-BeamRiderNoFrameskip-v4#hyperparameters **in the model card, you have the hyperparameters of the trained agent.**
-
-But finding hyperparameters can be a daunting task. Fortunately, we'll see in the next Unit, how we can **use Optuna for optimizing the Hyperparameters 🔥.**
-
-
-## Some additional challenges 🏆
-
-The best way to learn **is to try things by your own**!
-
-In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?
-
-Here's a list of environments you can try to train your agent with:
-- BeamRiderNoFrameskip-v4
-- BreakoutNoFrameskip-v4
-- EnduroNoFrameskip-v4
-- PongNoFrameskip-v4
-
-Also, **if you want to learn to implement Deep Q-Learning by yourself**, you definitely should look at CleanRL implementation: https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn_atari.py
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
-
-________________________________________________________________________
-Congrats on finishing this chapter!
-
-If you’re still feel confused with all these elements...it's totally normal! **This was the same for me and for all people who studied RL.**
-
-Take time to really **grasp the material before continuing and try the additional challenges**. It’s important to master these elements and having a solid foundations.
-
-In the next unit, **we’re going to learn about [Optuna](https://optuna.org/)**. One of the most critical task in Deep Reinforcement Learning is to find a good set of training hyperparameters. And Optuna is a library that helps you to automate the search.
-
-
-### This is a course built with you 👷🏿‍♀️
-
-Finally, we want to improve and update the course iteratively with your feedback. If you have some, please fill this form 👉 https://forms.gle/3HgA7bEHwAmmLfwh9
-
-We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues).
-
-See you on Bonus unit 2! 🔥
-
-### Keep Learning, Stay Awesome 🤗
diff --git a/units/en/unit3/introduction.mdx b/units/en/unit3/introduction.mdx
deleted file mode 100644
index b892c75..0000000
--- a/units/en/unit3/introduction.mdx
+++ /dev/null
@@ -1,19 +0,0 @@
-# Deep Q-Learning [[deep-q-learning]]
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/thumbnail.jpg" alt="Unit 3 thumbnail" width="100%">
-
-
-
-In the last unit, we learned our first reinforcement learning algorithm: Q-Learning, **implemented it from scratch**, and trained it in two environments, FrozenLake-v1 ☃️ and Taxi-v3 🚕.
-
-We got excellent results with this simple algorithm, but these environments were relatively simple because the **state space was discrete and small** (16 different states for FrozenLake-v1 and 500 for Taxi-v3). For comparison, the state space in Atari games can **contain \\(10^{9}\\) to \\(10^{11}\\) states**.
-
-But as we'll see, producing and updating a **Q-table can become ineffective in large state space environments.**
-
-So in this unit, **we'll study our first Deep Reinforcement Learning agent**: Deep Q-Learning. Instead of using a Q-table, Deep Q-Learning uses a Neural Network that takes a state and approximates Q-values for each action based on that state.
-
-And **we'll train it to play Space Invaders and other Atari environments using [RL-Zoo](https://github.com/DLR-RM/rl-baselines3-zoo)**, a training framework for RL using Stable-Baselines that provides scripts for training, evaluating agents, tuning hyperparameters, plotting results, and recording videos.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
-
-So let’s get started! 🚀
diff --git a/units/en/unit3/quiz.mdx b/units/en/unit3/quiz.mdx
deleted file mode 100644
index 13f0295..0000000
--- a/units/en/unit3/quiz.mdx
+++ /dev/null
@@ -1,104 +0,0 @@
-# Quiz [[quiz]]
-
-The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
-
-### Q1: We mentioned Q Learning is a tabular method. What are tabular methods?
-
-<details>
-<summary>Solution</summary>
-
-*Tabular methods* is a type of problem in which the state and actions spaces are small enough to approximate value functions to be **represented as arrays and tables**. For instance, **Q-Learning is a tabular method** since we use a table to represent the state, and action value pairs.
-
-
-</details>
-
-### Q2: Why can't we use a classical Q-Learning to solve an Atari Game?
-
-<Question
-	choices={[
-		{
-			text: "Atari environments are too fast for Q-Learning",
-			explain: ""
-		},
-		{
-			text: "Atari environments have a big observation space. So creating an updating the Q-Table would not be efficient",
-			explain: "",
-      correct: true
-		}
-	]}
-/>
-
-
-### Q3: Why do we stack four frames together when we use frames as input in Deep Q-Learning?
-
-<details>
-<summary>Solution</summary>
-
-We stack frames together because it helps us **handle the problem of temporal limitation**: one frame is not enough to capture temporal information.
-For instance, in pong, our agent **will be unable to know the ball direction if it gets only one frame**.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation.jpg" alt="Temporal limitation"/>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation-2.jpg" alt="Temporal limitation"/>
-
-
-</details>
-
-
-### Q4: What are the two phases of Deep Q-Learning?
-
-<Question
-	choices={[
-		{
-			text: "Sampling",
-			explain: "We perform actions and store the observed experiences tuples in a replay memory.",
-      correct: true,
-		},
-		{
-			text: "Shuffling",
-			explain: "",
-		},
-    {
-      text: "Reranking",
-      explain: "",
-    },
-    {
-			text: "Training",
-			explain: "We select the small batch of tuple randomly and learn from it using a gradient descent update step.",
-      correct: true,
-		}
-	]}
-/>
-
-### Q5: Why do we create a replay memory in Deep Q-Learning?
-
-<details>
-   <summary>Solution</summary>
-
-**1. Make more efficient use of the experiences during the training**
-
-Usually, in online reinforcement learning, the agent interacts in the environment, gets experiences (state, action, reward, and next state), learns from them (updates the neural network), and discards them. This is not efficient.
-But, with experience replay, **we create a replay buffer that saves experience samples that we can reuse during the training**.
-
-**2. Avoid forgetting previous experiences and reduce the correlation between experiences**
-
-  The problem we get if we give sequential samples of experiences to our neural network is that it **tends to forget the previous experiences as it overwrites new experiences**. For instance, if we are in the first level and then the second, which is different, our agent can forget how to behave and play in the first level.
-
-
-</details>
-
-### Q6: How do we use Double Deep Q-Learning?
-
-
-<details>
-  <summary>Solution</summary>
-
-  When we compute the Q target, we use two networks to decouple the action selection from the target Q value generation. We:
-
-  - Use our *DQN network* to **select the best action to take for the next state** (the action with the highest Q value).
-
-  - Use our *Target network* to calculate **the target Q value of taking that action at the next state**.
-
-</details>
-
-
-Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read again the chapter to reinforce (😏) your knowledge.
diff --git a/units/en/unit4/additional-readings.mdx b/units/en/unit4/additional-readings.mdx
deleted file mode 100644
index 6d246ae..0000000
--- a/units/en/unit4/additional-readings.mdx
+++ /dev/null
@@ -1,20 +0,0 @@
-# Additional Readings
-
-These are **optional readings** if you want to go deeper.
-
-
-## Introduction to Policy Optimization
-
-- [Part 3: Intro to Policy Optimization - Spinning Up documentation](https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html)
-
-
-## Policy Gradient
-
-- [https://johnwlambert.github.io/policy-gradients/](https://johnwlambert.github.io/policy-gradients/)
-- [RL - Policy Gradient Explained](https://jonathan-hui.medium.com/rl-policy-gradients-explained-9b13b688b146)
-- [Chapter 13, Policy Gradient Methods;  Reinforcement Learning, an introduction by Richard Sutton and Andrew G. Barto](http://incompleteideas.net/book/RLbook2020.pdf)
-
-## Implementation
-
-- [PyTorch Reinforce implementation](https://github.com/pytorch/examples/blob/main/reinforcement_learning/reinforce.py)
-- [Implementations from DDPG to PPO](https://github.com/MrSyee/pg-is-all-you-need)
diff --git a/units/en/unit4/advantages-disadvantages.mdx b/units/en/unit4/advantages-disadvantages.mdx
deleted file mode 100644
index 3739a72..0000000
--- a/units/en/unit4/advantages-disadvantages.mdx
+++ /dev/null
@@ -1,74 +0,0 @@
-# The advantages and disadvantages of policy-gradient methods
-
-At this point, you might ask, "but Deep Q-Learning is excellent! Why use policy-gradient methods?". To answer this question, let's study the **advantages and disadvantages of policy-gradient methods**.
-
-## Advantages
-
-There are multiple advantages over value-based methods. Let's see some of them:
-
-### The simplicity of integration
-
-We can estimate the policy directly without storing additional data (action values).
-
-### Policy-gradient methods can learn a stochastic policy
-
-Policy-gradient methods can **learn a stochastic policy while value functions can't**.
-
-This has two consequences:
-
-1. We **don't need to implement an exploration/exploitation trade-off by hand**. Since we output a probability distribution over actions, the agent explores **the state space without always taking the same trajectory.**
-
-2. We also get rid of the problem of **perceptual aliasing**. Perceptual aliasing is when two states seem (or are) the same but need different actions.
-
-Let's take an example: we have an intelligent vacuum cleaner whose goal is to suck the dust and avoid killing the hamsters.
-
-<figure class="image table text-center m-0 w-full">
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/hamster1.jpg" alt="Hamster 1"/>
-</figure>
-
-Our vacuum cleaner can only perceive where the walls are.
-
-The problem is that the **two red (colored) states are aliased states because the agent perceives an upper and lower wall for each**.
-
-<figure class="image table text-center m-0 w-full">
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/hamster2.jpg" alt="Hamster 1"/>
-</figure>
-
-Under a deterministic policy, the policy will either always move right when in a red state or always move left. **Either case will cause our agent to get stuck and never suck the dust**.
-
-Under a value-based Reinforcement learning algorithm, we learn a **quasi-deterministic policy** ("greedy epsilon strategy"). Consequently, our agent can **spend a lot of time before finding the dust**.
-
-On the other hand, an optimal stochastic policy **will randomly move left or right in red (colored) states**. Consequently, **it will not be stuck and will reach the goal state with a high probability**.
-
-<figure class="image table text-center m-0 w-full">
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/hamster3.jpg" alt="Hamster 1"/>
-</figure>
-
-### Policy-gradient methods are more effective in high-dimensional action spaces and continuous actions spaces
-
-The problem with Deep Q-learning is that their **predictions assign a score (maximum expected future reward) for each possible action**, at each time step, given the current state.
-
-But what if we have an infinite possibility of actions?
-
-For instance, with a self-driving car, at each state, you can have a (near) infinite choice of actions (turning the wheel at 15°, 17.2°, 19,4°, honking, etc.). **We'll need to output a Q-value for each possible action**! And **taking the max action of a continuous output is an optimization problem itself**!
-
-Instead, with policy-gradient methods, we output a **probability distribution over actions.**
-
-### Policy-gradient methods have better convergence properties
-
-In value-based methods, we use an aggressive operator to **change the value function: we take the maximum over Q-estimates**.
-Consequently, the action probabilities may change dramatically for an arbitrarily small change in the estimated action values if that change results in a different action having the maximal value.
-
-For instance, if during the training, the best action was left (with a Q-value of 0.22) and the training step after it's right (since the right Q-value becomes 0.23), we dramatically changed the policy since now the policy will take most of the time right instead of left.
-
-On the other hand, in policy-gradient methods, stochastic policy action preferences (probability of taking action) **change smoothly over time**.
-
-## Disadvantages
-
-Naturally, policy-gradient methods also have some disadvantages:
-
-- **Frequently, policy-gradient methods converges to a local maximum instead of a global optimum.**
-- Policy-gradient goes slower, **step by step: it can take longer to train (inefficient).**
-- Policy-gradient can have high variance. We'll see in the actor-critic unit why, and how we can solve this problem.
-
-👉 If you want to go deeper into the advantages and disadvantages of policy-gradient methods, [you can check this video](https://youtu.be/y3oqOjHilio).
diff --git a/units/en/unit4/conclusion.mdx b/units/en/unit4/conclusion.mdx
deleted file mode 100644
index 25cf80a..0000000
--- a/units/en/unit4/conclusion.mdx
+++ /dev/null
@@ -1,17 +0,0 @@
-# Conclusion
-
-
-**Congrats on finishing this unit**! There was a lot of information.
-And congrats on finishing the tutorial. You've just coded your first Deep Reinforcement Learning agent from scratch using PyTorch and shared it on the Hub 🥳.
-
-Don't hesitate to iterate on this unit **by improving the implementation for more complex environments** (for instance, what about changing the network to a Convolutional Neural Network to handle
-frames as observation)?
-
-In the next unit, **we're going to learn more about Unity MLAgents**, by training agents in Unity environments. This way, you will be ready to participate in the **AI vs AI challenges where you'll train your agents
-to compete against other agents in a snowball fight and a soccer game.**
-
-Sound fun? See you next time!
-
-Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉  [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
-
-### Keep Learning, stay awesome 🤗
diff --git a/units/en/unit4/glossary.mdx b/units/en/unit4/glossary.mdx
deleted file mode 100644
index e2ea67f..0000000
--- a/units/en/unit4/glossary.mdx
+++ /dev/null
@@ -1,25 +0,0 @@
-# Glossary 
-
-This is a community-created glossary. Contributions are welcome!
-
-- **Deep Q-Learning:** A value-based deep reinforcement learning algorithm that uses a deep neural network to approximate Q-values for actions in a given state. The goal of Deep Q-learning is to find the optimal policy that maximizes the expected cumulative reward by learning the action-values.
-
-- **Value-based methods:** Reinforcement Learning methods that estimate a value function as an intermediate step towards finding an optimal policy.
-
-- **Policy-based methods:** Reinforcement Learning methods that directly learn to approximate the optimal policy without learning a value function. In practice they output a probability distribution over actions. 
-
-    The benefits of using policy-gradient methods over value-based methods include: 
-    - simplicity of integration: no need to store action values;
-    - ability to learn a stochastic policy: the agent explores the state space without always taking the same trajectory, and avoids the problem of perceptual aliasing;
-    - effectiveness in high-dimensional and continuous action spaces; and
-    - improved convergence properties.
-
-- **Policy Gradient:** A subset of policy-based methods where the objective is to maximize the performance of a parameterized policy using gradient ascent. The goal of a policy-gradient is to control the probability distribution of actions by tuning the policy such that good actions (that maximize the return) are sampled more frequently in the future. 
-
-- **Monte Carlo Reinforce:** A policy-gradient algorithm that uses an estimated return from an entire episode to update the policy parameter.
-
-If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)
-
-This glossary was made possible thanks to:
-
-- [Diego Carpintero](https://github.com/dcarpintero)
\ No newline at end of file
diff --git a/units/en/unit4/hands-on.mdx b/units/en/unit4/hands-on.mdx
deleted file mode 100644
index 5f101e8..0000000
--- a/units/en/unit4/hands-on.mdx
+++ /dev/null
@@ -1,1028 +0,0 @@
-# Hands on
-
-
-
-      <CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
-      notebooks={[
-        {label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit4/unit4.ipynb"}
-        ]}
-        askForHelpUrl="http://hf.co/join/discord" />
-
-
-
-Now that we've studied the theory behind Reinforce, **you’re ready to code your Reinforce agent with PyTorch**. And you'll test its robustness using CartPole-v1 and PixelCopter,.
-
-You'll then be able to iterate and improve this implementation for more advanced environments.
-
-<figure class="image table text-center m-0 w-full">
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/envs.gif" alt="Environments"/>
-</figure>
-
-
-To validate this hands-on for the certification process, you need to push your trained models to the Hub and:
-
-- Get a result of >= 350 for `Cartpole-v1`
-- Get a result of >= 5 for `PixelCopter`.
-
-To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**. **If you don't see your model on the leaderboard, go at the bottom of the leaderboard page and click on the refresh button**.
-
-**If you don't find your model, go to the bottom of the page and click on the refresh button.**
-
-For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
-
-And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSimonini/Check-my-progress-Deep-RL-Course
-
-
-**To start the hands-on click on Open In Colab button** 👇 :
-
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit4/unit4.ipynb)
-
-We strongly **recommend students use Google Colab for the hands-on exercises** instead of running them on their personal computers. 
-
-By using Google Colab, **you can focus on learning and experimenting without worrying about the technical aspects** of setting up your environments.
-
-# Unit 4: Code your first Deep Reinforcement Learning Algorithm with PyTorch: Reinforce. And test its robustness 💪
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/thumbnail.png" alt="thumbnail"/>
-
-
-In this notebook, you'll code your first Deep Reinforcement Learning algorithm from scratch: Reinforce (also called Monte Carlo Policy Gradient).
-
-Reinforce is a *Policy-based method*: a Deep Reinforcement Learning algorithm that tries **to optimize the policy directly without using an action-value function**.
-
-More precisely, Reinforce is a *Policy-gradient method*, a subclass of *Policy-based methods* that aims **to optimize the policy directly by estimating the weights of the optimal policy using gradient ascent**.
-
-To test its robustness, we're going to train it in 2 different simple environments:
-- Cartpole-v1
-- PixelcopterEnv
-
-⬇️ Here is an example of what **you will achieve at the end of this notebook.** ⬇️
-
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/envs.gif" alt="Environments"/>
-
-
-### 🎮 Environments:
-
-- [CartPole-v1](https://www.gymlibrary.dev/environments/classic_control/cart_pole/)
-- [PixelCopter](https://pygame-learning-environment.readthedocs.io/en/latest/user/games/pixelcopter.html)
-
-### 📚 RL-Library:
-
-- Python
-- PyTorch
-
-
-We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).
-
-## Objectives of this notebook 🏆
-
-At the end of the notebook, you will:
-
-- Be able to **code a Reinforce algorithm from scratch using PyTorch.**
-- Be able to **test the robustness of your agent using simple environments.**
-- Be able to **push your trained agent to the Hub** with a nice video replay and an evaluation score 🔥.
-
-## Prerequisites 🏗️
-
-Before diving into the notebook, you need to:
-
-🔲 📚 [Study Policy Gradients by reading Unit 4](https://huggingface.co/deep-rl-course/unit4/introduction)
-
-# Let's code Reinforce algorithm from scratch 🔥
-
-## Some advice 💡
-
-It's better to run this colab in a copy on your Google Drive, so that **if it times out** you still have the saved notebook on your Google Drive and do not need to fill everything in from scratch.
-
-To do that you can either do `Ctrl + S` or `File > Save a copy in Google Drive.`
-
-## Set the GPU 💪
-
-- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg" alt="GPU Step 1">
-
-- `Hardware Accelerator > GPU`
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg" alt="GPU Step 2">
-
-## Create a virtual display 🖥
-
-During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).
-
-The following cell will install the librairies and create and run a virtual screen 🖥
-
-```python
-%%capture
-!apt install python-opengl
-!apt install ffmpeg
-!apt install xvfb
-!pip install pyvirtualdisplay
-!pip install pyglet==1.5.1
-```
-
-```python
-# Virtual display
-from pyvirtualdisplay import Display
-
-virtual_display = Display(visible=0, size=(1400, 900))
-virtual_display.start()
-```
-
-## Install the dependencies 🔽
-
-The first step is to install the dependencies. We’ll install multiple ones:
-
-- `gym`
-- `gym-games`: Extra gym environments made with PyGame.
-- `huggingface_hub`: The Hub works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations, and other features that will allow you to easily collaborate with others.
-
-You may be wondering why we install gym and not gymnasium, a more recent version of gym? **Because the gym-games we are using are not updated yet with gymnasium**. 
-
-The differences you'll encounter here:
-- In `gym` we don't have `terminated` and `truncated` but only `done`.
-- In `gym` using `env.step()` returns `state, reward, done, info`
-
-You can learn more about the differences between Gym and Gymnasium here 👉 https://gymnasium.farama.org/content/migration-guide/
-
-You can see here all the Reinforce models available 👉 https://huggingface.co/models?other=reinforce
-
-And you can find all the Deep Reinforcement Learning models here 👉 https://huggingface.co/models?pipeline_tag=reinforcement-learning
-
-
-```bash
-!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit4/requirements-unit4.txt
-```
-
-## Import the packages 📦
-
-In addition to importing the installed libraries, we also import:
-
-- `imageio`: A library that will help us to generate a replay video
-
-
-
-```python
-import numpy as np
-
-from collections import deque
-
-import matplotlib.pyplot as plt
-%matplotlib inline
-
-# PyTorch
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torch.optim as optim
-from torch.distributions import Categorical
-
-# Gym
-import gym
-import gym_pygame
-
-# Hugging Face Hub
-from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.
-import imageio
-```
-
-## Check if we have a GPU
-
-- Let's check if we have a GPU
-- If it's the case you should see `device:cuda0`
-
-```python
-device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
-```
-
-```python
-print(device)
-```
-
-We're now ready to implement our Reinforce algorithm 🔥
-
-# First agent: Playing CartPole-v1 🤖
-
-## Create the CartPole environment and understand how it works
-
-### [The environment 🎮](https://www.gymlibrary.dev/environments/classic_control/cart_pole/)
-
-### Why do we use a simple environment like CartPole-v1?
-
-As explained in [Reinforcement Learning Tips and Tricks](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html), when you implement your agent from scratch, you need **to be sure that it works correctly and find bugs with easy environments before going deeper** as finding bugs will be much easier in simple environments.
-
-
-> Try to have some “sign of life” on toy problems
-
-
-> Validate the implementation by making it run on harder and harder envs (you can compare results against the RL zoo). You usually need to run hyperparameter optimization for that step.
-
-
-### The CartPole-v1 environment
-
-> A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum is placed upright on the cart and the goal is to balance the pole by applying forces in the left and right direction on the cart.
-
-
-
-So, we start with CartPole-v1. The goal is to push the cart left or right **so that the pole stays in the equilibrium.**
-
-The episode ends if:
-- The pole Angle is greater than ±12°
-- The Cart Position is greater than ±2.4
-- The episode length is greater than 500
-
-We get a reward 💰 of +1 every timestep that the Pole stays in the equilibrium.
-
-```python
-env_id = "CartPole-v1"
-# Create the env
-env = gym.make(env_id)
-
-# Create the evaluation env
-eval_env = gym.make(env_id)
-
-# Get the state space and action space
-s_size = env.observation_space.shape[0]
-a_size = env.action_space.n
-```
-
-```python
-print("_____OBSERVATION SPACE_____ \n")
-print("The State Space is: ", s_size)
-print("Sample observation", env.observation_space.sample())  # Get a random observation
-```
-
-```python
-print("\n _____ACTION SPACE_____ \n")
-print("The Action Space is: ", a_size)
-print("Action Space Sample", env.action_space.sample())  # Take a random action
-```
-
-## Let's build the Reinforce Architecture
-
-This implementation is based on three implementations:
-- [PyTorch official Reinforcement Learning example](https://github.com/pytorch/examples/blob/main/reinforcement_learning/reinforce.py)
-- [Udacity Reinforce](https://github.com/udacity/deep-reinforcement-learning/blob/master/reinforce/REINFORCE.ipynb)
-- [Improvement of the integration by Chris1nexus](https://github.com/huggingface/deep-rl-class/pull/95)
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/reinforce.png" alt="Reinforce"/>
-
-So we want:
-- Two fully connected layers (fc1 and fc2).
-- To use ReLU as activation function of fc1
-- To use Softmax to output a probability distribution over actions
-
-```python
-class Policy(nn.Module):
-    def __init__(self, s_size, a_size, h_size):
-        super(Policy, self).__init__()
-        # Create two fully connected layers
-
-
-
-    def forward(self, x):
-        # Define the forward pass
-        # state goes to fc1 then we apply ReLU activation function
-
-        # fc1 outputs goes to fc2
-
-        # We output the softmax
-
-    def act(self, state):
-        """
-        Given a state, take action
-        """
-        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
-        probs = self.forward(state).cpu()
-        m = Categorical(probs)
-        action = np.argmax(m)
-        return action.item(), m.log_prob(action)
-```
-
-### Solution
-
-```python
-class Policy(nn.Module):
-    def __init__(self, s_size, a_size, h_size):
-        super(Policy, self).__init__()
-        self.fc1 = nn.Linear(s_size, h_size)
-        self.fc2 = nn.Linear(h_size, a_size)
-
-    def forward(self, x):
-        x = F.relu(self.fc1(x))
-        x = self.fc2(x)
-        return F.softmax(x, dim=1)
-
-    def act(self, state):
-        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
-        probs = self.forward(state).cpu()
-        m = Categorical(probs)
-        action = np.argmax(m)
-        return action.item(), m.log_prob(action)
-```
-
-I made a mistake, can you guess where?
-
-- To find out let's make a forward pass:
-
-```python
-debug_policy = Policy(s_size, a_size, 64).to(device)
-debug_policy.act(env.reset())
-```
-
-- Here we see that the error says `ValueError: The value argument to log_prob must be a Tensor`
-
-- It means that `action` in `m.log_prob(action)` must be a Tensor **but it's not.**
-
-- Do you know why? Check the act function and try to see why it does not work.
-
-Advice 💡: Something is wrong in this implementation. Remember that for the act function **we want to sample an action from the probability distribution over actions**.
-
-
-### (Real) Solution
-
-```python
-class Policy(nn.Module):
-    def __init__(self, s_size, a_size, h_size):
-        super(Policy, self).__init__()
-        self.fc1 = nn.Linear(s_size, h_size)
-        self.fc2 = nn.Linear(h_size, a_size)
-
-    def forward(self, x):
-        x = F.relu(self.fc1(x))
-        x = self.fc2(x)
-        return F.softmax(x, dim=1)
-
-    def act(self, state):
-        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
-        probs = self.forward(state).cpu()
-        m = Categorical(probs)
-        action = m.sample()
-        return action.item(), m.log_prob(action)
-```
-
-By using CartPole, it was easier to debug since **we know that the bug comes from our integration and not from our simple environment**.
-
-- Since **we want to sample an action from the probability distribution over actions**, we can't use `action = np.argmax(m)` since it will always output the action that has the highest probability.
-
-- We need to replace this with `action = m.sample()` which will sample an action from the probability distribution P(.|s)
-
-### Let's build the Reinforce Training Algorithm
-This is the Reinforce algorithm pseudocode:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/pg_pseudocode.png" alt="Policy gradient pseudocode"/>
-
-
-- When we calculate the return Gt (line 6), we see that we calculate the sum of discounted rewards **starting at timestep t**.
-
-- Why? Because our policy should only **reinforce actions on the basis of the consequences**: so rewards obtained before taking an action are useless (since they were not because of the action), **only the ones that come after the action matters**.
-
-- Before coding this you should read this section [don't let the past distract you](https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#don-t-let-the-past-distract-you) that explains why we use reward-to-go policy gradient.
-
-We use an interesting technique coded by [Chris1nexus](https://github.com/Chris1nexus) to **compute the return at each timestep efficiently**. The comments explained the procedure. Don't hesitate also [to check the PR explanation](https://github.com/huggingface/deep-rl-class/pull/95)
-But overall the idea is to **compute the return at each timestep efficiently**.
-
-The second question you may ask is **why do we minimize the loss**? Didn't we talk about Gradient Ascent, not Gradient Descent earlier?
-
-- We want to maximize our utility function $J(\theta)$, but in PyTorch and TensorFlow, it's better to **minimize an objective function.**
-    - So let's say we want to reinforce action 3 at a certain timestep. Before training this action P is 0.25.
-    - So we want to modify \\(theta \\) such that \\(\pi_\theta(a_3|s; \theta) > 0.25 \\)
-    - Because all P must sum to 1, max \\(pi_\theta(a_3|s; \theta)\\) will **minimize other action probability.**
-    - So we should tell PyTorch **to min \\(1 - \pi_\theta(a_3|s; \theta)\\).**
-    - This loss function approaches 0 as \\(\pi_\theta(a_3|s; \theta)\\) nears 1.
-    - So we are encouraging the gradient to max \\(\pi_\theta(a_3|s; \theta)\\)
-
-
-```python
-def reinforce(policy, optimizer, n_training_episodes, max_t, gamma, print_every):
-    # Help us to calculate the score during the training
-    scores_deque = deque(maxlen=100)
-    scores = []
-    # Line 3 of pseudocode
-    for i_episode in range(1, n_training_episodes+1):
-        saved_log_probs = []
-        rewards = []
-        state = # TODO: reset the environment
-        # Line 4 of pseudocode
-        for t in range(max_t):
-            action, log_prob = # TODO get the action
-            saved_log_probs.append(log_prob)
-            state, reward, done, _ = # TODO: take an env step
-            rewards.append(reward)
-            if done:
-                break
-        scores_deque.append(sum(rewards))
-        scores.append(sum(rewards))
-
-        # Line 6 of pseudocode: calculate the return
-        returns = deque(maxlen=max_t)
-        n_steps = len(rewards)
-        # Compute the discounted returns at each timestep,
-        # as the sum of the gamma-discounted return at time t (G_t) + the reward at time t
-
-        # In O(N) time, where N is the number of time steps
-        # (this definition of the discounted return G_t follows the definition of this quantity
-        # shown at page 44 of Sutton&Barto 2017 2nd draft)
-        # G_t = r_(t+1) + r_(t+2) + ...
-
-        # Given this formulation, the returns at each timestep t can be computed
-        # by re-using the computed future returns G_(t+1) to compute the current return G_t
-        # G_t = r_(t+1) + gamma*G_(t+1)
-        # G_(t-1) = r_t + gamma* G_t
-        # (this follows a dynamic programming approach, with which we memorize solutions in order
-        # to avoid computing them multiple times)
-
-        # This is correct since the above is equivalent to (see also page 46 of Sutton&Barto 2017 2nd draft)
-        # G_(t-1) = r_t + gamma*r_(t+1) + gamma*gamma*r_(t+2) + ...
-
-
-        ## Given the above, we calculate the returns at timestep t as:
-        #               gamma[t] * return[t] + reward[t]
-        #
-        ## We compute this starting from the last timestep to the first, in order
-        ## to employ the formula presented above and avoid redundant computations that would be needed
-        ## if we were to do it from first to last.
-
-        ## Hence, the queue "returns" will hold the returns in chronological order, from t=0 to t=n_steps
-        ## thanks to the appendleft() function which allows to append to the position 0 in constant time O(1)
-        ## a normal python list would instead require O(N) to do this.
-        for t in range(n_steps)[::-1]:
-            disc_return_t = (returns[0] if len(returns)>0 else 0)
-            returns.appendleft(    ) # TODO: complete here
-
-        ## standardization of the returns is employed to make training more stable
-        eps = np.finfo(np.float32).eps.item()
-
-        ## eps is the smallest representable float, which is
-        # added to the standard deviation of the returns to avoid numerical instabilities
-        returns = torch.tensor(returns)
-        returns = (returns - returns.mean()) / (returns.std() + eps)
-
-        # Line 7:
-        policy_loss = []
-        for log_prob, disc_return in zip(saved_log_probs, returns):
-            policy_loss.append(-log_prob * disc_return)
-        policy_loss = torch.cat(policy_loss).sum()
-
-        # Line 8: PyTorch prefers gradient descent
-        optimizer.zero_grad()
-        policy_loss.backward()
-        optimizer.step()
-
-        if i_episode % print_every == 0:
-            print('Episode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))
-
-    return scores
-```
-
-#### Solution
-
-```python
-def reinforce(policy, optimizer, n_training_episodes, max_t, gamma, print_every):
-    # Help us to calculate the score during the training
-    scores_deque = deque(maxlen=100)
-    scores = []
-    # Line 3 of pseudocode
-    for i_episode in range(1, n_training_episodes + 1):
-        saved_log_probs = []
-        rewards = []
-        state = env.reset()
-        # Line 4 of pseudocode
-        for t in range(max_t):
-            action, log_prob = policy.act(state)
-            saved_log_probs.append(log_prob)
-            state, reward, done, _ = env.step(action)
-            rewards.append(reward)
-            if done:
-                break
-        scores_deque.append(sum(rewards))
-        scores.append(sum(rewards))
-
-        # Line 6 of pseudocode: calculate the return
-        returns = deque(maxlen=max_t)
-        n_steps = len(rewards)
-        # Compute the discounted returns at each timestep,
-        # as
-        #      the sum of the gamma-discounted return at time t (G_t) + the reward at time t
-        #
-        # In O(N) time, where N is the number of time steps
-        # (this definition of the discounted return G_t follows the definition of this quantity
-        # shown at page 44 of Sutton&Barto 2017 2nd draft)
-        # G_t = r_(t+1) + r_(t+2) + ...
-
-        # Given this formulation, the returns at each timestep t can be computed
-        # by re-using the computed future returns G_(t+1) to compute the current return G_t
-        # G_t = r_(t+1) + gamma*G_(t+1)
-        # G_(t-1) = r_t + gamma* G_t
-        # (this follows a dynamic programming approach, with which we memorize solutions in order
-        # to avoid computing them multiple times)
-
-        # This is correct since the above is equivalent to (see also page 46 of Sutton&Barto 2017 2nd draft)
-        # G_(t-1) = r_t + gamma*r_(t+1) + gamma*gamma*r_(t+2) + ...
-
-        ## Given the above, we calculate the returns at timestep t as:
-        #               gamma[t] * return[t] + reward[t]
-        #
-        ## We compute this starting from the last timestep to the first, in order
-        ## to employ the formula presented above and avoid redundant computations that would be needed
-        ## if we were to do it from first to last.
-
-        ## Hence, the queue "returns" will hold the returns in chronological order, from t=0 to t=n_steps
-        ## thanks to the appendleft() function which allows to append to the position 0 in constant time O(1)
-        ## a normal python list would instead require O(N) to do this.
-        for t in range(n_steps)[::-1]:
-            disc_return_t = returns[0] if len(returns) > 0 else 0
-            returns.appendleft(gamma * disc_return_t + rewards[t])
-
-        ## standardization of the returns is employed to make training more stable
-        eps = np.finfo(np.float32).eps.item()
-        ## eps is the smallest representable float, which is
-        # added to the standard deviation of the returns to avoid numerical instabilities
-        returns = torch.tensor(returns)
-        returns = (returns - returns.mean()) / (returns.std() + eps)
-
-        # Line 7:
-        policy_loss = []
-        for log_prob, disc_return in zip(saved_log_probs, returns):
-            policy_loss.append(-log_prob * disc_return)
-        policy_loss = torch.cat(policy_loss).sum()
-
-        # Line 8: PyTorch prefers gradient descent
-        optimizer.zero_grad()
-        policy_loss.backward()
-        optimizer.step()
-
-        if i_episode % print_every == 0:
-            print("Episode {}\tAverage Score: {:.2f}".format(i_episode, np.mean(scores_deque)))
-
-    return scores
-```
-
-##  Train it
-- We're now ready to train our agent.
-- But first, we define a variable containing all the training hyperparameters.
-- You can change the training parameters (and should 😉)
-
-```python
-cartpole_hyperparameters = {
-    "h_size": 16,
-    "n_training_episodes": 1000,
-    "n_evaluation_episodes": 10,
-    "max_t": 1000,
-    "gamma": 1.0,
-    "lr": 1e-2,
-    "env_id": env_id,
-    "state_space": s_size,
-    "action_space": a_size,
-}
-```
-
-```python
-# Create policy and place it to the device
-cartpole_policy = Policy(
-    cartpole_hyperparameters["state_space"],
-    cartpole_hyperparameters["action_space"],
-    cartpole_hyperparameters["h_size"],
-).to(device)
-cartpole_optimizer = optim.Adam(cartpole_policy.parameters(), lr=cartpole_hyperparameters["lr"])
-```
-
-```python
-scores = reinforce(
-    cartpole_policy,
-    cartpole_optimizer,
-    cartpole_hyperparameters["n_training_episodes"],
-    cartpole_hyperparameters["max_t"],
-    cartpole_hyperparameters["gamma"],
-    100,
-)
-```
-
-## Define evaluation method 📝
-- Here we define the evaluation method that we're going to use to test our Reinforce agent.
-
-```python
-def evaluate_agent(env, max_steps, n_eval_episodes, policy):
-    """
-    Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.
-    :param env: The evaluation environment
-    :param n_eval_episodes: Number of episode to evaluate the agent
-    :param policy: The Reinforce agent
-    """
-    episode_rewards = []
-    for episode in range(n_eval_episodes):
-        state = env.reset()
-        step = 0
-        done = False
-        total_rewards_ep = 0
-
-        for step in range(max_steps):
-            action, _ = policy.act(state)
-            new_state, reward, done, info = env.step(action)
-            total_rewards_ep += reward
-
-            if done:
-                break
-            state = new_state
-        episode_rewards.append(total_rewards_ep)
-    mean_reward = np.mean(episode_rewards)
-    std_reward = np.std(episode_rewards)
-
-    return mean_reward, std_reward
-```
-
-## Evaluate our agent 📈
-
-```python
-evaluate_agent(
-    eval_env, cartpole_hyperparameters["max_t"], cartpole_hyperparameters["n_evaluation_episodes"], cartpole_policy
-)
-```
-
-### Publish our trained model on the Hub 🔥
-Now that we saw we got good results after the training, we can publish our trained model on the hub 🤗 with one line of code.
-
-Here's an example of a Model Card:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/modelcard.png"/>
-
-### Push to the Hub
-#### Do not modify this code
-
-```python
-from huggingface_hub import HfApi, snapshot_download
-from huggingface_hub.repocard import metadata_eval_result, metadata_save
-
-from pathlib import Path
-import datetime
-import json
-import imageio
-
-import tempfile
-
-import os
-```
-
-```python
-def record_video(env, policy, out_directory, fps=30):
-    """
-    Generate a replay video of the agent
-    :param env
-    :param Qtable: Qtable of our agent
-    :param out_directory
-    :param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)
-    """
-    images = []
-    done = False
-    state = env.reset()
-    img = env.render(mode="rgb_array")
-    images.append(img)
-    while not done:
-        # Take the action (index) that have the maximum expected future reward given that state
-        action, _ = policy.act(state)
-        state, reward, done, info = env.step(action)  # We directly put next_state = state for recording logic
-        img = env.render(mode="rgb_array")
-        images.append(img)
-    imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)
-```
-
-```python
-def push_to_hub(repo_id,
-                model,
-                hyperparameters,
-                eval_env,
-                video_fps=30
-                ):
-  """
-  Evaluate, Generate a video and Upload a model to Hugging Face Hub.
-  This method does the complete pipeline:
-  - It evaluates the model
-  - It generates the model card
-  - It generates a replay video of the agent
-  - It pushes everything to the Hub
-
-  :param repo_id: repo_id: id of the model repository from the Hugging Face Hub
-  :param model: the pytorch model we want to save
-  :param hyperparameters: training hyperparameters
-  :param eval_env: evaluation environment
-  :param video_fps: how many frame per seconds to record our video replay
-  """
-
-  _, repo_name = repo_id.split("/")
-  api = HfApi()
-
-  # Step 1: Create the repo
-  repo_url = api.create_repo(
-        repo_id=repo_id,
-        exist_ok=True,
-  )
-
-  with tempfile.TemporaryDirectory() as tmpdirname:
-    local_directory = Path(tmpdirname)
-
-    # Step 2: Save the model
-    torch.save(model, local_directory / "model.pt")
-
-    # Step 3: Save the hyperparameters to JSON
-    with open(local_directory / "hyperparameters.json", "w") as outfile:
-      json.dump(hyperparameters, outfile)
-
-    # Step 4: Evaluate the model and build JSON
-    mean_reward, std_reward = evaluate_agent(eval_env,
-                                            hyperparameters["max_t"],
-                                            hyperparameters["n_evaluation_episodes"],
-                                            model)
-    # Get datetime
-    eval_datetime = datetime.datetime.now()
-    eval_form_datetime = eval_datetime.isoformat()
-
-    evaluate_data = {
-          "env_id": hyperparameters["env_id"],
-          "mean_reward": mean_reward,
-          "n_evaluation_episodes": hyperparameters["n_evaluation_episodes"],
-          "eval_datetime": eval_form_datetime,
-    }
-
-    # Write a JSON file
-    with open(local_directory / "results.json", "w") as outfile:
-        json.dump(evaluate_data, outfile)
-
-    # Step 5: Create the model card
-    env_name = hyperparameters["env_id"]
-
-    metadata = {}
-    metadata["tags"] = [
-          env_name,
-          "reinforce",
-          "reinforcement-learning",
-          "custom-implementation",
-          "deep-rl-class"
-      ]
-
-    # Add metrics
-    eval = metadata_eval_result(
-        model_pretty_name=repo_name,
-        task_pretty_name="reinforcement-learning",
-        task_id="reinforcement-learning",
-        metrics_pretty_name="mean_reward",
-        metrics_id="mean_reward",
-        metrics_value=f"{mean_reward:.2f} +/- {std_reward:.2f}",
-        dataset_pretty_name=env_name,
-        dataset_id=env_name,
-      )
-
-    # Merges both dictionaries
-    metadata = {**metadata, **eval}
-
-    model_card = f"""
-  # **Reinforce** Agent playing **{env_id}**
-  This is a trained model of a **Reinforce** agent playing **{env_id}** .
-  To learn to use this model and train yours check Unit 4 of the Deep Reinforcement Learning Course: https://huggingface.co/deep-rl-course/unit4/introduction
-  """
-
-    readme_path = local_directory / "README.md"
-    readme = ""
-    if readme_path.exists():
-        with readme_path.open("r", encoding="utf8") as f:
-          readme = f.read()
-    else:
-      readme = model_card
-
-    with readme_path.open("w", encoding="utf-8") as f:
-      f.write(readme)
-
-    # Save our metrics to Readme metadata
-    metadata_save(readme_path, metadata)
-
-    # Step 6: Record a video
-    video_path =  local_directory / "replay.mp4"
-    record_video(env, model, video_path, video_fps)
-
-    # Step 7. Push everything to the Hub
-    api.upload_folder(
-          repo_id=repo_id,
-          folder_path=local_directory,
-          path_in_repo=".",
-    )
-
-    print(f"Your model is pushed to the Hub. You can view your model here: {repo_url}")
-```
-
-By using `push_to_hub`, **you evaluate, record a replay, generate a model card of your agent, and push it to the Hub**.
-
-This way:
-- You can **showcase our work** 🔥
-- You can **visualize your agent playing** 👀
-- You can **share an agent with the community that others can use** 💾
-- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
-
-
-To be able to share your model with the community there are three more steps to follow:
-
-1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
-
-2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
-- Create a new token (https://huggingface.co/settings/tokens) **with write role**
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
-
-
-```python
-notebook_login()
-```
-
-If you don't want to use Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login` (or `login`)
-
-3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function
-
-```python
-repo_id = ""  # TODO Define your repo id {username/Reinforce-{model-id}}
-push_to_hub(
-    repo_id,
-    cartpole_policy,  # The model we want to save
-    cartpole_hyperparameters,  # Hyperparameters
-    eval_env,  # Evaluation environment
-    video_fps=30
-)
-```
-
-Now that we tested the robustness of our implementation, let's try a more complex environment: PixelCopter 🚁
-
-
-
-
-## Second agent: PixelCopter 🚁
-
-### Study the PixelCopter environment 👀
-- [The Environment documentation](https://pygame-learning-environment.readthedocs.io/en/latest/user/games/pixelcopter.html)
-
-
-```python
-env_id = "Pixelcopter-PLE-v0"
-env = gym.make(env_id)
-eval_env = gym.make(env_id)
-s_size = env.observation_space.shape[0]
-a_size = env.action_space.n
-```
-
-```python
-print("_____OBSERVATION SPACE_____ \n")
-print("The State Space is: ", s_size)
-print("Sample observation", env.observation_space.sample())  # Get a random observation
-```
-
-```python
-print("\n _____ACTION SPACE_____ \n")
-print("The Action Space is: ", a_size)
-print("Action Space Sample", env.action_space.sample())  # Take a random action
-```
-
-The observation space (7) 👀:
-- player y position
-- player velocity
-- player distance to floor
-- player distance to ceiling
-- next block x distance to player
-- next blocks top y location
-- next blocks bottom y location
-
-The action space(2) 🎮:
-- Up (press accelerator)
-- Do nothing (don't press accelerator)
-
-The reward function 💰:
-- For each vertical block it passes, it gains a positive reward of +1. Each time a terminal state is reached it receives a negative reward of -1.
-
-### Define the new Policy 🧠
-- We need to have a deeper neural network since the environment is more complex
-
-```python
-class Policy(nn.Module):
-    def __init__(self, s_size, a_size, h_size):
-        super(Policy, self).__init__()
-        # Define the three layers here
-
-    def forward(self, x):
-        # Define the forward process here
-        return F.softmax(x, dim=1)
-
-    def act(self, state):
-        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
-        probs = self.forward(state).cpu()
-        m = Categorical(probs)
-        action = m.sample()
-        return action.item(), m.log_prob(action)
-```
-
-#### Solution
-
-```python
-class Policy(nn.Module):
-    def __init__(self, s_size, a_size, h_size):
-        super(Policy, self).__init__()
-        self.fc1 = nn.Linear(s_size, h_size)
-        self.fc2 = nn.Linear(h_size, h_size * 2)
-        self.fc3 = nn.Linear(h_size * 2, a_size)
-
-    def forward(self, x):
-        x = F.relu(self.fc1(x))
-        x = F.relu(self.fc2(x))
-        x = self.fc3(x)
-        return F.softmax(x, dim=1)
-
-    def act(self, state):
-        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
-        probs = self.forward(state).cpu()
-        m = Categorical(probs)
-        action = m.sample()
-        return action.item(), m.log_prob(action)
-```
-
-### Define the hyperparameters ⚙️
-- Because this environment is more complex.
-- Especially for the hidden size, we need more neurons.
-
-```python
-pixelcopter_hyperparameters = {
-    "h_size": 64,
-    "n_training_episodes": 50000,
-    "n_evaluation_episodes": 10,
-    "max_t": 10000,
-    "gamma": 0.99,
-    "lr": 1e-4,
-    "env_id": env_id,
-    "state_space": s_size,
-    "action_space": a_size,
-}
-```
-
-###  Train it
-- We're now ready to train our agent 🔥.
-
-```python
-# Create policy and place it to the device
-# torch.manual_seed(50)
-pixelcopter_policy = Policy(
-    pixelcopter_hyperparameters["state_space"],
-    pixelcopter_hyperparameters["action_space"],
-    pixelcopter_hyperparameters["h_size"],
-).to(device)
-pixelcopter_optimizer = optim.Adam(pixelcopter_policy.parameters(), lr=pixelcopter_hyperparameters["lr"])
-```
-
-```python
-scores = reinforce(
-    pixelcopter_policy,
-    pixelcopter_optimizer,
-    pixelcopter_hyperparameters["n_training_episodes"],
-    pixelcopter_hyperparameters["max_t"],
-    pixelcopter_hyperparameters["gamma"],
-    1000,
-)
-```
-
-### Publish our trained model on the Hub 🔥
-
-```python
-repo_id = ""  # TODO Define your repo id {username/Reinforce-{model-id}}
-push_to_hub(
-    repo_id,
-    pixelcopter_policy,  # The model we want to save
-    pixelcopter_hyperparameters,  # Hyperparameters
-    eval_env,  # Evaluation environment
-    video_fps=30
-)
-```
-
-## Some additional challenges 🏆
-
-The best way to learn **is to try things on your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. But also try to find better parameters.
-
-In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?
-
-Here are some ideas to climb up the leaderboard:
-* Train more steps
-* Try different hyperparameters by looking at what your classmates have done 👉 https://huggingface.co/models?other=reinforce
-* **Push your new trained model** on the Hub 🔥
-* **Improving the implementation for more complex environments** (for instance, what about changing the network to a Convolutional Neural Network to handle
-frames as observation)?
-
-________________________________________________________________________
-
-**Congrats on finishing this unit**! There was a lot of information.
-And congrats on finishing the tutorial. You've just coded your first Deep Reinforcement Learning agent from scratch using PyTorch and shared it on the Hub 🥳.
-
-Don't hesitate to iterate on this unit **by improving the implementation for more complex environments** (for instance, what about changing the network to a Convolutional Neural Network to handle
-frames as observation)?
-
-In the next unit, **we're going to learn more about Unity MLAgents**, by training agents in Unity environments. This way, you will be ready to participate in the **AI vs AI challenges where you'll train your agents
-to compete against other agents in a snowball fight and a soccer game.**
-
-Sound fun? See you next time!
-
-Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉  [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
-
-See you in Unit 5! 🔥
-
-### Keep Learning, stay awesome 🤗
diff --git a/units/en/unit4/introduction.mdx b/units/en/unit4/introduction.mdx
deleted file mode 100644
index c087059..0000000
--- a/units/en/unit4/introduction.mdx
+++ /dev/null
@@ -1,24 +0,0 @@
-# Introduction [[introduction]]
-
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/thumbnail.png" alt="thumbnail"/>
-
-In the last unit, we learned about Deep Q-Learning. In this value-based deep reinforcement learning algorithm, we **used a deep neural network to approximate the different Q-values for each possible action at a state.**
-
-Since the beginning of the course, we have only studied value-based methods, **where we estimate a value function as an intermediate step towards finding an optimal policy.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" />
-
-In value-based methods, the policy ** \(π\) only exists because of the action value estimates since the policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.
-
-With policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.**
-
-So today, **we'll learn about policy-based methods and study a subset of these methods called policy gradient**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch.
-Then, we'll test its robustness using the CartPole-v1 and PixelCopter environments.
-
-You'll then be able to iterate and improve this implementation for more advanced environments.
-
-<figure class="image table text-center m-0 w-full">
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/envs.gif" alt="Environments"/>
-</figure>
-
-Let's get started!
diff --git a/units/en/unit4/pg-theorem.mdx b/units/en/unit4/pg-theorem.mdx
deleted file mode 100644
index ff61913..0000000
--- a/units/en/unit4/pg-theorem.mdx
+++ /dev/null
@@ -1,86 +0,0 @@
-# (Optional) the Policy Gradient Theorem
-
-In this optional section where we're **going to study how we differentiate the objective function that we will use to approximate the policy gradient**.
-
-Let's first recap our different formulas:
-
-1. The Objective function
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/expected_reward.png" alt="Return"/>
-
-
-2. The probability of a trajectory (given that action comes from \\(\pi_\theta\\)):
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
-
-
-So we have:
-
-\\(\nabla_\theta J(\theta) =  \nabla_\theta \sum_{\tau}P(\tau;\theta)R(\tau)\\)
-
-
-We can rewrite the gradient of the sum as the sum of the gradient:
-
-\\( =  \sum_{\tau} \nabla_\theta (P(\tau;\theta)R(\tau)) = \sum_{\tau} \nabla_\theta P(\tau;\theta)R(\tau) \\) as \\(R(\tau)\\) is not dependent on \\(\theta\\)
-
-We then multiply every term in the sum by \\(\frac{P(\tau;\theta)}{P(\tau;\theta)}\\)(which is possible since it's = 1)
-
-\\( = \sum_{\tau} \frac{P(\tau;\theta)}{P(\tau;\theta)}\nabla_\theta P(\tau;\theta)R(\tau) \\)
-
-
-We can simplify further this since \\( \frac{P(\tau;\theta)}{P(\tau;\theta)}\nabla_\theta P(\tau;\theta) =  P(\tau;\theta)\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}  \\). 
-
-Thus we can rewrite the sum as
-
-\\( P(\tau;\theta)\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}= \sum_{\tau} P(\tau;\theta) \frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}R(\tau) \\)
-
-We can then use the *derivative log trick* (also called *likelihood ratio trick* or *REINFORCE trick*), a simple rule in calculus that implies that \\( \nabla_x log f(x) = \frac{\nabla_x f(x)}{f(x)} \\)
-
-So given we have \\(\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)} \\) we transform it as \\(\nabla_\theta log P(\tau|\theta) \\)
-
-
-
-So this is our likelihood policy gradient:
-
-\\( \nabla_\theta J(\theta) = \sum_{\tau} P(\tau;\theta)  \nabla_\theta log P(\tau;\theta) R(\tau) \\)
-
-
-
-
-
-Thanks for this new formula, we can estimate the gradient using trajectory samples (we can approximate the likelihood ratio policy gradient with sample-based estimate if you prefer).
-
-\\(\nabla_\theta J(\theta) = \frac{1}{m} \sum^{m}_{i=1} \nabla_\theta log P(\tau^{(i)};\theta)R(\tau^{(i)})\\) where each \\(\tau^{(i)}\\) is a sampled trajectory.
-
-
-But we still have some mathematics work to do there: we need to simplify \\(  \nabla_\theta log P(\tau|\theta) \\)
-
-We know that:
-
-\\(\nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta log[ \mu(s_0) \prod_{t=0}^{H} P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)}) \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})]\\)
-
-Where \\(\mu(s_0)\\) is the initial state distribution and \\( P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)})  \\) is the state transition dynamics of the MDP.
-
-We know that the log of a product is equal to the sum of the logs:
-
-\\(\nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta \left[log \mu(s_0) + \sum\limits_{t=0}^{H}log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) + \sum\limits_{t=0}^{H}log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})\right] \\)
-
-We also know that the gradient of the sum is equal to the sum of gradient:
-
-\\( \nabla_\theta log P(\tau^{(i)};\theta)=\nabla_\theta log\mu(s_0) + \nabla_\theta \sum\limits_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) + \nabla_\theta \sum\limits_{t=0}^{H} log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)}) \\)
-
-
-Since neither initial state distribution or state transition dynamics of the MDP are dependent of \\(\theta\\), the derivate of both terms are 0. So we can remove them:
-
-Since:
-\\(\nabla_\theta \sum_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)}  a_{t}^{(i)}) = 0 \\) and \\( \nabla_\theta \mu(s_0) = 0\\)
-
-\\(\nabla_\theta log P(\tau^{(i)};\theta) =   \nabla_\theta \sum_{t=0}^{H} log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})\\)
-
-We can rewrite the gradient of the sum as the sum of gradients:
-
-\\( \nabla_\theta log P(\tau^{(i)};\theta)=    \sum_{t=0}^{H} \nabla_\theta log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)}) \\)
-
-So, the final formula for estimating the policy gradient is:
-
-\\( \nabla_{\theta} J(\theta) = \hat{g} = \frac{1}{m} \sum^{m}_{i=1} \sum^{H}_{t=0} \nabla_\theta \log \pi_\theta(a^{(i)}_{t} | s_{t}^{(i)})R(\tau^{(i)}) \\)
diff --git a/units/en/unit4/policy-gradient.mdx b/units/en/unit4/policy-gradient.mdx
deleted file mode 100644
index 9ea6b3b..0000000
--- a/units/en/unit4/policy-gradient.mdx
+++ /dev/null
@@ -1,123 +0,0 @@
-# Diving deeper into policy-gradient methods
-
-## Getting the big picture
-
-We just learned that policy-gradient methods aim to find parameters  \\( \theta \\) that **maximize the expected return**.
-
-The idea is that we have a *parameterized stochastic policy*. In our case, a neural network outputs a probability distribution over actions. The probability of taking each action is also called the *action preference*.
-
-If we take the example of CartPole-v1:
-- As input, we have a state.
-- As output, we have a probability distribution over actions at that state.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_based.png" alt="Policy based" />
-
-Our goal with policy-gradient is to **control the probability distribution of actions** by tuning the policy such that **good actions (that maximize the return) are sampled more frequently in the future.**
-Each time the agent interacts with the environment, we tweak the parameters such that good actions will be sampled more likely in the future.
-
-But **how are we going to optimize the weights using the expected return**?
-
-The idea is that we're going to **let the agent interact during an episode**. And if we win the episode, we consider that each action taken was good and must be more sampled in the future
-since they lead to win.
-
-So for each state-action pair, we want to increase the  \\(P(a|s)\\): the probability of taking that action at that state. Or decrease if we lost.
-
-The Policy-gradient algorithm (simplified) looks like this:
-<figure class="image table text-center m-0 w-full">
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/pg_bigpicture.jpg" alt="Policy Gradient Big Picture"/>
-</figure>
-
-Now that we got the big picture, let's dive deeper into policy-gradient methods.
-
-## Diving deeper into policy-gradient methods
-
-We have our stochastic policy  \\(\pi\\) which has a parameter  \\(\theta\\). This  \\(\pi\\), given a state, **outputs a probability distribution of actions**.
-
-<figure class="image table text-center m-0 w-full">
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="Policy"/>
-</figure>
-
-Where  \\(\pi_\theta(a_t|s_t)\\) is the probability of the agent selecting action  \\(a_t\\) from state  \\(s_t\\) given our policy.
-
-**But how do we know if our policy is good?** We need to have a way to measure it. To know that, we define a score/objective function called  \\(J(\theta)\\).
-
-### The objective function
-
-The *objective function* gives us the **performance of the agent** given a trajectory (state action sequence without considering reward (contrary to an episode)), and it outputs the *expected cumulative reward*.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/objective.jpg" alt="Return"/>
-
-Let's give some more details on this formula:
-- The *expected return* (also called expected cumulative reward), is the weighted average (where the weights are given by  \\(P(\tau;\theta)\\) of all possible values that the return  \\(R(\tau)\\) can take).
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/expected_reward.png" alt="Return"/>
-
-
-- \\(R(\tau)\\) :  Return from an arbitrary trajectory. To take this quantity and use it to calculate the expected return, we need to multiply it by the probability of each possible trajectory.
-
-- \\(P(\tau;\theta)\\) : Probability of each possible trajectory  \\(\tau\\) (that probability depends on  \\(\theta\\) since it defines the policy that it uses to select the actions of the trajectory which has an impact of the states visited).
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
-
-- \\(J(\theta)\\) : Expected return, we calculate it by summing for all trajectories, the probability of taking that trajectory given  \\(\theta \\) multiplied by the return of this trajectory.
-
-Our objective then is to maximize the expected cumulative reward by finding the  \\(\theta \\) that will output the best action probability distributions:
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/max_objective.png" alt="Max objective"/>
-
-
-## Gradient Ascent and the Policy-gradient Theorem
-
-Policy-gradient is an optimization problem: we want to find the values of  \\(\theta\\) that maximize our objective function  \\(J(\theta)\\), so we need to use **gradient-ascent**. It's the inverse of *gradient-descent* since it gives the direction of the steepest increase of  \\(J(\theta)\\).
-
-(If you need a refresher on the difference between gradient descent and gradient ascent [check this](https://www.baeldung.com/cs/gradient-descent-vs-ascent) and [this](https://stats.stackexchange.com/questions/258721/gradient-ascent-vs-gradient-descent-in-logistic-regression)).
-
-Our update step for gradient-ascent is:
-
-\\( \theta \leftarrow \theta + \alpha *  \nabla_\theta J(\theta) \\)
-
-We can repeatedly apply this update in the hopes that  \\(\theta \\) converges to the value that maximizes  \\(J(\theta)\\).
-
-However, there are two problems with computing the derivative of  \\(J(\theta)\\):
-1. We can't calculate the true gradient of the objective function since it requires calculating the probability of each possible trajectory, which is computationally super expensive.
-So we want to **calculate a gradient estimation with a sample-based estimate (collect some trajectories)**.
-
-2. We have another problem that I explain in the next optional section. To differentiate this objective function, we need to differentiate the state distribution, called the Markov Decision Process dynamics. This is attached to the environment. It gives us the probability of the environment going into the next state, given the current state and the action taken by the agent. The problem is that we can't differentiate it because we might not know about it.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
-
-Fortunately we're going to use a solution called the Policy Gradient Theorem that will help us to reformulate the objective function into a differentiable function that does not involve the differentiation of the state distribution.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_theorem.png" alt="Policy Gradient"/>
-
-If you want to understand how we derive this formula for approximating the gradient, check out the next (optional) section.
-
-## The Reinforce algorithm (Monte Carlo Reinforce)
-
-The Reinforce algorithm, also called Monte-Carlo policy-gradient, is a policy-gradient algorithm that **uses an estimated return from an entire episode to update the policy parameter**  \\(\theta\\):
-
-In a loop:
-- Use the policy  \\(\pi_\theta\\)  to collect an episode  \\(\tau\\)
-- Use the episode to estimate the gradient  \\(\hat{g} = \nabla_\theta J(\theta)\\)
-
- <figure class="image table text-center m-0 w-full">
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_one.png" alt="Policy Gradient"/>
-</figure>
-
-- Update the weights of the policy:  \\(\theta \leftarrow \theta + \alpha \hat{g}\\)
-
-We can interpret this update as follows:
-
-- \\(\nabla_\theta log \pi_\theta(a_t|s_t)\\) is the direction of **steepest increase of the (log) probability** of selecting action \\(a_t\\) from state \\(s_t\\).
-This tells us **how we should change the weights of policy** if we want to increase/decrease the log probability of selecting action \\(a_t\\) at state \\(s_t\\).
-
-- \\(R(\tau)\\): is the scoring function:
-  - If the return is high, it will **push up the probabilities** of the (state, action) combinations.
-  - Otherwise, if the return is low, it will **push down the probabilities** of the (state, action) combinations.
-
-
-We can also **collect multiple episodes (trajectories)** to estimate the gradient:
-<figure class="image table text-center m-0 w-full">
- <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_multiple.png" alt="Policy Gradient"/>
-</figure>
diff --git a/units/en/unit4/quiz.mdx b/units/en/unit4/quiz.mdx
deleted file mode 100644
index 42d9297..0000000
--- a/units/en/unit4/quiz.mdx
+++ /dev/null
@@ -1,82 +0,0 @@
-# Quiz
-
-The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
-
-
-### Q1: What are the advantages of policy-gradient over value-based methods? (Check all that apply)
-
-<Question
-	choices={[
-		{
-			text: "Policy-gradient methods can learn a stochastic policy",
-			explain: "",
-      			correct: true,
-		},
-		{
-			text: "Policy-gradient methods are more effective in high-dimensional action spaces and continuous actions spaces",
-			explain: "",
-      			correct: true,
-		},
-    {
-			text: "Policy-gradient converges most of the time on a global maximum.",
-			explain: "No, frequently, policy-gradient converges on a local maximum instead of a global optimum.",
-		},
-	]}
-/>
-
-### Q2: What is the Policy Gradient Theorem?
-
-<details>
-<summary>Solution</summary>
-
-*The Policy Gradient Theorem* is a formula that will help us to reformulate the objective function into a differentiable function that does not involve the differentiation of the state distribution.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_theorem.png" alt="Policy Gradient"/>
-
-</details>
-
-
-### Q3: What's the difference between policy-based methods and policy-gradient methods? (Check all that apply)
-
-<Question
-	choices={[
-    {
-      text: "Policy-based methods are a subset of policy-gradient methods.",
-      explain: "",
-    },
-    {
-      text: "Policy-gradient methods are a subset of policy-based methods.",
-      explain: "",
-      correct: true,
-    },
-    {
-      text: "In Policy-based methods, we can optimize the parameter θ **indirectly** by maximizing the local approximation of the objective function with techniques like hill climbing, simulated annealing, or evolution strategies.",
-      explain: "",
-      correct: true,
-    },
-    {
-	text: "In Policy-gradient methods, we optimize the parameter θ **directly** by performing the gradient ascent on the performance of the objective function.",
-	explain: "",
-	correct: true,
-	},
-	]}
-/>
-
-
-### Q4: Why do we use gradient ascent instead of gradient descent to optimize J(θ)?
-
-<Question
-	choices={[
-    {
-      text: "We want to minimize J(θ) and gradient ascent gives us the direction of the steepest increase of J(θ)",
-      explain: "",
-    },
-		{
-			text: "We want to maximize J(θ) and gradient ascent gives us the direction of the steepest increase of J(θ)",
-			explain: "",
-      correct: true
-		},
-	]}
-/>
-
-Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read the chapter again to reinforce (😏) your knowledge.
diff --git a/units/en/unit4/what-are-policy-based-methods.mdx b/units/en/unit4/what-are-policy-based-methods.mdx
deleted file mode 100644
index 2f0b29a..0000000
--- a/units/en/unit4/what-are-policy-based-methods.mdx
+++ /dev/null
@@ -1,42 +0,0 @@
-# What are the policy-based methods?
-
-The main goal of Reinforcement learning is to **find the optimal policy \\(\pi^{*}\\) that will maximize the expected cumulative reward**.
-Because Reinforcement Learning is based on the *reward hypothesis*: **all goals can be described as the maximization of the expected cumulative reward.**
-
-For instance, in a soccer game (where you're going to train the agents in two units), the goal is to win the game. We can describe this goal in reinforcement learning as
-**maximizing the number of goals scored** (when the ball crosses the goal line) into your opponent's soccer goals. And **minimizing the number of goals in your soccer goals**.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/soccer.jpg" alt="Soccer" />
-
-## Value-based, Policy-based, and Actor-critic methods
-
-In the first unit, we saw two methods to find (or, most of the time, approximate) this optimal policy \\(\pi^{*}\\).
-
-- In *value-based methods*, we learn a value function.
-  - The idea is that an optimal value function leads to an optimal policy \\(\pi^{*}\\).
-  - Our objective is to **minimize the loss between the predicted and target value** to approximate the true action-value function.
-  - We have a policy, but it's implicit since it **is generated directly from the value function**. For instance, in Q-Learning, we used an (epsilon-)greedy policy.
-
-- On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi^{*}\\) without having to learn a value function.
-  - The idea is **to parameterize the policy**. For instance, using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
-  - <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="stochastic policy" />
-  - Our objective then is **to maximize the performance of the parameterized policy using gradient ascent**.
-  - To do that, we control the parameter \\(\theta\\) that will affect the distribution of actions over a state.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_based.png" alt="Policy based" />
-
-- Next time, we'll study the *actor-critic* method, which is a combination of value-based and policy-based methods.
-
-Consequently, thanks to policy-based methods, we can directly optimize our policy \\(\pi_\theta\\) to output a probability distribution over actions \\(\pi_\theta(a|s)\\) that leads to the best cumulative return.
-To do that, we define an objective function \\(J(\theta)\\), that is, the expected cumulative reward, and we **want to find the value \\(\theta\\) that maximizes this objective function**.
-
-## The difference between policy-based and policy-gradient methods
-
-Policy-gradient methods, what we're going to study in this unit, is a subclass of policy-based methods. In policy-based methods, the optimization is most of the time *on-policy* since for each update, we only use data (trajectories) collected **by our most recent version of** \\(\pi_\theta\\).
-
-The difference between these two methods **lies on how we optimize the parameter** \\(\theta\\):
-
-- In *policy-based methods*, we search directly for the optimal policy. We can optimize the parameter \\(\theta\\) **indirectly** by maximizing the local approximation of the objective function with techniques like hill climbing, simulated annealing, or evolution strategies.
-- In *policy-gradient methods*, because it is a subclass of the policy-based methods, we search directly for the optimal policy. But we optimize the parameter \\(\theta\\) **directly** by performing the gradient ascent on the performance of the objective function \\(J(\theta)\\).
-
-Before diving more into how policy-gradient methods work (the objective function, policy gradient theorem, gradient ascent, etc.), let's study the advantages and disadvantages of policy-based methods.
diff --git a/units/en/unit5/bonus.mdx b/units/en/unit5/bonus.mdx
deleted file mode 100644
index f4607f6..0000000
--- a/units/en/unit5/bonus.mdx
+++ /dev/null
@@ -1,19 +0,0 @@
-# Bonus: Learn to create your own environments with Unity and MLAgents
-
-**You can create your own reinforcement learning environments with Unity and MLAgents**. Using a game engine such as Unity can be intimidating at first, but here are the steps you can take to learn smoothly.
-
-## Step 1: Know how to use Unity
-
-- The best way to learn Unity is to do ["Create with Code" course](https://learn.unity.com/course/create-with-code): it's a series of videos for beginners where **you will create 5 small games with Unity**.
-
-## Step 2: Create the simplest environment with this tutorial
-
-- Then, when you know how to use Unity, you can create your [first basic RL environment using this tutorial](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Learning-Environment-Create-New.md).
-
-## Step 3: Iterate and create nice environments
-
-- Now that you've created your first simple environment you can iterate to more complex ones using the [MLAgents documentation (especially Designing Agents and Agent part)](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/)
-- In addition, you can take this free course ["Create a hummingbird environment"](https://learn.unity.com/course/ml-agents-hummingbirds) by [Adam Kelly](https://twitter.com/aktwelve)
-
-
-Have fun! And if you create custom environments don't hesitate to share them to the `#rl-i-made-this` discord channel.
diff --git a/units/en/unit5/conclusion.mdx b/units/en/unit5/conclusion.mdx
deleted file mode 100644
index b0d8056..0000000
--- a/units/en/unit5/conclusion.mdx
+++ /dev/null
@@ -1,22 +0,0 @@
-# Conclusion
-
-Congrats on finishing this unit! You’ve just trained your first ML-Agents and shared it to the Hub 🥳.
-
-The best way to learn is to **practice and try stuff**. Why not try another environment? [ML-Agents has 18 different environments](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Learning-Environment-Examples.md).
-
-For instance:
-- [Worm](https://singularite.itch.io/worm), where you teach a worm to crawl.
-- [Walker](https://singularite.itch.io/walker), where you teach an agent to walk towards a goal.
-
-Check the documentation to find out how to train them and to see the list of already integrated MLAgents environments on the Hub: https://github.com/huggingface/ml-agents#getting-started
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/envs-unity.jpeg" alt="Example envs"/>
-
-
-In the next unit, we're going to learn about multi-agents. You're going to train your first multi-agents to compete in Soccer and Snowball fight against other classmate's agents.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballfight.gif" alt="Snownball fight"/>
-
-Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉  [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
-
-### Keep Learning, stay awesome 🤗
diff --git a/units/en/unit5/curiosity.mdx b/units/en/unit5/curiosity.mdx
deleted file mode 100644
index a1aed85..0000000
--- a/units/en/unit5/curiosity.mdx
+++ /dev/null
@@ -1,50 +0,0 @@
-# (Optional) What is Curiosity in Deep Reinforcement Learning?
-
-This is an (optional) introduction to Curiosity. If you want to learn more, you can read two additional articles where we dive into the mathematical details:
-
-- [Curiosity-Driven Learning through Next State Prediction](https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-next-state-prediction-f7f4e2f592fa)
-- [Random Network Distillation: a new take on Curiosity-Driven Learning](https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938)
-
-## Two Major Problems in Modern RL
-
-To understand what Curiosity is, we first need to understand the two major problems with RL:
-
-First, the *sparse rewards problem:* that is, **most rewards do not contain information, and hence are set to zero**.
-
-Remember that RL is based on the *reward hypothesis*, which is the idea that each goal can be described as the maximization of the rewards. Therefore, rewards act as feedback for RL agents; **if they don’t receive any, their knowledge of which action is appropriate (or not) cannot change**.
-
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/curiosity1.png" alt="Curiosity"/>
-<figcaption>Source: Thanks to the reward, our agent knows that this action at that state was good</figcaption>
-</figure>
-
-
-For instance, in [Vizdoom](https://vizdoom.cs.put.edu.pl/), a set of environments based on the game Doom “DoomMyWayHome,” your agent is only rewarded **if it finds the vest**.
-However, the vest is far away from your starting point, so most of your rewards will be zero. Therefore, if our agent does not receive useful feedback (dense rewards), it will take much longer to learn an optimal policy, and **it can spend time turning around without finding the goal**.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/curiosity2.png" alt="Curiosity"/>
-
-The second big problem is that **the extrinsic reward function is handmade; in each environment, a human has to implement a reward function**. But how we can scale that in big and complex environments?
-
-## So what is Curiosity?
-
-A solution to these problems is **to develop a reward function intrinsic to the agent, i.e., generated by the agent itself**. The agent will act as a self-learner since it will be the student and its own feedback master.
-
-**This intrinsic reward mechanism is known as Curiosity** because this reward pushes the agent to explore states that are novel/unfamiliar. To achieve that, our agent will receive a high reward when exploring new trajectories.
-
-This reward is inspired by how humans act. ** We naturally have an intrinsic desire to explore environments and discover new things**.
-
-There are different ways to calculate this intrinsic reward. The classical approach (Curiosity through next-state prediction) is to calculate Curiosity **as the error of our agent in predicting the next state, given the current state and action taken**.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/curiosity3.png" alt="Curiosity"/>
-
-Because the idea of Curiosity is to **encourage our agent to perform actions that reduce the uncertainty in the agent’s ability to predict the consequences of its actions** (uncertainty will be higher in areas where the agent has spent less time or in areas with complex dynamics).
-
-If the agent spends a lot of time on these states, it will be good at predicting the next state (low Curiosity). On the other hand, if it’s in a new, unexplored state, it will be hard to predict the following state (high Curiosity).
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/curiosity4.png" alt="Curiosity"/>
-
-Using Curiosity will push our agent to favor transitions with high prediction error (which will be higher in areas where the agent has spent less time, or in areas with complex dynamics) and **consequently better explore our environment**.
-
-There’s also **other curiosity calculation methods**. ML-Agents uses a more advanced one called Curiosity through random network distillation. This is out of the scope of the tutorial but if you’re interested [I wrote an article explaining it in detail](https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938).
diff --git a/units/en/unit5/hands-on.mdx b/units/en/unit5/hands-on.mdx
deleted file mode 100644
index e7fec94..0000000
--- a/units/en/unit5/hands-on.mdx
+++ /dev/null
@@ -1,416 +0,0 @@
-# Hands-on
-
-<CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
-notebooks={[
-  {label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit5/unit5.ipynb"}
-  ]}
-  askForHelpUrl="http://hf.co/join/discord" />
-
-
-We learned what ML-Agents is and how it works. We also studied the two environments we're going to use. Now we're ready to train our agents!
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/envs.png" alt="Environments" />
-
-To validate this hands-on for the certification process, you **just need to push your trained models to the Hub.**
-There are **no minimum results to attain** in order to validate this Hands On. But if you want to get nice results, you can try to reach the following:
-
-- For [Pyramids](https://huggingface.co/spaces/unity/ML-Agents-Pyramids): Mean Reward = 1.75
-- For [SnowballTarget](https://huggingface.co/spaces/ThomasSimonini/ML-Agents-SnowballTarget): Mean Reward = 15 or 30 targets shoot in an episode.
-
-For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
-
-**To start the hands-on, click on Open In Colab button** 👇 :
-
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit5/unit5.ipynb)
-
-We strongly **recommend students use Google Colab for the hands-on exercises** instead of running them on their personal computers. 
-
-By using Google Colab, **you can focus on learning and experimenting without worrying about the technical aspects** of setting up your environments.
-
-# Unit 5: An Introduction to ML-Agents
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/thumbnail.png" alt="Thumbnail"/>
-
-In this notebook, you'll learn about ML-Agents and train two agents.
-
-- The first one will learn to **shoot snowballs onto spawning targets**.
-- The second needs to press a button to spawn a pyramid, then navigate to the pyramid, knock it over, **and move to the gold brick at the top**. To do that, it will need to explore its environment, and we will use a technique called curiosity.
-
-After that, you'll be able **to watch your agents playing directly on your browser**.
-
-For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
-
-⬇️ Here is an example of what **you will achieve at the end of this unit.** ⬇️
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids.gif" alt="Pyramids"/>
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget.gif" alt="SnowballTarget"/>
-
-### 🎮 Environments:
-
-- [Pyramids](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Learning-Environment-Examples.md#pyramids)
-- SnowballTarget
-
-### 📚 RL-Library:
-
-- [ML-Agents](https://github.com/Unity-Technologies/ml-agents)
-
-We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).
-
-## Objectives of this notebook 🏆
-
-At the end of the notebook, you will:
-
-- Understand how **ML-Agents** works and the environment library.
-- Be able to **train agents in Unity Environments**.
-
-## Prerequisites 🏗️
-Before diving into the notebook, you need to:
-
-🔲 📚 **Study [what ML-Agents is and how it works by reading Unit 5](https://huggingface.co/deep-rl-course/unit5/introduction)**  🤗
-
-# Let's train our agents 🚀
-
-## Set the GPU 💪
-
-- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg" alt="GPU Step 1">
-
-- `Hardware Accelerator > GPU`
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg" alt="GPU Step 2">
-
-## Clone the repository 🔽
-
-- We need to clone the repository, that contains **ML-Agents.**
-
-```bash
-# Clone the repository (can take 3min)
-git clone --depth 1 https://github.com/Unity-Technologies/ml-agents
-```
-
-## Setup the Virtual Environment 🔽
-
-- In order for the **ML-Agents** to run successfully in Colab,  Colab's Python version must meet the library's Python requirements.
-
-- We can check for the supported Python version under the `python_requires` parameter in the `setup.py` files. These files are required to set up the **ML-Agents** library for use and can be found in the following locations:
-  - `/content/ml-agents/ml-agents/setup.py`
-  - `/content/ml-agents/ml-agents-envs/setup.py`
-
-- Colab's Current Python version(can be checked using `!python --version`) doesn't match the library's `python_requires` parameter, as a result installation may silently fail and lead to errors like these, when executing the same commands later:
-  - `/bin/bash: line 1: mlagents-learn: command not found`
-  - `/bin/bash: line 1: mlagents-push-to-hf: command not found`
-
-- To resolve this, we'll create a virtual environment with a Python version compatible with the **ML-Agents** library.
-
-`Note:` *For future compatibility, always check the `python_requires` parameter in the installation files and set your virtual environment to the maximum supported Python version in the given below script if the Colab's Python version is not compatible*
-
-```bash
-# Colab's Current Python Version (Incompatible with ML-Agents)
-!python --version
-```
-
-```bash
-# Install virtualenv and create a virtual environment
-!pip install virtualenv
-!virtualenv myenv
-
-# Download and install Miniconda
-!wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
-!chmod +x Miniconda3-latest-Linux-x86_64.sh
-!./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local
-
-# Activate Miniconda and install Python ver 3.10.12
-!source /usr/local/bin/activate
-!conda install -q -y --prefix /usr/local python=3.10.12 ujson  # Specify the version here
-
-# Set environment variables for Python and conda paths
-!export PYTHONPATH=/usr/local/lib/python3.10/site-packages/
-!export CONDA_PREFIX=/usr/local/envs/myenv
-```
-
-```bash
-# Python Version in New Virtual Environment (Compatible with ML-Agents)
-!python --version
-```
-
-## Installing the dependencies 🔽
-
-```bash
-# Go inside the repository and install the package (can take 3min)
-%cd ml-agents
-pip3 install -e ./ml-agents-envs
-pip3 install -e ./ml-agents
-```
-
-## SnowballTarget ⛄
-
-If you need a refresher on how this environment works check this section 👉
-https://huggingface.co/deep-rl-course/unit5/snowball-target
-
-### Download and move the environment zip file in `./training-envs-executables/linux/`
-
-- Our environment executable is in a zip file.
-- We need to download it and place it to `./training-envs-executables/linux/`
-- We use a linux executable because we use colab, and colab machines OS is Ubuntu (linux)
-
-```bash
-# Here, we create training-envs-executables and linux
-mkdir ./training-envs-executables
-mkdir ./training-envs-executables/linux
-```
-
-We downloaded the file SnowballTarget.zip from https://github.com/huggingface/Snowball-Target using `wget`
-
-```bash
-wget "https://github.com/huggingface/Snowball-Target/raw/main/SnowballTarget.zip" -O ./training-envs-executables/linux/SnowballTarget.zip
-```
-
-We unzip the executable.zip file
-
-```bash
-unzip -d ./training-envs-executables/linux/ ./training-envs-executables/linux/SnowballTarget.zip
-```
-
-Make sure your file is accessible
-
-```bash
-chmod -R 755 ./training-envs-executables/linux/SnowballTarget
-```
-
-### Define the SnowballTarget config file
-- In ML-Agents, you define the **training hyperparameters in config.yaml files.**
-
-There are multiple hyperparameters. To understand them better, you should read the explanation for each one in [the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md)
-
-
-You need to create a `SnowballTarget.yaml` config file in ./content/ml-agents/config/ppo/
-
-We'll give you a preliminary version of this config (to copy and paste into your `SnowballTarget.yaml file`), **but you should modify it**.
-
-```yaml
-behaviors:
-  SnowballTarget:
-    trainer_type: ppo
-    summary_freq: 10000
-    keep_checkpoints: 10
-    checkpoint_interval: 50000
-    max_steps: 200000
-    time_horizon: 64
-    threaded: true
-    hyperparameters:
-      learning_rate: 0.0003
-      learning_rate_schedule: linear
-      batch_size: 128
-      buffer_size: 2048
-      beta: 0.005
-      epsilon: 0.2
-      lambd: 0.95
-      num_epoch: 3
-    network_settings:
-      normalize: false
-      hidden_units: 256
-      num_layers: 2
-      vis_encode_type: simple
-    reward_signals:
-      extrinsic:
-        gamma: 0.99
-        strength: 1.0
-```
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballfight_config1.png" alt="Config SnowballTarget"/>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballfight_config2.png" alt="Config SnowballTarget"/>
-
-As an experiment, try to modify some other hyperparameters. Unity provides very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md).
-
-Now that you've created the config file and understand what most hyperparameters do, we're ready to train our agent 🔥.
-
-### Train the agent
-
-To train our agent, we need to **launch mlagents-learn and select the executable containing the environment.**
-
-We define four parameters:
-
-1. `mlagents-learn <config>`: the path where the hyperparameter config file is.
-2. `--env`: where the environment executable is.
-3. `--run_id`: the name you want to give to your training run id.
-4. `--no-graphics`: to not launch the visualization during the training.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/mlagentslearn.png" alt="MlAgents learn"/>
-
-Train the model and use the `--resume` flag to continue training in case of interruption.
-
-> It will fail the first time if and when you use `--resume`. Try rerunning the block to bypass the error.
-
-The training will take 10 to 35min depending on your config. Go take a ☕️ you deserve it 🤗.
-
-```bash
-mlagents-learn ./config/ppo/SnowballTarget.yaml --env=./training-envs-executables/linux/SnowballTarget/SnowballTarget --run-id="SnowballTarget1" --no-graphics
-```
-
-### Push the agent to the Hugging Face Hub
-
-- Now that we've trained our agent, we’re **ready to push it to the Hub and visualize it playing on your browser🔥.**
-
-To be able to share your model with the community, there are three more steps to follow:
-
-1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
-
-2️⃣ Sign in and store your authentication token from the Hugging Face website.
-- Create a new token (https://huggingface.co/settings/tokens) **with write role**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
-
-- Copy the token
-- Run the cell below and paste the token
-
-```python
-from huggingface_hub import notebook_login
-
-notebook_login()
-```
-
-If you don't want to use Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
-
-Then we need to run `mlagents-push-to-hf`.
-
-And we define four parameters:
-
-1. `--run-id`: the name of the training run id.
-2. `--local-dir`: where the agent was saved, it’s results/<run_id name>, so in my case results/First Training.
-3. `--repo-id`: the name of the Hugging Face repo you want to create or update. It’s always <your huggingface username>/<the repo name>
-If the repo does not exist **it will be created automatically**
-4. `--commit-message`: since HF repos are git repositories you need to give a commit message.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/mlagentspushtohub.png" alt="Push to Hub"/>
-
-For instance:
-
-`mlagents-push-to-hf  --run-id="SnowballTarget1" --local-dir="./results/SnowballTarget1" --repo-id="ThomasSimonini/ppo-SnowballTarget"  --commit-message="First Push"`
-
-```python
-mlagents-push-to-hf  --run-id= # Add your run id  --local-dir= # Your local dir  --repo-id= # Your repo id  --commit-message= # Your commit message
-```
-
-If everything worked you should see this at the end of the process (but with a different url 😆) :
-
-
-
-```
-Your model is pushed to the hub. You can view your model here: https://huggingface.co/ThomasSimonini/ppo-SnowballTarget
-```
-
-It's the link to your model. It contains a model card that explains how to use it, your Tensorboard, and your config file. **What's awesome is that it's a git repository, which means you can have different commits, update your repository with a new push, etc.**
-
-But now comes the best: **being able to visualize your agent online 👀.**
-
-### Watch your agent playing 👀
-
-This step it's simple:
-
-1. Go here: https://huggingface.co/spaces/ThomasSimonini/ML-Agents-SnowballTarget
-
-2. Launch the game and put it in full screen by clicking on the bottom right button
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget_load.png" alt="Snowballtarget load"/>
-
-1. In step 1, type your username (your username is case sensitive: for instance, my username is ThomasSimonini not thomassimonini or ThOmasImoNInI) and click on the search button.
-
-2. In step 2, select your model repository.
-
-3. In step 3, **choose which model you want to replay**:
-  - I have multiple ones, since we saved a model every 500000 timesteps.
-  - But since I want the most recent one, I choose `SnowballTarget.onnx`
-
-👉 It's good **to try with different models steps to see the improvement of the agent.**
-
-
-And don't hesitate to share the best score your agent gets on discord in the #rl-i-made-this channel 🔥
-
-Now let's try a more challenging environment called Pyramids.
-
-## Pyramids 🏆
-
-### Download and move the environment zip file in `./training-envs-executables/linux/`
-- Our environment executable is in a zip file.
-- We need to download it and place it into `./training-envs-executables/linux/`
-- We use a linux executable because we're using colab, and the colab machine's OS is Ubuntu (linux)
-
-We downloaded the file Pyramids.zip from from https://huggingface.co/spaces/unity/ML-Agents-Pyramids/resolve/main/Pyramids.zip using `wget`
-
-```python
-wget "https://huggingface.co/spaces/unity/ML-Agents-Pyramids/resolve/main/Pyramids.zip" -O ./training-envs-executables/linux/Pyramids.zip
-```
-
-Unzip it
-
-```python
-unzip -d ./training-envs-executables/linux/ ./training-envs-executables/linux/Pyramids.zip
-```
-
-Make sure your file is accessible
-
-```bash
-chmod -R 755 ./training-envs-executables/linux/Pyramids/Pyramids
-```
-
-###  Modify the PyramidsRND config file
-  
-- Contrary to the first environment, which was a custom one, **Pyramids was made by the Unity team**.
-- So the PyramidsRND config file already exists and is in ./content/ml-agents/config/ppo/PyramidsRND.yaml
-- You might ask why "RND" is in PyramidsRND. RND stands for *random network distillation* it's a way to generate curiosity rewards. If you want to know more about that, we wrote an article explaining this technique: https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938
-
-For this training, we’ll modify one thing:
-- The total training steps hyperparameter is too high since we can hit the benchmark (mean reward = 1.75) in only 1M training steps.
-👉 To do that, we go to config/ppo/PyramidsRND.yaml,**and change max_steps to 1000000.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-config.png" alt="Pyramids config"/>
-
-As an experiment, you should also try to modify some other hyperparameters. Unity provides very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md).
-
-We’re now ready to train our agent 🔥.
-
-### Train the agent
-
-The training will take 30 to 45min depending on your machine, go take a ☕️ you deserve it 🤗.
-
-```python
-mlagents-learn ./config/ppo/PyramidsRND.yaml --env=./training-envs-executables/linux/Pyramids/Pyramids --run-id="Pyramids Training" --no-graphics
-```
-
-### Push the agent to the Hugging Face Hub
-
-- Now that we trained our agent, we’re **ready to push it to the Hub to be able to visualize it playing on your browser🔥.**
-
-```python
-mlagents-push-to-hf  --run-id= # Add your run id  --local-dir= # Your local dir  --repo-id= # Your repo id  --commit-message= # Your commit message
-```
-
-### Watch your agent playing 👀
-
-👉 https://huggingface.co/spaces/unity/ML-Agents-Pyramids
-  
-### 🎁 Bonus: Why not train on another environment?
-  
-Now that you know how to train an agent using MLAgents, **why not try another environment?**
-
-MLAgents provides 17 different environments and we’re building some custom ones. The best way to learn is to try things on your own, have fun.
-
-![cover](https://miro.medium.com/max/1400/0*xERdThTRRM2k_U9f.png)
-
-You have the full list of the one currently available environments on Hugging Face here 👉 https://github.com/huggingface/ml-agents#the-environments
-
-For the demos to visualize your agent 👉 https://huggingface.co/unity
-
-For now we have integrated: 
-- [Worm](https://huggingface.co/spaces/unity/ML-Agents-Worm) demo where you teach a **worm to crawl**.
-- [Walker](https://huggingface.co/spaces/unity/ML-Agents-Walker) demo where you teach an agent **to walk towards a goal**.
-
-That’s all for today. Congrats on finishing this tutorial!
-
-The best way to learn is to practice and try stuff. Why not try another environment? ML-Agents has 18 different environments, but you can also create your own. Check the documentation and have fun!
-
-See you on Unit 6 🔥,
-
-## Keep Learning, Stay awesome 🤗
diff --git a/units/en/unit5/how-mlagents-works.mdx b/units/en/unit5/how-mlagents-works.mdx
deleted file mode 100644
index f92054f..0000000
--- a/units/en/unit5/how-mlagents-works.mdx
+++ /dev/null
@@ -1,68 +0,0 @@
-# How do Unity ML-Agents work? [[how-mlagents-works]]
-
-Before training our agent, we need to understand **what ML-Agents is and how it works**.
-
-## What is Unity ML-Agents? [[what-is-mlagents]]
-
-[Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents) is a toolkit for the game engine Unity that **allows us to create environments using Unity or use pre-made environments to train our agents**.
-
-It’s developed by [Unity Technologies](https://unity.com/), the developers of Unity, one of the most famous Game Engines used by the creators of Firewatch, Cuphead, and Cities: Skylines.
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/firewatch.jpeg" alt="Firewatch"/>
-<figcaption>Firewatch was made with Unity</figcaption>
-</figure>
-
-## The six components [[six-components]]
-
-With Unity ML-Agents, you have six essential components:
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/mlagents-1.png" alt="MLAgents"/>
-<figcaption>Source: <a href="https://unity-technologies.github.io/ml-agents/">Unity ML-Agents Documentation</a> </figcaption>
-</figure>
-
-- The first is the *Learning Environment*, which contains **the Unity scene (the environment) and the environment elements** (game characters).
-- The second is the *Python Low-level API*, which contains **the low-level Python interface for interacting and manipulating the environment**. It’s the API we use to launch the training.
-- Then, we have the *External Communicator* that **connects the Learning Environment (made with C#) with the low level Python API (Python)**.
-- The *Python trainers*: the **Reinforcement algorithms made with PyTorch (PPO, SAC…)**.
-- The *Gym wrapper*: to encapsulate the RL environment in a gym wrapper.
-- The *PettingZoo wrapper*: PettingZoo is the multi-agents version of the gym wrapper.
-
-## Inside the Learning Component [[inside-learning-component]]
-
-Inside the Learning Component, we have **two important elements**:
-
-- The first is the *agent component*, the actor of the scene. We’ll **train the agent by optimizing its policy** (which will tell us what action to take in each state). The policy is called the *Brain*.
-- Finally, there is the *Academy*. This component **orchestrates agents and their decision-making processes**. Think of this Academy as a teacher who handles Python API requests.
-
-To better understand its role, let’s remember the RL process. This can be modeled as a loop that works like this:
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process.jpg" alt="The RL process" width="100%">
-<figcaption>The RL Process: a loop of state, action, reward and next state</figcaption>
-<figcaption>Source: <a href="http://incompleteideas.net/book/RLbook2020.pdf">Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto</a></figcaption>
-</figure>
-
-Now, let’s imagine an agent learning to play a platform game. The RL process looks like this:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg" alt="The RL process" width="100%">
-
-- Our Agent receives **state  \\(S_0\\)** from the **Environment** — we receive the first frame of our game (Environment).
-- Based on that **state \\(S_0\\),** the Agent takes **action \\(A_0\\)** — our Agent will move to the right.
-- The environment goes to a **new** **state \\(S_1\\)** — new frame.
-- The environment gives some **reward \\(R_1\\)** to the Agent — we’re not dead *(Positive Reward +1)*.
-
-This RL loop outputs a sequence of **state, action, reward and next state.** The goal of the agent is to **maximize the expected cumulative reward**.
-
-The Academy will be the one that will **send the order to our Agents and ensure that agents are in sync**:
-
-- Collect Observations
-- Select your action using your policy
-- Take the Action
-- Reset if you reached the max step or if you’re done.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/academy.png" alt="The MLAgents Academy" width="100%">
-
-
-Now that we understand how ML-Agents works, **we’re ready to train our agents.**
diff --git a/units/en/unit5/introduction.mdx b/units/en/unit5/introduction.mdx
deleted file mode 100644
index 8997f47..0000000
--- a/units/en/unit5/introduction.mdx
+++ /dev/null
@@ -1,31 +0,0 @@
-# An Introduction to Unity ML-Agents [[introduction-to-ml-agents]]
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/thumbnail.png" alt="thumbnail"/>
-
-One of the challenges in Reinforcement Learning is **creating environments**. Fortunately for us, we can use game engines to do so.
-These engines, such as [Unity](https://unity.com/), [Godot](https://godotengine.org/) or [Unreal Engine](https://www.unrealengine.com/), are programs made to create video games. They are perfectly suited
-for creating environments: they provide physics systems, 2D/3D rendering, and more.
-
-
-One of them, [Unity](https://unity.com/), created the [Unity ML-Agents Toolkit](https://github.com/Unity-Technologies/ml-agents), a plugin based on the game engine Unity that allows us **to use the Unity Game Engine as an environment builder to train agents**. In the first bonus unit, this is what we used to train Huggy to catch a stick!
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/example-envs.png" alt="MLAgents environments"/>
-<figcaption>Source: <a href="https://github.com/Unity-Technologies/ml-agents">ML-Agents documentation</a></figcaption>
-</figure>
-
-Unity ML-Agents Toolkit provides many exceptional pre-made environments, from playing football (soccer), learning to walk, and jumping over big walls.
-
-In this Unit, we'll learn to use ML-Agents, but **don't worry if you don't know how to use the Unity Game Engine**: you don't need to use it to train your agents.
-
-So, today, we're going to train two agents:
-- The first one will learn to **shoot snowballs onto a spawning target**.
-- The second needs to **press a button to spawn a pyramid, then navigate to the pyramid, knock it over, and move to the gold brick at the top**. To do that, it will need to explore its environment, which will be done using a technique called curiosity.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/envs.png" alt="Environments" />
-
-Then, after training, **you'll push the trained agents to the Hugging Face Hub**, and you'll be able to **visualize them playing directly on your browser without having to use the Unity Editor**.
-
-Doing this Unit will **prepare you for the next challenge: AI vs. AI where you will train agents in multi-agents environments and compete against your classmates' agents**.
-
-Sound exciting? Let's get started!
diff --git a/units/en/unit5/pyramids.mdx b/units/en/unit5/pyramids.mdx
deleted file mode 100644
index 3a39ba3..0000000
--- a/units/en/unit5/pyramids.mdx
+++ /dev/null
@@ -1,39 +0,0 @@
-# The Pyramid environment
-
-The goal in this environment is to train our agent to **get the gold brick on the top of the Pyramid. To do that, it needs to press a button to spawn a Pyramid, navigate to the Pyramid, knock it over, and move to the gold brick at the top**.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids.png" alt="Pyramids Environment"/>
-
-
-## The reward function
-
-The reward function is:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-reward.png" alt="Pyramids Environment"/>
-
-In terms of code, it looks like this
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-reward-code.png" alt="Pyramids Reward"/>
-
-To train this new agent that seeks that button and then the Pyramid to destroy, we’ll use a combination of two types of rewards:
-
-- The *extrinsic one* given by the environment (illustration above).
-- But also an *intrinsic* one called **curiosity**. This second will **push our agent to be curious, or in other terms, to better explore its environment**.
-
-If you want to know more about curiosity, the next section (optional) will explain the basics.
-
-## The observation space
-
-In terms of observation, we **use 148 raycasts that can each detect objects** (switch, bricks, golden brick, and walls.)
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids_raycasts.png"/>
-
-We also use a **boolean variable indicating the switch state** (did we turn on or off the switch to spawn the Pyramid) and a vector that **contains the agent’s speed**.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-obs-code.png" alt="Pyramids obs code"/>
-
-
-## The action space
-
-The action space is **discrete** with four possible actions:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-action.png" alt="Pyramids Environment"/>
diff --git a/units/en/unit5/quiz.mdx b/units/en/unit5/quiz.mdx
deleted file mode 100644
index ccb8f40..0000000
--- a/units/en/unit5/quiz.mdx
+++ /dev/null
@@ -1,132 +0,0 @@
-# Quiz
-
-The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
-
-### Q1: Which of the following tools are specifically designed for video games development?
-
-<Question
-	choices={[
-		{
-			text: "Unity (C#)",
-			explain: "",
-            correct: true,
-		},
-		{
-			text: "Unreal Engine (C++)",
-			explain: "",
-            correct: true,
-		},
-		{
-			text: "Godot (GDScript, C++, C#)",
-			explain: "",
-            correct: true,
-		},
-		{
-			text: "JetBrains' Rider",
-			explain: "Although useful for its support of C# for Unity, it's not a video games development IDE",
-            correct: false,
-		},
-		{
-			text: "JetBrains' CLion",
-			explain: "Although useful for its support of C++ for Unreal Engine, it's not a video games development IDE",
-            correct: false,
-		},
-		{
-			text: "Microsoft Visual Studio and Visual Studio Code",
-			explain: "Including support for both Unity and Unreal, they are generic IDEs, not video games oriented.",
-            correct: false,
-		},
-	]}
-/>
-
-### Q2: What of the following statements are true about Unity ML-Agents?
-
-<Question
-	choices={[
-		{
-			text: "Unity ´Scene´ objects can be used to create learning environments",
-			explain: "",
-            correct: true,
-		},
-		{
-			text: "Unit ML-Agents allows you to create and train your agents using Reinforcement Learning",
-			explain: "",
-            correct: true,
-		},
-        {
-			text: "Its `Communicator` component manages the communication between Unity's C# Environments/Agents and a Python back-end",
-			explain: "",
-            correct: true,
-		},
-        {
-			text: "The training process uses Reinforcement Learning algorithms, implemented in Pytorch",
-			explain: "",
-            correct: true,
-		},
-        {
-			text: "Unity ML-Agents only support Proximal Policy Optimization (PPO)",
-			explain: "No, Unity ML-Agents supports several families of algorithms, including Actor-Critic which is going to be explained in the next section",
-            correct: false,
-		},
-        {
-			text: "It includes a Gym Wrapper and a multi-agent version of it called `PettingZoo`",
-			explain: "",
-            correct: true,
-		},
-	]}
-/>
-
-### Q3: Fill the missing letters
-
-- In Unity ML-Agents, the Policy of an Agent is called a b \_ \_ \_ n
-- The component in charge of orchestrating the agents is called the \_ c \_ \_ \_ m \_
-
-<details>
-<summary>Solution</summary>
-<ul>
-	<li>b r a i n</li>
-	<li>a c a d e m y</li>
-</ul>
-</details>
-
-### Q4: Define with your own words what is a `raycast`
-
-<details>
-<summary>Solution</summary>
-A raycast is (most of the times) a linear projection, as a `laser` which aims to detect collisions through objects.
-</details>
-
-### Q5: Which are the differences between capturing the environment using `frames` or `raycasts`?
-
-<Question
-	choices={[
-    {
-      text: "By using `frames`, the environment is defined by each of the pixels of the screen. By using `raycasts`, we only send a sample of those pixels.",
-      explain: "`Raycasts` don't have anything to do with pixels. They are linear projections (lasers) that we spawn to look for collisions.",
-      correct: false,
-    },
-    {
-      text: "By using `raycasts`, the environment is defined by each of the pixels of the screen. By using `frames`, we spawn a (usually) line to check what objects it collides with",
-      explain: "It's the other way around - `frames` collect pixels, `raycasts` check for collisions.",
-      correct: false,
-    },
-    {
-      text: "By using `frames`, we collect all the pixels of the screen, which define the environment. By using `raycast`, we don't use pixels, we spawn (normally) lines and check their collisions",
-      explain: "",
-      correct: true,
-    },
-	]}
-/>
-
-
-### Q6: Name several environment and agent input variables used to train the agent in the Snowball or Pyramid environments
-
-<details>
-<summary>Solution</summary>
-- Collisions of the raycasts spawned from the agent detecting blocks, (invisible) walls, stones, our target, switches, etc.
-- Traditional inputs describing agent features, as its speed
-- Boolean vars, as the switch (on/off) in Pyramids or the `can I shoot?` in the SnowballTarget.
-</details>
-
-
-Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read the chapter again to reinforce (😏) your knowledge.
diff --git a/units/en/unit5/snowball-target.mdx b/units/en/unit5/snowball-target.mdx
deleted file mode 100644
index fc89101..0000000
--- a/units/en/unit5/snowball-target.mdx
+++ /dev/null
@@ -1,57 +0,0 @@
-# The SnowballTarget Environment
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget.gif" alt="SnowballTarget"/>
-
-SnowballTarget is an environment we created at Hugging Face using assets from [Kay Lousberg](https://kaylousberg.com/). We have an optional section at the end of this Unit **if you want to learn to use Unity and create your environments**.
-
-## The agent's Goal
-
-The first agent you're going to train is called Julien the bear 🐻. Julien is trained **to hit targets with snowballs**.
-
-The Goal in this environment is that Julien **hits as many targets as possible in the limited time** (1000 timesteps). It will need **to place itself correctly in relation to the target and shoot**to do that.
-
-In addition, to avoid "snowball spamming" (aka shooting a snowball every timestep), **Julien has a "cool off" system** (it needs to wait 0.5 seconds after a shoot to be able to shoot again).
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/cooloffsystem.gif" alt="Cool Off System"/>
-<figcaption>The agent needs to wait 0.5s before being able to shoot a snowball again</figcaption>
-</figure>
-
-## The reward function and the reward engineering problem
-
-The reward function is simple. **The environment gives a +1 reward every time the agent's snowball hits a target**. Because the agent's Goal is to maximize the expected cumulative reward, **it will try to hit as many targets as possible**.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget_reward.png" alt="Reward system"/>
-
-We could have a more complex reward function (with a penalty to push the agent to go faster, for example). But when you design an environment, you need to avoid the *reward engineering problem*, which is having a too complex reward function to force your agent to behave as you want it to do.
-Why? Because by doing that, **you might miss interesting strategies that the agent will find with a simpler reward function**.
-
-In terms of code, it looks like this:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget-reward-code.png" alt="Reward"/>
-
-
-## The observation space
-
-Regarding observations, we don't use normal vision (frame), but **we use raycasts**.
-
-Think of raycasts as lasers that will detect if they pass through an object.
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/raycasts.png" alt="Raycasts"/>
-<figcaption>Source: <a href="https://github.com/Unity-Technologies/ml-agents">ML-Agents documentation</a></figcaption>
-</figure>
-
-
-In this environment, our agent has multiple set of raycasts:
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowball_target_raycasts.png" alt="Raycasts"/>
-
-In addition to raycasts, the agent gets a "can I shoot" bool as observation.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget-obs-code.png" alt="Obs"/>
-
-## The action space
-
-The action space is discrete:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget_action_space.png" alt="Action Space"/>
diff --git a/units/en/unit6/additional-readings.mdx b/units/en/unit6/additional-readings.mdx
deleted file mode 100644
index 1f91af4..0000000
--- a/units/en/unit6/additional-readings.mdx
+++ /dev/null
@@ -1,17 +0,0 @@
-# Additional Readings [[additional-readings]]
-
-## Bias-variance tradeoff in Reinforcement Learning
-
-If you want to dive deeper into the question of variance and bias tradeoff in Deep Reinforcement Learning, you can check out these two articles:
-
-- [Making Sense of the Bias / Variance Trade-off in (Deep) Reinforcement Learning](https://blog.mlreview.com/making-sense-of-the-bias-variance-trade-off-in-deep-reinforcement-learning-79cf1e83d565)
-- [Bias-variance Tradeoff in Reinforcement Learning](https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/)
-
-## Advantage Functions
-
-- [Advantage Functions, SpinningUp RL](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html?highlight=advantage%20functio#advantage-functions)
-
-## Actor Critic
-
-- [Foundations of Deep RL Series, L3 Policy Gradients and Advantage Estimation by Pieter Abbeel](https://www.youtube.com/watch?v=AKbX1Zvo7r8)
-- [A2C Paper: Asynchronous Methods for Deep Reinforcement Learning](https://arxiv.org/abs/1602.01783v2)
diff --git a/units/en/unit6/advantage-actor-critic.mdx b/units/en/unit6/advantage-actor-critic.mdx
deleted file mode 100644
index 0c455a0..0000000
--- a/units/en/unit6/advantage-actor-critic.mdx
+++ /dev/null
@@ -1,70 +0,0 @@
-# Advantage Actor-Critic (A2C) [[advantage-actor-critic]]
-
-## Reducing variance with Actor-Critic methods
-
-The solution to reducing the variance of the Reinforce algorithm and training our agent faster and better is to use a combination of Policy-Based and Value-Based methods: *the Actor-Critic method*.
-
-To understand the Actor-Critic, imagine you're playing a video game. You can play with a friend that will provide you with some feedback. You're the Actor and your friend is the Critic.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/ac.jpg" alt="Actor Critic"/>
-
-You don't know how to play at the beginning, **so you try some actions randomly**. The Critic observes your action and **provides feedback**.
-
-Learning from this feedback, **you'll update your policy and be better at playing that game.**
-
-On the other hand, your friend (Critic) will also update their way to provide feedback so it can be better next time.
-
-This is the idea behind Actor-Critic. We learn two function approximations:
-
-- *A policy* that **controls how our agent acts**: \\( \pi_{\theta}(s) \\)
-
-- *A value function* to assist the policy update by measuring how good the action taken is: \\( \hat{q}_{w}(s,a) \\)
-
-## The Actor-Critic Process
-Now that we have seen the Actor Critic's big picture, let's dive deeper to understand how the Actor and Critic improve together during the training.
-
-As we saw, with Actor-Critic methods, there are two function approximations (two neural networks):
-- *Actor*, a **policy function** parameterized by theta: \\( \pi_{\theta}(s) \\)
-- *Critic*, a **value function** parameterized by w: \\( \hat{q}_{w}(s,a) \\)
-
-Let's see the training process to understand how the Actor and Critic are optimized:
-- At each timestep, t, we get the current state \\( S_t\\) from the environment and **pass it as input through our Actor and Critic**.
-
-- Our Policy takes the state and **outputs an action**  \\( A_t \\).
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step1.jpg" alt="Step 1 Actor Critic"/>
-
-- The Critic takes that action also as input and, using \\( S_t\\) and \\( A_t \\), **computes the value of taking that action at that state: the Q-value**.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step2.jpg" alt="Step 2 Actor Critic"/>
-
-- The action \\( A_t\\) performed in the environment outputs a new state \\( S_{t+1}\\) and a reward \\( R_{t+1} \\) .
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step3.jpg" alt="Step 3 Actor Critic"/>
-
-- The Actor updates its policy parameters using the Q value.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step4.jpg" alt="Step 4 Actor Critic"/>
-
-- Thanks to its updated parameters, the Actor produces the next action to take at \\( A_{t+1} \\) given the new state \\( S_{t+1} \\).
-
-- The Critic then updates its value parameters.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step5.jpg" alt="Step 5 Actor Critic"/>
-
-## Adding Advantage in Actor-Critic (A2C)
-We can stabilize learning further by **using the Advantage function as Critic instead of the Action value function**.
-
-The idea is that the Advantage function calculates the relative advantage of an action compared to the others possible at a state: **how taking that action at a state is better compared to the average value of the state**. It's subtracting the mean value of the state from the state action pair:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/advantage1.jpg" alt="Advantage Function"/>
-
-In other words, this function calculates **the extra reward we get if we take this action at that state compared to the mean reward we get at that state**.
-
-The extra reward is what's beyond the expected value of that state.
-- If A(s,a) > 0: our gradient is **pushed in that direction**.
-- If A(s,a) < 0 (our action does worse than the average value of that state), **our gradient is pushed in the opposite direction**.
-
-The problem with implementing this advantage function is that it requires two value functions —  \\( Q(s,a)\\) and  \\( V(s)\\). Fortunately, **we can use the TD error as a good estimator of the advantage function.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/advantage2.jpg" alt="Advantage Function"/>
diff --git a/units/en/unit6/conclusion.mdx b/units/en/unit6/conclusion.mdx
deleted file mode 100644
index 557b159..0000000
--- a/units/en/unit6/conclusion.mdx
+++ /dev/null
@@ -1,11 +0,0 @@
-# Conclusion [[conclusion]]
-
-Congrats on finishing this unit and the tutorial. You've just trained your first virtual robots 🥳.
-
-**Take time to grasp the material before continuing**. You can also look at the additional reading materials we provided in the *additional reading* section.
-
-Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉  [fill out this form](https://forms.gle/BzKXWzLAGZESGNaE9)
-
-See you in next unit!
-
-### Keep learning, stay awesome 🤗
diff --git a/units/en/unit6/hands-on.mdx b/units/en/unit6/hands-on.mdx
deleted file mode 100644
index 5bc8e75..0000000
--- a/units/en/unit6/hands-on.mdx
+++ /dev/null
@@ -1,396 +0,0 @@
-# Advantage Actor Critic (A2C) using Robotics Simulations with Panda-Gym 🤖 [[hands-on]]
-
-
-      <CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
-      notebooks={[
-        {label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit6/unit6.ipynb"}
-        ]}
-        askForHelpUrl="http://hf.co/join/discord" />
-
-
-Now that you've studied the theory behind Advantage Actor Critic (A2C), **you're ready to train your A2C agent** using Stable-Baselines3 in a robotic environment. And train a:
-- A robotic arm 🦾 to move to the correct position.
-
-We're going to use 
-- [panda-gym](https://github.com/qgallouedec/panda-gym)
-
-To validate this hands-on for the certification process, you need to push your two trained models to the Hub and get the following results:
-
-- `PandaReachDense-v3` get a result of >= -3.5.
-
-To find your result, [go to the leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
-
-For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
-
-**To start the hands-on click on Open In Colab button** 👇 :
-
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit6/unit6.ipynb)
-
-
-# Unit 6: Advantage Actor Critic (A2C) using Robotics Simulations with Panda-Gym 🤖
-
-### 🎮 Environments:
-
-- [Panda-Gym](https://github.com/qgallouedec/panda-gym)
-
-### 📚 RL-Library:
-
-- [Stable-Baselines3](https://stable-baselines3.readthedocs.io/)
-
-We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).
-
-## Objectives of this notebook 🏆
-
-At the end of the notebook, you will:
-
-- Be able to use **Panda-Gym**, the environment library.
-- Be able to **train robots using A2C**.
-- Understand why **we need to normalize the input**.
-- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.
-
-## Prerequisites 🏗️
-
-Before diving into the notebook, you need to:
-
-🔲 📚 Study [Actor-Critic methods by reading Unit 6](https://huggingface.co/deep-rl-course/unit6/introduction) 🤗
-
-# Let's train our first robots 🤖
-
-## Set the GPU 💪
-
-- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg" alt="GPU Step 1">
-
-- `Hardware Accelerator > GPU`
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg" alt="GPU Step 2">
-
-## Create a virtual display 🔽
-
-During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).
-
-The following cell will install the librairies and create and run a virtual screen 🖥
-
-```python
-%%capture
-!apt install python-opengl
-!apt install ffmpeg
-!apt install xvfb
-!pip3 install pyvirtualdisplay
-```
-
-```python
-# Virtual display
-from pyvirtualdisplay import Display
-
-virtual_display = Display(visible=0, size=(1400, 900))
-virtual_display.start()
-```
-
-### Install dependencies 🔽
-
-We’ll install multiple ones:
-
-- `gymnasium`
-- `panda-gym`: Contains the robotics arm environments.
-- `stable-baselines3`: The SB3 deep reinforcement learning library.
-- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face 🤗 Hub.
-- `huggingface_hub`: Library allowing anyone to work with the Hub repositories.
-
-```bash
-!pip install stable-baselines3[extra]
-!pip install gymnasium
-!pip install huggingface_sb3
-!pip install huggingface_hub
-!pip install panda_gym
-```
-
-## Import the packages 📦
-
-```python
-import os
-
-import gymnasium as gym
-import panda_gym
-
-from huggingface_sb3 import load_from_hub, package_to_hub
-
-from stable_baselines3 import A2C
-from stable_baselines3.common.evaluation import evaluate_policy
-from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
-from stable_baselines3.common.env_util import make_vec_env
-
-from huggingface_hub import notebook_login
-```
-
-## PandaReachDense-v3 🦾
-
-The agent we're going to train is a robotic arm that needs to do controls (moving the arm and using the end-effector).
-
-In robotics, the *end-effector* is the device at the end of a robotic arm designed to interact with the environment.
-
-In `PandaReach`, the robot must place its end-effector at a target position (green ball).
-
-We're going to use the dense version of this environment. It means we'll get a *dense reward function* that **will provide a reward at each timestep** (the closer the agent is to completing the task, the higher the reward). Contrary to a *sparse reward function* where the environment **return a reward if and only if the task is completed**.
-
-Also, we're going to use the *End-effector displacement control*, it means the **action corresponds to the displacement of the end-effector**. We don't control the individual motion of each joint (joint control).
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/robotics.jpg"  alt="Robotics"/>
-
-This way **the training will be easier**.
-
-### Create the environment
-
-#### The environment 🎮
-
-In `PandaReachDense-v3` the robotic arm must place its end-effector at a target position (green ball).
-
-```python
-env_id = "PandaReachDense-v3"
-
-# Create the env
-env = gym.make(env_id)
-
-# Get the state space and action space
-s_size = env.observation_space.shape
-a_size = env.action_space
-```
-
-```python
-print("_____OBSERVATION SPACE_____ \n")
-print("The State Space is: ", s_size)
-print("Sample observation", env.observation_space.sample()) # Get a random observation
-```
-
-The observation space **is a dictionary with 3 different elements**:
-
-- `achieved_goal`: (x,y,z) position of the goal.
-- `desired_goal`: (x,y,z) distance between the goal position and the current object position.
-- `observation`: position (x,y,z) and velocity of the end-effector (vx, vy, vz).
-
-Given it's a dictionary as observation, **we will need to use a MultiInputPolicy policy instead of MlpPolicy**.
-
-```python
-print("\n _____ACTION SPACE_____ \n")
-print("The Action Space is: ", a_size)
-print("Action Space Sample", env.action_space.sample()) # Take a random action
-```
-
-The action space is a vector with 3 values:
-- Control x, y, z movement
-
-
-### Normalize observation and rewards
-
-A good practice in reinforcement learning is to [normalize input features](https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html).
-
-For that purpose, there is a wrapper that will compute a running average and standard deviation of input features.
-
-We also normalize rewards with this same wrapper by adding `norm_reward = True`
-
-[You should check the documentation to fill this cell](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)
-
-```python
-env = make_vec_env(env_id, n_envs=4)
-
-# Adding this wrapper to normalize the observation and the reward
-env = # TODO: Add the wrapper
-```
-
-#### Solution
-
-```python
-env = make_vec_env(env_id, n_envs=4)
-
-env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.)
-```
-
-### Create the A2C Model 🤖
-
-For more information about A2C implementation with StableBaselines3 check: https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html#notes
-
-To find the best parameters I checked the [official trained agents by Stable-Baselines3 team](https://huggingface.co/sb3).
-
-```python
-model = # Create the A2C model and try to find the best parameters
-```
-
-#### Solution
-
-```python
-model = A2C(policy = "MultiInputPolicy",
-            env = env,
-            verbose=1)
-```
-
-### Train the A2C agent 🏃
-
-- Let's train our agent for 1,000,000 timesteps, don't forget to use GPU on Colab. It will take approximately ~25-40min
-
-```python
-model.learn(1_000_000)
-```
-
-```python
-# Save the model and  VecNormalize statistics when saving the agent
-model.save("a2c-PandaReachDense-v3")
-env.save("vec_normalize.pkl")
-```
-
-### Evaluate the agent 📈
-
-- Now that's our  agent is trained, we need to **check its performance**.
-- Stable-Baselines3 provides a method to do that: `evaluate_policy`
-
-```python
-from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
-
-# Load the saved statistics
-eval_env = DummyVecEnv([lambda: gym.make("PandaReachDense-v3")])
-eval_env = VecNormalize.load("vec_normalize.pkl", eval_env)
-
-# We need to override the render_mode
-eval_env.render_mode = "rgb_array"
-
-#  do not update them at test time
-eval_env.training = False
-# reward normalization is not needed at test time
-eval_env.norm_reward = False
-
-# Load the agent
-model = A2C.load("a2c-PandaReachDense-v3")
-
-mean_reward, std_reward = evaluate_policy(model, eval_env)
-
-print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}")
-```
-### Publish your trained model on the Hub 🔥
-
-Now that we saw we got good results after the training, we can publish our trained model on the Hub with one line of code.
-
-📚 The libraries documentation 👉 https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face--x-stable-baselines3-v20
-
-By using `package_to_hub`, as we already mentionned in the former units, **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.
-
-This way:
-- You can **showcase our work** 🔥
-- You can **visualize your agent playing** 👀
-- You can **share with the community an agent that others can use** 💾
-- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
-
-To be able to share your model with the community there are three more steps to follow:
-
-1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
-
-2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
-- Create a new token (https://huggingface.co/settings/tokens) **with write role**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
-
-- Copy the token
-- Run the cell below and paste the token
-
-```python
-notebook_login()
-!git config --global credential.helper store
-```
-If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
-
-3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function.
-For this environment, **running this cell can take approximately 10min**
-
-```python
-from huggingface_sb3 import package_to_hub
-
-package_to_hub(
-    model=model,
-    model_name=f"a2c-{env_id}",
-    model_architecture="A2C",
-    env_id=env_id,
-    eval_env=eval_env,
-    repo_id=f"ThomasSimonini/a2c-{env_id}", # Change the username
-    commit_message="Initial commit",
-)
-```
-
-## Some additional challenges 🏆
-
-The best way to learn **is to try things by your own**! Why not trying  `PandaPickAndPlace-v3`?
-
-If you want to try more advanced tasks for panda-gym, you need to check what was done using **TQC or SAC** (a more sample-efficient algorithm suited for robotics tasks). In real robotics, you'll use a more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much, you have a risk of breaking it**.
-
-PandaPickAndPlace-v1 (this model uses the v1 version of the environment): https://huggingface.co/sb3/tqc-PandaPickAndPlace-v1
-
-And don't hesitate to check panda-gym documentation here: https://panda-gym.readthedocs.io/en/latest/usage/train_with_sb3.html
-
-We provide you the steps to train another agent (optional):
-
-1. Define the environment called "PandaPickAndPlace-v3"
-2. Make a vectorized environment
-3. Add a wrapper to normalize the observations and rewards. [Check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)
-4. Create the A2C Model (don't forget verbose=1 to print the training logs).
-5. Train it for 1M Timesteps
-6. Save the model and  VecNormalize statistics when saving the agent
-7. Evaluate your agent
-8. Publish your trained model on the Hub 🔥 with `package_to_hub`
-
-
-### Solution (optional)
-
-```python
-# 1 - 2
-env_id = "PandaPickAndPlace-v3"
-env = make_vec_env(env_id, n_envs=4)
-
-# 3
-env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.)
-
-# 4
-model = A2C(policy = "MultiInputPolicy",
-            env = env,
-            verbose=1)
-# 5
-model.learn(1_000_000)
-```
-
-```python
-# 6
-model_name = "a2c-PandaPickAndPlace-v3";
-model.save(model_name)
-env.save("vec_normalize.pkl")
-
-# 7
-from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
-
-# Load the saved statistics
-eval_env = DummyVecEnv([lambda: gym.make("PandaPickAndPlace-v3")])
-eval_env = VecNormalize.load("vec_normalize.pkl", eval_env)
-
-#  do not update them at test time
-eval_env.training = False
-# reward normalization is not needed at test time
-eval_env.norm_reward = False
-
-# Load the agent
-model = A2C.load(model_name)
-
-mean_reward, std_reward = evaluate_policy(model, eval_env)
-
-print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}")
-
-# 8
-package_to_hub(
-    model=model,
-    model_name=f"a2c-{env_id}",
-    model_architecture="A2C",
-    env_id=env_id,
-    eval_env=eval_env,
-    repo_id=f"ThomasSimonini/a2c-{env_id}", # TODO: Change the username
-    commit_message="Initial commit",
-)
-```
-
-See you on Unit 7! 🔥
-
-## Keep learning, stay awesome 🤗
diff --git a/units/en/unit6/introduction.mdx b/units/en/unit6/introduction.mdx
deleted file mode 100644
index 9d4c4ad..0000000
--- a/units/en/unit6/introduction.mdx
+++ /dev/null
@@ -1,22 +0,0 @@
-# Introduction [[introduction]]
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/thumbnail.png"  alt="Thumbnail"/>
-
-In unit 4, we learned about our first Policy-Based algorithm called **Reinforce**.
-
-In Policy-Based methods, **we aim to optimize the policy directly without using a value function**. More precisely, Reinforce is part of a subclass of *Policy-Based Methods* called *Policy-Gradient methods*. This subclass optimizes the policy directly by **estimating the weights of the optimal policy using Gradient Ascent**.
-
-We saw that Reinforce worked well. However, because we use Monte-Carlo sampling to estimate return (we use an entire episode to calculate the return), **we have significant variance in policy gradient estimation**.
-
-Remember that the policy gradient estimation is **the direction of the steepest increase in return**. In other words, how to update our policy weights so that actions that lead to good returns have a higher probability of being taken. The Monte Carlo variance, which we will further study in this unit, **leads to slower training since we need a lot of samples to mitigate it**.
-
-So today we'll study **Actor-Critic methods**, a hybrid architecture combining value-based and Policy-Based methods that helps to stabilize the training by reducing the variance using:
-- *An Actor* that controls **how our agent behaves** (Policy-Based method)
-- *A Critic* that measures **how good the taken action is** (Value-Based method)
-
-
-We'll study one of these hybrid methods, Advantage Actor Critic (A2C), **and train our agent using Stable-Baselines3 in robotic environments**. We'll train:
-- A robotic arm 🦾 to move to the correct position.
-
-Sound exciting? Let's get started!
diff --git a/units/en/unit6/quiz.mdx b/units/en/unit6/quiz.mdx
deleted file mode 100644
index 09228d7..0000000
--- a/units/en/unit6/quiz.mdx
+++ /dev/null
@@ -1,123 +0,0 @@
-# Quiz
-
-The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
-
-
-### Q1: Which of the following interpretations of bias-variance tradeoff is the most accurate in the field of Reinforcement Learning?
-
-<Question
-	choices={[
-		{
-			text: "The bias-variance tradeoff reflects how my model is able to generalize the knowledge to previously tagged data we give to the model during training time.",
-			explain: "This is the traditional bias-variance tradeoff in Machine Learning. In our specific case of Reinforcement Learning, we don't have previously tagged data, but only a reward signal.",
-      			correct: false,
-		},
-   		{
-			text: "The bias-variance tradeoff reflects how well the reinforcement signal reflects the true reward the agent should get from the enviromment",
-			explain: "",
-      			correct: true,
-		},		
-	]}
-/>
-
-### Q2: Which of the following statements are true, when talking about models with bias and/or variance in RL?
-
-<Question
-	choices={[
-		{
-			text: "An unbiased reward signal returns rewards similar to the real / expected ones from the environment",
-			explain: "",
-      			correct: true,
-		},
-    		{
-			text: "A biased reward signal returns rewards similar to the real / expected ones from the environment",
-			explain: "If a reward signal is biased, it means the reward signal we get differs from the real reward we should be getting from an environment",
-      			correct: false,
-		},
-    		{
-			text: "A reward signal with high variance has much noise in it and gets affected by, for example, stochastic (non constant) elements in the environment",
-			explain: "",
-      			correct: true,
-		},		
-    		{
-			text: "A reward signal with low variance has much noise in it and gets affected by, for example, stochastic (non constant) elements in the environment",
-			explain: "If a reward signal has low variance, then it's less affected by the noise of the environment and produce similar values regardless the random elements in the environment",
-      			correct: false,
-		},
-	]}
-/>
-
-
-### Q3: Which of the following statements are true about Monte Carlo method?
-
-<Question
-	choices={[
-		{
-			text: "It's a sampling mechanism, which means we don't analyze all the possible states, but a sample of those",
-			explain: "",
-      			correct: true,
-		},
-    		{
-			text: "It's very resistant to stochasticity (random elements in the trajectory)",
-			explain: "Monte Carlo randomly estimates everytime a sample of trajectories. However, even same trajectories can have different reward values if they contain stochastic elements",
-      			correct: false,
-		},
-    		{
-			text: "To reduce the impact of stochastic elements in Monte Carlo, we take `n` strategies and average them, reducing their individual impact",
-			explain: "",
-			correct: true,
-		},		    
-	]}
-/>
-
-### Q4: How would you describe, with your own words, the Actor-Critic Method (A2C)?
-
-<details>
-<summary>Solution</summary>
-
-The idea behind Actor-Critic is that we learn two function approximations:
-1. A `policy` that controls how our agent acts (π)
-2. A `value` function to assist the policy update by measuring how good the action taken is (q)
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step2.jpg" alt="Actor-Critic, step 2"/>
-
-</details>
-
-### Q5: Which of the following statements are true about the Actor-Critic Method?
-
-<Question
-	choices={[
-   		 {
-			text: "The Critic does not learn any function during the training process",
-			explain: "Both the Actor and the Critic function parameters are updated during training time",
-      			correct: false,
-		},
-		{
-			text: "The Actor learns a policy function, while the Critic learns a value function",
-			explain: "",
-      			correct: true,
-		},
-    		{
-			text: "It adds resistance to stochasticity and reduces high variance",
-			explain: "",
-      			correct: true,
-		},	    
-	]}
-/>
-
-
-
-### Q6: What is `Advantage` in the A2C method?
-
-<details>
-<summary>Solution</summary>
-
-Instead of using directly the Action-Value function of the Critic as it is, we could use an `Advantage` function. The idea behind an `Advantage` function is that we calculate the relative advantage of an action compared to the others possible at a state, averaging them.
-
-In other words: how taking that action at a state is better compared to the average value of the state
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/advantage1.jpg" alt="Advantage in A2C"/>
-
-</details>
-
-Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read the chapter again to reinforce (😏) your knowledge.
diff --git a/units/en/unit6/variance-problem.mdx b/units/en/unit6/variance-problem.mdx
deleted file mode 100644
index 1fbbe9c..0000000
--- a/units/en/unit6/variance-problem.mdx
+++ /dev/null
@@ -1,31 +0,0 @@
-# The Problem of Variance in Reinforce [[the-problem-of-variance-in-reinforce]]
-
-In Reinforce, we want to **increase the probability of actions in a trajectory proportionally to how high the return is**.
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/pg.jpg"  alt="Reinforce"/>
-
-- If the **return is high**, we will **push up** the probabilities of the (state, action) combinations.
-- Otherwise, if the **return is low**, it will **push down** the probabilities of the (state, action) combinations.
-
-This return \\(R(\tau)\\) is calculated using a *Monte-Carlo sampling*. We collect a trajectory and calculate the discounted return, **and use this score to increase or decrease the probability of every action taken in that trajectory**. If the return is good, all actions will be “reinforced” by increasing their likelihood of being taken.
-
-\\(R(\tau) = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ...\\)
-
-The advantage of this method is that **it’s unbiased. Since we’re not estimating the return**, we use only the true return we obtain.
-
-Given the stochasticity of the environment (random events during an episode) and stochasticity of the policy, **trajectories can lead to different returns, which can lead to high variance**. Consequently, the same starting state can lead to very different returns.
-Because of this, **the return starting at the same state can vary significantly across episodes**.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/variance.jpg" alt="variance"/>
-
-The solution is to mitigate the variance by **using a large number of trajectories, hoping that the variance introduced in any one trajectory will be reduced in aggregate and provide a "true" estimation of the return.**
-
-However, increasing the batch size significantly **reduces sample efficiency**. So we need to find additional mechanisms to reduce the variance.
-
----
-If you want to dive deeper into the question of variance and bias tradeoff in Deep Reinforcement Learning, you can check out these two articles:
-- [Making Sense of the Bias / Variance Trade-off in (Deep) Reinforcement Learning](https://blog.mlreview.com/making-sense-of-the-bias-variance-trade-off-in-deep-reinforcement-learning-79cf1e83d565)
-- [Bias-variance Tradeoff in Reinforcement Learning](https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/)
-- [High Variance in Policy gradients](https://balajiai.github.io/high_variance_in_policy_gradients)
----
diff --git a/units/en/unit7/additional-readings.mdx b/units/en/unit7/additional-readings.mdx
deleted file mode 100644
index 6cb3239..0000000
--- a/units/en/unit7/additional-readings.mdx
+++ /dev/null
@@ -1,17 +0,0 @@
-# Additional Readings [[additional-readings]]
-
-##  An introduction to multi-agents
-
-- [Multi-agent reinforcement learning: An overview](https://www.dcsc.tudelft.nl/~bdeschutter/pub/rep/10_003.pdf)
-- [Multiagent Reinforcement Learning, Marc Lanctot](https://rlss.inria.fr/files/2019/07/RLSS_Multiagent.pdf)
-- [Example of a multi-agent environment](https://www.mathworks.com/help/reinforcement-learning/ug/train-3-agents-for-area-coverage.html?s_eid=PSM_15028)
-- [A list of different multi-agent environments](https://agents.inf.ed.ac.uk/blog/multiagent-learning-environments/)
-- [Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents](https://bit.ly/3nVK7My)
-- [Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning](https://bit.ly/3v7LxaT)
-
-## Self-Play and MA-POCA
-
-- [Self Play Theory and with MLAgents](https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents)
-- [Training complex behavior with MLAgents](https://blog.unity.com/technology/ml-agents-v20-release-now-supports-training-complex-cooperative-behaviors)
-- [MLAgents plays dodgeball](https://blog.unity.com/technology/ml-agents-plays-dodgeball)
-- [On the Use and Misuse of Absorbing States in Multi-agent Reinforcement Learning (MA-POCA)](https://arxiv.org/pdf/2111.05992.pdf)
diff --git a/units/en/unit7/conclusion.mdx b/units/en/unit7/conclusion.mdx
deleted file mode 100644
index 83f173e..0000000
--- a/units/en/unit7/conclusion.mdx
+++ /dev/null
@@ -1,11 +0,0 @@
-# Conclusion
-
-That’s all for today. Congrats on finishing this unit and the tutorial!
-
-The best way to learn is to practice and try stuff. **Why not train another agent with a different configuration?**
-
-And don’t hesitate from time to time to check the [leaderboard](https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos)
-
-See you in Unit 8 🔥
-
-## Keep Learning, Stay awesome 🤗
diff --git a/units/en/unit7/hands-on.mdx b/units/en/unit7/hands-on.mdx
deleted file mode 100644
index 0176abe..0000000
--- a/units/en/unit7/hands-on.mdx
+++ /dev/null
@@ -1,323 +0,0 @@
-# Hands-on
-
-Now that you learned the basics of multi-agents, you're ready to train your first agents in a multi-agent system: **a 2vs2 soccer team that needs to beat the opponent team**.
-
-And you’re going to participate in AI vs. AI challenges where your trained agent will compete against other classmates’ **agents every day and be ranked on a new leaderboard.**
-
-To validate this hands-on for the certification process, you just need to push a trained model. There **are no minimal results to attain to validate it.**
-
-For more information about the certification process, check this section 👉 [https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process)
-
-This hands-on will be different since to get correct results **you need to train your agents from 4 hours to 8 hours**. And given the risk of timeout in Colab, we advise you to train on your computer. You don’t need a supercomputer: a simple laptop is good enough for this exercise.
-
-Let's get started! 🔥
-
-## What is AI vs. AI?
-
-AI vs. AI is an open-source tool we developed at Hugging Face to compete agents on the Hub against one another in a multi-agent setting. These models are then ranked in a leaderboard.
-
-The idea of this tool is to have a robust evaluation tool: **by evaluating your agent with a lot of others, you’ll get a good idea of the quality of your policy.**
-
-More precisely, AI vs. AI is three tools:
-
-- A *matchmaking process* defining the matches (which model against which) and running the model fights using a background task in the Space.
-- A *leaderboard* getting the match history results and displaying the models’ ELO ratings: [https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos](https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos)
-- A *Space demo* to visualize your agents playing against others: [https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos](https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos)
-
-In addition to these three tools, your classmate cyllum created a 🤗 SoccerTwos Challenge Analytics where you can check the detailed match results of a model: [https://huggingface.co/spaces/cyllum/soccertwos-analytics](https://huggingface.co/spaces/cyllum/soccertwos-analytics)
-
-We're [wrote a blog post to explain this AI vs. AI tool in detail](https://huggingface.co/blog/aivsai), but to give you the big picture it works this way:
-
-- Every four hours, our algorithm **fetches all the available models for a given environment (in our case ML-Agents-SoccerTwos).**
-- It creates a **queue of matches with the matchmaking algorithm.**
-- We simulate the match in a Unity headless process and **gather the match result** (1 if the first model won, 0.5 if it’s a draw, 0 if the second model won) in a Dataset.
-- Then, when all matches from the matches queue are done, **we update the ELO score for each model and update the leaderboard.**
-
-### Competition Rules
-
-This first AI vs. AI competition **is an experiment**: the goal is to improve the tool in the future with your feedback. So some **breakups can happen during the challenge**. But don't worry
-**all the results are saved in a dataset so we can always restart the calculation correctly without losing information**.
-
-In order for your model to get correctly evaluated against others you need to follow these rules:
-
-1. **You can't change the observation space or action space of the agent.** By doing that your model will not work during evaluation.
-2. You **can't use a custom trainer for now,** you need to use the Unity MLAgents ones.
-3. We provide executables to train your agents. You can also use the Unity Editor if you prefer **, but to avoid bugs, we advise that you use our executables**.
-
-What will make the difference during this challenge are **the hyperparameters you choose**.
-
-We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).
-
-### Chat with your classmates, share advice and ask questions on Discord
-
-- We created a new channel called `ai-vs-ai-challenge` to exchange advice and ask questions.
-- If you didn’t join the discord server yet, you can [join here](https://discord.gg/ydHrjt3WP5)
-
-## Step 0: Install MLAgents and download the correct executable
-
-We advise you to use [conda](https://docs.conda.io/en/latest/) as a package manager and create a new environment.
-
-With conda, we create a new environment called rl with **Python 3.10.12**:
-
-```bash
-conda create --name rl python=3.10.12
-conda activate rl
-```
-
-To be able to train our agents correctly and push to the Hub, we need to install ML-Agents
-
-```bash
-git clone https://github.com/Unity-Technologies/ml-agents
-```
-
-When the cloning is done (it takes 2.63 GB), we go inside the repository and install the package
-
-```bash
-cd ml-agents
-pip install -e ./ml-agents-envs
-pip install -e ./ml-agents
-```
-
- Mac users on Apple Silicon may encounter troubles with the installation (e.g. ONNX wheel build failing), you should first try to install grpcio:
-```bash
-conda install grpcio
-```
-[This github issue](https://github.com/Unity-Technologies/ml-agents/issues/6019) in the official ml-agent repo might also help you.
-
-Finally, you need to install git-lfs: https://git-lfs.com/
-
-Now that it’s installed, we need to add the environment training executable. Based on your operating system you need to download one of them, unzip it and place it in a new folder inside `ml-agents` that you call `training-envs-executables`
-
-At the end your executable should be in `ml-agents/training-envs-executables/SoccerTwos`
-
-Windows: Download [this executable](https://drive.google.com/file/d/1sqFxbEdTMubjVktnV4C6ICjp89wLhUcP/view?usp=sharing)
-
-Linux (Ubuntu): Download [this executable](https://drive.google.com/file/d/1KuqBKYiXiIcU4kNMqEzhgypuFP5_45CL/view?usp=sharing)
-
-Mac: Download [this executable](https://drive.google.com/drive/folders/1h7YB0qwjoxxghApQdEUQmk95ZwIDxrPG?usp=share_link)
-⚠ For Mac you need also to call this `xattr -cr training-envs-executables/SoccerTwos/SoccerTwos.app` to be able to run SoccerTwos
-
-## Step 1: Understand the environment
-
-The environment is called `SoccerTwos`. The Unity MLAgents Team made it. You can find its documentation [here](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Learning-Environment-Examples.md#soccer-twos)
-
-The goal in this environment **is to get the ball into the opponent's goal while preventing the ball from entering your own goal.**
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/soccertwos.gif" alt="SoccerTwos"/>
-
-<figcaption>This environment was made by the <a href="https://github.com/Unity-Technologies/ml-agents"> Unity MLAgents Team</a></figcaption>
-
-</figure>
-
-### The reward function
-
-The reward function is:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/soccerreward.png" alt="SoccerTwos Reward"/>
-
-### The observation space
-
-The observation space is composed of vectors of size 336:
-
-- 11 ray-casts forward distributed over 120 degrees (264 state dimensions)
-- 3 ray-casts backward distributed over 90 degrees (72 state dimensions)
-- Both of these ray-casts can detect 6 objects:
-    - Ball
-    - Blue Goal
-    - Purple Goal
-    - Wall
-    - Blue Agent
-    - Purple Agent
-
-### The action space
-
-The action space is three discrete branches:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/socceraction.png" alt="SoccerTwos Action"/>
-
-## Step 2: Understand MA-POCA
-
-We know how to train agents to play against others: **we can use self-play.** This is a perfect technique for a 1vs1.
-
-But in our case we’re 2vs2, and each team has 2 agents. How then can we **train cooperative behavior for groups of agents?**
-
-As explained in the [Unity Blog](https://blog.unity.com/technology/ml-agents-v20-release-now-supports-training-complex-cooperative-behaviors), agents typically receive a reward as a group (+1 - penalty) when the team scores a goal. This implies that **every agent on the team is rewarded even if each agent didn’t contribute the same to the win**, which makes it difficult to learn what to do independently.
-
-The Unity MLAgents team developed the solution in a new multi-agent trainer called *MA-POCA (Multi-Agent POsthumous Credit Assignment)*.
-
-The idea is simple but powerful: a centralized critic **processes the states of all agents in the team to estimate how well each agent is doing**. Think of this critic as a coach.
-
-This allows each agent to **make decisions based only on what it perceives locally**, and **simultaneously evaluate how good its behavior is in the context of the whole group**.
-
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/mapoca.png" alt="MA POCA"/>
-
-<figcaption>This illustrates MA-POCA’s centralized learning and decentralized execution. Source: <a href="https://blog.unity.com/technology/ml-agents-plays-dodgeball">MLAgents Plays Dodgeball</a>
-</figcaption>
-
-</figure>
-
-The solution then is to use Self-Play with an MA-POCA trainer (called poca). The poca trainer will help us to train cooperative behavior and self-play to win against an opponent team.
-
-If you want to dive deeper into this MA-POCA algorithm, you need to read the paper they published [here](https://arxiv.org/pdf/2111.05992.pdf) and the sources we put on the additional readings section.
-
-## Step 3: Define the config file
-
-We already learned in [Unit 5](https://huggingface.co/deep-rl-course/unit5/introduction) that in ML-Agents, you define **the training hyperparameters in `config.yaml` files.**
-
-There are multiple hyperparameters. To understand them better, you should read the explanations for each of them in **[the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md)**
-
-The config file we’re going to use here is in  `./config/poca/SoccerTwos.yaml`. It looks like this:
-
-```csharp
-behaviors:
-  SoccerTwos:
-    trainer_type: poca
-    hyperparameters:
-      batch_size: 2048
-      buffer_size: 20480
-      learning_rate: 0.0003
-      beta: 0.005
-      epsilon: 0.2
-      lambd: 0.95
-      num_epoch: 3
-      learning_rate_schedule: constant
-    network_settings:
-      normalize: false
-      hidden_units: 512
-      num_layers: 2
-      vis_encode_type: simple
-    reward_signals:
-      extrinsic:
-        gamma: 0.99
-        strength: 1.0
-    keep_checkpoints: 5
-    max_steps: 5000000
-    time_horizon: 1000
-    summary_freq: 10000
-    self_play:
-      save_steps: 50000
-      team_change: 200000
-      swap_steps: 2000
-      window: 10
-      play_against_latest_model_ratio: 0.5
-      initial_elo: 1200.0
-```
-
-Compared to Pyramids or SnowballTarget, we have new hyperparameters with a self-play part. How you modify them can be critical in getting good results.
-
-The advice I can give you here is to check the explanation and recommended value for each parameters (especially self-play ones) against **[the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md).**
-
-Now that you’ve modified our config file, you’re ready to train your agents.
-
-## Step 4: Start the training
-
-To train the agents, we need to **launch mlagents-learn and select the executable containing the environment.**
-
-We define four parameters:
-
-1. `mlagents-learn <config>`: the path where the hyperparameter config file is.
-2. `-env`: where the environment executable is.
-3. `-run_id`: the name you want to give to your training run id.
-4. `-no-graphics`: to not launch the visualization during the training.
-
-Depending on your hardware, 5M timesteps (the recommended value, but you can also try 10M) will take 5 to 8 hours of training. You can continue using your computer in the meantime, but I advise deactivating the computer standby mode to prevent the training from being stopped.
-
-Depending on the executable you use (windows, ubuntu, mac) the training command will look like this (your executable path can be different so don’t hesitate to check before running).
-
-For Windows, it might look like this:
-```bash
-mlagents-learn ./config/poca/SoccerTwos.yaml --env=./training-envs-executables/SoccerTwos.exe --run-id="SoccerTwos" --no-graphics
-```
-
-For Mac, it might look like this:
-```bash
-mlagents-learn ./config/poca/SoccerTwos.yaml --env=./training-envs-executables/SoccerTwos/SoccerTwos.app --run-id="SoccerTwos" --no-graphics
-```
-
-The executable contains 8 copies of SoccerTwos.
-
-⚠️ It’s normal if you don’t see a big increase of ELO score (and even a decrease below 1200) before 2M timesteps, since your agents will spend most of their time moving randomly on the field before being able to goal.
-
-⚠️ You can stop the training with Ctrl + C but beware of typing this command only once to stop the training since MLAgents needs to generate a final .onnx file before closing the run.
-
-## Step 5: **Push the agent to the Hugging Face Hub**
-
-Now that we trained our agents, we’re **ready to push them to the Hub to be able to participate in the AI vs. AI challenge and visualize them playing on your browser🔥.**
-
-To be able to share your model with the community, there are three more steps to follow:
-
-1️⃣ (If it’s not already done) create an account to HF ➡ [https://huggingface.co/join](https://huggingface.co/join)
-
-2️⃣ Sign in and store your authentication token from the Hugging Face website.
-
-Create a new token (https://huggingface.co/settings/tokens) **with write role**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
-
-Copy the token, run this, and paste the token
-
-```bash
-huggingface-cli login
-```
-
-Then, we need to run `mlagents-push-to-hf`.
-
-And we define four parameters:
-
-1. `-run-id`: the name of the training run id.
-2. `-local-dir`: where the agent was saved, it’s results/<run_id name>, so in my case results/First Training.
-3. `-repo-id`: the name of the Hugging Face repo you want to create or update. It’s always <your huggingface username>/<the repo name>
-If the repo does not exist **it will be created automatically**
-4. `--commit-message`: since HF repos are git repositories you need to give a commit message.
-
-In my case
-
-```bash
-mlagents-push-to-hf  --run-id="SoccerTwos" --local-dir="./results/SoccerTwos" --repo-id="ThomasSimonini/poca-SoccerTwos" --commit-message="First Push"`
-```
-
-```bash
-mlagents-push-to-hf  --run-id= # Add your run id  --local-dir= # Your local dir  --repo-id= # Your repo id --commit-message="First Push"
-```
-
-If everything worked you should see this at the end of the process (but with a different url 😆) :
-
-Your model is pushed to the Hub. You can view your model here: https://huggingface.co/ThomasSimonini/poca-SoccerTwos
-
-It's the link to your model. It contains a model card that explains how to use it, your Tensorboard, and your config file. **What's awesome is that it's a git repository, which means you can have different commits, update your repository with a new push, etc.**
-
-## Step 6: Verify that your model is ready for AI vs AI Challenge
-
-Now that your model is pushed to the Hub, **it’s going to be added automatically to the AI vs AI Challenge model pool.** It can take a little bit of time before your model is added to the leaderboard given we do a run of matches every 4h.
-
-But to ensure that everything works perfectly you need to check:
-
-1. That you have this tag in your model: ML-Agents-SoccerTwos. This is the tag we use to select models to be added to the challenge pool. To do that go to your model and check the tags
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/verify1.png" alt="Verify"/>
-
-
-If it’s not the case you just need to modify the readme and add it
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/verify2.png" alt="Verify"/>
-
-2. That you have a `SoccerTwos.onnx` file
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/verify3.png" alt="Verify"/>
-
-We strongly suggest that you create a new model when you push to the Hub if you want to train it again or train a new version.
-
-## Step 7: Visualize some match in our demo
-
-Now that your model is part of AI vs AI Challenge, **you can visualize how good it is compared to others**: https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos
-
-In order to do that, you just need to go to this demo:
-
-- Select your model as team blue (or team purple if you prefer) and another model to compete against. The best opponents to compare your model to are either whoever is on top of the leaderboard or the [baseline model](https://huggingface.co/unity/MLAgents-SoccerTwos)
-
-The matches you see live are not used in the calculation of your result **but they are a good way to visualize how good your agent is**.
-
-And don't hesitate to share the best score your agent gets on discord in the #rl-i-made-this channel 🔥
diff --git a/units/en/unit7/introduction-to-marl.mdx b/units/en/unit7/introduction-to-marl.mdx
deleted file mode 100644
index b3a8b29..0000000
--- a/units/en/unit7/introduction-to-marl.mdx
+++ /dev/null
@@ -1,55 +0,0 @@
-# An introduction to Multi-Agents Reinforcement Learning (MARL)
-
-## From single agent to multiple agents
-
-In the first unit, we learned to train agents in a single-agent system. When our agent was alone in its environment: **it was not cooperating or collaborating with other agents**.
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/patchwork.jpg" alt="Patchwork"/>
-<figcaption>
-A patchwork of all the environments you've trained your agents on since the beginning of the course
-</figcaption>
-</figure>
-
-When we do multi-agents reinforcement learning (MARL), we are in a situation where we have multiple agents **that share and interact in a common environment**.
-
-For instance, you can think of a warehouse where **multiple robots need to navigate to load and unload packages**.
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/warehouse.jpg" alt="Warehouse"/>
-<figcaption> [Image by upklyak](https://www.freepik.com/free-vector/robots-warehouse-interior-automated-machines_32117680.htm#query=warehouse robot&position=17&from_view=keyword) on Freepik </figcaption>
-</figure>
-
-Or a road with **several autonomous vehicles**.
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/selfdrivingcar.jpg" alt="Self driving cars"/>
-<figcaption>
-[Image by jcomp](https://www.freepik.com/free-vector/autonomous-smart-car-automatic-wireless-sensor-driving-road-around-car-autonomous-smart-car-goes-scans-roads-observe-distance-automatic-braking-system_26413332.htm#query=self driving cars highway&position=34&from_view=search&track=ais) on Freepik
-</figcaption>
-</figure>
-
-In these examples, we have **multiple agents interacting in the environment and with the other agents**. This implies defining a multi-agents system. But first, let's understand the different types of multi-agent environments.
-
-## Different types of multi-agent environments
-
-Given that, in a multi-agent system, agents interact with other agents, we can have different types of environments:
-
-- *Cooperative environments*: where your agents need **to maximize the common benefits**.
-
-For instance, in a warehouse, **robots must collaborate to load and unload the packages efficiently (as fast as possible)**.
-
-- *Competitive/Adversarial environments*: in this case, your agent **wants to maximize its benefits by minimizing the opponent's**.
-
-For example, in a game of tennis, **each agent wants to beat the other agent**.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/tennis.png" alt="Tennis"/>
-
-- *Mixed of both adversarial and cooperative*: like in our SoccerTwos environment, two agents are part of a team (blue or purple): they need to cooperate with each other and beat the opponent team.
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/soccertwos.gif" alt="SoccerTwos"/>
-<figcaption>This environment was made by the <a href="https://github.com/Unity-Technologies/ml-agents">Unity MLAgents Team</a></figcaption>
-</figure>
-
-So now we might wonder: how can we design these multi-agent systems? Said differently, **how can we train agents in a multi-agent setting** ?
diff --git a/units/en/unit7/introduction.mdx b/units/en/unit7/introduction.mdx
deleted file mode 100644
index d77e167..0000000
--- a/units/en/unit7/introduction.mdx
+++ /dev/null
@@ -1,42 +0,0 @@
-# Introduction [[introduction]]
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/thumbnail.png" alt="Thumbnail"/>
-
-Since the beginning of this course, we learned to train agents in a *single-agent system*  where our agent was alone in its environment: it was **not cooperating or collaborating with other agents**.
-
-This worked great, and the single-agent system is useful for many applications.
-
-
-<figure>
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/patchwork.jpg" alt="Patchwork"/>
-
-<figcaption>
-A patchwork of all the environments you’ve trained your agents on since the beginning of the course
-</figcaption>
-</figure>
-
-But, as humans, **we live in a multi-agent world**. Our intelligence comes from interaction with other agents. And so, our **goal is to create agents that can interact with other humans and other agents**.
-
-Consequently, we must study how to train deep reinforcement learning agents in a *multi-agents system* to build robust agents that can adapt, collaborate, or compete.
-
-So today we’re going to **learn the basics of the fascinating topic of multi-agents reinforcement learning (MARL)**.
-
-And the most exciting part is that, during this unit, you’re going to train your first agents in a multi-agents system: **a 2vs2 soccer team that needs to beat the opponent team**.
-
-## Course Maintenance Notice 🚧
-
-Please note that this **Deep Reinforcement Learning course is now in a low-maintenance state**. However, it **remains an excellent resource to learn both the theory and practical aspects of Deep Reinforcement Learning**.
-
-Keep in mind the following points:
-
-- *Unit 7 (AI vs AI)* : This feature is currently non-functional. However, you can still train your agent to play soccer and observe its performance. But the leaderboard for AI vs AI soccer was shut down.
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/soccertwos.gif" alt="SoccerTwos"/>
-
-<figcaption>This environment was made by the <a href="https://github.com/Unity-Technologies/ml-agents">Unity MLAgents Team</a></figcaption>
-
-</figure>
-
-So let’s get started!
diff --git a/units/en/unit7/multi-agent-setting.mdx b/units/en/unit7/multi-agent-setting.mdx
deleted file mode 100644
index 9d2457b..0000000
--- a/units/en/unit7/multi-agent-setting.mdx
+++ /dev/null
@@ -1,57 +0,0 @@
-# Designing Multi-Agents systems
-
-For this section, you're going to watch this excellent introduction to multi-agents made by <a href="https://www.youtube.com/channel/UCq0imsn84ShAe9PBOFnoIrg"> Brian Douglas </a>.
-
-<Youtube id="qgb0gyrpiGk" />
-
-
-In this video, Brian talked about how to design multi-agent systems. He specifically took a multi-agents system of vacuum cleaners and asked: **how can can cooperate with each other**?
-
-We have two solutions to design this multi-agent reinforcement learning system (MARL).
-
-## Decentralized system
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/decentralized.png" alt="Decentralized"/>
-<figcaption>
-Source: <a href="https://www.youtube.com/watch?v=qgb0gyrpiGk"> Introduction to Multi-Agent Reinforcement Learning </a>
-</figcaption>
-</figure>
-
-In decentralized learning, **each agent is trained independently from the others**. In the example given, each vacuum learns to clean as many places as it can **without caring about what other vacuums (agents) are doing**.
-
-The benefit is that **since no information is shared between agents, these vacuums can be designed and trained like we train single agents**.
-
-The idea here is that **our training agent will consider other agents as part of the environment dynamics**. Not as agents.
-
-However, the big drawback of this technique is that it will **make the environment non-stationary** since the underlying Markov decision process changes over time as other agents are also interacting in the environment.
-And this is problematic for many Reinforcement Learning algorithms **that can't reach a global optimum with a non-stationary environment**.
-
-## Centralized approach
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/centralized.png" alt="Centralized"/>
-<figcaption>
-Source: <a href="https://www.youtube.com/watch?v=qgb0gyrpiGk"> Introduction to Multi-Agent Reinforcement Learning </a>
-</figcaption>
-</figure>
-
-In this architecture, **we have a high-level process that collects agents' experiences**: the experience buffer. And we'll use these experiences **to learn a common policy**.
-
-For instance, in the vacuum cleaner example, the observation will be:
-- The coverage map of the vacuums.
-- The position of all the vacuums.
-
-We use that collective experience **to train a policy that will move all three robots in the most beneficial way as a whole**. So each robot is learning from their common experience.
-We now have a stationary environment since all the agents are treated as a larger entity, and they know the change of other agents' policies (since it's the same as theirs).
-
-If we recap:
-
-- In a *decentralized approach*, we **treat all agents independently without considering the existence of the other agents.**
-  - In this case, all agents **consider others agents as part of the environment**.
-  - **It’s a non-stationarity environment condition**, so has no guarantee of convergence.
-
-- In a *centralized approach*:
-  - A **single policy is learned from all the agents**.
-  - Takes as input the present state of an environment and the policy outputs joint actions.
-  - The reward is global.
diff --git a/units/en/unit7/quiz.mdx b/units/en/unit7/quiz.mdx
deleted file mode 100644
index a059ec3..0000000
--- a/units/en/unit7/quiz.mdx
+++ /dev/null
@@ -1,139 +0,0 @@
-# Quiz
-
-The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
-
-
-### Q1: Chose the option which fits better when comparing different types of multi-agent environments
-
-- Your agents aim to maximize common benefits in ____ environments
-- Your agents aim to maximize common benefits while minimizing opponent's in ____ environments
-
-<Question
-	choices={[
-		{
-			text: "competitive, cooperative",
-			explain: "You maximize common benefit in cooperative, while in competitive you also aim to reduce opponent's score",
-   			correct: false,
-		},
-   		{
-			text: "cooperative, competitive",
-			explain: "",
-   			correct: true,
-		},
-	]}
-/>
-
-### Q2: Which of the following statements are true about `decentralized` learning?
-
-<Question
-	choices={[
-		{
-			text: "Each agent is trained independently from the others",
-			explain: "",
-      		correct: true,
-		},
-    	{
-			text: "Inputs from other agents are just considered environment data",
-			explain: "",
-			correct: true,
-		},
-		{
-			text: "Considering other agents part of the environment makes the environment stationary",
-			explain: "In decentralized learning, agents ignore the existence of other agents and consider them part of the environment. However, this means the environment is in constant change, becoming non-stationary.",
-			correct: false,
-		},
-	]}
-/>
-
-
-### Q3: Which of the following statements are true about `centralized` learning?
-
-<Question
-	choices={[
-		{
-			text: "It learns one common policy based on the learnings from all agents' interactions",
-			explain: "",
-      		correct: true,
-		},
-    	{
-			text: "The reward is global",
-			explain: "",
-			correct: true,
-		},
-		{
-			text: "The environment with this approach is stationary",
-			explain: "",
-			correct: true,
-		},
-	]}
-/>
-
-### Q4: Explain in your own words what is the `Self-Play` approach
-
-<details>
-<summary>Solution</summary>
-
-`Self-play` is an approach to instantiate copies of agents with the same policy as your as opponents, so that your agent learns from agents with same training level.
-
-</details>
-
-### Q5: When configuring `Self-play`, several parameters are important. Could you identify, by their definition, which parameter are we talking about?
-
-- The probability of playing against the current self vs an opponent from a pool
-- Variety (dispersion) of training levels of the opponents you can face
-- The number of training steps before spawning a new opponent
-- Opponent change rate
-
-<Question
-	choices={[
-   		 {
-			text: "window, play_against_latest_model_ratio, save_steps, swap_steps+team_change",
-			explain: "",
-      		correct: false,
-		},
-		{
-			text: "play_against_latest_model_ratio, save_steps, window, swap_steps+team_change",
-			explain: "",
-			correct: false,
-		},
-		{
-			text: "play_against_latest_model_ratio, window, save_steps, swap_steps+team_change",
-			explain: "",
-			correct: true,
-		},
-    	{
-			text: "swap_steps+team_change, save_steps, play_against_latest_model_ratio, window",
-			explain: "",
-      		correct: false,
-		},
-	]}
-/>
-
-### Q6: What are the main motivations to use a ELO rating Score?
-
-<Question
-	choices={[
-   		 {
-			text: "The score takes into account the different of skills between you and your opponent",
-			explain: "",
-      		correct: true,
-		},
-		{
-			text: "Although more points can be exchanged depending on the result of the match and given the levels of the agents, the sum is always the same",
-			explain: "",
-			correct: true,
-		},
-		{
-			text: "It's easy for an agent to keep a high score rate",
-			explain: "That is called the `Rating deflation`: keeping a high rate requires much skill over time",
-			correct: false,
-		},
-    	{
-			text: "It works well calculating the individual contributions of each player in a team",
-			explain: "ELO uses the score achieved by the whole team, but individual contributions are not calculated",
-      		correct: false,
-		},
-	]}
-/>
-
-Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read the chapter again to reinforce (😏) your knowledge.
diff --git a/units/en/unit7/self-play.mdx b/units/en/unit7/self-play.mdx
deleted file mode 100644
index 7ebfe9e..0000000
--- a/units/en/unit7/self-play.mdx
+++ /dev/null
@@ -1,137 +0,0 @@
-# Self-Play: a classic technique to train competitive agents in adversarial games
-
-Now that we've studied the basics of multi-agents, we're ready to go deeper. As mentioned in the introduction, we're going **to train agents in an adversarial game with SoccerTwos, a 2vs2 game**.
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/soccertwos.gif" alt="SoccerTwos"/>
-
-<figcaption>This environment was made by the <a href="https://github.com/Unity-Technologies/ml-agents">Unity MLAgents Team</a></figcaption>
-
-</figure>
-
-## What is Self-Play?
-
-Training agents correctly in an adversarial game can be **quite complex**.
-
-On the one hand, we need to find how to get a well-trained opponent to play against your training agent. And on the other hand, if you find a very good trained opponent, how will your agent improve its policy when the opponent is too strong?
-
-Think of a child that just started to learn soccer. Playing against a very good soccer player will be useless since it will be too hard to win or at least get the ball from time to time. So the child will continuously lose without having time to learn a good policy.
-
-The best solution would be **to have an opponent that is on the same level as the agent and will upgrade its level as the agent upgrades its own**. Because if the opponent is too strong, we’ll learn nothing; if it is too weak, we’ll overlearn useless behavior against a stronger opponent then.
-
-This solution is called *self-play*. In self-play, **the agent uses former copies of itself (of its policy) as an opponent**. This way, the agent will play against an agent of the same level (challenging but not too much), have opportunities to gradually improve its policy, and then update its opponent as it becomes better. It’s a way to bootstrap an opponent and progressively increase the opponent's complexity.
-
-It’s the same way humans learn in competition:
-
-- We start to train against an opponent of similar level
-- Then we learn from it, and when we acquire some skills, we can move further with stronger opponents.
-
-We do the same with self-play:
-
-- We **start with a copy of our agent as an opponent** this way, this opponent is on a similar level.
-- We **learn from it** and, when we acquire some skills, we **update our opponent with a more recent copy of our training policy**.
-
-The theory behind self-play is not something new. It was already used by Arthur Samuel’s checker player system in the fifties and by Gerald Tesauro’s TD-Gammon in 1995. If you want to learn more about the history of self-play [check out this very good blogpost by Andrew Cohen](https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents)
-
-## Self-Play in MLAgents
-
-Self-Play is integrated into the MLAgents library and is managed by multiple hyperparameters that we’re going to study. But the main focus, as explained in the documentation, is the **tradeoff between the skill level and generality of the final policy and the stability of learning**.
-
-Training against a set of slowly changing or unchanging adversaries with low diversity **results in more stable training. But there is a risk of overfitting if the change is too slow.**
-
-So we need to control:
-
-- How **often we change opponents** with the `swap_steps` and `team_change` parameters.
-- The **number of opponents saved** with the `window` parameter. A larger value of `window`
- means that an agent's pool of opponents will contain a larger diversity of behaviors since it will contain policies from earlier in the training run.
-- The **probability of playing against the current self vs opponent** sampled from the pool with `play_against_latest_model_ratio`. A larger value of `play_against_latest_model_ratio`
- indicates that an agent will be playing against the current opponent more often.
-- The **number of training steps before saving a new opponent** with `save_steps` parameters. A larger value of `save_steps`
- will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training.
-
-To get more details about these hyperparameters, you definitely need [to check out this part of the documentation](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Training-Configuration-File.md#self-play)
-
-
-## The ELO Score to evaluate our agent
-
-### What is ELO Score?
-
-In adversarial games, tracking the **cumulative reward is not always a meaningful metric to track the learning progress:** because this metric is **dependent only on the skill of the opponent.**
-
-Instead, we’re using an ***ELO rating system*** (named after Arpad Elo) that calculates the **relative skill level** between 2 players from a given population in a zero-sum game.
-
-In a zero-sum game: one agent wins, and the other agent loses. It’s a mathematical representation of a situation in which each participant’s gain or loss of utility **is exactly balanced by the gain or loss of the utility of the other participants.** We talk about zero-sum games because the sum of utility is equal to zero.
-
-This ELO (starting at a specific score: frequently 1200) can decrease initially but should increase progressively during the training.
-
-The Elo system is **inferred from the losses and draws against other players.** It means that player ratings depend **on the ratings of their opponents and the results scored against them.**
-
-Elo defines an Elo score that is the relative skills of a player in a zero-sum game. **We say relative because it depends on the performance of opponents.**
-
-The central idea is to think of the performance of a player **as a random variable that is normally distributed.**
-
-The difference in rating between 2 players serves as **the predictor of the outcomes of a match.** If the player wins, but the probability of winning is high, it will only win a few points from its opponent since it means that it is much stronger than it.
-
-After every game:
-
-- The winning player takes **points from the losing one.**
-- The number of points is determined **by the difference in the 2 players ratings (hence relative).**
-    - If the higher-rated player wins → few points will be taken from the lower-rated player.
-    - If the lower-rated player wins → a lot of points will be taken from the high-rated player.
-    - If it’s a draw → the lower-rated player gains a few points from the higher.
-
-So if A and B have rating Ra, and Rb, then the **expected scores are** given by:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/elo1.png" alt="ELO Score"/>
-
-Then, at the end of the game, we need to update the player’s actual Elo score. We use a linear adjustment **proportional to the amount by which the player over-performed or under-performed.**
-
-We also define a maximum adjustment rating per game: K-factor.
-
-- K=16 for master.
-- K=32 for weaker players.
-
-If Player A has Ea points but scored Sa points, then the player’s rating is updated using the formula:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/elo2.png" alt="ELO Score"/>
-
-### Example
-
-If we take an example:
-
-Player A has a rating of 2600
-
-Player B has a rating of 2300
-
-- We first calculate the expected score:
-
-\\(E_{A} = \frac{1}{1+10^{(2300-2600)/400}} = 0.849 \\)
-
-\\(E_{B} = \frac{1}{1+10^{(2600-2300)/400}} = 0.151 \\)
-
-- If the organizers determined that K=16 and A wins, the new rating would be:
-
-\\(ELO_A = 2600 + 16*(1-0.849) = 2602 \\)
-
-\\(ELO_B = 2300 + 16*(0-0.151) = 2298 \\)
-
-- If the organizers determined that K=16 and B wins, the new rating would be:
-
-\\(ELO_A = 2600 + 16*(0-0.849) = 2586 \\)
-
-\\(ELO_B = 2300 + 16 *(1-0.151) = 2314 \\)
-
-
-### The Advantages
-
-Using the ELO score has multiple advantages:
-
-- Points are **always balanced** (more points are exchanged when there is an unexpected outcome, but the sum is always the same).
-- It is a **self-corrected system** since if a player wins against a weak player, they will only win a few points.
-- It **works with team games**: we calculate the average for each team and use it in Elo.
-
-### The Disadvantages
-
-- ELO **does not take into account the individual contribution** of each people in the team.
-- Rating deflation: **a good rating requires skill over time to keep the same rating**.
-- **Can’t compare rating in history**.
diff --git a/units/en/unit8/additional-readings.mdx b/units/en/unit8/additional-readings.mdx
deleted file mode 100644
index 89196f9..0000000
--- a/units/en/unit8/additional-readings.mdx
+++ /dev/null
@@ -1,21 +0,0 @@
-# Additional Readings [[additional-readings]]
-
-These are **optional readings** if you want to go deeper.
-
-## PPO Explained
-
-- [Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization by Daniel Bick](https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf)
-- [What is the way to understand Proximal Policy Optimization Algorithm in RL?](https://stackoverflow.com/questions/46422845/what-is-the-way-to-understand-proximal-policy-optimization-algorithm-in-rl)
-- [Foundations of Deep RL Series, L4 TRPO and PPO by Pieter Abbeel](https://youtu.be/KjWF8VIMGiY)
-- [OpenAI PPO Blogpost](https://openai.com/blog/openai-baselines-ppo/)
-- [Spinning Up RL PPO](https://spinningup.openai.com/en/latest/algorithms/ppo.html)
-- [Paper Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347)
-
-## PPO Implementation details
-
-- [The 37 Implementation Details of Proximal Policy Optimization](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/)
-- [Part 1 of 3 — Proximal Policy Optimization Implementation: 11 Core Implementation Details](https://www.youtube.com/watch?v=MEt6rrxH8W4)
-
-## Importance Sampling
-
-- [Importance Sampling Explained](https://youtu.be/C3p2wI4RAi8)
diff --git a/units/en/unit8/clipped-surrogate-objective.mdx b/units/en/unit8/clipped-surrogate-objective.mdx
deleted file mode 100644
index 09d9be1..0000000
--- a/units/en/unit8/clipped-surrogate-objective.mdx
+++ /dev/null
@@ -1,69 +0,0 @@
-# Introducing the Clipped Surrogate Objective Function
-## Recap: The Policy Objective Function
-
-Let’s remember what the objective is to optimize in Reinforce:
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/lpg.jpg" alt="Reinforce"/>
-
-The idea was that by taking a gradient ascent step on this function (equivalent to taking gradient descent of the negative of this function), we would **push our agent to take actions that lead to higher rewards and avoid harmful actions.**
-
-However, the problem comes from the step size:
-- Too small, **the training process was too slow**
-- Too high, **there was too much variability in the training**
-
-With PPO, the idea is to constrain our policy update with a new objective function called the *Clipped surrogate objective function* that **will constrain the policy change in a small range using a clip.**
-
-This new function **is designed to avoid destructively large weights updates** :
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/ppo-surrogate.jpg" alt="PPO surrogate function"/>
-
-Let’s study each part to understand how it works.
-
-## The Ratio Function
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/ratio1.jpg" alt="Ratio"/>
-
-This ratio is calculated as follows:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/ratio2.jpg" alt="Ratio"/>
-
-It’s the probability of taking action \\( a_t \\) at state \\( s_t \\) in the current policy, divided by the same for the previous policy.
-
-As we can see, \\( r_t(\theta) \\) denotes the probability ratio between the current and old policy:
-
-- If \\( r_t(\theta) > 1 \\), the **action \\( a_t \\) at state \\( s_t \\) is more likely in the current policy than the old policy.**
-- If \\( r_t(\theta) \\) is between 0 and 1, the **action is less likely for the current policy than for the old one**.
-
-So this probability ratio is an **easy way to estimate the divergence between old and current policy.**
-
-## The unclipped part of the Clipped Surrogate Objective function
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/unclipped1.jpg" alt="PPO"/>
-
-This ratio **can replace the log probability we use in the policy objective function**. This gives us the left part of the new objective function: multiplying the ratio by the advantage.
-<figure class="image table text-center m-0 w-full">
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/unclipped2.jpg" alt="PPO"/>
-  <figcaption><a href="https://arxiv.org/pdf/1707.06347.pdf">Proximal Policy Optimization Algorithms</a></figcaption>
-</figure>
-
-However, without a constraint, if the action taken is much more probable in our current policy than in our former, **this would lead to a significant policy gradient step** and, therefore, an **excessive policy update.**
-
-## The clipped Part of the Clipped Surrogate Objective function
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/clipped.jpg" alt="PPO"/>
-
-Consequently, we need to constrain this objective function by penalizing changes that lead to a ratio far away from 1 (in the paper, the ratio can only vary from 0.8 to 1.2).
-
-**By clipping the ratio, we ensure that we do not have a too large policy update because the current policy can't be too different from the older one.**
-
-To do that, we have two solutions:
-
-- *TRPO (Trust Region Policy Optimization)* uses KL divergence constraints outside the objective function to constrain the policy update. But this method **is complicated to implement and takes more computation time.**
-- *PPO* clip probability ratio directly in the objective function with its **Clipped surrogate objective function.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/clipped.jpg" alt="PPO"/>
-
-This clipped part is a version where \\( r_t(\theta) \\) is clipped between  \\( [1 - \epsilon, 1 + \epsilon] \\).
-
-With the Clipped Surrogate Objective function, we have two probability ratios, one non-clipped and one clipped in a range between  \\( [1 - \epsilon, 1 + \epsilon] \\), epsilon is a hyperparameter that helps us to define this clip range (in the paper  \\( \epsilon = 0.2 \\).).
-
-Then, we take the minimum of the clipped and non-clipped objective, **so the final objective is a lower bound (pessimistic bound) of the unclipped objective.**
-
-Taking the minimum of the clipped and non-clipped objective means **we'll select either the clipped or the non-clipped objective based on the ratio and advantage situation**.
diff --git a/units/en/unit8/conclusion-sf.mdx b/units/en/unit8/conclusion-sf.mdx
deleted file mode 100644
index 7b82f91..0000000
--- a/units/en/unit8/conclusion-sf.mdx
+++ /dev/null
@@ -1,13 +0,0 @@
-# Conclusion
-
-That's all for today. Congrats on finishing this Unit and the tutorial! ⭐️
-
-Now that you've successfully trained your Doom agent, why not try deathmatch? Remember, that's a much more complex level than the one you've just trained, **but it's a nice experiment and I advise you to try it.**
-
-If you do it, don't hesitate to share your model in the `#rl-i-made-this` channel in our [discord server](https://www.hf.co/join/discord).
-
-This concludes the last unit, but we are not finished yet! 🤗 The following **bonus unit includes some of the most interesting, advanced, and cutting edge work in Deep Reinforcement Learning**.
-
-See you next time 🔥
-
-## Keep Learning, Stay awesome 🤗
diff --git a/units/en/unit8/conclusion.mdx b/units/en/unit8/conclusion.mdx
deleted file mode 100644
index dd99c18..0000000
--- a/units/en/unit8/conclusion.mdx
+++ /dev/null
@@ -1,9 +0,0 @@
-# Conclusion [[Conclusion]]
-
-That’s all for today. Congrats on finishing this unit and the tutorial!
-
-The best way to learn is to practice and try stuff. **Why not improve the implementation to handle frames as input?**.
-
-See you on second part of this Unit 🔥
-
-## Keep Learning, Stay awesome 🤗
diff --git a/units/en/unit8/hands-on-cleanrl.mdx b/units/en/unit8/hands-on-cleanrl.mdx
deleted file mode 100644
index 2e35f07..0000000
--- a/units/en/unit8/hands-on-cleanrl.mdx
+++ /dev/null
@@ -1,1078 +0,0 @@
-# Hands-on
-
-
-      <CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
-      notebooks={[
-        {label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit8/unit8_part1.ipynb"}
-        ]}
-        askForHelpUrl="http://hf.co/join/discord" />
-
-
-
-Now that we studied the theory behind PPO, the best way to understand how it works **is to implement it from scratch.**
-
-Implementing an architecture from scratch is the best way to understand it, and it's a good habit. We have already done it for a value-based method with Q-Learning and a Policy-based method with Reinforce.
-
-So, to be able to code it, we're going to use two resources:
-- A tutorial made by [Costa Huang](https://github.com/vwxyzjn). Costa is behind [CleanRL](https://github.com/vwxyzjn/cleanrl), a Deep Reinforcement Learning library that provides high-quality single-file implementation with research-friendly features.
-- In addition to the tutorial, to go deeper, you can read the 13 core implementation details: [https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/)
-
-Then, to test its robustness, we're going to train it in:
-- [LunarLander-v2](https://www.gymlibrary.ml/environments/box2d/lunar_lander/)
-
-<figure class="image table text-center m-0 w-full">
-    <video
-        alt="LunarLander"
-        style="max-width: 70%; margin: auto;"
-        autoplay loop autobuffer muted playsinline
-    >
-      <source src="assets/63_deep_rl_intro/lunarlander.mp4" type="video/mp4">
-  </video>
-</figure>
-
-And finally, we will push the trained model to the Hub to evaluate and visualize your agent playing.
-
-LunarLander-v2 is the first environment you used when you started this course. At that time, you didn't know how it worked, and now you can code it from scratch and train it. **How incredible is that 🤩.**
-
-<iframe src="https://giphy.com/embed/pynZagVcYxVUk" width="480" height="480" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/the-office-michael-heartbreak-pynZagVcYxVUk">via GIPHY</a></p>
-
-Let's get started! 🚀
-
-The colab notebook: 
-
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit8/unit8_part1.ipynb)
-
-# Unit 8: Proximal Policy Gradient (PPO) with PyTorch 🤖
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/thumbnail.png" alt="Unit 8"/>
-
-
-In this notebook, you'll learn to **code your PPO agent from scratch with PyTorch using CleanRL implementation as model**.
-
-To test its robustness, we're going to train it in:
-
-- [LunarLander-v2 🚀](https://www.gymlibrary.dev/environments/box2d/lunar_lander/)
-
-We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).
-
-## Objectives of this notebook 🏆
-
-At the end of the notebook, you will:
-
-- Be able to **code your PPO agent from scratch using PyTorch**.
-- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.
-
-
-## Prerequisites 🏗️
-
-Before diving into the notebook, you need to:
-
-🔲 📚 Study [PPO by reading Unit 8](https://huggingface.co/deep-rl-course/unit8/introduction) 🤗
-
-To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push one model, we don't ask for a minimal result but we **advise you to try different hyperparameters settings to get better results**.
-
-If you don't find your model, **go to the bottom of the page and click on the refresh button**
-
-For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
-
-## Set the GPU 💪
-
-- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg" alt="GPU Step 1">
-
-- `Hardware Accelerator > GPU`
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg" alt="GPU Step 2">
-
-## Create a virtual display 🔽
-
-During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).
-
-Hence the following cell will install the librairies and create and run a virtual screen 🖥
-
-```python
-apt install python-opengl
-apt install ffmpeg
-apt install xvfb
-pip install pyglet==1.5
-pip install pyvirtualdisplay
-```
-
-```python
-# Virtual display
-from pyvirtualdisplay import Display
-
-virtual_display = Display(visible=0, size=(1400, 900))
-virtual_display.start()
-```
-
-## Install dependencies 🔽
-For this exercise, we use `gym==0.21` because the video was recorded with Gym.
-
-```python
-pip install gym==0.22
-pip install imageio-ffmpeg
-pip install huggingface_hub
-pip install gym[box2d]==0.22
-```
-
-## Let's code PPO from scratch with Costa Huang's tutorial
-- For the core implementation of PPO we're going to use the excellent [Costa Huang](https://costa.sh/) tutorial.
-- In addition to the tutorial, to go deeper you can read the 37 core implementation details: https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
-
-👉 The video tutorial: https://youtu.be/MEt6rrxH8W4
-
-```python
-from IPython.display import HTML
-
-HTML(
-    '<iframe width="560" height="315" src="https://www.youtube.com/embed/MEt6rrxH8W4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>'
-)
-```
-
-## Add the Hugging Face Integration 🤗
-- In order to push our model to the Hub, we need to define a function `package_to_hub`
-
-- Add dependencies we need to push our model to the Hub
-
-```python
-from huggingface_hub import HfApi, upload_folder
-from huggingface_hub.repocard import metadata_eval_result, metadata_save
-
-from pathlib import Path
-import datetime
-import tempfile
-import json
-import shutil
-import imageio
-
-from wasabi import Printer
-
-msg = Printer()
-```
-
-- Add new argument in `parse_args()` function to define the repo-id where we want to push the model.
-
-```python
-# Adding HuggingFace argument
-parser.add_argument(
-    "--repo-id",
-    type=str,
-    default="ThomasSimonini/ppo-CartPole-v1",
-    help="id of the model repository from the Hugging Face Hub {username/repo_name}",
-)
-```
-
-- Next, we add the methods needed to push the model to the Hub
-
-- These methods will:
-  - `_evalutate_agent()`: evaluate the agent.
-  - `_generate_model_card()`: generate the model card of your agent.
-  - `_record_video()`: record a video of your agent.
-
-```python
-def package_to_hub(
-    repo_id,
-    model,
-    hyperparameters,
-    eval_env,
-    video_fps=30,
-    commit_message="Push agent to the Hub",
-    token=None,
-    logs=None,
-):
-    """
-    Evaluate, Generate a video and Upload a model to Hugging Face Hub.
-    This method does the complete pipeline:
-    - It evaluates the model
-    - It generates the model card
-    - It generates a replay video of the agent
-    - It pushes everything to the hub
-    :param repo_id: id of the model repository from the Hugging Face Hub
-    :param model: trained model
-    :param eval_env: environment used to evaluate the agent
-    :param fps: number of fps for rendering the video
-    :param commit_message: commit message
-    :param logs: directory on local machine of tensorboard logs you'd like to upload
-    """
-    msg.info(
-        "This function will save, evaluate, generate a video of your agent, "
-        "create a model card and push everything to the hub. "
-        "It might take up to 1min. \n "
-        "This is a work in progress: if you encounter a bug, please open an issue."
-    )
-    # Step 1: Clone or create the repo
-    repo_url = HfApi().create_repo(
-        repo_id=repo_id,
-        token=token,
-        private=False,
-        exist_ok=True,
-    )
-
-    with tempfile.TemporaryDirectory() as tmpdirname:
-        tmpdirname = Path(tmpdirname)
-
-        # Step 2: Save the model
-        torch.save(model.state_dict(), tmpdirname / "model.pt")
-
-        # Step 3: Evaluate the model and build JSON
-        mean_reward, std_reward = _evaluate_agent(eval_env, 10, model)
-
-        # First get datetime
-        eval_datetime = datetime.datetime.now()
-        eval_form_datetime = eval_datetime.isoformat()
-
-        evaluate_data = {
-            "env_id": hyperparameters.env_id,
-            "mean_reward": mean_reward,
-            "std_reward": std_reward,
-            "n_evaluation_episodes": 10,
-            "eval_datetime": eval_form_datetime,
-        }
-
-        # Write a JSON file
-        with open(tmpdirname / "results.json", "w") as outfile:
-            json.dump(evaluate_data, outfile)
-
-        # Step 4: Generate a video
-        video_path = tmpdirname / "replay.mp4"
-        record_video(eval_env, model, video_path, video_fps)
-
-        # Step 5: Generate the model card
-        generated_model_card, metadata = _generate_model_card(
-            "PPO", hyperparameters.env_id, mean_reward, std_reward, hyperparameters
-        )
-        _save_model_card(tmpdirname, generated_model_card, metadata)
-
-        # Step 6: Add logs if needed
-        if logs:
-            _add_logdir(tmpdirname, Path(logs))
-
-        msg.info(f"Pushing repo {repo_id} to the Hugging Face Hub")
-
-        repo_url = upload_folder(
-            repo_id=repo_id,
-            folder_path=tmpdirname,
-            path_in_repo="",
-            commit_message=commit_message,
-            token=token,
-        )
-
-        msg.info(f"Your model is pushed to the Hub. You can view your model here: {repo_url}")
-    return repo_url
-
-
-def _evaluate_agent(env, n_eval_episodes, policy):
-    """
-    Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.
-    :param env: The evaluation environment
-    :param n_eval_episodes: Number of episode to evaluate the agent
-    :param policy: The agent
-    """
-    episode_rewards = []
-    for episode in range(n_eval_episodes):
-        state = env.reset()
-        step = 0
-        done = False
-        total_rewards_ep = 0
-
-        while done is False:
-            state = torch.Tensor(state).to(device)
-            action, _, _, _ = policy.get_action_and_value(state)
-            new_state, reward, done, info = env.step(action.cpu().numpy())
-            total_rewards_ep += reward
-            if done:
-                break
-            state = new_state
-        episode_rewards.append(total_rewards_ep)
-    mean_reward = np.mean(episode_rewards)
-    std_reward = np.std(episode_rewards)
-
-    return mean_reward, std_reward
-
-
-def record_video(env, policy, out_directory, fps=30):
-    images = []
-    done = False
-    state = env.reset()
-    img = env.render(mode="rgb_array")
-    images.append(img)
-    while not done:
-        state = torch.Tensor(state).to(device)
-        # Take the action (index) that have the maximum expected future reward given that state
-        action, _, _, _ = policy.get_action_and_value(state)
-        state, reward, done, info = env.step(
-            action.cpu().numpy()
-        )  # We directly put next_state = state for recording logic
-        img = env.render(mode="rgb_array")
-        images.append(img)
-    imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)
-
-
-def _generate_model_card(model_name, env_id, mean_reward, std_reward, hyperparameters):
-    """
-    Generate the model card for the Hub
-    :param model_name: name of the model
-    :env_id: name of the environment
-    :mean_reward: mean reward of the agent
-    :std_reward: standard deviation of the mean reward of the agent
-    :hyperparameters: training arguments
-    """
-    # Step 1: Select the tags
-    metadata = generate_metadata(model_name, env_id, mean_reward, std_reward)
-
-    # Transform the hyperparams namespace to string
-    converted_dict = vars(hyperparameters)
-    converted_str = str(converted_dict)
-    converted_str = converted_str.split(", ")
-    converted_str = "\n".join(converted_str)
-
-    # Step 2: Generate the model card
-    model_card = f"""
-  # PPO Agent Playing {env_id}
-
-  This is a trained model of a PPO agent playing {env_id}.
-
-  # Hyperparameters
-  """
-    return model_card, metadata
-
-
-def generate_metadata(model_name, env_id, mean_reward, std_reward):
-    """
-    Define the tags for the model card
-    :param model_name: name of the model
-    :param env_id: name of the environment
-    :mean_reward: mean reward of the agent
-    :std_reward: standard deviation of the mean reward of the agent
-    """
-    metadata = {}
-    metadata["tags"] = [
-        env_id,
-        "ppo",
-        "deep-reinforcement-learning",
-        "reinforcement-learning",
-        "custom-implementation",
-        "deep-rl-course",
-    ]
-
-    # Add metrics
-    eval = metadata_eval_result(
-        model_pretty_name=model_name,
-        task_pretty_name="reinforcement-learning",
-        task_id="reinforcement-learning",
-        metrics_pretty_name="mean_reward",
-        metrics_id="mean_reward",
-        metrics_value=f"{mean_reward:.2f} +/- {std_reward:.2f}",
-        dataset_pretty_name=env_id,
-        dataset_id=env_id,
-    )
-
-    # Merges both dictionaries
-    metadata = {**metadata, **eval}
-
-    return metadata
-
-
-def _save_model_card(local_path, generated_model_card, metadata):
-    """Saves a model card for the repository.
-    :param local_path: repository directory
-    :param generated_model_card: model card generated by _generate_model_card()
-    :param metadata: metadata
-    """
-    readme_path = local_path / "README.md"
-    readme = ""
-    if readme_path.exists():
-        with readme_path.open("r", encoding="utf8") as f:
-            readme = f.read()
-    else:
-        readme = generated_model_card
-
-    with readme_path.open("w", encoding="utf-8") as f:
-        f.write(readme)
-
-    # Save our metrics to Readme metadata
-    metadata_save(readme_path, metadata)
-
-
-def _add_logdir(local_path: Path, logdir: Path):
-    """Adds a logdir to the repository.
-    :param local_path: repository directory
-    :param logdir: logdir directory
-    """
-    if logdir.exists() and logdir.is_dir():
-        # Add the logdir to the repository under new dir called logs
-        repo_logdir = local_path / "logs"
-
-        # Delete current logs if they exist
-        if repo_logdir.exists():
-            shutil.rmtree(repo_logdir)
-
-        # Copy logdir into repo logdir
-        shutil.copytree(logdir, repo_logdir)
-```
-
-- Finally, we call this function at the end of the PPO training
-
-```python
-# Create the evaluation environment
-eval_env = gym.make(args.env_id)
-
-package_to_hub(
-    repo_id=args.repo_id,
-    model=agent,  # The model we want to save
-    hyperparameters=args,
-    eval_env=gym.make(args.env_id),
-    logs=f"runs/{run_name}",
-)
-```
-
-- Here's what the final ppo.py file looks like:
-
-```python
-# docs and experiment results can be found at https://docs.cleanrl.dev/rl-algorithms/ppo/#ppopy
-
-import argparse
-import os
-import random
-import time
-from distutils.util import strtobool
-
-import gym
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.optim as optim
-from torch.distributions.categorical import Categorical
-from torch.utils.tensorboard import SummaryWriter
-
-from huggingface_hub import HfApi, upload_folder
-from huggingface_hub.repocard import metadata_eval_result, metadata_save
-
-from pathlib import Path
-import datetime
-import tempfile
-import json
-import shutil
-import imageio
-
-from wasabi import Printer
-
-msg = Printer()
-
-
-def parse_args():
-    # fmt: off
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--exp-name", type=str, default=os.path.basename(__file__).rstrip(".py"),
-        help="the name of this experiment")
-    parser.add_argument("--seed", type=int, default=1,
-        help="seed of the experiment")
-    parser.add_argument("--torch-deterministic", type=lambda x: bool(strtobool(x)), default=True, nargs="?", const=True,
-        help="if toggled, `torch.backends.cudnn.deterministic=False`")
-    parser.add_argument("--cuda", type=lambda x: bool(strtobool(x)), default=True, nargs="?", const=True,
-        help="if toggled, cuda will be enabled by default")
-    parser.add_argument("--track", type=lambda x: bool(strtobool(x)), default=False, nargs="?", const=True,
-        help="if toggled, this experiment will be tracked with Weights and Biases")
-    parser.add_argument("--wandb-project-name", type=str, default="cleanRL",
-        help="the wandb's project name")
-    parser.add_argument("--wandb-entity", type=str, default=None,
-        help="the entity (team) of wandb's project")
-    parser.add_argument("--capture-video", type=lambda x: bool(strtobool(x)), default=False, nargs="?", const=True,
-        help="weather to capture videos of the agent performances (check out `videos` folder)")
-
-    # Algorithm specific arguments
-    parser.add_argument("--env-id", type=str, default="CartPole-v1",
-        help="the id of the environment")
-    parser.add_argument("--total-timesteps", type=int, default=50000,
-        help="total timesteps of the experiments")
-    parser.add_argument("--learning-rate", type=float, default=2.5e-4,
-        help="the learning rate of the optimizer")
-    parser.add_argument("--num-envs", type=int, default=4,
-        help="the number of parallel game environments")
-    parser.add_argument("--num-steps", type=int, default=128,
-        help="the number of steps to run in each environment per policy rollout")
-    parser.add_argument("--anneal-lr", type=lambda x: bool(strtobool(x)), default=True, nargs="?", const=True,
-        help="Toggle learning rate annealing for policy and value networks")
-    parser.add_argument("--gae", type=lambda x: bool(strtobool(x)), default=True, nargs="?", const=True,
-        help="Use GAE for advantage computation")
-    parser.add_argument("--gamma", type=float, default=0.99,
-        help="the discount factor gamma")
-    parser.add_argument("--gae-lambda", type=float, default=0.95,
-        help="the lambda for the general advantage estimation")
-    parser.add_argument("--num-minibatches", type=int, default=4,
-        help="the number of mini-batches")
-    parser.add_argument("--update-epochs", type=int, default=4,
-        help="the K epochs to update the policy")
-    parser.add_argument("--norm-adv", type=lambda x: bool(strtobool(x)), default=True, nargs="?", const=True,
-        help="Toggles advantages normalization")
-    parser.add_argument("--clip-coef", type=float, default=0.2,
-        help="the surrogate clipping coefficient")
-    parser.add_argument("--clip-vloss", type=lambda x: bool(strtobool(x)), default=True, nargs="?", const=True,
-        help="Toggles whether or not to use a clipped loss for the value function, as per the paper.")
-    parser.add_argument("--ent-coef", type=float, default=0.01,
-        help="coefficient of the entropy")
-    parser.add_argument("--vf-coef", type=float, default=0.5,
-        help="coefficient of the value function")
-    parser.add_argument("--max-grad-norm", type=float, default=0.5,
-        help="the maximum norm for the gradient clipping")
-    parser.add_argument("--target-kl", type=float, default=None,
-        help="the target KL divergence threshold")
-
-    # Adding HuggingFace argument
-    parser.add_argument("--repo-id", type=str, default="ThomasSimonini/ppo-CartPole-v1", help="id of the model repository from the Hugging Face Hub {username/repo_name}")
-
-    args = parser.parse_args()
-    args.batch_size = int(args.num_envs * args.num_steps)
-    args.minibatch_size = int(args.batch_size // args.num_minibatches)
-    # fmt: on
-    return args
-
-
-def package_to_hub(
-    repo_id,
-    model,
-    hyperparameters,
-    eval_env,
-    video_fps=30,
-    commit_message="Push agent to the Hub",
-    token=None,
-    logs=None,
-):
-    """
-    Evaluate, Generate a video and Upload a model to Hugging Face Hub.
-    This method does the complete pipeline:
-    - It evaluates the model
-    - It generates the model card
-    - It generates a replay video of the agent
-    - It pushes everything to the hub
-    :param repo_id: id of the model repository from the Hugging Face Hub
-    :param model: trained model
-    :param eval_env: environment used to evaluate the agent
-    :param fps: number of fps for rendering the video
-    :param commit_message: commit message
-    :param logs: directory on local machine of tensorboard logs you'd like to upload
-    """
-    msg.info(
-        "This function will save, evaluate, generate a video of your agent, "
-        "create a model card and push everything to the hub. "
-        "It might take up to 1min. \n "
-        "This is a work in progress: if you encounter a bug, please open an issue."
-    )
-    # Step 1: Clone or create the repo
-    repo_url = HfApi().create_repo(
-        repo_id=repo_id,
-        token=token,
-        private=False,
-        exist_ok=True,
-    )
-
-    with tempfile.TemporaryDirectory() as tmpdirname:
-        tmpdirname = Path(tmpdirname)
-
-        # Step 2: Save the model
-        torch.save(model.state_dict(), tmpdirname / "model.pt")
-
-        # Step 3: Evaluate the model and build JSON
-        mean_reward, std_reward = _evaluate_agent(eval_env, 10, model)
-
-        # First get datetime
-        eval_datetime = datetime.datetime.now()
-        eval_form_datetime = eval_datetime.isoformat()
-
-        evaluate_data = {
-            "env_id": hyperparameters.env_id,
-            "mean_reward": mean_reward,
-            "std_reward": std_reward,
-            "n_evaluation_episodes": 10,
-            "eval_datetime": eval_form_datetime,
-        }
-
-        # Write a JSON file
-        with open(tmpdirname / "results.json", "w") as outfile:
-            json.dump(evaluate_data, outfile)
-
-        # Step 4: Generate a video
-        video_path = tmpdirname / "replay.mp4"
-        record_video(eval_env, model, video_path, video_fps)
-
-        # Step 5: Generate the model card
-        generated_model_card, metadata = _generate_model_card(
-            "PPO", hyperparameters.env_id, mean_reward, std_reward, hyperparameters
-        )
-        _save_model_card(tmpdirname, generated_model_card, metadata)
-
-        # Step 6: Add logs if needed
-        if logs:
-            _add_logdir(tmpdirname, Path(logs))
-
-        msg.info(f"Pushing repo {repo_id} to the Hugging Face Hub")
-
-        repo_url = upload_folder(
-            repo_id=repo_id,
-            folder_path=tmpdirname,
-            path_in_repo="",
-            commit_message=commit_message,
-            token=token,
-        )
-
-        msg.info(f"Your model is pushed to the Hub. You can view your model here: {repo_url}")
-    return repo_url
-
-
-def _evaluate_agent(env, n_eval_episodes, policy):
-    """
-    Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.
-    :param env: The evaluation environment
-    :param n_eval_episodes: Number of episode to evaluate the agent
-    :param policy: The agent
-    """
-    episode_rewards = []
-    for episode in range(n_eval_episodes):
-        state = env.reset()
-        step = 0
-        done = False
-        total_rewards_ep = 0
-
-        while done is False:
-            state = torch.Tensor(state).to(device)
-            action, _, _, _ = policy.get_action_and_value(state)
-            new_state, reward, done, info = env.step(action.cpu().numpy())
-            total_rewards_ep += reward
-            if done:
-                break
-            state = new_state
-        episode_rewards.append(total_rewards_ep)
-    mean_reward = np.mean(episode_rewards)
-    std_reward = np.std(episode_rewards)
-
-    return mean_reward, std_reward
-
-
-def record_video(env, policy, out_directory, fps=30):
-    images = []
-    done = False
-    state = env.reset()
-    img = env.render(mode="rgb_array")
-    images.append(img)
-    while not done:
-        state = torch.Tensor(state).to(device)
-        # Take the action (index) that have the maximum expected future reward given that state
-        action, _, _, _ = policy.get_action_and_value(state)
-        state, reward, done, info = env.step(
-            action.cpu().numpy()
-        )  # We directly put next_state = state for recording logic
-        img = env.render(mode="rgb_array")
-        images.append(img)
-    imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)
-
-
-def _generate_model_card(model_name, env_id, mean_reward, std_reward, hyperparameters):
-    """
-    Generate the model card for the Hub
-    :param model_name: name of the model
-    :env_id: name of the environment
-    :mean_reward: mean reward of the agent
-    :std_reward: standard deviation of the mean reward of the agent
-    :hyperparameters: training arguments
-    """
-    # Step 1: Select the tags
-    metadata = generate_metadata(model_name, env_id, mean_reward, std_reward)
-
-    # Transform the hyperparams namespace to string
-    converted_dict = vars(hyperparameters)
-    converted_str = str(converted_dict)
-    converted_str = converted_str.split(", ")
-    converted_str = "\n".join(converted_str)
-
-    # Step 2: Generate the model card
-    model_card = f"""
-  # PPO Agent Playing {env_id}
-
-  This is a trained model of a PPO agent playing {env_id}.
-
-  # Hyperparameters
-  """
-    return model_card, metadata
-
-
-def generate_metadata(model_name, env_id, mean_reward, std_reward):
-    """
-    Define the tags for the model card
-    :param model_name: name of the model
-    :param env_id: name of the environment
-    :mean_reward: mean reward of the agent
-    :std_reward: standard deviation of the mean reward of the agent
-    """
-    metadata = {}
-    metadata["tags"] = [
-        env_id,
-        "ppo",
-        "deep-reinforcement-learning",
-        "reinforcement-learning",
-        "custom-implementation",
-        "deep-rl-course",
-    ]
-
-    # Add metrics
-    eval = metadata_eval_result(
-        model_pretty_name=model_name,
-        task_pretty_name="reinforcement-learning",
-        task_id="reinforcement-learning",
-        metrics_pretty_name="mean_reward",
-        metrics_id="mean_reward",
-        metrics_value=f"{mean_reward:.2f} +/- {std_reward:.2f}",
-        dataset_pretty_name=env_id,
-        dataset_id=env_id,
-    )
-
-    # Merges both dictionaries
-    metadata = {**metadata, **eval}
-
-    return metadata
-
-
-def _save_model_card(local_path, generated_model_card, metadata):
-    """Saves a model card for the repository.
-    :param local_path: repository directory
-    :param generated_model_card: model card generated by _generate_model_card()
-    :param metadata: metadata
-    """
-    readme_path = local_path / "README.md"
-    readme = ""
-    if readme_path.exists():
-        with readme_path.open("r", encoding="utf8") as f:
-            readme = f.read()
-    else:
-        readme = generated_model_card
-
-    with readme_path.open("w", encoding="utf-8") as f:
-        f.write(readme)
-
-    # Save our metrics to Readme metadata
-    metadata_save(readme_path, metadata)
-
-
-def _add_logdir(local_path: Path, logdir: Path):
-    """Adds a logdir to the repository.
-    :param local_path: repository directory
-    :param logdir: logdir directory
-    """
-    if logdir.exists() and logdir.is_dir():
-        # Add the logdir to the repository under new dir called logs
-        repo_logdir = local_path / "logs"
-
-        # Delete current logs if they exist
-        if repo_logdir.exists():
-            shutil.rmtree(repo_logdir)
-
-        # Copy logdir into repo logdir
-        shutil.copytree(logdir, repo_logdir)
-
-
-def make_env(env_id, seed, idx, capture_video, run_name):
-    def thunk():
-        env = gym.make(env_id)
-        env = gym.wrappers.RecordEpisodeStatistics(env)
-        if capture_video:
-            if idx == 0:
-                env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")
-        env.seed(seed)
-        env.action_space.seed(seed)
-        env.observation_space.seed(seed)
-        return env
-
-    return thunk
-
-
-def layer_init(layer, std=np.sqrt(2), bias_const=0.0):
-    torch.nn.init.orthogonal_(layer.weight, std)
-    torch.nn.init.constant_(layer.bias, bias_const)
-    return layer
-
-
-class Agent(nn.Module):
-    def __init__(self, envs):
-        super().__init__()
-        self.critic = nn.Sequential(
-            layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 64)),
-            nn.Tanh(),
-            layer_init(nn.Linear(64, 64)),
-            nn.Tanh(),
-            layer_init(nn.Linear(64, 1), std=1.0),
-        )
-        self.actor = nn.Sequential(
-            layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 64)),
-            nn.Tanh(),
-            layer_init(nn.Linear(64, 64)),
-            nn.Tanh(),
-            layer_init(nn.Linear(64, envs.single_action_space.n), std=0.01),
-        )
-
-    def get_value(self, x):
-        return self.critic(x)
-
-    def get_action_and_value(self, x, action=None):
-        logits = self.actor(x)
-        probs = Categorical(logits=logits)
-        if action is None:
-            action = probs.sample()
-        return action, probs.log_prob(action), probs.entropy(), self.critic(x)
-
-
-if __name__ == "__main__":
-    args = parse_args()
-    run_name = f"{args.env_id}__{args.exp_name}__{args.seed}__{int(time.time())}"
-    if args.track:
-        import wandb
-
-        wandb.init(
-            project=args.wandb_project_name,
-            entity=args.wandb_entity,
-            sync_tensorboard=True,
-            config=vars(args),
-            name=run_name,
-            monitor_gym=True,
-            save_code=True,
-        )
-    writer = SummaryWriter(f"runs/{run_name}")
-    writer.add_text(
-        "hyperparameters",
-        "|param|value|\n|-|-|\n%s" % ("\n".join([f"|{key}|{value}|" for key, value in vars(args).items()])),
-    )
-
-    # TRY NOT TO MODIFY: seeding
-    random.seed(args.seed)
-    np.random.seed(args.seed)
-    torch.manual_seed(args.seed)
-    torch.backends.cudnn.deterministic = args.torch_deterministic
-
-    device = torch.device("cuda" if torch.cuda.is_available() and args.cuda else "cpu")
-
-    # env setup
-    envs = gym.vector.SyncVectorEnv(
-        [make_env(args.env_id, args.seed + i, i, args.capture_video, run_name) for i in range(args.num_envs)]
-    )
-    assert isinstance(envs.single_action_space, gym.spaces.Discrete), "only discrete action space is supported"
-
-    agent = Agent(envs).to(device)
-    optimizer = optim.Adam(agent.parameters(), lr=args.learning_rate, eps=1e-5)
-
-    # ALGO Logic: Storage setup
-    obs = torch.zeros((args.num_steps, args.num_envs) + envs.single_observation_space.shape).to(device)
-    actions = torch.zeros((args.num_steps, args.num_envs) + envs.single_action_space.shape).to(device)
-    logprobs = torch.zeros((args.num_steps, args.num_envs)).to(device)
-    rewards = torch.zeros((args.num_steps, args.num_envs)).to(device)
-    dones = torch.zeros((args.num_steps, args.num_envs)).to(device)
-    values = torch.zeros((args.num_steps, args.num_envs)).to(device)
-
-    # TRY NOT TO MODIFY: start the game
-    global_step = 0
-    start_time = time.time()
-    next_obs = torch.Tensor(envs.reset()).to(device)
-    next_done = torch.zeros(args.num_envs).to(device)
-    num_updates = args.total_timesteps // args.batch_size
-
-    for update in range(1, num_updates + 1):
-        # Annealing the rate if instructed to do so.
-        if args.anneal_lr:
-            frac = 1.0 - (update - 1.0) / num_updates
-            lrnow = frac * args.learning_rate
-            optimizer.param_groups[0]["lr"] = lrnow
-
-        for step in range(0, args.num_steps):
-            global_step += 1 * args.num_envs
-            obs[step] = next_obs
-            dones[step] = next_done
-
-            # ALGO LOGIC: action logic
-            with torch.no_grad():
-                action, logprob, _, value = agent.get_action_and_value(next_obs)
-                values[step] = value.flatten()
-            actions[step] = action
-            logprobs[step] = logprob
-
-            # TRY NOT TO MODIFY: execute the game and log data.
-            next_obs, reward, done, info = envs.step(action.cpu().numpy())
-            rewards[step] = torch.tensor(reward).to(device).view(-1)
-            next_obs, next_done = torch.Tensor(next_obs).to(device), torch.Tensor(done).to(device)
-
-            for item in info:
-                if "episode" in item.keys():
-                    print(f"global_step={global_step}, episodic_return={item['episode']['r']}")
-                    writer.add_scalar("charts/episodic_return", item["episode"]["r"], global_step)
-                    writer.add_scalar("charts/episodic_length", item["episode"]["l"], global_step)
-                    break
-
-        # bootstrap value if not done
-        with torch.no_grad():
-            next_value = agent.get_value(next_obs).reshape(1, -1)
-            if args.gae:
-                advantages = torch.zeros_like(rewards).to(device)
-                lastgaelam = 0
-                for t in reversed(range(args.num_steps)):
-                    if t == args.num_steps - 1:
-                        nextnonterminal = 1.0 - next_done
-                        nextvalues = next_value
-                    else:
-                        nextnonterminal = 1.0 - dones[t + 1]
-                        nextvalues = values[t + 1]
-                    delta = rewards[t] + args.gamma * nextvalues * nextnonterminal - values[t]
-                    advantages[t] = lastgaelam = delta + args.gamma * args.gae_lambda * nextnonterminal * lastgaelam
-                returns = advantages + values
-            else:
-                returns = torch.zeros_like(rewards).to(device)
-                for t in reversed(range(args.num_steps)):
-                    if t == args.num_steps - 1:
-                        nextnonterminal = 1.0 - next_done
-                        next_return = next_value
-                    else:
-                        nextnonterminal = 1.0 - dones[t + 1]
-                        next_return = returns[t + 1]
-                    returns[t] = rewards[t] + args.gamma * nextnonterminal * next_return
-                advantages = returns - values
-
-        # flatten the batch
-        b_obs = obs.reshape((-1,) + envs.single_observation_space.shape)
-        b_logprobs = logprobs.reshape(-1)
-        b_actions = actions.reshape((-1,) + envs.single_action_space.shape)
-        b_advantages = advantages.reshape(-1)
-        b_returns = returns.reshape(-1)
-        b_values = values.reshape(-1)
-
-        # Optimizing the policy and value network
-        b_inds = np.arange(args.batch_size)
-        clipfracs = []
-        for epoch in range(args.update_epochs):
-            np.random.shuffle(b_inds)
-            for start in range(0, args.batch_size, args.minibatch_size):
-                end = start + args.minibatch_size
-                mb_inds = b_inds[start:end]
-
-                _, newlogprob, entropy, newvalue = agent.get_action_and_value(
-                    b_obs[mb_inds], b_actions.long()[mb_inds]
-                )
-                logratio = newlogprob - b_logprobs[mb_inds]
-                ratio = logratio.exp()
-
-                with torch.no_grad():
-                    # calculate approx_kl http://joschu.net/blog/kl-approx.html
-                    old_approx_kl = (-logratio).mean()
-                    approx_kl = ((ratio - 1) - logratio).mean()
-                    clipfracs += [((ratio - 1.0).abs() > args.clip_coef).float().mean().item()]
-
-                mb_advantages = b_advantages[mb_inds]
-                if args.norm_adv:
-                    mb_advantages = (mb_advantages - mb_advantages.mean()) / (mb_advantages.std() + 1e-8)
-
-                # Policy loss
-                pg_loss1 = -mb_advantages * ratio
-                pg_loss2 = -mb_advantages * torch.clamp(ratio, 1 - args.clip_coef, 1 + args.clip_coef)
-                pg_loss = torch.max(pg_loss1, pg_loss2).mean()
-
-                # Value loss
-                newvalue = newvalue.view(-1)
-                if args.clip_vloss:
-                    v_loss_unclipped = (newvalue - b_returns[mb_inds]) ** 2
-                    v_clipped = b_values[mb_inds] + torch.clamp(
-                        newvalue - b_values[mb_inds],
-                        -args.clip_coef,
-                        args.clip_coef,
-                    )
-                    v_loss_clipped = (v_clipped - b_returns[mb_inds]) ** 2
-                    v_loss_max = torch.max(v_loss_unclipped, v_loss_clipped)
-                    v_loss = 0.5 * v_loss_max.mean()
-                else:
-                    v_loss = 0.5 * ((newvalue - b_returns[mb_inds]) ** 2).mean()
-
-                entropy_loss = entropy.mean()
-                loss = pg_loss - args.ent_coef * entropy_loss + v_loss * args.vf_coef
-
-                optimizer.zero_grad()
-                loss.backward()
-                nn.utils.clip_grad_norm_(agent.parameters(), args.max_grad_norm)
-                optimizer.step()
-
-            if args.target_kl is not None:
-                if approx_kl > args.target_kl:
-                    break
-
-        y_pred, y_true = b_values.cpu().numpy(), b_returns.cpu().numpy()
-        var_y = np.var(y_true)
-        explained_var = np.nan if var_y == 0 else 1 - np.var(y_true - y_pred) / var_y
-
-        # TRY NOT TO MODIFY: record rewards for plotting purposes
-        writer.add_scalar("charts/learning_rate", optimizer.param_groups[0]["lr"], global_step)
-        writer.add_scalar("losses/value_loss", v_loss.item(), global_step)
-        writer.add_scalar("losses/policy_loss", pg_loss.item(), global_step)
-        writer.add_scalar("losses/entropy", entropy_loss.item(), global_step)
-        writer.add_scalar("losses/old_approx_kl", old_approx_kl.item(), global_step)
-        writer.add_scalar("losses/approx_kl", approx_kl.item(), global_step)
-        writer.add_scalar("losses/clipfrac", np.mean(clipfracs), global_step)
-        writer.add_scalar("losses/explained_variance", explained_var, global_step)
-        print("SPS:", int(global_step / (time.time() - start_time)))
-        writer.add_scalar("charts/SPS", int(global_step / (time.time() - start_time)), global_step)
-
-    envs.close()
-    writer.close()
-
-    # Create the evaluation environment
-    eval_env = gym.make(args.env_id)
-
-    package_to_hub(
-        repo_id=args.repo_id,
-        model=agent,  # The model we want to save
-        hyperparameters=args,
-        eval_env=gym.make(args.env_id),
-        logs=f"runs/{run_name}",
-    )
-```
-
-To be able to share your model with the community there are three more steps to follow:
-
-1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
-
-2️⃣ Sign in and get your authentication token from the Hugging Face website.
-- Create a new token (https://huggingface.co/settings/tokens) **with write role**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
-
-- Copy the token
-- Run the cell below and paste the token
-
-```python
-from huggingface_hub import notebook_login
-notebook_login()
-!git config --global credential.helper store
-```
-
-If you don't want to use Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
-
-## Let's start the training 🔥
-
-⚠️ ⚠️ ⚠️  Don't use **the same repo id with the one you used for the Unit 1**
-
-- Now that you've coded PPO from scratch and added the Hugging Face Integration, we're ready to start the training 🔥
-
-- First, you need to copy all your code to a file you create called `ppo.py`
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/step1.png" alt="PPO"/>
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/step2.png" alt="PPO"/>
-
-- Now we just need to run this python script using `python <name-of-python-script>.py` with the additional parameters we defined using `argparse`
-
-- You should modify more hyperparameters otherwise the training will not be super stable.
-
-```python
-!python ppo.py --env-id="LunarLander-v2" --repo-id="YOUR_REPO_ID" --total-timesteps=50000
-```
-
-## Some additional challenges 🏆
-
-The best way to learn **is to try things on your own**! Why not try another environment? Or why not trying to modify the implementation to work with Gymnasium?
-
-See you in Unit 8, part 2 where we're going to train agents to play Doom 🔥
-
-## Keep learning, stay awesome 🤗
diff --git a/units/en/unit8/hands-on-sf.mdx b/units/en/unit8/hands-on-sf.mdx
deleted file mode 100644
index aaad781..0000000
--- a/units/en/unit8/hands-on-sf.mdx
+++ /dev/null
@@ -1,430 +0,0 @@
-# Hands-on: advanced Deep Reinforcement Learning. Using Sample Factory to play Doom from pixels
-
-<CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
-notebooks={[
-  {label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit8/unit8_part2.ipynb"}
-  ]}
-  askForHelpUrl="http://hf.co/join/discord" />
-
-The colab notebook:
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit8/unit8_part2.ipynb)
-
-# Unit 8 Part 2: Advanced Deep Reinforcement Learning. Using Sample Factory to play Doom from pixels
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/thumbnail2.png" alt="Thumbnail"/>
-
-In this notebook, we will learn how to train a Deep Neural Network to collect objects in a 3D environment based on the game of Doom, a video of the resulting policy is shown below. We train this policy using [Sample Factory](https://www.samplefactory.dev/), an asynchronous implementation of the PPO algorithm.
-
-Please note the following points:
-
-*   [Sample Factory](https://www.samplefactory.dev/) is an advanced RL framework and **only functions on Linux and Mac** (not Windows).
-
-*  The framework performs best on a **GPU machine with many CPU cores**, where it can achieve speeds of 100k interactions per second. The resources available on a standard Colab notebook **limit the performance of this library**. So the speed in this setting **does not reflect the real-world performance**.
-* Benchmarks for Sample Factory are available in a number of settings, check out the [examples](https://github.com/alex-petrenko/sample-factory/tree/master/sf_examples) if you want to find out more.
-
-
-```python
-from IPython.display import HTML
-
-HTML(
-    """<video width="640" height="480" controls>
-  <source src="https://huggingface.co/edbeeching/doom_health_gathering_supreme_3333/resolve/main/replay.mp4"
-  type="video/mp4">Your browser does not support the video tag.</video>"""
-)
-```
-
-To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push one model:
-
-- `doom_health_gathering_supreme` get a result of >= 5.
-
-To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
-
-If you don't find your model, **go to the bottom of the page and click on the refresh button**
-
-For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
-
-## Set the GPU 💪
-
-- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg" alt="GPU Step 1">
-
-- `Hardware Accelerator > GPU`
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg" alt="GPU Step 2">
-
-Before starting to train our agent, let's **study the library and environments we're going to use**.
-
-## Sample Factory
-
-[Sample Factory](https://www.samplefactory.dev/) is one of the **fastest RL libraries focused on very efficient synchronous and asynchronous implementations of policy gradients (PPO)**.
-
-Sample Factory is thoroughly **tested, used by many researchers and practitioners**, and is actively maintained. Our implementation is known to **reach SOTA performance in a variety of domains while minimizing RL experiment training time and hardware requirements**.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/samplefactoryenvs.png" alt="Sample factory"/>
-
-### Key features
-
-- Highly optimized algorithm [architecture](https://www.samplefactory.dev/06-architecture/overview/) for maximum learning throughput
-- [Synchronous and asynchronous](https://www.samplefactory.dev/07-advanced-topics/sync-async/) training regimes
-- [Serial (single-process) mode](https://www.samplefactory.dev/07-advanced-topics/serial-mode/) for easy debugging
-- Optimal performance in both CPU-based and [GPU-accelerated environments](https://www.samplefactory.dev/09-environment-integrations/isaacgym/)
-- Single- & multi-agent training, self-play, supports [training multiple policies](https://www.samplefactory.dev/07-advanced-topics/multi-policy-training/) at once on one or many GPUs
-- Population-Based Training ([PBT](https://www.samplefactory.dev/07-advanced-topics/pbt/))
-- Discrete, continuous, hybrid action spaces
-- Vector-based, image-based, dictionary observation spaces
-- Automatically creates a model architecture by parsing action/observation space specification. Supports [custom model architectures](https://www.samplefactory.dev/03-customization/custom-models/)
-- Designed to be imported into other projects, [custom environments](https://www.samplefactory.dev/03-customization/custom-environments/) are first-class citizens
-- Detailed [WandB and Tensorboard summaries](https://www.samplefactory.dev/05-monitoring/metrics-reference/), [custom metrics](https://www.samplefactory.dev/05-monitoring/custom-metrics/)
-- [HuggingFace 🤗 integration](https://www.samplefactory.dev/10-huggingface/huggingface/) (upload trained models and metrics to the Hub)
-- [Multiple](https://www.samplefactory.dev/09-environment-integrations/mujoco/) [example](https://www.samplefactory.dev/09-environment-integrations/atari/) [environment](https://www.samplefactory.dev/09-environment-integrations/vizdoom/) [integrations](https://www.samplefactory.dev/09-environment-integrations/dmlab/) with tuned parameters and trained models
-
-All of the above policies are available on the 🤗 hub. Search for the tag [sample-factory](https://huggingface.co/models?library=sample-factory&sort=downloads)
-
-### How sample-factory works
-
-Sample-factory is one of the **most highly optimized RL implementations available to the community**.
-
-It works by **spawning multiple processes that run rollout workers, inference workers and a learner worker**.
-
-The *workers* **communicate through shared memory, which lowers the communication cost between processes**.
-
-The *rollout workers* interact with the environment and send observations to the *inference workers*.
-
-The *inferences workers* query a fixed version of the policy and **send actions back to the rollout worker**.
-
-After *k* steps the rollout works send a trajectory of experience to the learner worker, **which it uses to update the agent’s policy network**.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/samplefactory.png" alt="Sample factory"/>
-
-### Actor Critic models in Sample-factory
-
-Actor Critic models in Sample Factory are composed of three components:
-
-- **Encoder** - Process input observations (images, vectors) and map them to a vector. This is the part of the model you will most likely want to customize.
-- **Core** - Intergrate vectors from one or more encoders, can optionally include a single- or multi-layer LSTM/GRU in a memory-based agent.
-- **Decoder** - Apply additional layers to the output of the model core before computing the policy and value outputs.
-
-The library has been designed to automatically support any observation and action spaces. Users can easily add their custom models. You can find out more in the [documentation](https://www.samplefactory.dev/03-customization/custom-models/#actor-critic-models-in-sample-factory).
-
-## ViZDoom
-
-[ViZDoom](https://vizdoom.cs.put.edu.pl/) is an **open-source python interface for the Doom Engine**.
-
-The library was created in 2016 by Marek Wydmuch, Michal Kempka  at the Institute of Computing Science, Poznan University of Technology, Poland.
-
-The library enables the **training of agents directly from the screen pixels in a number of scenarios**, including team deathmatch, shown in the video below. Because the ViZDoom environment is based on a game the was created in the 90s, it can be run on modern hardware at accelerated speeds, **allowing us to learn complex AI behaviors fairly quickly**.
-
-The library includes feature such as:
-
-- Multi-platform (Linux, macOS, Windows),
-- API for Python and C++,
-- [OpenAI Gym](https://www.gymlibrary.dev/) environment wrappers
-- Easy-to-create custom scenarios (visual editors, scripting language, and examples available),
-- Async and sync single-player and multiplayer modes,
-- Lightweight (few MBs) and fast (up to 7000 fps in sync mode, single-threaded),
-- Customizable resolution and rendering parameters,
-- Access to the depth buffer (3D vision),
-- Automatic labeling of game objects visible in the frame,
-- Access to the audio buffer
-- Access to the list of actors/objects and map geometry,
-- Off-screen rendering and episode recording,
-- Time scaling in async mode.
-
-## We first need to install some dependencies that are required for the ViZDoom environment
-
-Now that our Colab runtime is set up, we can start by installing the dependencies required to run ViZDoom on linux.
-
-If you are following on your machine on Mac, you will want to follow the installation instructions on the [github page](https://github.com/Farama-Foundation/ViZDoom/blob/master/doc/Quickstart.md#-quickstart-for-macos-and-anaconda3-python-36).
-
-```python
-# Install ViZDoom deps from
-# https://github.com/mwydmuch/ViZDoom/blob/master/doc/Building.md#-linux
-
-apt-get install build-essential zlib1g-dev libsdl2-dev libjpeg-dev \
-nasm tar libbz2-dev libgtk2.0-dev cmake git libfluidsynth-dev libgme-dev \
-libopenal-dev timidity libwildmidi-dev unzip ffmpeg
-
-# Boost libraries
-apt-get install libboost-all-dev
-
-# Lua binding dependencies
-apt-get install liblua5.1-dev
-```
-
-## Then we can install Sample Factory and ViZDoom
-
-- This can take 7min
-
-```bash
-pip install sample-factory
-pip install vizdoom
-```
-
-## Setting up the Doom Environment in sample-factory
-
-```python
-import functools
-
-from sample_factory.algo.utils.context import global_model_factory
-from sample_factory.cfg.arguments import parse_full_cfg, parse_sf_args
-from sample_factory.envs.env_utils import register_env
-from sample_factory.train import run_rl
-
-from sf_examples.vizdoom.doom.doom_model import make_vizdoom_encoder
-from sf_examples.vizdoom.doom.doom_params import add_doom_env_args, doom_override_defaults
-from sf_examples.vizdoom.doom.doom_utils import DOOM_ENVS, make_doom_env_from_spec
-
-
-# Registers all the ViZDoom environments
-def register_vizdoom_envs():
-    for env_spec in DOOM_ENVS:
-        make_env_func = functools.partial(make_doom_env_from_spec, env_spec)
-        register_env(env_spec.name, make_env_func)
-
-
-# Sample Factory allows the registration of a custom Neural Network architecture
-# See https://github.com/alex-petrenko/sample-factory/blob/master/sf_examples/vizdoom/doom/doom_model.py for more details
-def register_vizdoom_models():
-    global_model_factory().register_encoder_factory(make_vizdoom_encoder)
-
-
-def register_vizdoom_components():
-    register_vizdoom_envs()
-    register_vizdoom_models()
-
-
-# parse the command line args and create a config
-def parse_vizdoom_cfg(argv=None, evaluation=False):
-    parser, _ = parse_sf_args(argv=argv, evaluation=evaluation)
-    # parameters specific to Doom envs
-    add_doom_env_args(parser)
-    # override Doom default values for algo parameters
-    doom_override_defaults(parser)
-    # second parsing pass yields the final configuration
-    final_cfg = parse_full_cfg(parser, argv)
-    return final_cfg
-```
-
-Now that the setup if complete, we can train the agent. We have chosen here to learn a ViZDoom task called `Health Gathering Supreme`.
-
-### The scenario: Health Gathering Supreme
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/Health-Gathering-Supreme.png" alt="Health-Gathering-Supreme"/>
-
-
-
-The objective of this scenario is to **teach the agent how to survive without knowing what makes it survive**. The Agent know only that **life is precious** and death is bad so **it must learn what prolongs its existence and that its health is connected with survival**.
-
-The map is a rectangle containing walls and with a green, acidic floor which **hurts the player periodically**. Initially there are some medkits spread uniformly over the map. A new medkit falls from the skies every now and then. **Medkits heal some portions of player's health** - to survive, the agent needs to pick them up. The episode finishes after the player's death or on timeout.
-
-Further configuration:
-- Living_reward = 1
-- 3 available buttons: turn left, turn right, move forward
-- 1 available game variable: HEALTH
-- death penalty = 100
-
-You can find out more about the scenarios available in ViZDoom [here](https://github.com/Farama-Foundation/ViZDoom/tree/master/scenarios).
-
-There are also a number of more complex scenarios that have been create for ViZDoom, such as the ones detailed on [this github page](https://github.com/edbeeching/3d_control_deep_rl).
-
-
-
-## Training the agent
-
-- We're going to train the agent for 4000000 steps. It will take approximately 20min
-
-```python
-## Start the training, this should take around 15 minutes
-register_vizdoom_components()
-
-# The scenario we train on today is health gathering
-# other scenarios include "doom_basic", "doom_two_colors_easy", "doom_dm", "doom_dwango5", "doom_my_way_home", "doom_deadly_corridor", "doom_defend_the_center", "doom_defend_the_line"
-env = "doom_health_gathering_supreme"
-cfg = parse_vizdoom_cfg(
-    argv=[f"--env={env}", "--num_workers=8", "--num_envs_per_worker=4", "--train_for_env_steps=4000000"]
-)
-
-status = run_rl(cfg)
-```
-
-## Let's take a look at the performance of the trained policy and output a video of the agent.
-
-```python
-from sample_factory.enjoy import enjoy
-
-cfg = parse_vizdoom_cfg(
-    argv=[f"--env={env}", "--num_workers=1", "--save_video", "--no_render", "--max_num_episodes=10"], evaluation=True
-)
-status = enjoy(cfg)
-```
-
-## Now lets visualize the performance of the agent
-
-```python
-from base64 import b64encode
-from IPython.display import HTML
-
-mp4 = open("/content/train_dir/default_experiment/replay.mp4", "rb").read()
-data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
-HTML(
-    """
-<video width=640 controls>
-      <source src="%s" type="video/mp4">
-</video>
-"""
-    % data_url
-)
-```
-
-The agent has learned something, but its performance could be better. We would clearly need to train for longer. But let's upload this model to the Hub.
-
-## Now lets upload your checkpoint and video to the Hugging Face Hub
-
-
-
-
-To be able to share your model with the community there are three more steps to follow:
-
-1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
-
-2️⃣ Sign in and get your authentication token from the Hugging Face website.
-- Create a new token (https://huggingface.co/settings/tokens) **with write role**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
-
-- Copy the token
-- Run the cell below and paste the token
-
-If you don't want to use Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
-
-```python
-from huggingface_hub import notebook_login
-notebook_login()
-!git config --global credential.helper store
-```
-
-```python
-from sample_factory.enjoy import enjoy
-
-hf_username = "ThomasSimonini"  # insert your HuggingFace username here
-
-cfg = parse_vizdoom_cfg(
-    argv=[
-        f"--env={env}",
-        "--num_workers=1",
-        "--save_video",
-        "--no_render",
-        "--max_num_episodes=10",
-        "--max_num_frames=100000",
-        "--push_to_hub",
-        f"--hf_repository={hf_username}/rl_course_vizdoom_health_gathering_supreme",
-    ],
-    evaluation=True,
-)
-status = enjoy(cfg)
-```
-
-## Let's load another model
-
-
-
-
-This agent's performance was good, but we can do better! Let's download and visualize an agent trained for 10B timesteps from the hub.
-
-```bash
-#download the agent from the hub
-python -m sample_factory.huggingface.load_from_hub -r edbeeching/doom_health_gathering_supreme_2222 -d ./train_dir
-```
-
-```bash
-ls train_dir/doom_health_gathering_supreme_2222
-```
-
-```python
-env = "doom_health_gathering_supreme"
-cfg = parse_vizdoom_cfg(
-    argv=[
-        f"--env={env}",
-        "--num_workers=1",
-        "--save_video",
-        "--no_render",
-        "--max_num_episodes=10",
-        "--experiment=doom_health_gathering_supreme_2222",
-        "--train_dir=train_dir",
-    ],
-    evaluation=True,
-)
-status = enjoy(cfg)
-```
-
-```python
-mp4 = open("/content/train_dir/doom_health_gathering_supreme_2222/replay.mp4", "rb").read()
-data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
-HTML(
-    """
-<video width=640 controls>
-      <source src="%s" type="video/mp4">
-</video>
-"""
-    % data_url
-)
-```
-
-## Some additional challenges 🏆: Doom Deathmatch
-
-Training an agent to play a Doom deathmatch **takes many hours on a more beefy machine than is available in Colab**.
-
-Fortunately, we have have **already trained an agent in this scenario and it is available in the 🤗 Hub!** Let’s download the model and visualize the agent’s performance.
-
-```python
-# Download the agent from the hub
-python -m sample_factory.huggingface.load_from_hub -r edbeeching/doom_deathmatch_bots_2222 -d ./train_dir
-```
-
-Given the agent plays for a long time the video generation can take **10 minutes**.
-
-```python
-from sample_factory.enjoy import enjoy
-
-register_vizdoom_components()
-env = "doom_deathmatch_bots"
-cfg = parse_vizdoom_cfg(
-    argv=[
-        f"--env={env}",
-        "--num_workers=1",
-        "--save_video",
-        "--no_render",
-        "--max_num_episodes=1",
-        "--experiment=doom_deathmatch_bots_2222",
-        "--train_dir=train_dir",
-    ],
-    evaluation=True,
-)
-status = enjoy(cfg)
-mp4 = open("/content/train_dir/doom_deathmatch_bots_2222/replay.mp4", "rb").read()
-data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
-HTML(
-    """
-<video width=640 controls>
-      <source src="%s" type="video/mp4">
-</video>
-"""
-    % data_url
-)
-```
-
-
-You **can try to train your agent in this environment** using the code above, but not on colab.
-**Good luck 🤞**
-
-If you prefer an easier scenario, **why not try training in another ViZDoom scenario such as `doom_deadly_corridor` or `doom_defend_the_center`.**
-
-
-
----
-
-
-This concludes the last unit. But we are not finished yet! 🤗 The following **bonus section include some of the most interesting, advanced, and cutting edge work in Deep Reinforcement Learning**.
-
-## Keep learning, stay awesome 🤗
diff --git a/units/en/unit8/introduction-sf.mdx b/units/en/unit8/introduction-sf.mdx
deleted file mode 100644
index a5dda98..0000000
--- a/units/en/unit8/introduction-sf.mdx
+++ /dev/null
@@ -1,13 +0,0 @@
-# Introduction to PPO with Sample-Factory
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/thumbnail2.png" alt="thumbnail"/>
-
-In this second part of Unit 8, we'll get deeper into PPO optimization by using [Sample-Factory](https://samplefactory.dev/), an **asynchronous implementation of the PPO algorithm**, to train our agent to play [vizdoom](https://vizdoom.cs.put.edu.pl/) (an open source version of Doom).
-
-In the notebook, **you'll train your agent to play the Health Gathering level**, where the agent must collect health packs to avoid dying. After that, you can **train your agent to play more complex levels, such as Deathmatch**.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/environments.png" alt="Environment"/>
-
-Sound exciting? Let's get started! 🚀
-
-The hands-on is made by [Edward Beeching](https://twitter.com/edwardbeeching), a Machine Learning Research Scientist at Hugging Face. He worked on Godot Reinforcement Learning Agents, an open-source interface for developing environments and agents in the Godot Game Engine.
diff --git a/units/en/unit8/introduction.mdx b/units/en/unit8/introduction.mdx
deleted file mode 100644
index bb6a45c..0000000
--- a/units/en/unit8/introduction.mdx
+++ /dev/null
@@ -1,23 +0,0 @@
-# Introduction [[introduction]]
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/thumbnail.png" alt="Unit 8"/>
-
-In Unit 6, we learned about Advantage Actor Critic (A2C), a hybrid architecture combining value-based and policy-based methods that helps to stabilize the training by reducing the variance with:
-
-- *An Actor* that controls **how our agent behaves** (policy-based method).
-- *A Critic* that measures **how good the action taken is** (value-based method).
-
-Today we'll learn about Proximal Policy Optimization (PPO), an architecture that **improves our agent's training stability by avoiding policy updates that are too large**. To do that, we use a ratio that indicates the difference between our current and old policy and clip this ratio to a specific range \\( [1 - \epsilon, 1 + \epsilon] \\) .
-
-Doing this will ensure **that our policy update will not be too large and that the training is more stable.**
-
-This Unit is in two parts:
-- In this first part, you'll learn the theory behind PPO and code your PPO agent from scratch using the [CleanRL](https://github.com/vwxyzjn/cleanrl) implementation. To test its robustness you'll use LunarLander-v2. LunarLander-v2 **is the first environment you used when you started this course**. At that time, you didn't know how PPO worked, and now, **you can code it from scratch and train it. How incredible is that 🤩**.
-- In the second part, we'll get deeper into PPO optimization by using [Sample-Factory](https://samplefactory.dev/) and train an agent playing vizdoom (an open source version of Doom).
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/environments.png" alt="Environment"/>
-<figcaption>These are the environments you're going to use to train your agents: VizDoom environments</figcaption>
-</figure>
-
-Sound exciting? Let's get started! 🚀
diff --git a/units/en/unit8/intuition-behind-ppo.mdx b/units/en/unit8/intuition-behind-ppo.mdx
deleted file mode 100644
index 10b1739..0000000
--- a/units/en/unit8/intuition-behind-ppo.mdx
+++ /dev/null
@@ -1,16 +0,0 @@
-# The intuition behind PPO [[the-intuition-behind-ppo]]
-
-
-The idea with Proximal Policy Optimization (PPO) is that we want to improve the training stability of the policy by limiting the change you make to the policy at each training epoch: **we want to avoid having too large of a policy update.**
-
-For two reasons:
-- We know empirically that smaller policy updates during training are **more likely to converge to an optimal solution.**
-- A too-big step in a policy update can result in falling “off the cliff” (getting a bad policy) **and taking a long time or even having no possibility to recover.**
-
-<figure class="image table text-center m-0 w-full">
-  <img class="center" src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/cliff.jpg" alt="Policy Update cliff"/>
-  <figcaption>Taking smaller policy updates to improve the training stability</figcaption>
-  <figcaption>Modified version from RL — Proximal Policy Optimization (PPO) <a href="https://jonathan-hui.medium.com/rl-proximal-policy-optimization-ppo-explained-77f014ec3f12">Explained by Jonathan Hui</a></figcaption>
-</figure>
-
-**So with PPO, we update the policy conservatively**. To do so, we need to measure how much the current policy changed compared to the former one using a ratio calculation between the current and former policy. And we clip this ratio in a range \\( [1 - \epsilon, 1 + \epsilon] \\), meaning that we **remove the incentive for the current policy to go too far from the old one (hence the proximal policy term).**
diff --git a/units/en/unit8/visualize.mdx b/units/en/unit8/visualize.mdx
deleted file mode 100644
index fd977ca..0000000
--- a/units/en/unit8/visualize.mdx
+++ /dev/null
@@ -1,68 +0,0 @@
-# Visualize the Clipped Surrogate Objective Function
-
-Don't worry. **It's normal if this seems complex to handle right now**. But we're going to see what this Clipped Surrogate Objective Function looks like, and this will help you to visualize better what's going on.
-
-<figure class="image table text-center m-0 w-full">
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/recap.jpg" alt="PPO"/>
-  <figcaption><a href="https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf">Table from "Towards Delivering a Coherent Self-Contained
-    Explanation of Proximal Policy Optimization" by Daniel Bick</a></figcaption>
-</figure>
-
-We have six different situations. Remember first that we take the minimum between the clipped and unclipped objectives.
-
-## Case 1 and 2: the ratio is between the range
-
-In situations 1 and 2, **the clipping does not apply since the ratio is between the range** \\( [1 - \epsilon, 1 + \epsilon] \\)
-
-In situation 1, we have a positive advantage: the **action is better than the average** of all the actions in that state. Therefore, we should encourage our current policy to increase the probability of taking that action in that state.
-
-Since the ratio is between intervals, **we can increase our policy's probability of taking that action at that state.**
-
-In situation 2, we have a negative advantage: the action is worse than the average of all actions at that state. Therefore, we should discourage our current policy from taking that action in that state.
-
-Since the ratio is between intervals, **we can decrease the probability that our policy takes that action at that state.**
-
-## Case 3 and 4: the ratio is below the range
-<figure class="image table text-center m-0 w-full">
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/recap.jpg" alt="PPO"/>
-  <figcaption><a href="https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf">Table from "Towards Delivering a Coherent Self-Contained
-    Explanation of Proximal Policy Optimization" by Daniel Bick</a></figcaption>
-</figure>
-
-If the probability ratio is lower than \\( [1 - \epsilon] \\), the probability of taking that action at that state is much lower than with the old policy.
-
-If, like in situation 3, the advantage estimate is positive (A>0), then **you want to increase the probability of taking that action at that state.**
-
-But if, like situation 4, the advantage estimate is negative, **we don't want to decrease further** the probability of taking that action at that state. Therefore, the gradient is = 0 (since we're on a flat line), so we don't update our weights.
-
-## Case 5 and 6: the ratio is above the range
-<figure class="image table text-center m-0 w-full">
-  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/recap.jpg" alt="PPO"/>
-  <figcaption><a href="https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf">Table from "Towards Delivering a Coherent Self-Contained
-    Explanation of Proximal Policy Optimization" by Daniel Bick</a></figcaption>
-</figure>
-
-If the probability ratio is higher than \\( [1 + \epsilon] \\), the probability of taking that action at that state in the current policy is **much higher than in the former policy.**
-
-If, like in situation 5, the advantage is positive, **we don't want to get too greedy**. We already have a higher probability of taking that action at that state than the former policy. Therefore, the gradient is = 0 (since we're on a flat line), so we don't update our weights.
-
-If, like in situation 6, the advantage is negative, we want to decrease the probability of taking that action at that state.
-
-So if we recap, **we only update the policy with the unclipped objective part**. When the minimum is the clipped objective part, we don't update our policy weights since the gradient will equal 0.
-
-So we update our policy  only if:
-- Our ratio is in the range \\( [1 - \epsilon, 1 + \epsilon] \\)
-- Our ratio is outside the range, but **the advantage leads to getting closer to the range**
-    - Being below the ratio but the advantage is > 0
-    - Being above the ratio but the advantage is < 0
-
-**You might wonder why, when the minimum is the clipped ratio, the gradient is 0.** When the ratio is clipped, the derivative in this case will not be the derivative of the \\( r_t(\theta) * A_t \\)   but the derivative of either \\( (1 - \epsilon)* A_t\\) or the derivative of \\( (1 + \epsilon)* A_t\\) which both = 0.
-
-
-To summarize, thanks to this clipped surrogate objective, **we restrict the range that the current policy can vary from the old one.** Because we remove the incentive for the probability ratio to move outside of the interval since the clip forces the gradient to be zero. If the ratio is > \\( 1 + \epsilon \\) or < \\( 1 - \epsilon \\) the gradient will be equal to 0.
-
-The final Clipped Surrogate Objective Loss for PPO Actor-Critic style looks like this, it's a combination of Clipped Surrogate Objective function, Value Loss Function and Entropy bonus:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/ppo-objective.jpg" alt="PPO objective"/>
-
-That was quite complex. Take time to understand these situations by looking at the table and the graph. **You must understand why this makes sense.** If you want to go deeper, the best resource is the article ["Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization" by Daniel Bick, especially part 3.4](https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf).
diff --git a/units/en/unitbonus1/conclusion.mdx b/units/en/unitbonus1/conclusion.mdx
deleted file mode 100644
index baac2a6..0000000
--- a/units/en/unitbonus1/conclusion.mdx
+++ /dev/null
@@ -1,12 +0,0 @@
-# Conclusion [[conclusion]]
-
-Congrats on finishing this bonus unit!
-
-You can now sit and enjoy playing with your Huggy 🐶. And don't **forget to spread the love by sharing Huggy with your friends 🤗**. And if you share about it on social media, **please tag us @huggingface and me @simoninithomas**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy-cover.jpeg" alt="Huggy cover" width="100%">
-
-Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉  [fill out this form](https://forms.gle/BzKXWzLAGZESGNaE9)
-
-### Keep Learning, stay awesome 🤗
-
diff --git a/units/en/unitbonus1/how-huggy-works.mdx b/units/en/unitbonus1/how-huggy-works.mdx
deleted file mode 100644
index e5a07e2..0000000
--- a/units/en/unitbonus1/how-huggy-works.mdx
+++ /dev/null
@@ -1,66 +0,0 @@
-# How Huggy works [[how-huggy-works]]
-
-Huggy is a Deep Reinforcement Learning environment made by Hugging Face and based on [Puppo the Corgi, a project by the Unity MLAgents team](https://blog.unity.com/technology/puppo-the-corgi-cuteness-overload-with-the-unity-ml-agents-toolkit).
-This environment was created using the [Unity game engine](https://unity.com/) and [MLAgents](https://github.com/Unity-Technologies/ml-agents). ML-Agents is a toolkit for the game engine from Unity that allows us to **create environments using Unity or use pre-made environments to train our agents**.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy.jpg" alt="Huggy" width="100%">
-
-In this environment we aim to train Huggy to **fetch the stick we throw. This means he needs to move correctly toward the stick**.
-
-## The State Space, what Huggy perceives. [[state-space]]
-Huggy doesn't "see" his environment. Instead, we provide him information about the environment:
-
-- The target (stick) position
-- The relative position between himself and the target
-- The orientation of his legs.
-
-Given all this information, Huggy can **use his policy to determine which action to take next to fulfill his goal**.
-
-## The Action Space, what moves Huggy can perform [[action-space]]
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy-action.jpg" alt="Huggy action" width="100%">
-
-**Joint motors drive Huggy's legs**. This means that to get the target, Huggy needs to **learn to rotate the joint motors of each of his legs correctly so he can move**.
-
-## The Reward Function [[reward-function]]
-
-The reward function is designed so that **Huggy will fulfill his goal**: fetch the stick.
-
-Remember that one of the foundations of Reinforcement Learning is the *reward hypothesis*: a goal can be described as the **maximization of the expected cumulative reward**.
-
-Here, our goal is that Huggy **goes towards the stick but without spinning too much**. Hence, our reward function must translate this goal.
-
-Our reward function:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/reward.jpg" alt="Huggy reward function" width="100%">
-
-- *Orientation bonus*: we **reward him for getting close to the target**.
-- *Time penalty*: a fixed-time penalty given at every action to **force him to get to the stick as fast as possible**.
-- *Rotation penalty*: we penalize Huggy if **he spins too much and turns too quickly**.
-- *Getting to the target reward*: we reward Huggy for **reaching the target**.
-
-If you want to see what this reward function looks like mathematically, check [Puppo the Corgi presentation](https://blog.unity.com/technology/puppo-the-corgi-cuteness-overload-with-the-unity-ml-agents-toolkit).
-
-## Train Huggy
-
-Huggy aims **to learn to run correctly and as fast as possible toward the goal**. To do that, at every step and given the environment observation, he needs to decide how to rotate each joint motor of his legs to move correctly (not spinning too much) and towards the goal.
-
-The training loop looks like this:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy-loop.jpg" alt="Huggy loop" width="100%">
-
-
-The training environment looks like this:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/training-env.jpg" alt="Huggy training env" width="100%">
-
-
-It's a place where a **stick is spawned randomly**. When Huggy reaches it, the stick get spawned somewhere else.
-We built **multiple copies of the environment for the training**. This helps speed up the training by providing more diverse experiences.
-
-
-
-Now that you have the big picture of the environment, you're ready to train Huggy to fetch the stick.
-
-To do that, we're going to use [MLAgents](https://github.com/Unity-Technologies/ml-agents). Don't worry if you have never used it before. In this unit we'll use Google Colab to train Huggy, and then you'll be able to load your trained Huggy and play with him directly in the browser.
-
-In a future unit, we will study MLAgents more in-depth and see how it works. But for now, we keep things simple by just using the provided implementation.
diff --git a/units/en/unitbonus1/introduction.mdx b/units/en/unitbonus1/introduction.mdx
deleted file mode 100644
index 68a57f9..0000000
--- a/units/en/unitbonus1/introduction.mdx
+++ /dev/null
@@ -1,7 +0,0 @@
-# Introduction [[introduction]]
-
-In this bonus unit, we'll reinforce what we learned in the first unit by teaching Huggy the Dog to fetch the stick and then [play with him directly in your browser](https://huggingface.co/spaces/ThomasSimonini/Huggy) 🐶
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit2/thumbnail.png" alt="Unit bonus 1 thumbnail" width="100%">
-
-So let's get started 🚀
diff --git a/units/en/unitbonus1/play.mdx b/units/en/unitbonus1/play.mdx
deleted file mode 100644
index c28b143..0000000
--- a/units/en/unitbonus1/play.mdx
+++ /dev/null
@@ -1,18 +0,0 @@
-# Play with Huggy [[play]]
-
-Now that you've trained Huggy and pushed it to the Hub. **You will be able to play with him ❤️**
-
-For this step it’s simple:
-
-- Open the Huggy game in your browser: https://huggingface.co/spaces/ThomasSimonini/Huggy
-- Click on Play with my Huggy model
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/load-huggy.jpg" alt="load-huggy" width="100%">
-
-1. In step 1, choose your model repository which is the model id (in my case ThomasSimonini/ppo-Huggy).
-
-2. In step 2, **choose which model you want to replay**:
-  - I have multiple ones, since we saved a model every 500000 timesteps.
-  - But if I want the most recent one I choose Huggy.onnx
-
-👉 It's good to **try with different model checkpoints to see the improvement of the agent.**
diff --git a/units/en/unitbonus1/train.mdx b/units/en/unitbonus1/train.mdx
deleted file mode 100644
index cbc5a2f..0000000
--- a/units/en/unitbonus1/train.mdx
+++ /dev/null
@@ -1,348 +0,0 @@
-# Let's train and play with Huggy 🐶 [[train]]
-
-
-
-
-          <CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
-          notebooks={[
-          {label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/bonus-unit1/bonus-unit1.ipynb"}
-          ]}
-            askForHelpUrl="http://hf.co/join/discord" />
-
-
-
-We strongly **recommend students use Google Colab for the hands-on exercises** instead of running them on their personal computers. 
-
-By using Google Colab, **you can focus on learning and experimenting without worrying about the technical aspects** of setting up your environments.
-
-
-## Let's train Huggy 🐶
-
-**To start to train Huggy, click on Open In Colab button** 👇 :
-
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/bonus-unit1/bonus-unit1.ipynb)
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit2/thumbnail.png" alt="Bonus Unit 1Thumbnail">
-
-In this notebook, we'll reinforce what we learned in the first Unit by **teaching Huggy the Dog to fetch the stick and then play with it directly in your browser**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy.jpg" alt="Huggy"/>
-
-### The environment 🎮
-
-- Huggy the Dog, an environment created by [Thomas Simonini](https://twitter.com/ThomasSimonini) based on [Puppo The Corgi](https://blog.unity.com/technology/puppo-the-corgi-cuteness-overload-with-the-unity-ml-agents-toolkit)
-
-### The library used 📚
-
-- [MLAgents](https://github.com/Unity-Technologies/ml-agents)
-
-We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues).
-
-## Objectives of this notebook 🏆
-
-At the end of the notebook, you will:
-
-- Understand **the state space, action space, and reward function used to train Huggy**.
-- **Train your own Huggy** to fetch the stick.
-- Be able to play **with your trained Huggy directly in your browser**.
-
-
-## Prerequisites 🏗️
-
-Before diving into the notebook, you need to:
-
-🔲 📚 **Develop an understanding of the foundations of Reinforcement learning** (MC, TD, Rewards hypothesis...) by doing Unit 1
-
-🔲 📚 **Read the introduction to Huggy** by doing Bonus Unit 1
-
-## Set the GPU 💪
-- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg" alt="GPU Step 1">
-
-- `Hardware Accelerator > GPU`
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg" alt="GPU Step 2">
-
-## Clone the repository 🔽
-
-- We need to clone the repository, that contains **ML-Agents.**
-
-```bash
-# Clone the repository (can take 3min)
-git clone --depth 1 https://github.com/Unity-Technologies/ml-agents
-```
-
-## Setup the Virtual Environment 🔽
-
-- In order for the **ML-Agents** to run successfully in Colab,  Colab's Python version must meet the library's Python requirements.
-
-- We can check for the supported Python version under the `python_requires` parameter in the `setup.py` files. These files are required to set up the **ML-Agents** library for use and can be found in the following locations:
-  - `/content/ml-agents/ml-agents/setup.py`
-  - `/content/ml-agents/ml-agents-envs/setup.py`
-
-- Colab's Current Python version(can be checked using `!python --version`) doesn't match the library's `python_requires` parameter, as a result installation may silently fail and lead to errors like these, when executing the same commands later:
-  - `/bin/bash: line 1: mlagents-learn: command not found`
-  - `/bin/bash: line 1: mlagents-push-to-hf: command not found`
-
-- To resolve this, we'll create a virtual environment with a Python version compatible with the **ML-Agents** library.
-
-`Note:` *For future compatibility, always check the `python_requires` parameter in the installation files and set your virtual environment to the maximum supported Python version in the given below script if the Colab's Python version is not compatible*
-
-```bash
-# Colab's Current Python Version (Incompatible with ML-Agents)
-!python --version
-```
-
-```bash
-# Install virtualenv and create a virtual environment
-!pip install virtualenv
-!virtualenv myenv
-
-# Download and install Miniconda
-!wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
-!chmod +x Miniconda3-latest-Linux-x86_64.sh
-!./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local
-
-# Activate Miniconda and install Python ver 3.10.12
-!source /usr/local/bin/activate
-!conda install -q -y --prefix /usr/local python=3.10.12 ujson  # Specify the version here
-
-# Set environment variables for Python and conda paths
-!export PYTHONPATH=/usr/local/lib/python3.10/site-packages/
-!export CONDA_PREFIX=/usr/local/envs/myenv
-```
-
-```bash
-# Python Version in New Virtual Environment (Compatible with ML-Agents)
-!python --version
-```
-
-## Installing the dependencies 🔽
-
-```bash
-# Go inside the repository and install the package (can take 3min)
-%cd ml-agents
-pip3 install -e ./ml-agents-envs
-pip3 install -e ./ml-agents
-```
-
-## Download and move the environment zip file in `./trained-envs-executables/linux/`
-
-- Our environment executable is in a zip file.
-- We need to download it and place it to `./trained-envs-executables/linux/`
-
-```bash
-mkdir ./trained-envs-executables
-mkdir ./trained-envs-executables/linux
-```
-We downloaded the file Huggy.zip from https://github.com/huggingface/Huggy using `wget`
-
-```bash
-wget "https://github.com/huggingface/Huggy/raw/main/Huggy.zip" -O ./trained-envs-executables/linux/Huggy.zip
-```
-
-```bash
-%%capture
-unzip -d ./trained-envs-executables/linux/ ./trained-envs-executables/linux/Huggy.zip
-```
-
-Make sure your file is accessible
-
-```bash
-chmod -R 755 ./trained-envs-executables/linux/Huggy
-```
-
-## Let's recap how this environment works
-
-### The State Space: what Huggy perceives.
-
-Huggy doesn't "see" his environment. Instead, we provide him information about the environment:
-
-- The target (stick) position
-- The relative position between himself and the target
-- The orientation of his legs.
-
-Given all this information, Huggy **can decide which action to take next to fulfill his goal**.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy.jpg" alt="Huggy" width="100%">
-
-
-### The Action Space: what moves Huggy can do
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy-action.jpg" alt="Huggy action" width="100%">
-
-**Joint motors drive huggy legs**. This means that to get the target, Huggy needs to **learn to rotate the joint motors of each of his legs correctly so he can move**.
-
-### The Reward Function
-
-The reward function is designed so that **Huggy will fulfill his goal** : fetch the stick.
-
-Remember that one of the foundations of Reinforcement Learning is the *reward hypothesis*: a goal can be described as the **maximization of the expected cumulative reward**.
-
-Here, our goal is that Huggy **goes towards the stick but without spinning too much**. Hence, our reward function must translate this goal.
-
-Our reward function:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/reward.jpg" alt="Huggy reward function" width="100%">
-
-- *Orientation bonus*: we **reward him for getting close to the target**.
-- *Time penalty*: a fixed-time penalty given at every action to **force him to get to the stick as fast as possible**.
-- *Rotation penalty*: we penalize Huggy if **he spins too much and turns too quickly**.
-- *Getting to the target reward*: we reward Huggy for **reaching the target**.
-
-## Check the Huggy config file
-
-- In ML-Agents, you define the **training hyperparameters in config.yaml files.**
-
-- For the scope of this notebook, we're not going to modify the hyperparameters, but if you want to try as an experiment, Unity provides very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md).
-
-- We need to create a config file for Huggy. 
-
-- Go to `/content/ml-agents/config/ppo`
-
-- Create a new file called `Huggy.yaml`
-
-- Copy and paste the content below 🔽
-
-```
-behaviors:
-  Huggy:
-    trainer_type: ppo
-    hyperparameters:
-      batch_size: 2048
-      buffer_size: 20480
-      learning_rate: 0.0003
-      beta: 0.005
-      epsilon: 0.2
-      lambd: 0.95
-      num_epoch: 3
-      learning_rate_schedule: linear
-    network_settings:
-      normalize: true
-      hidden_units: 512
-      num_layers: 3
-      vis_encode_type: simple
-    reward_signals:
-      extrinsic:
-        gamma: 0.995
-        strength: 1.0
-    checkpoint_interval: 200000
-    keep_checkpoints: 15
-    max_steps: 2e6
-    time_horizon: 1000
-    summary_freq: 50000
-```
-
-- Don't forget to save the file!
-
-- **In the case you want to modify the hyperparameters**, in Google Colab notebook, you can click here to open the config.yaml: `/content/ml-agents/config/ppo/Huggy.yaml`
-
-We’re now ready to train our agent 🔥.
-
-## Train our agent
-
-To train our agent, we just need to **launch mlagents-learn and select the executable containing the environment.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/mllearn.png" alt="ml learn function" width="100%">
-
-With ML Agents, we run a training script. We define four parameters:
-
-1. `mlagents-learn <config>`: the path where the hyperparameter config file is.
-2. `--env`: where the environment executable is.
-3. `--run-id`: the name you want to give to your training run id.
-4. `--no-graphics`: to not launch the visualization during the training.
-
-Train the model and use the `--resume` flag to continue training in case of interruption.
-
-> It will fail first time when you use `--resume`, try running the block again to bypass the error.
-
-
-
-The training will take 30 to 45min depending on your machine (don't forget to **set up a GPU**), go take a ☕️ you deserve it 🤗.
-
-```bash
-mlagents-learn ./config/ppo/Huggy.yaml --env=./trained-envs-executables/linux/Huggy/Huggy --run-id="Huggy" --no-graphics
-```
-
-## Push the agent to the 🤗 Hub
-
-- Now that we trained our agent, we’re **ready to push it to the Hub to be able to play with Huggy on your browser🔥.**
-
-To be able to share your model with the community there are three more steps to follow:
-
-1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
-
-2️⃣ Sign in and then get your token from the Hugging Face website.
-- Create a new token (https://huggingface.co/settings/tokens) **with write role**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
-
-- Copy the token
-- Run the cell below and paste the token
-
-```python
-from huggingface_hub import notebook_login
-
-notebook_login()
-```
-
-If you don't want to use Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
-
-Then, we simply need to run `mlagents-push-to-hf`.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/mlpush.png" alt="ml learn function" width="100%">
-
-And we define 4 parameters:
-
-1. `--run-id`: the name of the training run id.
-2. `--local-dir`: where the agent was saved, it’s results/<run_id name>, so in my case results/First Training.
-3. `--repo-id`: the name of the Hugging Face repo you want to create or update. It’s always <your huggingface username>/<the repo name>
-If the repo does not exist **it will be created automatically**
-4. `--commit-message`: since HF repos are git repositories you need to give a commit message.
-
-```bash
-mlagents-push-to-hf --run-id="HuggyTraining" --local-dir="./results/Huggy" --repo-id="ThomasSimonini/ppo-Huggy" --commit-message="Huggy"
-```
-
-If everything worked you should see this at the end of the process (but with a different url 😆) :
-
-
-
-```
-Your model is pushed to the hub. You can view your model here: https://huggingface.co/ThomasSimonini/ppo-Huggy
-```
-
-It’s the link to your model repository. The repository contains a model card that explains how to use the model, your Tensorboard logs and your config file. **What’s awesome is that it’s a git repository, which means you can have different commits, update your repository with a new push, open Pull Requests, etc.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/modelcard.png" alt="ml learn function" width="100%">
-
-But now comes the best part: **being able to play with Huggy online 👀.**
-
-## Play with your Huggy 🐕
-
-This step is the simplest:
-
-- Open the Huggy game in your browser: https://huggingface.co/spaces/ThomasSimonini/Huggy
-
-- Click on Play with my Huggy model
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/load-huggy.jpg" alt="load-huggy" width="100%">
-
-1. In step 1, type your username (your username is case sensitive: for instance, my username is ThomasSimonini not thomassimonini or ThOmasImoNInI) and click on the search button.
-
-2. In step 2, select your model repository.
-
-3. In step 3, **choose which model you want to replay**:
-  - I have multiple ones, since we saved a model every 500000 timesteps.
-  - But since I want the most recent one, I choose `Huggy.onnx`
-
-👉 It's good **to try with different models steps to see the improvement of the agent.**
-
-Congrats on finishing this bonus unit!
-
-You can now sit and enjoy playing with your Huggy 🐶. And don't **forget to spread the love by sharing Huggy with your friends 🤗**. And if you share about it on social media, **please tag us @huggingface and me @simoninithomas**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy-cover.jpeg" alt="Huggy cover" width="100%">
-
-
-## Keep Learning, Stay awesome 🤗
diff --git a/units/en/unitbonus2/hands-on.mdx b/units/en/unitbonus2/hands-on.mdx
deleted file mode 100644
index 95b2cc5..0000000
--- a/units/en/unitbonus2/hands-on.mdx
+++ /dev/null
@@ -1,16 +0,0 @@
-# Hands-on [[hands-on]]
-
-Now that you've learned to use Optuna, here are some ideas to apply what you've learned:
-
-1️⃣ **Beat your LunarLander-v2 agent results**, by using Optuna to find a better set of hyperparameters. You can also try with another environment, such as MountainCar-v0 and CartPole-v1.
-
-2️⃣ **Beat your SpaceInvaders agent results**.
-
-By doing this, you'll see how valuable and powerful Optuna can be in training better agents.
-
-Have fun!
-
-Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉  [fill out this form](https://forms.gle/BzKXWzLAGZESGNaE9)
-
-### Keep Learning, stay awesome 🤗
-
diff --git a/units/en/unitbonus2/introduction.mdx b/units/en/unitbonus2/introduction.mdx
deleted file mode 100644
index ff2ccf3..0000000
--- a/units/en/unitbonus2/introduction.mdx
+++ /dev/null
@@ -1,7 +0,0 @@
-# Introduction [[introduction]]
-
-One of the most critical tasks in Deep Reinforcement Learning is to **find a good set of training hyperparameters**.
-
-<img src="https://raw.githubusercontent.com/optuna/optuna/master/docs/image/optuna-logo.png" alt="Optuna Logo"/>
-
-[Optuna](https://optuna.org/) is a library that helps you to automate the search. In this Unit, we'll study a **little bit of the theory behind automatic hyperparameter tuning**. We'll first try to optimize the parameters of the DQN studied in the last unit manually. We'll then **learn how to automate the search using Optuna**.
diff --git a/units/en/unitbonus2/optuna.mdx b/units/en/unitbonus2/optuna.mdx
deleted file mode 100644
index ec378de..0000000
--- a/units/en/unitbonus2/optuna.mdx
+++ /dev/null
@@ -1,15 +0,0 @@
-# Optuna Tutorial [[optuna]]
-
-The content below comes from [Antonin's Raffin ICRA 2022 presentations](https://araffin.github.io/tools-for-robotic-rl-icra2022/), he's one of the founders of Stable-Baselines and RL-Baselines3-Zoo.
-
-
-## The theory behind Hyperparameter tuning
-
-<Youtube id="AidFTOdGNFQ" />
-
-
-## Optuna Tutorial
-
-<Youtube id="ihP7E76KGOI" />
-
-The notebook 👉 [here](https://colab.research.google.com/github/araffin/tools-for-robotic-rl-icra2022/blob/main/notebooks/optuna_lab.ipynb)
diff --git a/units/en/unitbonus3/curriculum-learning.mdx b/units/en/unitbonus3/curriculum-learning.mdx
deleted file mode 100644
index fa26427..0000000
--- a/units/en/unitbonus3/curriculum-learning.mdx
+++ /dev/null
@@ -1,54 +0,0 @@
-# (Automatic) Curriculum Learning for RL
-
-While most of the RL methods seen in this course work well in practice, there are some cases where using them alone fails. This can happen, for instance, when:
-
-- the task to learn is hard and requires an **incremental acquisition of skills** (for instance when one wants to make a bipedal agent learn to go through hard obstacles, it must first learn to stand, then walk, then maybe jump…)
-- there are variations in the environment (that affect the difficulty) and one wants its agent to be **robust** to them
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/bipedal.gif" alt="Bipedal"/>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/movable_creepers.gif" alt="Movable creepers"/>
-<figcaption> <a href="https://developmentalsystems.org/TeachMyAgent/">TeachMyAgent</a> </figcaption>
-</figure>
-
-In such cases, it seems needed to propose different tasks to our RL agent and organize them such that the agent progressively acquires skills. This approach is called **Curriculum Learning** and usually implies a hand-designed curriculum (or set of tasks organized in a specific order). In practice, one can, for instance, control the generation of the environment, the initial states, or use Self-Play and control the level of opponents proposed to the RL agent.
-
-As designing such a curriculum is not always trivial, the field of **Automatic Curriculum Learning (ACL) proposes to design approaches that learn to create such an organization of tasks in order to maximize the RL agent’s performances**. Portelas et al. proposed to define ACL as:
-
-> … a family of mechanisms that automatically adapt the distribution of training data by learning to adjust the selection of learning situations to the capabilities of RL agents.
->
-
-As an example, OpenAI used **Domain Randomization** (they applied random variations on the environment) to make a robot hand solve Rubik’s Cubes.
-
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/dr.jpg" alt="Dr"/>
-<figcaption> <a href="https://openai.com/blog/solving-rubiks-cube/">OpenAI - Solving Rubik’s Cube with a Robot Hand</a></figcaption>
-</figure>
-
-Finally, you can play with the robustness of agents trained in the <a href="https://huggingface.co/spaces/flowers-team/Interactive_DeepRL_Demo">TeachMyAgent</a> benchmark by controlling environment variations or even drawing the terrain 👇
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/demo.png" alt="Demo"/>
-<figcaption> <a href="https://huggingface.co/spaces/flowers-team/Interactive_DeepRL_Demo">https://huggingface.co/spaces/flowers-team/Interactive_DeepRL_Demo</a></figcaption>
-</figure>
-
-
-## Further reading
-
-For more information, we recommend that you check out the following resources:
-
-### Overview of the field
-
-- [Automatic Curriculum Learning For Deep RL: A Short Survey](https://arxiv.org/pdf/2003.04664.pdf)
-- [Curriculum for Reinforcement Learning](https://lilianweng.github.io/posts/2020-01-29-curriculum-rl/)
-
-### Recent methods
-
-- [Evolving Curricula with Regret-Based Environment Design](https://arxiv.org/abs/2203.01302)
-- [Curriculum Reinforcement Learning via Constrained Optimal Transport](https://proceedings.mlr.press/v162/klink22a.html)
-- [Prioritized Level Replay](https://arxiv.org/abs/2010.03934)
-
-## Author
-
-This section was written by <a href="https://twitter.com/ClementRomac"> Clément Romac </a>
diff --git a/units/en/unitbonus3/decision-transformers.mdx b/units/en/unitbonus3/decision-transformers.mdx
deleted file mode 100644
index 4fbbc5e..0000000
--- a/units/en/unitbonus3/decision-transformers.mdx
+++ /dev/null
@@ -1,31 +0,0 @@
-# Decision Transformers
-
-The Decision Transformer model was introduced by ["Decision Transformer: Reinforcement Learning via Sequence Modeling” by Chen L. et al](https://arxiv.org/abs/2106.01345). It abstracts Reinforcement Learning as a conditional-sequence modeling problem.
-
-The main idea is that instead of training a policy using RL methods, such as fitting a value function, that will tell us what action to take to maximize the return (cumulative reward), **we use a sequence modeling algorithm (Transformer) that, given a desired return, past states, and actions, will generate future actions to achieve this desired return**.
-It’s an autoregressive model conditioned on the desired return, past states, and actions to generate future actions that achieve the desired return.
-
-This is a complete shift in the Reinforcement Learning paradigm since we use generative trajectory modeling (modeling the joint distribution of the sequence of states, actions, and rewards) to replace conventional RL algorithms. This means that in Decision Transformers, we don’t maximize the return but rather generate a series of future actions that achieve the desired return.
-
-The 🤗 Transformers team integrated the Decision Transformer, an Offline Reinforcement Learning method, into the library as well as the Hugging Face Hub.
-
-## Learn about Decision Transformers
-
-To learn more about Decision Transformers, you should read the blogpost we wrote about it [Introducing Decision Transformers on Hugging Face](https://huggingface.co/blog/decision-transformers)
-
-## Train your first Decision Transformers
-
-Now that you understand how Decision Transformers work thanks to [Introducing Decision Transformers on Hugging Face](https://huggingface.co/blog/decision-transformers), you’re ready to learn to train your first Offline Decision Transformer model from scratch to make a half-cheetah run.
-
-Start the tutorial here 👉 https://huggingface.co/blog/train-decision-transformers
-
-## Further reading
-
-For more information, we recommend that you check out the following resources:
-
-- [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345)
-- [Online Decision Transformer](https://arxiv.org/abs/2202.05607)
-
-## Author
-
-This section was written by <a href="https://twitter.com/edwardbeeching">Edward Beeching</a>
diff --git a/units/en/unitbonus3/envs-to-try.mdx b/units/en/unitbonus3/envs-to-try.mdx
deleted file mode 100644
index ce372e5..0000000
--- a/units/en/unitbonus3/envs-to-try.mdx
+++ /dev/null
@@ -1,88 +0,0 @@
-# Interesting Environments to try
-
-Here we provide a list of interesting environments you can try to train your agents on:
-
-## DIAMBRA Arena
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/diambraarena.png" alt="diambraArena"/>
-
-
-DIAMBRA Arena is a software package featuring a collection of high-quality environments for Reinforcement Learning research and experimentation. It provides a standard interface to popular arcade emulated video games, offering a Python API fully compliant with OpenAI Gym/Gymnasium format, that makes its adoption smooth and straightforward.
-
-It supports all major Operating Systems (Linux, Windows and MacOS) and can be easily installed via [Python PIP](https://pypi.org/project/diambra-arena/). It is completely free to use, the user only needs to register on the [official website](https://diambra.ai/register/).
-
-In addition, its [GitHub repository](https://github.com/diambra/) provides a collection of examples covering main use cases of interest that can be run in just a few steps.
-
-#### Main Features
-
-All environments are episodic Reinforcement Learning tasks, with discrete actions (gamepad buttons) and observations composed by screen pixels plus additional numerical data (RAM values like characters health bars or characters stage side).
-
-They all support both single player (1P) as well as two players (2P) mode, making them the perfect resource to explore Standard RL, Competitive Multi-Agent, Competitive Human-Agent, Self-Play, Imitation Learning and Human-in-the-Loop.
-
-[Interfaced games](https://docs.diambra.ai/envs/games/) have been selected among the most popular fighting retro-games. While sharing the same fundamental mechanics, they provide different challenges, with specific features such as different type and number of characters, how to perform combos, health bars recharging, etc.
-
-DIAMBRA Arena is built to maximize compatibility will all major Reinforcement Learning libraries. It natively provides interfaces with the two most important packages: [Stable Baselines 3](https://stable-baselines3.readthedocs.io/en/master/) and [Ray RLlib](https://docs.ray.io/en/latest/rllib/index.html), while Stable Baselines is also available but deprecated. Their usage is illustrated in the [official documentation](https://docs.diambra.ai/) and in the [DIAMBRA Agents examples repository](https://github.com/diambra/agents). It can easily be interfaced with any other package in a similar way.
-
-### Competition Platform
-
-DIAMBRA also provides a competition platform fully integrated with the Hugging Face Hub, on which you can submit your trained agents and compete with other coders around the globe in epic video games tournaments!
-
-It features a public leaderboard where users are ranked by the best score achieved by their agents in our different environments.
-
-It also offers the possibility to unlock cool achievements depending on the performances of your agent.
-
-Submitted agents are evaluated and their episodes are streamed on [DIAMBRA Twitch channel](https://www.twitch.tv/diambra_ai).
-
-#### References
-
-To start using this environment, check these resources:
-- [Official Docs](https://docs.diambra.ai/)
-- [Competition Platform](https://diambra.ai)
-- [GitHub](https://github.com/diambra/)
-- [Discord](https://diambra.ai/discord)
-
-## MineRL
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/minerl.jpg" alt="MineRL"/>
-
-
-MineRL is a Python library that provides a Gym interface for interacting with the video game Minecraft, accompanied by datasets of human gameplay.
-Every year there are challenges with this library. Check the [website](https://minerl.io/)
-
-To start using this environment, check these resources:
-- [What is MineRL?](https://www.youtube.com/watch?v=z6PTrGifupU)
-- [First steps in MineRL](https://www.youtube.com/watch?v=8yIrWcyWGek)
-- [MineRL documentation and tutorials](https://minerl.readthedocs.io/en/latest/)
-
-## DonkeyCar Simulator
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/donkeycar.jpg" alt="Donkey Car"/>
-Donkey is a Self Driving Car Platform for hobby remote control cars.
-This simulator version is built on the Unity game platform. It uses their internal physics and graphics and connects to a donkey Python process to use our trained model to control the simulated Donkey (car).
-
-
-To start using this environment, check these resources:
-- [DonkeyCar Simulator documentation](https://docs.donkeycar.com/guide/deep_learning/simulator/)
-- [Learn to Drive Smoothly (Antonin Raffin's tutorial) Part 1](https://www.youtube.com/watch?v=ngK33h00iBE)
-- [Learn to Drive Smoothly (Antonin Raffin's tutorial) Part 2](https://www.youtube.com/watch?v=DUqssFvcSOY)
-- [Learn to Drive Smoothly (Antonin Raffin's tutorial) Part 3](https://www.youtube.com/watch?v=v8j2bpcE4Rg)
-
-- Pretrained agents:
-  - https://huggingface.co/araffin/tqc-donkey-mountain-track-v0
-  - https://huggingface.co/araffin/tqc-donkey-avc-sparkfun-v0
-  - https://huggingface.co/araffin/tqc-donkey-minimonaco-track-v0
-
-
-## Starcraft II
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/alphastar.jpg" alt="Alphastar"/>
-
-Starcraft II is a famous *real-time strategy game*. DeepMind has used this game for their Deep Reinforcement Learning research with [Alphastar](https://www.deepmind.com/blog/alphastar-mastering-the-real-time-strategy-game-starcraft-ii)
-
-To start using this environment, check these resources:
-- [Starcraft gym](http://starcraftgym.com/)
-- [A. I. Learns to Play Starcraft 2 (Reinforcement Learning) tutorial](https://www.youtube.com/watch?v=q59wap1ELQ4)
-
-## Author
-
-This section was written by <a href="https://twitter.com/ThomasSimonini"> Thomas Simonini</a>
diff --git a/units/en/unitbonus3/generalisation.mdx b/units/en/unitbonus3/generalisation.mdx
deleted file mode 100644
index 27f38c7..0000000
--- a/units/en/unitbonus3/generalisation.mdx
+++ /dev/null
@@ -1,12 +0,0 @@
-# Generalization in Reinforcement Learning
-
-Generalization plays a pivotal role in the realm of Reinforcement Learning. While **RL algorithms demonstrate good performance in controlled environments**, the real world presents a **unique challenge due to its non-stationary and open-ended nature**.
-
-As a result, the development of RL algorithms that stay robust in the face of environmental variations, coupled with the capability to transfer and adapt to uncharted yet analogous tasks and settings, becomes fundamental for real world application of RL.
-
-If you're interested to dive deeper into this research subject, we recommend exploring the following resource:
-
-- [Generalization in Reinforcement Learning by Robert Kirk](https://robertkirk.github.io/2022/01/17/generalisation-in-reinforcement-learning-survey.html): this comprehensive survey provides an insightful **overview of the concept of generalization in RL**, making it an excellent starting point for your exploration.
-
-- [Improving Generalization in Reinforcement Learning using Policy Similarity Embeddings](https://blog.research.google/2021/09/improving-generalization-in.html?m=1)
-
diff --git a/units/en/unitbonus3/godotrl.mdx b/units/en/unitbonus3/godotrl.mdx
deleted file mode 100644
index bf6903e..0000000
--- a/units/en/unitbonus3/godotrl.mdx
+++ /dev/null
@@ -1,257 +0,0 @@
-# Godot RL Agents
-
-[Godot RL Agents](https://github.com/edbeeching/godot_rl_agents) is an Open Source package that allows video game creators, AI researchers, and hobbyists the opportunity **to learn complex behaviors for their Non Player Characters or agents**.
-
-The library provides:
-
-- An interface between games created in the [Godot Engine](https://godotengine.org/) and Machine Learning algorithms running in Python
-- Wrappers for four well known rl frameworks: [StableBaselines3](https://stable-baselines3.readthedocs.io/en/master/), [CleanRL](https://docs.cleanrl.dev/), [Sample Factory](https://www.samplefactory.dev/) and [Ray RLLib](https://docs.ray.io/en/latest/rllib-algorithms.html)
-- Support for memory-based agents with LSTM or attention based interfaces
-- Support for *2D and 3D games*
-- A suite of *AI sensors* to augment your agent's capacity to observe the game world
-- Godot and Godot RL Agents are **completely free and open source under a very permissive MIT license**. No strings attached, no royalties, nothing.
-
-You can find out more about Godot RL agents on their [GitHub page](https://github.com/edbeeching/godot_rl_agents) or their AAAI-2022 Workshop [paper](https://arxiv.org/abs/2112.03636). The library's creator, [Ed Beeching](https://edbeeching.github.io/), is a Research Scientist here at Hugging Face.
-
-Installation of the library is simple: `pip install godot-rl`
-
-## Create a custom RL environment with Godot RL Agents
-
-In this section, you will **learn how to create a custom environment in the Godot Game Engine** and then implement an AI controller that learns to play with Deep Reinforcement Learning.
-
-The example game we create today is simple, **but shows off many of the features of the Godot Engine and the Godot RL Agents library**. You can then dive into the examples for more complex environments and behaviors.
-
-The environment we will be building today is called Ring Pong, the game of pong but the pitch is a ring and the paddle moves around the ring. The **objective is to keep the ball bouncing inside the ring**.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/ringpong.gif" alt="Ring Pong">
-
-### Installing the Godot Game Engine
-
-The [Godot game engine](https://godotengine.org/) is an open source tool for the **creation of video games, tools and user interfaces**.
-
-Godot Engine is a feature-packed, cross-platform game engine designed to create 2D and 3D games from a unified interface. It provides a comprehensive set of common tools, so users **can focus on making games without having to reinvent the wheel**. Games can be exported in one click to a number of platforms, including the major desktop platforms (Linux, macOS, Windows) as well as mobile (Android, iOS) and web-based (HTML5) platforms.
-
-While we will guide you through the steps to implement your agent, you may wish to learn more about the Godot Game Engine. Their [documentation](https://docs.godotengine.org/en/latest/index.html) is thorough, and there are many tutorials on YouTube we would also recommend [GDQuest](https://www.gdquest.com/), [KidsCanCode](https://kidscancode.org/godot_recipes/4.x/) and [Bramwell](https://www.youtube.com/channel/UCczi7Aq_dTKrQPF5ZV5J3gg) as sources of information.
-
-In order to create games in Godot, **you must first download the editor**. Godot RL Agents supports the latest version of Godot, Godot 4.0.
-
-Which can be downloaded at the following links:
-
-- [Windows](https://downloads.tuxfamily.org/godotengine/4.0.1/Godot_v4.0.1-stable_win64.exe.zip)
-- [Mac](https://downloads.tuxfamily.org/godotengine/4.0.1/Godot_v4.0.1-stable_macos.universal.zip)
-- [Linux](https://downloads.tuxfamily.org/godotengine/4.0.1/Godot_v4.0.1-stable_linux.x86_64.zip)
-
-### Loading the starter project
-
-We provide two versions of the codebase:
-- [A starter project, to download and follow along for this tutorial](https://drive.google.com/file/d/1C7xd3TibJHlxFEJPBgBLpksgxrFZ3D8e/view?usp=share_link)
-- [A final version of the project, for comparison and debugging.](https://drive.google.com/file/d/1k-b2Bu7uIA6poApbouX4c3sq98xqogpZ/view?usp=share_link)
-
-To load the project, in the Godot Project Manager click **Import**, navigate to where the files are located and load the **project.godot** file.
-
-If you press F5 or play in the editor, you should be able to play the game in human mode. There are several instances of the game running, this is because we want to speed up training our AI agent with many parallel environments.
-
-### Installing the Godot RL Agents plugin
-
-The Godot RL Agents plugin can be installed from the Github repo or with the Godot Asset Lib in the editor.
-
-First click on the AssetLib and search for “rl”
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/godot1.png" alt="Godot">
-
-Then click on Godot RL Agents, click Download and unselect the LICENSE and README .md files. Then click install.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/godot2.png" alt="Godot">
-
-
-The Godot RL Agents plugin is now downloaded to your machine. Now click on Project → Project settings and enable the addon:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/godot3.png" alt="Godot">
-
-
-### Adding the AI controller
-
-We now want to add an AI controller to our game. Open the player.tscn scene, on the left you should see a hierarchy of nodes that looks like this:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/godot4.png" alt="Godot">
-
-Right click the **Player** node and click **Add Child Node.** There are many nodes listed here, search for AIController3D and create it.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/godot5.png" alt="Godot">
-
-The AI Controller Node should have been added to the scene tree, next to it is a scroll. Click on it to open the script that is attached to the AIController. The Godot game engine uses a scripting language called GDScript, which is syntactically similar to python. The script contains methods that need to be implemented in order to get our AI controller working.
-
-```python
-#-- Methods that need implementing using the "extend script" option in Godot --#
-func get_obs() -> Dictionary:
-	assert(false, "the get_obs method is not implemented when extending from ai_controller")
-	return {"obs":[]}
-
-func get_reward() -> float:
-	assert(false, "the get_reward method is not implemented when extending from ai_controller")
-	return 0.0
-
-func get_action_space() -> Dictionary:
-	assert(false, "the get get_action_space method is not implemented when extending from ai_controller")
-	return {
-		"example_actions_continous" : {
-			"size": 2,
-			"action_type": "continuous"
-		},
-		"example_actions_discrete" : {
-			"size": 2,
-			"action_type": "discrete"
-		},
-		}
-
-func set_action(action) -> void:
-	assert(false, "the get set_action method is not implemented when extending from ai_controller")
-# -----------------------------------------------------------------------------#
-```
-
-In order to implement these methods, we will need to create a class that inherits from AIController3D. This is easy to do in Godot, and is called “extending” a class.
-
-Right click the AIController3D Node and click “Extend Script” and call the new script `controller.gd`. You should now have an almost empty script file that looks like this:
-
-```python
-extends AIController3D
-
-# Called when the node enters the scene tree for the first time.
-func _ready():
-	pass # Replace with function body.
-
-# Called every frame. 'delta' is the elapsed time since the previous frame.
-func _process(delta):
-	pass
-```
-
-We will now implement the 4 missing methods, delete this code, and replace it with the following:
-
-```python
-extends AIController3D
-
-# Stores the action sampled for the agent's policy, running in python
-var move_action : float = 0.0
-
-func get_obs() -> Dictionary:
-	# get the balls position and velocity in the paddle's frame of reference
-	var ball_pos = to_local(_player.ball.global_position)
-	var ball_vel = to_local(_player.ball.linear_velocity)
-	var obs = [ball_pos.x, ball_pos.z, ball_vel.x/10.0, ball_vel.z/10.0]
-
-	return {"obs":obs}
-
-func get_reward() -> float:
-	return reward
-
-func get_action_space() -> Dictionary:
-	return {
-		"move_action" : {
-			"size": 1,
-			"action_type": "continuous"
-		},
-		}
-
-func set_action(action) -> void:
-	move_action = clamp(action["move_action"][0], -1.0, 1.0)
-```
-
-We have now defined the agent’s observation, which is the position and velocity of the ball in its local coordinate space. We have also defined the action space of the agent, which is a single continuous value ranging from -1 to +1.
-
-The next step is to update the Player’s script to use the actions from the AIController, edit the Player’s script by clicking on the scroll next to the player node, update the code in `Player.gd` to the following:
-
-```python
-extends Node3D
-
-@export var rotation_speed = 3.0
-@onready var ball = get_node("../Ball")
-@onready var ai_controller = $AIController3D
-
-func _ready():
-	ai_controller.init(self)
-
-func game_over():
-	ai_controller.done = true
-	ai_controller.needs_reset = true
-
-func _physics_process(delta):
-	if ai_controller.needs_reset:
-		ai_controller.reset()
-		ball.reset()
-		return
-
-	var movement : float
-	if ai_controller.heuristic == "human":
-		movement = Input.get_axis("rotate_anticlockwise", "rotate_clockwise")
-	else:
-		movement = ai_controller.move_action
-	rotate_y(movement*delta*rotation_speed)
-
-func _on_area_3d_body_entered(body):
-	ai_controller.reward += 1.0
-```
-
-We now need to synchronize between the game running in Godot and the neural network being trained in Python. Godot RL agents provides a node that does just that. Open the train.tscn scene, right click on the root node, and click “Add child node”. Then, search for “sync” and add a Godot RL Agents Sync node. This node handles the communication between Python and Godot over TCP.
-
-You can run training live in the editor, by first launching the python training with `gdrl`.
-
-In this simple example, a reasonable policy is learned in several minutes. You may wish to speed up training, click on the Sync node in the train scene, and you will see there is a “Speed Up” property exposed in the editor:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/godot6.png" alt="Godot">
-
-Try setting this property up to 8 to speed up training. This can be a great benefit on more complex environments, like the multi-player FPS we will learn about in the next chapter.
-
-### Export model
-
-[Reference doc](https://github.com/edbeeching/godot_rl_agents/tree/main?tab=readme-ov-file#exporting-and-loading-your-trained-agent-in-onnx-format)
-
-Let's put aside the Godot editor for now. We'll need to use terminals to run some commands in order to save models we trained.
-
-The latest version of the Godot RL library provides experimental support for onnx models with the Stable Baselines 3, rllib, and CleanRL training frameworks.
-
-For example, let's use the Stable Baselines 3 as the framework. Train your agent using the [sb3 example](https://github.com/edbeeching/godot_rl_agents/blob/main/examples/stable_baselines3_example.py) ([instructions for using the script](https://github.com/edbeeching/godot_rl_agents/blob/main/docs/ADV_STABLE_BASELINES_3.md#train-a-model-from-scratch)), enabling the option `--onnx_export_path=model.onnx`
-
-Below is an example command line to execute:
-
-```bash
-cd <....> # go into this Godot project directory
-python stable_baselines3_example.py --timesteps=100_000 --onnx_export_path=model.onnx --save_model_path=model.zip --save_checkpoint_frequency=20_000 --experiment_name=exp1
-```
-
-If things work correctly, you should see messages printed out in the terminal like below:
-
-```
-No game binary has been provided, please press PLAY in the Godot editor
-waiting for remote GODOT connection on port 11008
-```
-
-> If you encounter failures about import error in stable_baselines3_example script: "ImportError: cannot import name 'export_model_as_onnx' from 'godot_rl.wrappers.onnx.stable_baselines_export'", follow the answer in [this issue](https://github.com/edbeeching/godot_rl_agents/issues/203) here.
-
-Now it's time to switch back to the Godot editor, and hit PLAY on the top right corner. Once you hit that, the game scene will pop up showing AI training. In the meantime, the terminal will start to print out metrics. Wait for the training to finish, and if things work correctly, you should be able to find the file `model.onnx` in the Godot project directory.
-
-### Apply AI in the game!
-
-Now let's apply this trained model to the game!
-
-In the Godot editor, find the Sync node in `train.tscn`:
-
-* change the control mode to `Onnx Inference` from the dropdown
-* set `Onnx Model Path` to the model file name, in our case here it's `model.onnx`
-
-To run this game, we need the mono version (i.e., the .NET version) of the Godot editor, you can download it from the Godot official page. We need to install [.NET](https://dotnet.microsoft.com/en-us/download) as well.
-
-Most likely you wil encounter errors in the first attempt. Below are the scenarios to help you resolve the errors.
-
-1. issue about `Invalid Call. Nonexistent function 'new' in base 'CSharpScript'`: [solution](https://github.com/edbeeching/godot_rl_agents/blob/main/docs/TROUBLESHOOTING.md)
-2. errors about `onnxruntime` on MacOS: [solution](https://github.com/microsoft/onnxruntime/issues/9707)
-
-
-### There’s more!
-
-We have only scratched the surface of what can be achieved with Godot RL Agents, the library includes custom sensors and cameras to enrich the information available to the agent. Take a look at the [examples](https://github.com/edbeeching/godot_rl_agents_examples) to find out more!
-
-For the ability to export the trained model to .onnx so that you can run inference directly from Godot without the Python server, and other useful training options, take a look at the [advanced SB3 tutorial](https://github.com/edbeeching/godot_rl_agents/blob/main/docs/ADV_STABLE_BASELINES_3.md).
-
-## Author
-
-This section was written by <a href="https://twitter.com/edwardbeeching">Edward Beeching</a>
diff --git a/units/en/unitbonus3/introduction.mdx b/units/en/unitbonus3/introduction.mdx
deleted file mode 100644
index 248e3d7..0000000
--- a/units/en/unitbonus3/introduction.mdx
+++ /dev/null
@@ -1,11 +0,0 @@
-# Introduction
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/thumbnail.png" alt="Unit bonus 3 thumbnail"/>
-
-
-Congratulations on finishing this course! **You now have a solid background in Deep Reinforcement Learning**.
-But this course was just the beginning of your Deep Reinforcement Learning journey, there are so many subsections to discover. In this optional unit, we **give you resources to explore multiple concepts and research topics in Reinforcement Learning**.
-
-Contrary to other units, this unit is a collective work of multiple people from Hugging Face. We mention the author for each unit.
-
-Sound fun? Let's get started 🔥,
diff --git a/units/en/unitbonus3/language-models.mdx b/units/en/unitbonus3/language-models.mdx
deleted file mode 100644
index 2cabd1f..0000000
--- a/units/en/unitbonus3/language-models.mdx
+++ /dev/null
@@ -1,45 +0,0 @@
-# Language models in RL
-## LMs encode useful knowledge for agents
-
-**Language models** (LMs) can exhibit impressive abilities when manipulating text such as question-answering or even step-by-step reasoning. Additionally, their training on massive text corpora allowed them to **encode various types of knowledge including abstract ones about the physical rules of our world** (for instance what is possible to do with an object, what happens when one rotates an object…).
-
-A natural question recently studied was whether such knowledge could benefit agents such as robots when trying to solve everyday tasks. And while these works showed interesting results, the proposed agents lacked any learning method. **This limitation prevents these agent from adapting to the environment (e.g. fixing wrong knowledge) or learning new skills.**
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/language.png" alt="Language">
-<figcaption>Source: <a href="https://ai.googleblog.com/2022/08/towards-helpful-robots-grounding.html">Towards Helpful Robots: Grounding Language in Robotic Affordances</a></figcaption>
-</figure>
-
-## LMs and RL
-
-There is therefore a potential synergy between LMs which can bring knowledge about the world, and RL which can align and correct this knowledge by interacting with an environment. It is especially interesting from a RL point-of-view as the RL field mostly relies on the **Tabula-rasa** setup where everything is learned from scratch by the agent leading to:
-
-1) Sample inefficiency
-
-2) Unexpected behaviors from humans’ eyes
-
-As a first attempt, the paper [“Grounding Large Language Models with Online Reinforcement Learning”](https://arxiv.org/abs/2302.02662v1) tackled the problem of **adapting or aligning a LM to a textual environment using PPO**. They showed that the knowledge encoded in the LM lead to a fast adaptation to the environment (opening avenues for sample efficient RL agents) but also that such knowledge allowed the LM to better generalize to new tasks once aligned.
-
-<video src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/papier_v4.mp4" type="video/mp4" controls />
-
-Another direction studied in [“Guiding Pretraining in Reinforcement Learning with Large Language Models”](https://arxiv.org/abs/2302.06692) was to keep the LM frozen but leverage its knowledge to **guide an RL agent’s exploration**. Such a method allows the RL agent to be guided towards human-meaningful and plausibly useful behaviors without requiring a human in the loop during training.
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/language2.png" alt="Language">
-<figcaption> Source: <a href="https://ai.googleblog.com/2022/08/towards-helpful-robots-grounding.html"> Towards Helpful Robots: Grounding Language in Robotic Affordances</a>  </figcaption>
-</figure>
-
-Several limitations make these works still very preliminary such as the need to convert the agent's observation to text before giving it to a LM as well as the compute cost of interacting with very large LMs.
-
-## Further reading
-
-For more information we recommend you check out the following resources:
-
-- [Google Research, 2022 & beyond: Robotics](https://ai.googleblog.com/2023/02/google-research-2022-beyond-robotics.html)
-- [Pre-Trained Language Models for Interactive Decision-Making](https://arxiv.org/abs/2202.01771)
-- [Grounding Large Language Models with Online Reinforcement Learning](https://arxiv.org/abs/2302.02662v1)
-- [Guiding Pretraining in Reinforcement Learning with Large Language Models](https://arxiv.org/abs/2302.06692)
-
-## Author
-
-This section was written by <a href="https://twitter.com/ClementRomac"> Clément Romac </a>
diff --git a/units/en/unitbonus3/learning-agents.mdx b/units/en/unitbonus3/learning-agents.mdx
deleted file mode 100644
index e775499..0000000
--- a/units/en/unitbonus3/learning-agents.mdx
+++ /dev/null
@@ -1,37 +0,0 @@
-# An Introduction to Unreal Learning Agents
-
-[Learning Agents](https://dev.epicgames.com/community/learning/tutorials/8OWY/unreal-engine-learning-agents-introduction) is an Unreal Engine (UE) plugin that allows you **to train AI characters using machine learning (ML) in Unreal**.
-
-It's an exciting new plugin where you can create unique environments using Unreal Engine and train your agents.
-
-Let's see how you can **get started and train a car to drive in an Unreal Engine Environment**.
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/learning-agents-car.png" alt="Learning Agents"/>
-<figcaption>Source: [Learning Agents Driving Car Tutorial](https://dev.epicgames.com/community/learning/tutorials/qj2O/unreal-engine-learning-to-drive)</figcaption>
-</figure>
-
-## Case 1: I don't know anything about Unreal Engine and Beginners in Unreal Engine
-If you're new to Unreal Engine, don't be scared! We listed two courses you need to follow to be able to use Learning Agents:
-
-1. Master the Basics: Begin by watching this course [your first hour in Unreal Engine 5](https://dev.epicgames.com/community/learning/courses/ZpX/your-first-hour-in-unreal-engine-5/E7L/introduction-to-your-first-hour-in-unreal-engine-5). This comprehensive course will **lay down the foundational knowledge you need to use Unreal**.
-
-2. Dive into Blueprints: Explore the world of Blueprints, the visual scripting component of Unreal Engine. [This video course](https://youtu.be/W0brCeJNMqk?si=zy4t4t1l6FMIzbpz) will familiarize you with this essential tool.
-
-Armed with the basics, **you're now prepared to play with Learning Agents**:
-
-3. Get the Big Picture of Learning Agents by [reading this informative overview](https://dev.epicgames.com/community/learning/tutorials/8OWY/unreal-engine-learning-agents-introduction).
-
-4. [Teach a Car to Drive using Reinforcement Learning in Learning Agents](https://dev.epicgames.com/community/learning/tutorials/qj2O/unreal-engine-learning-to-drive).
-
-5. [Check Imitation Learning with the Unreal Engine 5.3 Learning Agents Plugin](https://www.youtube.com/watch?v=NwYUNlFvajQ)
-
-## Case 2: I'm familiar with Unreal
-
-For those already acquainted with Unreal Engine, you can jump straight into Learning Agents with these two tutorials:
-
-1. Get the Big Picture of Learning Agents by [reading this informative overview](https://dev.epicgames.com/community/learning/tutorials/8OWY/unreal-engine-learning-agents-introduction).
-
-2. [Teach a Car to Drive using Reinforcement Learning in Learning Agents](https://dev.epicgames.com/community/learning/tutorials/qj2O/unreal-engine-learning-to-drive). .
-
-3. [Check Imitation Learning with the Unreal Engine 5.3 Learning Agents Plugin](https://www.youtube.com/watch?v=NwYUNlFvajQ)
\ No newline at end of file
diff --git a/units/en/unitbonus3/model-based.mdx b/units/en/unitbonus3/model-based.mdx
deleted file mode 100644
index 633afcb..0000000
--- a/units/en/unitbonus3/model-based.mdx
+++ /dev/null
@@ -1,32 +0,0 @@
-# Model Based Reinforcement Learning (MBRL)
-
-Model-based reinforcement learning only differs from its model-free counterpart in learning a *dynamics model*, but that has substantial downstream effects on how the decisions are made.
-
-The dynamics model usually models the environment transition dynamics, \\( s_{t+1} = f_\theta (s_t, a_t) \\), but things like inverse dynamics models (mapping from states to actions) or reward models (predicting rewards) can be used in this framework.
-
-
-## Simple definition
-
-- There is an agent that repeatedly tries to solve a problem, **accumulating state and action data**.
-- With that data, the agent creates a structured learning tool, *a dynamics model*, to reason about the world.
-- With the dynamics model, the agent **decides how to act by predicting the future**.
-- With those actions, **the agent collects more data, improves said model, and hopefully improves future actions**.
-
-## Academic definition
-
-Model-based reinforcement learning (MBRL) follows the framework of an agent interacting in an environment, **learning a model of said environment**, and then **leveraging the model for control (making decisions).
-
-Specifically, the agent acts in a Markov Decision Process (MDP) governed by a transition function \\( s_{t+1} = f (s_t , a_t) \\) and returns a reward at each step \\( r(s_t, a_t) \\). With a collected dataset \\( D :={ s_i, a_i, s_{i+1}, r_i} \\), the agent learns a model, \\( s_{t+1} = f_\theta (s_t , a_t) \\) **to minimize the negative log-likelihood of the transitions**.
-
-We employ sample-based model-predictive control (MPC) using the learned dynamics model, which optimizes the expected reward over a finite, recursively predicted horizon, \\( \tau \\), from a set of actions sampled from a uniform distribution \\( U(a) \\), (see [paper](https://arxiv.org/pdf/2002.04523) or [paper](https://arxiv.org/pdf/2012.09156.pdf) or [paper](https://arxiv.org/pdf/2009.01221.pdf)).
-
-## Further reading
-
-For more information on MBRL, we recommend you check out the following resources:
-
-- A [blog post on debugging MBRL](https://www.natolambert.com/writing/debugging-mbrl).
-- A [recent review paper on MBRL](https://arxiv.org/abs/2006.16712),
-
-## Author
-
-This section was written by <a href="https://twitter.com/natolambert"> Nathan Lambert </a>
diff --git a/units/en/unitbonus3/offline-online.mdx b/units/en/unitbonus3/offline-online.mdx
deleted file mode 100644
index be6fa37..0000000
--- a/units/en/unitbonus3/offline-online.mdx
+++ /dev/null
@@ -1,37 +0,0 @@
-# Offline vs. Online Reinforcement Learning
-
-Deep Reinforcement Learning (RL) is a framework **to build decision-making agents**. These agents aim to learn optimal behavior (policy) by interacting with the environment through **trial and error and receiving rewards as unique feedback**.
-
-The agent’s goal **is to maximize its cumulative reward**, called return. Because RL is based on the *reward hypothesis*: all goals can be described as the **maximization of the expected cumulative reward**.
-
-Deep Reinforcement Learning agents **learn with batches of experience**. The question is, how do they collect it?:
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/offlinevsonlinerl.gif" alt="Unit bonus 3 thumbnail">
-<figcaption>A comparison between Reinforcement Learning in an Online and Offline setting, figure taken from <a href="https://offline-rl.github.io/">this post</a></figcaption>
-</figure>
-
-- In *online reinforcement learning*, which is what we've learned during this course, the agent **gathers data directly**: it collects a batch of experience by **interacting with the environment**. Then, it uses this experience immediately (or via some replay buffer) to learn from it (update its policy).
-
-But this implies that either you **train your agent directly in the real world or have a simulator**. If you don’t have one, you need to build it, which can be very complex (how to reflect the complex reality of the real world in an environment?), expensive, and insecure (if the simulator has flaws that may provide a competitive advantage, the agent will exploit them).
-
-- On the other hand, in *offline reinforcement learning*, the agent only **uses data collected from other agents or human demonstrations**. It does **not interact with the environment**.
-
-The process is as follows:
-- **Create a dataset** using one or more policies and/or human interactions.
-- Run **offline RL on this dataset** to learn a policy
-
-This method has one drawback: the *counterfactual queries problem*. What do we do if our agent **decides to do something for which we don’t have the data?** For instance, turning right on an intersection but we don’t have this trajectory.
-
-There exist some solutions on this topic, but if you want to know more about offline reinforcement learning, you can [watch this video](https://www.youtube.com/watch?v=k08N5a0gG0A)
-
-## Further reading
-
-For more information, we recommend you check out the following resources:
-
-- [Offline Reinforcement Learning, Talk by Sergei Levine](https://www.youtube.com/watch?v=qgZPZREor5I)
-- [Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems](https://arxiv.org/abs/2005.01643)
-
-## Author
-
-This section was written by <a href="https://twitter.com/ThomasSimonini"> Thomas Simonini</a>
diff --git a/units/en/unitbonus3/rl-documentation.mdx b/units/en/unitbonus3/rl-documentation.mdx
deleted file mode 100644
index 8357e10..0000000
--- a/units/en/unitbonus3/rl-documentation.mdx
+++ /dev/null
@@ -1,56 +0,0 @@
-# Brief introduction to RL documentation
-
-In this advanced topic, we address the question: **how should we monitor and keep track of powerful reinforcement learning agents that we are training in the real world and
-interfacing with humans?**
-
-As machine learning systems have increasingly impacted modern life, the **call for the documentation of these systems has grown**.
-
-Such documentation can cover aspects such as the training data used — where it is stored, when it was collected, who was involved, etc.
-— or the model optimization framework — the architecture, evaluation metrics, relevant papers, etc. — and more.
-
-Today, model cards and datasheets are becoming increasingly available. For example, on the Hub
-(see documentation [here](https://huggingface.co/docs/hub/model-cards)).
-
-If you click on a [popular model on the Hub](https://huggingface.co/models), you can learn about its creation process.
-
-These model and data specific logs are designed to be completed when the model or dataset are created, leaving them to go un-updated when these models are built into evolving systems in the future.
-​
-## Motivating Reward Reports
-
-Reinforcement learning systems are fundamentally designed to optimize based on measurements of reward and time.
-While the notion of a reward function can be mapped nicely to many well-understood fields of supervised learning (via a loss function),
-understanding of how machine learning systems evolve over time is limited.
-
-To that end, the authors introduce [*Reward Reports for Reinforcement Learning*](https://www.notion.so/Brief-introduction-to-RL-documentation-b8cbda5a6f5242338e0756e6bef72af4) (the pithy naming is designed to mirror the popular papers *Model Cards for Model Reporting* and *Datasheets for Datasets*).
-The goal is to propose a type of documentation focused on the **human factors of reward** and **time-varying feedback systems**.
-
-Building on the documentation frameworks for [model cards](https://arxiv.org/abs/1810.03993) and [datasheets](https://arxiv.org/abs/1803.09010) proposed by Mitchell et al. and Gebru et al., we argue the need for Reward Reports for AI systems.
-
-**Reward Reports** are living documents for proposed RL deployments that demarcate design choices.
-
-However, many questions remain about the applicability of this framework to different RL applications, roadblocks to system interpretability,
-and the resonances between deployed supervised machine learning systems and the sequential decision-making utilized in RL.
-
-At a minimum, Reward Reports are an opportunity for RL practitioners to deliberate on these questions and begin the work of deciding how to resolve them in practice.
-​
-## Capturing temporal behavior with documentation
-
-The core piece specific to documentation designed for RL and feedback-driven ML systems is a *change-log*. The change-log updates information
-from the designer (changed training parameters, data, etc.) along with noticed changes from the user (harmful behavior, unexpected responses, etc.).
-
-The change log is accompanied by update triggers that encourage monitoring these effects.
-
-## Contributing
-
-Some of the most impactful RL-driven systems are multi-stakeholder in nature and behind the closed doors of private corporations.
-These corporations are largely without regulation, so the burden of documentation falls on the public.
-
-If you are interested in contributing, we are building Reward Reports for popular machine learning systems on a public
-record on [GitHub](https://github.com/RewardReports/reward-reports).
-​
-For further reading, you can visit the Reward Reports [paper](https://arxiv.org/abs/2204.10817)
-or look [an example report](https://github.com/RewardReports/reward-reports/tree/main/examples).
-
-## Author
-
-This section was written by <a href="https://twitter.com/natolambert"> Nathan Lambert </a>
diff --git a/units/en/unitbonus3/rlhf.mdx b/units/en/unitbonus3/rlhf.mdx
deleted file mode 100644
index e8de6d1..0000000
--- a/units/en/unitbonus3/rlhf.mdx
+++ /dev/null
@@ -1,50 +0,0 @@
-# RLHF
-
-Reinforcement learning from human feedback (RLHF) is a **methodology for integrating human data labels into a RL-based optimization process**.
-It is motivated by the **challenge of modeling human preferences**.
-
-For many questions, even if you could try and write down an equation for one ideal, humans differ on their preferences.
-
-Updating models **based on measured data is an avenue to try and alleviate these inherently human ML problems**.
-
-## Start Learning about RLHF
-
-To start learning about RLHF:
-
-1. Read this introduction: [Illustrating Reinforcement Learning from Human Feedback (RLHF)](https://huggingface.co/blog/rlhf).
-
-2. Watch the recorded live we did some weeks ago, where Nathan covered the basics of Reinforcement Learning from Human Feedback (RLHF) and how this technology is being used to enable state-of-the-art ML tools like ChatGPT.
-Most of the talk is an overview of the interconnected ML models. It covers the basics of Natural Language Processing and RL and how RLHF is used on large language models. We then conclude with open questions in RLHF.
-
-<Youtube id="2MBJOuVq380" />
-
-3. Read other blogs on this topic, such as [Closed-API vs Open-source continues: RLHF, ChatGPT, data moats](https://robotic.substack.com/p/rlhf-chatgpt-data-moats). Let us know if there are more you like!
-
-
-## Additional readings
-
-*Note, this is copied from the Illustrating RLHF blog post above*.
-Here is a list of the most prevalent papers on RLHF to date. The field was recently popularized with the emergence of DeepRL (around 2017) and has grown into a broader study of the applications of LLMs from many large technology companies.
-Here are some papers on RLHF that pre-date the LM focus:
-- [TAMER: Training an Agent Manually via Evaluative Reinforcement](https://www.cs.utexas.edu/~pstone/Papers/bib2html-links/ICDL08-knox.pdf) (Knox and Stone 2008): Proposed a learned agent where humans provided scores on the actions taken iteratively to learn a reward model.
-- [Interactive Learning from Policy-Dependent Human Feedback](http://proceedings.mlr.press/v70/macglashan17a/macglashan17a.pdf) (MacGlashan et al. 2017): Proposed an actor-critic algorithm, COACH, where human feedback (both positive and negative) is used to tune the advantage function.
-- [Deep Reinforcement Learning from Human Preferences](https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html) (Christiano et al. 2017): RLHF applied on preferences between Atari trajectories.
-- [Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces](https://ojs.aaai.org/index.php/AAAI/article/view/11485) (Warnell et al. 2018): Extends the TAMER framework where a deep neural network is used to model the reward prediction.
-
-And here is a snapshot of the growing set of papers that show RLHF's performance for LMs:
-- [Fine-Tuning Language Models from Human Preferences](https://arxiv.org/abs/1909.08593) (Zieglar et al. 2019): An early paper that studies the impact of reward learning on four specific tasks.
-- [Learning to summarize with human feedback](https://proceedings.neurips.cc/paper/2020/hash/1f89885d556929e98d3ef9b86448f951-Abstract.html) (Stiennon et al., 2020): RLHF applied to the task of summarizing text. Also, [Recursively Summarizing Books with Human Feedback](https://arxiv.org/abs/2109.10862) (OpenAI Alignment Team 2021), follow on work summarizing books.
-- [WebGPT: Browser-assisted question-answering with human feedback](https://arxiv.org/abs/2112.09332) (OpenAI, 2021): Using RLHF to train an agent to navigate the web.
-- InstructGPT: [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155) (OpenAI Alignment Team 2022): RLHF applied to a general language model [[Blog post](https://openai.com/blog/instruction-following/) on InstructGPT].
-- GopherCite: [Teaching language models to support answers with verified quotes](https://www.deepmind.com/publications/gophercite-teaching-language-models-to-support-answers-with-verified-quotes) (Menick et al. 2022): Train a LM with RLHF to return answers with specific citations.
-- Sparrow: [Improving alignment of dialogue agents via targeted human judgements](https://arxiv.org/abs/2209.14375) (Glaese et al. 2022): Fine-tuning a dialogue agent with RLHF
-- [ChatGPT: Optimizing Language Models for Dialogue](https://openai.com/blog/chatgpt/) (OpenAI 2022): Training a LM with RLHF for suitable use as an all-purpose chat bot.
-- [Scaling Laws for Reward Model Overoptimization](https://arxiv.org/abs/2210.10760) (Gao et al. 2022): studies the scaling properties of the learned preference model in RLHF.
-- [Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback](https://arxiv.org/abs/2204.05862) (Anthropic, 2022): A detailed documentation of training a LM assistant with RLHF.
-- [Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned](https://arxiv.org/abs/2209.07858) (Ganguli et al. 2022): A detailed documentation of efforts to “discover, measure, and attempt to reduce [language models] potentially harmful outputs.”
-- [Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning](https://arxiv.org/abs/2208.02294) (Cohen at al. 2022): Using RL to enhance the conversational skill of an open-ended dialogue agent.
-- [Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization](https://arxiv.org/abs/2210.01241) (Ramamurthy and Ammanabrolu et al. 2022): Discusses the design space of open-source tools in RLHF and proposes a new algorithm NLPO (Natural Language Policy Optimization) as an alternative to PPO.
-
-## Author
-
-This section was written by <a href="https://twitter.com/natolambert"> Nathan Lambert </a>
diff --git a/units/en/unitbonus3/student-works.mdx b/units/en/unitbonus3/student-works.mdx
deleted file mode 100644
index 6ab4148..0000000
--- a/units/en/unitbonus3/student-works.mdx
+++ /dev/null
@@ -1,72 +0,0 @@
-# Student Works
-
-Since the launch of the Deep Reinforcement Learning Course, **many students have created amazing projects that you should check out and consider participating in**.
-
-If you've created an interesting project, don't hesitate to [add it to this list by opening a pull request on the GitHub repository](https://github.com/huggingface/deep-rl-class).
-
-The projects are **arranged based on the date of publication in this page**.
-
-
-## Space Scavanger AI
-
-This project is a space game environment with trained neural network for AI.
-
-AI is trained by Reinforcement learning algorithm based on UnityMLAgents and RLlib frameworks.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/spacescavangerai.png" alt="Space Scavanger AI"/>
-
-Play the Game here 👉  https://swingshuffle.itch.io/spacescalvagerai
-
-Check the Unity project here 👉 https://github.com/HighExecutor/SpaceScalvagerAI
-
-
-## Neural Nitro 🏎️
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/neuralnitro.png" alt="Neural Nitro" />
-
-In this project, Sookeyy created a low poly racing game and trained a car to drive.
-
-Check out the demo here 👉 https://sookeyy.itch.io/neuralnitro
-
-
-## Space War 🚀
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/spacewar.jpg" alt="SpaceWar" />
-
-In this project, Eric Dong recreates Bill Seiler's 1985 version of Space War in Pygame and uses reinforcement learning (RL) to train AI agents.
-
-This project is currently in development!
-
-### Demo
-
-Dev/Edge version:
-* https://e-dong.itch.io/spacewar-dev
-
-Stable version:
-* https://e-dong.itch.io/spacewar
-* https://huggingface.co/spaces/EricofRL/SpaceWarRL
-
-### Community blog posts
-
-TBA
-
-### Other links
-
-Check out the source here 👉 https://github.com/e-dong/space-war-rl  
-Check out his blog here 👉 https://dev.to/edong/space-war-rl-0-series-introduction-25dh
-
-
-## Decision Transformers for Trading
-
-In this project, student has explored training a Decision Transformer for stock trading. In phase-1, offline training has been implemented. He intends to incorporate online fine-tuning in the next version.
-
-<figure>
-<img src="https://github.com/user-attachments/assets/41e3bc10-2594-4c05-b48f-17666cc8fb1c" alt="DT for Trading"/>
-
-<figcaption> 
-    
-Source: <a href="https://www.youtube.com/watch?v=w4Bw8WYL8Ps"> Stanford CS25: V1 I Decision Transformer: Reinforcement Learning via Sequence Modeling</a> </figcaption>
-
-</figure>
-
-Check out the source here 👉 https://github.com/ra9hur/Decision-Transformers-For-Trading
diff --git a/units/en/unitbonus5/conclusion.mdx b/units/en/unitbonus5/conclusion.mdx
deleted file mode 100644
index c50bf12..0000000
--- a/units/en/unitbonus5/conclusion.mdx
+++ /dev/null
@@ -1,5 +0,0 @@
-# Conclusion:
-
-**Congratulations on finishing this bonus unit!** You have learned the process of recording expert demonstrations and training the agent using IL, which can be an alternative to training in-game agents with RL in some cases.
-
-This tutorial was written by [Ivan Dodic](https://github.com/Ivan-267). Thanks to [Edward Beeching](https://twitter.com/edwardbeeching) and [Thomas Simonini](https://twitter.com/thomassimonini) for their reviews and feedback.
\ No newline at end of file
diff --git a/units/en/unitbonus5/customize-the-environment.mdx b/units/en/unitbonus5/customize-the-environment.mdx
deleted file mode 100644
index c236e0f..0000000
--- a/units/en/unitbonus5/customize-the-environment.mdx
+++ /dev/null
@@ -1,27 +0,0 @@
-# (Optional) How to customize the environment
-
-If you’d like to customize the game level, open the level scene `res://scenes/level.tscn`, then open the `res://scenes/modules/` folder in the Godot FileSystem:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit13/level_scene.jpg" alt="level scene"/>
-
-The level contains 3 rooms made using the modules, robot, and some additional colliders which prevent the ability to complete the level by climbing on a wall in the first room and reaching the key that way. By adding the modules to the scene, you can add new rooms and items.
-
-If you click on the Key node (it’s in `Room3`, you can also search for it), then click on `Node > Signals`, you will see that the `collected` signal is connected to both the robot and the chest. We use this to track whether the robot has collected the key, and to unlock the chest. The same system is applied for using the lever to activate the stairs, and if you add more levers/stairs/keys, you can connect them using signals.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit13/level_signals.jpg" alt="level signals"/>
-
-If you switch to `Groups`, you will see that the key is a member of the `resetable` group. In the same group we have the raft, lever, chest, player, and can add any node that needs to be reset when the episode resets.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit13/resetables_group.jpg" alt="resetables group"/>
-
-For this to work, every object that is in the `resetable` group also needs to implement the `reset()` method, which takes care of resetting that object.
-
-Because we have multiple instances of the level scene for training, we don’t reset all `resetables`, but only those within the same scene. In `level_manager.gd`, we have a method `reset_all_resetables()` that takes care of this, and it is called by the robot script when resetting is needed.
-
-After changing the level size, updating the `level_size` variable in `robot_ai_controller.gd` is also needed. For this, just roughly measure the longest dimension of the level, and update the variable.
-
-If you change the amount of objects that need to be tracked by the `AIController` (levers, rafts, etc.), you will need to update the relevant code in the script, include export properties for those objects, then connect them in the inspector properties of `AIController` in the level scene:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit13/ai_controller_inspector_properties.jpg" alt="ai controller inspector properties"/>
-
-After this, you may also need to update the same properties of the `AIController` in the demo record scene as well.
\ No newline at end of file
diff --git a/units/en/unitbonus5/getting-started.mdx b/units/en/unitbonus5/getting-started.mdx
deleted file mode 100644
index b8496e2..0000000
--- a/units/en/unitbonus5/getting-started.mdx
+++ /dev/null
@@ -1,266 +0,0 @@
-# Getting started:
-
-To get started, download the project from [here](https://huggingface.co/ivan267/imitation-learning-tutorial-godot-project/tree/main) (click on the download icon next to `GDRL-IL-Project.zip`). The zip file features both the “Starter” and “Complete” projects.
-
-The game code is already implemented in the starter project and the nodes are configured. We will focus on: 
-
-- Implementing the code for the AIController node,
-- Recording expert demonstrations,
-- Training the agent and exporting an .onnx file which we can use for inference in Godot.
-
-### Open the starter project in Godot
-
-Extract the zip file, open Godot, click “Import” and navigate to the `Starter\Godot` folder of the extracted archive.
-
-### Open the robot scene
-
-<Tip>
-You can search for “robot” in the FileSystem search.
-</Tip>
-
-This scene contains a couple of different nodes, including the `robot` node, which contains the visual shape of the robot, `CameraXRotation` node which is used to rotate the camera “up-down” using the mouse in human control modes. The AI agent does not control this node since it is not necessary for learning the task. `RaycastSensors` node contains two Raycast sensors that help the agent to “sense” parts of the game world, including walls, floors, etc.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit13/open-robot-scene.jpg" alt="open robot scene"/>
-
-### Click on the scroll next to AIController3D to open the script for editing
-
-<Tip>
-You might have to collapse the “robot” branch to find it more easily, or you can type `aicontroller` in the Filter box above the `Robot` node.
-</Tip>
-
-### Replace the `get_obs()` and `get_reward()` methods with the implementation below:
-
-```python
-func get_obs() -> Dictionary:
-	var observations: Array[float] = []
-	for raycast_sensor in raycast_sensors:
-		observations.append_array(raycast_sensor.get_observation())
-
-	var level_size = 16.0
-
-	var chest_local = to_local(chest.global_position)
-	var chest_direction = chest_local.normalized()
-	var chest_distance = clampf(chest_local.length(), 0.0, level_size)
-	
-	var lever_local = to_local(lever.global_position)
-	var lever_direction = lever_local.normalized()
-	var lever_distance = clampf(lever_local.length(), 0.0, level_size)
-		
-	var key_local = to_local(key.global_position)
-	var key_direction = key_local.normalized()
-	var key_distance = clampf(key_local.length(), 0.0, level_size)
-	
-	var raft_local = to_local(raft.global_position)
-	var raft_direction = raft_local.normalized()
-	var raft_distance = clampf(raft_local.length(), 0.0, level_size)
-	
-	var player_speed = player.global_basis.inverse() * player.velocity.limit_length(5.0) / 5.0
-
-	(
-		observations
-		.append_array(
-			[
-				chest_direction.x,
-				chest_direction.y,
-				chest_direction.z,
-				chest_distance,
-				lever_direction.x,
-				lever_direction.y,
-				lever_direction.z,
-				lever_distance,
-				key_direction.x,
-				key_direction.y,
-				key_direction.z,
-				key_distance,
-				raft_direction.x,
-				raft_direction.y,
-				raft_direction.z,
-				raft_distance,
-				raft.movement_direction_multiplier,
-				float(player._is_lever_pulled),
-				float(player._is_chest_opened),
-				float(player._is_key_collected),
-				float(player.is_on_floor()),
-				player_speed.x,
-				player_speed.y,
-				player_speed.z,
-			]
-		)
-	)
-	return {"obs": observations}
-
-func get_reward() -> float:
-	return reward
-```
-
-In `get_obs()`, we first get the obs from the two Raycast sensors added to the `AIController3D` node in the inspector, and add them to the obs, then we get the relative position vectors to chest, lever, key, and raft, which we separate into directions and distances, and then we add them to the obs as well.
-
-We also add other game state info to the obs: 
-
-- has the lever has been pulled,
-- was the key collected,
-- was the chest opened,
-- is the player on floor (also determines whether the player can jump),
-- the normalized local velocity of the player.
-
-We convert boolean values such as `_is_lever_pulled` to floats (0 or 1).
-
-In `get_reward()`, we only need to return the current reward.
-
-### Replace the `_physics_process()` and `reset()` methods with the implementation below:
-
-```python
-func _physics_process(delta: float) -> void:
-	# Reset on timeout, this is implemented in parent class to set needs_reset to true,
-	# we are re-implementing here to call player.game_over() that handles the game reset.
-	n_steps += 1
-	if n_steps > reset_after:
-		player.game_over()
-
-	# In training or onnx inference modes, this method will be called by sync node with actions provided,
-	# For expert demo recording mode, it will be called without any actions (as we set the actions based on human input),
-	# For human control mode the method will not be called, so we call it here without any actions provided.
-	if control_mode == ControlModes.HUMAN:
-		set_action()
-
-	# Reset the game faster if the lever is not pulled.
-	steps_without_lever_pulled += 1
-	if steps_without_lever_pulled > 200 and (not player._is_lever_pulled):
-		player.game_over()
-
-func reset():
-	super.reset()
-	steps_without_lever_pulled = 0
-```
-
-### **Replace the `get_action_space()`, `get_action()`, and `set_action()` methods with the implementation below:**
-
-```python
-# Defines the actions for the AI agent ("size": 2 means 2 floats for this action)
-func get_action_space() -> Dictionary:
-	return {
-		"movement": {"size": 2, "action_type": "continuous"},
-		"rotation": {"size": 1, "action_type": "continuous"},
-		"jump": {"size": 1, "action_type": "continuous"},
-		"use_action": {"size": 1, "action_type": "continuous"}
-	}
-
-# We return the action values in the same order as defined in get_action_space() (important), but all in one array
-# For actions of size 1, we return 1 float in the array, for size 2, 2 floats in the array, etc.
-# set_action is called just before get_action by the sync node, so we can read the newly set values
-func get_action():
-	return [
-		# "movement" action values
-		player.requested_movement.x,
-		player.requested_movement.y,
-		# "rotation" action value
-		player.requested_rotation.x,
-		# "jump" action value (-1 if not requested, 1 if requested)
-		-1.0 + 2.0 * float(player.jump_requested),
-		# "use_action" action value (-1 if not requested, 1 if requested)
-		-1.0 + 2.0 * float(player.use_action_requested)
-	]
-
-# Here we set human control and AI control actions to the robot
-func set_action(action = null) -> void:
-	# If there's no action provided, it means that AI is not controlling the robot (human control),
-	if not action:
-		# Only rotate if the mouse has moved since the last set_action call
-		if previous_mouse_movement == mouse_movement:
-			mouse_movement = Vector2.ZERO
-
-		player.requested_movement = Input.get_vector(
-			"move_left", "move_right", "move_forward", "move_back"
-		)
-		player.requested_rotation = mouse_movement
-
-		var use_action = Input.is_action_pressed("requested_action")
-		var jump = Input.is_action_pressed("requested_jump")
-
-		player.use_action_requested = use_action
-		player.jump_requested = jump
-
-		previous_mouse_movement = mouse_movement	
-	else:
-		# If there is action provided, we set the actions received from the AI agent 
-		player.requested_movement = Vector2(action.movement[0], action.movement[1])
-		# The agent only rotates the robot along the Y axis, no need to rotate the camera along X axis
-		player.requested_rotation = Vector2(action.rotation[0], 0.0)
-		player.jump_requested = bool(action.jump[0] > 0)
-		player.use_action_requested = bool(action.use_action[0] > 0)
-```
-
-For `get_action()` (only needed if using the demo record mode), we need to provide the actions that we want the agent to send when it encounters the same state. It is important for the values to be in the correct range (`-1.0 to 1.0`), which is why we have the `-1 + 2 * variable` for boolean states, and in the correct order, as defined in `get_action_space()`. 
-
-In demo record mode, `set_action()` is called without providing actions, as we need to set the action values based on human input. In training/inference modes, the method is called with an `action` argument containing values for all of the actions provided by the RL model, so we have an `if/else` to handle both cases.
-
-More info is included in the code comments.
-
-### Replace the `_input` method with the implementation below:
-
-```python
-# Record mouse movement for human and demo_record modes
-# We don't directly rotate in input to allow for frame skipping (action_repeat setting) which
-# will also be applied to the AI agent in training/inference modes.
-func _input(event):
-	if not (heuristic == "human" or heuristic == "demo_record"):
-		return
-
-	if event is InputEventMouseMotion:
-		var movement_scale: float = 0.005
-		mouse_movement.y = clampf(event.relative.y * movement_scale, -1.0, 1.0)
-		mouse_movement.x = clampf(event.relative.x * movement_scale, -1.0, 1.0)
-```
-
-This code part records mouse movement in case of human control and demo record modes.
-
-**Finally, save the script. We are ready for the next step.**
-
-### Open the demo record scene, and click on AIController3D node
-
-<Tip>
-You can search for “demo” in the FileSystem search, and you can search for “aicontroller” in the scene's filter box.
-</Tip>
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit13/demo_record_scene.jpg" alt="open robot scene"/>
-
-
-You don’t need to make any changes as everything is preset, but let’s go over the things you would need to set in your own env:
-
-The scene contains modified `Level > Robot > AIController3D` node settings:
-
-- `Control Mode` is set to `Record Expert Demos`
-- `Expert Demo Save Path` is filled out
-- `Action Repeat` is set to the same value as is set for the `Sync` node in `training_scene` and `onnx_inference_scene`. This means that every action set by the agent is repeated for 3 physics frames. The setting in `AIController` adds the same action repeat to the human input (which introduces some lag) to match the same behavior. This is a fairly low value which doesn’t introduce much lag. If you change this value, make sure to change it in all 3 places.
-- `Remove Last Episode` key allows us to set a key that can be used to remove a failed episode during recording, without having to restart the entire session. E.g. if the robot falls in the water and the game resets, we can use this key to remove the previously recorded episode while recording the next one. It is set to `R`, but you can change it to any key by clicking on it, and then clicking on the `Configure` button.
-
-Another way to make episode recording easier in challenging environments is to slow down the environment during recording. This can easily be done by clicking on the `Sync` node in the scene, and adjusting the `Speed Up` property (set to 1 by default).
-
-### Let’s record some demos:
-
-<Tip>
-Note that the demos will only be saved if we have recorded at least one complete episode and closed the game window by clicking on "X" or pressing ALT+F4. Using the stop button in Godot editor will not save the demos. It’s best to try recording just one episode first, then check if you see "expert_demos.json" in the filesystem or in the Godot project folder.
-</Tip>
-
-Make sure that you are still in the `demo_record_scene`, `press F6` and the demo recording will start.  
-
-Controls:
-
-- mouse controls the camera (if you need to adjust mouse sensitivity, open the `robot` scene, click on the `Robot` node and adjust the `Rotation Speed`, keep it the same value for recording demos, training and inference),
-- `WASD` controls the player movement,
-- `SPACE` jumps,
-- `E` activates the lever and opens the chest
-
-You can take a few practice first to get familiar with the env. If you wish to skip recording demos, you can also find the pre-recorded demos in the completed project and use the `expert_demos.json` file from there.
-
-The recorded demos should include at least 22-24 complete successful episodes. Multiple demo files can also be used in the training stage, so you don’t have to record all demos in one go (you can change the file name using the `Expert Demo Save Path` property mentioned before).
-
-Recording 23 episodes took me ~10 minutes (as the key has 2 alternating spawning positions, 22 or 24 would provide an equal distribution of key positions in the demos, but it is fairly close). When approaching the lever or chest, I pressed and held the `E` key slightly longer to ensure the action is recorded for multiple steps when near those objects. I also removed a couple of episodes that I didn’t complete successfully by pressing the `R` key during the following episode. 
-
-Here’s a sped-up video of the demo recording process:
-
-<video src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit13/demo_record.mp4" type="video/mp4" controls autoplay loop mute />
-
-### Export the game for training:
-
-You can export the game from Godot using `Project > Export`.
diff --git a/units/en/unitbonus5/introduction.mdx b/units/en/unitbonus5/introduction.mdx
deleted file mode 100644
index 9b583a1..0000000
--- a/units/en/unitbonus5/introduction.mdx
+++ /dev/null
@@ -1,23 +0,0 @@
-# Introduction:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit13/thumbnail.png" alt="Unit bonus 4 thumbnail"/>
-
-Welcome to this bonus unit, where you will **train a robot agent to complete a mini-game level using imitation learning.**
-
-At the end of the unit, **you will have a trained agent capable of solving the level as in the video**:
-
-<video src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit13/onnx_inference_test.mp4" type="video/mp4" controls autoplay loop mute />
-
-
-## Objectives:
-
-- Learn how to use imitation learning with Godot RL Agents by training an agent to complete a mini-game environment using human-recorded expert demonstrations.
-
-## Prerequisites and requirements:
-
-- It is recommended that you complete the previous chapter ([Godot RL Agents](https://huggingface.co/learn/deep-rl-course/unitbonus3/godotrl)) before starting this tutorial,
-- Some familiarity with Godot is recommended, although completing the tutorial does not require any gdscript coding knowledge,
-- Godot with .NET support (tested to work with [4.3.dev5 .NET](https://godotengine.org/article/dev-snapshot-godot-4-3-dev-5/), may work with newer versions too),
-- Godot RL Agents (you can use `pip install godot-rl` in the venv/conda env),
-- [Imitation library](https://huggingface.co/learn/deep-rl-course/unitbonus5/train-our-robot),
-- Time: ~1-2 hours to complete the project and training. It can be outside of this range depending on the hardware used.
diff --git a/units/en/unitbonus5/the-environment.mdx b/units/en/unitbonus5/the-environment.mdx
deleted file mode 100644
index b9d12d2..0000000
--- a/units/en/unitbonus5/the-environment.mdx
+++ /dev/null
@@ -1,9 +0,0 @@
-# The environment
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit13/il_tutorial_env.png" alt="IL tutorial environment"/>
-
-The tutorial environment features a robot that needs to:
-
-- Pull a lever to raise the stairs leading to the second room,
-- Navigate to the key 🔑 and collect it while avoiding falling down into traps, water, or outside the map,
-- Navigate back to the treasure chest in the first room, and open it. Victory! 🏆
\ No newline at end of file
diff --git a/units/en/unitbonus5/train-our-robot.mdx b/units/en/unitbonus5/train-our-robot.mdx
deleted file mode 100644
index 3492a0f..0000000
--- a/units/en/unitbonus5/train-our-robot.mdx
+++ /dev/null
@@ -1,53 +0,0 @@
-# Train our robot
-
-<Tip>
-In order to start training, we’ll first need to install the <a href="https://imitation.readthedocs.io/en/latest/getting-started/installation.html">imitation</a> library in the same venv / conda env where you installed Godot RL Agents by using: <code>pip install imitation</code>
-</Tip>
-
-### Download a copy of the [imitation learning](https://github.com/edbeeching/godot_rl_agents/blob/main/examples/sb3_imitation.py) script from the Godot RL Repository.
-
-### Run training using the arguments below:
-
-```python
-sb3_imitation.py --env_path="path_to_ILTutorial_executable" --bc_epochs=100 --gail_timesteps=1450000 --demo_files "path_to_expert_demos.json" --n_parallel=4 --speedup=20 --onnx_export_path=model.onnx --experiment_name=ILTutorial
-```
-
-**Set the env path to the exported game, and demo files path to the recorded demos. If you have multiple demo files add them with a space in between, e.g. `--demo_files demos.json demos2.json`.**
-
-You can also set a large amount of timesteps for `--gail_timesteps` and then manually stop training with `CTRL+C`. I used this method to stop training when the reward started to approach 3, which was at `total_timesteps | 1.38e+06`. That took ~41 minutes, and the BC pre-training took ~5.5 minutes on my PC using CPU for training.
-
-To observe the environment while training, add the `--viz` argument. For the duration of the BC training, the env will be frozen as this stage doesn’t use the env except to get some information about the observation and action spaces. During the GAIL training stage, the env rendering will update.
-
-Here are the `ep_rew_mean` and `ep_rew_wrapped_mean` stats from the logs displayed using [tensorboard](https://github.com/edbeeching/godot_rl_agents/blob/main/docs/TRAINING_STATISTICS.md), we can see that they are closely matching in this case:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/
-en/unit13/training_results.png" alt="training results"/>
-
-
-<Tip>
-You can find the logs in `logs/ILTutorial` relative to the path you started training from. If making multiple runs, change the `--experiment_name` argument between each.
-</Tip>
-
-Even though setting the env rewards is not necessary and not used for the training here, a simple sparse reward was implemented to track success. Falling outside the map, in water, or traps sets `reward += -1`, while activating the lever, collecting the key, and opening the chest each set `reward += 1`. If the `ep_rew_mean` approaches 3, we are getting a good result. `ep_rew_wrapped_mean` is the reward from the GAIL discriminator, which does not directly tell us how successful the agent is at solving the environment.
-
-### Let’s test the trained agent
-
-After training, you’ll find a `model.onnx` file in the folder you started the training script from (you can also find the full path to the `.onnx` file in the training log in the console, near the end). **Copy it to the Godot game project folder.**
-
-### Open the onnx inference scene
-
-This scene, like the demo record scene, uses only one copy of the level. It also has it’s `Sync` node mode set to `Onnx Inference`. 
-
-**Click on the `Sync` node and set the `Onnx Model Path` property to `model.onnx`.** 
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/
-en/unit13/onnx_inference_scene.jpg" alt="onnx inference scene"/>
-
-**Press F6 to start the scene and let’s see what the agent has learned!**
-
-Video of the trained agent:
-<video src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit13/onnx_inference_test.mp4" type="video/mp4" controls autoplay loop mute />
-
-It seems the agent is capable of collecting the key from both positions (left platform or right platform) and replicates the recorded behavior well. **If you’re getting similar results, well done, you’ve successfully completed this tutorial!** 🏆👏
-
-If your results are significantly different, note that the amount and quality of recorded demos can affect the results, and adjusting the number of steps for BC/GAIL stages as well as modifying the hyper-parameters in the Python script can potentially help. There’s also some run-to-run variation, so sometimes the results can be slightly different even with the same settings.