Q-Learning Agent para Taxi-v3 🚖

Este modelo utiliza el algoritmo Q-Learning para resolver el entorno clásico de Gymnasium Taxi-v3.

Descripción del entorno 🚕

El entorno Taxi-v3 tiene como objetivo llevar pasajeros desde un punto de recogida hasta un destino específico en una cuadrícula de 5x5.

  • Acciones:

    • 0: Mover al sur
    • 1: Mover al norte
    • 2: Mover al este
    • 3: Mover al oeste
    • 4: Recoger pasajero
    • 5: Dejar pasajero
  • Recompensas:

    • +20 por llevar al pasajero al destino correcto
    • -10 por intentos incorrectos de recoger o dejar pasajeros
    • -1 por cada paso adicional

Resultados 📊

Métrica Valor
Episodios 50,000
Mean Reward 7.54
Std Reward 2.74
Resultado final 4.80

Hiperparámetros 🛠️

  • Learning rate (α): 0.7
  • Gamma (γ): 0.99
  • Epsilon inicial: 1.0
  • Epsilon mínimo: 0.05
  • Tasa de decaimiento de epsilon: 0.005

Instalación y uso 🚀

!pip install gymnasium pygame numpy imageio huggingface_hub pyvirtualdisplay
!apt-get update
!apt-get install -y python3-opengl ffmpeg xvfb

Código completo 📄

import numpy as np
import gymnasium as gym
import random
from tqdm.notebook import tqdm
import pickle
from huggingface_hub import notebook_login

# Autenticarse en Hugging Face
notebook_login()

# Crear entorno Taxi-v3
env = gym.make("Taxi-v3", render_mode="rgb_array")

# Inicializar Q-table
state_space = env.observation_space.n
action_space = env.action_space.n
Qtable = np.zeros((state_space, action_space))

# Hiperparámetros
n_training_episodes = 50000
learning_rate = 0.7
gamma = 0.99
max_steps = 99

# Parámetros de exploración
max_epsilon = 1.0
min_epsilon = 0.05
decay_rate = 0.005

# Seeds de evaluación (no modificar)
eval_seed = [16,54,165,177,191,191,120,80,149,178,48,38,6,125,174,73,50,172,100,148,
             146,6,25,40,68,148,49,167,9,97,164,176,61,7,54,55,161,131,184,51,170,
             12,120,113,95,126,51,98,36,135,54,82,45,95,89,59,95,124,9,113,58,85,
             51,134,121,169,105,21,30,11,50,65,12,43,82,145,152,97,106,55,31,85,38,
             112,102,168,123,97,21,83,158,26,80,63,5,81,32,11,28,148]

# Políticas
def greedy_policy(Qtable, state):
    return np.argmax(Qtable[state])

def epsilon_greedy_policy(Qtable, state, epsilon):
    if random.uniform(0,1) > epsilon:
        action = greedy_policy(Qtable, state)
    else:
        action = env.action_space.sample()
    return action

# Entrenar el agente
def train_agent():
    for episode in tqdm(range(n_training_episodes)):
        epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)
        state, info = env.reset()
        terminated, truncated = False, False
        
        for step in range(max_steps):
            action = epsilon_greedy_policy(Qtable, state, epsilon)
            new_state, reward, terminated, truncated, info = env.step(action)

            Qtable[state][action] += learning_rate * (
                reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action]
            )

            if terminated or truncated:
                break

            state = new_state

train_agent()

# Evaluar el agente
def evaluate_agent():
    episode_rewards = []
    for seed in tqdm(eval_seed):
        state, info = env.reset(seed=seed)
        total_reward = 0

        for step in range(max_steps):
            action = greedy_policy(Qtable, state)
            new_state, reward, terminated, truncated, info = env.step(action)
            total_reward += reward

            if terminated or truncated:
                break
            state = new_state
        episode_rewards.append(total_reward)

    mean_reward = np.mean(episode_rewards)
    std_reward = np.std(episode_rewards)
    print(f"Mean reward: {mean_reward:.2f}, Std reward: {std_reward:.2f}, Result: {mean_reward - std_reward:.2f}")

evaluate_agent()

Autor ✨

Desarrollado por cparedes.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Evaluation results