Q-Learning Agent para Taxi-v3 🚖
Este modelo utiliza el algoritmo Q-Learning para resolver el entorno clásico de Gymnasium Taxi-v3.
Descripción del entorno 🚕
El entorno Taxi-v3 tiene como objetivo llevar pasajeros desde un punto de recogida hasta un destino específico en una cuadrícula de 5x5.
Acciones:
- 0: Mover al sur
- 1: Mover al norte
- 2: Mover al este
- 3: Mover al oeste
- 4: Recoger pasajero
- 5: Dejar pasajero
Recompensas:
- +20 por llevar al pasajero al destino correcto
- -10 por intentos incorrectos de recoger o dejar pasajeros
- -1 por cada paso adicional
Resultados 📊
Métrica | Valor |
---|---|
Episodios | 50,000 |
Mean Reward | 7.54 |
Std Reward | 2.74 |
Resultado final | 4.80 |
Hiperparámetros 🛠️
- Learning rate (α): 0.7
- Gamma (γ): 0.99
- Epsilon inicial: 1.0
- Epsilon mínimo: 0.05
- Tasa de decaimiento de epsilon: 0.005
Instalación y uso 🚀
!pip install gymnasium pygame numpy imageio huggingface_hub pyvirtualdisplay
!apt-get update
!apt-get install -y python3-opengl ffmpeg xvfb
Código completo 📄
import numpy as np
import gymnasium as gym
import random
from tqdm.notebook import tqdm
import pickle
from huggingface_hub import notebook_login
# Autenticarse en Hugging Face
notebook_login()
# Crear entorno Taxi-v3
env = gym.make("Taxi-v3", render_mode="rgb_array")
# Inicializar Q-table
state_space = env.observation_space.n
action_space = env.action_space.n
Qtable = np.zeros((state_space, action_space))
# Hiperparámetros
n_training_episodes = 50000
learning_rate = 0.7
gamma = 0.99
max_steps = 99
# Parámetros de exploración
max_epsilon = 1.0
min_epsilon = 0.05
decay_rate = 0.005
# Seeds de evaluación (no modificar)
eval_seed = [16,54,165,177,191,191,120,80,149,178,48,38,6,125,174,73,50,172,100,148,
146,6,25,40,68,148,49,167,9,97,164,176,61,7,54,55,161,131,184,51,170,
12,120,113,95,126,51,98,36,135,54,82,45,95,89,59,95,124,9,113,58,85,
51,134,121,169,105,21,30,11,50,65,12,43,82,145,152,97,106,55,31,85,38,
112,102,168,123,97,21,83,158,26,80,63,5,81,32,11,28,148]
# Políticas
def greedy_policy(Qtable, state):
return np.argmax(Qtable[state])
def epsilon_greedy_policy(Qtable, state, epsilon):
if random.uniform(0,1) > epsilon:
action = greedy_policy(Qtable, state)
else:
action = env.action_space.sample()
return action
# Entrenar el agente
def train_agent():
for episode in tqdm(range(n_training_episodes)):
epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)
state, info = env.reset()
terminated, truncated = False, False
for step in range(max_steps):
action = epsilon_greedy_policy(Qtable, state, epsilon)
new_state, reward, terminated, truncated, info = env.step(action)
Qtable[state][action] += learning_rate * (
reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action]
)
if terminated or truncated:
break
state = new_state
train_agent()
# Evaluar el agente
def evaluate_agent():
episode_rewards = []
for seed in tqdm(eval_seed):
state, info = env.reset(seed=seed)
total_reward = 0
for step in range(max_steps):
action = greedy_policy(Qtable, state)
new_state, reward, terminated, truncated, info = env.step(action)
total_reward += reward
if terminated or truncated:
break
state = new_state
episode_rewards.append(total_reward)
mean_reward = np.mean(episode_rewards)
std_reward = np.std(episode_rewards)
print(f"Mean reward: {mean_reward:.2f}, Std reward: {std_reward:.2f}, Result: {mean_reward - std_reward:.2f}")
evaluate_agent()
Autor ✨
Desarrollado por cparedes.