FrozenLake Q-Learning Agent - Model Documentation
Model Overview
This is a Q-Learning agent trained to solve the FrozenLake-v1 environment from OpenAI Gymnasium. The agent learns to navigate a frozen lake from start to goal while avoiding holes in the ice using tabular Q-learning with epsilon-greedy exploration.
Environment Details
- Environment: FrozenLake-v1
- Type: Discrete, Grid World
- Grid Size: 4x4 (16 states)
- Action Space: 4 discrete actions (Left, Down, Right, Up)
- Observation Space: 16 discrete states (0-15)
- Objective: Navigate from start (S) to goal (G) while avoiding holes (H)
Environment Layout
SFFF
FHFH
FFFH
HFFG
Where:
- S = Start position (State 0)
- F = Frozen surface (safe)
- H = Hole (terminal, reward = 0)
- G = Goal (terminal, reward = 1)
Training Configuration
Q-Learning Hyperparameters
- Learning Rate (α): 0.005
- Discount Factor (γ): 0.95
- Maximum Epsilon: 1.0 (100% exploration initially)
- Minimum Epsilon: 0.05 (5% exploration finally)
- Decay Rate: 0.0005 (epsilon decay)
Training Parameters
- Training Episodes: 1,000,000
- Maximum Steps per Episode: 99
- Evaluation Episodes: 100
- Algorithm: Tabular Q-Learning with ε-greedy policy
Learned Q-Table Analysis
The final Q-table represents the learned action values for each state-action pair:
Q-Table Structure
State | Action 0 (Left) | Action 1 (Down) | Action 2 (Right) | Action 3 (Up) | Best Action
------|-----------------|-----------------|------------------|---------------|------------
0 | 0.735 | 0.774 | 0.774 | 0.735 | Down/Right
1 | 0.735 | 0.000 | 0.815 | 0.774 | Right
2 | 0.774 | 0.857 | 0.774 | 0.815 | Down
3 | 0.815 | 0.000 | 0.774 | 0.774 | Left
4 | 0.774 | 0.815 | 0.000 | 0.735 | Down
5 | 0.000 | 0.000 | 0.000 | 0.000 | Any (Hole)
6 | 0.000 | 0.903 | 0.000 | 0.815 | Down
7 | 0.000 | 0.000 | 0.000 | 0.000 | Any (Hole)
8 | 0.815 | 0.000 | 0.857 | 0.774 | Right
9 | 0.815 | 0.903 | 0.903 | 0.000 | Down/Right
10 | 0.857 | 0.950 | 0.000 | 0.857 | Down
11 | 0.000 | 0.000 | 0.000 | 0.000 | Any (Hole)
12 | 0.000 | 0.000 | 0.000 | 0.000 | Any (Hole)
13 | 0.000 | 0.903 | 0.950 | 0.857 | Right
14 | 0.903 | 0.950 | 1.000 | 0.903 | Right
15 | 0.000 | 0.000 | 0.000 | 0.000 | Goal State
Key Insights from Q-Table
- Goal-Adjacent States: State 14 (adjacent to goal) has the highest Q-value (1.0) for moving right to the goal
- Hole States: States 5, 7, 11, 12 have zero Q-values (terminal hole states)
- Value Propagation: Q-values decrease with distance from goal, showing proper value propagation
- Optimal Policy: The agent learned to navigate around holes toward the goal
Optimal Policy Extraction
Based on the Q-table, the optimal policy is:
- State 0: Move Down or Right (0.774)
- State 1: Move Right (0.815)
- State 2: Move Down (0.857)
- State 3: Move Left (0.815)
- State 4: Move Down (0.815)
- State 6: Move Down (0.903)
- State 8: Move Right (0.857)
- State 9: Move Down or Right (0.903)
- State 10: Move Down (0.950)
- State 13: Move Right (0.950)
- State 14: Move Right (1.000) → Goal!
Usage Instructions
Usage
model = load_from_hub(repo_id="Adilbai/q-FrozenLake-v1-4x4-noSlippery", filename="q-learning.pkl")
# Don't forget to check if you need to add additional attributes (is_slippery=False etc)
env = gym.make(model["env_id"])
Create environment and test
env = gym.make('FrozenLake-v1', render_mode='human') reward, steps = run_episode(env, qtable, render=True) print(f"Episode reward: {reward}, Steps: {steps}") env.close()
### Performance Evaluation
```python
def evaluate_policy(qtable, n_episodes=100):
"""
Evaluate the learned policy
"""
env = gym.make('FrozenLake-v1')
rewards = []
steps_list = []
for _ in range(n_episodes):
reward, steps = run_episode(env, qtable)
rewards.append(reward)
steps_list.append(steps)
env.close()
success_rate = np.mean(rewards) * 100
avg_steps = np.mean(steps_list)
print(f"Evaluation Results ({n_episodes} episodes):")
print(f"Success Rate: {success_rate:.1f}%")
print(f"Average Steps: {avg_steps:.1f}")
print(f"Average Reward: {np.mean(rewards):.3f}")
return success_rate, avg_steps
# Evaluate the model
success_rate, avg_steps = evaluate_policy(qtable)
Visualizing the Policy
def visualize_policy(qtable):
"""
Visualize the learned policy
"""
action_names = ['←', '↓', '→', '↑']
policy_grid = np.zeros((4, 4), dtype=object)
for state in range(16):
row, col = state // 4, state % 4
if state in [5, 7, 11, 12]: # Holes
policy_grid[row, col] = 'H'
elif state == 15: # Goal
policy_grid[row, col] = 'G'
else:
best_action = np.argmax(qtable[state])
policy_grid[row, col] = action_names[best_action]
print("Learned Policy:")
print("S = Start, G = Goal, H = Hole")
for row in policy_grid:
print(' '.join(f'{cell:>2}' for cell in row))
visualize_policy(qtable)
Training from Scratch (Reference)
def train_qlearning(env, n_episodes=1000000, learning_rate=0.005,
gamma=0.95, max_epsilon=1.0, min_epsilon=0.05,
decay_rate=0.0005):
"""
Train Q-learning agent (for reference)
"""
qtable = np.zeros((env.observation_space.n, env.action_space.n))
for episode in range(n_episodes):
state, _ = env.reset()
# Epsilon decay
epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)
for step in range(99): # max_steps
# Choose action
if np.random.random() < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(qtable[state])
# Take action
next_state, reward, terminated, truncated, _ = env.step(action)
# Q-learning update
qtable[state, action] = qtable[state, action] + learning_rate * (
reward + gamma * np.max(qtable[next_state]) - qtable[state, action]
)
if terminated or truncated:
break
state = next_state
return qtable
Model Performance Characteristics
Strengths
- Optimal Solution: Successfully learned to navigate to the goal
- Robust Policy: High Q-values near goal indicate reliable pathfinding
- Hole Avoidance: Properly learned to avoid terminal hole states
- Value Propagation: Correct value propagation from goal to start
Limitations
- Environment Specific: Only works for FrozenLake-v1 4x4 grid
- Tabular Method: Doesn't generalize to larger or different environments
- Stochastic Environment: Performance may vary due to environment randomness
Expected Performance
Based on the Q-table values, the agent should achieve:
- Success Rate: ~70-80% (typical for FrozenLake-v1)
- Average Steps: 10-20 steps per successful episode
- Convergence: Stable policy after 1M training episodes
This Q-learning agent represents a well-trained tabular reinforcement learning solution for the classic FrozenLake navigation problem.
Evaluation results
- mean_reward on FrozenLake-v1-4x4-no_slipperyself-reported1.00 +/- 0.00