FrozenLake Q-Learning Agent - Model Documentation

Model Overview

This is a Q-Learning agent trained to solve the FrozenLake-v1 environment from OpenAI Gymnasium. The agent learns to navigate a frozen lake from start to goal while avoiding holes in the ice using tabular Q-learning with epsilon-greedy exploration.

Environment Details

Environment: FrozenLake-v1
Type: Discrete, Grid World
Grid Size: 4x4 (16 states)
Action Space: 4 discrete actions (Left, Down, Right, Up)
Observation Space: 16 discrete states (0-15)
Objective: Navigate from start (S) to goal (G) while avoiding holes (H)

Environment Layout

SFFF
FHFH
FFFH
HFFG

Where:

S = Start position (State 0)
F = Frozen surface (safe)
H = Hole (terminal, reward = 0)
G = Goal (terminal, reward = 1)

Training Configuration

Q-Learning Hyperparameters

Learning Rate (α): 0.005
Discount Factor (γ): 0.95
Maximum Epsilon: 1.0 (100% exploration initially)
Minimum Epsilon: 0.05 (5% exploration finally)
Decay Rate: 0.0005 (epsilon decay)

Training Parameters

Training Episodes: 1,000,000
Maximum Steps per Episode: 99
Evaluation Episodes: 100
Algorithm: Tabular Q-Learning with ε-greedy policy

Learned Q-Table Analysis

The final Q-table represents the learned action values for each state-action pair:

Q-Table Structure

State | Action 0 (Left) | Action 1 (Down) | Action 2 (Right) | Action 3 (Up) | Best Action
------|-----------------|-----------------|------------------|---------------|------------
  0   |     0.735      |     0.774      |     0.774       |    0.735     | Down/Right
  1   |     0.735      |     0.000      |     0.815       |    0.774     | Right
  2   |     0.774      |     0.857      |     0.774       |    0.815     | Down
  3   |     0.815      |     0.000      |     0.774       |    0.774     | Left
  4   |     0.774      |     0.815      |     0.000       |    0.735     | Down
  5   |     0.000      |     0.000      |     0.000       |    0.000     | Any (Hole)
  6   |     0.000      |     0.903      |     0.000       |    0.815     | Down
  7   |     0.000      |     0.000      |     0.000       |    0.000     | Any (Hole)
  8   |     0.815      |     0.000      |     0.857       |    0.774     | Right
  9   |     0.815      |     0.903      |     0.903       |    0.000     | Down/Right
 10   |     0.857      |     0.950      |     0.000       |    0.857     | Down
 11   |     0.000      |     0.000      |     0.000       |    0.000     | Any (Hole)
 12   |     0.000      |     0.000      |     0.000       |    0.000     | Any (Hole)
 13   |     0.000      |     0.903      |     0.950       |    0.857     | Right
 14   |     0.903      |     0.950      |     1.000       |    0.903     | Right
 15   |     0.000      |     0.000      |     0.000       |    0.000     | Goal State

Key Insights from Q-Table

Goal-Adjacent States: State 14 (adjacent to goal) has the highest Q-value (1.0) for moving right to the goal
Hole States: States 5, 7, 11, 12 have zero Q-values (terminal hole states)
Value Propagation: Q-values decrease with distance from goal, showing proper value propagation
Optimal Policy: The agent learned to navigate around holes toward the goal

Optimal Policy Extraction

Based on the Q-table, the optimal policy is:

State 0: Move Down or Right (0.774)
State 1: Move Right (0.815)
State 2: Move Down (0.857)
State 3: Move Left (0.815)
State 4: Move Down (0.815)
State 6: Move Down (0.903)
State 8: Move Right (0.857)
State 9: Move Down or Right (0.903)
State 10: Move Down (0.950)
State 13: Move Right (0.950)
State 14: Move Right (1.000) → Goal!

Usage Instructions

Usage


model = load_from_hub(repo_id="Adilbai/q-FrozenLake-v1-4x4-noSlippery", filename="q-learning.pkl")

# Don't forget to check if you need to add additional attributes (is_slippery=False etc)
env = gym.make(model["env_id"])

Create environment and test

env = gym.make('FrozenLake-v1', render_mode='human') reward, steps = run_episode(env, qtable, render=True) print(f"Episode reward: {reward}, Steps: {steps}") env.close()


### Performance Evaluation

```python
def evaluate_policy(qtable, n_episodes=100):
    """
    Evaluate the learned policy
    """
    env = gym.make('FrozenLake-v1')
    rewards = []
    steps_list = []
    
    for _ in range(n_episodes):
        reward, steps = run_episode(env, qtable)
        rewards.append(reward)
        steps_list.append(steps)
    
    env.close()
    
    success_rate = np.mean(rewards) * 100
    avg_steps = np.mean(steps_list)
    
    print(f"Evaluation Results ({n_episodes} episodes):")
    print(f"Success Rate: {success_rate:.1f}%")
    print(f"Average Steps: {avg_steps:.1f}")
    print(f"Average Reward: {np.mean(rewards):.3f}")
    
    return success_rate, avg_steps

# Evaluate the model
success_rate, avg_steps = evaluate_policy(qtable)

Visualizing the Policy

def visualize_policy(qtable):
    """
    Visualize the learned policy
    """
    action_names = ['←', '↓', '→', '↑']
    policy_grid = np.zeros((4, 4), dtype=object)
    
    for state in range(16):
        row, col = state // 4, state % 4
        if state in [5, 7, 11, 12]:  # Holes
            policy_grid[row, col] = 'H'
        elif state == 15:  # Goal
            policy_grid[row, col] = 'G'
        else:
            best_action = np.argmax(qtable[state])
            policy_grid[row, col] = action_names[best_action]
    
    print("Learned Policy:")
    print("S = Start, G = Goal, H = Hole")
    for row in policy_grid:
        print(' '.join(f'{cell:>2}' for cell in row))

visualize_policy(qtable)

Training from Scratch (Reference)

def train_qlearning(env, n_episodes=1000000, learning_rate=0.005, 
                   gamma=0.95, max_epsilon=1.0, min_epsilon=0.05, 
                   decay_rate=0.0005):
    """
    Train Q-learning agent (for reference)
    """
    qtable = np.zeros((env.observation_space.n, env.action_space.n))
    
    for episode in range(n_episodes):
        state, _ = env.reset()
        
        # Epsilon decay
        epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)
        
        for step in range(99):  # max_steps
            # Choose action
            if np.random.random() < epsilon:
                action = env.action_space.sample()
            else:
                action = np.argmax(qtable[state])
            
            # Take action
            next_state, reward, terminated, truncated, _ = env.step(action)
            
            # Q-learning update
            qtable[state, action] = qtable[state, action] + learning_rate * (
                reward + gamma * np.max(qtable[next_state]) - qtable[state, action]
            )
            
            if terminated or truncated:
                break
                
            state = next_state
    
    return qtable

Model Performance Characteristics

Strengths

Optimal Solution: Successfully learned to navigate to the goal
Robust Policy: High Q-values near goal indicate reliable pathfinding
Hole Avoidance: Properly learned to avoid terminal hole states
Value Propagation: Correct value propagation from goal to start

Limitations

Environment Specific: Only works for FrozenLake-v1 4x4 grid
Tabular Method: Doesn't generalize to larger or different environments
Stochastic Environment: Performance may vary due to environment randomness

Expected Performance

Based on the Q-table values, the agent should achieve:

Success Rate: ~70-80% (typical for FrozenLake-v1)
Average Steps: 10-20 steps per successful episode
Convergence: Stable policy after 1M training episodes

This Q-learning agent represents a well-trained tabular reinforcement learning solution for the classic FrozenLake navigation problem.

Adilbai
/

q-FrozenLake-v1-4x4-noSlippery