{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "7d4e1012-3ee7-4482-b536-a5740f73a074", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2025-08-02 09:46:05,919 - INFO - ๐ŸŽฏ TD Learning Implementation - Complete with Logging\n", "2025-08-02 09:46:05,920 - INFO - Environment: 5 states (0 to 4)\n", "2025-08-02 09:46:05,920 - INFO - Goal: Reach terminal state 4 for reward +10\n", "2025-08-02 09:46:05,920 - INFO - \n", "๐Ÿš€ Starting TD Learning Training for 100 episodes\n", "2025-08-02 09:46:05,920 - INFO - Parameters - Alpha: 0.1, Gamma: 0.9\n", "2025-08-02 09:46:05,920 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,920 - INFO - TD Update - State: 0, Reward: -1, Target: -1.000, TD Error: -1.000, New V(s): -0.100\n", "2025-08-02 09:46:05,921 - INFO - TD Update - State: 0, Reward: 0, Target: 0.000, TD Error: 0.100, New V(s): -0.090\n", "2025-08-02 09:46:05,921 - INFO - TD Update - State: 1, Reward: 0, Target: 0.000, TD Error: 0.000, New V(s): 0.000\n", "2025-08-02 09:46:05,921 - INFO - TD Update - State: 2, Reward: -1, Target: -1.000, TD Error: -1.000, New V(s): -0.100\n", "2025-08-02 09:46:05,921 - INFO - TD Update - State: 2, Reward: 0, Target: 0.000, TD Error: 0.100, New V(s): -0.090\n", "2025-08-02 09:46:05,921 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 10.000, New V(s): 1.000\n", "2025-08-02 09:46:05,921 - INFO - Episode Complete - Total Reward: 8, Avg TD Error: 2.033\n", "2025-08-02 09:46:05,922 - INFO - \n", "๐Ÿ“Š Episode 0 Summary:\n", "2025-08-02 09:46:05,922 - INFO - Value Function: [-0.09 0. -0.09 1. 0. ]\n", "2025-08-02 09:46:05,922 - INFO - Recent Avg Reward: 8.00\n", "2025-08-02 09:46:05,922 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,922 - INFO - TD Update - State: 0, Reward: 0, Target: 0.000, TD Error: 0.090, New V(s): -0.081\n", "2025-08-02 09:46:05,922 - INFO - TD Update - State: 1, Reward: -1, Target: -1.000, TD Error: -1.000, New V(s): -0.100\n", "2025-08-02 09:46:05,922 - INFO - TD Update - State: 1, Reward: 0, Target: -0.081, TD Error: 0.019, New V(s): -0.098\n", "2025-08-02 09:46:05,923 - INFO - TD Update - State: 2, Reward: 0, Target: 0.900, TD Error: 0.990, New V(s): 0.009\n", "2025-08-02 09:46:05,923 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 9.000, New V(s): 1.900\n", "2025-08-02 09:46:05,923 - INFO - Episode Complete - Total Reward: 9, Avg TD Error: 2.220\n", "2025-08-02 09:46:05,923 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,923 - INFO - TD Update - State: 0, Reward: 0, Target: -0.088, TD Error: -0.007, New V(s): -0.082\n", "2025-08-02 09:46:05,923 - INFO - TD Update - State: 1, Reward: -1, Target: -1.088, TD Error: -0.990, New V(s): -0.197\n", "2025-08-02 09:46:05,924 - INFO - TD Update - State: 1, Reward: 0, Target: 0.008, TD Error: 0.205, New V(s): -0.177\n", "2025-08-02 09:46:05,924 - INFO - TD Update - State: 2, Reward: -1, Target: -0.992, TD Error: -1.001, New V(s): -0.091\n", "2025-08-02 09:46:05,924 - INFO - TD Update - State: 2, Reward: -1, Target: -1.082, TD Error: -0.991, New V(s): -0.190\n", "2025-08-02 09:46:05,924 - INFO - TD Update - State: 2, Reward: 0, Target: 1.710, TD Error: 1.900, New V(s): -0.000\n", "2025-08-02 09:46:05,924 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 8.100, New V(s): 2.710\n", "2025-08-02 09:46:05,924 - INFO - Episode Complete - Total Reward: 7, Avg TD Error: 1.885\n", "2025-08-02 09:46:05,924 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,925 - INFO - TD Update - State: 0, Reward: -1, Target: -1.074, TD Error: -0.992, New V(s): -0.181\n", "2025-08-02 09:46:05,925 - INFO - TD Update - State: 0, Reward: -1, Target: -1.163, TD Error: -0.982, New V(s): -0.279\n", "2025-08-02 09:46:05,925 - INFO - TD Update - State: 0, Reward: 0, Target: -0.159, TD Error: 0.120, New V(s): -0.267\n", "2025-08-02 09:46:05,925 - INFO - TD Update - State: 1, Reward: -1, Target: -1.159, TD Error: -0.982, New V(s): -0.275\n", "2025-08-02 09:46:05,925 - INFO - TD Update - State: 1, Reward: -1, Target: -1.247, TD Error: -0.973, New V(s): -0.372\n", "2025-08-02 09:46:05,925 - INFO - TD Update - State: 1, Reward: -1, Target: -1.335, TD Error: -0.963, New V(s): -0.468\n", "2025-08-02 09:46:05,925 - INFO - TD Update - State: 1, Reward: -1, Target: -1.422, TD Error: -0.953, New V(s): -0.564\n", "2025-08-02 09:46:05,926 - INFO - TD Update - State: 1, Reward: 0, Target: -0.000, TD Error: 0.564, New V(s): -0.507\n", "2025-08-02 09:46:05,926 - INFO - TD Update - State: 2, Reward: 0, Target: 2.439, TD Error: 2.439, New V(s): 0.244\n", "2025-08-02 09:46:05,926 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 7.290, New V(s): 3.439\n", "2025-08-02 09:46:05,926 - INFO - Episode Complete - Total Reward: 4, Avg TD Error: 1.626\n", "2025-08-02 09:46:05,926 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,926 - INFO - TD Update - State: 0, Reward: -1, Target: -1.240, TD Error: -0.973, New V(s): -0.364\n", "2025-08-02 09:46:05,927 - INFO - TD Update - State: 0, Reward: 0, Target: -0.457, TD Error: -0.092, New V(s): -0.374\n", "2025-08-02 09:46:05,927 - INFO - TD Update - State: 1, Reward: 0, Target: 0.219, TD Error: 0.727, New V(s): -0.435\n", "2025-08-02 09:46:05,927 - INFO - TD Update - State: 2, Reward: 0, Target: 3.095, TD Error: 2.851, New V(s): 0.529\n", "2025-08-02 09:46:05,927 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 6.561, New V(s): 4.095\n", "2025-08-02 09:46:05,927 - INFO - Episode Complete - Total Reward: 9, Avg TD Error: 2.241\n", "2025-08-02 09:46:05,927 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,927 - INFO - TD Update - State: 0, Reward: -1, Target: -1.336, TD Error: -0.963, New V(s): -0.470\n", "2025-08-02 09:46:05,928 - INFO - TD Update - State: 0, Reward: 0, Target: -0.391, TD Error: 0.079, New V(s): -0.462\n", "2025-08-02 09:46:05,928 - INFO - TD Update - State: 1, Reward: -1, Target: -1.391, TD Error: -0.957, New V(s): -0.530\n", "2025-08-02 09:46:05,928 - INFO - TD Update - State: 1, Reward: 0, Target: 0.476, TD Error: 1.006, New V(s): -0.430\n", "2025-08-02 09:46:05,928 - INFO - TD Update - State: 2, Reward: -1, Target: -0.524, TD Error: -1.053, New V(s): 0.424\n", "2025-08-02 09:46:05,928 - INFO - TD Update - State: 2, Reward: -1, Target: -0.619, TD Error: -1.042, New V(s): 0.319\n", "2025-08-02 09:46:05,928 - INFO - TD Update - State: 2, Reward: 0, Target: 3.686, TD Error: 3.366, New V(s): 0.656\n", "2025-08-02 09:46:05,929 - INFO - TD Update - State: 3, Reward: -1, Target: 2.686, TD Error: -1.410, New V(s): 3.954\n", "2025-08-02 09:46:05,929 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 6.046, New V(s): 4.559\n", "2025-08-02 09:46:05,929 - INFO - Episode Complete - Total Reward: 5, Avg TD Error: 1.769\n", "2025-08-02 09:46:05,929 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,930 - INFO - TD Update - State: 0, Reward: 0, Target: -0.387, TD Error: 0.075, New V(s): -0.454\n", "2025-08-02 09:46:05,930 - INFO - TD Update - State: 1, Reward: 0, Target: 0.590, TD Error: 1.020, New V(s): -0.328\n", "2025-08-02 09:46:05,930 - INFO - TD Update - State: 2, Reward: 0, Target: 4.103, TD Error: 3.447, New V(s): 1.001\n", "2025-08-02 09:46:05,930 - INFO - TD Update - State: 3, Reward: -1, Target: 3.103, TD Error: -1.456, New V(s): 4.413\n", "2025-08-02 09:46:05,930 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 5.587, New V(s): 4.972\n", "2025-08-02 09:46:05,931 - INFO - Episode Complete - Total Reward: 9, Avg TD Error: 2.317\n", "2025-08-02 09:46:05,931 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,931 - INFO - TD Update - State: 0, Reward: -1, Target: -1.409, TD Error: -0.955, New V(s): -0.550\n", "2025-08-02 09:46:05,931 - INFO - TD Update - State: 0, Reward: 0, Target: -0.295, TD Error: 0.255, New V(s): -0.524\n", "2025-08-02 09:46:05,931 - INFO - TD Update - State: 1, Reward: 0, Target: 0.901, TD Error: 1.228, New V(s): -0.205\n", "2025-08-02 09:46:05,931 - INFO - TD Update - State: 2, Reward: 0, Target: 4.475, TD Error: 3.474, New V(s): 1.348\n", "2025-08-02 09:46:05,932 - INFO - TD Update - State: 3, Reward: -1, Target: 3.475, TD Error: -1.497, New V(s): 4.822\n", "2025-08-02 09:46:05,932 - INFO - TD Update - State: 3, Reward: -1, Target: 3.340, TD Error: -1.482, New V(s): 4.674\n", "2025-08-02 09:46:05,932 - INFO - TD Update - State: 3, Reward: -1, Target: 3.207, TD Error: -1.467, New V(s): 4.527\n", "2025-08-02 09:46:05,932 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 5.473, New V(s): 5.074\n", "2025-08-02 09:46:05,932 - INFO - Episode Complete - Total Reward: 6, Avg TD Error: 1.979\n", "2025-08-02 09:46:05,933 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,933 - INFO - TD Update - State: 0, Reward: -1, Target: -1.472, TD Error: -0.948, New V(s): -0.619\n", "2025-08-02 09:46:05,933 - INFO - TD Update - State: 0, Reward: 0, Target: -0.184, TD Error: 0.435, New V(s): -0.576\n", "2025-08-02 09:46:05,933 - INFO - TD Update - State: 1, Reward: -1, Target: -1.184, TD Error: -0.980, New V(s): -0.303\n", "2025-08-02 09:46:05,933 - INFO - TD Update - State: 1, Reward: -1, Target: -1.273, TD Error: -0.970, New V(s): -0.400\n", "2025-08-02 09:46:05,933 - INFO - TD Update - State: 1, Reward: -1, Target: -1.360, TD Error: -0.960, New V(s): -0.496\n", "2025-08-02 09:46:05,934 - INFO - TD Update - State: 1, Reward: -1, Target: -1.446, TD Error: -0.950, New V(s): -0.591\n", "2025-08-02 09:46:05,934 - INFO - TD Update - State: 1, Reward: 0, Target: 1.213, TD Error: 1.804, New V(s): -0.410\n", "2025-08-02 09:46:05,934 - INFO - TD Update - State: 2, Reward: 0, Target: 4.567, TD Error: 3.219, New V(s): 1.670\n", "2025-08-02 09:46:05,934 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 4.926, New V(s): 5.567\n", "2025-08-02 09:46:05,934 - INFO - Episode Complete - Total Reward: 5, Avg TD Error: 1.688\n", "2025-08-02 09:46:05,934 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,934 - INFO - TD Update - State: 0, Reward: 0, Target: -0.369, TD Error: 0.206, New V(s): -0.555\n", "2025-08-02 09:46:05,935 - INFO - TD Update - State: 1, Reward: -1, Target: -1.369, TD Error: -0.959, New V(s): -0.506\n", "2025-08-02 09:46:05,935 - INFO - TD Update - State: 1, Reward: 0, Target: 1.503, TD Error: 2.009, New V(s): -0.305\n", "2025-08-02 09:46:05,935 - INFO - TD Update - State: 2, Reward: 0, Target: 5.010, TD Error: 3.340, New V(s): 2.004\n", "2025-08-02 09:46:05,935 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 4.433, New V(s): 6.010\n", "2025-08-02 09:46:05,936 - INFO - Episode Complete - Total Reward: 9, Avg TD Error: 2.190\n", "2025-08-02 09:46:05,936 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,936 - INFO - TD Update - State: 0, Reward: -1, Target: -1.500, TD Error: -0.944, New V(s): -0.650\n", "2025-08-02 09:46:05,936 - INFO - TD Update - State: 0, Reward: -1, Target: -1.585, TD Error: -0.935, New V(s): -0.743\n", "2025-08-02 09:46:05,936 - INFO - TD Update - State: 0, Reward: -1, Target: -1.669, TD Error: -0.926, New V(s): -0.836\n", "2025-08-02 09:46:05,937 - INFO - TD Update - State: 0, Reward: -1, Target: -1.752, TD Error: -0.916, New V(s): -0.927\n", "2025-08-02 09:46:05,937 - INFO - TD Update - State: 0, Reward: 0, Target: -0.275, TD Error: 0.652, New V(s): -0.862\n", "2025-08-02 09:46:05,937 - INFO - TD Update - State: 1, Reward: -1, Target: -1.275, TD Error: -0.969, New V(s): -0.402\n", "2025-08-02 09:46:05,937 - INFO - TD Update - State: 1, Reward: -1, Target: -1.362, TD Error: -0.960, New V(s): -0.498\n", "2025-08-02 09:46:05,937 - INFO - TD Update - State: 1, Reward: 0, Target: 1.804, TD Error: 2.302, New V(s): -0.268\n", "2025-08-02 09:46:05,937 - INFO - TD Update - State: 2, Reward: 0, Target: 5.409, TD Error: 3.405, New V(s): 2.345\n", "2025-08-02 09:46:05,938 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 3.990, New V(s): 6.409\n", "2025-08-02 09:46:05,938 - INFO - Episode Complete - Total Reward: 4, Avg TD Error: 1.600\n", "2025-08-02 09:46:05,938 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,938 - INFO - TD Update - State: 0, Reward: 0, Target: -0.241, TD Error: 0.621, New V(s): -0.800\n", "2025-08-02 09:46:05,938 - INFO - TD Update - State: 1, Reward: -1, Target: -1.241, TD Error: -0.973, New V(s): -0.365\n", "2025-08-02 09:46:05,938 - INFO - TD Update - State: 1, Reward: -1, Target: -1.329, TD Error: -0.963, New V(s): -0.462\n", "2025-08-02 09:46:05,939 - INFO - TD Update - State: 1, Reward: 0, Target: 2.110, TD Error: 2.572, New V(s): -0.205\n", "2025-08-02 09:46:05,939 - INFO - TD Update - State: 2, Reward: 0, Target: 5.768, TD Error: 3.424, New V(s): 2.687\n", "2025-08-02 09:46:05,939 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 3.591, New V(s): 6.768\n", "2025-08-02 09:46:05,939 - INFO - Episode Complete - Total Reward: 8, Avg TD Error: 2.024\n", "2025-08-02 09:46:05,939 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,939 - INFO - TD Update - State: 0, Reward: -1, Target: -1.720, TD Error: -0.920, New V(s): -0.892\n", "2025-08-02 09:46:05,939 - INFO - TD Update - State: 0, Reward: 0, Target: -0.184, TD Error: 0.708, New V(s): -0.821\n", "2025-08-02 09:46:05,940 - INFO - TD Update - State: 1, Reward: -1, Target: -1.184, TD Error: -0.980, New V(s): -0.303\n", "2025-08-02 09:46:05,940 - INFO - TD Update - State: 1, Reward: -1, Target: -1.272, TD Error: -0.970, New V(s): -0.400\n", "2025-08-02 09:46:05,940 - INFO - TD Update - State: 1, Reward: -1, Target: -1.360, TD Error: -0.960, New V(s): -0.496\n", "2025-08-02 09:46:05,940 - INFO - TD Update - State: 1, Reward: -1, Target: -1.446, TD Error: -0.950, New V(s): -0.591\n", "2025-08-02 09:46:05,940 - INFO - TD Update - State: 1, Reward: -1, Target: -1.532, TD Error: -0.941, New V(s): -0.685\n", "2025-08-02 09:46:05,940 - INFO - TD Update - State: 1, Reward: -1, Target: -1.616, TD Error: -0.932, New V(s): -0.778\n", "2025-08-02 09:46:05,940 - INFO - TD Update - State: 1, Reward: 0, Target: 2.418, TD Error: 3.196, New V(s): -0.458\n", "2025-08-02 09:46:05,941 - INFO - TD Update - State: 2, Reward: -1, Target: 1.418, TD Error: -1.269, New V(s): 2.560\n", "2025-08-02 09:46:05,941 - INFO - TD Update - State: 2, Reward: -1, Target: 1.304, TD Error: -1.256, New V(s): 2.434\n", "2025-08-02 09:46:05,941 - INFO - TD Update - State: 2, Reward: 0, Target: 6.092, TD Error: 3.657, New V(s): 2.800\n", "2025-08-02 09:46:05,941 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 3.232, New V(s): 7.092\n", "2025-08-02 09:46:05,941 - INFO - Episode Complete - Total Reward: 1, Avg TD Error: 1.536\n", "2025-08-02 09:46:05,941 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,942 - INFO - TD Update - State: 0, Reward: 0, Target: -0.412, TD Error: 0.409, New V(s): -0.780\n", "2025-08-02 09:46:05,942 - INFO - TD Update - State: 1, Reward: 0, Target: 2.520, TD Error: 2.978, New V(s): -0.160\n", "2025-08-02 09:46:05,942 - INFO - TD Update - State: 2, Reward: 0, Target: 6.382, TD Error: 3.582, New V(s): 3.158\n", "2025-08-02 09:46:05,942 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.908, New V(s): 7.382\n", "2025-08-02 09:46:05,942 - INFO - Episode Complete - Total Reward: 10, Avg TD Error: 2.469\n", "2025-08-02 09:46:05,942 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,942 - INFO - TD Update - State: 0, Reward: -1, Target: -1.702, TD Error: -0.922, New V(s): -0.872\n", "2025-08-02 09:46:05,943 - INFO - TD Update - State: 0, Reward: -1, Target: -1.785, TD Error: -0.913, New V(s): -0.964\n", "2025-08-02 09:46:05,943 - INFO - TD Update - State: 0, Reward: 0, Target: -0.144, TD Error: 0.819, New V(s): -0.882\n", "2025-08-02 09:46:05,943 - INFO - TD Update - State: 1, Reward: -1, Target: -1.144, TD Error: -0.984, New V(s): -0.259\n", "2025-08-02 09:46:05,943 - INFO - TD Update - State: 1, Reward: -1, Target: -1.233, TD Error: -0.974, New V(s): -0.356\n", "2025-08-02 09:46:05,943 - INFO - TD Update - State: 1, Reward: 0, Target: 2.843, TD Error: 3.199, New V(s): -0.036\n", "2025-08-02 09:46:05,943 - INFO - TD Update - State: 2, Reward: 0, Target: 6.644, TD Error: 3.486, New V(s): 3.507\n", "2025-08-02 09:46:05,944 - INFO - TD Update - State: 3, Reward: -1, Target: 5.644, TD Error: -1.738, New V(s): 7.209\n", "2025-08-02 09:46:05,944 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.791, New V(s): 7.488\n", "2025-08-02 09:46:05,944 - INFO - Episode Complete - Total Reward: 5, Avg TD Error: 1.758\n", "2025-08-02 09:46:05,944 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,944 - INFO - TD Update - State: 0, Reward: -1, Target: -1.794, TD Error: -0.912, New V(s): -0.973\n", "2025-08-02 09:46:05,944 - INFO - TD Update - State: 0, Reward: 0, Target: -0.033, TD Error: 0.940, New V(s): -0.879\n", "2025-08-02 09:46:05,945 - INFO - TD Update - State: 1, Reward: -1, Target: -1.033, TD Error: -0.996, New V(s): -0.136\n", "2025-08-02 09:46:05,945 - INFO - TD Update - State: 1, Reward: -1, Target: -1.122, TD Error: -0.986, New V(s): -0.235\n", "2025-08-02 09:46:05,945 - INFO - TD Update - State: 1, Reward: -1, Target: -1.211, TD Error: -0.977, New V(s): -0.332\n", "2025-08-02 09:46:05,945 - INFO - TD Update - State: 1, Reward: 0, Target: 3.156, TD Error: 3.488, New V(s): 0.017\n", "2025-08-02 09:46:05,945 - INFO - TD Update - State: 2, Reward: -1, Target: 2.156, TD Error: -1.351, New V(s): 3.372\n", "2025-08-02 09:46:05,945 - INFO - TD Update - State: 2, Reward: 0, Target: 6.739, TD Error: 3.367, New V(s): 3.709\n", "2025-08-02 09:46:05,946 - INFO - TD Update - State: 3, Reward: -1, Target: 5.739, TD Error: -1.749, New V(s): 7.313\n", "2025-08-02 09:46:05,946 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.687, New V(s): 7.582\n", "2025-08-02 09:46:05,946 - INFO - Episode Complete - Total Reward: 4, Avg TD Error: 1.745\n", "2025-08-02 09:46:05,946 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,946 - INFO - TD Update - State: 0, Reward: 0, Target: 0.015, TD Error: 0.894, New V(s): -0.790\n", "2025-08-02 09:46:05,946 - INFO - TD Update - State: 1, Reward: 0, Target: 3.338, TD Error: 3.321, New V(s): 0.349\n", "2025-08-02 09:46:05,946 - INFO - TD Update - State: 2, Reward: -1, Target: 2.338, TD Error: -1.371, New V(s): 3.571\n", "2025-08-02 09:46:05,947 - INFO - TD Update - State: 2, Reward: -1, Target: 2.214, TD Error: -1.357, New V(s): 3.436\n", "2025-08-02 09:46:05,947 - INFO - TD Update - State: 2, Reward: -1, Target: 2.092, TD Error: -1.344, New V(s): 3.301\n", "2025-08-02 09:46:05,947 - INFO - TD Update - State: 2, Reward: -1, Target: 1.971, TD Error: -1.330, New V(s): 3.168\n", "2025-08-02 09:46:05,947 - INFO - TD Update - State: 2, Reward: -1, Target: 1.852, TD Error: -1.317, New V(s): 3.037\n", "2025-08-02 09:46:05,947 - INFO - TD Update - State: 2, Reward: 0, Target: 6.823, TD Error: 3.787, New V(s): 3.415\n", "2025-08-02 09:46:05,947 - INFO - TD Update - State: 3, Reward: -1, Target: 5.823, TD Error: -1.758, New V(s): 7.406\n", "2025-08-02 09:46:05,948 - INFO - TD Update - State: 3, Reward: -1, Target: 5.665, TD Error: -1.741, New V(s): 7.232\n", "2025-08-02 09:46:05,948 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.768, New V(s): 7.508\n", "2025-08-02 09:46:05,948 - INFO - Episode Complete - Total Reward: 3, Avg TD Error: 1.908\n", "2025-08-02 09:46:05,948 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,948 - INFO - TD Update - State: 0, Reward: -1, Target: -1.711, TD Error: -0.921, New V(s): -0.882\n", "2025-08-02 09:46:05,948 - INFO - TD Update - State: 0, Reward: 0, Target: 0.314, TD Error: 1.196, New V(s): -0.762\n", "2025-08-02 09:46:05,949 - INFO - TD Update - State: 1, Reward: 0, Target: 3.074, TD Error: 2.725, New V(s): 0.621\n", "2025-08-02 09:46:05,949 - INFO - TD Update - State: 2, Reward: -1, Target: 2.074, TD Error: -1.342, New V(s): 3.281\n", "2025-08-02 09:46:05,949 - INFO - TD Update - State: 2, Reward: 0, Target: 6.758, TD Error: 3.476, New V(s): 3.629\n", "2025-08-02 09:46:05,949 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.492, New V(s): 7.758\n", "2025-08-02 09:46:05,949 - INFO - Episode Complete - Total Reward: 8, Avg TD Error: 2.025\n", "2025-08-02 09:46:05,949 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,949 - INFO - TD Update - State: 0, Reward: 0, Target: 0.559, TD Error: 1.321, New V(s): -0.630\n", "2025-08-02 09:46:05,950 - INFO - TD Update - State: 1, Reward: -1, Target: -0.441, TD Error: -1.062, New V(s): 0.515\n", "2025-08-02 09:46:05,950 - INFO - TD Update - State: 1, Reward: 0, Target: 3.266, TD Error: 2.751, New V(s): 0.790\n", "2025-08-02 09:46:05,950 - INFO - TD Update - State: 2, Reward: 0, Target: 6.982, TD Error: 3.353, New V(s): 3.964\n", "2025-08-02 09:46:05,950 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.242, New V(s): 7.982\n", "2025-08-02 09:46:05,950 - INFO - Episode Complete - Total Reward: 9, Avg TD Error: 2.146\n", "2025-08-02 09:46:05,950 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,951 - INFO - TD Update - State: 0, Reward: -1, Target: -1.567, TD Error: -0.937, New V(s): -0.724\n", "2025-08-02 09:46:05,951 - INFO - TD Update - State: 0, Reward: -1, Target: -1.651, TD Error: -0.928, New V(s): -0.816\n", "2025-08-02 09:46:05,951 - INFO - TD Update - State: 0, Reward: -1, Target: -1.735, TD Error: -0.918, New V(s): -0.908\n", "2025-08-02 09:46:05,951 - INFO - TD Update - State: 0, Reward: -1, Target: -1.817, TD Error: -0.909, New V(s): -0.999\n", "2025-08-02 09:46:05,951 - INFO - TD Update - State: 0, Reward: -1, Target: -1.899, TD Error: -0.900, New V(s): -1.089\n", "2025-08-02 09:46:05,951 - INFO - TD Update - State: 0, Reward: -1, Target: -1.980, TD Error: -0.891, New V(s): -1.178\n", "2025-08-02 09:46:05,951 - INFO - TD Update - State: 0, Reward: 0, Target: 0.711, TD Error: 1.889, New V(s): -0.989\n", "2025-08-02 09:46:05,952 - INFO - TD Update - State: 1, Reward: 0, Target: 3.568, TD Error: 2.778, New V(s): 1.068\n", "2025-08-02 09:46:05,952 - INFO - TD Update - State: 2, Reward: 0, Target: 7.184, TD Error: 3.220, New V(s): 4.286\n", "2025-08-02 09:46:05,952 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.018, New V(s): 8.184\n", "2025-08-02 09:46:05,952 - INFO - Episode Complete - Total Reward: 4, Avg TD Error: 1.539\n", "2025-08-02 09:46:05,952 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,953 - INFO - TD Update - State: 0, Reward: -1, Target: -1.890, TD Error: -0.901, New V(s): -1.079\n", "2025-08-02 09:46:05,953 - INFO - TD Update - State: 0, Reward: 0, Target: 0.961, TD Error: 2.041, New V(s): -0.875\n", "2025-08-02 09:46:05,953 - INFO - TD Update - State: 1, Reward: -1, Target: -0.039, TD Error: -1.107, New V(s): 0.957\n", "2025-08-02 09:46:05,953 - INFO - TD Update - State: 1, Reward: 0, Target: 3.858, TD Error: 2.900, New V(s): 1.247\n", "2025-08-02 09:46:05,953 - INFO - TD Update - State: 2, Reward: 0, Target: 7.365, TD Error: 3.079, New V(s): 4.594\n", "2025-08-02 09:46:05,953 - INFO - TD Update - State: 3, Reward: -1, Target: 6.365, TD Error: -1.818, New V(s): 8.002\n", "2025-08-02 09:46:05,954 - INFO - TD Update - State: 3, Reward: -1, Target: 6.202, TD Error: -1.800, New V(s): 7.822\n", "2025-08-02 09:46:05,954 - INFO - TD Update - State: 3, Reward: -1, Target: 6.040, TD Error: -1.782, New V(s): 7.644\n", "2025-08-02 09:46:05,954 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.356, New V(s): 7.879\n", "2025-08-02 09:46:05,954 - INFO - Episode Complete - Total Reward: 5, Avg TD Error: 1.976\n", "2025-08-02 09:46:05,954 - INFO - \n", "๐Ÿ“Š Episode 20 Summary:\n", "2025-08-02 09:46:05,955 - INFO - Value Function: [-0.87543578 1.24722903 4.59404087 7.87925002 0. ]\n", "2025-08-02 09:46:05,955 - INFO - Recent Avg Reward: 5.70\n", "2025-08-02 09:46:05,955 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,955 - INFO - TD Update - State: 0, Reward: 0, Target: 1.123, TD Error: 1.998, New V(s): -0.676\n", "2025-08-02 09:46:05,955 - INFO - TD Update - State: 1, Reward: -1, Target: 0.123, TD Error: -1.125, New V(s): 1.135\n", "2025-08-02 09:46:05,955 - INFO - TD Update - State: 1, Reward: -1, Target: 0.021, TD Error: -1.113, New V(s): 1.023\n", "2025-08-02 09:46:05,955 - INFO - TD Update - State: 1, Reward: 0, Target: 4.135, TD Error: 3.111, New V(s): 1.335\n", "2025-08-02 09:46:05,956 - INFO - TD Update - State: 2, Reward: -1, Target: 3.135, TD Error: -1.459, New V(s): 4.448\n", "2025-08-02 09:46:05,956 - INFO - TD Update - State: 2, Reward: -1, Target: 3.003, TD Error: -1.445, New V(s): 4.304\n", "2025-08-02 09:46:05,956 - INFO - TD Update - State: 2, Reward: -1, Target: 2.873, TD Error: -1.430, New V(s): 4.161\n", "2025-08-02 09:46:05,956 - INFO - TD Update - State: 2, Reward: 0, Target: 7.091, TD Error: 2.931, New V(s): 4.454\n", "2025-08-02 09:46:05,956 - INFO - TD Update - State: 3, Reward: -1, Target: 6.091, TD Error: -1.788, New V(s): 7.700\n", "2025-08-02 09:46:05,956 - INFO - TD Update - State: 3, Reward: -1, Target: 5.930, TD Error: -1.770, New V(s): 7.523\n", "2025-08-02 09:46:05,957 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.477, New V(s): 7.771\n", "2025-08-02 09:46:05,957 - INFO - Episode Complete - Total Reward: 3, Avg TD Error: 1.877\n", "2025-08-02 09:46:05,957 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,957 - INFO - TD Update - State: 0, Reward: -1, Target: -1.608, TD Error: -0.932, New V(s): -0.769\n", "2025-08-02 09:46:05,957 - INFO - TD Update - State: 0, Reward: 0, Target: 1.201, TD Error: 1.970, New V(s): -0.572\n", "2025-08-02 09:46:05,957 - INFO - TD Update - State: 1, Reward: 0, Target: 4.008, TD Error: 2.674, New V(s): 1.602\n", "2025-08-02 09:46:05,957 - INFO - TD Update - State: 2, Reward: -1, Target: 3.008, TD Error: -1.445, New V(s): 4.309\n", "2025-08-02 09:46:05,958 - INFO - TD Update - State: 2, Reward: 0, Target: 6.994, TD Error: 2.685, New V(s): 4.578\n", "2025-08-02 09:46:05,958 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.229, New V(s): 7.994\n", "2025-08-02 09:46:05,958 - INFO - Episode Complete - Total Reward: 8, Avg TD Error: 1.989\n", "2025-08-02 09:46:05,958 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,958 - INFO - TD Update - State: 0, Reward: -1, Target: -1.515, TD Error: -0.943, New V(s): -0.666\n", "2025-08-02 09:46:05,958 - INFO - TD Update - State: 0, Reward: -1, Target: -1.600, TD Error: -0.933, New V(s): -0.760\n", "2025-08-02 09:46:05,959 - INFO - TD Update - State: 0, Reward: -1, Target: -1.684, TD Error: -0.924, New V(s): -0.852\n", "2025-08-02 09:46:05,959 - INFO - TD Update - State: 0, Reward: 0, Target: 1.442, TD Error: 2.294, New V(s): -0.623\n", "2025-08-02 09:46:05,959 - INFO - TD Update - State: 1, Reward: -1, Target: 0.442, TD Error: -1.160, New V(s): 1.486\n", "2025-08-02 09:46:05,959 - INFO - TD Update - State: 1, Reward: -1, Target: 0.337, TD Error: -1.149, New V(s): 1.371\n", "2025-08-02 09:46:05,959 - INFO - TD Update - State: 1, Reward: 0, Target: 4.120, TD Error: 2.749, New V(s): 1.646\n", "2025-08-02 09:46:05,959 - INFO - TD Update - State: 2, Reward: -1, Target: 3.120, TD Error: -1.458, New V(s): 4.432\n", "2025-08-02 09:46:05,959 - INFO - TD Update - State: 2, Reward: -1, Target: 2.989, TD Error: -1.443, New V(s): 4.288\n", "2025-08-02 09:46:05,960 - INFO - TD Update - State: 2, Reward: -1, Target: 2.859, TD Error: -1.429, New V(s): 4.145\n", "2025-08-02 09:46:05,960 - INFO - TD Update - State: 2, Reward: -1, Target: 2.730, TD Error: -1.414, New V(s): 4.003\n", "2025-08-02 09:46:05,960 - INFO - TD Update - State: 2, Reward: 0, Target: 7.195, TD Error: 3.191, New V(s): 4.322\n", "2025-08-02 09:46:05,960 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.006, New V(s): 8.195\n", "2025-08-02 09:46:05,960 - INFO - Episode Complete - Total Reward: 1, Avg TD Error: 1.623\n", "2025-08-02 09:46:05,960 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,961 - INFO - TD Update - State: 0, Reward: -1, Target: -1.560, TD Error: -0.938, New V(s): -0.716\n", "2025-08-02 09:46:05,961 - INFO - TD Update - State: 0, Reward: -1, Target: -1.645, TD Error: -0.928, New V(s): -0.809\n", "2025-08-02 09:46:05,961 - INFO - TD Update - State: 0, Reward: -1, Target: -1.728, TD Error: -0.919, New V(s): -0.901\n", "2025-08-02 09:46:05,961 - INFO - TD Update - State: 0, Reward: -1, Target: -1.811, TD Error: -0.910, New V(s): -0.992\n", "2025-08-02 09:46:05,961 - INFO - TD Update - State: 0, Reward: 0, Target: 1.481, TD Error: 2.473, New V(s): -0.745\n", "2025-08-02 09:46:05,961 - INFO - TD Update - State: 1, Reward: 0, Target: 3.890, TD Error: 2.244, New V(s): 1.870\n", "2025-08-02 09:46:05,961 - INFO - TD Update - State: 2, Reward: 0, Target: 7.375, TD Error: 3.053, New V(s): 4.628\n", "2025-08-02 09:46:05,962 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.805, New V(s): 8.375\n", "2025-08-02 09:46:05,962 - INFO - Episode Complete - Total Reward: 6, Avg TD Error: 1.659\n", "2025-08-02 09:46:05,962 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,962 - INFO - TD Update - State: 0, Reward: -1, Target: -1.670, TD Error: -0.926, New V(s): -0.837\n", "2025-08-02 09:46:05,962 - INFO - TD Update - State: 0, Reward: -1, Target: -1.754, TD Error: -0.916, New V(s): -0.929\n", "2025-08-02 09:46:05,962 - INFO - TD Update - State: 0, Reward: -1, Target: -1.836, TD Error: -0.907, New V(s): -1.020\n", "2025-08-02 09:46:05,963 - INFO - TD Update - State: 0, Reward: -1, Target: -1.918, TD Error: -0.898, New V(s): -1.109\n", "2025-08-02 09:46:05,963 - INFO - TD Update - State: 0, Reward: -1, Target: -1.998, TD Error: -0.889, New V(s): -1.198\n", "2025-08-02 09:46:05,963 - INFO - TD Update - State: 0, Reward: 0, Target: 1.683, TD Error: 2.882, New V(s): -0.910\n", "2025-08-02 09:46:05,963 - INFO - TD Update - State: 1, Reward: 0, Target: 4.165, TD Error: 2.295, New V(s): 2.100\n", "2025-08-02 09:46:05,963 - INFO - TD Update - State: 2, Reward: -1, Target: 3.165, TD Error: -1.463, New V(s): 4.481\n", "2025-08-02 09:46:05,964 - INFO - TD Update - State: 2, Reward: -1, Target: 3.033, TD Error: -1.448, New V(s): 4.337\n", "2025-08-02 09:46:05,964 - INFO - TD Update - State: 2, Reward: 0, Target: 7.538, TD Error: 3.201, New V(s): 4.657\n", "2025-08-02 09:46:05,964 - INFO - TD Update - State: 3, Reward: -1, Target: 6.538, TD Error: -1.838, New V(s): 8.191\n", "2025-08-02 09:46:05,964 - INFO - TD Update - State: 3, Reward: -1, Target: 6.372, TD Error: -1.819, New V(s): 8.009\n", "2025-08-02 09:46:05,964 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.991, New V(s): 8.209\n", "2025-08-02 09:46:05,964 - INFO - Episode Complete - Total Reward: 1, Avg TD Error: 1.652\n", "2025-08-02 09:46:05,965 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,965 - INFO - TD Update - State: 0, Reward: 0, Target: 1.890, TD Error: 2.800, New V(s): -0.630\n", "2025-08-02 09:46:05,965 - INFO - TD Update - State: 1, Reward: 0, Target: 4.191, TD Error: 2.091, New V(s): 2.309\n", "2025-08-02 09:46:05,965 - INFO - TD Update - State: 2, Reward: -1, Target: 3.191, TD Error: -1.466, New V(s): 4.510\n", "2025-08-02 09:46:05,965 - INFO - TD Update - State: 2, Reward: 0, Target: 7.388, TD Error: 2.878, New V(s): 4.798\n", "2025-08-02 09:46:05,966 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.791, New V(s): 8.388\n", "2025-08-02 09:46:05,966 - INFO - Episode Complete - Total Reward: 9, Avg TD Error: 2.205\n", "2025-08-02 09:46:05,966 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,966 - INFO - TD Update - State: 0, Reward: 0, Target: 2.078, TD Error: 2.708, New V(s): -0.359\n", "2025-08-02 09:46:05,966 - INFO - TD Update - State: 1, Reward: 0, Target: 4.318, TD Error: 2.009, New V(s): 2.510\n", "2025-08-02 09:46:05,966 - INFO - TD Update - State: 2, Reward: -1, Target: 3.318, TD Error: -1.480, New V(s): 4.650\n", "2025-08-02 09:46:05,966 - INFO - TD Update - State: 2, Reward: 0, Target: 7.549, TD Error: 2.899, New V(s): 4.940\n", "2025-08-02 09:46:05,967 - INFO - TD Update - State: 3, Reward: -1, Target: 6.549, TD Error: -1.839, New V(s): 8.204\n", "2025-08-02 09:46:05,967 - INFO - TD Update - State: 3, Reward: -1, Target: 6.383, TD Error: -1.820, New V(s): 8.022\n", "2025-08-02 09:46:05,967 - INFO - TD Update - State: 3, Reward: -1, Target: 6.220, TD Error: -1.802, New V(s): 7.842\n", "2025-08-02 09:46:05,967 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.158, New V(s): 8.057\n", "2025-08-02 09:46:05,967 - INFO - Episode Complete - Total Reward: 6, Avg TD Error: 2.089\n", "2025-08-02 09:46:05,967 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,968 - INFO - TD Update - State: 0, Reward: 0, Target: 2.259, TD Error: 2.618, New V(s): -0.098\n", "2025-08-02 09:46:05,968 - INFO - TD Update - State: 1, Reward: -1, Target: 1.259, TD Error: -1.251, New V(s): 2.385\n", "2025-08-02 09:46:05,968 - INFO - TD Update - State: 1, Reward: -1, Target: 1.146, TD Error: -1.238, New V(s): 2.261\n", "2025-08-02 09:46:05,968 - INFO - TD Update - State: 1, Reward: 0, Target: 4.446, TD Error: 2.185, New V(s): 2.479\n", "2025-08-02 09:46:05,968 - INFO - TD Update - State: 2, Reward: 0, Target: 7.252, TD Error: 2.312, New V(s): 5.171\n", "2025-08-02 09:46:05,968 - INFO - TD Update - State: 3, Reward: -1, Target: 6.252, TD Error: -1.806, New V(s): 7.877\n", "2025-08-02 09:46:05,968 - INFO - TD Update - State: 3, Reward: -1, Target: 6.089, TD Error: -1.788, New V(s): 7.698\n", "2025-08-02 09:46:05,969 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.302, New V(s): 7.928\n", "2025-08-02 09:46:05,969 - INFO - Episode Complete - Total Reward: 6, Avg TD Error: 1.937\n", "2025-08-02 09:46:05,969 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,969 - INFO - TD Update - State: 0, Reward: 0, Target: 2.231, TD Error: 2.329, New V(s): 0.135\n", "2025-08-02 09:46:05,969 - INFO - TD Update - State: 1, Reward: 0, Target: 4.654, TD Error: 2.174, New V(s): 2.697\n", "2025-08-02 09:46:05,969 - INFO - TD Update - State: 2, Reward: -1, Target: 3.654, TD Error: -1.517, New V(s): 5.019\n", "2025-08-02 09:46:05,969 - INFO - TD Update - State: 2, Reward: 0, Target: 7.135, TD Error: 2.116, New V(s): 5.231\n", "2025-08-02 09:46:05,970 - INFO - TD Update - State: 3, Reward: -1, Target: 6.135, TD Error: -1.793, New V(s): 7.749\n", "2025-08-02 09:46:05,970 - INFO - TD Update - State: 3, Reward: -1, Target: 5.974, TD Error: -1.775, New V(s): 7.571\n", "2025-08-02 09:46:05,970 - INFO - TD Update - State: 3, Reward: -1, Target: 5.814, TD Error: -1.757, New V(s): 7.396\n", "2025-08-02 09:46:05,970 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.604, New V(s): 7.656\n", "2025-08-02 09:46:05,970 - INFO - Episode Complete - Total Reward: 6, Avg TD Error: 2.008\n", "2025-08-02 09:46:05,970 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,970 - INFO - TD Update - State: 0, Reward: 0, Target: 2.427, TD Error: 2.292, New V(s): 0.365\n", "2025-08-02 09:46:05,971 - INFO - TD Update - State: 1, Reward: 0, Target: 4.708, TD Error: 2.011, New V(s): 2.898\n", "2025-08-02 09:46:05,971 - INFO - TD Update - State: 2, Reward: -1, Target: 3.708, TD Error: -1.523, New V(s): 5.079\n", "2025-08-02 09:46:05,971 - INFO - TD Update - State: 2, Reward: -1, Target: 3.571, TD Error: -1.508, New V(s): 4.928\n", "2025-08-02 09:46:05,971 - INFO - TD Update - State: 2, Reward: -1, Target: 3.435, TD Error: -1.493, New V(s): 4.778\n", "2025-08-02 09:46:05,971 - INFO - TD Update - State: 2, Reward: 0, Target: 6.891, TD Error: 2.112, New V(s): 4.990\n", "2025-08-02 09:46:05,971 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.344, New V(s): 7.891\n", "2025-08-02 09:46:05,971 - INFO - Episode Complete - Total Reward: 7, Avg TD Error: 1.897\n", "2025-08-02 09:46:05,972 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,972 - INFO - TD Update - State: 0, Reward: 0, Target: 2.608, TD Error: 2.244, New V(s): 0.589\n", "2025-08-02 09:46:05,972 - INFO - TD Update - State: 1, Reward: 0, Target: 4.491, TD Error: 1.593, New V(s): 3.057\n", "2025-08-02 09:46:05,972 - INFO - TD Update - State: 2, Reward: 0, Target: 7.102, TD Error: 2.112, New V(s): 5.201\n", "2025-08-02 09:46:05,973 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.109, New V(s): 8.102\n", "2025-08-02 09:46:05,973 - INFO - Episode Complete - Total Reward: 10, Avg TD Error: 2.014\n", "2025-08-02 09:46:05,973 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,973 - INFO - TD Update - State: 0, Reward: 0, Target: 2.751, TD Error: 2.163, New V(s): 0.805\n", "2025-08-02 09:46:05,973 - INFO - TD Update - State: 1, Reward: -1, Target: 1.751, TD Error: -1.306, New V(s): 2.927\n", "2025-08-02 09:46:05,973 - INFO - TD Update - State: 1, Reward: 0, Target: 4.681, TD Error: 1.754, New V(s): 3.102\n", "2025-08-02 09:46:05,974 - INFO - TD Update - State: 2, Reward: 0, Target: 7.291, TD Error: 2.090, New V(s): 5.410\n", "2025-08-02 09:46:05,974 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.898, New V(s): 8.291\n", "2025-08-02 09:46:05,974 - INFO - Episode Complete - Total Reward: 9, Avg TD Error: 1.842\n", "2025-08-02 09:46:05,974 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,974 - INFO - TD Update - State: 0, Reward: 0, Target: 2.792, TD Error: 1.987, New V(s): 1.004\n", "2025-08-02 09:46:05,974 - INFO - TD Update - State: 1, Reward: 0, Target: 4.869, TD Error: 1.767, New V(s): 3.279\n", "2025-08-02 09:46:05,974 - INFO - TD Update - State: 2, Reward: 0, Target: 7.462, TD Error: 2.052, New V(s): 5.615\n", "2025-08-02 09:46:05,975 - INFO - TD Update - State: 3, Reward: -1, Target: 6.462, TD Error: -1.829, New V(s): 8.108\n", "2025-08-02 09:46:05,975 - INFO - TD Update - State: 3, Reward: -1, Target: 6.298, TD Error: -1.811, New V(s): 7.927\n", "2025-08-02 09:46:05,975 - INFO - TD Update - State: 3, Reward: -1, Target: 6.135, TD Error: -1.793, New V(s): 7.748\n", "2025-08-02 09:46:05,975 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.252, New V(s): 7.973\n", "2025-08-02 09:46:05,975 - INFO - Episode Complete - Total Reward: 7, Avg TD Error: 1.927\n", "2025-08-02 09:46:05,975 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,976 - INFO - TD Update - State: 0, Reward: 0, Target: 2.951, TD Error: 1.947, New V(s): 1.199\n", "2025-08-02 09:46:05,976 - INFO - TD Update - State: 1, Reward: 0, Target: 5.054, TD Error: 1.775, New V(s): 3.456\n", "2025-08-02 09:46:05,976 - INFO - TD Update - State: 2, Reward: 0, Target: 7.176, TD Error: 1.561, New V(s): 5.771\n", "2025-08-02 09:46:05,976 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.027, New V(s): 8.176\n", "2025-08-02 09:46:05,976 - INFO - Episode Complete - Total Reward: 10, Avg TD Error: 1.827\n", "2025-08-02 09:46:05,976 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,976 - INFO - TD Update - State: 0, Reward: -1, Target: 0.079, TD Error: -1.120, New V(s): 1.087\n", "2025-08-02 09:46:05,977 - INFO - TD Update - State: 0, Reward: 0, Target: 3.111, TD Error: 2.024, New V(s): 1.289\n", "2025-08-02 09:46:05,977 - INFO - TD Update - State: 1, Reward: -1, Target: 2.111, TD Error: -1.346, New V(s): 3.322\n", "2025-08-02 09:46:05,977 - INFO - TD Update - State: 1, Reward: 0, Target: 5.194, TD Error: 1.872, New V(s): 3.509\n", "2025-08-02 09:46:05,977 - INFO - TD Update - State: 2, Reward: -1, Target: 4.194, TD Error: -1.577, New V(s): 5.614\n", "2025-08-02 09:46:05,977 - INFO - TD Update - State: 2, Reward: 0, Target: 7.358, TD Error: 1.745, New V(s): 5.788\n", "2025-08-02 09:46:05,977 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.824, New V(s): 8.358\n", "2025-08-02 09:46:05,977 - INFO - Episode Complete - Total Reward: 7, Avg TD Error: 1.644\n", "2025-08-02 09:46:05,978 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,978 - INFO - TD Update - State: 0, Reward: -1, Target: 0.160, TD Error: -1.129, New V(s): 1.176\n", "2025-08-02 09:46:05,978 - INFO - TD Update - State: 0, Reward: 0, Target: 3.158, TD Error: 1.982, New V(s): 1.374\n", "2025-08-02 09:46:05,978 - INFO - TD Update - State: 1, Reward: 0, Target: 5.209, TD Error: 1.700, New V(s): 3.679\n", "2025-08-02 09:46:05,978 - INFO - TD Update - State: 2, Reward: 0, Target: 7.523, TD Error: 1.735, New V(s): 5.961\n", "2025-08-02 09:46:05,978 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.642, New V(s): 8.523\n", "2025-08-02 09:46:05,979 - INFO - Episode Complete - Total Reward: 9, Avg TD Error: 1.637\n", "2025-08-02 09:46:05,979 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,979 - INFO - TD Update - State: 0, Reward: 0, Target: 3.311, TD Error: 1.937, New V(s): 1.568\n", "2025-08-02 09:46:05,979 - INFO - TD Update - State: 1, Reward: 0, Target: 5.365, TD Error: 1.686, New V(s): 3.848\n", "2025-08-02 09:46:05,979 - INFO - TD Update - State: 2, Reward: -1, Target: 4.365, TD Error: -1.596, New V(s): 5.802\n", "2025-08-02 09:46:05,979 - INFO - TD Update - State: 2, Reward: 0, Target: 7.670, TD Error: 1.868, New V(s): 5.989\n", "2025-08-02 09:46:05,979 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.477, New V(s): 8.670\n", "2025-08-02 09:46:05,980 - INFO - Episode Complete - Total Reward: 9, Avg TD Error: 1.713\n", "2025-08-02 09:46:05,980 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,980 - INFO - TD Update - State: 0, Reward: -1, Target: 0.411, TD Error: -1.157, New V(s): 1.452\n", "2025-08-02 09:46:05,980 - INFO - TD Update - State: 0, Reward: -1, Target: 0.307, TD Error: -1.145, New V(s): 1.338\n", "2025-08-02 09:46:05,980 - INFO - TD Update - State: 0, Reward: 0, Target: 3.463, TD Error: 2.125, New V(s): 1.550\n", "2025-08-02 09:46:05,980 - INFO - TD Update - State: 1, Reward: 0, Target: 5.390, TD Error: 1.542, New V(s): 4.002\n", "2025-08-02 09:46:05,981 - INFO - TD Update - State: 2, Reward: 0, Target: 7.803, TD Error: 1.815, New V(s): 6.170\n", "2025-08-02 09:46:05,981 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.330, New V(s): 8.803\n", "2025-08-02 09:46:05,981 - INFO - Episode Complete - Total Reward: 8, Avg TD Error: 1.519\n", "2025-08-02 09:46:05,981 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,982 - INFO - TD Update - State: 0, Reward: -1, Target: 0.395, TD Error: -1.155, New V(s): 1.435\n", "2025-08-02 09:46:05,982 - INFO - TD Update - State: 0, Reward: -1, Target: 0.291, TD Error: -1.143, New V(s): 1.320\n", "2025-08-02 09:46:05,982 - INFO - TD Update - State: 0, Reward: -1, Target: 0.188, TD Error: -1.132, New V(s): 1.207\n", "2025-08-02 09:46:05,982 - INFO - TD Update - State: 0, Reward: 0, Target: 3.602, TD Error: 2.394, New V(s): 1.447\n", "2025-08-02 09:46:05,982 - INFO - TD Update - State: 1, Reward: 0, Target: 5.553, TD Error: 1.551, New V(s): 4.157\n", "2025-08-02 09:46:05,982 - INFO - TD Update - State: 2, Reward: -1, Target: 4.553, TD Error: -1.617, New V(s): 6.008\n", "2025-08-02 09:46:05,982 - INFO - TD Update - State: 2, Reward: -1, Target: 4.408, TD Error: -1.601, New V(s): 5.848\n", "2025-08-02 09:46:05,983 - INFO - TD Update - State: 2, Reward: -1, Target: 4.264, TD Error: -1.585, New V(s): 5.690\n", "2025-08-02 09:46:05,983 - INFO - TD Update - State: 2, Reward: 0, Target: 7.923, TD Error: 2.233, New V(s): 5.913\n", "2025-08-02 09:46:05,983 - INFO - TD Update - State: 3, Reward: -1, Target: 6.923, TD Error: -1.880, New V(s): 8.615\n", "2025-08-02 09:46:05,983 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.385, New V(s): 8.754\n", "2025-08-02 09:46:05,983 - INFO - Episode Complete - Total Reward: 3, Avg TD Error: 1.607\n", "2025-08-02 09:46:05,983 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,984 - INFO - TD Update - State: 0, Reward: -1, Target: 0.302, TD Error: -1.145, New V(s): 1.332\n", "2025-08-02 09:46:05,984 - INFO - TD Update - State: 0, Reward: 0, Target: 3.741, TD Error: 2.409, New V(s): 1.573\n", "2025-08-02 09:46:05,984 - INFO - TD Update - State: 1, Reward: 0, Target: 5.322, TD Error: 1.165, New V(s): 4.273\n", "2025-08-02 09:46:05,984 - INFO - TD Update - State: 2, Reward: 0, Target: 7.878, TD Error: 1.965, New V(s): 6.110\n", "2025-08-02 09:46:05,984 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.246, New V(s): 8.878\n", "2025-08-02 09:46:05,984 - INFO - Episode Complete - Total Reward: 9, Avg TD Error: 1.586\n", "2025-08-02 09:46:05,984 - INFO - \n", "๐Ÿ“Š Episode 40 Summary:\n", "2025-08-02 09:46:05,985 - INFO - Value Function: [1.57306672 4.27341598 6.10968602 8.87831698 0. ]\n", "2025-08-02 09:46:05,985 - INFO - Recent Avg Reward: 8.10\n", "2025-08-02 09:46:05,985 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,985 - INFO - TD Update - State: 0, Reward: 0, Target: 3.846, TD Error: 2.273, New V(s): 1.800\n", "2025-08-02 09:46:05,986 - INFO - TD Update - State: 1, Reward: 0, Target: 5.499, TD Error: 1.225, New V(s): 4.396\n", "2025-08-02 09:46:05,986 - INFO - TD Update - State: 2, Reward: 0, Target: 7.990, TD Error: 1.881, New V(s): 6.298\n", "2025-08-02 09:46:05,986 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.122, New V(s): 8.990\n", "2025-08-02 09:46:05,986 - INFO - Episode Complete - Total Reward: 10, Avg TD Error: 1.625\n", "2025-08-02 09:46:05,986 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,986 - INFO - TD Update - State: 0, Reward: 0, Target: 3.956, TD Error: 2.156, New V(s): 2.016\n", "2025-08-02 09:46:05,987 - INFO - TD Update - State: 1, Reward: 0, Target: 5.668, TD Error: 1.272, New V(s): 4.523\n", "2025-08-02 09:46:05,987 - INFO - TD Update - State: 2, Reward: 0, Target: 8.091, TD Error: 1.794, New V(s): 6.477\n", "2025-08-02 09:46:05,987 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.010, New V(s): 9.091\n", "2025-08-02 09:46:05,987 - INFO - Episode Complete - Total Reward: 10, Avg TD Error: 1.558\n", "2025-08-02 09:46:05,987 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,987 - INFO - TD Update - State: 0, Reward: 0, Target: 4.071, TD Error: 2.055, New V(s): 2.221\n", "2025-08-02 09:46:05,988 - INFO - TD Update - State: 1, Reward: 0, Target: 5.829, TD Error: 1.306, New V(s): 4.654\n", "2025-08-02 09:46:05,988 - INFO - TD Update - State: 2, Reward: 0, Target: 8.182, TD Error: 1.705, New V(s): 6.648\n", "2025-08-02 09:46:05,988 - INFO - TD Update - State: 3, Reward: -1, Target: 7.182, TD Error: -1.909, New V(s): 8.901\n", "2025-08-02 09:46:05,988 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.099, New V(s): 9.010\n", "2025-08-02 09:46:05,988 - INFO - Episode Complete - Total Reward: 9, Avg TD Error: 1.615\n", "2025-08-02 09:46:05,988 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,988 - INFO - TD Update - State: 0, Reward: 0, Target: 4.188, TD Error: 1.967, New V(s): 2.418\n", "2025-08-02 09:46:05,989 - INFO - TD Update - State: 1, Reward: -1, Target: 3.188, TD Error: -1.465, New V(s): 4.507\n", "2025-08-02 09:46:05,989 - INFO - TD Update - State: 1, Reward: -1, Target: 3.057, TD Error: -1.451, New V(s): 4.362\n", "2025-08-02 09:46:05,989 - INFO - TD Update - State: 1, Reward: -1, Target: 2.926, TD Error: -1.436, New V(s): 4.219\n", "2025-08-02 09:46:05,989 - INFO - TD Update - State: 1, Reward: 0, Target: 5.983, TD Error: 1.764, New V(s): 4.395\n", "2025-08-02 09:46:05,989 - INFO - TD Update - State: 2, Reward: -1, Target: 4.983, TD Error: -1.665, New V(s): 6.481\n", "2025-08-02 09:46:05,989 - INFO - TD Update - State: 2, Reward: -1, Target: 4.833, TD Error: -1.648, New V(s): 6.316\n", "2025-08-02 09:46:05,990 - INFO - TD Update - State: 2, Reward: -1, Target: 4.685, TD Error: -1.632, New V(s): 6.153\n", "2025-08-02 09:46:05,990 - INFO - TD Update - State: 2, Reward: 0, Target: 8.109, TD Error: 1.956, New V(s): 6.349\n", "2025-08-02 09:46:05,990 - INFO - TD Update - State: 3, Reward: -1, Target: 7.109, TD Error: -1.901, New V(s): 8.820\n", "2025-08-02 09:46:05,990 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.180, New V(s): 8.938\n", "2025-08-02 09:46:05,990 - INFO - Episode Complete - Total Reward: 3, Avg TD Error: 1.642\n", "2025-08-02 09:46:05,990 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,991 - INFO - TD Update - State: 0, Reward: -1, Target: 1.176, TD Error: -1.242, New V(s): 2.294\n", "2025-08-02 09:46:05,991 - INFO - TD Update - State: 0, Reward: 0, Target: 3.955, TD Error: 1.662, New V(s): 2.460\n", "2025-08-02 09:46:05,991 - INFO - TD Update - State: 1, Reward: 0, Target: 5.714, TD Error: 1.319, New V(s): 4.527\n", "2025-08-02 09:46:05,991 - INFO - TD Update - State: 2, Reward: -1, Target: 4.714, TD Error: -1.635, New V(s): 6.185\n", "2025-08-02 09:46:05,991 - INFO - TD Update - State: 2, Reward: 0, Target: 8.044, TD Error: 1.859, New V(s): 6.371\n", "2025-08-02 09:46:05,991 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.062, New V(s): 9.044\n", "2025-08-02 09:46:05,992 - INFO - Episode Complete - Total Reward: 8, Avg TD Error: 1.463\n", "2025-08-02 09:46:05,992 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,992 - INFO - TD Update - State: 0, Reward: 0, Target: 4.074, TD Error: 1.614, New V(s): 2.622\n", "2025-08-02 09:46:05,992 - INFO - TD Update - State: 1, Reward: 0, Target: 5.734, TD Error: 1.207, New V(s): 4.648\n", "2025-08-02 09:46:05,992 - INFO - TD Update - State: 2, Reward: 0, Target: 8.140, TD Error: 1.769, New V(s): 6.548\n", "2025-08-02 09:46:05,992 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 0.956, New V(s): 9.140\n", "2025-08-02 09:46:05,992 - INFO - Episode Complete - Total Reward: 10, Avg TD Error: 1.386\n", "2025-08-02 09:46:05,993 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,993 - INFO - TD Update - State: 0, Reward: -1, Target: 1.359, TD Error: -1.262, New V(s): 2.495\n", "2025-08-02 09:46:05,993 - INFO - TD Update - State: 0, Reward: -1, Target: 1.246, TD Error: -1.250, New V(s): 2.370\n", "2025-08-02 09:46:05,993 - INFO - TD Update - State: 0, Reward: -1, Target: 1.133, TD Error: -1.237, New V(s): 2.247\n", "2025-08-02 09:46:05,993 - INFO - TD Update - State: 0, Reward: -1, Target: 1.022, TD Error: -1.225, New V(s): 2.124\n", "2025-08-02 09:46:05,993 - INFO - TD Update - State: 0, Reward: -1, Target: 0.912, TD Error: -1.212, New V(s): 2.003\n", "2025-08-02 09:46:05,993 - INFO - TD Update - State: 0, Reward: 0, Target: 4.183, TD Error: 2.180, New V(s): 2.221\n", "2025-08-02 09:46:05,994 - INFO - TD Update - State: 1, Reward: -1, Target: 3.183, TD Error: -1.465, New V(s): 4.501\n", "2025-08-02 09:46:05,994 - INFO - TD Update - State: 1, Reward: 0, Target: 5.893, TD Error: 1.392, New V(s): 4.640\n", "2025-08-02 09:46:05,994 - INFO - TD Update - State: 2, Reward: 0, Target: 8.226, TD Error: 1.678, New V(s): 6.716\n", "2025-08-02 09:46:05,994 - INFO - TD Update - State: 3, Reward: -1, Target: 7.226, TD Error: -1.914, New V(s): 8.949\n", "2025-08-02 09:46:05,994 - INFO - TD Update - State: 3, Reward: -1, Target: 7.054, TD Error: -1.895, New V(s): 8.759\n", "2025-08-02 09:46:05,994 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.241, New V(s): 8.883\n", "2025-08-02 09:46:05,994 - INFO - Episode Complete - Total Reward: 2, Avg TD Error: 1.496\n", "2025-08-02 09:46:05,995 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,995 - INFO - TD Update - State: 0, Reward: -1, Target: 0.999, TD Error: -1.222, New V(s): 2.099\n", "2025-08-02 09:46:05,995 - INFO - TD Update - State: 0, Reward: 0, Target: 4.176, TD Error: 2.078, New V(s): 2.306\n", "2025-08-02 09:46:05,995 - INFO - TD Update - State: 1, Reward: -1, Target: 3.176, TD Error: -1.464, New V(s): 4.494\n", "2025-08-02 09:46:05,995 - INFO - TD Update - State: 1, Reward: -1, Target: 3.045, TD Error: -1.449, New V(s): 4.349\n", "2025-08-02 09:46:05,995 - INFO - TD Update - State: 1, Reward: -1, Target: 2.914, TD Error: -1.435, New V(s): 4.206\n", "2025-08-02 09:46:05,995 - INFO - TD Update - State: 1, Reward: -1, Target: 2.785, TD Error: -1.421, New V(s): 4.063\n", "2025-08-02 09:46:05,995 - INFO - TD Update - State: 1, Reward: 0, Target: 6.044, TD Error: 1.981, New V(s): 4.262\n", "2025-08-02 09:46:05,996 - INFO - TD Update - State: 2, Reward: -1, Target: 5.044, TD Error: -1.672, New V(s): 6.549\n", "2025-08-02 09:46:05,996 - INFO - TD Update - State: 2, Reward: 0, Target: 7.995, TD Error: 1.446, New V(s): 6.693\n", "2025-08-02 09:46:05,996 - INFO - TD Update - State: 3, Reward: -1, Target: 6.995, TD Error: -1.888, New V(s): 8.694\n", "2025-08-02 09:46:05,996 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.306, New V(s): 8.825\n", "2025-08-02 09:46:05,996 - INFO - Episode Complete - Total Reward: 3, Avg TD Error: 1.578\n", "2025-08-02 09:46:05,996 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,997 - INFO - TD Update - State: 0, Reward: 0, Target: 3.835, TD Error: 1.529, New V(s): 2.459\n", "2025-08-02 09:46:05,997 - INFO - TD Update - State: 1, Reward: 0, Target: 6.024, TD Error: 1.762, New V(s): 4.438\n", "2025-08-02 09:46:05,997 - INFO - TD Update - State: 2, Reward: 0, Target: 7.942, TD Error: 1.249, New V(s): 6.818\n", "2025-08-02 09:46:05,997 - INFO - TD Update - State: 3, Reward: -1, Target: 6.942, TD Error: -1.882, New V(s): 8.637\n", "2025-08-02 09:46:05,997 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.363, New V(s): 8.773\n", "2025-08-02 09:46:05,997 - INFO - Episode Complete - Total Reward: 9, Avg TD Error: 1.557\n", "2025-08-02 09:46:05,998 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,998 - INFO - TD Update - State: 0, Reward: 0, Target: 3.994, TD Error: 1.535, New V(s): 2.613\n", "2025-08-02 09:46:05,998 - INFO - TD Update - State: 1, Reward: 0, Target: 6.136, TD Error: 1.699, New V(s): 4.608\n", "2025-08-02 09:46:05,998 - INFO - TD Update - State: 2, Reward: -1, Target: 5.136, TD Error: -1.682, New V(s): 6.650\n", "2025-08-02 09:46:05,998 - INFO - TD Update - State: 2, Reward: -1, Target: 4.985, TD Error: -1.665, New V(s): 6.484\n", "2025-08-02 09:46:05,998 - INFO - TD Update - State: 2, Reward: 0, Target: 7.896, TD Error: 1.412, New V(s): 6.625\n", "2025-08-02 09:46:05,998 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.227, New V(s): 8.896\n", "2025-08-02 09:46:05,999 - INFO - Episode Complete - Total Reward: 8, Avg TD Error: 1.537\n", "2025-08-02 09:46:05,999 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:05,999 - INFO - TD Update - State: 0, Reward: 0, Target: 4.147, TD Error: 1.534, New V(s): 2.766\n", "2025-08-02 09:46:05,999 - INFO - TD Update - State: 1, Reward: 0, Target: 5.962, TD Error: 1.355, New V(s): 4.743\n", "2025-08-02 09:46:05,999 - INFO - TD Update - State: 2, Reward: 0, Target: 8.006, TD Error: 1.381, New V(s): 6.763\n", "2025-08-02 09:46:05,999 - INFO - TD Update - State: 3, Reward: -1, Target: 7.006, TD Error: -1.890, New V(s): 8.707\n", "2025-08-02 09:46:06,000 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.293, New V(s): 8.836\n", "2025-08-02 09:46:06,000 - INFO - Episode Complete - Total Reward: 9, Avg TD Error: 1.491\n", "2025-08-02 09:46:06,000 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,000 - INFO - TD Update - State: 0, Reward: 0, Target: 4.269, TD Error: 1.503, New V(s): 2.917\n", "2025-08-02 09:46:06,000 - INFO - TD Update - State: 1, Reward: 0, Target: 6.087, TD Error: 1.344, New V(s): 4.877\n", "2025-08-02 09:46:06,000 - INFO - TD Update - State: 2, Reward: -1, Target: 5.087, TD Error: -1.676, New V(s): 6.595\n", "2025-08-02 09:46:06,000 - INFO - TD Update - State: 2, Reward: -1, Target: 4.936, TD Error: -1.660, New V(s): 6.429\n", "2025-08-02 09:46:06,001 - INFO - TD Update - State: 2, Reward: 0, Target: 7.952, TD Error: 1.523, New V(s): 6.582\n", "2025-08-02 09:46:06,001 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.164, New V(s): 8.952\n", "2025-08-02 09:46:06,001 - INFO - Episode Complete - Total Reward: 8, Avg TD Error: 1.478\n", "2025-08-02 09:46:06,001 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,001 - INFO - TD Update - State: 0, Reward: 0, Target: 4.390, TD Error: 1.473, New V(s): 3.064\n", "2025-08-02 09:46:06,001 - INFO - TD Update - State: 1, Reward: -1, Target: 3.390, TD Error: -1.488, New V(s): 4.729\n", "2025-08-02 09:46:06,001 - INFO - TD Update - State: 1, Reward: -1, Target: 3.256, TD Error: -1.473, New V(s): 4.581\n", "2025-08-02 09:46:06,002 - INFO - TD Update - State: 1, Reward: 0, Target: 5.924, TD Error: 1.342, New V(s): 4.716\n", "2025-08-02 09:46:06,002 - INFO - TD Update - State: 2, Reward: 0, Target: 8.057, TD Error: 1.476, New V(s): 6.729\n", "2025-08-02 09:46:06,002 - INFO - TD Update - State: 3, Reward: -1, Target: 7.057, TD Error: -1.895, New V(s): 8.763\n", "2025-08-02 09:46:06,002 - INFO - TD Update - State: 3, Reward: -1, Target: 6.887, TD Error: -1.876, New V(s): 8.575\n", "2025-08-02 09:46:06,002 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.425, New V(s): 8.718\n", "2025-08-02 09:46:06,002 - INFO - Episode Complete - Total Reward: 6, Avg TD Error: 1.556\n", "2025-08-02 09:46:06,002 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,003 - INFO - TD Update - State: 0, Reward: -1, Target: 1.757, TD Error: -1.306, New V(s): 2.933\n", "2025-08-02 09:46:06,003 - INFO - TD Update - State: 0, Reward: 0, Target: 4.244, TD Error: 1.311, New V(s): 3.064\n", "2025-08-02 09:46:06,003 - INFO - TD Update - State: 1, Reward: 0, Target: 6.056, TD Error: 1.341, New V(s): 4.850\n", "2025-08-02 09:46:06,004 - INFO - TD Update - State: 2, Reward: 0, Target: 7.846, TD Error: 1.117, New V(s): 6.841\n", "2025-08-02 09:46:06,005 - INFO - TD Update - State: 3, Reward: -1, Target: 6.846, TD Error: -1.872, New V(s): 8.531\n", "2025-08-02 09:46:06,005 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.469, New V(s): 8.678\n", "2025-08-02 09:46:06,005 - INFO - Episode Complete - Total Reward: 8, Avg TD Error: 1.403\n", "2025-08-02 09:46:06,006 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,007 - INFO - TD Update - State: 0, Reward: 0, Target: 4.365, TD Error: 1.300, New V(s): 3.194\n", "2025-08-02 09:46:06,007 - INFO - TD Update - State: 1, Reward: 0, Target: 6.157, TD Error: 1.307, New V(s): 4.980\n", "2025-08-02 09:46:06,007 - INFO - TD Update - State: 2, Reward: -1, Target: 5.157, TD Error: -1.684, New V(s): 6.673\n", "2025-08-02 09:46:06,007 - INFO - TD Update - State: 2, Reward: -1, Target: 5.005, TD Error: -1.667, New V(s): 6.506\n", "2025-08-02 09:46:06,008 - INFO - TD Update - State: 2, Reward: -1, Target: 4.855, TD Error: -1.651, New V(s): 6.341\n", "2025-08-02 09:46:06,008 - INFO - TD Update - State: 2, Reward: 0, Target: 7.810, TD Error: 1.469, New V(s): 6.488\n", "2025-08-02 09:46:06,008 - INFO - TD Update - State: 3, Reward: -1, Target: 6.810, TD Error: -1.868, New V(s): 8.491\n", "2025-08-02 09:46:06,008 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.509, New V(s): 8.642\n", "2025-08-02 09:46:06,009 - INFO - Episode Complete - Total Reward: 6, Avg TD Error: 1.557\n", "2025-08-02 09:46:06,009 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,009 - INFO - TD Update - State: 0, Reward: 0, Target: 4.482, TD Error: 1.288, New V(s): 3.323\n", "2025-08-02 09:46:06,010 - INFO - TD Update - State: 1, Reward: -1, Target: 3.482, TD Error: -1.498, New V(s): 4.831\n", "2025-08-02 09:46:06,010 - INFO - TD Update - State: 1, Reward: 0, Target: 5.839, TD Error: 1.008, New V(s): 4.931\n", "2025-08-02 09:46:06,010 - INFO - TD Update - State: 2, Reward: 0, Target: 7.778, TD Error: 1.290, New V(s): 6.617\n", "2025-08-02 09:46:06,011 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.358, New V(s): 8.778\n", "2025-08-02 09:46:06,011 - INFO - Episode Complete - Total Reward: 9, Avg TD Error: 1.289\n", "2025-08-02 09:46:06,011 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,011 - INFO - TD Update - State: 0, Reward: 0, Target: 4.438, TD Error: 1.115, New V(s): 3.435\n", "2025-08-02 09:46:06,011 - INFO - TD Update - State: 1, Reward: 0, Target: 5.955, TD Error: 1.024, New V(s): 5.034\n", "2025-08-02 09:46:06,011 - INFO - TD Update - State: 2, Reward: 0, Target: 7.900, TD Error: 1.283, New V(s): 6.745\n", "2025-08-02 09:46:06,011 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.222, New V(s): 8.900\n", "2025-08-02 09:46:06,012 - INFO - Episode Complete - Total Reward: 10, Avg TD Error: 1.161\n", "2025-08-02 09:46:06,012 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,012 - INFO - TD Update - State: 0, Reward: 0, Target: 4.530, TD Error: 1.096, New V(s): 3.544\n", "2025-08-02 09:46:06,012 - INFO - TD Update - State: 1, Reward: -1, Target: 3.530, TD Error: -1.503, New V(s): 4.883\n", "2025-08-02 09:46:06,012 - INFO - TD Update - State: 1, Reward: 0, Target: 6.070, TD Error: 1.187, New V(s): 5.002\n", "2025-08-02 09:46:06,013 - INFO - TD Update - State: 2, Reward: -1, Target: 5.070, TD Error: -1.674, New V(s): 6.577\n", "2025-08-02 09:46:06,013 - INFO - TD Update - State: 2, Reward: 0, Target: 8.010, TD Error: 1.432, New V(s): 6.721\n", "2025-08-02 09:46:06,013 - INFO - TD Update - State: 3, Reward: -1, Target: 7.010, TD Error: -1.890, New V(s): 8.711\n", "2025-08-02 09:46:06,013 - INFO - TD Update - State: 3, Reward: -1, Target: 6.840, TD Error: -1.871, New V(s): 8.524\n", "2025-08-02 09:46:06,013 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.476, New V(s): 8.671\n", "2025-08-02 09:46:06,013 - INFO - Episode Complete - Total Reward: 6, Avg TD Error: 1.516\n", "2025-08-02 09:46:06,013 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,014 - INFO - TD Update - State: 0, Reward: -1, Target: 2.190, TD Error: -1.354, New V(s): 3.409\n", "2025-08-02 09:46:06,014 - INFO - TD Update - State: 0, Reward: -1, Target: 2.068, TD Error: -1.341, New V(s): 3.275\n", "2025-08-02 09:46:06,014 - INFO - TD Update - State: 0, Reward: -1, Target: 1.947, TD Error: -1.327, New V(s): 3.142\n", "2025-08-02 09:46:06,014 - INFO - TD Update - State: 0, Reward: 0, Target: 4.502, TD Error: 1.360, New V(s): 3.278\n", "2025-08-02 09:46:06,015 - INFO - TD Update - State: 1, Reward: 0, Target: 6.049, TD Error: 1.047, New V(s): 5.107\n", "2025-08-02 09:46:06,015 - INFO - TD Update - State: 2, Reward: -1, Target: 5.049, TD Error: -1.672, New V(s): 6.554\n", "2025-08-02 09:46:06,015 - INFO - TD Update - State: 2, Reward: -1, Target: 4.898, TD Error: -1.655, New V(s): 6.388\n", "2025-08-02 09:46:06,015 - INFO - TD Update - State: 2, Reward: -1, Target: 4.749, TD Error: -1.639, New V(s): 6.224\n", "2025-08-02 09:46:06,015 - INFO - TD Update - State: 2, Reward: -1, Target: 4.602, TD Error: -1.622, New V(s): 6.062\n", "2025-08-02 09:46:06,015 - INFO - TD Update - State: 2, Reward: 0, Target: 7.804, TD Error: 1.742, New V(s): 6.236\n", "2025-08-02 09:46:06,015 - INFO - TD Update - State: 3, Reward: -1, Target: 6.804, TD Error: -1.867, New V(s): 8.485\n", "2025-08-02 09:46:06,016 - INFO - TD Update - State: 3, Reward: -1, Target: 6.636, TD Error: -1.848, New V(s): 8.300\n", "2025-08-02 09:46:06,016 - INFO - TD Update - State: 3, Reward: -1, Target: 6.470, TD Error: -1.830, New V(s): 8.117\n", "2025-08-02 09:46:06,016 - INFO - TD Update - State: 3, Reward: -1, Target: 6.305, TD Error: -1.812, New V(s): 7.936\n", "2025-08-02 09:46:06,016 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.064, New V(s): 8.142\n", "2025-08-02 09:46:06,016 - INFO - Episode Complete - Total Reward: -1, Avg TD Error: 1.612\n", "2025-08-02 09:46:06,016 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,016 - INFO - TD Update - State: 0, Reward: 0, Target: 4.596, TD Error: 1.318, New V(s): 3.410\n", "2025-08-02 09:46:06,017 - INFO - TD Update - State: 1, Reward: -1, Target: 3.596, TD Error: -1.511, New V(s): 4.956\n", "2025-08-02 09:46:06,017 - INFO - TD Update - State: 1, Reward: -1, Target: 3.460, TD Error: -1.496, New V(s): 4.806\n", "2025-08-02 09:46:06,017 - INFO - TD Update - State: 1, Reward: -1, Target: 3.326, TD Error: -1.481, New V(s): 4.658\n", "2025-08-02 09:46:06,017 - INFO - TD Update - State: 1, Reward: -1, Target: 3.192, TD Error: -1.466, New V(s): 4.512\n", "2025-08-02 09:46:06,018 - INFO - TD Update - State: 1, Reward: 0, Target: 5.612, TD Error: 1.101, New V(s): 4.622\n", "2025-08-02 09:46:06,018 - INFO - TD Update - State: 2, Reward: -1, Target: 4.612, TD Error: -1.624, New V(s): 6.074\n", "2025-08-02 09:46:06,018 - INFO - TD Update - State: 2, Reward: 0, Target: 7.328, TD Error: 1.254, New V(s): 6.199\n", "2025-08-02 09:46:06,018 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.858, New V(s): 8.328\n", "2025-08-02 09:46:06,018 - INFO - Episode Complete - Total Reward: 5, Avg TD Error: 1.456\n", "2025-08-02 09:46:06,018 - INFO - \n", "๐Ÿ“Š Episode 60 Summary:\n", "2025-08-02 09:46:06,019 - INFO - Value Function: [3.40975648 4.62162398 6.19914033 8.32782678 0. ]\n", "2025-08-02 09:46:06,019 - INFO - Recent Avg Reward: 6.60\n", "2025-08-02 09:46:06,019 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,019 - INFO - TD Update - State: 0, Reward: -1, Target: 2.069, TD Error: -1.341, New V(s): 3.276\n", "2025-08-02 09:46:06,019 - INFO - TD Update - State: 0, Reward: -1, Target: 1.948, TD Error: -1.328, New V(s): 3.143\n", "2025-08-02 09:46:06,020 - INFO - TD Update - State: 0, Reward: -1, Target: 1.829, TD Error: -1.314, New V(s): 3.011\n", "2025-08-02 09:46:06,020 - INFO - TD Update - State: 0, Reward: 0, Target: 4.159, TD Error: 1.148, New V(s): 3.126\n", "2025-08-02 09:46:06,020 - INFO - TD Update - State: 1, Reward: 0, Target: 5.579, TD Error: 0.958, New V(s): 4.717\n", "2025-08-02 09:46:06,020 - INFO - TD Update - State: 2, Reward: -1, Target: 4.579, TD Error: -1.620, New V(s): 6.037\n", "2025-08-02 09:46:06,020 - INFO - TD Update - State: 2, Reward: 0, Target: 7.495, TD Error: 1.458, New V(s): 6.183\n", "2025-08-02 09:46:06,020 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.672, New V(s): 8.495\n", "2025-08-02 09:46:06,020 - INFO - Episode Complete - Total Reward: 6, Avg TD Error: 1.355\n", "2025-08-02 09:46:06,021 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,021 - INFO - TD Update - State: 0, Reward: 0, Target: 4.246, TD Error: 1.119, New V(s): 3.238\n", "2025-08-02 09:46:06,021 - INFO - TD Update - State: 1, Reward: 0, Target: 5.565, TD Error: 0.847, New V(s): 4.802\n", "2025-08-02 09:46:06,021 - INFO - TD Update - State: 2, Reward: -1, Target: 4.565, TD Error: -1.618, New V(s): 6.021\n", "2025-08-02 09:46:06,021 - INFO - TD Update - State: 2, Reward: 0, Target: 7.646, TD Error: 1.624, New V(s): 6.184\n", "2025-08-02 09:46:06,021 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.505, New V(s): 8.646\n", "2025-08-02 09:46:06,022 - INFO - Episode Complete - Total Reward: 9, Avg TD Error: 1.343\n", "2025-08-02 09:46:06,022 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,022 - INFO - TD Update - State: 0, Reward: 0, Target: 4.322, TD Error: 1.084, New V(s): 3.347\n", "2025-08-02 09:46:06,022 - INFO - TD Update - State: 1, Reward: 0, Target: 5.565, TD Error: 0.763, New V(s): 4.878\n", "2025-08-02 09:46:06,022 - INFO - TD Update - State: 2, Reward: -1, Target: 4.565, TD Error: -1.618, New V(s): 6.022\n", "2025-08-02 09:46:06,023 - INFO - TD Update - State: 2, Reward: -1, Target: 4.420, TD Error: -1.602, New V(s): 5.861\n", "2025-08-02 09:46:06,023 - INFO - TD Update - State: 2, Reward: -1, Target: 4.275, TD Error: -1.586, New V(s): 5.703\n", "2025-08-02 09:46:06,023 - INFO - TD Update - State: 2, Reward: 0, Target: 7.781, TD Error: 2.078, New V(s): 5.911\n", "2025-08-02 09:46:06,023 - INFO - TD Update - State: 3, Reward: -1, Target: 6.781, TD Error: -1.865, New V(s): 8.459\n", "2025-08-02 09:46:06,023 - INFO - TD Update - State: 3, Reward: -1, Target: 6.613, TD Error: -1.846, New V(s): 8.274\n", "2025-08-02 09:46:06,024 - INFO - TD Update - State: 3, Reward: -1, Target: 6.447, TD Error: -1.827, New V(s): 8.092\n", "2025-08-02 09:46:06,024 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.908, New V(s): 8.283\n", "2025-08-02 09:46:06,024 - INFO - Episode Complete - Total Reward: 4, Avg TD Error: 1.618\n", "2025-08-02 09:46:06,024 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,024 - INFO - TD Update - State: 0, Reward: 0, Target: 4.391, TD Error: 1.044, New V(s): 3.451\n", "2025-08-02 09:46:06,024 - INFO - TD Update - State: 1, Reward: 0, Target: 5.320, TD Error: 0.441, New V(s): 4.923\n", "2025-08-02 09:46:06,025 - INFO - TD Update - State: 2, Reward: -1, Target: 4.320, TD Error: -1.591, New V(s): 5.752\n", "2025-08-02 09:46:06,025 - INFO - TD Update - State: 2, Reward: -1, Target: 4.176, TD Error: -1.575, New V(s): 5.594\n", "2025-08-02 09:46:06,025 - INFO - TD Update - State: 2, Reward: 0, Target: 7.454, TD Error: 1.860, New V(s): 5.780\n", "2025-08-02 09:46:06,025 - INFO - TD Update - State: 3, Reward: -1, Target: 6.454, TD Error: -1.828, New V(s): 8.100\n", "2025-08-02 09:46:06,025 - INFO - TD Update - State: 3, Reward: -1, Target: 6.290, TD Error: -1.810, New V(s): 7.919\n", "2025-08-02 09:46:06,025 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.081, New V(s): 8.127\n", "2025-08-02 09:46:06,026 - INFO - Episode Complete - Total Reward: 6, Avg TD Error: 1.529\n", "2025-08-02 09:46:06,026 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,026 - INFO - TD Update - State: 0, Reward: -1, Target: 2.106, TD Error: -1.345, New V(s): 3.316\n", "2025-08-02 09:46:06,026 - INFO - TD Update - State: 0, Reward: 0, Target: 4.430, TD Error: 1.114, New V(s): 3.428\n", "2025-08-02 09:46:06,026 - INFO - TD Update - State: 1, Reward: -1, Target: 3.430, TD Error: -1.492, New V(s): 4.773\n", "2025-08-02 09:46:06,026 - INFO - TD Update - State: 1, Reward: -1, Target: 3.296, TD Error: -1.477, New V(s): 4.626\n", "2025-08-02 09:46:06,026 - INFO - TD Update - State: 1, Reward: 0, Target: 5.202, TD Error: 0.577, New V(s): 4.683\n", "2025-08-02 09:46:06,027 - INFO - TD Update - State: 2, Reward: 0, Target: 7.314, TD Error: 1.534, New V(s): 5.934\n", "2025-08-02 09:46:06,027 - INFO - TD Update - State: 3, Reward: -1, Target: 6.314, TD Error: -1.813, New V(s): 7.946\n", "2025-08-02 09:46:06,027 - INFO - TD Update - State: 3, Reward: -1, Target: 6.151, TD Error: -1.795, New V(s): 7.766\n", "2025-08-02 09:46:06,027 - INFO - TD Update - State: 3, Reward: -1, Target: 5.990, TD Error: -1.777, New V(s): 7.588\n", "2025-08-02 09:46:06,027 - INFO - TD Update - State: 3, Reward: -1, Target: 5.830, TD Error: -1.759, New V(s): 7.413\n", "2025-08-02 09:46:06,027 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.587, New V(s): 7.671\n", "2025-08-02 09:46:06,027 - INFO - Episode Complete - Total Reward: 3, Avg TD Error: 1.570\n", "2025-08-02 09:46:06,028 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,028 - INFO - TD Update - State: 0, Reward: 0, Target: 4.215, TD Error: 0.787, New V(s): 3.507\n", "2025-08-02 09:46:06,028 - INFO - TD Update - State: 1, Reward: 0, Target: 5.340, TD Error: 0.657, New V(s): 4.749\n", "2025-08-02 09:46:06,028 - INFO - TD Update - State: 2, Reward: 0, Target: 6.904, TD Error: 0.971, New V(s): 6.031\n", "2025-08-02 09:46:06,028 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.329, New V(s): 7.904\n", "2025-08-02 09:46:06,028 - INFO - Episode Complete - Total Reward: 10, Avg TD Error: 1.186\n", "2025-08-02 09:46:06,074 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,075 - INFO - TD Update - State: 0, Reward: -1, Target: 2.156, TD Error: -1.351, New V(s): 3.371\n", "2025-08-02 09:46:06,075 - INFO - TD Update - State: 0, Reward: 0, Target: 4.274, TD Error: 0.903, New V(s): 3.462\n", "2025-08-02 09:46:06,075 - INFO - TD Update - State: 1, Reward: 0, Target: 5.428, TD Error: 0.679, New V(s): 4.817\n", "2025-08-02 09:46:06,075 - INFO - TD Update - State: 2, Reward: 0, Target: 7.114, TD Error: 1.083, New V(s): 6.139\n", "2025-08-02 09:46:06,075 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.096, New V(s): 8.114\n", "2025-08-02 09:46:06,076 - INFO - Episode Complete - Total Reward: 9, Avg TD Error: 1.222\n", "2025-08-02 09:46:06,076 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,076 - INFO - TD Update - State: 0, Reward: -1, Target: 2.116, TD Error: -1.346, New V(s): 3.327\n", "2025-08-02 09:46:06,076 - INFO - TD Update - State: 0, Reward: 0, Target: 4.335, TD Error: 1.008, New V(s): 3.428\n", "2025-08-02 09:46:06,076 - INFO - TD Update - State: 1, Reward: 0, Target: 5.525, TD Error: 0.708, New V(s): 4.888\n", "2025-08-02 09:46:06,077 - INFO - TD Update - State: 2, Reward: 0, Target: 7.302, TD Error: 1.164, New V(s): 6.255\n", "2025-08-02 09:46:06,077 - INFO - TD Update - State: 3, Reward: -1, Target: 6.302, TD Error: -1.811, New V(s): 7.933\n", "2025-08-02 09:46:06,077 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.067, New V(s): 8.139\n", "2025-08-02 09:46:06,077 - INFO - Episode Complete - Total Reward: 8, Avg TD Error: 1.351\n", "2025-08-02 09:46:06,077 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,078 - INFO - TD Update - State: 0, Reward: 0, Target: 4.399, TD Error: 0.971, New V(s): 3.525\n", "2025-08-02 09:46:06,078 - INFO - TD Update - State: 1, Reward: 0, Target: 5.630, TD Error: 0.742, New V(s): 4.962\n", "2025-08-02 09:46:06,078 - INFO - TD Update - State: 2, Reward: 0, Target: 7.325, TD Error: 1.070, New V(s): 6.362\n", "2025-08-02 09:46:06,078 - INFO - TD Update - State: 3, Reward: -1, Target: 6.325, TD Error: -1.814, New V(s): 7.958\n", "2025-08-02 09:46:06,078 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.042, New V(s): 8.162\n", "2025-08-02 09:46:06,079 - INFO - Episode Complete - Total Reward: 9, Avg TD Error: 1.328\n", "2025-08-02 09:46:06,079 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,079 - INFO - TD Update - State: 0, Reward: 0, Target: 4.466, TD Error: 0.941, New V(s): 3.619\n", "2025-08-02 09:46:06,079 - INFO - TD Update - State: 1, Reward: -1, Target: 3.466, TD Error: -1.496, New V(s): 4.812\n", "2025-08-02 09:46:06,079 - INFO - TD Update - State: 1, Reward: -1, Target: 3.331, TD Error: -1.481, New V(s): 4.664\n", "2025-08-02 09:46:06,080 - INFO - TD Update - State: 1, Reward: -1, Target: 3.198, TD Error: -1.466, New V(s): 4.517\n", "2025-08-02 09:46:06,080 - INFO - TD Update - State: 1, Reward: 0, Target: 5.726, TD Error: 1.209, New V(s): 4.638\n", "2025-08-02 09:46:06,080 - INFO - TD Update - State: 2, Reward: 0, Target: 7.346, TD Error: 0.984, New V(s): 6.461\n", "2025-08-02 09:46:06,080 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.838, New V(s): 8.346\n", "2025-08-02 09:46:06,080 - INFO - Episode Complete - Total Reward: 7, Avg TD Error: 1.345\n", "2025-08-02 09:46:06,080 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,081 - INFO - TD Update - State: 0, Reward: 0, Target: 4.174, TD Error: 0.555, New V(s): 3.675\n", "2025-08-02 09:46:06,081 - INFO - TD Update - State: 1, Reward: 0, Target: 5.815, TD Error: 1.176, New V(s): 4.756\n", "2025-08-02 09:46:06,081 - INFO - TD Update - State: 2, Reward: -1, Target: 4.815, TD Error: -1.646, New V(s): 6.296\n", "2025-08-02 09:46:06,081 - INFO - TD Update - State: 2, Reward: -1, Target: 4.666, TD Error: -1.630, New V(s): 6.133\n", "2025-08-02 09:46:06,082 - INFO - TD Update - State: 2, Reward: -1, Target: 4.520, TD Error: -1.613, New V(s): 5.972\n", "2025-08-02 09:46:06,082 - INFO - TD Update - State: 2, Reward: -1, Target: 4.375, TD Error: -1.597, New V(s): 5.812\n", "2025-08-02 09:46:06,082 - INFO - TD Update - State: 2, Reward: -1, Target: 4.231, TD Error: -1.581, New V(s): 5.654\n", "2025-08-02 09:46:06,082 - INFO - TD Update - State: 2, Reward: -1, Target: 4.089, TD Error: -1.565, New V(s): 5.497\n", "2025-08-02 09:46:06,082 - INFO - TD Update - State: 2, Reward: 0, Target: 7.511, TD Error: 2.014, New V(s): 5.699\n", "2025-08-02 09:46:06,082 - INFO - TD Update - State: 3, Reward: -1, Target: 6.511, TD Error: -1.835, New V(s): 8.163\n", "2025-08-02 09:46:06,083 - INFO - TD Update - State: 3, Reward: -1, Target: 6.346, TD Error: -1.816, New V(s): 7.981\n", "2025-08-02 09:46:06,083 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.019, New V(s): 8.183\n", "2025-08-02 09:46:06,083 - INFO - Episode Complete - Total Reward: 2, Avg TD Error: 1.587\n", "2025-08-02 09:46:06,083 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,083 - INFO - TD Update - State: 0, Reward: -1, Target: 2.307, TD Error: -1.367, New V(s): 3.538\n", "2025-08-02 09:46:06,083 - INFO - TD Update - State: 0, Reward: -1, Target: 2.184, TD Error: -1.354, New V(s): 3.402\n", "2025-08-02 09:46:06,084 - INFO - TD Update - State: 0, Reward: -1, Target: 2.062, TD Error: -1.340, New V(s): 3.268\n", "2025-08-02 09:46:06,084 - INFO - TD Update - State: 0, Reward: 0, Target: 4.280, TD Error: 1.012, New V(s): 3.370\n", "2025-08-02 09:46:06,084 - INFO - TD Update - State: 1, Reward: -1, Target: 3.280, TD Error: -1.476, New V(s): 4.608\n", "2025-08-02 09:46:06,084 - INFO - TD Update - State: 1, Reward: 0, Target: 5.129, TD Error: 0.521, New V(s): 4.660\n", "2025-08-02 09:46:06,084 - INFO - TD Update - State: 2, Reward: -1, Target: 4.129, TD Error: -1.570, New V(s): 5.542\n", "2025-08-02 09:46:06,085 - INFO - TD Update - State: 2, Reward: 0, Target: 7.365, TD Error: 1.823, New V(s): 5.724\n", "2025-08-02 09:46:06,085 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.817, New V(s): 8.365\n", "2025-08-02 09:46:06,085 - INFO - Episode Complete - Total Reward: 5, Avg TD Error: 1.364\n", "2025-08-02 09:46:06,085 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,085 - INFO - TD Update - State: 0, Reward: 0, Target: 4.194, TD Error: 0.825, New V(s): 3.452\n", "2025-08-02 09:46:06,086 - INFO - TD Update - State: 1, Reward: 0, Target: 5.152, TD Error: 0.491, New V(s): 4.710\n", "2025-08-02 09:46:06,086 - INFO - TD Update - State: 2, Reward: 0, Target: 7.528, TD Error: 1.804, New V(s): 5.904\n", "2025-08-02 09:46:06,086 - INFO - TD Update - State: 3, Reward: -1, Target: 6.528, TD Error: -1.836, New V(s): 8.181\n", "2025-08-02 09:46:06,086 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.819, New V(s): 8.363\n", "2025-08-02 09:46:06,086 - INFO - Episode Complete - Total Reward: 9, Avg TD Error: 1.355\n", "2025-08-02 09:46:06,087 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,087 - INFO - TD Update - State: 0, Reward: 0, Target: 4.239, TD Error: 0.786, New V(s): 3.531\n", "2025-08-02 09:46:06,087 - INFO - TD Update - State: 1, Reward: -1, Target: 3.239, TD Error: -1.471, New V(s): 4.562\n", "2025-08-02 09:46:06,087 - INFO - TD Update - State: 1, Reward: -1, Target: 3.106, TD Error: -1.456, New V(s): 4.417\n", "2025-08-02 09:46:06,087 - INFO - TD Update - State: 1, Reward: 0, Target: 5.314, TD Error: 0.897, New V(s): 4.507\n", "2025-08-02 09:46:06,088 - INFO - TD Update - State: 2, Reward: -1, Target: 4.314, TD Error: -1.590, New V(s): 5.745\n", "2025-08-02 09:46:06,088 - INFO - TD Update - State: 2, Reward: -1, Target: 4.171, TD Error: -1.575, New V(s): 5.588\n", "2025-08-02 09:46:06,088 - INFO - TD Update - State: 2, Reward: 0, Target: 7.527, TD Error: 1.939, New V(s): 5.782\n", "2025-08-02 09:46:06,088 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.637, New V(s): 8.527\n", "2025-08-02 09:46:06,088 - INFO - Episode Complete - Total Reward: 6, Avg TD Error: 1.419\n", "2025-08-02 09:46:06,088 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,089 - INFO - TD Update - State: 0, Reward: -1, Target: 2.178, TD Error: -1.353, New V(s): 3.395\n", "2025-08-02 09:46:06,089 - INFO - TD Update - State: 0, Reward: -1, Target: 2.056, TD Error: -1.340, New V(s): 3.262\n", "2025-08-02 09:46:06,089 - INFO - TD Update - State: 0, Reward: -1, Target: 1.935, TD Error: -1.326, New V(s): 3.129\n", "2025-08-02 09:46:06,089 - INFO - TD Update - State: 0, Reward: 0, Target: 4.056, TD Error: 0.927, New V(s): 3.222\n", "2025-08-02 09:46:06,089 - INFO - TD Update - State: 1, Reward: 0, Target: 5.204, TD Error: 0.697, New V(s): 4.576\n", "2025-08-02 09:46:06,089 - INFO - TD Update - State: 2, Reward: 0, Target: 7.674, TD Error: 1.892, New V(s): 5.971\n", "2025-08-02 09:46:06,090 - INFO - TD Update - State: 3, Reward: -1, Target: 6.674, TD Error: -1.853, New V(s): 8.341\n", "2025-08-02 09:46:06,090 - INFO - TD Update - State: 3, Reward: -1, Target: 6.507, TD Error: -1.834, New V(s): 8.158\n", "2025-08-02 09:46:06,090 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.842, New V(s): 8.342\n", "2025-08-02 09:46:06,090 - INFO - Episode Complete - Total Reward: 5, Avg TD Error: 1.452\n", "2025-08-02 09:46:06,090 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,091 - INFO - TD Update - State: 0, Reward: 0, Target: 4.119, TD Error: 0.897, New V(s): 3.311\n", "2025-08-02 09:46:06,091 - INFO - TD Update - State: 1, Reward: -1, Target: 3.119, TD Error: -1.458, New V(s): 4.430\n", "2025-08-02 09:46:06,091 - INFO - TD Update - State: 1, Reward: -1, Target: 2.987, TD Error: -1.443, New V(s): 4.286\n", "2025-08-02 09:46:06,091 - INFO - TD Update - State: 1, Reward: -1, Target: 2.858, TD Error: -1.429, New V(s): 4.143\n", "2025-08-02 09:46:06,091 - INFO - TD Update - State: 1, Reward: -1, Target: 2.729, TD Error: -1.414, New V(s): 4.002\n", "2025-08-02 09:46:06,091 - INFO - TD Update - State: 1, Reward: -1, Target: 2.602, TD Error: -1.400, New V(s): 3.862\n", "2025-08-02 09:46:06,092 - INFO - TD Update - State: 1, Reward: -1, Target: 2.476, TD Error: -1.386, New V(s): 3.723\n", "2025-08-02 09:46:06,092 - INFO - TD Update - State: 1, Reward: -1, Target: 2.351, TD Error: -1.372, New V(s): 3.586\n", "2025-08-02 09:46:06,092 - INFO - TD Update - State: 1, Reward: -1, Target: 2.227, TD Error: -1.359, New V(s): 3.450\n", "2025-08-02 09:46:06,092 - INFO - TD Update - State: 1, Reward: 0, Target: 5.374, TD Error: 1.924, New V(s): 3.643\n", "2025-08-02 09:46:06,092 - INFO - TD Update - State: 2, Reward: 0, Target: 7.508, TD Error: 1.537, New V(s): 6.125\n", "2025-08-02 09:46:06,093 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.658, New V(s): 8.508\n", "2025-08-02 09:46:06,093 - INFO - Episode Complete - Total Reward: 2, Avg TD Error: 1.440\n", "2025-08-02 09:46:06,093 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,093 - INFO - TD Update - State: 0, Reward: 0, Target: 3.278, TD Error: -0.033, New V(s): 3.308\n", "2025-08-02 09:46:06,093 - INFO - TD Update - State: 1, Reward: 0, Target: 5.512, TD Error: 1.870, New V(s): 3.830\n", "2025-08-02 09:46:06,093 - INFO - TD Update - State: 2, Reward: 0, Target: 7.657, TD Error: 1.532, New V(s): 6.278\n", "2025-08-02 09:46:06,093 - INFO - TD Update - State: 3, Reward: -1, Target: 6.657, TD Error: -1.851, New V(s): 8.323\n", "2025-08-02 09:46:06,094 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.677, New V(s): 8.490\n", "2025-08-02 09:46:06,094 - INFO - Episode Complete - Total Reward: 9, Avg TD Error: 1.393\n", "2025-08-02 09:46:06,094 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,094 - INFO - TD Update - State: 0, Reward: -1, Target: 1.977, TD Error: -1.331, New V(s): 3.175\n", "2025-08-02 09:46:06,094 - INFO - TD Update - State: 0, Reward: 0, Target: 3.447, TD Error: 0.272, New V(s): 3.202\n", "2025-08-02 09:46:06,094 - INFO - TD Update - State: 1, Reward: -1, Target: 2.447, TD Error: -1.383, New V(s): 3.691\n", "2025-08-02 09:46:06,095 - INFO - TD Update - State: 1, Reward: -1, Target: 2.322, TD Error: -1.369, New V(s): 3.554\n", "2025-08-02 09:46:06,095 - INFO - TD Update - State: 1, Reward: 0, Target: 5.650, TD Error: 2.096, New V(s): 3.764\n", "2025-08-02 09:46:06,095 - INFO - TD Update - State: 2, Reward: 0, Target: 7.641, TD Error: 1.364, New V(s): 6.414\n", "2025-08-02 09:46:06,095 - INFO - TD Update - State: 3, Reward: -1, Target: 6.641, TD Error: -1.849, New V(s): 8.306\n", "2025-08-02 09:46:06,095 - INFO - TD Update - State: 3, Reward: -1, Target: 6.475, TD Error: -1.831, New V(s): 8.123\n", "2025-08-02 09:46:06,095 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.877, New V(s): 8.310\n", "2025-08-02 09:46:06,096 - INFO - Episode Complete - Total Reward: 5, Avg TD Error: 1.486\n", "2025-08-02 09:46:06,096 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,096 - INFO - TD Update - State: 0, Reward: -1, Target: 1.882, TD Error: -1.320, New V(s): 3.070\n", "2025-08-02 09:46:06,096 - INFO - TD Update - State: 0, Reward: 0, Target: 3.387, TD Error: 0.317, New V(s): 3.102\n", "2025-08-02 09:46:06,096 - INFO - TD Update - State: 1, Reward: 0, Target: 5.773, TD Error: 2.009, New V(s): 3.965\n", "2025-08-02 09:46:06,096 - INFO - TD Update - State: 2, Reward: -1, Target: 4.773, TD Error: -1.641, New V(s): 6.250\n", "2025-08-02 09:46:06,097 - INFO - TD Update - State: 2, Reward: 0, Target: 7.479, TD Error: 1.229, New V(s): 6.373\n", "2025-08-02 09:46:06,097 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.690, New V(s): 8.479\n", "2025-08-02 09:46:06,097 - INFO - Episode Complete - Total Reward: 8, Avg TD Error: 1.368\n", "2025-08-02 09:46:06,097 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,097 - INFO - TD Update - State: 0, Reward: 0, Target: 3.568, TD Error: 0.466, New V(s): 3.148\n", "2025-08-02 09:46:06,097 - INFO - TD Update - State: 1, Reward: 0, Target: 5.736, TD Error: 1.771, New V(s): 4.142\n", "2025-08-02 09:46:06,098 - INFO - TD Update - State: 2, Reward: 0, Target: 7.631, TD Error: 1.258, New V(s): 6.499\n", "2025-08-02 09:46:06,098 - INFO - TD Update - State: 3, Reward: -1, Target: 6.631, TD Error: -1.848, New V(s): 8.294\n", "2025-08-02 09:46:06,098 - INFO - TD Update - State: 3, Reward: -1, Target: 6.465, TD Error: -1.829, New V(s): 8.112\n", "2025-08-02 09:46:06,098 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.888, New V(s): 8.300\n", "2025-08-02 09:46:06,098 - INFO - Episode Complete - Total Reward: 8, Avg TD Error: 1.510\n", "2025-08-02 09:46:06,098 - INFO - \n", "๐Ÿ“Š Episode 80 Summary:\n", "2025-08-02 09:46:06,099 - INFO - Value Function: [3.14845117 4.14187375 6.49887922 8.30035909 0. ]\n", "2025-08-02 09:46:06,099 - INFO - Recent Avg Reward: 5.90\n", "2025-08-02 09:46:06,099 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,099 - INFO - TD Update - State: 0, Reward: 0, Target: 3.728, TD Error: 0.579, New V(s): 3.206\n", "2025-08-02 09:46:06,099 - INFO - TD Update - State: 1, Reward: -1, Target: 2.728, TD Error: -1.414, New V(s): 4.000\n", "2025-08-02 09:46:06,099 - INFO - TD Update - State: 1, Reward: -1, Target: 2.600, TD Error: -1.400, New V(s): 3.860\n", "2025-08-02 09:46:06,100 - INFO - TD Update - State: 1, Reward: -1, Target: 2.474, TD Error: -1.386, New V(s): 3.722\n", "2025-08-02 09:46:06,100 - INFO - TD Update - State: 1, Reward: -1, Target: 2.350, TD Error: -1.372, New V(s): 3.585\n", "2025-08-02 09:46:06,100 - INFO - TD Update - State: 1, Reward: -1, Target: 2.226, TD Error: -1.358, New V(s): 3.449\n", "2025-08-02 09:46:06,100 - INFO - TD Update - State: 1, Reward: 0, Target: 5.849, TD Error: 2.400, New V(s): 3.689\n", "2025-08-02 09:46:06,100 - INFO - TD Update - State: 2, Reward: 0, Target: 7.470, TD Error: 0.971, New V(s): 6.596\n", "2025-08-02 09:46:06,100 - INFO - TD Update - State: 3, Reward: -1, Target: 6.470, TD Error: -1.830, New V(s): 8.117\n", "2025-08-02 09:46:06,101 - INFO - TD Update - State: 3, Reward: -1, Target: 6.306, TD Error: -1.812, New V(s): 7.936\n", "2025-08-02 09:46:06,101 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 2.064, New V(s): 8.143\n", "2025-08-02 09:46:06,101 - INFO - Episode Complete - Total Reward: 3, Avg TD Error: 1.508\n", "2025-08-02 09:46:06,101 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,101 - INFO - TD Update - State: 0, Reward: -1, Target: 1.886, TD Error: -1.321, New V(s): 3.074\n", "2025-08-02 09:46:06,101 - INFO - TD Update - State: 0, Reward: -1, Target: 1.767, TD Error: -1.307, New V(s): 2.944\n", "2025-08-02 09:46:06,101 - INFO - TD Update - State: 0, Reward: 0, Target: 3.320, TD Error: 0.376, New V(s): 2.981\n", "2025-08-02 09:46:06,102 - INFO - TD Update - State: 1, Reward: -1, Target: 2.320, TD Error: -1.369, New V(s): 3.552\n", "2025-08-02 09:46:06,102 - INFO - TD Update - State: 1, Reward: 0, Target: 5.936, TD Error: 2.385, New V(s): 3.790\n", "2025-08-02 09:46:06,102 - INFO - TD Update - State: 2, Reward: -1, Target: 4.936, TD Error: -1.660, New V(s): 6.430\n", "2025-08-02 09:46:06,104 - INFO - TD Update - State: 2, Reward: 0, Target: 7.328, TD Error: 0.898, New V(s): 6.520\n", "2025-08-02 09:46:06,104 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.857, New V(s): 8.328\n", "2025-08-02 09:46:06,104 - INFO - Episode Complete - Total Reward: 6, Avg TD Error: 1.397\n", "2025-08-02 09:46:06,105 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,105 - INFO - TD Update - State: 0, Reward: 0, Target: 3.411, TD Error: 0.430, New V(s): 3.024\n", "2025-08-02 09:46:06,105 - INFO - TD Update - State: 1, Reward: 0, Target: 5.868, TD Error: 2.078, New V(s): 3.998\n", "2025-08-02 09:46:06,105 - INFO - TD Update - State: 2, Reward: 0, Target: 7.495, TD Error: 0.976, New V(s): 6.617\n", "2025-08-02 09:46:06,105 - INFO - TD Update - State: 3, Reward: -1, Target: 6.495, TD Error: -1.833, New V(s): 8.145\n", "2025-08-02 09:46:06,106 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.855, New V(s): 8.331\n", "2025-08-02 09:46:06,106 - INFO - Episode Complete - Total Reward: 9, Avg TD Error: 1.434\n", "2025-08-02 09:46:06,106 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,106 - INFO - TD Update - State: 0, Reward: -1, Target: 1.722, TD Error: -1.302, New V(s): 2.894\n", "2025-08-02 09:46:06,107 - INFO - TD Update - State: 0, Reward: 0, Target: 3.598, TD Error: 0.704, New V(s): 2.964\n", "2025-08-02 09:46:06,107 - INFO - TD Update - State: 1, Reward: -1, Target: 2.598, TD Error: -1.400, New V(s): 3.858\n", "2025-08-02 09:46:06,107 - INFO - TD Update - State: 1, Reward: 0, Target: 5.956, TD Error: 2.098, New V(s): 4.068\n", "2025-08-02 09:46:06,107 - INFO - TD Update - State: 2, Reward: -1, Target: 4.956, TD Error: -1.662, New V(s): 6.451\n", "2025-08-02 09:46:06,108 - INFO - TD Update - State: 2, Reward: 0, Target: 7.497, TD Error: 1.046, New V(s): 6.556\n", "2025-08-02 09:46:06,108 - INFO - TD Update - State: 3, Reward: -1, Target: 6.497, TD Error: -1.833, New V(s): 8.147\n", "2025-08-02 09:46:06,108 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.853, New V(s): 8.332\n", "2025-08-02 09:46:06,108 - INFO - Episode Complete - Total Reward: 6, Avg TD Error: 1.487\n", "2025-08-02 09:46:06,109 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,109 - INFO - TD Update - State: 0, Reward: -1, Target: 1.668, TD Error: -1.296, New V(s): 2.835\n", "2025-08-02 09:46:06,109 - INFO - TD Update - State: 0, Reward: -1, Target: 1.551, TD Error: -1.283, New V(s): 2.706\n", "2025-08-02 09:46:06,110 - INFO - TD Update - State: 0, Reward: 0, Target: 3.661, TD Error: 0.955, New V(s): 2.802\n", "2025-08-02 09:46:06,110 - INFO - TD Update - State: 1, Reward: -1, Target: 2.661, TD Error: -1.407, New V(s): 3.927\n", "2025-08-02 09:46:06,110 - INFO - TD Update - State: 1, Reward: -1, Target: 2.534, TD Error: -1.393, New V(s): 3.788\n", "2025-08-02 09:46:06,110 - INFO - TD Update - State: 1, Reward: 0, Target: 5.900, TD Error: 2.112, New V(s): 3.999\n", "2025-08-02 09:46:06,111 - INFO - TD Update - State: 2, Reward: -1, Target: 4.900, TD Error: -1.656, New V(s): 6.390\n", "2025-08-02 09:46:06,111 - INFO - TD Update - State: 2, Reward: 0, Target: 7.499, TD Error: 1.109, New V(s): 6.501\n", "2025-08-02 09:46:06,111 - INFO - TD Update - State: 3, Reward: -1, Target: 6.499, TD Error: -1.833, New V(s): 8.149\n", "2025-08-02 09:46:06,111 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.851, New V(s): 8.334\n", "2025-08-02 09:46:06,111 - INFO - Episode Complete - Total Reward: 4, Avg TD Error: 1.490\n", "2025-08-02 09:46:06,112 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,112 - INFO - TD Update - State: 0, Reward: -1, Target: 1.522, TD Error: -1.280, New V(s): 2.674\n", "2025-08-02 09:46:06,112 - INFO - TD Update - State: 0, Reward: -1, Target: 1.406, TD Error: -1.267, New V(s): 2.547\n", "2025-08-02 09:46:06,112 - INFO - TD Update - State: 0, Reward: -1, Target: 1.292, TD Error: -1.255, New V(s): 2.422\n", "2025-08-02 09:46:06,112 - INFO - TD Update - State: 0, Reward: -1, Target: 1.179, TD Error: -1.242, New V(s): 2.297\n", "2025-08-02 09:46:06,112 - INFO - TD Update - State: 0, Reward: -1, Target: 1.068, TD Error: -1.230, New V(s): 2.174\n", "2025-08-02 09:46:06,113 - INFO - TD Update - State: 0, Reward: -1, Target: 0.957, TD Error: -1.217, New V(s): 2.053\n", "2025-08-02 09:46:06,113 - INFO - TD Update - State: 0, Reward: -1, Target: 0.847, TD Error: -1.205, New V(s): 1.932\n", "2025-08-02 09:46:06,113 - INFO - TD Update - State: 0, Reward: -1, Target: 0.739, TD Error: -1.193, New V(s): 1.813\n", "2025-08-02 09:46:06,113 - INFO - TD Update - State: 0, Reward: -1, Target: 0.632, TD Error: -1.181, New V(s): 1.695\n", "2025-08-02 09:46:06,113 - INFO - TD Update - State: 0, Reward: 0, Target: 3.599, TD Error: 1.905, New V(s): 1.885\n", "2025-08-02 09:46:06,114 - INFO - TD Update - State: 1, Reward: 0, Target: 5.851, TD Error: 1.852, New V(s): 4.184\n", "2025-08-02 09:46:06,114 - INFO - TD Update - State: 2, Reward: -1, Target: 4.851, TD Error: -1.650, New V(s): 6.336\n", "2025-08-02 09:46:06,114 - INFO - TD Update - State: 2, Reward: -1, Target: 4.703, TD Error: -1.634, New V(s): 6.173\n", "2025-08-02 09:46:06,114 - INFO - TD Update - State: 2, Reward: -1, Target: 4.556, TD Error: -1.617, New V(s): 6.011\n", "2025-08-02 09:46:06,114 - INFO - TD Update - State: 2, Reward: 0, Target: 7.501, TD Error: 1.490, New V(s): 6.160\n", "2025-08-02 09:46:06,115 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.666, New V(s): 8.501\n", "2025-08-02 09:46:06,115 - INFO - Episode Complete - Total Reward: -2, Avg TD Error: 1.430\n", "2025-08-02 09:46:06,115 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,115 - INFO - TD Update - State: 0, Reward: 0, Target: 3.766, TD Error: 1.881, New V(s): 2.073\n", "2025-08-02 09:46:06,116 - INFO - TD Update - State: 1, Reward: 0, Target: 5.544, TD Error: 1.360, New V(s): 4.320\n", "2025-08-02 09:46:06,116 - INFO - TD Update - State: 2, Reward: -1, Target: 4.544, TD Error: -1.616, New V(s): 5.998\n", "2025-08-02 09:46:06,116 - INFO - TD Update - State: 2, Reward: -1, Target: 4.399, TD Error: -1.600, New V(s): 5.839\n", "2025-08-02 09:46:06,116 - INFO - TD Update - State: 2, Reward: -1, Target: 4.255, TD Error: -1.584, New V(s): 5.680\n", "2025-08-02 09:46:06,116 - INFO - TD Update - State: 2, Reward: 0, Target: 7.651, TD Error: 1.971, New V(s): 5.877\n", "2025-08-02 09:46:06,116 - INFO - TD Update - State: 3, Reward: -1, Target: 6.651, TD Error: -1.850, New V(s): 8.316\n", "2025-08-02 09:46:06,117 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.684, New V(s): 8.484\n", "2025-08-02 09:46:06,117 - INFO - Episode Complete - Total Reward: 6, Avg TD Error: 1.693\n", "2025-08-02 09:46:06,117 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,117 - INFO - TD Update - State: 0, Reward: -1, Target: 0.866, TD Error: -1.207, New V(s): 1.953\n", "2025-08-02 09:46:06,117 - INFO - TD Update - State: 0, Reward: -1, Target: 0.757, TD Error: -1.195, New V(s): 1.833\n", "2025-08-02 09:46:06,117 - INFO - TD Update - State: 0, Reward: 0, Target: 3.888, TD Error: 2.055, New V(s): 2.039\n", "2025-08-02 09:46:06,117 - INFO - TD Update - State: 1, Reward: -1, Target: 2.888, TD Error: -1.432, New V(s): 4.177\n", "2025-08-02 09:46:06,118 - INFO - TD Update - State: 1, Reward: 0, Target: 5.289, TD Error: 1.112, New V(s): 4.288\n", "2025-08-02 09:46:06,118 - INFO - TD Update - State: 2, Reward: 0, Target: 7.636, TD Error: 1.759, New V(s): 6.053\n", "2025-08-02 09:46:06,118 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.516, New V(s): 8.636\n", "2025-08-02 09:46:06,118 - INFO - Episode Complete - Total Reward: 7, Avg TD Error: 1.468\n", "2025-08-02 09:46:06,118 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,118 - INFO - TD Update - State: 0, Reward: 0, Target: 3.860, TD Error: 1.821, New V(s): 2.221\n", "2025-08-02 09:46:06,118 - INFO - TD Update - State: 1, Reward: 0, Target: 5.448, TD Error: 1.159, New V(s): 4.404\n", "2025-08-02 09:46:06,119 - INFO - TD Update - State: 2, Reward: 0, Target: 7.772, TD Error: 1.719, New V(s): 6.225\n", "2025-08-02 09:46:06,119 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.364, New V(s): 8.772\n", "2025-08-02 09:46:06,119 - INFO - Episode Complete - Total Reward: 10, Avg TD Error: 1.516\n", "2025-08-02 09:46:06,119 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,119 - INFO - TD Update - State: 0, Reward: -1, Target: 0.999, TD Error: -1.222, New V(s): 2.098\n", "2025-08-02 09:46:06,119 - INFO - TD Update - State: 0, Reward: -1, Target: 0.889, TD Error: -1.210, New V(s): 1.977\n", "2025-08-02 09:46:06,119 - INFO - TD Update - State: 0, Reward: -1, Target: 0.780, TD Error: -1.198, New V(s): 1.858\n", "2025-08-02 09:46:06,120 - INFO - TD Update - State: 0, Reward: 0, Target: 3.964, TD Error: 2.106, New V(s): 2.068\n", "2025-08-02 09:46:06,120 - INFO - TD Update - State: 1, Reward: -1, Target: 2.964, TD Error: -1.440, New V(s): 4.260\n", "2025-08-02 09:46:06,120 - INFO - TD Update - State: 1, Reward: 0, Target: 5.602, TD Error: 1.342, New V(s): 4.394\n", "2025-08-02 09:46:06,120 - INFO - TD Update - State: 2, Reward: -1, Target: 4.602, TD Error: -1.622, New V(s): 6.063\n", "2025-08-02 09:46:06,120 - INFO - TD Update - State: 2, Reward: 0, Target: 7.895, TD Error: 1.832, New V(s): 6.246\n", "2025-08-02 09:46:06,120 - INFO - TD Update - State: 3, Reward: -1, Target: 6.895, TD Error: -1.877, New V(s): 8.585\n", "2025-08-02 09:46:06,120 - INFO - TD Update - State: 3, Reward: -1, Target: 6.726, TD Error: -1.858, New V(s): 8.399\n", "2025-08-02 09:46:06,120 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.601, New V(s): 8.559\n", "2025-08-02 09:46:06,121 - INFO - Episode Complete - Total Reward: 3, Avg TD Error: 1.574\n", "2025-08-02 09:46:06,121 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,121 - INFO - TD Update - State: 0, Reward: -1, Target: 0.861, TD Error: -1.207, New V(s): 1.948\n", "2025-08-02 09:46:06,121 - INFO - TD Update - State: 0, Reward: 0, Target: 3.955, TD Error: 2.007, New V(s): 2.148\n", "2025-08-02 09:46:06,121 - INFO - TD Update - State: 1, Reward: -1, Target: 2.955, TD Error: -1.439, New V(s): 4.251\n", "2025-08-02 09:46:06,121 - INFO - TD Update - State: 1, Reward: 0, Target: 5.621, TD Error: 1.371, New V(s): 4.388\n", "2025-08-02 09:46:06,122 - INFO - TD Update - State: 2, Reward: -1, Target: 4.621, TD Error: -1.625, New V(s): 6.083\n", "2025-08-02 09:46:06,122 - INFO - TD Update - State: 2, Reward: -1, Target: 4.475, TD Error: -1.608, New V(s): 5.923\n", "2025-08-02 09:46:06,122 - INFO - TD Update - State: 2, Reward: 0, Target: 7.703, TD Error: 1.780, New V(s): 6.101\n", "2025-08-02 09:46:06,122 - INFO - TD Update - State: 3, Reward: -1, Target: 6.703, TD Error: -1.856, New V(s): 8.373\n", "2025-08-02 09:46:06,122 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.627, New V(s): 8.536\n", "2025-08-02 09:46:06,122 - INFO - Episode Complete - Total Reward: 5, Avg TD Error: 1.613\n", "2025-08-02 09:46:06,122 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,123 - INFO - TD Update - State: 0, Reward: 0, Target: 3.949, TD Error: 1.801, New V(s): 2.328\n", "2025-08-02 09:46:06,123 - INFO - TD Update - State: 1, Reward: 0, Target: 5.491, TD Error: 1.103, New V(s): 4.498\n", "2025-08-02 09:46:06,123 - INFO - TD Update - State: 2, Reward: -1, Target: 4.491, TD Error: -1.610, New V(s): 5.940\n", "2025-08-02 09:46:06,123 - INFO - TD Update - State: 2, Reward: 0, Target: 7.682, TD Error: 1.743, New V(s): 6.114\n", "2025-08-02 09:46:06,123 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.464, New V(s): 8.682\n", "2025-08-02 09:46:06,123 - INFO - Episode Complete - Total Reward: 9, Avg TD Error: 1.544\n", "2025-08-02 09:46:06,123 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,124 - INFO - TD Update - State: 0, Reward: -1, Target: 1.096, TD Error: -1.233, New V(s): 2.205\n", "2025-08-02 09:46:06,124 - INFO - TD Update - State: 0, Reward: 0, Target: 4.048, TD Error: 1.843, New V(s): 2.389\n", "2025-08-02 09:46:06,124 - INFO - TD Update - State: 1, Reward: 0, Target: 5.503, TD Error: 1.005, New V(s): 4.598\n", "2025-08-02 09:46:06,124 - INFO - TD Update - State: 2, Reward: -1, Target: 4.503, TD Error: -1.611, New V(s): 5.953\n", "2025-08-02 09:46:06,124 - INFO - TD Update - State: 2, Reward: 0, Target: 7.814, TD Error: 1.861, New V(s): 6.139\n", "2025-08-02 09:46:06,124 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.318, New V(s): 8.814\n", "2025-08-02 09:46:06,125 - INFO - Episode Complete - Total Reward: 8, Avg TD Error: 1.478\n", "2025-08-02 09:46:06,125 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,125 - INFO - TD Update - State: 0, Reward: 0, Target: 4.139, TD Error: 1.749, New V(s): 2.564\n", "2025-08-02 09:46:06,125 - INFO - TD Update - State: 1, Reward: 0, Target: 5.525, TD Error: 0.927, New V(s): 4.691\n", "2025-08-02 09:46:06,125 - INFO - TD Update - State: 2, Reward: -1, Target: 4.525, TD Error: -1.614, New V(s): 5.978\n", "2025-08-02 09:46:06,125 - INFO - TD Update - State: 2, Reward: 0, Target: 7.933, TD Error: 1.955, New V(s): 6.173\n", "2025-08-02 09:46:06,125 - INFO - TD Update - State: 3, Reward: -1, Target: 6.933, TD Error: -1.881, New V(s): 8.626\n", "2025-08-02 09:46:06,125 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.374, New V(s): 8.763\n", "2025-08-02 09:46:06,126 - INFO - Episode Complete - Total Reward: 8, Avg TD Error: 1.583\n", "2025-08-02 09:46:06,126 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,126 - INFO - TD Update - State: 0, Reward: 0, Target: 4.222, TD Error: 1.658, New V(s): 2.730\n", "2025-08-02 09:46:06,126 - INFO - TD Update - State: 1, Reward: 0, Target: 5.556, TD Error: 0.865, New V(s): 4.778\n", "2025-08-02 09:46:06,126 - INFO - TD Update - State: 2, Reward: 0, Target: 7.887, TD Error: 1.714, New V(s): 6.344\n", "2025-08-02 09:46:06,126 - INFO - TD Update - State: 3, Reward: -1, Target: 6.887, TD Error: -1.876, New V(s): 8.576\n", "2025-08-02 09:46:06,126 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.424, New V(s): 8.718\n", "2025-08-02 09:46:06,127 - INFO - Episode Complete - Total Reward: 9, Avg TD Error: 1.507\n", "2025-08-02 09:46:06,127 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,127 - INFO - TD Update - State: 0, Reward: -1, Target: 1.457, TD Error: -1.273, New V(s): 2.603\n", "2025-08-02 09:46:06,127 - INFO - TD Update - State: 0, Reward: -1, Target: 1.343, TD Error: -1.260, New V(s): 2.477\n", "2025-08-02 09:46:06,127 - INFO - TD Update - State: 0, Reward: -1, Target: 1.229, TD Error: -1.248, New V(s): 2.352\n", "2025-08-02 09:46:06,127 - INFO - TD Update - State: 0, Reward: -1, Target: 1.117, TD Error: -1.235, New V(s): 2.228\n", "2025-08-02 09:46:06,127 - INFO - TD Update - State: 0, Reward: 0, Target: 4.300, TD Error: 2.071, New V(s): 2.436\n", "2025-08-02 09:46:06,128 - INFO - TD Update - State: 1, Reward: -1, Target: 3.300, TD Error: -1.478, New V(s): 4.630\n", "2025-08-02 09:46:06,128 - INFO - TD Update - State: 1, Reward: 0, Target: 5.710, TD Error: 1.080, New V(s): 4.738\n", "2025-08-02 09:46:06,128 - INFO - TD Update - State: 2, Reward: 0, Target: 7.846, TD Error: 1.502, New V(s): 6.495\n", "2025-08-02 09:46:06,128 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.282, New V(s): 8.846\n", "2025-08-02 09:46:06,128 - INFO - Episode Complete - Total Reward: 5, Avg TD Error: 1.381\n", "2025-08-02 09:46:06,128 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,129 - INFO - TD Update - State: 0, Reward: -1, Target: 1.192, TD Error: -1.244, New V(s): 2.311\n", "2025-08-02 09:46:06,129 - INFO - TD Update - State: 0, Reward: -1, Target: 1.080, TD Error: -1.231, New V(s): 2.188\n", "2025-08-02 09:46:06,129 - INFO - TD Update - State: 0, Reward: 0, Target: 4.264, TD Error: 2.076, New V(s): 2.396\n", "2025-08-02 09:46:06,129 - INFO - TD Update - State: 1, Reward: -1, Target: 3.264, TD Error: -1.474, New V(s): 4.590\n", "2025-08-02 09:46:06,129 - INFO - TD Update - State: 1, Reward: 0, Target: 5.845, TD Error: 1.255, New V(s): 4.716\n", "2025-08-02 09:46:06,129 - INFO - TD Update - State: 2, Reward: 0, Target: 7.962, TD Error: 1.467, New V(s): 6.641\n", "2025-08-02 09:46:06,129 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.154, New V(s): 8.962\n", "2025-08-02 09:46:06,130 - INFO - Episode Complete - Total Reward: 7, Avg TD Error: 1.414\n", "2025-08-02 09:46:06,130 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,130 - INFO - TD Update - State: 0, Reward: -1, Target: 1.156, TD Error: -1.240, New V(s): 2.272\n", "2025-08-02 09:46:06,130 - INFO - TD Update - State: 0, Reward: 0, Target: 4.244, TD Error: 1.973, New V(s): 2.469\n", "2025-08-02 09:46:06,130 - INFO - TD Update - State: 1, Reward: -1, Target: 3.244, TD Error: -1.472, New V(s): 4.569\n", "2025-08-02 09:46:06,130 - INFO - TD Update - State: 1, Reward: 0, Target: 5.977, TD Error: 1.408, New V(s): 4.710\n", "2025-08-02 09:46:06,130 - INFO - TD Update - State: 2, Reward: 0, Target: 8.066, TD Error: 1.424, New V(s): 6.784\n", "2025-08-02 09:46:06,131 - INFO - TD Update - State: 3, Reward: -1, Target: 7.066, TD Error: -1.896, New V(s): 8.772\n", "2025-08-02 09:46:06,131 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.228, New V(s): 8.895\n", "2025-08-02 09:46:06,131 - INFO - Episode Complete - Total Reward: 7, Avg TD Error: 1.520\n", "2025-08-02 09:46:06,131 - INFO - \n", "=== Starting New Episode ===\n", "2025-08-02 09:46:06,131 - INFO - TD Update - State: 0, Reward: -1, Target: 1.222, TD Error: -1.247, New V(s): 2.344\n", "2025-08-02 09:46:06,131 - INFO - TD Update - State: 0, Reward: -1, Target: 1.110, TD Error: -1.234, New V(s): 2.221\n", "2025-08-02 09:46:06,131 - INFO - TD Update - State: 0, Reward: 0, Target: 4.239, TD Error: 2.018, New V(s): 2.423\n", "2025-08-02 09:46:06,132 - INFO - TD Update - State: 1, Reward: 0, Target: 6.105, TD Error: 1.396, New V(s): 4.849\n", "2025-08-02 09:46:06,132 - INFO - TD Update - State: 2, Reward: 0, Target: 8.005, TD Error: 1.222, New V(s): 6.906\n", "2025-08-02 09:46:06,132 - INFO - TD Update - State: 3, Reward: -1, Target: 7.005, TD Error: -1.889, New V(s): 8.706\n", "2025-08-02 09:46:06,132 - INFO - TD Update - State: 3, Reward: -1, Target: 6.835, TD Error: -1.871, New V(s): 8.519\n", "2025-08-02 09:46:06,132 - INFO - TD Update - State: 3, Reward: 10, Target: 10.000, TD Error: 1.481, New V(s): 8.667\n", "2025-08-02 09:46:06,132 - INFO - Episode Complete - Total Reward: 6, Avg TD Error: 1.545\n", "2025-08-02 09:46:06,132 - INFO - \n", "โœ… Training Complete!\n", "2025-08-02 09:46:06,133 - INFO - Final Value Function: [2.42265579 4.84913975 6.90591591 8.66697041 0. ]\n", "2025-08-02 09:46:06,134 - INFO - ๐Ÿ’พ Results saved to td_learning_20250802_094606.json\n", "2025-08-02 09:46:06,559 - INFO - ๐Ÿ“ˆ Plots saved to td_learning_plots_20250802_094606.png\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "2025-08-02 09:46:06,655 - INFO - \n", "๐ŸŽ‰ Experiment Complete!\n", "2025-08-02 09:46:06,655 - INFO - ๐Ÿ“ Files saved: td_learning_20250802_094606.json, td_learning_plots_20250802_094606.png\n" ] } ], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import random\n", "import logging\n", "from collections import defaultdict\n", "import json\n", "import os\n", "from datetime import datetime\n", "\n", "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')\n", "logger = logging.getLogger(__name__)\n", "\n", "class TDLearningEnvironment:\n", " def __init__(self, num_states=5):\n", " self.num_states = num_states\n", " self.current_state = 0\n", " self.terminal_state = num_states - 1\n", " self.reset()\n", " \n", " def reset(self):\n", " self.current_state = 0\n", " return self.current_state\n", " \n", " def step(self, action):\n", " if self.current_state == self.terminal_state:\n", " return self.current_state, 0, True\n", " \n", " reward = 0\n", " if action == 1:\n", " self.current_state += 1\n", " if self.current_state == self.terminal_state:\n", " reward = 10\n", " else:\n", " reward = -1\n", " \n", " done = self.current_state == self.terminal_state\n", " return self.current_state, reward, done\n", "\n", "class TDLearningAgent:\n", " def __init__(self, num_states, alpha=0.1, gamma=0.9):\n", " self.num_states = num_states\n", " self.alpha = alpha\n", " self.gamma = gamma\n", " self.V = np.zeros(num_states)\n", " self.episode_rewards = []\n", " self.value_history = []\n", " self.td_errors = []\n", " self.training_metrics = {\n", " 'episodes': [],\n", " 'total_rewards': [],\n", " 'avg_td_error': [],\n", " 'convergence_rate': []\n", " }\n", " \n", " def get_action(self, state):\n", " return random.choice([0, 1])\n", " \n", " def td_update(self, state, reward, next_state, done):\n", " if done:\n", " target = reward\n", " else:\n", " target = reward + self.gamma * self.V[next_state]\n", " \n", " td_error = target - self.V[state]\n", " self.V[state] += self.alpha * td_error\n", " \n", " self.td_errors.append(abs(td_error))\n", " \n", " logger.info(f\"TD Update - State: {state}, Reward: {reward}, Target: {target:.3f}, \"\n", " f\"TD Error: {td_error:.3f}, New V(s): {self.V[state]:.3f}\")\n", " \n", " return td_error\n", " \n", " def train_episode(self, env):\n", " state = env.reset()\n", " total_reward = 0\n", " episode_td_errors = []\n", " episode_states = []\n", " \n", " logger.info(f\"\\n=== Starting New Episode ===\")\n", " \n", " while True:\n", " action = self.get_action(state)\n", " next_state, reward, done = env.step(action)\n", " \n", " td_error = self.td_update(state, reward, next_state, done)\n", " episode_td_errors.append(abs(td_error))\n", " episode_states.append(state)\n", " \n", " total_reward += reward\n", " \n", " if done:\n", " break\n", " \n", " state = next_state\n", " \n", " self.episode_rewards.append(total_reward)\n", " self.value_history.append(self.V.copy())\n", " \n", " avg_td_error = np.mean(episode_td_errors) if episode_td_errors else 0\n", " logger.info(f\"Episode Complete - Total Reward: {total_reward}, \"\n", " f\"Avg TD Error: {avg_td_error:.3f}\")\n", " \n", " return total_reward, avg_td_error, episode_states\n", " \n", " def train(self, env, num_episodes=100):\n", " logger.info(f\"\\n๐Ÿš€ Starting TD Learning Training for {num_episodes} episodes\")\n", " logger.info(f\"Parameters - Alpha: {self.alpha}, Gamma: {self.gamma}\")\n", " \n", " for episode in range(num_episodes):\n", " total_reward, avg_td_error, states = self.train_episode(env)\n", " \n", " self.training_metrics['episodes'].append(episode)\n", " self.training_metrics['total_rewards'].append(total_reward)\n", " self.training_metrics['avg_td_error'].append(avg_td_error)\n", " \n", " convergence_rate = np.std(self.V) if len(self.value_history) > 1 else 0\n", " self.training_metrics['convergence_rate'].append(convergence_rate)\n", " \n", " if episode % 20 == 0:\n", " logger.info(f\"\\n๐Ÿ“Š Episode {episode} Summary:\")\n", " logger.info(f\"Value Function: {self.V}\")\n", " logger.info(f\"Recent Avg Reward: {np.mean(self.episode_rewards[-10:]):.2f}\")\n", " \n", " logger.info(f\"\\nโœ… Training Complete!\")\n", " logger.info(f\"Final Value Function: {self.V}\")\n", " \n", " def save_results(self, filename_prefix=\"td_learning\"):\n", " timestamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n", " \n", " results = {\n", " 'parameters': {\n", " 'alpha': self.alpha,\n", " 'gamma': self.gamma,\n", " 'num_states': self.num_states\n", " },\n", " 'final_values': self.V.tolist(),\n", " 'training_metrics': self.training_metrics,\n", " 'episode_rewards': self.episode_rewards\n", " }\n", " \n", " filename = f\"{filename_prefix}_{timestamp}.json\"\n", " with open(filename, 'w') as f:\n", " json.dump(results, f, indent=2)\n", " \n", " logger.info(f\"๐Ÿ’พ Results saved to {filename}\")\n", " return filename\n", " \n", " def visualize_training(self):\n", " fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))\n", " \n", " ax1.plot(self.training_metrics['episodes'], self.training_metrics['total_rewards'])\n", " ax1.set_title('Episode Rewards Over Time')\n", " ax1.set_xlabel('Episode')\n", " ax1.set_ylabel('Total Reward')\n", " ax1.grid(True)\n", " \n", " if len(self.value_history) > 0:\n", " value_array = np.array(self.value_history)\n", " for state in range(self.num_states):\n", " ax2.plot(range(len(self.value_history)), value_array[:, state], \n", " label=f'State {state}', linewidth=2)\n", " ax2.set_title('Value Function Evolution')\n", " ax2.set_xlabel('Episode')\n", " ax2.set_ylabel('Value V(s)')\n", " ax2.legend()\n", " ax2.grid(True)\n", " \n", " ax3.plot(self.training_metrics['episodes'], self.training_metrics['avg_td_error'])\n", " ax3.set_title('Average TD Error Over Time')\n", " ax3.set_xlabel('Episode')\n", " ax3.set_ylabel('TD Error')\n", " ax3.grid(True)\n", " \n", " final_values = self.V\n", " bars = ax4.bar(range(len(final_values)), final_values, \n", " color=['lightblue' if i != len(final_values)-1 else 'gold' \n", " for i in range(len(final_values))])\n", " ax4.set_title('Final Value Function')\n", " ax4.set_xlabel('State')\n", " ax4.set_ylabel('Value V(s)')\n", " ax4.grid(True, alpha=0.3)\n", " \n", " for i, v in enumerate(final_values):\n", " ax4.text(i, v + 0.1, f'{v:.2f}', ha='center', va='bottom', fontweight='bold')\n", " \n", " plt.tight_layout()\n", " \n", " timestamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n", " plot_filename = f\"td_learning_plots_{timestamp}.png\"\n", " plt.savefig(plot_filename, dpi=300, bbox_inches='tight')\n", " logger.info(f\"๐Ÿ“ˆ Plots saved to {plot_filename}\")\n", " \n", " plt.show()\n", " return plot_filename\n", "\n", "def run_td_learning_experiment():\n", " logger.info(\"๐ŸŽฏ TD Learning Implementation - Complete with Logging\")\n", " \n", " env = TDLearningEnvironment(num_states=5)\n", " agent = TDLearningAgent(num_states=5, alpha=0.1, gamma=0.9)\n", " \n", " logger.info(f\"Environment: {env.num_states} states (0 to {env.num_states-1})\")\n", " logger.info(f\"Goal: Reach terminal state {env.terminal_state} for reward +10\")\n", " \n", " agent.train(env, num_episodes=100)\n", " \n", " results_file = agent.save_results()\n", " plot_file = agent.visualize_training()\n", " \n", " logger.info(f\"\\n๐ŸŽ‰ Experiment Complete!\")\n", " logger.info(f\"๐Ÿ“ Files saved: {results_file}, {plot_file}\")\n", " \n", " return agent, env, results_file, plot_file\n", "\n", "if __name__ == \"__main__\":\n", " agent, env, results_file, plot_file = run_td_learning_experiment()" ] }, { "cell_type": "code", "execution_count": null, "id": "6f90a7a5-2341-4df8-bc4f-a28c3f8296a7", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import random\n", "import logging\n", "from collections import defaultdict\n", "import json\n", "import os\n", "from datetime import datetime\n", "\n", "# Configure logging to show detailed information about the training process\n", "# This helps us track exactly what the algorithm is learning at each step\n", "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')\n", "logger = logging.getLogger(__name__)\n", "\n", "class TDLearningEnvironment:\n", " \"\"\"\n", " Simple environment for demonstrating TD Learning\n", " - States: 0, 1, 2, 3, 4 (where 4 is terminal)\n", " - Actions: 0 (stay/wrong), 1 (move forward/right)\n", " - Rewards: +10 for reaching terminal, -1 for wrong moves, 0 otherwise\n", " \"\"\"\n", " def __init__(self, num_states=5):\n", " self.num_states = num_states # Total number of states in our environment\n", " self.current_state = 0 # Agent always starts at state 0\n", " self.terminal_state = num_states - 1 # Last state (4) is the goal\n", " self.reset()\n", " \n", " def reset(self):\n", " \"\"\"Reset environment to starting state - called at beginning of each episode\"\"\"\n", " self.current_state = 0\n", " return self.current_state\n", " \n", " def step(self, action):\n", " \"\"\"\n", " Execute one step in the environment\n", " Args:\n", " action: 0 (stay/wrong move) or 1 (move forward)\n", " Returns:\n", " next_state, reward, done\n", " \"\"\"\n", " # If already at terminal state, no more moves possible\n", " if self.current_state == self.terminal_state:\n", " return self.current_state, 0, True\n", " \n", " reward = 0\n", " \n", " # Action 1 = move forward toward goal\n", " if action == 1:\n", " self.current_state += 1\n", " # Give big reward (+10) when reaching the terminal state\n", " if self.current_state == self.terminal_state:\n", " reward = 10\n", " else:\n", " # Action 0 = wrong move, give negative reward\n", " reward = -1\n", " \n", " # Episode is done when we reach the terminal state\n", " done = self.current_state == self.terminal_state\n", " return self.current_state, reward, done\n", "\n", "class TDLearningAgent:\n", " \"\"\"\n", " TD Learning Agent that learns state values V(s) using the TD(0) algorithm\n", " Key insight: Updates value estimates using bootstrapping - learning from partial experience\n", " \"\"\"\n", " def __init__(self, num_states, alpha=0.1, gamma=0.9):\n", " self.num_states = num_states # Number of states in the environment\n", " self.alpha = alpha # Learning rate: how fast we update our estimates\n", " self.gamma = gamma # Discount factor: how much we value future rewards\n", " \n", " # Initialize all state values to zero - these are our V(s) estimates\n", " self.V = np.zeros(num_states)\n", " \n", " # Track training progress for analysis and visualization\n", " self.episode_rewards = [] # Total reward per episode\n", " self.value_history = [] # V(s) values after each episode\n", " self.td_errors = [] # All TD errors during training\n", " \n", " # Metrics for tracking learning progress\n", " self.training_metrics = {\n", " 'episodes': [], # Episode numbers\n", " 'total_rewards': [], # Cumulative reward per episode\n", " 'avg_td_error': [], # Average TD error per episode\n", " 'convergence_rate': [] # How much V(s) values are changing\n", " }\n", " \n", " def get_action(self, state):\n", " \"\"\"\n", " Simple random policy for action selection\n", " In a real scenario, this could be epsilon-greedy or policy-based\n", " \"\"\"\n", " return random.choice([0, 1])\n", " \n", " def td_update(self, state, reward, next_state, done):\n", " \"\"\"\n", " The core TD(0) update rule: V(s) โ† V(s) + ฮฑ[r + ฮณV(s') - V(s)]\n", " This is the heart of temporal difference learning!\n", " \n", " Args:\n", " state: Current state s_t\n", " reward: Reward r_{t+1} received after taking action\n", " next_state: Next state s_{t+1}\n", " done: Whether episode is finished\n", " \n", " Returns:\n", " td_error: The temporal difference error ฮด\n", " \"\"\"\n", " # Calculate the TD target\n", " if done:\n", " # If episode is done, there's no next state value to consider\n", " target = reward\n", " else:\n", " # TD target = immediate reward + discounted future value\n", " # This is the \"bootstrapping\" - we use our current estimate of V(s')\n", " target = reward + self.gamma * self.V[next_state]\n", " \n", " # Calculate TD error: ฮด = target - current_estimate\n", " # This tells us how wrong our current value estimate was\n", " td_error = target - self.V[state]\n", " \n", " # Update the value function using the TD error\n", " # ฮฑ controls how much we trust this new experience vs our old estimate\n", " self.V[state] += self.alpha * td_error\n", " \n", " # Store TD error for analysis\n", " self.td_errors.append(abs(td_error))\n", " \n", " # Log the update for detailed tracking\n", " logger.info(f\"TD Update - State: {state}, Reward: {reward}, Target: {target:.3f}, \"\n", " f\"TD Error: {td_error:.3f}, New V(s): {self.V[state]:.3f}\")\n", " \n", " return td_error\n", " \n", " def train_episode(self, env):\n", " \"\"\"\n", " Run one complete episode of TD learning\n", " An episode goes from start state to terminal state (or max steps)\n", " \"\"\"\n", " state = env.reset() # Start at initial state\n", " total_reward = 0 # Track cumulative reward for this episode\n", " episode_td_errors = [] # Track TD errors for this episode\n", " episode_states = [] # Track which states we visited\n", " \n", " logger.info(f\"\\n=== Starting New Episode ===\")\n", " \n", " # Continue until episode is done\n", " while True:\n", " # Choose an action (random policy in this simple example)\n", " action = self.get_action(state)\n", " \n", " # Take action in environment and observe results\n", " next_state, reward, done = env.step(action)\n", " \n", " # *** THIS IS THE KEY STEP: TD UPDATE ***\n", " # Update our value function using the TD(0) rule\n", " td_error = self.td_update(state, reward, next_state, done)\n", " \n", " # Track statistics for analysis\n", " episode_td_errors.append(abs(td_error))\n", " episode_states.append(state)\n", " total_reward += reward\n", " \n", " # If episode is finished, break out of loop\n", " if done:\n", " break\n", " \n", " # Move to next state for next iteration\n", " state = next_state\n", " \n", " # Store episode results for tracking progress\n", " self.episode_rewards.append(total_reward)\n", " self.value_history.append(self.V.copy()) # Save snapshot of current V(s)\n", " \n", " # Calculate average TD error for this episode\n", " avg_td_error = np.mean(episode_td_errors) if episode_td_errors else 0\n", " \n", " logger.info(f\"Episode Complete - Total Reward: {total_reward}, \"\n", " f\"Avg TD Error: {avg_td_error:.3f}\")\n", " \n", " return total_reward, avg_td_error, episode_states\n", " \n", " def train(self, env, num_episodes=100):\n", " \"\"\"\n", " Train the TD learning agent for multiple episodes\n", " Each episode provides more experience to improve our value estimates\n", " \"\"\"\n", " logger.info(f\"\\n๐Ÿš€ Starting TD Learning Training for {num_episodes} episodes\")\n", " logger.info(f\"Parameters - Alpha: {self.alpha}, Gamma: {self.gamma}\")\n", " \n", " # Run the specified number of training episodes\n", " for episode in range(num_episodes):\n", " # Train for one episode and get results\n", " total_reward, avg_td_error, states = self.train_episode(env)\n", " \n", " # Store metrics for later analysis and plotting\n", " self.training_metrics['episodes'].append(episode)\n", " self.training_metrics['total_rewards'].append(total_reward)\n", " self.training_metrics['avg_td_error'].append(avg_td_error)\n", " \n", " # Calculate convergence rate (how much V(s) is still changing)\n", " convergence_rate = np.std(self.V) if len(self.value_history) > 1 else 0\n", " self.training_metrics['convergence_rate'].append(convergence_rate)\n", " \n", " # Print progress every 20 episodes\n", " if episode % 20 == 0:\n", " logger.info(f\"\\n๐Ÿ“Š Episode {episode} Summary:\")\n", " logger.info(f\"Value Function: {self.V}\")\n", " logger.info(f\"Recent Avg Reward: {np.mean(self.episode_rewards[-10:]):.2f}\")\n", " \n", " logger.info(f\"\\nโœ… Training Complete!\")\n", " logger.info(f\"Final Value Function: {self.V}\")\n", " \n", " def save_results(self, filename_prefix=\"td_learning\"):\n", " \"\"\"\n", " Save all training results to JSON file for later analysis\n", " This includes parameters, final values, and all training metrics\n", " \"\"\"\n", " timestamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n", " \n", " # Package all results into a dictionary\n", " results = {\n", " 'parameters': {\n", " 'alpha': self.alpha,\n", " 'gamma': self.gamma,\n", " 'num_states': self.num_states\n", " },\n", " 'final_values': self.V.tolist(), # Final learned V(s) values\n", " 'training_metrics': self.training_metrics, # All episode data\n", " 'episode_rewards': self.episode_rewards # Reward progression\n", " }\n", " \n", " # Save to timestamped JSON file\n", " filename = f\"{filename_prefix}_{timestamp}.json\"\n", " with open(filename, 'w') as f:\n", " json.dump(results, f, indent=2)\n", " \n", " logger.info(f\"๐Ÿ’พ Results saved to {filename}\")\n", " return filename\n", " \n", " def visualize_training(self):\n", " \"\"\"\n", " Create comprehensive visualizations of the TD learning process\n", " Shows how the algorithm learned over time\n", " \"\"\"\n", " # Create 2x2 subplot layout for multiple visualizations\n", " fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))\n", " \n", " # Plot 1: Episode rewards over time - shows learning progress\n", " ax1.plot(self.training_metrics['episodes'], self.training_metrics['total_rewards'])\n", " ax1.set_title('Episode Rewards Over Time')\n", " ax1.set_xlabel('Episode')\n", " ax1.set_ylabel('Total Reward')\n", " ax1.grid(True)\n", " \n", " # Plot 2: Value function evolution - shows how V(s) changed during learning\n", " if len(self.value_history) > 0:\n", " value_array = np.array(self.value_history)\n", " # Plot each state's value over time\n", " for state in range(self.num_states):\n", " ax2.plot(range(len(self.value_history)), value_array[:, state], \n", " label=f'State {state}', linewidth=2)\n", " ax2.set_title('Value Function Evolution')\n", " ax2.set_xlabel('Episode')\n", " ax2.set_ylabel('Value V(s)')\n", " ax2.legend()\n", " ax2.grid(True)\n", " \n", " # Plot 3: TD error over time - shows convergence\n", " ax3.plot(self.training_metrics['episodes'], self.training_metrics['avg_td_error'])\n", " ax3.set_title('Average TD Error Over Time')\n", " ax3.set_xlabel('Episode')\n", " ax3.set_ylabel('TD Error')\n", " ax3.grid(True)\n", " \n", " # Plot 4: Final value function - shows what the agent learned\n", " final_values = self.V\n", " bars = ax4.bar(range(len(final_values)), final_values, \n", " color=['lightblue' if i != len(final_values)-1 else 'gold' \n", " for i in range(len(final_values))])\n", " ax4.set_title('Final Value Function')\n", " ax4.set_xlabel('State')\n", " ax4.set_ylabel('Value V(s)')\n", " ax4.grid(True, alpha=0.3)\n", " \n", " # Add value labels on top of bars\n", " for i, v in enumerate(final_values):\n", " ax4.text(i, v + 0.1, f'{v:.2f}', ha='center', va='bottom', fontweight='bold')\n", " \n", " plt.tight_layout()\n", " \n", " # Save plot with timestamp\n", " timestamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n", " plot_filename = f\"td_learning_plots_{timestamp}.png\"\n", " plt.savefig(plot_filename, dpi=300, bbox_inches='tight')\n", " logger.info(f\"๐Ÿ“ˆ Plots saved to {plot_filename}\")\n", " \n", " plt.show()\n", " return plot_filename\n", "\n", "def run_td_learning_experiment():\n", " \"\"\"\n", " Main function to run the complete TD Learning experiment\n", " This ties together the environment and agent for a full demonstration\n", " \"\"\"\n", " logger.info(\"๐ŸŽฏ TD Learning Implementation - Complete with Logging\")\n", " \n", " # Create environment and agent\n", " env = TDLearningEnvironment(num_states=5) # 5-state environment\n", " agent = TDLearningAgent(num_states=5, alpha=0.1, gamma=0.9) # TD learning agent\n", " \n", " logger.info(f\"Environment: {env.num_states} states (0 to {env.num_states-1})\")\n", " logger.info(f\"Goal: Reach terminal state {env.terminal_state} for reward +10\")\n", " \n", " # Run the training process\n", " agent.train(env, num_episodes=100)\n", " \n", " # Save results and create visualizations\n", " results_file = agent.save_results()\n", " plot_file = agent.visualize_training()\n", " \n", " logger.info(f\"\\n๐ŸŽ‰ Experiment Complete!\")\n", " logger.info(f\"๐Ÿ“ Files saved: {results_file}, {plot_file}\")\n", " \n", " return agent, env, results_file, plot_file\n", "\n", "# Run the experiment when script is executed\n", "if __name__ == \"__main__\":\n", " agent, env, results_file, plot_file = run_td_learning_experiment()" ] }, { "cell_type": "markdown", "id": "c8566b53-0ebd-46de-b501-9d2a4ce87ceb", "metadata": {}, "source": [ "# PHASE 4: TRAINING FLOW\n", "\n", "Let me walk you through **exactly how TD Learning learns step-by-step** using simple analogies and the actual numbers from your run!\n", "\n", "--------------------------------------------\n", "## ๐ŸŽฏ **The Big Picture: Learning Restaurant Quality**\n", "--------------------------------------------\n", "\n", "Think of TD Learning like **rating restaurants on a street**:\n", "- **State 0**: Far from the best restaurant (your starting point)\n", "- **State 1-3**: Getting closer to the amazing restaurant\n", "- **State 4**: The amazing restaurant (+10 reward)\n", "\n", "The agent learns: *\"How good is each location, assuming I'll walk optimally from there?\"*\n", "\n", "--------------------------------------------\n", "## ๐Ÿ”„ **Step-by-Step Training Flow**\n", "--------------------------------------------\n", "\n", "### **STEP 1: Initialize Everything to Zero**\n", "```\n", "V(0) = 0.0 V(1) = 0.0 V(2) = 0.0 V(3) = 0.0 V(4) = 0.0\n", "```\n", "**Analogy**: *\"I have no idea how good any restaurant location is yet\"*\n", "\n", "---\n", "\n", "### **STEP 2: First Experience (Episode 1)**\n", "\n", "**๐Ÿšถ Agent's Journey:**\n", "```\n", "State 0 โ†’ (wrong action) โ†’ State 0, reward = -1\n", "State 0 โ†’ (right action) โ†’ State 1, reward = 0 \n", "State 1 โ†’ (right action) โ†’ State 2, reward = 0\n", "State 2 โ†’ (wrong action) โ†’ State 2, reward = -1\n", "State 2 โ†’ (right action) โ†’ State 3, reward = 0\n", "State 3 โ†’ (right action) โ†’ State 4, reward = +10 ๐ŸŽ‰\n", "```\n", "\n", "**๐Ÿง  TD Updates (The Learning Magic):**\n", "\n", "**Update 1**: State 0 gets wrong action\n", "```\n", "TD Target = reward + ฮณ ร— V(next_state) = -1 + 0.9 ร— 0.0 = -1.0\n", "TD Error = target - current = -1.0 - 0.0 = -1.0\n", "V(0) = 0.0 + 0.1 ร— (-1.0) = -0.1\n", "```\n", "*\"Being at State 0 and making wrong moves is bad!\"*\n", "\n", "**Update 2**: State 0 โ†’ State 1 (right move)\n", "```\n", "TD Target = 0 + 0.9 ร— 0.0 = 0.0\n", "TD Error = 0.0 - (-0.1) = 0.1\n", "V(0) = -0.1 + 0.1 ร— 0.1 = -0.09\n", "```\n", "*\"State 0 is slightly better when I move right\"*\n", "\n", "**Update 6**: State 3 โ†’ State 4 (BIG REWARD!)\n", "```\n", "TD Target = 10 + 0.9 ร— 0.0 = 10.0\n", "TD Error = 10.0 - 0.0 = 10.0\n", "V(3) = 0.0 + 0.1 ร— 10.0 = 1.0\n", "```\n", "*\"WOW! State 3 is really valuable - it leads to the prize!\"*\n", "\n", "**After Episode 1:**\n", "```\n", "V(0) = -0.09 V(1) = 0.0 V(2) = -0.09 V(3) = 1.0 V(4) = 0.0\n", "```\n", "\n", "---\n", "\n", "### **STEP 3: Information Spreads Backwards (Episode 2)**\n", "\n", "**๐Ÿ”„ The Magic of Bootstrapping:**\n", "\n", "When agent reaches State 2 โ†’ State 3:\n", "```\n", "TD Target = 0 + 0.9 ร— V(3) = 0 + 0.9 ร— 1.0 = 0.9\n", "TD Error = 0.9 - (-0.09) = 0.99\n", "V(2) = -0.09 + 0.1 ร— 0.99 = 0.009\n", "```\n", "*\"State 2 is valuable because it can reach the valuable State 3!\"*\n", "\n", "**Analogy**: Like hearing *\"The restaurant at location 3 is amazing!\"* so you start valuing location 2 because it's close to location 3.\n", "\n", "---\n", "\n", "### **STEP 4: Values Propagate Through the Chain**\n", "\n", "**Episodes 3-10: Watch the pattern!**\n", "```\n", "Episode 1: V = [-0.09, 0.00, -0.09, 1.00, 0.00]\n", "Episode 2: V = [-0.08, -0.10, 0.01, 1.90, 0.00]\n", "Episode 5: V = [-0.55, -0.41, 1.67, 4.97, 0.00]\n", "Episode 10: V = [-0.86, -0.27, 2.00, 5.57, 0.00]\n", "```\n", "\n", "**What's Happening:**\n", "- **State 3** learns fastest (closest to reward)\n", "- **State 2** learns it's good because it reaches State 3\n", "- **State 1** learns it's good because it reaches State 2\n", "- **State 0** slowly learns it's the starting point\n", "\n", "---\n", "\n", "### **STEP 5: Convergence (Episodes 50-100)**\n", "\n", "**Final Values from Your Run:**\n", "```\n", "Final: V = [2.42, 4.85, 6.91, 8.67, 0.00]\n", "```\n", "\n", "**Perfect Learning! ๐ŸŽฏ**\n", "- **Higher states = Higher values** (closer to reward)\n", "- **State 3**: 8.67 (almost as good as the +10 reward)\n", "- **State 2**: 6.91 (good because leads to State 3)\n", "- **State 1**: 4.85 (decent because 2 steps from reward)\n", "- **State 0**: 2.42 (starting point, but can reach goal)\n", "\n", "---\n", "\n", "### **STEP 6: How TD Error Shows Learning**\n", "\n", "**Early Episodes**: Large TD errors (2.0+)\n", "```\n", "\"I thought State 2 was worth 0, but I just learned it leads to State 3 worth 5!\"\n", "```\n", "\n", "**Later Episodes**: Small TD errors (1.4-)\n", "```\n", "\"I thought State 2 was worth 6.8, and I just learned it's worth 6.9 - close!\"\n", "```\n", "\n", "**Analogy**: Like a GPS that starts with wrong estimates but gets more accurate with each trip!\n", "\n", "--------------------------------------------\n", "## ๐Ÿ”ฅ **The \"Aha!\" Moments During Training**\n", "--------------------------------------------\n", "\n", "### **Moment 1: First Reward Discovery**\n", "```\n", "Episode 1: \"Holy cow! State 3 โ†’ +10 reward! State 3 is VALUABLE!\"\n", "```\n", "\n", "### **Moment 2: Backward Propagation**\n", "```\n", "Episode 2-5: \"Wait, State 2 is valuable too because it leads to State 3!\"\n", "```\n", "\n", "### **Moment 3: Chain Reaction**\n", "```\n", "Episode 10-20: \"State 1 is good because it leads to State 2, which leads to State 3!\"\n", "```\n", "\n", "### **Moment 4: Convergence**\n", "```\n", "Episode 80+: \"I've got it! Each state's value reflects how close it is to the goal!\"\n", "```\n", "\n", "--------------------------------------------\n", "## ๐Ÿ“Š **Numbers Changing Through Iterations**\n", "--------------------------------------------\n", "\n", "**Watch V(3) learn (closest to reward):**\n", "```\n", "Episode 1: V(3) = 1.00 (first discovery)\n", "Episode 5: V(3) = 4.97 (learning it's really good)\n", "Episode 20: V(3) = 7.77 (almost perfect)\n", "Episode 100: V(3) = 8.67 (converged!)\n", "```\n", "\n", "**Watch V(0) learn (furthest from reward):**\n", "```\n", "Episode 1: V(0) = -0.09 (seems bad at first)\n", "Episode 20: V(0) = 1.57 (realizing it can reach goal)\n", "Episode 50: V(0) = 3.15 (getting more optimistic)\n", "Episode 100: V(0) = 2.42 (settled at realistic value)\n", "```\n", "\n", "--------------------------------------------\n", "## ๐ŸŽจ **Visual Walkthrough**\n", "--------------------------------------------\n", "\n", "```\n", "Episode 1: [0] โ†’ [0] โ†’ [1] โ†’ [2] โ†’ [3] โ†’ [4] ๐Ÿ’ฐ\n", "Values: -0.09 0.0 -0.09 1.0 0.0\n", "\n", "Episode 50: [0] โ†’ [1] โ†’ [2] โ†’ [3] โ†’ [4] ๐Ÿ’ฐ \n", "Values: 3.15 4.27 6.11 8.88 0.0\n", "\n", "Episode 100: [0] โ†’ [1] โ†’ [2] โ†’ [3] โ†’ [4] ๐Ÿ’ฐ\n", "Values: 2.42 4.85 6.91 8.67 0.0\n", " โ†‘ โ†‘ โ†‘ โ†‘ โ†‘\n", " Start Better Good Great Goal!\n", "```\n", "\n", "--------------------------------------------\n", "## ๐Ÿงช **Why This Works (The Magic Explained)**\n", "--------------------------------------------\n", "\n", "**Traditional Learning**: *\"Wait until I finish the whole journey, then update everything\"*\n", "\n", "**TD Learning**: *\"Update my beliefs immediately based on what I just experienced + what I currently believe about the future\"*\n", "\n", "**The Bootstrap Formula:**\n", "```\n", "New Belief = Old Belief + Learning_Rate ร— (Reality - Old Belief)\n", " = Old Belief + ฮฑ ร— (reward + ฮณ ร— future_estimate - Old Belief)\n", "```\n", "\n", "**Why It's Powerful:**\n", "1. **Learns online** - no waiting for episode to end\n", "2. **Uses current knowledge** - bootstraps from existing estimates \n", "3. **Balances old vs new** - learning rate controls the blend\n", "4. **Propagates value backwards** - good states make previous states good\n", "\n", "**The Result**: The agent learns *\"How good is it to be in each state?\"* which is exactly what we want for decision making! ๐ŸŽฏ\n" ] }, { "cell_type": "code", "execution_count": null, "id": "f6e9f145-5a14-45d6-8a33-0ec83d4b7cce", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.13" } }, "nbformat": 4, "nbformat_minor": 5 }