Upload 25 files
Browse files- .gitattributes +3 -0
- Enhanced Environemnt Simulation.gif +0 -0
- README.md +230 -3
- Report.pdf +3 -0
- images/all_algos.png +3 -0
- images/all_algos_testing.png +3 -0
- images/desk.png +0 -0
- images/enhanced-environment.png +0 -0
- images/human.png +0 -0
- images/room.png +0 -0
- images/rover_desk_collision.png +0 -0
- images/rover_dest.png +0 -0
- images/rover_human_collision.png +0 -0
- images/rover_moving.png +0 -0
- images/rover_room_collision - Copy.png +0 -0
- images/rover_rover_collision.png +0 -0
- images/rover_start.png +0 -0
- images/target.png +0 -0
- images/trained_agents.gif +0 -0
- rx_rover_A2C.ipynb +0 -0
- rx_rover_DDQN.ipynb +0 -0
- rx_rover_DQN.ipynb +0 -0
- rx_rover_Double_Q_Learning.ipynb +0 -0
- rx_rover_Enhanced_Environment.ipynb +278 -0
- rx_rover_PPO.ipynb +0 -0
- rx_rover_Q_Learning.ipynb +0 -0
.gitattributes
CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
images/all_algos_testing.png filter=lfs diff=lfs merge=lfs -text
|
37 |
+
images/all_algos.png filter=lfs diff=lfs merge=lfs -text
|
38 |
+
Report.pdf filter=lfs diff=lfs merge=lfs -text
|
Enhanced Environemnt Simulation.gif
ADDED
![]() |
README.md
CHANGED
@@ -1,3 +1,230 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# RxRovers: Roaming for Rapid Relief
|
2 |
+
|
3 |
+
<span style="color: blue;">Dynamic Obstacles and Path Optimization
|
4 |
+
|
5 |
+

|
6 |
+
|
7 |
+
## Project Overview
|
8 |
+
|
9 |
+
"RxRovers: Roaming for Rapid Relief" aims to integrate advanced reinforcement learning (RL) into the healthcare domain, specifically focusing on optimizing medical supply delivery within hospital settings. This project seeks to deploy autonomous agents, RxRovers, which are programmed to navigate through hospital corridors efficiently, dodging any potential obstacles (dynamic and static) to ensure the timely distribution of medicines. This innovation can significantly enhance patient care and outcomes.
|
10 |
+
|
11 |
+
## Team Members
|
12 |
+
|
13 |
+
- Charvi Kusuma [GitHub](https://github.com/kcharvi)
|
14 |
+
- Tarun Reddi [GitHub](https://github.com/REDDITARUN)
|
15 |
+
|
16 |
+
### Real-World Application
|
17 |
+
|
18 |
+
The RxRovers project directly addresses a real-world healthcare problem: simulating delivery of medical supplies efficiently within hospital environments. This project integrates RL into the healthcare domain to enhance this delivery system, directly benefiting patient care and outcomes.
|
19 |
+
|
20 |
+
### Complex Navigation Challenges
|
21 |
+
|
22 |
+
The project’s challenges mirror real-world navigation complexities in hospital settings:
|
23 |
+
|
24 |
+
- **Dynamic and Static Obstacles**: Accurate representation of unpredictable hospital environments.
|
25 |
+
- **Path Planning**: Optimization to ensure timely and efficient delivery of medical supplies.
|
26 |
+
|
27 |
+
### Adaptable Environment
|
28 |
+
|
29 |
+
The project's environment can be modified to reflect various real-world hospital layouts, demonstrating its adaptability to different healthcare settings.
|
30 |
+
|
31 |
+
## Objectives
|
32 |
+
|
33 |
+
1. Develop RL agents capable of autonomously navigating hospital environments while delivering medical supplies.
|
34 |
+
2. Optimize path planning strategies to ensure timely and efficient delivery of medicines, while avoiding obstacles such as equipment, humans, and environmental constraints.
|
35 |
+
3. Enhance the visual representation and user experience of the simulated hospital environment to improve engagement and realism.
|
36 |
+
4. Conduct comparative analysis of various reinforcement learning algorithms to identify the optimal approach for medical supply delivery optimization within hospital environments.
|
37 |
+
|
38 |
+
## Environment
|
39 |
+
|
40 |
+
### Simulated Hospital Layouts
|
41 |
+
|
42 |
+
The project creates a hospital environment that mirrors real-world scenarios. The environment includes features such as:
|
43 |
+
|
44 |
+
- **Hospital Corridors**: Multiple corridors, rooms, and operation desks.
|
45 |
+
- **Obstacles**: Dynamic obstacles (humans, other rovers) and static obstacles (rooms, walls).
|
46 |
+
- **RxRovers**: Autonomous agents programmed to navigate these environments, avoid obstacles, and reach designated destinations.
|
47 |
+
|
48 |
+
### Initial Stage
|
49 |
+
|
50 |
+
- **Grid**: 9×9 grid simulating rovers navigating on a grid.
|
51 |
+
- **Setup**: Initialized action and observation spaces, set up grid size, starting points, destinations, and parameters such as rewards and penalties for actions and events.
|
52 |
+
|
53 |
+
### Refined Version
|
54 |
+
|
55 |
+
- **Grid Size**: 15×15 grid representing the hospital environment.
|
56 |
+
- **Rovers**: Two agents represented by blue squares.
|
57 |
+
- **Targets**: Yellow target destinations for delivering medicine.
|
58 |
+
- **Actions**: Move down, up, left, right, or stay still (5 possible actions).
|
59 |
+
- **Operation Desks**: Dark green squares representing starting points.
|
60 |
+
- **Rooms**: Black squares indicating static obstacles.
|
61 |
+
- **Human**: Purple square representing a moving human obstacle.
|
62 |
+
- **Observation Space**: Positions of the Rovers, human, rooms, operation desks, and the grid boundary.
|
63 |
+
- **Rewards**:
|
64 |
+
- +30 for moving closer to targets.
|
65 |
+
- +100 for reaching destinations.
|
66 |
+
- -15 for collisions or moving out of grid bounds.
|
67 |
+
- -5 for waiting near the human obstacle.
|
68 |
+
- -20 for moving away from targets.
|
69 |
+
- **Termination**: Episode ends if both Rovers reach their targets or maximum time steps (20) are reached.
|
70 |
+
|
71 |
+

|
72 |
+
|
73 |
+
## Algorithms
|
74 |
+
|
75 |
+
### Q-Learning (QL)
|
76 |
+
|
77 |
+
Implemented Q-Learning for a rover agent navigating a grid environment.
|
78 |
+
|
79 |
+
- **Settings**:
|
80 |
+
- State Representation: Positions of the two rovers and the human.
|
81 |
+
- Action Space: Move up, down, left, right, or stay in place.
|
82 |
+
- Rewards: Based on interactions (moving closer to target, collisions, reaching target).
|
83 |
+
- Done Flag: Episode terminates when both rovers reach targets or max time steps are reached.
|
84 |
+
- **Hyperparameters**:
|
85 |
+
- Alpha (Learning Rate): 1e-4
|
86 |
+
- Gamma (Discount Factor): 0.9
|
87 |
+
- Epsilon (Exploration Rate): 0.5
|
88 |
+
- Epsilon Decay: 0.995
|
89 |
+
- Epsilon Minimum: 0.01
|
90 |
+
- **Training Phase**:
|
91 |
+
- Total Rewards per Episode
|
92 |
+
- Epsilon Decay Curve
|
93 |
+
- **Evaluation Phase**:
|
94 |
+
- Evaluation Rewards per Episode
|
95 |
+
|
96 |
+
### Double Q-Learning (DQL)
|
97 |
+
|
98 |
+
Implemented Double Q-Learning to mitigate overestimation biases.
|
99 |
+
|
100 |
+
- **Settings**: Same as Q-Learning.
|
101 |
+
- **Hyperparameters**: Same as Q-Learning.
|
102 |
+
- **Training Phase**:
|
103 |
+
- Total Rewards per Episode
|
104 |
+
- **Evaluation Phase**:
|
105 |
+
- Evaluation Rewards
|
106 |
+
|
107 |
+
### Deep Q Network (DQN)
|
108 |
+
|
109 |
+
Utilized a neural network to approximate the Q-values.
|
110 |
+
|
111 |
+
- **Settings**:
|
112 |
+
- Neural Network Architecture: Two hidden layers with ReLU activation.
|
113 |
+
- Replay Memory: Stores past experiences for experience replay.
|
114 |
+
- Select Action Function: Epsilon-greedy strategy.
|
115 |
+
- Optimize Model Function: Gradient descent on the Q-network.
|
116 |
+
- **Hyperparameters**:
|
117 |
+
- Number of Episodes: 1000
|
118 |
+
- Target Network Update Frequency: 10 episodes
|
119 |
+
- Batch Size: 256
|
120 |
+
- Discount Factor: 0.9
|
121 |
+
- Learning Rate: 0.001
|
122 |
+
- Epsilon Initial Value: 1
|
123 |
+
- Epsilon Final Value: 0.05
|
124 |
+
- Epsilon Decay: 10000
|
125 |
+
- Maximum Timestamps per Episode: 30
|
126 |
+
- **Training Phase**:
|
127 |
+
- Total Rewards per Episode
|
128 |
+
- Epsilon Decay Curve
|
129 |
+
- **Evaluation Phase**:
|
130 |
+
- Evaluation Rewards
|
131 |
+
|
132 |
+
### Double Deep Q Network (DDQN)
|
133 |
+
|
134 |
+
Addresses overestimation bias by decoupling selection and evaluation of the action.
|
135 |
+
|
136 |
+
- **Settings**:
|
137 |
+
- Optimize Model Function: Gradient descent with gradients clipping.
|
138 |
+
- **Hyperparameters**:
|
139 |
+
- Batch Size: 256
|
140 |
+
- Gamma: 0.9
|
141 |
+
- Learning Rate: 0.001
|
142 |
+
- Epsilon Start: 1
|
143 |
+
- Epsilon End: 0.05
|
144 |
+
- Epsilon Decay: 10,000
|
145 |
+
- Number of Episodes: 1000
|
146 |
+
- Target Update: Every 10 episodes
|
147 |
+
- **Training Phase**:
|
148 |
+
- Total Rewards per Episode
|
149 |
+
- Epsilon Decay Curve
|
150 |
+
- **Evaluation Phase**:
|
151 |
+
- Evaluation Rewards
|
152 |
+
|
153 |
+
### Proximal Policy Optimization (PPO)
|
154 |
+
|
155 |
+
A robust and efficient algorithm developed by OpenAI.
|
156 |
+
|
157 |
+
- **Settings**:
|
158 |
+
- Neural Network Architecture: Actor and critic heads.
|
159 |
+
- Training Loop: States, actions, rewards, values, and log-probs collected in replay buffer.
|
160 |
+
- Loss Functions: Policy and value losses.
|
161 |
+
- **Hyperparameters**:
|
162 |
+
- Total Timesteps: 50,000
|
163 |
+
- Gamma: 0.99
|
164 |
+
- Lambda: 0.95
|
165 |
+
- Epsilon: 0.2
|
166 |
+
- Epochs: 3
|
167 |
+
- Batch Size: 64
|
168 |
+
- Learning Rate: 0.001
|
169 |
+
- **Training Phase**:
|
170 |
+
- Total Rewards per Episode
|
171 |
+
- **Evaluation Phase**:
|
172 |
+
- Evaluation Rewards
|
173 |
+
|
174 |
+
### Actor Critic (A2C)
|
175 |
+
|
176 |
+
Combines elements of both policy-based methods (Actor) and value-based methods (Critic).
|
177 |
+
|
178 |
+
- **Settings**:
|
179 |
+
- Actor Network: Learns a policy π(s).
|
180 |
+
- Critic Network: Learns the value function V(s).
|
181 |
+
- Advantage: Measures action quality.
|
182 |
+
- Policy and Value Function Updates: Gradient descent.
|
183 |
+
- **Training Phase**:
|
184 |
+
- Total Rewards per Episode
|
185 |
+
- **Evaluation Phase**:
|
186 |
+
- Evaluation Rewards
|
187 |
+
|
188 |
+
**Total Rewards Plot during Training:**
|
189 |
+
|
190 |
+

|
191 |
+
|
192 |
+
**Evaluation for 10 Timestamps:**
|
193 |
+
|
194 |
+

|
195 |
+
|
196 |
+
## Comparison
|
197 |
+
|
198 |
+
Comparison of different reinforcement learning algorithms and their performance:
|
199 |
+
|
200 |
+
| Aspect | Q-Learning | Double Q-Learning | DQN | DDQN | PPO | A2C |
|
201 |
+
| ------------------------- | ----------------------- | ----------------------- | -------------------- | ------------------------------ | ----------------------------- | ----------------------------- |
|
202 |
+
| Algorithm Type | Model-free, Value-based | Model-free, Value-based | Value-based, Deep NN | Value-based, Deep NN | Actor-Critic, Policy Gradient | Actor-Critic, Policy Gradient |
|
203 |
+
| Exploration-Exploitation | Epsilon-greedy | Epsilon-greedy | Epsilon-greedy | Epsilon-greedy | Continuous | Continuous |
|
204 |
+
| Stability & Learning Rate | No target network | No target network | Target network | Double updates, target network | Adaptive, Gradient Clipping | Adaptive, Entropy Bonus |
|
205 |
+
| Convergence | Slow | Moderate | Fast | Faster | Fast | Moderate |
|
206 |
+
| Memory Requirements | Low | Low | High | High | Moderate | Moderate |
|
207 |
+
| Adaptability | Moderate | Moderate | Low to Moderate | Low to Moderate | High | Moderate |
|
208 |
+
| Generalization | Low to Moderate | Low to Moderate | Low | Low | High | Moderate |
|
209 |
+
|
210 |
+
## Challenges Addressed
|
211 |
+
|
212 |
+
1. **Navigating a Dynamic World**: Rovers learned to dodge moving obstacles using dynamic obstacle handling techniques.
|
213 |
+
2. **Custom Paths for Diverse Layouts**: Environment adaptable to various hospital settings.
|
214 |
+
3. **Learning Optimal Path**: Rovers trained to take the shortest route to delivery rooms, avoiding obstacles.
|
215 |
+
|
216 |
+
## Bonus: Real-World Application
|
217 |
+
|
218 |
+
1. **Healthcare System Integration**: Enhances delivery system within hospital settings, benefiting patient care and outcomes.
|
219 |
+
2. **Simulating Hospital Layouts**: Reflects real-world scenarios with hospital corridors, obstacles, and RxRovers.
|
220 |
+
3. **Complex Navigation Challenges**: Optimizes path planning and avoids dynamic/static obstacles.
|
221 |
+
4. **Adaptable Environment**: Can be modified to different healthcare settings.
|
222 |
+
|
223 |
+
## References
|
224 |
+
|
225 |
+
1. [Deep Reinforcement Learning DQN for Multi-Agent Environment](https://medium.com/yellowme/deep-reinforcement-learning-dqn-for-multi-agent-environment-5f4fae1a9ff5)
|
226 |
+
2. [Warehouse Robot Path Planning](https://github.com/LyapunovJingci/Warehouse_Robot_Path_Planning) for path planning and optimization with obstacle avoidance.
|
227 |
+
3. [Reinforcement Learning (DQN) Tutorial | PyTorch Tutorials](https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html)
|
228 |
+
4. [Deep Q Learning (DQN) using PyTorch](https://medium.com/@vignesh.g1609/deep-q-learning-dqn-using-pytorch-a31f02a910ac)
|
229 |
+
|
230 |
+
---
|
Report.pdf
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:a28e62aa4cfc411d65a8b745759d810d14b0674110e044bfbd672f0f991f26ea
|
3 |
+
size 3280745
|
images/all_algos.png
ADDED
![]() |
Git LFS Details
|
images/all_algos_testing.png
ADDED
![]() |
Git LFS Details
|
images/desk.png
ADDED
![]() |
images/enhanced-environment.png
ADDED
![]() |
images/human.png
ADDED
![]() |
images/room.png
ADDED
![]() |
images/rover_desk_collision.png
ADDED
![]() |
images/rover_dest.png
ADDED
![]() |
images/rover_human_collision.png
ADDED
![]() |
images/rover_moving.png
ADDED
![]() |
images/rover_room_collision - Copy.png
ADDED
![]() |
images/rover_rover_collision.png
ADDED
![]() |
images/rover_start.png
ADDED
![]() |
images/target.png
ADDED
![]() |
images/trained_agents.gif
ADDED
![]() |
rx_rover_A2C.ipynb
ADDED
The diff for this file is too large to render.
See raw diff
|
|
rx_rover_DDQN.ipynb
ADDED
The diff for this file is too large to render.
See raw diff
|
|
rx_rover_DQN.ipynb
ADDED
The diff for this file is too large to render.
See raw diff
|
|
rx_rover_Double_Q_Learning.ipynb
ADDED
The diff for this file is too large to render.
See raw diff
|
|
rx_rover_Enhanced_Environment.ipynb
ADDED
@@ -0,0 +1,278 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cells": [
|
3 |
+
{
|
4 |
+
"cell_type": "markdown",
|
5 |
+
"metadata": {},
|
6 |
+
"source": [
|
7 |
+
"# Environment - Enhanced"
|
8 |
+
]
|
9 |
+
},
|
10 |
+
{
|
11 |
+
"cell_type": "code",
|
12 |
+
"execution_count": 2,
|
13 |
+
"metadata": {},
|
14 |
+
"outputs": [],
|
15 |
+
"source": [
|
16 |
+
"import gym\n",
|
17 |
+
"from gym import spaces\n",
|
18 |
+
"import numpy as np\n",
|
19 |
+
"import matplotlib.pyplot as plt\n",
|
20 |
+
"\n",
|
21 |
+
"import os\n",
|
22 |
+
"import random\n",
|
23 |
+
"import imageio\n",
|
24 |
+
"from tqdm import tqdm \n",
|
25 |
+
"from itertools import count\n",
|
26 |
+
"from collections import namedtuple, deque\n",
|
27 |
+
"\n",
|
28 |
+
"import torch\n",
|
29 |
+
"import torch.nn as nn\n",
|
30 |
+
"import torch.optim as optim\n",
|
31 |
+
"import random\n",
|
32 |
+
"import matplotlib.image as mpimg"
|
33 |
+
]
|
34 |
+
},
|
35 |
+
{
|
36 |
+
"cell_type": "code",
|
37 |
+
"execution_count": 3,
|
38 |
+
"metadata": {},
|
39 |
+
"outputs": [],
|
40 |
+
"source": [
|
41 |
+
"class RoverGridEnv(gym.Env):\n",
|
42 |
+
" metadata={'render.modes': ['human']} \n",
|
43 |
+
" def __init__(self, max_ts=20): \n",
|
44 |
+
" super(RoverGridEnv,self).__init__()\n",
|
45 |
+
" self.max_ts=max_ts # The Max_Timestamps is set to 20 by default.\n",
|
46 |
+
" self.grid_size=(15,15) \n",
|
47 |
+
" self.action_space=spaces.Discrete(5) \n",
|
48 |
+
" self.observation_space=spaces.MultiDiscrete([15,15,15,15,15,15])\n",
|
49 |
+
" self.rover_positions=np.array([[6,4],[10,4]])\n",
|
50 |
+
" self.operation_desks=np.array([[6,3],[10,3]])\n",
|
51 |
+
" self.rooms=np.array([[4,7],[4,10],[4,13],[8,7],[8,10],[8,13],[12,7],[12,10],[12,13]])\n",
|
52 |
+
" self.human_position=np.array([8,9])\n",
|
53 |
+
" self.targets=np.array([[5,10],[9,13]])\n",
|
54 |
+
" self.actions=[(0,-1),(0,1),(-1,0),(1,0),(0,0)] # Down,Up,Left,Right,Wait\n",
|
55 |
+
" self.rover_done=[False,False] \n",
|
56 |
+
" self.reset()\n",
|
57 |
+
" \n",
|
58 |
+
" def seed(self,seed=None):\n",
|
59 |
+
" np.random.seed(seed)\n",
|
60 |
+
" random.seed(seed)\n",
|
61 |
+
" \n",
|
62 |
+
" def reset(self):\n",
|
63 |
+
" self.current_step=0\n",
|
64 |
+
" self.rover_positions=np.array([[6,4],[10,4]])\n",
|
65 |
+
" self.rover_done=[False,False]\n",
|
66 |
+
" self.human_position=np.array([7,8])\n",
|
67 |
+
" self.current_step=0\n",
|
68 |
+
" return self._get_obs()\n",
|
69 |
+
" \n",
|
70 |
+
" def _get_obs(self):\n",
|
71 |
+
" return np.concatenate((self.rover_positions.flatten(),self.human_position))\n",
|
72 |
+
" \n",
|
73 |
+
" def step(self,actions):\n",
|
74 |
+
" rewards=np.zeros(2)\n",
|
75 |
+
" done=[False,False]\n",
|
76 |
+
" info={'message': ''} \n",
|
77 |
+
" for i,action in enumerate(actions):\n",
|
78 |
+
" if self.rover_done[i]:\n",
|
79 |
+
" done[i]=True \n",
|
80 |
+
" continue\n",
|
81 |
+
" prev_distance=np.linalg.norm(self.targets[i]-self.rover_positions[i])\n",
|
82 |
+
" if self._is_human_adjacent(self.rover_positions[i]):\n",
|
83 |
+
" rewards[i] -= 5\n",
|
84 |
+
" else:\n",
|
85 |
+
" delta=np.array(self.actions[action])\n",
|
86 |
+
" new_position=self.rover_positions[i]+delta\n",
|
87 |
+
" if self._out_of_bounds(new_position):\n",
|
88 |
+
" rewards[i] -= 15\n",
|
89 |
+
" continue\n",
|
90 |
+
" if self._collision(new_position,i):\n",
|
91 |
+
" rewards[i] -= 15\n",
|
92 |
+
" continue\n",
|
93 |
+
" self.rover_positions[i]=new_position\n",
|
94 |
+
" new_distance=np.linalg.norm(self.targets[i]-new_position)\n",
|
95 |
+
" if new_distance < prev_distance:\n",
|
96 |
+
" rewards[i]+=30 \n",
|
97 |
+
" else:\n",
|
98 |
+
" rewards[i] -= 20 \n",
|
99 |
+
" if np.array_equal(new_position,self.targets[i]):\n",
|
100 |
+
" rewards[i]+=100\n",
|
101 |
+
" self.rover_done[i]=True \n",
|
102 |
+
" done[i]=True\n",
|
103 |
+
"\n",
|
104 |
+
" # move human randomly\n",
|
105 |
+
" self._move_human()\n",
|
106 |
+
" self.current_step+=1\n",
|
107 |
+
" all_done=all(done) or self.current_step >= self.max_ts\n",
|
108 |
+
" if all_done and not all(done): # if the maximum number of steps is reached but not all targets were reached\n",
|
109 |
+
" info['message']='Maximum number of timestamps reached'\n",
|
110 |
+
" return self._get_obs(),rewards,all_done,info\n",
|
111 |
+
"\n",
|
112 |
+
" def _is_human_adjacent(self,position):\n",
|
113 |
+
" for delta in [(1,1),(1,-1),(-1,1),(-1,-1)]:\n",
|
114 |
+
" adjacent_position=position+np.array(delta)\n",
|
115 |
+
" if np.array_equal(adjacent_position,self.human_position):\n",
|
116 |
+
" return True\n",
|
117 |
+
" return False\n",
|
118 |
+
"\n",
|
119 |
+
" def _out_of_bounds(self,position):\n",
|
120 |
+
" return not (0 <= position[0] < self.grid_size[0] and 0 <= position[1] < self.grid_size[1])\n",
|
121 |
+
" \n",
|
122 |
+
" def _collision(self,new_position,rover_index):\n",
|
123 |
+
" if any(np.array_equal(new_position,pos) for pos in np.delete(self.rover_positions,rover_index,axis=0)):\n",
|
124 |
+
" return True # Collision with the other rover\n",
|
125 |
+
" if any(np.array_equal(new_position,pos) for pos in self.rooms):\n",
|
126 |
+
" return True # Collision with a room\n",
|
127 |
+
" if any(np.array_equal(new_position,pos) for pos in self.operation_desks):\n",
|
128 |
+
" return True # Collision with an operation desk\n",
|
129 |
+
" if np.array_equal(new_position,self.human_position):\n",
|
130 |
+
" return True # Collision with the human\n",
|
131 |
+
" return False\n",
|
132 |
+
" \n",
|
133 |
+
" def _move_human(self):\n",
|
134 |
+
" valid_moves=[move for move in self.actions if not self._out_of_bounds(self.human_position+np.array(move))]\n",
|
135 |
+
" self.human_position+=np.array(valid_moves[np.random.choice(len(valid_moves))])\n",
|
136 |
+
" \n",
|
137 |
+
" # def render(self,mode='human',save_path=None):\n",
|
138 |
+
" # fig,ax=plt.subplots(figsize=(7,7))\n",
|
139 |
+
" # ax.set_xlim(0,self.grid_size[0])\n",
|
140 |
+
" # ax.set_ylim(0,self.grid_size[1])\n",
|
141 |
+
" # ax.set_xticks(np.arange(0,15,1))\n",
|
142 |
+
" # ax.set_yticks(np.arange(0,15,1))\n",
|
143 |
+
" # ax.grid(which='both')\n",
|
144 |
+
"\n",
|
145 |
+
" # # draw elements\n",
|
146 |
+
" # for pos in self.rover_positions:\n",
|
147 |
+
" # ax.add_patch(Rectangle((pos[0]-0.5,pos[1]-0.5),1,1,color='blue'))\n",
|
148 |
+
" # for pos in self.operation_desks:\n",
|
149 |
+
" # ax.add_patch(Rectangle((pos[0]-0.5,pos[1]-0.5),1,1,color='darkgreen'))\n",
|
150 |
+
" # for pos in self.rooms:\n",
|
151 |
+
" # ax.add_patch(Rectangle((pos[0]-0.5,pos[1]-0.5),1,1,color='black'))\n",
|
152 |
+
" # ax.add_patch(Rectangle((self.human_position[0]-0.5,self.human_position[1]-0.5),1,1,color='purple'))\n",
|
153 |
+
" # for pos in self.targets:\n",
|
154 |
+
" # ax.add_patch(Rectangle((pos[0]-0.5,pos[1]-0.5),1,1,color='yellow',alpha=0.5))\n",
|
155 |
+
"\n",
|
156 |
+
" # if save_path is not None:\n",
|
157 |
+
" # plt.savefig(save_path)\n",
|
158 |
+
" # plt.close()\n",
|
159 |
+
" \n",
|
160 |
+
" # def close(self):\n",
|
161 |
+
" # plt.close()\n",
|
162 |
+
"\n",
|
163 |
+
" def render(self, mode='human', save_path=None):\n",
|
164 |
+
" fig, ax=plt.subplots(figsize=(7,7))\n",
|
165 |
+
" ax.set_xlim(0, self.grid_size[0])\n",
|
166 |
+
" ax.set_ylim(0, self.grid_size[1])\n",
|
167 |
+
"\n",
|
168 |
+
" rover_start_img_path='images/rover_moving.png' \n",
|
169 |
+
" rover_moving_img_path='images/rover_moving.png' \n",
|
170 |
+
" rover_dest_img_path='images/rover_dest.png' \n",
|
171 |
+
" rover_human_collision_path='images/rover_human_collision.png' \n",
|
172 |
+
" rover_room_collision_path='images/rover_room_collision.png' \n",
|
173 |
+
" rover_desk_collision_path='images/rover_desk_collision.png' \n",
|
174 |
+
" rover_rover_collision_path='images/rover_rover_collision.png' \n",
|
175 |
+
" desk_img_path='images/desk.png'\n",
|
176 |
+
" room_img_path='images/room.png' \n",
|
177 |
+
" human_img_path='images/human.png' \n",
|
178 |
+
" target_img_path='images/target.png' \n",
|
179 |
+
"\n",
|
180 |
+
"\n",
|
181 |
+
" for i, pos in enumerate(self.rover_positions):\n",
|
182 |
+
" \n",
|
183 |
+
" if self.rover_done[i]:\n",
|
184 |
+
" rover_img=mpimg.imread(rover_dest_img_path) \n",
|
185 |
+
" elif np.array_equal(pos, self.rover_positions[i]):\n",
|
186 |
+
" rover_img=mpimg.imread(rover_start_img_path) \n",
|
187 |
+
" elif self._is_human_adjacent(pos):\n",
|
188 |
+
" rover_img=mpimg.imread(rover_human_collision_path) \n",
|
189 |
+
" elif self._collision(pos, i):\n",
|
190 |
+
" \n",
|
191 |
+
" if any(np.array_equal(pos, pos) for pos in self.rooms):\n",
|
192 |
+
" rover_img=mpimg.imread(rover_room_collision_path)\n",
|
193 |
+
" elif any(np.array_equal(pos, pos) for pos in self.operation_desks):\n",
|
194 |
+
" rover_img=mpimg.imread(rover_desk_collision_path)\n",
|
195 |
+
" elif any(np.array_equal(pos, self.rover_positions[i]) for i in range(len(self.rover_positions)) if i != 0):\n",
|
196 |
+
" rover_img=mpimg.imread(rover_rover_collision_path)\n",
|
197 |
+
" else:\n",
|
198 |
+
" rover_img=mpimg.imread(rover_moving_img_path) \n",
|
199 |
+
" ax.imshow(rover_img, extent=(pos[0]-0.5, pos[0]+0.5, pos[1]-0.5, pos[1]+0.5))\n",
|
200 |
+
"\n",
|
201 |
+
" \n",
|
202 |
+
" for pos in self.operation_desks:\n",
|
203 |
+
" desk_img=mpimg.imread(desk_img_path)\n",
|
204 |
+
" ax.imshow(desk_img, extent=(pos[0]-0.5, pos[0]+0.5, pos[1]-0.5, pos[1]+0.5))\n",
|
205 |
+
"\n",
|
206 |
+
" for pos in self.rooms:\n",
|
207 |
+
" room_img=mpimg.imread(room_img_path)\n",
|
208 |
+
" ax.imshow(room_img, extent=(pos[0]-0.5, pos[0]+0.5, pos[1]-0.5, pos[1]+0.5))\n",
|
209 |
+
"\n",
|
210 |
+
" human_img=mpimg.imread(human_img_path)\n",
|
211 |
+
" ax.imshow(human_img, extent=(self.human_position[0]-0.5, self.human_position[0]+0.5, self.human_position[1]-0.5, self.human_position[1]+0.5))\n",
|
212 |
+
"\n",
|
213 |
+
" for pos in self.targets:\n",
|
214 |
+
" target_img=mpimg.imread(target_img_path)\n",
|
215 |
+
" ax.imshow(target_img, extent=(pos[0]-0.5, pos[0]+0.5, pos[1]-0.5, pos[1]+0.5))\n",
|
216 |
+
"\n",
|
217 |
+
" if save_path is not None:\n",
|
218 |
+
" plt.savefig(save_path)\n",
|
219 |
+
" plt.close()\n",
|
220 |
+
"\n",
|
221 |
+
" def close(self):\n",
|
222 |
+
" plt.close()\n",
|
223 |
+
"\n"
|
224 |
+
]
|
225 |
+
},
|
226 |
+
{
|
227 |
+
"cell_type": "code",
|
228 |
+
"execution_count": 4,
|
229 |
+
"metadata": {},
|
230 |
+
"outputs": [
|
231 |
+
{
|
232 |
+
"name": "stdout",
|
233 |
+
"output_type": "stream",
|
234 |
+
"text": [
|
235 |
+
"Initial Setup\n"
|
236 |
+
]
|
237 |
+
},
|
238 |
+
{
|
239 |
+
"data": {
|
240 |
+
"image/png": "",
|
241 |
+
"text/plain": [
|
242 |
+
"<Figure size 700x700 with 1 Axes>"
|
243 |
+
]
|
244 |
+
},
|
245 |
+
"metadata": {},
|
246 |
+
"output_type": "display_data"
|
247 |
+
}
|
248 |
+
],
|
249 |
+
"source": [
|
250 |
+
"env=RoverGridEnv()\n",
|
251 |
+
"print(\"Initial Setup\")\n",
|
252 |
+
"observation=env.reset()\n",
|
253 |
+
"env.render()"
|
254 |
+
]
|
255 |
+
}
|
256 |
+
],
|
257 |
+
"metadata": {
|
258 |
+
"kernelspec": {
|
259 |
+
"display_name": "Python 3",
|
260 |
+
"language": "python",
|
261 |
+
"name": "python3"
|
262 |
+
},
|
263 |
+
"language_info": {
|
264 |
+
"codemirror_mode": {
|
265 |
+
"name": "ipython",
|
266 |
+
"version": 3
|
267 |
+
},
|
268 |
+
"file_extension": ".py",
|
269 |
+
"mimetype": "text/x-python",
|
270 |
+
"name": "python",
|
271 |
+
"nbconvert_exporter": "python",
|
272 |
+
"pygments_lexer": "ipython3",
|
273 |
+
"version": "3.8.10"
|
274 |
+
}
|
275 |
+
},
|
276 |
+
"nbformat": 4,
|
277 |
+
"nbformat_minor": 2
|
278 |
+
}
|
rx_rover_PPO.ipynb
ADDED
The diff for this file is too large to render.
See raw diff
|
|
rx_rover_Q_Learning.ipynb
ADDED
The diff for this file is too large to render.
See raw diff
|
|