This paper explores the deployment and challenges of Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) for precision angle seeking in robotic control across both simulated and physical environments. We introduce the Angular Positioning Seeker (APS) environment, leveraging Raspberry Pi 4B+ platforms to rigorously evaluate RL algorithms in scenarios that closely mimic real-world conditions. This benchmark highlights the subtleties distinguishing RL implementations in physical realms from simulated proxies, fostering advancements and more nuanced testing protocols within the domain of robotic intelligence.
In the methodology, we developed the Angular Positioning Seeker (APS) environment using OpenAI’s Gym tailored for angular positioning tasks and implemented the step function logic prioritizing angle-seeking behavior. We utilized Raspberry Pi 4B+ and STM32F103ZET6 microcontrollers to construct a physical apparatus, ensuring robust evaluation of RL algorithms. The DQN algorithm and its variants, including Double DQN and Dueling DQN, were applied in both simulated and physical settings. The system's performance was measured using reward functions designed to minimize deviation from target angles, with detailed pseudocode provided for reproducibility.
Differential Performance of Reinforcement Learning Algorithms in Real and Simulated Environments. (1) DQN showcases a convergence around a reward of 200 in real environments, despite wider fluctuations, and a transient drop to -700 in simulations at episode 65. (2) Double DQN achieves a stable learning curve in simulations, while real-world performance demonstrates higher reward peaks followed by larger oscillations. (3) Dueling DQN rapidly attains high convergence in simulations, with real-world trials displaying more pronounced and frequent reward fluctuations.
Trajectories of Q-value Convergence Across Environments for Reinforcement Learning Algorithms. (a) \& (d) DQN shows swift convergence in simulations with Q-value oscillations up to 150, whereas real-world convergence is gradual with values settling around -30, demonstrating adaptability. (b) \& (e) Double DQN offers reduced simulation oscillations with spikes decreasing to 60, but underperforms in real settings with Q-values converging around -50, indicating less predictability. (c) \& (f) Dueling DQN maintains the lowest variability in simulations with spikes near 20 yet converges around -50 in real-world settings, paralleling Double DQN and revealing room for improvement in adaptability.
Histogram Analysis of Action Space Exploration in Reinforcement Learning. The figure demonstrates the algorithmic exploration process of DQN, Double DQN, and Dueling DQN, validated by a fitting of normal distributions and quantified by areas outside the expected curves, reflecting each algorithm's approach to exploring the action space.
Task 1: Process of balance. The top half showcases the actual pendulum setup. The bottom half of the video displays the reward curve.
Task 2: Utilize the compact yet powerful Raspberry Pi to train an APS to balance itself across a variety of angles.
The heatmaps juxtapose the action convergence profiles of DQN and Double DQN, visually depicting their respective policies' precision and stability in maintaining the APS's upright state.
The primary contributions of this work encompass: