Precision Angle Seeking in Robots: A Reinforcement Learning Approach in Simulation and Reality

Abstract

This paper explores the deployment and challenges of Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) for precision angle seeking in robotic control across both simulated and physical environments. We introduce the Angular Positioning Seeker (APS) environment, leveraging Raspberry Pi 4B+ platforms to rigorously evaluate RL algorithms in scenarios that closely mimic real-world conditions. This benchmark highlights the subtleties distinguishing RL implementations in physical realms from simulated proxies, fostering advancements and more nuanced testing protocols within the domain of robotic intelligence.


Method

In the methodology, we developed the Angular Positioning Seeker (APS) environment using OpenAI’s Gym tailored for angular positioning tasks and implemented the step function logic prioritizing angle-seeking behavior. We utilized Raspberry Pi 4B+ and STM32F103ZET6 microcontrollers to construct a physical apparatus, ensuring robust evaluation of RL algorithms. The DQN algorithm and its variants, including Double DQN and Dueling DQN, were applied in both simulated and physical settings. The system's performance was measured using reward functions designed to minimize deviation from target angles, with detailed pseudocode provided for reproducibility.

Circuit Diagram with Pi4B+
Physical Connectivity Diagram

How does compositionality works?

Differential Performance of Reinforcement Learning Algorithms in Real and Simulated Environments. (1) DQN showcases a convergence around a reward of 200 in real environments, despite wider fluctuations, and a transient drop to -700 in simulations at episode 65. (2) Double DQN achieves a stable learning curve in simulations, while real-world performance demonstrates higher reward peaks followed by larger oscillations. (3) Dueling DQN rapidly attains high convergence in simulations, with real-world trials displaying more pronounced and frequent reward fluctuations.

Trajectories of Q-value Convergence Across Environments for Reinforcement Learning Algorithms. (a) \& (d) DQN shows swift convergence in simulations with Q-value oscillations up to 150, whereas real-world convergence is gradual with values settling around -30, demonstrating adaptability. (b) \& (e) Double DQN offers reduced simulation oscillations with spikes decreasing to 60, but underperforms in real settings with Q-values converging around -50, indicating less predictability. (c) \& (f) Dueling DQN maintains the lowest variability in simulations with spikes near 20 yet converges around -50 in real-world settings, paralleling Double DQN and revealing room for improvement in adaptability.

Histogram Analysis of Action Space Exploration in Reinforcement Learning. The figure demonstrates the algorithmic exploration process of DQN, Double DQN, and Dueling DQN, validated by a fitting of normal distributions and quantified by areas outside the expected curves, reflecting each algorithm's approach to exploring the action space.

Experiments

Evaluation on open-loop and closed-loop tasks


Task 1: Process of balance. The top half showcases the actual pendulum setup. The bottom half of the video displays the reward curve.

Task 2: Utilize the compact yet powerful Raspberry Pi to train an APS to balance itself across a variety of angles.

State Visitation Visualization


The heatmaps juxtapose the action convergence profiles of DQN and Double DQN, visually depicting their respective policies' precision and stability in maintaining the APS's upright state.



Contribution

The primary contributions of this work encompass:

  • APS Environment Development. Created a specialized environment, Angular Positioning Seeker (APS), based on OpenAI Gym for angular positioning tasks, addressing key practical challenges.
  • Simulation-to-Reality Gap Analysis. Systematically analyzed differences in RL algorithm performance between simulated and physical environments, highlighting real-world factors like hardware noise and dynamic changes.
  • Physical Experiment Platform. Built a physical experimental platform with Raspberry Pi 4B+ and STM32F103ZET6 microcontrollers for validating RL algorithms in real-world conditions, serving as a practical benchmark.
  • Extensive Algorithm Implementation. Implemented and validated a range of RL and control algorithms on our physical setup, including REINFORCE, Actor Critic, TRPO, PPO, DDPG, SAC, Behavior Clone, GAIL, PETS, and MBPO. These implementations are available on our GitHub repository, demonstrating the platform's versatility.
  • Detailed Performance Evaluation. Conducted exhaustive evaluations of multiple algorithms in the APS environment, revealing their strengths and weaknesses in terms of learning efficiency, convergence speed, and stability.
  • Open Source Resources. Provided all experimental source codes and documentation to ensure reproducibility and community sharing, fostering further research and application of RL algorithms in real-world scenarios.