Reinforcement Learning for Path Planning of Free-Floating Space Robotic Manipulator with Collision Avoidance and Observation Noise

Authors: Al Ali, Zhu · Year: 2024 · Venue: Frontiers in Control Engineering, 5:1394668
Raw: md

Summary

The paper trains a Deep Deterministic Policy Gradient (DDPG) agent to plan joint-torque paths that drive the end-effector (EE) of a 6-DOF space manipulator to a target pose for capture, while avoiding self-collision and an external obstacle. The central contributions are an engineered scalar reward function that (i) uses a quaternion orientation error rather than Euler angles to avoid representation singularities, (ii) adds a joint-velocity penalty for path smoothing, and (iii) injects white observation noise during training to harden the policy against realistic sensor noise. The system is explicitly modeled in the free-floating regime (uncontrolled base) with zero total linear and angular momentum.

Key Claims

Quaternion orientation error preserves rotation-sequence information that the Euler-angle error (Eq. 22) loses; using Euler angles enlarges the search space and “may even lead to divergence” in training.
The trained agent reaches a minimum position error Δd = 0.058 m and orientation error Δθ = 4.85° in Case 1 (no velocity constraint).
Adding the joint-velocity penalty (Case 2) roughly halves the squared joint-velocity sum along the path (peaks ~80 vs ~160 in Case 1) and smooths the first ~3 s of the trajectory; final errors Δd = 0.088 m, Δθ = 6.61°.
Training with observation noise (Case 3, SNR 10:1) yields Δd = 0.048 m, Δθ = 7.79°, with more oscillatory torques. Critically: the noise-trained agent (Case 3) still succeeds in a noise-free environment, whereas the noise-free agent (Case 2) fails in a noisy environment.
A 7.45% base-arm mass ratio (Orbital Express mission) was deliberately chosen so the free-floating coupling effect is non-negligible; most flown missions have <1% ratios where coupling is weak.
State dimension is 32, action dimension is 6 (joint torques); actor/critic are feed-forward nets with hidden layers of 400 and 300 neurons.

Method

The model is a free-floating n-DOF manipulator on an uncontrolled base (B0 = link 0, 6-DOF). EE position and orientation are propagated joint-by-joint (Eqs. 1, 2), differentiated to the velocity form

$\dot{\mathbf{x}}_e = \mathbf{J}_b \dot{\mathbf{x}}_b + \mathbf{J}_m \dot{\boldsymbol{\Phi}} \tag{5}$

with base state $\dot{x}_{b} = [v_{0}, ω_{0}]^{T}$ and EE state $\dot{x}_{e} = [v_{e}, ω_{e}]^{T}$ . Conservation of (zero) linear and angular momentum gives

$\begin{bmatrix} P \\ L \end{bmatrix} = \mathbf{M}_b \dot{\mathbf{x}}_b + \mathbf{M}_m \dot{\boldsymbol{\Phi}} = \mathbf{0} \tag{8}$

Eliminating the base velocity yields the generalized Jacobian (Umetani-Yoshida form):

$\dot{\mathbf{x}}_e = (\mathbf{J}_m - \mathbf{J}_b \mathbf{M}_b^{-1} \mathbf{M}_m)\dot{\boldsymbol{\Phi}} = \mathbf{J}_g \dot{\boldsymbol{\Phi}} \tag{9}$

which depends on base attitude, joint angles, and mass properties (this is the locus of dynamic singularities the paper flags). Dynamics are Lagrangian (Eq. 10) with the coupling block $M_{bm}$ between base and manipulator; the base row of the generalized force is zero (no base actuation), confirming the free-floating assumption.

RL layer: state $s$ (Eq. 11) stacks joint angles/rates, EE pose/twist, target pose/twist, obstacle position, and the scalar distance/orientation errors $d, ϑ$ . Action $a$ = 6 joint torques. The reward is built incrementally to the final form

$R_t = -K_d d - K_\vartheta \vartheta + r_d + r_\vartheta - p_l - p_{Ob} - p_j \sum(\dot{\varphi})^2 \tag{28}$

with the quaternion orientation error $\tilde{ξ} = ξ_{T} \times ξ_{e}^{*}$ and $ϑ = 2 arccos [Re (\tilde{ξ})]$ (Eqs. 23-24). Bonus rewards $r_{d}, r_{ϑ}$ fire only inside the capture region $(d < d_{0}, ϑ < ϑ_{0})$ ; penalties $p_{l}$ (link/target proximity), $p_{O b}$ (obstacle proximity), and $p_{j}$ (velocity smoothing). Trained in MATLAB/Simulink with DDPG (replay buffer, target nets soft-updated by $τ$ ).

Regime note: This is squarely free-FLOATING (base pose reacts passively via momentum conservation), the opposite of the thesis’s free-FLYING fully-actuated base. The paper’s own taxonomy (Dubowsky & Papadopoulos 1993) and its zero-base-force generalized-force vector make this explicit.

Relevance to thesis

For a free-flying manipulator the base is actuated, so the momentum-conservation constraint (Eq. 8) and the generalized Jacobian (Eq. 9) do not apply in this form — the dynamic-singularity pathology that motivates much of this paper is largely a free-floating artifact. The transferable content is (a) the reward-shaping recipe for pose alignment + collision/obstacle avoidance + velocity smoothing, which is regime-agnostic, and (b) the demonstration that injecting observation noise during training is necessary for robustness, directly relevant to the thesis’s risk/uncertainty layer. The quaternion-vs-Euler argument for orientation error is also a clean, citable justification for singularity-free attitude error metrics.

Connections

Topics: generalized_jacobian · dynamic_singularity · free_floating_dynamics · reinforcement_learning_planning

Key Equations / Quotes

“Free-flying manipulators are characterized by the active control of position and orientation (attitude) of the base spacecraft through the use of thrusters… Conversely, free-floating manipulator systems operate with a base spacecraft whose pose is not actively controlled.” (Sec. 1)

“the orientation difference in such a form loses the information of rotation sequence of Euler angles. This increases the search space in the learning process and may even lead to divergence.” (Sec. 4.3)

“the agent trained in Case 3 (noise on observations) still manages to achieve the desired task when implemented in a noise-free environment. Conversely, the agent trained in Case 2 (no noise on observations) fails its task in a noisy environment.” (Sec. 5.2.3)

Generalized Jacobian: $\dot{x}_{e} = (J_{m} - J_{b} M_{b}^{- 1} M_{m}) \dot{Φ} = J_{g} \dot{Φ}$ (Eq. 9).

Noise model: $Signal_{o u t} = Signal_{in} [1 + R_{n}]$ , $R_{n} \in [- 0.1, 0.1]$ (SNR 10:1).

Open Questions

The noise model states $R_{n} \in [- 0.1, 0.1]$ with “Mean = 0 and Variance = 1” — the stated variance is inconsistent with a $[- 0.1, 0.1]$ support (a uniform on that interval has variance ~3.3e-3). Likely a transcription/notation error; the effective perturbation is ±10%.
No comparison against optimization-based planners (PSO, sliding mode) on the same task, so the claimed advantage over classical methods is asserted, not benchmarked.
Generalization is shown only for a single fixed target/obstacle configuration; robustness to varied initial/target poses (a stated motivation for RL over re-optimization) is not quantified.
The reward bonus terms $r_{d}, r_{ϑ}$ are discontinuous step functions; their effect on training stability vs. a smooth shaping potential is not analyzed.

Quartz 5

Explorer

alali2024reinforcement