Redundant Space Manipulator Autonomous Guidance for In-Orbit Servicing via Deep Reinforcement Learning

Authors: D’Ambrosio, Capra, Brandonisio, Silvestrini, Lavagna · Year: 2024 · Venue: Aerospace (MDPI), 11(5), 341
Raw: md

Summary

The paper presents a Deep Reinforcement Learning (DRL) guidance law for the motion-synchronization phase of an in-orbit servicing (IOS) mission. A Proximal Policy Optimization (PPO) agent generates desired joint rates for a 7-DoF redundant manipulator mounted on a free-flying (fully-actuated 6-DoF base) chaser, so that the end effector tracks a grasping point on a tumbling, uncooperative target in both position and attitude. The agent’s joint-rate commands are integrated and passed to a model-based feedback-linearization controller with PD inner loops. The central contribution is to bypass the ill-posed inverse kinematics and real-time optimization burden of redundant manipulators by learning a guidance policy that, after offline training, requires little to no online optimization.

Key Claims

After training (3500 episodes, 420 s each, ~4.9 M timesteps), the agent achieves a 100% success rate keeping end-effector errors below thresholds of position r̃_ax, r̃_tx < 5 cm and attitude θ̃ < 5 deg, sustained consecutively for at least t_min = 30 s.
Success is defined more strictly than typical literature: the EE must remain in tolerance for ≥30 s consecutively, not merely enter the tolerance once. Average first-convergence time was 103 s (max 219 s); average consecutive in-threshold time was 312 s (74% of an episode).
The agent generalizes without retraining: under base/target spin-rate synchronization error ω_err ∈ [−0.5, 0.5] deg/s (taken from the COMRADE requirement ||ω_0 − ω_T|| < 0.5 deg/s), success rate drops only to 94%, requiring the EE to track a moving point.
The agent generalizes to a target up to 28% larger (150 cm vs 50 cm training radius); no successful episodes fell below a 64.5 cm radius, and re-randomizing the goal within 64.5 cm restored a 100% success rate.
Sensitivity analysis: the EE positioning threshold is the performance bottleneck; varying the attitude threshold and t_min had negligible effect on success rate. Residual error is attributed to the PD controller’s nonzero steady-state error and the 0.3 s agent sample time.
Controlling all 7 joints (with full position+attitude goal) is claimed as an improvement over prior work that either neglected attitude [17] or controlled only 6 of 7 joints [7].

Method

The multibody system is explicitly modeled as free-flying (base actively controlled in translation and rotation), in contrast to the free-floating regime where the base is uncontrolled — the paper draws this distinction directly (Sec. 2.1, citing [10]). Dynamics use a direct-path, Newton–Euler formulation with the base center of mass as the translational reference point; kinematics/dynamics computed via the SPART toolkit [11].

Equation of motion (1):
$H (q) \overset{q}{¨} + C (q, \overset{q}{˙}) \overset{q}{˙} = τ$
with H ∈ ℝ^{(6+N)×(6+N)} the symmetric positive-definite Generalized Inertia Matrix (GIM), C ∈ ℝ^{(6+N)×(6+N)} the Convective Inertia Matrix (CIM), τ ∈ ℝ^{(6+N)} generalized joint-space forces. Generalized coordinates (2): q = [r_0, R_0, q_m]^T = [q_0, q_m]^T, with r_0 base position (inertial), R_0 base orientation as a quaternion, q_m the joint angles. For N=7 the state stacks 6-DoF base + 7-DoF arm = 13.

Target dynamics (3): pure rotation in principal axes via Euler’s equation, I ω̇_T + ω_T × (I ω_T) = M, with M = 0 (torque-free tumble), placed at the LVLH origin.

Control (8) — coupled, model-based feedback-linearization with two PD loops (base and manipulator):
$τ = {τ_{0} τ_{m}} = H {P D (q_{0}^{*} - q_{0}) P D (q_{m}^{*} - q_{m})} + C \overset{q}{˙}$
PD gains (Table 1): base P=0.4, D=0.3; manipulator P=2.5, D=1.25. The base is held at the synchronized state; the manipulator follows the agent’s commands.

RL layer: PPO (on-policy, model-free, actor–critic). Clipped surrogate objective (5)–(6) with probability ratio p_k(θ)=π_θ(a_k|s_k)/π_{θ_old}(a_k|s_k), advantage (7) A(s_k,a_k)=[Σ_{j=k}^T γ^{j-k} R(a_k,s_k)] − V(s_k). Observation o ∈ ℝ^{32} (9): o=[q_m, q̇_m, r̃, D̃CM, ṽ, ω̃]^T (joint angles/rates plus EE pose/twist errors in the body frame). Action a ∈ ℝ^{7} (10): desired joint rates a = q̇_m^*, integrated to give q_m^*. Reward shaping via an Artificial Potential Field (11): U_k = −r̃ + 10/(1+r̃_ax) + 10/(1+r̃_tx) + 10/(1+θ̃), with reward (12)–(13) R_k = ΔU if ΔU≥0 else 1.5 ΔU (the 1.5× penalty discourages motion along equipotential surfaces), plus a +0.01 sparse bonus when all errors are below threshold simultaneously.

Hyperparameters: clip ε=0.2, discount γ=0.99, entropy weight w=0.01, mini-batch 128, 4 epochs, learning rate 1×10^{-5}, tanh activations, two 3×300 FNNs (actor outputs 14 = mean+std of each joint rate; critic outputs 1). Sample time 0.3 s. Episodes terminate only on a singular manipulator configuration. The navigation block is omitted: full state is assumed known.

Relevance to thesis

This is a directly on-regime reference: it shares the free-flying, fully-actuated 6-DoF base assumption central to the thesis and explicitly contrasts it against free-floating work. It is a useful datapoint on learning-based guidance for redundancy resolution — sidestepping the ill-posed inverse kinematics that plague redundant arms — and on coordinated base/manipulator control. The reward APF, the strict (sustained) success criterion, and the COMRADE-derived synchronization-error budget are concrete benchmarks. For the risk layer, the paper is a cautionary example: it omits navigation/estimation uncertainty, relies on a PD loop with nonzero steady-state error, and offers only empirical robustness (94% under spin-rate error) with no formal guarantees or chance-constrained treatment — exactly the gap a risk-aware planner would address.

Connections

Topics: generalized_inertia_matrix · redundancy_resolution · coordinated_base_manipulator_control · reinforcement_learning_guidance

Key Equations / Quotes

“in the scope of this study the multibody system is described as free-flying, signifying that the spacecraft is actively controlled in both translation and rotation, in contrast to the free-floating scenario” (p. 3, Sec. 2.1)

$H(q)\ddot{q} + C(q,\dot{q})\dot{q} = \tau \tag{1}$

$\tau = \begin{Bmatrix}\tau_0\\\tau_m\end{Bmatrix} = H\begin{Bmatrix}PD(q_0^*-q_0)\\PD(q_m^*-q_m)\end{Bmatrix} + C\dot{q} \tag{8}$

$U_k = -\tilde{r} + \frac{10}{1+\tilde{r}_{ax}} + \frac{10}{1+\tilde{r}_{tx}} + \frac{10}{1+\tilde{\theta}} \tag{11}$

“Simulations were only terminated if the manipulator’s configuration became singular, to prevent the DRL algorithm from breaking down due to mathematical issues.” (p. 7)

Open Questions

How would performance change with a controller guaranteeing null steady-state error (the cited positioning bottleneck), or with a smaller agent sample time tied to the navigation filter rate?
Can the empirical robustness (94% under spin-rate error, generalization to 28% larger targets) be made into a formal guarantee, or paired with chance constraints / risk metrics under navigation uncertainty?
Singular configurations only terminate episodes — how does the learned policy behave near (dynamic/kinematic) singularities, and can redundancy be exploited explicitly for singularity avoidance?
Does the agent’s value transfer under Sim2Real domain gaps, joint/link flexibility (omitted here), and a real navigation filter in the loop?

Quartz 5

Explorer

dambrosio2024redundant