Redundant Space Manipulator Autonomous Guidance for In-Orbit Servicing via Deep Reinforcement Learning
Authors: D’Ambrosio, Capra, Brandonisio, Silvestrini, Lavagna · Year: 2024 · Venue: Aerospace (MDPI), 11(5), 341
Raw: md
Summary
The paper presents a Deep Reinforcement Learning (DRL) guidance law for the motion-synchronization phase of an in-orbit servicing (IOS) mission. A Proximal Policy Optimization (PPO) agent generates desired joint rates for a 7-DoF redundant manipulator mounted on a free-flying (fully-actuated 6-DoF base) chaser, so that the end effector tracks a grasping point on a tumbling, uncooperative target in both position and attitude. The agent’s joint-rate commands are integrated and passed to a model-based feedback-linearization controller with PD inner loops. The central contribution is to bypass the ill-posed inverse kinematics and real-time optimization burden of redundant manipulators by learning a guidance policy that, after offline training, requires little to no online optimization.
Key Claims
- After training (3500 episodes, 420 s each, ~4.9 M timesteps), the agent achieves a 100% success rate keeping end-effector errors below thresholds of position
r̃_ax, r̃_tx < 5 cmand attitudeθ̃ < 5 deg, sustained consecutively for at leastt_min = 30 s. - Success is defined more strictly than typical literature: the EE must remain in tolerance for ≥30 s consecutively, not merely enter the tolerance once. Average first-convergence time was 103 s (max 219 s); average consecutive in-threshold time was 312 s (74% of an episode).
- The agent generalizes without retraining: under base/target spin-rate synchronization error
ω_err ∈ [−0.5, 0.5] deg/s(taken from the COMRADE requirement||ω_0 − ω_T|| < 0.5 deg/s), success rate drops only to 94%, requiring the EE to track a moving point. - The agent generalizes to a target up to 28% larger (150 cm vs 50 cm training radius); no successful episodes fell below a 64.5 cm radius, and re-randomizing the goal within 64.5 cm restored a 100% success rate.
- Sensitivity analysis: the EE positioning threshold is the performance bottleneck; varying the attitude threshold and
t_minhad negligible effect on success rate. Residual error is attributed to the PD controller’s nonzero steady-state error and the 0.3 s agent sample time. - Controlling all 7 joints (with full position+attitude goal) is claimed as an improvement over prior work that either neglected attitude [17] or controlled only 6 of 7 joints [7].
Method
The multibody system is explicitly modeled as free-flying (base actively controlled in translation and rotation), in contrast to the free-floating regime where the base is uncontrolled — the paper draws this distinction directly (Sec. 2.1, citing [10]). Dynamics use a direct-path, Newton–Euler formulation with the base center of mass as the translational reference point; kinematics/dynamics computed via the SPART toolkit [11].
Equation of motion (1):
with H ∈ ℝ^{(6+N)×(6+N)} the symmetric positive-definite Generalized Inertia Matrix (GIM), C ∈ ℝ^{(6+N)×(6+N)} the Convective Inertia Matrix (CIM), τ ∈ ℝ^{(6+N)} generalized joint-space forces. Generalized coordinates (2): q = [r_0, R_0, q_m]^T = [q_0, q_m]^T, with r_0 base position (inertial), R_0 base orientation as a quaternion, q_m the joint angles. For N=7 the state stacks 6-DoF base + 7-DoF arm = 13.
Target dynamics (3): pure rotation in principal axes via Euler’s equation, I ω̇_T + ω_T × (I ω_T) = M, with M = 0 (torque-free tumble), placed at the LVLH origin.
Control (8) — coupled, model-based feedback-linearization with two PD loops (base and manipulator):
PD gains (Table 1): base P=0.4, D=0.3; manipulator P=2.5, D=1.25. The base is held at the synchronized state; the manipulator follows the agent’s commands.
RL layer: PPO (on-policy, model-free, actor–critic). Clipped surrogate objective (5)–(6) with probability ratio p_k(θ)=π_θ(a_k|s_k)/π_{θ_old}(a_k|s_k), advantage (7) A(s_k,a_k)=[Σ_{j=k}^T γ^{j-k} R(a_k,s_k)] − V(s_k). Observation o ∈ ℝ^{32} (9): o=[q_m, q̇_m, r̃, D̃CM, ṽ, ω̃]^T (joint angles/rates plus EE pose/twist errors in the body frame). Action a ∈ ℝ^{7} (10): desired joint rates a = q̇_m^*, integrated to give q_m^*. Reward shaping via an Artificial Potential Field (11): U_k = −r̃ + 10/(1+r̃_ax) + 10/(1+r̃_tx) + 10/(1+θ̃), with reward (12)–(13) R_k = ΔU if ΔU≥0 else 1.5 ΔU (the 1.5× penalty discourages motion along equipotential surfaces), plus a +0.01 sparse bonus when all errors are below threshold simultaneously.
Hyperparameters: clip ε=0.2, discount γ=0.99, entropy weight w=0.01, mini-batch 128, 4 epochs, learning rate 1×10^{-5}, tanh activations, two 3×300 FNNs (actor outputs 14 = mean+std of each joint rate; critic outputs 1). Sample time 0.3 s. Episodes terminate only on a singular manipulator configuration. The navigation block is omitted: full state is assumed known.
Relevance to thesis
This is a directly on-regime reference: it shares the free-flying, fully-actuated 6-DoF base assumption central to the thesis and explicitly contrasts it against free-floating work. It is a useful datapoint on learning-based guidance for redundancy resolution — sidestepping the ill-posed inverse kinematics that plague redundant arms — and on coordinated base/manipulator control. The reward APF, the strict (sustained) success criterion, and the COMRADE-derived synchronization-error budget are concrete benchmarks. For the risk layer, the paper is a cautionary example: it omits navigation/estimation uncertainty, relies on a PD loop with nonzero steady-state error, and offers only empirical robustness (94% under spin-rate error) with no formal guarantees or chance-constrained treatment — exactly the gap a risk-aware planner would address.
Connections
Topics: generalized_inertia_matrix · redundancy_resolution · coordinated_base_manipulator_control · reinforcement_learning_guidance
Key Equations / Quotes
“in the scope of this study the multibody system is described as free-flying, signifying that the spacecraft is actively controlled in both translation and rotation, in contrast to the free-floating scenario” (p. 3, Sec. 2.1)
H(q)\ddot{q} + C(q,\dot{q})\dot{q} = \tau \tag{1}
\tau = \begin{Bmatrix}\tau_0\\\tau_m\end{Bmatrix} = H\begin{Bmatrix}PD(q_0^*-q_0)\\PD(q_m^*-q_m)\end{Bmatrix} + C\dot{q} \tag{8}
U_k = -\tilde{r} + \frac{10}{1+\tilde{r}_{ax}} + \frac{10}{1+\tilde{r}_{tx}} + \frac{10}{1+\tilde{\theta}} \tag{11}
“Simulations were only terminated if the manipulator’s configuration became singular, to prevent the DRL algorithm from breaking down due to mathematical issues.” (p. 7)
Open Questions
- How would performance change with a controller guaranteeing null steady-state error (the cited positioning bottleneck), or with a smaller agent sample time tied to the navigation filter rate?
- Can the empirical robustness (94% under spin-rate error, generalization to 28% larger targets) be made into a formal guarantee, or paired with chance constraints / risk metrics under navigation uncertainty?
- Singular configurations only terminate episodes — how does the learned policy behave near (dynamic/kinematic) singularities, and can redundancy be exploited explicitly for singularity avoidance?
- Does the agent’s value transfer under Sim2Real domain gaps, joint/link flexibility (omitted here), and a real navigation filter in the loop?