BIGAI Peking University
SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning
Lirui Luo, Guoxi Zhang, Hongming Xu, Cong Fang, Qing Li
Peking University · State Key Laboratory of General Artificial Intelligence, BIGAI
A practical Parseval penalty that keeps MoE policy updates spectrally diverse during continual RL.
Project
Paper
ICML 2026
Accepted

Motivation

MoE policies scale capacity, but continual RL can still make later tasks hard to learn.
  • Plasticity loss: new gradients lose useful update directions.
  • MoE gap: sparse experts can still collapse spectrally.
  • SPHERE view: track plasticity via eNTK effective rank.
+133%
MetaWorld CRL success vs. unregularized Top-K MoE
+50%
HumanoidBench CRL success vs. unregularized Top-K MoE
Spectral plasticity teaser
Core intuition. A collapsed eNTK spectrum filters gradients into a few update directions. A more isotropic spectrum lets new-task gradients move the policy more broadly.

Setup

Protocols. RL trains each task independently. CRL trains one agent through a task sequence, resuming from the previous task checkpoint.

Benchmarks. MetaWorld CW10 manipulation and HumanoidBench H1 control. Main metrics are final success and spectral plasticity $r_e(K)$.
Top-K MoEDS-MoEPPOCRL

Spectral Plasticity

The empirical NTK $K=JJ^\top$ maps output gradients to functional changes:
$\Delta f = -\eta\,K\nabla_f L$
If $K$ concentrates on a few eigen-directions, many gradient components are attenuated. We quantify update breadth by spectral-entropy effective rank:
$r_e(K)=\exp\!\left(-\sum_i p_i\log p_i\right)$,   $p_i=\lambda_i/\sum_j\lambda_j$
Collapsed spectrum
low $r_e(K)$ · narrow updates
Isotropic spectrum
high $r_e(K)$ · broad updates
RL vs CRL architecture comparison
Plasticity loss is not only a dense-network issue. Continual training reduces success across dense PPO and multiple MoE architectures.
eNTK effective rank during CRL
Spectral plasticity decays in baselines. SPHERE maintains higher $r_e(K)$ over HumanoidBench CRL while PPO, Top-K MoE, Dense-MoE, and DS-MoE drop toward lower-rank update geometry.

SPHERE Method

Directly forming $K$ is expensive. SPHERE acts on the routing-weighted expert feature Gram at the last hidden expert layer.
MoE forward
weighted expert features
feature Gram
$A^{exp}_{last}$
penalty
contract spectrum
$\mathcal L_{SPHERE}(A)=\left\|A-\frac{\operatorname{Tr}(A)}{m}I\right\|_F^2$
$\mathcal L = \mathcal L_{PPO}+\lambda^e\mathcal L_{SPHERE}(A^{exp}_{last})$
The penalty suppresses expert-feature anisotropy and improves a tractable spectral-plasticity proxy.
Not just load balancing. Expert-load balancing does not prevent the same spectral decay seen in Top-K MoE.
Proxy validation. Expert feature isotropy co-varies with $r_e(K)$ across rollout batches.
Design choices that matter
  • Regularize the actor MoE.
  • Use the last hidden expert layer.
  • Concatenate weighted expert features to penalize cross-expert correlations.
MetaWorld. Under CRL, SPHERE raises average success from 0.21 to 0.49 over Top-K MoE and narrows the RL–CRL gap.
HumanoidBench. SPHERE improves average success over Top-K MoE under both RL and CRL, with a 50% CRL gain.
qualitative spectral collapse case
Qualitative evidence. Without SPHERE, states quickly concentrate on one dominant singular direction. With SPHERE, multiple components remain active across the continual task sequence.

Takeaways

  • Spectral collapse explains one plasticity bottleneck: later gradients get fewer functional update directions.
  • SPHERE is lightweight: add a feature-Gram Parseval penalty to PPO instead of forming the eNTK.
  • Empirically robust: gains hold across MetaWorld and HumanoidBench, with routing, placement, spectral-baseline, and statistical checks.
Boundary: spectral plasticity is one specific eNTK-effective-rank view of plasticity loss, not a complete taxonomy of all plasticity mechanisms.