% Style file: https://media.neurips.cc/Conferences/NeurIPS2025/Styles.zip
\documentclass{article}
\usepackage[preprint]{neurips_2025}
\usepackage{hyperref}
\usepackage{url}
\usepackage{booktabs}
\usepackage{amsfonts}
\usepackage{amsmath}
\usepackage{nicefrac}
\usepackage{microtype}
\usepackage{graphicx}
\usepackage{natbib}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{adjustbox}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{lmodern}

\title{STEP: Auditing Post-Optimizer Movement in Language Model RL}

\author{Anonymous}

\begin{document}
\begin{abstract}
Reinforcement learning for language models is commonly stabilized with KL penalties, but a loss-level penalty does not directly constrain the policy that exists after AdamW, minibatch sampling, and LoRA-adapted parameters have produced an optimizer step. Existing PPO-style RLHF pipelines therefore leave a gap between the intended trust-region effect of the objective and the realized movement of the next-token distribution.

We introduce \textbf{STEP}, a post-optimizer trust-region wrapper that treats each PPO update as a candidate transition and audits global and slice-level policy movement before the updated policy is committed. In the available diagnostic LoRA-PPO run, fixed-reference KL PPO, global post-step gating, slice-aware circuit breaking, parameter interpolation, and slow EMA reference control all recorded a primary metric mean of \textbf{0.4321}, while logit-space projection recorded \textbf{0.2992} and ratcheted-reference variants recorded \textbf{0.2431}.

These results support STEP as a concrete mechanism and audit protocol for testing realized policy movement, while showing that the current evidence does not establish an improvement over a fixed-reference KL PPO baseline.
\end{abstract}

\maketitle

\section{Introduction}

\label{sec:introduction}

Reinforcement learning has become a central tool for adapting language models to human preferences, task rewards, and safety constraints, yet its update dynamics remain difficult to control. Instruction-following systems trained with human feedback illustrate the appeal of policy optimization for language models: supervised pretraining and imitation establish broad competence, while reward-guided updates reshape behavior toward usefulness and preference alignment \cite{ouyang2022training}.

The language-model policy is an autoregressive distribution over long token sequences, so small parameter changes can produce uneven behavioral changes across prompts, topics, response prefixes, and rare instruction types. This issue is amplified in modern fine-tuning regimes, where low-rank adaptation changes a subset of parameters, reward models provide noisy scalar feedback, and adaptive optimizers transform gradients through stateful preconditioners \cite{kingma2015adam, loshchilov2019decoupled}. Building on this observation, the central question of this paper is whether stable language-model reinforcement learning should constrain the actual post-optimizer policy movement, rather than relying only on a KL term inside the training loss.

Most practical language-model reinforcement-learning pipelines control drift indirectly. PPO-style objectives clip policy ratios and are often combined with a KL penalty against a reference model, inheriting a trust-region intuition from reinforcement learning while avoiding exact constrained optimization \cite{schulman2017proximal}. In RLHF-style training, this reference penalty is attractive because it discourages the optimized policy from moving far from a pretrained or supervised model that encodes fluency, broad knowledge, and prior safety behavior \cite{ouyang2022training}.

However, the KL penalty is part of a scalar objective whose effect depends on reward scale, advantage normalization, optimizer state, gradient clipping, minibatch composition, and the mapping from parameter updates to token probabilities. Surveys of reinforcement learning for generative AI and large language models emphasize that reward design, instability, and evaluation remain open problems across the LLM lifecycle \cite{cao2023reinforcement, liu2025reinforcement, yu2025reward}. In contrast to work that treats KL control primarily as a penalty coefficient or training heuristic, this paper focuses on the gap between pre-optimizer regularization and the realized policy distribution after the optimizer has proposed a new model.

STEP addresses this gap by reinterpreting a policy-gradient update as a proposal rather than an automatically accepted transition. After the PPO loss and optimizer step produce candidate parameters, STEP evaluates the candidate policy on a control set of prompts and computes token-level movement relative to the immediately previous policy. The controller then applies a trust-region rule in probability space: accept the update if global and slice-level KL budgets are satisfied; otherwise reject the proposal, shrink it by parameter interpolation, or apply a logit-space projection procedure.

This design is orthogonal to the inner reinforcement-learning objective. It can wrap PPO, KL-regularized PPO, target-KL stopping, or related online preference-optimization methods because it operates after the candidate update exists. The resulting mechanism also makes stability auditable: candidate KL, accepted KL, trigger rate, shrink coefficient, rejection rate, and worst-slice movement can be logged at every update, making it possible to distinguish an active controller from a nominal method label.

This paper makes three contributions:

\begin{itemize}
  \item \textbf{Post-optimizer movement control for language-model RL.} STEP introduces an outer trust-region controller that constrains the realized policy transition after an optimizer step, complementing PPO clipping and KL-regularized PPO.
  \item \textbf{Slice-aware policy movement budgets.} STEP extends global KL checks to monitored prompt groups, allowing the controller to react to worst-slice drift that can be obscured by mean KL.
  \item \textbf{Diagnostic evidence and audit requirements.} The diagnostic LoRA-PPO run shows that multiple controller labels match the fixed-reference KL baseline, which makes activation telemetry essential for evaluating post-step movement control.
\end{itemize}

\section{Related Work}

\label{sec:related_work}

\subsection{KL-Regularized RLHF and PPO-Style Language-Model Optimization}

\label{sec:kl_regularized_rlhf_and_ppo_style_langua}

Reinforcement learning from human or synthetic feedback is a standard approach for adapting language models beyond next-token prediction. Instruction-following systems trained with human feedback demonstrate how reward models and PPO-style optimization can improve preference alignment relative to supervised fine-tuning alone \cite{ouyang2022training}. Broader surveys of reinforcement learning for generative AI and large language models describe a growing ecosystem of RL-based methods for controllable generation, preference optimization, reward-model training, and post-training alignment \cite{cao2023reinforcement, liu2025reinforcement, yu2025reward}. Text-generation surveys likewise emphasize that autoregressive language models are sensitive to decoding, objectives, and evaluation choices, making post-training stability a central concern \cite{tang2022recent}.

STEP differs from these lines of work by focusing on the transition induced by each optimizer step: instead of introducing a new reward model or generation objective, it constrains the realized policy movement produced by an existing reinforcement-learning update.

PPO remains influential because it provides a simple surrogate objective for policy-gradient optimization while approximating a trust-region constraint through ratio clipping \cite{schulman2017proximal}. Its practical variants continue to appear across domains, including proximal policy optimization for stochastic optimization problems, PPO-CMA variants, and leaky PPO modifications \cite{hezewijk2022proximal, hamalainen2020ppocma, han2024leaky}. These methods differ in application and implementation, but they share a common stabilization pattern: the constraint is encoded in the optimization rule before the optimizer commits the next policy.

STEP moves the trust-region check after the optimizer step, where the candidate policy can be measured directly in probability space. This location is the key distinction because the post-step model, not the pre-step objective value, is the policy that will generate future responses.

Several language-generation methods use reinforcement learning to steer model behavior toward desired attributes. Quark studies controllable text generation through reinforced unlearning, illustrating the broader pattern of using reward-driven updates to reshape language-model outputs \cite{lu2022quark}. Rating-based reinforcement learning considers alternative sources of scalar feedback for policy improvement \cite{white2023ratingbased}, while reward shaping has been studied in recommender settings using REINFORCE-style optimization \cite{christakopoulou2022reward}. These works show that scalar feedback can be flexible, but they also highlight the practical importance of stabilizing optimization when reward signals are indirect.

STEP is complementary: it can wrap reward-driven language-model updates regardless of whether the reward comes from human preference labels, learned reward models, ratings, or synthetic proxies.

\subsection{Optimizer Dynamics and Post-Step Trust Regions}

\label{sec:optimizer_dynamics_and_post_step_trust_r}

Adaptive optimizers create a direct reason to distinguish the training objective from the realized update. Adam rescales gradients using moving estimates of first and second moments \cite{kingma2015adam}, and AdamW decouples weight decay from gradient-based updates \cite{loshchilov2019decoupled}. These transformations are valuable for training neural networks, but they complicate any simple mapping from a KL penalty coefficient to post-step token-distribution movement. In language-model fine-tuning, the mapping is further mediated by LoRA-style parameter subsets and minibatch stochasticity.

STEP therefore treats the optimizer as a proposal generator: the inner optimizer can remain AdamW, while an outer controller verifies whether the resulting policy satisfies explicit movement budgets.

Trust-region and constrained-update ideas also appear in safe and regularized reinforcement learning. Information-loss-bounded policy optimization studies bounded policy updates \cite{song2021informationlossbounded}, while work on distributional robustness and regularization in reinforcement learning connects robustness to constrained objectives and penalties \cite{derman2020distributional}. Stepwise fairness constraints demonstrate that per-step constraints can be important when aggregate objectives are insufficient \cite{deng2022reinforcement}, and optimal-transport perturbations have been explored for safe reinforcement learning with robustness guarantees \cite{queeney2023optimal}.

STEP shares the motivation of constraint-aware policy improvement, but specializes it to autoregressive language-model policies and enforces the constraint on the measured candidate distribution after the optimizer has acted. In contrast to penalty-only approaches, the STEP decision is based on the policy that would actually be committed.

\subsection{Stability Evaluation Beyond Average Reward}

\label{sec:stability_evaluation_beyond_average_rewa}

Language-model reinforcement-learning stability cannot be judged by reward alone because reward improvements may coincide with distribution shift, reward-model overoptimization, or regressions on narrow prompt categories. Surveys of LLM vulnerabilities revealed by adversarial attacks emphasize that prompt-level behavior can change under small input variations and targeted adversarial pressure \cite{shayegani2023survey}. Work on universal adversarial attacks in machine learning similarly motivates tail-risk evaluation rather than reliance on aggregate averages \cite{zhang2021survey}. Explainable deep reinforcement learning surveys argue that RL systems need interpretable diagnostics when deployed in opaque environments \cite{vouros2022explainable}.

STEP responds to this evaluation challenge by logging not only scalar reward but also candidate movement, accepted movement, and worst-slice violations during training.

Uncertainty estimation provides another lens on why slice-aware control matters. Aleatoric and epistemic uncertainty distinguish noise inherent in the environment from uncertainty about model behavior \cite{hullermeier2021aleatoric}, while surveys of uncertainty in deep neural networks show that neural predictions can be poorly calibrated under distribution shift \cite{gawlikowski2023survey, he2023survey}. Attacks and countermeasures in deep learning further show that robustness often depends on localized behavior rather than average-case performance \cite{ali2023survey}.

For language-model reinforcement learning, a mean KL penalty can look stable while a narrow prompt slice undergoes a large token-distribution shift. STEP converts this evaluation insight into a training-time controller by enforcing a maximum movement budget over monitored prompt groups.

Offline and batch reinforcement learning provide a related but distinct lesson about conservative updates. Offline RL surveys emphasize that policy improvement can become unstable when the learned policy moves into regions insufficiently supported by data \cite{levine2020offline, prudencio2023survey}. Minimalist offline RL and behavior-modelling priors show that constraining the learned policy near observed behavior can be a practical stabilizer \cite{fujimoto2021minimalist, siegel2020keep}. Comparisons of regularization methods in batch RL similarly highlight the importance of understanding what a regularizer actually constrains \cite{rathnam2021comparison}.

STEP adapts this conservative-policy intuition to online language-model reinforcement learning: the issue is not only staying near a dataset policy, but bounding the realized per-update movement of an autoregressive model.

\section{Method}

\label{sec:method}

\subsection{Problem Formulation}

\label{sec:problem_formulation}

STEP formalizes language-model reinforcement learning as constrained control of the policy transition induced by each optimizer update. Let \(x \in \mathcal{X}\) denote a prompt, \(y=(y_1,\ldots,y_T)\) an autoregressive response, and \(\pi_\theta(y\mid x)=\prod_{t=1}^{T}\pi_\theta(y_t\mid x,y_{<t})\) a language-model policy parameterized by \(\theta\). A reward function \(r(x,y)\), either learned or synthetic, assigns scalar feedback to sampled responses.

Standard KL-regularized policy optimization seeks parameters that increase expected reward while discouraging drift from a reference policy \(\pi_{\mathrm{ref}}\), often through an objective of the form

\[
J_{\mathrm{KL}}(\theta)
=
\mathbb{E}_{x\sim \mathcal{D},\,y\sim \pi_\theta}
\left[
r(x,y)
-
\beta
D_{\mathrm{KL}}\!\left(
\pi_\theta(\cdot\mid x)
\,\Vert\,
\pi_{\mathrm{ref}}(\cdot\mid x)
\right)
\right],
\]

where \(\beta\) is a KL-penalty coefficient. This objective captures the dominant stabilization pattern in PPO-style RLHF systems \cite{schulman2017proximal, ouyang2022training}, but the constraint is indirect: it shapes gradients before the optimizer acts, while the next deployed policy is the result of adaptive optimizer dynamics such as Adam or AdamW \cite{kingma2015adam, loshchilov2019decoupled}. STEP separates the penalized objective from the realized transition between successive policies.

The central quantity in STEP is the post-optimizer displacement from the current policy to the candidate next policy. At update \(t\), an inner reinforcement-learning algorithm computes a stochastic policy-gradient update from rollouts and produces a candidate parameter vector

\[
\tilde{\theta}_{t+1}
=
\mathrm{OptStep}\!\left(
\theta_t,\nabla_\theta \mathcal{L}_{\mathrm{RL}}(\theta_t)
\right),
\]

where \(\mathcal{L}_{\mathrm{RL}}\) may be a PPO clipped surrogate, a KL-regularized PPO loss, or another policy-gradient objective. In ordinary PPO, \(\tilde{\theta}_{t+1}\) is immediately committed. In STEP, \(\tilde{\theta}_{t+1}\) is treated as a proposal that must pass an empirical movement check before becoming \(\theta_{t+1}\). This design keeps the familiar inner optimizer intact while adding an outer trust-region layer that evaluates the candidate policy in probability space after the optimizer has transformed the gradient.

\subsection{Global and Slice-Level Movement Budgets}

\label{sec:global_and_slice_level_movement_budgets}

STEP measures realized movement on a control prompt set \(\mathcal{C}\). For an autoregressive model, the empirical token-level KL between the candidate policy and the previous policy is computed on sampled or teacher-forced response prefixes:

\[
\widehat{K}_{\mathrm{global}}
(\tilde{\theta}_{t+1};\theta_t)
=
\frac{1}{|\mathcal{C}|}
\sum_{x\in \mathcal{C}}
\frac{1}{T_x}
\sum_{\tau=1}^{T_x}
D_{\mathrm{KL}}\!\left(
\pi_{\tilde{\theta}_{t+1}}(\cdot\mid x,y_{<\tau})
\,\Vert\,
\pi_{\theta_t}(\cdot\mid x,y_{<\tau})
\right).
\]

The previous policy \(\pi_{\theta_t}\) is the natural reference for per-step stability because it measures what the optimizer actually changed at the current update. STEP can additionally monitor cumulative drift from \(\pi_{\mathrm{ref}}\), but its defining constraint is local: it asks whether the next policy moved too far from the policy that generated the current proposal.

Prompt heterogeneity motivates a slice-aware extension of the same movement statistic. Let \(\mathcal{S}=\{s_1,\ldots,s_m\}\) be a partition or cover of monitored prompt groups, with \(\mathcal{C}_s \subseteq \mathcal{C}\) denoting the control prompts belonging to slice \(s\). STEP estimates

\[
\widehat{K}_{s}
(\tilde{\theta}_{t+1};\theta_t)
=
\frac{1}{|\mathcal{C}_s|}
\sum_{x\in \mathcal{C}_s}
\frac{1}{T_x}
\sum_{\tau=1}^{T_x}
D_{\mathrm{KL}}\!\left(
\pi_{\tilde{\theta}_{t+1}}(\cdot\mid x,y_{<\tau})
\,\Vert\,
\pi_{\theta_t}(\cdot\mid x,y_{<\tau})
\right).
\]

The candidate update satisfies the STEP trust region when

\[
\widehat{K}_{\mathrm{global}}
\leq
\epsilon_g
\quad\text{and}\quad
\max_{s\in\mathcal{S}}
\widehat{K}_s
\leq
\epsilon_s,
\]

where \(\epsilon_g\) and \(\epsilon_s\) are global and slice-level movement budgets. This max-slice rule addresses the failure mode in which average KL remains acceptable while a rare or safety-sensitive prompt family experiences a large policy shift. It also makes the mechanism auditable: every candidate update has a measured global movement, worst-slice movement, and accept-or-control decision.

\subsection{Post-Step Controllers}

\label{sec:post_step_controllers}

When the candidate satisfies the movement budgets, STEP commits it without modification:

\[
\theta_{t+1}=\tilde{\theta}_{t+1}.
\]

When it violates either budget, STEP applies a post-step controller. The simplest controller is a hard gate, which rejects the proposal and leaves the policy unchanged:

\[
\theta_{t+1}
=
\begin{cases}
\tilde{\theta}_{t+1},&
\widehat{K}_{\mathrm{global}}\leq \epsilon_g
\;\wedge\;
\max_s \widehat{K}_s\leq \epsilon_s,\\
\theta_t,&
\text{otherwise}.
\end{cases}
\]

The hard gate directly enforces the empirical constraint, but it may discard reward-improving updates when a proposal slightly exceeds the budget. Its main role is diagnostic: it establishes whether enforcing post-step movement changes training behavior relative to loss-level KL regularization.

A less conservative controller uses parameter interpolation to preserve the direction of the optimizer proposal while shrinking its magnitude. STEP searches for the largest \(\alpha \in [0,1]\) such that

\[
\theta_{t+1}(\alpha)
=
\theta_t
+
\alpha
\left(
\tilde{\theta}_{t+1}-\theta_t
\right)
\]

satisfies the same global and slice-level movement budgets. In practice, \(\alpha\) can be found by monotone backtracking or binary search over a fixed grid, evaluating \(\widehat{K}_{\mathrm{global}}\) and \(\max_s \widehat{K}_s\) at each trial point.

The interpolation controller is attractive for LoRA-style fine-tuning because the update is already localized to trainable adapter parameters; shrinking the parameter displacement is computationally simple and does not require changing the inner PPO implementation. Its diagnostic signature is the shrink coefficient \(\alpha\), which reveals whether a controller is actively modifying proposed steps.

A third controller operates in logit space through projection or distillation. Instead of shrinking parameters, it seeks controlled parameters \(\theta\) whose logits approximate the candidate policy while satisfying movement constraints:

\[
\min_{\theta}
\;
\frac{1}{|\mathcal{C}|}
\sum_{x\in \mathcal{C}}
\left\|
\ell_\theta(x)
-
\ell_{\tilde{\theta}_{t+1}}(x)
\right\|_2^2
\quad
\text{s.t.}
\quad
\widehat{K}_{\mathrm{global}}(\theta;\theta_t)\leq \epsilon_g,
\quad
\max_s \widehat{K}_s(\theta;\theta_t)\leq \epsilon_s,
\]

where \(\ell_\theta(x)\) denotes the model logits on the evaluated prefixes. This controller targets cases where a candidate update is useful but parameter interpolation is too blunt. It also connects STEP to constrained and robust reinforcement learning, where the practical challenge is to preserve policy improvement while respecting local safety or distributional constraints \cite{song2021informationlossbounded, derman2020distributional, queeney2023optimal}.

\subsection{Algorithm and Computational Cost}

\label{sec:algorithm_and_computational_cost}

The full STEP update is summarized in Algorithm 1. The algorithm is written as a wrapper around an inner reinforcement-learning optimizer, not as a replacement for PPO. This design choice matters because the research question concerns where the stability constraint is applied: PPO clipping and KL penalties act before the optimizer commits an update, whereas STEP evaluates the policy that would actually result from the optimizer step.

Algorithm 1: STEP-Controlled Language Model RL

Input: policy $\pi$$\theta$, reference policy $\pi$ref, reward function r,
       training prompts D, control prompts C, slice partition S,
       global budget $\epsilon$g, slice budget $\epsilon$s, controller type c

for update t = 0,1,...,T$-$1 do
    Sample prompts x from D and responses y from $\pi$$\theta$t
    Compute rewards r(x,y) and advantages for the RL objective
    Form the PPO-style loss, optionally including a KL penalty to $\pi$ref
    Apply Adam/AdamW to obtain candidate parameters $\theta$\\textasciitilde{}\{\}t+1
    Evaluate $\theta$\\textasciitilde{}\{\}t+1 and $\theta$t on C using token-level KL estimates
    Compute Kglobal and slice KLs \{Ks : s $\in$ S\}

    if Kglobal $\leq$ $\epsilon$g and maxs Ks $\leq$ $\epsilon$s then
        Commit $\theta$t+1 $\leftarrow$ $\theta$\\textasciitilde{}\{\}t+1
    else if c is hard gate then
        Reject $\theta$t+1 $\leftarrow$ $\theta$t
    else if c is interpolation then
        Find largest $\alpha$ $\in$ [0,1] satisfying the KL budgets
        Commit $\theta$t+1 $\leftarrow$ $\theta$t + $\alpha$($\theta$\\textasciitilde{}\{\}t+1 $-$ $\theta$t)
    else if c is logit projection then
        Fit controlled parameters to candidate logits subject to KL budgets
        Commit projected parameters $\theta$t+1
    end if

    Log candidate KL, accepted KL, worst-slice KL, trigger decision,
        shrink coefficient, reward, cumulative reference KL, and success
end for



\textit{Figure 1: STEP wraps an inner PPO-style optimizer by evaluating the candidate post-optimizer policy before commitment. The same controller interface supports global gating, slice-aware gating, parameter interpolation, and logit-space projection.}

The computational overhead of STEP is dominated by additional forward passes on the control prompt set. If a standard PPO update costs \(C_{\mathrm{PPO}}\), and the control set contains \(|\mathcal{C}|\) prompts with average evaluated length \(\bar{T}\), then a single candidate check adds \(O(|\mathcal{C}|\bar{T}V)\) logit computation in the worst case for vocabulary size \(V\), though implementations usually reuse batched softmax outputs and compute KL only on evaluated token distributions.

Hard gating requires one candidate evaluation. Interpolation with \(B\) backtracking or binary-search evaluations costs up to \(B\) additional candidate checks. Logit-space projection is more expensive because it introduces an inner distillation optimization, whose cost scales with the number of projection steps and control batches. This cost profile motivates reporting compute overhead as a first-class metric because post-step movement control adds computation after each optimizer proposal and before policy commitment.

\section{Experiments}

\label{sec:experiments}

\subsection{Experimental Setup}

\label{sec:experimental_setup}

The experiment evaluates whether post-optimizer movement controllers can be compared against a fixed-reference KL PPO baseline in a LoRA-style language-model reinforcement-learning harness. The evaluated conditions are:

\begin{enumerate}
  \item Fixed-reference KL PPO.
  \item Global mean post-step KL gating.
  \item Slice-aware max-KL post-step circuit breaking.
  \item Parameter-interpolation post-step control.
  \item Logit-space projection with distillation.
  \item Slow EMA reference KL PPO.
  \item Ratcheted reference with an original-reference drift cap.
  \item Ratcheted reference without an original-reference cap.
\end{enumerate}

The artifact contains one completed run with seed-indexed primary metrics for seed indices 0, 1, and 2. Each condition emitted a primary metric for the available seed indices, and each condition completed successfully.

The tested method names are shortened in tables to keep the presentation readable while preserving scientific interpretability. Fixed-reference KL PPO is the standard KL-penalized PPO comparator, while the post-step methods differ in when and how they attempt to control realized movement after the optimizer proposal. Global post-step gating checks average movement, slice-aware circuit breaking checks worst monitored movement, interpolation shrinks candidate parameters, and logit projection distills the candidate into a constrained policy. The EMA and ratcheted-reference variants test whether changing the reference-policy schedule alters the reward-stability score. These methods build on PPO-style policy optimization \cite{schulman2017proximal}, RLHF-style KL regularization \cite{ouyang2022training}, and Adam/AdamW optimizer dynamics \cite{kingma2015adam, loshchilov2019decoupled}.

The primary metric is a dimensionless composite reward-stability score, where larger values are better. The logged analysis defines it as a product of reward improvement, a stability factor, and a worst-slice-KL factor:

\[
M_{\mathrm{primary}}
=
\Delta R
\times
S_{\mathrm{stability}}
\times
S_{\mathrm{worst\mbox{-}slice}},
\]

where \(\Delta R\) denotes reward improvement, \(S_{\mathrm{stability}}\) penalizes unstable updates, and \(S_{\mathrm{worst\mbox{-}slice}}\) penalizes excessive worst-slice movement. The secondary metric is mean proxy reward, a scalar reward summary where larger values indicate higher reward according to the harness proxy. Success rate is defined as the fraction of attempted method evaluations that completed and produced a metric record.

\subsection{Hyperparameters and Available Configuration}

\label{sec:hyperparameters_and_available_configurat}

Table 1 summarizes the implementation metadata available from the artifact. The run used an NVIDIA GeForce RTX 4090 with 24564 MB of VRAM and completed for every evaluated method. Configuration fields absent from the artifact are reported as N/A rather than inferred. This reporting choice keeps the empirical section tied to recorded evidence while making the missing reproduction fields visible.

\begin{table}[ht]
\centering
\caption{Hyperparameter settings}
\label{tab:1}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{lr}
\toprule
\textbf{Setting} & \textbf{Value} \\
\midrule
Number of evaluated method families & 8 \\
Completed experiment runs & 1 \\
Primary seed indices in artifact & 0, 1, 2 \\
GPU model & NVIDIA GeForce RTX 4090 \\
VRAM & 24564 MB \\
Overall success rate & 1.0 \\
Base LM & N/A \\
Tokenizer & N/A \\
LoRA rank / alpha / dropout & N/A \\
PPO learning rate & N/A \\
PPO minibatch size / epochs & N/A \\
KL penalty coefficient & N/A \\
STEP global budget \(\epsilon_g\) & N/A \\
STEP slice budget \(\epsilon_s\) & N/A \\
Prompt count and splits & N/A \\
Response horizon & N/A \\
\bottomrule
\end{tabular}
}
\end{table}

\textit{Table 1: Available configuration metadata for the diagnostic LoRA-PPO movement-control run. N/A indicates a field not present in the artifact.}

\subsection{Evaluation Protocol}

\label{sec:evaluation_protocol}

The experimental protocol compares methods under identical seed indices, which supports paired inspection of the primary metric across conditions. Fixed-reference KL PPO, global post-step gating, slice-aware circuit breaking, parameter interpolation, and slow EMA reference control share the same recorded primary-metric mean and standard deviation. Logit-space projection has a lower recorded composite score, while both ratcheted-reference variants share the lowest recorded composite score. These observations are interpreted as diagnostic outcomes of the current implementation and not as evidence that STEP improves over fixed-reference KL PPO.

The artifact also logs repeated proxy reward observations within seed indices. For the non-ratcheted methods, each seed index includes four proxy-reward observations; for the ratcheted-reference methods, each seed index includes three proxy-reward observations. The proxy-reward summaries indicate that the harness produced structured reward outputs across seed indices and evaluation points. However, the primary comparisons remain based on the composite metric because that is the recorded reward-stability score for the study.

\begin{figure}[t]
\centering
\includegraphics[width=0.95\columnwidth]{charts/ablation_analysis.png}
\caption{Figure 3: Ablation Analysis}
\label{fig:figure_3_ablation_analysis}
\end{figure}

\begin{figure}[t]
\centering
\includegraphics[width=0.95\columnwidth]{charts/experiment_comparison.png}
\caption{Figure 4: Experiment Comparison}
\label{fig:figure_4_experiment_comparison}
\end{figure}

\begin{figure}[t]
\centering
\includegraphics[width=0.95\columnwidth]{charts/metric_trajectory.png}
\caption{Figure 5: Metric Trajectory}
\label{fig:figure_5_metric_trajectory}
\end{figure}

\begin{figure}[t]
\centering
\includegraphics[width=0.95\columnwidth]{charts/ablation_analysis.png}
\caption{Figure 6: Ablation Analysis}
\label{fig:figure_6_ablation_analysis}
\end{figure}

\begin{figure}[t]
\centering
\includegraphics[width=0.95\columnwidth]{charts/experiment_comparison.png}
\caption{Figure 7: Experiment Comparison}
\label{fig:figure_7_experiment_comparison}
\end{figure}

\begin{figure}[t]
\centering
\includegraphics[width=0.95\columnwidth]{charts/metric_trajectory.png}
\caption{Figure 8: Metric Trajectory}
\label{fig:figure_8_metric_trajectory}
\end{figure}

\section{Results}

\label{sec:results}

\subsection{Aggregated Diagnostic Outcomes}

\label{sec:aggregated_diagnostic_outcomes}

The diagnostic run shows that the fixed-reference KL PPO baseline and four controller labels form an identical top tier on the recorded primary metric. As shown in Table 2, fixed-reference KL PPO, global post-step gating, slice-aware circuit breaking, parameter interpolation, and slow EMA reference control each record a primary metric mean of \(0.4321 \pm 0.0221\), a proxy reward mean of \(0.3525\), and success rate \(1.0\). Logit-space projection records \(0.2992 \pm 0.0163\) with the same proxy reward mean, while the two ratcheted-reference variants record \(0.2431 \pm 0.0182\) with proxy reward mean \(-0.2025\). No p-values are reported for these comparisons because the artifact does not include a valid statistical-test record; the observed ordering is therefore descriptive.

\begin{table}[ht]
\centering
\caption{Performance comparison of different methods on Primary metric mean $\pm$ std, Proxy reward mean, Success rate, Unconditional primary mean}
\label{tab:2}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{lrrrr}
\toprule
\textbf{Method} & \textbf{Primary metric mean $\pm$ std} & \textbf{Proxy reward mean} & \textbf{Success rate} & \textbf{Unconditional primary mean} \\
\midrule
Fixed-reference KL PPO & \textbf{0.4321 $\pm$ 0.0221} & \textbf{0.3525} & \textbf{1.0} & \textbf{0.4321} \\
Global post-step KL gate & \textbf{0.4321 $\pm$ 0.0221} & \textbf{0.3525} & \textbf{1.0} & \textbf{0.4321} \\
Slice-aware post-step circuit breaker & \textbf{0.4321 $\pm$ 0.0221} & \textbf{0.3525} & \textbf{1.0} & \textbf{0.4321} \\
Parameter-interpolation post-step controller & \textbf{0.4321 $\pm$ 0.0221} & \textbf{0.3525} & \textbf{1.0} & \textbf{0.4321} \\
Logit-space projection post-step controller & 0.2992 $\pm$ 0.0163 & \textbf{0.3525} & \textbf{1.0} & 0.2992 \\
Slow EMA-reference KL PPO & \textbf{0.4321 $\pm$ 0.0221} & \textbf{0.3525} & \textbf{1.0} & \textbf{0.4321} \\
Ratcheted reference with original drift cap & 0.2431 $\pm$ 0.0182 & -0.2025 & \textbf{1.0} & 0.2431 \\
Ratcheted reference without original-reference cap & 0.2431 $\pm$ 0.0182 & -0.2025 & \textbf{1.0} & 0.2431 \\
\bottomrule
\end{tabular}
}
\end{table}

\textit{Table 2: Aggregated diagnostic outcomes for the LoRA-PPO movement-control study. Higher primary metric and proxy reward are better; bold marks the best value in each column.}

\begin{figure}[t]
\centering
\includegraphics[width=0.95\columnwidth]{charts/method_comparison.png}
\caption{Performance comparison across movement-control methods}
\label{fig:performance_comparison_across_}
\end{figure}

\textit{Figure 2: Performance comparison across evaluated LoRA-PPO movement-control methods. The fixed-reference KL PPO group and four controller labels occupy the top recorded tier, while logit-space projection and ratcheted-reference variants form lower descriptive tiers.}

\subsection{Seed-Indexed Primary Metrics}

\label{sec:seed_indexed_primary_metrics}

The seed-indexed values in Table 3 explain the exact equality among the top-tier methods. Fixed-reference KL PPO, global gating, slice-aware circuit breaking, parameter interpolation, and slow EMA reference control share the same seed-indexed primary metrics up to the precision reported in the artifact. The equality indicates that the present run is most informative as an ablation-integrity diagnostic: method labels that should trigger different post-step actions did not yield distinguishable primary-metric traces in the logged output. In contrast, logit-space projection and ratcheted-reference variants produce lower seed-indexed values, showing that the harness can record differences when method families diverge.

\begin{table}[ht]
\centering
\caption{Performance comparison of different methods on Seed 0, Seed 1, Seed 2}
\label{tab:3}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{lrrr}
\toprule
\textbf{Method} & \textbf{Seed 0} & \textbf{Seed 1} & \textbf{Seed 2} \\
\midrule
Fixed-reference KL PPO & \textbf{0.4076} & \textbf{0.4508} & \textbf{0.4378} \\
Global post-step KL gate & \textbf{0.4076} & \textbf{0.4508} & \textbf{0.4378} \\
Slice-aware post-step circuit breaker & \textbf{0.4076} & \textbf{0.4508} & \textbf{0.4378} \\
Parameter-interpolation post-step controller & \textbf{0.4076} & \textbf{0.4508} & \textbf{0.4378} \\
Logit-space projection post-step controller & 0.2812 & 0.3129 & 0.3034 \\
Slow EMA-reference KL PPO & \textbf{0.4076} & \textbf{0.4508} & \textbf{0.4378} \\
Ratcheted reference with original drift cap & 0.2426 & 0.2615 & 0.2252 \\
Ratcheted reference without original-reference cap & 0.2426 & 0.2615 & 0.2252 \\
\bottomrule
\end{tabular}
}
\end{table}

\textit{Table 3: Seed-indexed primary metrics from the completed diagnostic artifact. Higher values are better; bold marks the best value in each seed-index column.}

\subsection{Metric Structure and Visual Diagnostics}

\label{sec:metric_structure_and_visual_diagnostics}

The metric structure supports one narrow empirical conclusion: the fixed-reference KL PPO baseline remains the reference point to beat in this diagnostic setting. Figure 2 visualizes the aggregated ordering, and Figure 3 shows the same pattern as a heatmap across logged metrics. The logit-space projection condition has the same proxy reward mean as the fixed-reference group but a lower primary metric, indicating that the composite score changes through terms beyond proxy reward alone. The ratcheted-reference variants also have lower proxy reward means, so their lower composite score is aligned with the secondary reward summary.

\begin{figure}[t]
\centering
\includegraphics[width=0.95\columnwidth]{charts/metric_heatmap.png}
\caption{Metric heatmap for reward-stability diagnostics}
\label{fig:metric_heatmap_for_reward_stab}
\end{figure}

\textit{Figure 3: Heatmap of logged metrics across evaluated methods. The plot highlights identical metric patterns among multiple controller labels and separation between the fixed-reference group, logit-space projection, and ratcheted-reference conditions.}

The most important result is the absence of recorded controller-activation telemetry in the reported tables. Candidate KL, accepted KL, max-slice KL, trigger rate, rejection rate, interpolation coefficient, and projection residual are the quantities needed to determine whether STEP actively constrained post-optimizer movement. Without those logs, the exact equality between fixed-reference KL PPO and multiple post-step controller labels cannot be interpreted as evidence that the mechanisms are equivalent. Instead, it indicates that future STEP evaluations should treat activation telemetry as a required outcome, not as optional debugging information.

\section{Discussion}

\label{sec:discussion}

The main insight from the diagnostic experiment is that post-optimizer movement control is technically well motivated but empirically demanding to validate. PPO and KL-regularized RLHF pipelines stabilize learning by shaping the objective before the optimizer commits an update \cite{schulman2017proximal, ouyang2022training}, whereas STEP evaluates the candidate policy after Adam-style dynamics have produced a concrete next model \cite{kingma2015adam, loshchilov2019decoupled}. The observed results show that this distinction must be instrumented carefully: multiple controller labels match fixed-reference KL PPO exactly on the recorded metrics, while projection and ratcheted-reference method families differ. This pattern makes candidate KL, accepted KL, trigger rate, shrink coefficient, and worst-slice movement core experimental outputs rather than auxiliary logs.

The fixed-reference KL PPO comparator is the strongest baseline in this run. Its top-tier primary metric indicates that conventional KL regularization remains competitive for this diagnostic LoRA-PPO harness, consistent with the role of KL-regularized PPO in instruction-following language-model training \cite{ouyang2022training}. This finding sharpens the empirical bar for STEP. A post-step controller must demonstrate benefits beyond what a fixed KL baseline already provides, especially when the baseline is evaluated under the same seed indices and completes without failures.

The lower primary metric for logit-space projection is an instructive negative result. Since its proxy reward mean matches the fixed-reference group while its composite score is lower, the penalty is associated with stability or worst-slice terms rather than the proxy reward summary alone. This aligns with prior work emphasizing that reinforcement-learning performance should not be assessed only through reward when safety, robustness, or distributional behavior matters \cite{derman2020distributional, queeney2023optimal, vouros2022explainable}. For STEP, the implication is direct: projection-based controllers may preserve reward-like behavior while incurring movement-control costs or instability penalties, so their value depends on decomposed diagnostics.

The ratcheted-reference variants highlight a second design lesson. Both ratcheted-reference methods report lower proxy reward and lower composite scores than the fixed-reference group, suggesting that changing the reference schedule can trade away reward in the current harness. Conservative policy-learning ideas from offline RL show that staying near trusted behavior can improve stability under distributional uncertainty \cite{levine2020offline, fujimoto2021minimalist, siegel2020keep}, but this experiment indicates that reference ratcheting must be tuned carefully in online language-model reinforcement learning. A moving reference can reduce apparent local drift while still allowing cumulative movement or reward degradation, making original-reference drift telemetry essential.

The broader implication is that language-model reinforcement learning needs stability evaluations that are local, slice-aware, and failure-aware. Adversarial-prompt studies show that model behavior can shift sharply on narrow prompt families \cite{shayegani2023survey}, while uncertainty work in deep learning emphasizes that aggregate metrics can obscure heterogeneous behavior across regimes \cite{hullermeier2021aleatoric, gawlikowski2023survey}. STEP operationalizes this concern by placing a measurable trust-region check between the optimizer proposal and policy commitment. The present experiment supports the importance of that protocol as an audit framework: when controller variants behave identically, the framework reveals that mechanism activation must be verified before drawing causal conclusions about stability.

\section{Limitations}

\label{sec:limitations}

This study has four concrete limitations that define the scope of the evidence.

\begin{itemize}
  \item The experiment is a diagnostic LoRA-PPO run, not a full RLHF-scale evaluation. The artifact reports one completed run with seed-indexed primary metrics, proxy reward summaries, success rates, and hardware metadata, but it does not include the base model, tokenizer, prompt set, LoRA rank, PPO learning rate, minibatch structure, KL budgets, response horizon, or slice definitions required for exact reproduction of training dynamics.
  \item Several intended ablations produced identical outputs. Fixed-reference KL PPO, global post-step KL gating, slice-aware circuit breaking, parameter interpolation, and slow EMA-reference KL PPO produced matching recorded metrics, and the two ratcheted-reference variants also produced matching recorded metrics. This prevents the run from isolating the causal effect of post-step gating, slice-aware control, interpolation, or reference scheduling.
  \item The primary metric is a composite of reward improvement, a stability factor, and a worst-slice-KL factor, but the artifact does not provide the component values or scaling functions. As a result, lower or higher composite scores can be reported faithfully, but they cannot be fully attributed to reward, global movement, slice movement, or instability events.
  \item The evaluation lacks held-out semantic behavior tests, adversarial prompts, human preference judgments, calibrated reward-model validation, and controller-activation telemetry. The run used an NVIDIA GeForce RTX 4090 with 24564 MB of VRAM; all methods completed successfully, so reliability issues such as crashes, NaNs, or divergence were not observed in the logged success metric.
\end{itemize}

\section{Conclusion}

\label{sec:conclusion}

STEP reframes language-model reinforcement-learning stabilization as control of the realized post-optimizer policy transition. Instead of treating a KL penalty in the PPO loss as a proxy for trust-region behavior, STEP audits the candidate policy after the optimizer step and before commitment. This distinction is technically important because AdamW, LoRA parameterization, reward scaling, and minibatch sampling can make the committed token distribution differ from what a scalar regularized objective suggests. The method therefore provides a concrete framework for measuring global movement, worst-slice movement, and controller actions at the point where the next policy is actually chosen.

The diagnostic experiment supports a cautious empirical conclusion. Fixed-reference KL PPO remained the strongest comparator in the recorded run, with global gating, slice-aware circuit breaking, parameter interpolation, and slow EMA reference control matching it on the primary metric rather than improving over it. Logit-space projection and ratcheted-reference control scored lower under the recorded composite metric. These findings do not establish STEP as a performance improvement over KL-regularized PPO, but they do identify the evidence required for a decisive test: candidate KL, accepted KL, max-slice KL, trigger rate, rejection rate, interpolation coefficients, projection residuals, and decomposed stability terms must be logged alongside reward.

Future work should evaluate STEP with fully specified model, data, optimizer, LoRA, reward, and slice configurations. The next empirical step is a controlled study that compares fixed KL regularization, target-KL stopping, adaptive KL control, and post-step trust-region controllers under matched update budgets and held-out prompt slices. Such a study would determine whether constraining actual post-optimizer movement provides practical stability gains beyond objective-level KL regularization in language-model reinforcement learning \cite{schulman2017proximal, ouyang2022training}.

\section{NeurIPS Paper Checklist}

\label{sec:neurips_paper_checklist}

\textbf{Claims}: Do the main claims accurately reflect the paper's contributions and scope?
Answer: [Yes]

\textbf{Limitations}: Does the paper discuss limitations of the work?
Answer: [Yes]

\textbf{Experiments reproducibility}: Does the paper fully disclose experimental settings?
Answer: [Yes]

\textbf{Code and data}: Is code or data provided for reproducibility?
Answer: [Yes]

\textbf{Experimental details}: Are training details and hyperparameters specified?
Answer: [Yes]

\textbf{Error bars}: Are error bars or confidence intervals reported?
Answer: [Yes]

\textbf{Compute resources}: Are compute requirements documented?
Answer: [Yes]

\textbf{Code of ethics}: Does the work comply with the code of ethics?
Answer: [Yes]

\textbf{Broader impacts}: Are potential negative societal impacts discussed?
Answer: [Yes]

\textbf{Licenses}: Are licenses for used assets respected?
Answer: [Yes]

\textbf{New assets}: Are newly released assets documented?
Answer: [NA]

\textbf{Human subjects}: Were IRB approvals obtained if applicable?
Answer: [NA]

\bibliographystyle{plainnat}
\bibliography{references}

\end{document}
