Back

Reinforcement learning for LLMs: SDAR's for multi-turn agent training

Deveshi Dabbawala

May 21, 2026
Table of contents

The AI landscape is evolving rapidly, with Large Language Models (LLMs) transforming how intelligent systems operate. However, moving from static single-turn reasoning to dynamic multi-turn agents introduces significant challenges. At GoML, we believe that understanding advanced approaches to reinforcement learning in LLMs is essential for building reliable and scalable AI systems.  

This article explores Self-Distilled Agentic Reinforcement Learning (SDAR), a novel framework that combines reinforcement learning and knowledge distillation to improve multi-turn agent training and autonomous decision-making.

The challenge of multi-turn agent training

Training multi-turn AI agents is far more complex than traditional single-step LLM optimization. As agents interact dynamically with environments, maintaining stable learning and reliable guidance becomes increasingly difficult.

  • Reinforcement Learning (RL) optimizes models using environmental feedback, while On-Policy Self-Distillation (OPSD) provides token-level guidance from a teacher model.
  • Combining RL and OPSD in multi-turn environments often creates instability during training.
  • In dynamic workflows, every agent action affects future observations and decisions.
  • When a student agent deviates from the teacher-guided path, token-level supervision becomes less reliable.
  • This misalignment can lead to performance degradation and unstable optimization.
  • Challenges increase further in skill-conditioned systems using retrieved demonstrations or action templates.
  • Positive teacher signals often represent useful and learnable behaviors.
  • Negative signals may result from imperfect retrieval rather than genuinely incorrect actions.
  • Treating positive and negative signals equally can reduce training stability and efficiency.

SDAR: A principled fusion strategy

This is how SDAR solves the above issues using an elegant design paradigm. The reinforcement learning technique used in training LLMs should be regarded as the main optimization paradigm, whereas OPSD should remain a secondary auxiliary process. Instead of naively concatenating the loss of functions, SDAR uses adaptive token level gating techniques.

The methodology operates through three key innovations:

1. Preserving RL unbiasedness

SDAR keeps the verifier-driven reinforcement learning for LLMs policy objective entirely untouched. This preservation of the RL advantage signal ensures that environment feedback remains the primary optimization signal, preventing distillation gradients from hijacking the learning process. The OPSD loss functions as a strictly separate auxiliary objective, maintaining the integrity of advantage-based updates.

2. Asymmetric gating via teacher-student gap

At each token position, SDAR computes the teacher-student log-probability gap (Δₜ), measuring how much the privileged teacher disagrees with the student's selected token. This gap feeds directly into a sigmoid gate that produces a smooth, bounded weight between zero and one. In reinforcement learning for LLMs, this asymmetric treatment proves critical.

The elegance lies in the asymmetry: positive gaps (where the teacher assigns higher probability) receive stronger distillation weights, signaling high-confidence endorsements. Negative gaps experience softer attenuation rather than harsh suppression, acknowledging that the teacher's lower probability may reflect noisy privileged context rather than genuine tokens to suppress.

This design embodies a fundamental insight: different tokens merit different levels of supervision, and that difference should flow naturally from the evidence itself.

3. Dynamic self-paced curriculum

Traditional curriculum approaches rely on rigid schedules or heuristic thresholds. SDAR's token-level gating creates an emergent, self-paced curriculum where supervision of intensity adapts dynamically throughout training. As the student policy evolves, the fraction of tokens receiving strong distillation signals naturally increases, reflecting genuine improvements in teacher-student alignment rather than predetermined schedules.

Empirical validation across benchmarks

SDAR's effectiveness in reinforcement learning for LLMs has been rigorously evaluated across three diverse evaluation environments: ALFWorld (household manipulation tasks), Search-QA (multi-hop question answering), and WebShop (e-commerce navigation). Testing spans three model scales Qwen2.5-3B, Qwen2.5-7B, and Qwen3-1.7B, ensuring findings to generalize across parameter ranges.

The improvements are substantial and consistent:

  • ALFWorld: +9.4% improvement over GRPO baseline (3B model)
  • Search-QA: +7.0% accuracy gains over pure RL
  • WebShop: +10.2% on accuracy metrics (7B model)

Critically, SDAR avoids the catastrophic collapse of naive GRPO+OPSD combinations, which suffer from unbounded distillation gradients overwhelming RL signals. In smaller models (Qwen3-1.7B), this difference becomes pronounced SDAR achieves 53.9% on ALFWorld while naive hybrid approaches stall below 42%.

True knowledge internalization

A particularly compelling finding distinguishes SDAR from skill-augmented baselines: the method internalizes privileged knowledge into model parameters rather than creating brittle dependencies on external context. This represents a major advancement in how reinforcement learning for LLMs can effectively transfer knowledge.

Skill-GRPO, which augments prompts with retrieved demonstrations during training, exhibits massive performance cliffs at inference when skills become unavailable. For instance, on ALFWorld-3B, performance drops from 80.5% to 60.2% below even vanilla GRPO. SDAR, conversely, requires no external skills at inference yet surpasses skill-augmented baselines, achieving 84.4% versus 80.5%.

This distinction reflects a deeper truth: effective distillation transfers underlying patterns into the policy itself, not dependencies on context availability.

Robustness to retrieval quality

One of the practical concerns that arise in deployed systems is skill retrieval of quality. SDAR shows graceful degradation, which means that the use of random skill retrieval in selecting demonstrations without any task awareness still gives a gain of +1.9% relative to GRPO baselines on ALFWorld. Improved retrieval through keyword matching provides +4.7%, while optimal selection using UCB gives +5.6%.

The reason behind such robustness is due to the ability of the gating module to filter. Poor-quality skills generate noisy signals, but the adaptive gate ensures that negative tokens are weakened while positive tokens are preserved.

Theoretical foundations

SDAR's design benefits from rigorous theoretical grounding. The gating mechanism ensures gradients remain strictly bounded, preventing the amplification phenomena that plague earlier hybrid approaches. The detached gate acts as a pure confidence weight rather than creating self-referential optimization loops, guaranteeing stable parameter updates.

The sigmoid transformation converts unbounded discrepancies into smooth, monotonic importance weights. This prevents the gradient explosions observed in methods that employ raw gaps as coefficients, particularly during early training when teacher-student misalignment runs highest.

Implications for agentic AI

The advent of SDAR is indicative of the more mature approaches to training agentic models. With the growing complexity of the environments that LLM-based agents must handle, hybrid optimization approaches are necessary. However, efficient hybridization of reinforcement learning techniques in the context of LLMs requires a principled approach rather than loss of function concatenation.

This can be seen in the case of SDAR, where distillation is viewed as an additional signal subject to gating using evidence.

Conclusion

SDAR stands out as a groundbreaking development in multi-turn agent training as it shows that reinforcement learning for LLMs and knowledge distillation can be combined in a productive manner. The token-level gating technique is simple but extremely effective, allowing for dynamic, data-driven supervision without the limitations associated with hand-crafted schedules.

As far as companies building their own autonomous agents are concerned, SDAR provides valuable lessons for the process: protect primary optimization signals, be wary of secondary objectives, and leave the determination of supervision strength to the training data through carefully crafted gating techniques.

We at GoML make intelligent systems reach new frontiers through approaches that strike the perfect balance between elegance and practicality. One such technique is the SDAR method that allows for more efficient and adaptable multi-turn AI systems. GoML develops scalable enterprise AI solutions using AI Matic.

https://arxiv.org/html/2510.06303v1