ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL

Overview

Skills evolve with the policy, not beside it.

Core idea

Three pieces make each rollout do more work.

RL training with per-turn skill loading

veRL handles distributed RL. ReSkill adds skill-conditioned multi-turn rollouts, loading only the active skill guidance needed for the current turn.

RL-in-the-loop skill creation

Rollout experience becomes feedback for diagnosing failures, revising skill triggers, and proposing add, modify, or delete operations during training.

Skill versioning and sampling

Competing skill banks are tested within GRPO rollout groups. Thompson Sampling controls which versions are explored, accepted, rejected, or pruned.

ReSkill pipeline with RL training, skill creation, and skill evolution. — ReSkill uses each rollout for policy gradients, skill-creator evidence, and version comparison, keeping the rollout budget close to standard GRPO.

Results

Reconciled skill-policy updates improve both seen and unseen tasks.

Training dynamics for ALFWorld and Search comparing ReSkill to baselines and ablations. — Training dynamics on held-out validation subsets. ReSkill improves over baseline GRPO, static skill use, and decoupled update variants, especially on unseen splits.

Main evaluation

Comprehensive ALFWorld and Search results

Switch benchmark and model scale for the compact view. Expand any row to inspect the full seen/unseen task breakdown.

Generalization results on ScienceWorld, InterCode-SQL, and WANDS. — Additional benchmark results show the largest gaps on harder or out-of-domain tasks, where skill evolution has more room to discover reusable strategy.

Cross-domain test-time adaptation from ALFWorld to ScienceWorld. — Test-time skill adaptation transfers from ALFWorld to ScienceWorld without policy gradient updates.

Cost analysis comparing ReSkill overhead against GRPO in a two-by-two plot layout. — ReSkill achieves the highest accuracy ratio among all methods while maintaining competitive training and inference overhead.

Skill evolution

The skill library is created, tested, refined, and pruned.

ReSkill tracks skill versions over training rather than treating a skill bank as fixed context. Accepted operations tend to become shorter, more conditional, and more aligned with the action space as the policy improves.

Skill-policy co-evolution curves for ALFWorld and Search with add and delete operations. — Skill-policy co-evolution on ALFWorld and Search. Add and delete events are evaluated inside the training loop instead of being adopted automatically.

Codebase

Build agent RL with configurable skill co-evolution.

ReSkill is an easy-to-configure, extensible veRL extension that brings Anthropic-style skill creation into agentic RL training. It provides control over skill versioning, sampling, bundle testing, and customizable skill-policy co-evolution.

The codebase is under active development as we continue to improve integration, customization, and supported environments.

If ReSkill is useful to your work, please consider starring the repository ⭐

View code on GitHub