Agentic RL#
Architecture#
Core Components#
The framework consists of several key components:
Agent: Interacts with the environment by generating actions based on observations and conversation history.
Environment: Represents the task or problem to be solved, processes agent actions, and returns observations, rewards, and termination signals.
Tool: A reusable component that provides specific functionalities (e.g., calculation, search) that an agent can invoke.
Parser: Translates between natural language model responses and structured data like tool calls, and formats conversation history into model-specific input formats.
TrajectoryCollectEngine: Manages the interaction loop for a single agent-environment pair to produce a complete trajectory.
RolloutOrchestrator: Manages multiple
TrajectoryCollectEngineinstances for parallel trajectory collection.
Agents#
Agents inherit from ConversationAgentBase, which provides common functionality
for maintaining conversation history (chat_completions) and recording
interaction steps in a Trajectory.
ModelAgent: A simple agent for single-turn tasks where the model’s response is treated as the final answer.ToolAgent: A more complex agent that can parse model responses to detect and invoke tool calls. It uses aToolManagerto manage available tools and aToolParser(e.g.,QwenToolParser,GeminiToolParser) to understand model outputs and format tool schemas for the model prompt.
Environments#
Environments inherit from BaseTaskEnv, which handles episode lifecycle
management like max_steps.
TaskEnvironment: Designed for single-turn tasks. The environment terminates after the first agent action and computes a reward based on the final response using a providedreward_fn.ToolEnvironment: Designed for multi-turn, tool-using tasks. It receives actions from theToolAgent, and if they are tool calls, it uses its internalToolManagerto execute them viaexecute_calls. The results of tool execution are returned to the agent as a new observation in{"tool_outputs": ...}format. The episode terminates when the agent invokes a specialfinishfunction ormax_stepsis reached, at which pointreward_fnis called on the final answer.
Tool Integration#
Tools inherit from BaseTool and must implement get_json_schema() to define
their interface (parameters, description) and either apply() (synchronous) or
apply_async() (asynchronous) to define their logic. The ToolManager
discovers, registers, and executes tools by name. It can execute multiple tool
calls in parallel for efficiency.
Agent/Environment interaction#
Key Features and Optimizations#
Multi-turn Tool Use#
Tunix fully supports multi-turn interactions involving tool use. The typical flow is:
The
ToolAgentsends the conversation history (including user query and prior tool results) to the LLM.The LLM responds with a tool call, e.g.,
calculator(a=1, b=1).The
ToolAgentuses itsToolParserto parse this into anActionobject.The
ToolEnvironmentreceives this action, uses itsToolManagerto executecalculator, and receives the result “2”.The
ToolEnvironmentreturns an observation like{"tool_outputs": {"call_id_123": "Tool returned result: 2"}}, reward=1, and done=False.The
ToolAgentadds the tool result to its history as arole: toolmessage.The loop continues until the agent calls
finish(answer=...)ormax_stepsis reached.
Asynchronous Rollouts#
To accelerate trajectory collection, Tunix supports asynchronous rollouts via
the RolloutOrchestrator. It leverages Python’s asyncio to manage multiple
concurrent agent-environment interactions using TrajectoryCollectEngine
instances, with parallelism controlled by max_concurrency. The
run_producers_from_stream method manages a pool of workers that draw
agent-environment pairs from a stream, run full episodes via collect(), and
queue the resulting trajectories. The yield_batches method allows a consumer
(like an RL learner) to receive trajectories as they are generated. This
parallel execution significantly speeds up data collection, especially when
interacting with external models or tools with high latency.
Furthermore, Tunix provides a RolloutSyncLock to manage concurrency between
rollouts and model weight synchronization in distributed training setups. This
lock ensures that rollouts (acquire_rollout) are temporarily paused when a
weight sync (acquire_weight_sync) is requested, preventing agents from
generating trajectories with stale parameters.
Trajectory Batching and Grouping#
Tunix supports batching of agentic trajectories through the GroupQueueManager.
This component, used within the RolloutOrchestrator, collects TrajectoryItem
instances into buckets based on a configurable group_key (e.g., prompt ID via
env.task["group_id"]) and episode_id. Once a bucket reaches a predefined
group_size (e.g., num_generations in GRPO), it is marked as a “ready group”
and made available for downstream processing by yield_batches. This mechanism
is essential for algorithms like GRPO which require multiple trajectory samples
for each prompt, and improves efficiency by yielding full groups of trajectories
in batches. The max_open_buckets parameter can be used to limit memory usage
by controlling the number of groups that can be populated simultaneously.