Reward signals for RL code training. Sandbox it, verify it, score it.
Your model writes code. DeepGym runs it in an isolated sandbox, executes tests against it, and returns a structured reward signal -- per-test-case scores, shaped reward components, execution metrics -- that plugs straight into TRL, verl, OpenRLHF, or your own GRPO/DAPO/PPO loop.
DeepSeek-R1 deliberately avoided neural reward models for code because they're susceptible to reward hacking at scale. DAPO, QwQ-32B, and Open-R1 followed the same path: rule-based, execution-verified rewards. That's what DeepGym provides -- deterministic, execution-based scoring with per-test granularity, running in sandboxed containers so untrusted model outputs can't touch your infrastructure.
reward signal
+------------------------------------+
| |
v |
+-------+ +----------+ +--------------------+
| Model | --> | DeepGym | --> | Sandbox |
+-------+ +----------+ | (Daytona / local) |
^ | +--------------------+
| | |
| v v
| +-----------+ +----------+
| | RunResult |<-----| Verifier |
| +-----------+ +----------+
| | |
| | score: 0.85 | JSON stdout
| | passed: false | per-test cases
| | cases: [...] | reward components
| v
+-------------------+
| Training Loop |
| (TRL/verl/ORLHF) |
+-------------------+
pip install deepgymMore install options
# With Daytona sandbox support
pip install deepgym[daytona]
# With HuggingFace Hub integration
pip install deepgym[hf]
# With lm-evaluation-harness
pip install deepgym[lm-eval]
# Everything (dev + daytona + hf + lm-eval)
pip install deepgym[all]
# From source
git clone https://github.com/DeepGym/deepgym.git
cd deepgym
pip install -e ".[all]"from deepgym import DeepGym, load_environment
dg = DeepGym(mode='local')
env = load_environment('coin_change')
solution = '''
def coin_change(coins, amount):
dp = [float('inf')] * (amount + 1)
dp[0] = 0
for coin in coins:
for x in range(coin, amount + 1):
dp[x] = min(dp[x], dp[x - coin] + 1)
return dp[amount] if dp[amount] != float('inf') else -1
'''
result = dg.run(env, model_output=solution)
print(result.score) # 1.0
print(result.passed) # True
print(result.cases) # per-test breakdown: which tests passed, which failed Model DeepGym Sandbox Verifier
| | | |
| solution code | | |
|---------------->| | |
| | create sandbox | |
| |------------------>| |
| | upload files | |
| |------------------>| |
| | | python verifier.py |
| | |-------------------->|
| | | | run tests
| | | | (seeded)
| | | JSON stdout |
| | |<--------------------|
| | stdout + stderr | |
| |<------------------| |
| | parse JSON | |
| RunResult | | |
|<----------------| | |
| | | |
The verifier returns structured JSON: a 0.0-1.0 score, pass/fail, per-test-case breakdown, and optional shaped reward components (correctness, efficiency, style -- whatever you define). The per-test granularity is what makes this useful for training. Binary pass/fail is a sparse signal. Knowing that 12 out of 14 tests passed, and specifically which two failed, gives the optimizer something to work with -- this is the same approach used by CodePRM, PRIME, and Posterior-GRPO, but without needing a separate process reward model.
The field has largely converged here. A Practitioner's Guide to Multi-Turn Agentic RL found execution-based unit tests hit 22% success on SWE-Gym vs 4.2% for sparse binary and 7-9% for model-based judges (including GPT-4.1). DeepSeek-R1, DAPO, and QwQ-32B all use rule-based execution rewards rather than neural reward models.
The catch is infrastructure. You need sandboxed execution (you can't run untrusted model output on your training nodes), deterministic scoring (GRPO computes advantages across completions -- non-determinism breaks this), and structured output (binary pass/fail is too sparse for GRPO/DAPO to learn from). DeepGym handles all three.
- Execution-based verification -- the approach DeepSeek-R1, DAPO, and QwQ-32B converged on, not neural reward models
- Per-test reward signals -- test-case-level scores like CodePRM and PRIME provide, without training a separate PRM
- Shaped reward components --
reward_componentsdict for multi-signal composition (correctness + efficiency + style), similar to Posterior-GRPO's gated reward approach - Deterministic seeded scoring -- same solution, same score, every time. GRPO and DAPO both require this
- Sandboxed execution via Daytona -- container isolation for untrusted code, same pattern as verl's Sandbox Fusion and DeepSWE's 512-container setup
- Reward hack detection -- 6 adversarial attack strategies. Anthropic's Nov 2025 paper showed reward hacking during RL causes emergent misalignment. Check your verifiers before you train
- 24 built-in environments + 2,350+ importable benchmarks (HumanEval, MBPP, EvalPlus, BigCodeBench)
- Drop-in integrations -- TRL
GRPOTrainer, verlcompute_score, OpenRLHF reward server, lm-eval tasks, HF Hub - Batch scoring -- N completions in parallel with
run_batch(), async client with semaphore-based concurrency - Gymnasium API --
reset()/step()for multi-turn agent training, same interface as Agent-R1 and VerlTool - REST API -- FastAPI server with async jobs and API key auth
from deepgym import DeepGym, load_environment
dg = DeepGym(mode='local')
env = load_environment('two_sum')
result = dg.run(env, model_output='def two_sum(nums, target): ...')
print(result.score) # 0.85
print(result.passed) # False
print(result.reward_components) # {'correctness': 0.85, 'efficiency': 0.9}Generate N completions, score them all, compute advantages:
solutions = [model.generate(prompt) for _ in range(8)]
batch = dg.run_batch(env, solutions, max_parallel=8)
scores = [r.score for r in batch.results]
mean = sum(scores) / len(scores)
std = (sum((s - mean) ** 2 for s in scores) / len(scores)) ** 0.5
advantages = [(s - mean) / (std + 1e-8) for s in scores]from deepgym.integrations.trl import make_trl_reward_fn
from trl import GRPOTrainer
reward_fn = make_trl_reward_fn(env)
trainer = GRPOTrainer(model=model, reward_funcs=[reward_fn])
trainer.train()from deepgym.integrations.verl import make_verl_compute_score
compute_score = make_verl_compute_score(env)
# In verl config: custom_reward_function.path = "your_reward_module.py"from fastapi import FastAPI
from deepgym.integrations.openrlhf import create_openrlhf_router
app = FastAPI()
app.include_router(create_openrlhf_router(env, dg))
# uvicorn app:app --port 8000
# POST /reward/score {"prompts": [...], "outputs": [...]} -> {"rewards": [...]}python -c "from deepgym.integrations.lm_eval import register_deepgym_tasks; register_deepgym_tasks()"
lm_eval --model hf \
--model_args pretrained=Qwen/Qwen2-0.5B-Instruct \
--tasks deepgym_coin_change,deepgym_two_sumfrom deepgym.integrations.hf import push_environment_to_hub, load_environment_from_hub
push_environment_to_hub(env, repo_id='your-org/deepgym-coin-change', env_name='coin_change')
# load from anywhere
env = load_environment_from_hub('your-org/deepgym-coin-change')Write your own verifier inline. The string becomes the body of a function that gets (solution_path, test_cases_path=None). Return a float, bool, or dict -- the wrapper normalizes it to JSON.
from deepgym import DeepGym, Environment
dg = DeepGym(mode='local')
env = Environment(
task='Write a function `add(a, b)` that returns the sum of two numbers.',
verifier_code=(
'import importlib.util\n'
'spec = importlib.util.spec_from_file_location("sol", solution_path)\n'
'mod = importlib.util.module_from_spec(spec)\n'
'spec.loader.exec_module(mod)\n'
'cases = [(2, 3, 5), (0, 0, 0), (-1, 1, 0), (100, 200, 300)]\n'
'passed = sum(1 for a, b, exp in cases if mod.add(a, b) == exp)\n'
'return passed / len(cases)\n'
),
)
result = dg.run(env, model_output='def add(a, b):\n return a + b\n')
# score: 1.0, passed: TrueInstead of just a single number, you get scores for each individual test case. Useful for denser training signals.
result = dg.run(env, model_output=solution)
for case in result.cases:
print(f"{case.id}: {'PASS' if case.passed else 'FAIL'} "
f"(input: {case.input_summary}, expected: {case.expected_summary})")
# or through the reward function
from deepgym.integrations.reward import RewardFunction
reward_fn = RewardFunction(env, max_parallel=8)
per_test = reward_fn.per_test_rewards(solutions)
# [{'test_0': 1.0, 'test_1': 0.0, 'test_2': 1.0, 'overall': 0.67}, ...]
shaped = reward_fn.shaped_rewards(solutions)
# [{'correctness': 0.8, 'efficiency': 0.9}, ...]When you need throughput, use the async client:
import asyncio
from deepgym import AsyncDeepGym, load_environment
async def score_all():
dg = AsyncDeepGym(mode='daytona')
envs = ['coin_change', 'two_sum', 'climbing_stairs']
tasks = [
dg.run(load_environment(name), solutions[name])
for name in envs
]
results = await asyncio.gather(*tasks, return_exceptions=True)
for name, result in zip(envs, results):
if isinstance(result, Exception):
print(f'{name}: ERROR')
else:
print(f'{name}: {result.score:.2f}')
asyncio.run(score_all())from deepgym.gym import DeepGymEnv
gym_env = DeepGymEnv(environment=env, max_steps=3)
obs = gym_env.reset()
obs, reward, done, info = gym_env.step('def coin_change(coins, amount): ...')Anthropic found that models which learn to reward-hack during RL generalize to alignment faking and sabotage. Lilian Weng's analysis documents models rewriting unit tests, modifying reward-computing code, and gaming complexity metrics. Check your verifiers before training:
from deepgym.adversarial import AdversarialTester
tester = AdversarialTester(dg, pass_threshold=0.5)
report = tester.test(env, strategies=['empty', 'hardcoded', 'trivial', 'overflow'])
print(f'Exploits found: {report.exploits_found}/{report.attacks_run}')
print(f'Robust: {report.is_robust}')deepgym audit --verifier verifier.py --task "..." --strategies empty hardcoded trivialSix attack strategies: empty/null code, hardcoded outputs, trivial placeholders, numeric overflow (NaN/Inf), pattern matching against test structure, and LLM-generated adversarial code. The auditor also analyzes verifier source for anti-patterns (static inputs, few test cases, no type validation) and assigns a risk score.
Coding (20):
| Difficulty | Environments |
|---|---|
| Easy | fizzbuzz, reverse_string, palindrome_check, anagram_check, valid_parentheses, python_sorting, string_manipulation, two_sum |
| Medium | coin_change, climbing_stairs, house_robber, rotate_array, remove_duplicates, max_subarray, roman_to_integer, matrix_spiral, longest_consecutive, group_anagrams, top_k_frequent, merge_intervals, binary_search |
| Hard | longest_common_subsequence, level_order_traversal |
Computer-use (2): file_organizer, cli_task
Tool-use (2): api_request, data_pipeline
python scripts/import_humaneval.py # 164 problems
python scripts/import_mbpp.py # 500 problems
python scripts/import_evalplus.py # HumanEval+ (80x more tests) + MBPP+
python scripts/import_bigcodebench.py # 1,140 problemsVerifiers are standalone scripts that print JSON to stdout. No SDK, no imports from DeepGym, any language works. This is deliberate -- same philosophy as DeepSeek-R1's rule-based rewards. Keep the verifier simple and auditable.
{
"schema_version": "1.0",
"score": 0.85,
"passed": true,
"details": "12/14 tests passed",
"cases": [
{"id": "test_0", "passed": true, "score": 1.0, "input_summary": "coins=[1,2,5] amount=11"},
{"id": "test_1", "passed": false, "score": 0.0, "error": "expected 3, got -1"}
],
"reward_components": {"correctness": 0.85, "efficiency": 0.92},
"seed": 42
}Three levels of reward signal, depending on how much you want from your verifier:
- Binary -- just
scoreandpassed. Equivalent to what most RLVR setups use. - Per-test -- add
casesfor test-case-level granularity. The model learns which tests it gets right, not just whether everything passed. This is what PRIME and CodePRM provide through process reward models, but here it comes directly from execution. - Multi-signal -- add
reward_componentsfor shaped rewards. Compose correctness, efficiency, and style signals with custom weights, like Posterior-GRPO's format + rule + thinking reward composition.
Simple verifiers that return a float or bool get auto-wrapped to this format. Full spec in the wiki.
+------------------------------------------------------------------+
| TRAINING FRAMEWORKS |
| TRL (HuggingFace) | verl (ByteDance) | OpenRLHF | Custom |
+------------------------------------------------------------------+
|
completions (code)
|
v
+------------------------------------------------------------------+
| DEEPGYM |
| |
| +---------------------+ +----------------------------------+ |
| | Python Client | | Environment Registry | |
| | DeepGym (sync) | | 24 built-in envs | |
| | AsyncDeepGym | | HumanEval / MBPP / EvalPlus | |
| +---------------------+ | BigCodeBench / HF Hub | |
| | +----------------------------------+ |
| v |
| +---------------------+ +----------------------------------+ |
| | Verifier Engine | | Adversarial Tester | |
| | template wrapping | | 6 attack strategies | |
| | JSON protocol | | reward hack detection | |
| +---------------------+ +----------------------------------+ |
| | |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| EXECUTION LAYER |
| |
| +-------------------------+ +------------------------------+ |
| | LocalExecutor | | DaytonaSandbox | |
| | (subprocess, no isolation) | (container, full isolation) | |
| +-------------------------+ +------------------------------+ |
| |
+------------------------------------------------------------------+
|
v
RunResult { score, passed, cases, reward_components }
Three modes: local (subprocess, no deps, no isolation), daytona (container isolation), auto (tries Daytona, falls back to local). Use local for dev, Daytona for anything untrusted. The same Daytona infrastructure runs 500 sandboxes in parallel for TRL GRPO training with sub-200ms cold starts.
You're probably using one of these: DeepGym is this layer:
+-------------------------------------+
| Training Framework | TRL GRPOTrainer, verl, OpenRLHF,
| (policy optimization) | rLLM, or your own PPO/GRPO loop
+-------------------------------------+
|
| "score these N completions"
v
+-------------------------------------+
| Reward Infrastructure | <-- DeepGym
| (execution, verification, scoring) |
+-------------------------------------+
|
| sandbox lifecycle
v
+-------------------------------------+
| Compute Isolation | Daytona containers, local subprocess
| (run untrusted code safely) |
+-------------------------------------+
Other projects in this space: verl uses Sandbox Fusion for code verification. DeepSWE runs 512 Docker containers via rLLM. SWE-Gym and R2E-Gym provide execution-based environments for SWE tasks. DeepGym wraps the same pattern -- sandboxed execution + structured reward output -- into a single pip install with drop-in reward functions for the major frameworks.
# Run one environment
deepgym run --task task.md --verifier verifier.py --solution solution.py
# Batch eval
deepgym eval --suite medium --solutions-dir ./solutions/ --max-parallel 100
# Audit a verifier
deepgym audit --verifier verifier.py --task "..." --strategies empty hardcoded trivial
# API server (dev)
DEEPGYM_NO_AUTH=true deepgym serve --host 127.0.0.1 --port 8000 --allow-local-exec
# API server (production)
DEEPGYM_API_KEY=your-key DAYTONA_API_KEY=your-key deepgym serve --port 8000Self-hosted (local Docker)
git clone https://github.com/daytonaio/daytona
cd daytona
docker compose -f docker/docker-compose.yaml up -d
# Dashboard: http://localhost:3000 ([email protected] / password)export DAYTONA_API_URL=http://localhost:3000
export DAYTONA_API_KEY=your-local-keyDaytona Cloud
- Sign up at app.daytona.io
- Grab your API key from the dashboard
export DAYTONA_API_KEY=your-cloud-keypip install -e ".[all]"
pytest # 227 tests
ruff check src/ # lint
ruff format src/ # formatFull docs on the GitHub Wiki:
- Getting Started -- install and first run
- Core API Reference -- classes, methods, models
- Environments -- built-in + importable benchmarks
- Verifier Protocol -- JSON spec, writing verifiers
- Integrations -- TRL, verl, OpenRLHF, lm-eval, HF Hub
- Sandbox Modes -- local vs Daytona vs auto
- Adversarial Testing -- reward hack detection
- Advanced Usage -- Gymnasium API, multi-turn, shaped rewards
- Architecture -- system design, module map
MIT
Runs on Daytona sandboxes