Skip to content

DeepGym/deepgym

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

DeepGym

Reward signals for RL code training. Sandbox it, verify it, score it.

PyPI Python License Wiki


Your model writes code. DeepGym runs it in an isolated sandbox, executes tests against it, and returns a structured reward signal -- per-test-case scores, shaped reward components, execution metrics -- that plugs straight into TRL, verl, OpenRLHF, or your own GRPO/DAPO/PPO loop.

DeepSeek-R1 deliberately avoided neural reward models for code because they're susceptible to reward hacking at scale. DAPO, QwQ-32B, and Open-R1 followed the same path: rule-based, execution-verified rewards. That's what DeepGym provides -- deterministic, execution-based scoring with per-test granularity, running in sandboxed containers so untrusted model outputs can't touch your infrastructure.

                          reward signal
               +------------------------------------+
               |                                    |
               v                                    |
           +-------+     +----------+     +--------------------+
           | Model | --> | DeepGym  | --> |      Sandbox       |
           +-------+     +----------+     | (Daytona / local)  |
               ^              |           +--------------------+
               |              |                    |
               |              v                    v
               |         +-----------+       +----------+
               |         |  RunResult |<-----| Verifier |
               |         +-----------+       +----------+
               |           |                       |
               |           | score: 0.85           | JSON stdout
               |           | passed: false         | per-test cases
               |           | cases: [...]          | reward components
               |           v
           +-------------------+
           |   Training Loop   |
           | (TRL/verl/ORLHF)  |
           +-------------------+

Install

pip install deepgym
More install options
# With Daytona sandbox support
pip install deepgym[daytona]

# With HuggingFace Hub integration
pip install deepgym[hf]

# With lm-evaluation-harness
pip install deepgym[lm-eval]

# Everything (dev + daytona + hf + lm-eval)
pip install deepgym[all]

# From source
git clone https://github.com/DeepGym/deepgym.git
cd deepgym
pip install -e ".[all]"

Quick Start

from deepgym import DeepGym, load_environment

dg = DeepGym(mode='local')
env = load_environment('coin_change')

solution = '''
def coin_change(coins, amount):
    dp = [float('inf')] * (amount + 1)
    dp[0] = 0
    for coin in coins:
        for x in range(coin, amount + 1):
            dp[x] = min(dp[x], dp[x - coin] + 1)
    return dp[amount] if dp[amount] != float('inf') else -1
'''

result = dg.run(env, model_output=solution)
print(result.score)    # 1.0
print(result.passed)   # True
print(result.cases)    # per-test breakdown: which tests passed, which failed

How it works

  Model            DeepGym             Sandbox              Verifier
    |                 |                   |                     |
    |  solution code  |                   |                     |
    |---------------->|                   |                     |
    |                 |  create sandbox   |                     |
    |                 |------------------>|                     |
    |                 |  upload files     |                     |
    |                 |------------------>|                     |
    |                 |                   |  python verifier.py |
    |                 |                   |-------------------->|
    |                 |                   |                     | run tests
    |                 |                   |                     | (seeded)
    |                 |                   |  JSON stdout        |
    |                 |                   |<--------------------|
    |                 |  stdout + stderr  |                     |
    |                 |<------------------|                     |
    |                 |  parse JSON       |                     |
    |  RunResult      |                   |                     |
    |<----------------|                   |                     |
    |                 |                   |                     |

The verifier returns structured JSON: a 0.0-1.0 score, pass/fail, per-test-case breakdown, and optional shaped reward components (correctness, efficiency, style -- whatever you define). The per-test granularity is what makes this useful for training. Binary pass/fail is a sparse signal. Knowing that 12 out of 14 tests passed, and specifically which two failed, gives the optimizer something to work with -- this is the same approach used by CodePRM, PRIME, and Posterior-GRPO, but without needing a separate process reward model.

Why execution-based rewards

The field has largely converged here. A Practitioner's Guide to Multi-Turn Agentic RL found execution-based unit tests hit 22% success on SWE-Gym vs 4.2% for sparse binary and 7-9% for model-based judges (including GPT-4.1). DeepSeek-R1, DAPO, and QwQ-32B all use rule-based execution rewards rather than neural reward models.

The catch is infrastructure. You need sandboxed execution (you can't run untrusted model output on your training nodes), deterministic scoring (GRPO computes advantages across completions -- non-determinism breaks this), and structured output (binary pass/fail is too sparse for GRPO/DAPO to learn from). DeepGym handles all three.

What you get

  • Execution-based verification -- the approach DeepSeek-R1, DAPO, and QwQ-32B converged on, not neural reward models
  • Per-test reward signals -- test-case-level scores like CodePRM and PRIME provide, without training a separate PRM
  • Shaped reward components -- reward_components dict for multi-signal composition (correctness + efficiency + style), similar to Posterior-GRPO's gated reward approach
  • Deterministic seeded scoring -- same solution, same score, every time. GRPO and DAPO both require this
  • Sandboxed execution via Daytona -- container isolation for untrusted code, same pattern as verl's Sandbox Fusion and DeepSWE's 512-container setup
  • Reward hack detection -- 6 adversarial attack strategies. Anthropic's Nov 2025 paper showed reward hacking during RL causes emergent misalignment. Check your verifiers before you train
  • 24 built-in environments + 2,350+ importable benchmarks (HumanEval, MBPP, EvalPlus, BigCodeBench)
  • Drop-in integrations -- TRL GRPOTrainer, verl compute_score, OpenRLHF reward server, lm-eval tasks, HF Hub
  • Batch scoring -- N completions in parallel with run_batch(), async client with semaphore-based concurrency
  • Gymnasium API -- reset() / step() for multi-turn agent training, same interface as Agent-R1 and VerlTool
  • REST API -- FastAPI server with async jobs and API key auth

Usage

Score a single solution

from deepgym import DeepGym, load_environment

dg = DeepGym(mode='local')
env = load_environment('two_sum')

result = dg.run(env, model_output='def two_sum(nums, target): ...')
print(result.score)              # 0.85
print(result.passed)             # False
print(result.reward_components)  # {'correctness': 0.85, 'efficiency': 0.9}

Batch scoring for GRPO

Generate N completions, score them all, compute advantages:

solutions = [model.generate(prompt) for _ in range(8)]
batch = dg.run_batch(env, solutions, max_parallel=8)

scores = [r.score for r in batch.results]
mean = sum(scores) / len(scores)
std = (sum((s - mean) ** 2 for s in scores) / len(scores)) ** 0.5
advantages = [(s - mean) / (std + 1e-8) for s in scores]

TRL

from deepgym.integrations.trl import make_trl_reward_fn
from trl import GRPOTrainer

reward_fn = make_trl_reward_fn(env)
trainer = GRPOTrainer(model=model, reward_funcs=[reward_fn])
trainer.train()

verl

from deepgym.integrations.verl import make_verl_compute_score

compute_score = make_verl_compute_score(env)
# In verl config: custom_reward_function.path = "your_reward_module.py"

OpenRLHF

from fastapi import FastAPI
from deepgym.integrations.openrlhf import create_openrlhf_router

app = FastAPI()
app.include_router(create_openrlhf_router(env, dg))
# uvicorn app:app --port 8000
# POST /reward/score {"prompts": [...], "outputs": [...]} -> {"rewards": [...]}

lm-evaluation-harness

python -c "from deepgym.integrations.lm_eval import register_deepgym_tasks; register_deepgym_tasks()"

lm_eval --model hf \
  --model_args pretrained=Qwen/Qwen2-0.5B-Instruct \
  --tasks deepgym_coin_change,deepgym_two_sum

HuggingFace Hub

from deepgym.integrations.hf import push_environment_to_hub, load_environment_from_hub

push_environment_to_hub(env, repo_id='your-org/deepgym-coin-change', env_name='coin_change')

# load from anywhere
env = load_environment_from_hub('your-org/deepgym-coin-change')

Advanced Examples

Custom verifiers

Write your own verifier inline. The string becomes the body of a function that gets (solution_path, test_cases_path=None). Return a float, bool, or dict -- the wrapper normalizes it to JSON.

from deepgym import DeepGym, Environment

dg = DeepGym(mode='local')
env = Environment(
    task='Write a function `add(a, b)` that returns the sum of two numbers.',
    verifier_code=(
        'import importlib.util\n'
        'spec = importlib.util.spec_from_file_location("sol", solution_path)\n'
        'mod = importlib.util.module_from_spec(spec)\n'
        'spec.loader.exec_module(mod)\n'
        'cases = [(2, 3, 5), (0, 0, 0), (-1, 1, 0), (100, 200, 300)]\n'
        'passed = sum(1 for a, b, exp in cases if mod.add(a, b) == exp)\n'
        'return passed / len(cases)\n'
    ),
)

result = dg.run(env, model_output='def add(a, b):\n    return a + b\n')
# score: 1.0, passed: True

Per-test reward shaping

Instead of just a single number, you get scores for each individual test case. Useful for denser training signals.

result = dg.run(env, model_output=solution)

for case in result.cases:
    print(f"{case.id}: {'PASS' if case.passed else 'FAIL'} "
          f"(input: {case.input_summary}, expected: {case.expected_summary})")

# or through the reward function
from deepgym.integrations.reward import RewardFunction
reward_fn = RewardFunction(env, max_parallel=8)

per_test = reward_fn.per_test_rewards(solutions)
# [{'test_0': 1.0, 'test_1': 0.0, 'test_2': 1.0, 'overall': 0.67}, ...]

shaped = reward_fn.shaped_rewards(solutions)
# [{'correctness': 0.8, 'efficiency': 0.9}, ...]

Async batch processing

When you need throughput, use the async client:

import asyncio
from deepgym import AsyncDeepGym, load_environment

async def score_all():
    dg = AsyncDeepGym(mode='daytona')
    envs = ['coin_change', 'two_sum', 'climbing_stairs']

    tasks = [
        dg.run(load_environment(name), solutions[name])
        for name in envs
    ]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    for name, result in zip(envs, results):
        if isinstance(result, Exception):
            print(f'{name}: ERROR')
        else:
            print(f'{name}: {result.score:.2f}')

asyncio.run(score_all())

Gymnasium-style API

from deepgym.gym import DeepGymEnv

gym_env = DeepGymEnv(environment=env, max_steps=3)
obs = gym_env.reset()
obs, reward, done, info = gym_env.step('def coin_change(coins, amount): ...')

Audit verifiers for reward hacking

Anthropic found that models which learn to reward-hack during RL generalize to alignment faking and sabotage. Lilian Weng's analysis documents models rewriting unit tests, modifying reward-computing code, and gaming complexity metrics. Check your verifiers before training:

from deepgym.adversarial import AdversarialTester

tester = AdversarialTester(dg, pass_threshold=0.5)
report = tester.test(env, strategies=['empty', 'hardcoded', 'trivial', 'overflow'])

print(f'Exploits found: {report.exploits_found}/{report.attacks_run}')
print(f'Robust: {report.is_robust}')
deepgym audit --verifier verifier.py --task "..." --strategies empty hardcoded trivial

Six attack strategies: empty/null code, hardcoded outputs, trivial placeholders, numeric overflow (NaN/Inf), pattern matching against test structure, and LLM-generated adversarial code. The auditor also analyzes verifier source for anti-patterns (static inputs, few test cases, no type validation) and assigns a risk score.

Environments

Built-in (24)

Coding (20):

Difficulty Environments
Easy fizzbuzz, reverse_string, palindrome_check, anagram_check, valid_parentheses, python_sorting, string_manipulation, two_sum
Medium coin_change, climbing_stairs, house_robber, rotate_array, remove_duplicates, max_subarray, roman_to_integer, matrix_spiral, longest_consecutive, group_anagrams, top_k_frequent, merge_intervals, binary_search
Hard longest_common_subsequence, level_order_traversal

Computer-use (2): file_organizer, cli_task

Tool-use (2): api_request, data_pipeline

Importable benchmarks (2,350+)

python scripts/import_humaneval.py      # 164 problems
python scripts/import_mbpp.py           # 500 problems
python scripts/import_evalplus.py       # HumanEval+ (80x more tests) + MBPP+
python scripts/import_bigcodebench.py   # 1,140 problems

Verifier protocol

Verifiers are standalone scripts that print JSON to stdout. No SDK, no imports from DeepGym, any language works. This is deliberate -- same philosophy as DeepSeek-R1's rule-based rewards. Keep the verifier simple and auditable.

{
  "schema_version": "1.0",
  "score": 0.85,
  "passed": true,
  "details": "12/14 tests passed",
  "cases": [
    {"id": "test_0", "passed": true, "score": 1.0, "input_summary": "coins=[1,2,5] amount=11"},
    {"id": "test_1", "passed": false, "score": 0.0, "error": "expected 3, got -1"}
  ],
  "reward_components": {"correctness": 0.85, "efficiency": 0.92},
  "seed": 42
}

Three levels of reward signal, depending on how much you want from your verifier:

  1. Binary -- just score and passed. Equivalent to what most RLVR setups use.
  2. Per-test -- add cases for test-case-level granularity. The model learns which tests it gets right, not just whether everything passed. This is what PRIME and CodePRM provide through process reward models, but here it comes directly from execution.
  3. Multi-signal -- add reward_components for shaped rewards. Compose correctness, efficiency, and style signals with custom weights, like Posterior-GRPO's format + rule + thinking reward composition.

Simple verifiers that return a float or bool get auto-wrapped to this format. Full spec in the wiki.

Architecture

+------------------------------------------------------------------+
|                      TRAINING FRAMEWORKS                          |
|   TRL (HuggingFace)  |  verl (ByteDance)  |  OpenRLHF  | Custom |
+------------------------------------------------------------------+
                               |
                      completions (code)
                               |
                               v
+------------------------------------------------------------------+
|                           DEEPGYM                                 |
|                                                                   |
|  +---------------------+    +----------------------------------+  |
|  | Python Client       |    | Environment Registry             |  |
|  |   DeepGym (sync)    |    |   24 built-in envs               |  |
|  |   AsyncDeepGym      |    |   HumanEval / MBPP / EvalPlus   |  |
|  +---------------------+    |   BigCodeBench / HF Hub          |  |
|            |                 +----------------------------------+  |
|            v                                                      |
|  +---------------------+    +----------------------------------+  |
|  | Verifier Engine      |    | Adversarial Tester              |  |
|  |   template wrapping  |    |   6 attack strategies           |  |
|  |   JSON protocol      |    |   reward hack detection         |  |
|  +---------------------+    +----------------------------------+  |
|            |                                                      |
+------------------------------------------------------------------+
             |
             v
+------------------------------------------------------------------+
|                         EXECUTION LAYER                           |
|                                                                   |
|   +-------------------------+  +------------------------------+   |
|   | LocalExecutor           |  | DaytonaSandbox               |   |
|   | (subprocess, no isolation) | (container, full isolation)  |   |
|   +-------------------------+  +------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+
             |
             v
     RunResult { score, passed, cases, reward_components }

Three modes: local (subprocess, no deps, no isolation), daytona (container isolation), auto (tries Daytona, falls back to local). Use local for dev, Daytona for anything untrusted. The same Daytona infrastructure runs 500 sandboxes in parallel for TRL GRPO training with sub-200ms cold starts.

Where DeepGym fits

You're probably using one of these:        DeepGym is this layer:

+-------------------------------------+
| Training Framework                  |    TRL GRPOTrainer, verl, OpenRLHF,
| (policy optimization)               |    rLLM, or your own PPO/GRPO loop
+-------------------------------------+
                  |
                  | "score these N completions"
                  v
+-------------------------------------+
| Reward Infrastructure               |    <-- DeepGym
| (execution, verification, scoring)  |
+-------------------------------------+
                  |
                  | sandbox lifecycle
                  v
+-------------------------------------+
| Compute Isolation                   |    Daytona containers, local subprocess
| (run untrusted code safely)         |
+-------------------------------------+

Other projects in this space: verl uses Sandbox Fusion for code verification. DeepSWE runs 512 Docker containers via rLLM. SWE-Gym and R2E-Gym provide execution-based environments for SWE tasks. DeepGym wraps the same pattern -- sandboxed execution + structured reward output -- into a single pip install with drop-in reward functions for the major frameworks.

CLI

# Run one environment
deepgym run --task task.md --verifier verifier.py --solution solution.py

# Batch eval
deepgym eval --suite medium --solutions-dir ./solutions/ --max-parallel 100

# Audit a verifier
deepgym audit --verifier verifier.py --task "..." --strategies empty hardcoded trivial

# API server (dev)
DEEPGYM_NO_AUTH=true deepgym serve --host 127.0.0.1 --port 8000 --allow-local-exec

# API server (production)
DEEPGYM_API_KEY=your-key DAYTONA_API_KEY=your-key deepgym serve --port 8000

Daytona setup

Self-hosted (local Docker)
git clone https://github.com/daytonaio/daytona
cd daytona
docker compose -f docker/docker-compose.yaml up -d
# Dashboard: http://localhost:3000 ([email protected] / password)
export DAYTONA_API_URL=http://localhost:3000
export DAYTONA_API_KEY=your-local-key
Daytona Cloud
  1. Sign up at app.daytona.io
  2. Grab your API key from the dashboard
export DAYTONA_API_KEY=your-cloud-key

Development

pip install -e ".[all]"
pytest                      # 227 tests
ruff check src/             # lint
ruff format src/            # format

Docs

Full docs on the GitHub Wiki:

License

MIT


Runs on Daytona sandboxes