RL fine-tuning workbench for local LLMs.
Chat with an Ollama model through a logging proxy, collect conversation data in SQLite, then fine-tune with SFT, DPO, or PPO — all from one CLI.
- Proxy — an OpenAI-compatible server (
tuneloop serve) sits in front of Ollama and logs every request/response totuneloop.db - Collect data — chat through the proxy (interactive CLI or any OpenAI-compatible client) to build up training conversations
- Export — convert conversations to SFT or DPO format (JSONL)
- Train — QLoRA fine-tuning with TRL (SFT, DPO, or PPO with a learned reward model)
- Publish — merge adapter, convert to GGUF, register as an Ollama model
- Evaluate — blind LLM-as-judge comparison between base and fine-tuned models
# Install
uv sync # proxy, CLI, export
uv sync --extra train # adds PyTorch, TRL, PEFT, etc.
# Pull a base model
ollama pull qwen2.5:7b
# Start the proxy
tuneloop serve
# Generate training data (in another terminal)
uv run python scripts/generate_poems.py --count 50
# Train
tuneloop train --method sft| Command | Description |
|---|---|
tuneloop serve |
Start the logging proxy (default port 8000) |
tuneloop chat |
Interactive chat through the proxy with streaming |
tuneloop sessions |
List all chat sessions |
tuneloop messages <id> |
Show messages for a session (prefix match) |
tuneloop stats |
Show database statistics |
tuneloop export |
Export conversations to SFT or DPO JSONL |
tuneloop train |
Run QLoRA fine-tuning (SFT, DPO, or PPO) |
tuneloop train-reward-model |
Train a scalar reward model from preference pairs |
tuneloop runs |
List training runs |
tuneloop publish |
Merge adapter → GGUF → Ollama model |
tuneloop judge |
Blind A/B evaluation between two models |
tuneloop experiment |
Run full PPO vs DPO experiment end-to-end |
Run tuneloop <command> --help for detailed options.
The proxy is a FastAPI app that implements the OpenAI /v1/chat/completions endpoint (including streaming). It forwards requests to Ollama's local API and logs both sides of every conversation to a SQLite database (tuneloop.db) via SQLModel. Session tracking uses a custom x-session-id header — any OpenAI-compatible client can generate training data just by pointing at localhost:8000.
Training uses 4-bit QLoRA (nf4, double quantization, bfloat16 compute) with LoRA adapters on all attention and MLP projections. Everything fits on a 24GB GPU.
See docs/experiments.md for the full workflow: data generation, export strategies, training options, PPO setup, publishing, and evaluation — plus notes from debugging PPO's KL divergence under 4-bit quantization.