A comprehensive toolkit for training and running lightweight adapters for GGUF-based language models (ERNIE, Llama, Mistral, Phi-3, etc.) without modifying the base model.
Now with two adapter architectures (external and universal) and automatic type detection.
- Overview
- Installation
- Training (
train-adapter) - Inference (
run-inference) - Methodology
- Results & Performance
- Troubleshooting
- License
This toolkit implements External Logit Correction, a novel approach for domain adaptation of quantized LLMs. Instead of fine‑tuning the entire model (which is impossible with GGUF), we train a lightweight external adapter that refines the base model's logits. The adapter is:
- Lightweight: Typically 256‑512 dimensions vs. billions in base model
- Fast to train: Hours on consumer GPU vs. days for full fine‑tuning
- Transferable: Adapters trained on small models work on larger family members
- Non‑invasive: Base model remains completely unchanged
The system now supports two adapter architectures:
- External Adapter (original): full‑vocabulary correction, ideal for models where you can afford larger adapter dimensions.
- Universal Adapter (cross‑model): top‑k correction using semantic embeddings, designed to be more transferable across model sizes and quantizations.
Both architectures are auto‑detected during loading, so you never need to specify which one you are using – the toolkit reads the configuration or inspects the state dict.
git clone https://github.com/ShotokanOSS/ggufForge.git
cd ggufForgepython -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activateStandard Installation:
pip install -U pip
pip install .For CUDA Support (GPU Acceleration):
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install .For Development with Additional Tools:
pip install .[dev]train-adapter --help
run-inference --helpTrain lightweight adapters for GGUF models using streaming datasets.
The training script can automatically continue from an existing adapter repository – just pass its Hugging Face ID as --model.
Basic training (default: universal adapter) on ERNIE model:
train-adapter --model unsloth/ERNIE-4.5-21B-A3B-Thinking-GGUFTrain external adapter on Llama 2 with custom parameters:
train-adapter \
--model TheBloke/Llama-2-7B-GGUF \
--filename llama-2-7b.Q4_K_M.gguf \
--adapter-type external \
--adapter-dim 512 \
--steps 5000 \
--learning-rate 3e-5Continue training an existing adapter from Hugging Face:
train-adapter --model my-username/my-adapter-repo --resume| Option | Description | Default |
|---|---|---|
--model |
Required. HuggingFace repository (base model OR adapter repo) | - |
--base-model |
Explicit base model repository (if --model is an adapter) |
None |
--filename |
Specific GGUF filename (if multiple in repo) | Auto‑detect |
--context-size |
Context window size | 1024 |
--adapter-type |
Adapter architecture: universal or external |
universal |
--adapter-dim |
Adapter hidden dimension | 256 |
--heads |
Attention heads in adapter | 8 |
--top-k |
Top‑k tokens for universal adapter correction | 50 |
--semantic-dim |
Semantic embedding dimension (universal only) | 384 |
| Option | Description | Default |
|---|---|---|
--steps |
Number of training steps | 14000 |
--learning-rate |
Learning rate | 5e-5 |
--weight-decay |
Weight decay for AdamW | 0.01 |
--accumulation-steps |
Gradient accumulation steps | 32 |
--batch-size |
Batch size | 1 |
--seed |
Random seed | 42 |
--eval-steps |
Evaluate every N steps (0 = disable) | 0 |
--log-file |
CSV file to log training metrics | training_log.csv |
--save-every |
Save checkpoint every N steps | 1000 |
| Option | Description | Default |
|---|---|---|
--dataset |
HuggingFace dataset ID | prithivMLmods/Atlas-Think-Cot-12M |
--prompt-col |
Column name for prompts | problem |
--output-col |
Column name for responses | solution |
--val-samples |
Validation samples | 50 |
--max-length |
Maximum text length | None |
| Option | Description | Default |
|---|---|---|
--output-dir |
Checkpoint directory | checkpoints |
--hf-repo |
HF repo for upload | None |
--hf-private |
Make HF repo private | True |
--no-upload |
Skip HF upload | False |
--hf-token |
Hugging Face token (optional) | None |
| Option | Description |
|---|---|
--eval-only |
Only evaluate, no training |
--checkpoint |
Load specific checkpoint (overrides auto‑load) |
--resume |
Resume training from latest checkpoint in --output-dir |
--verbose |
Detailed output |
--gpu-layers |
Number of GPU layers for base model (-1 = all) |
1. Fast universal adapter training on dense model (Phi‑3, 1000 steps):
train-adapter \
--model microsoft/Phi-3.5-mini-instruct-GGUF \
--steps 1000 \
--learning-rate 5e-5 \
--adapter-dim 256 \
--adapter-type universal2. Extended external adapter training on ultra‑low‑bit MoE (ERNIE, 14000 steps):
train-adapter \
--model unsloth/ERNIE-4.5-21B-A3B-Thinking-GGUF \
--filename ERNIE-4.5-21B-A3B-Thinking-UD-Q2_K_XL.gguf \
--adapter-type external \
--steps 14000 \
--adapter-dim 512 \
--output-dir ernie-external-adapter3. Custom dataset training with periodic evaluation:
train-adapter \
--model TheBloke/Mistral-7B-Instruct-v0.1-GGUF \
--dataset my-org/custom-dataset \
--prompt-col instruction \
--output-col response \
--eval-steps 500 \
--log-file my_log.csv \
--hf-repo my-org/mistral-universal-adapter4. Evaluation only:
train-adapter \
--model unsloth/ERNIE-4.5-21B-A3B-Thinking-GGUF \
--eval-only \
--checkpoint checkpoints/adapter_final.pt5. Resume interrupted training:
train-adapter \
--model my-org/llama-adapter \
--resume \
--steps 20000 # new total stepsRun inference with or without adapters, compare models, or chat interactively.
The inference script automatically detects the adapter type (external or universal) from the loaded weights – no extra flags needed.
Single question with adapter (auto‑detects type):
run-inference \
--mode single \
--question "What is machine learning?" \
--adapter trueInteractive chat:
run-inference --mode chatCompare base vs. adapter:
run-inference \
--mode compare \
--question "Explain quantum computing"| Mode | Description | Best For |
|---|---|---|
single |
Single question/answer | Quick testing |
chat |
Interactive conversation with history & summaries | Dialog tasks |
compare |
Compare base vs. adapter side by side | Performance evaluation |
interactive |
Full menu system (change prompts, config on the fly) | Exploration |
| Option | Description | Default |
|---|---|---|
--adapter-repo |
HF repository for adapter | ShotokanJ/ERNIE-4.5-21B-A3B-Thinking-GGUF-finetune-Atlas-Think-Cot |
--base-repo |
HF repository for base model | unsloth/ERNIE-4.5-21B-A3B-Thinking-GGUF |
--gguf-filename |
Specific GGUF file | ERNIE-4.5-21B-A3B-Thinking-UD-Q2_K_XL.gguf |
--adapter |
Use adapter (true/false) | true |
--reasoning |
Use reasoning model (think‑tags) | true |
--think-tags |
Enable think tags | true |
--summary |
Enable automatic response summaries | true |
--max-summary-tokens |
Max tokens for summary | 512 |
| Option | Description | Default |
|---|---|---|
--temperature |
Sampling temperature | 0.6 |
--min-p |
Min‑P sampling threshold | 0.05 |
--repetition-penalty |
Repetition penalty | 1.1 |
--max-tokens |
Maximum new tokens | 6100 |
--context-size |
Context window | 8192 |
--adapter-window |
Cache window for external adapter | 2048 |
Note: For universal adapters, the top‑k value is fixed at 50 (as used during training). This is not configurable via CLI.
| Option | Description |
|---|---|
--question / -q |
Question text |
--file / -f |
Read question from file |
--system-prompt / -s |
Custom system prompt |
--output / -o |
Save response to file |
--verbose / -v |
Detailed output |
--no-progress |
Disable progress bar |
| Option | Description | Default |
|---|---|---|
--think-start-tag |
Think start tag | <think> |
--think-end-tag |
Think end tag | </think> |
--final-start-tag |
Final answer start tag | <final_answer> |
--final-end-tag |
Final answer end tag | </final_answer> |
--summary-start-tag |
Summary start tag | <Summary> |
--summary-end-tag |
Summary end tag | </Summary> |
1. Single question with custom parameters (adapter auto‑detected):
run-inference \
--mode single \
--question "Explain the theory of relativity" \
--temperature 0.7 \
--max-tokens 1000 \
--min-p 0.1 \
--output response.txt2. Chat with custom system prompt (adapter disabled):
run-inference \
--mode chat \
--system-prompt "You are a helpful physics tutor. Explain concepts simply." \
--temperature 0.5 \
--adapter false3. Compare with file input:
run-inference \
--mode compare \
--file question.txt \
--max-tokens 500 \
--output comparison.json4. Interactive menu mode (full control):
run-inference --mode interactive5. Custom model configuration (different base + adapter):
run-inference \
--mode single \
--base-repo TheBloke/Llama-2-7B-GGUF \
--gguf-filename llama-2-7b.Q4_K_M.gguf \
--adapter-repo my-org/llama-universal-adapter \
--context-size 4096 \
--question "Tell me a story"The original architecture – a single‑block causal transformer that receives token IDs and full base logits as input, and outputs correction logits for the entire vocabulary.
Input: [token_ids, base_logits] → Token Embedding + Logit Compressor → Additive Fusion → LayerNorm →
8‑Head Causal Attention → Feed‑Forward Network (4× expansion) → Output Head → Corrected Logits
- Vocabulary‑aware: operates on the full logit vector, allowing fine‑grained corrections.
- Cache‑compatible: can reuse KV‑cache for efficient generation.
- Best for: models where you can afford the extra memory of full‑vocabulary correction.
A more transferable architecture that only corrects the top‑k tokens of the base logits, using semantic embeddings of the token candidates. This makes the adapter less dependent on the exact vocabulary and more robust to model size changes.
Base logits → top‑k (tokens + logits) → token IDs → semantic embeddings (via Sentence‑Transformer) →
concatenate with logits → projection → Multi‑head Self‑Attention → FFN → correction scores for top‑k tokens
The final logits become:
final_logits = base_logits + scatter_add(top_k_corrections)
- Cross‑model transfer: adapters trained on small models work on larger family members.
- Memory efficient: only processes top‑k tokens (e.g., 50) instead of the full vocabulary.
- Best for: large vocabularies, model families, and ultra‑low‑bit quantizations.
Both architectures are auto‑detected during loading: the toolkit inspects the state dict keys or the saved config.json and selects the correct inference path automatically.
| Model Type | Training Steps | GPU Time | Relative Improvement |
|---|---|---|---|
| Dense Q4/Q5 | 1,000 | ~1 hour | 2.7‑4.5% |
| Ultra‑low‑bit MoE | 9,000‑14,000 | ~4‑6 hours | 11‑21% |
Why it's efficient:
- Tiny parameter count: ~1M vs. billions in base model
- Short training: Hours instead of days
- Transfer learning: Train small → use on large
- Streaming data: No dataset download needed
Universal adapters show remarkable transfer capabilities:
| Train On | Use On | Performance Retention |
|---|---|---|
| Phi‑3 3.8B | Phi‑3 14B | 93% of improvement |
| Llama‑3.2 1B | Llama‑3.2 3B | 82% of improvement |
| Gemma‑2 2B | Gemma‑2 9B | 59% of improvement |
Requirements for transfer:
- Same model family
- Compatible vocabulary
- Similar quantization scheme
| Model | Quantization | Base PPL | Adapted PPL | Improvement | Adapter Type | Steps |
|---|---|---|---|---|---|---|
| Phi‑3 3.8B | Q4_K_M | 2.89 | 2.76 | +4.5% | Universal | 1,000 |
| Llama‑3.2 1B | Q4_K_M | 4.37 | 4.20 | +3.9% | Universal | 1,000 |
| ERNIE‑4.5‑21B | UD‑Q2_K_XL | 4.39 | 3.46 | +21.2% | External | 14,000 |
| Qwen3‑30B‑A3B | UD‑IQ1_S | 3.06 | 2.71 | +11.4% | Universal | 9,000 |
- Greater degradation, greater improvement: Ultra‑low‑bit models benefit most.
- Rapid convergence: Dense models need only 1,000 steps.
- Universal adapter transfers: Improvements hold across model sizes.
- Consistent gains: Improvements are robust across validation sets.
1. Model loading fails:
# Check available files
run-inference --base-repo TheBloke/Llama-2-7B-GGUF --verbose
# Specify exact filename
run-inference --base-repo TheBloke/Llama-2-7B-GGUF --gguf-filename llama-2-7b.Q4_K_M.gguf2. Out of memory:
# Reduce context size
run-inference --context-size 2048
# Use CPU layers
run-inference --gpu-layers 20
# For training, lower accumulation steps or adapter dimension
train-adapter --adapter-dim 128 --accumulation-steps 163. Slow generation:
# Reduce adapter window (external adapter only)
run-inference --adapter-window 1024
# Use base model only
run-inference --adapter false
# Disable summaries
run-inference --summary false4. Poor quality responses:
# Adjust temperature
run-inference --temperature 0.3 # More focused
run-inference --temperature 0.9 # More creative
# Adjust min‑P
run-inference --min-p 0.01 # More diverse
run-inference --min-p 0.2 # More conservative5. Adapter type not recognized:
The toolkit auto‑detects the type from the state dict. If it fails, ensure the adapter was saved with the correct config.json (including adapter_type field). You can also manually inspect:
python -c "import torch; sd = torch.load('adapter_final.pt', map_location='cpu'); print(sd.keys())"Enable verbose mode:
run-inference --verbose --mode single --question "Test"Check model configuration:
train-adapter --model unsloth/ERNIE-4.5-21B-A3B-Thinking-GGUF --eval-only --verboseMonitor GPU usage:
nvidia-smi -l 1 # Linux
# or use --no-progress to reduce overhead during inference
run-inference --no-progress --mode chatThis project is licensed under the Apache License 2.0. See LICENSE for details.
Key Points:
- Free for commercial and research use
- Attribution required
- No warranty provided
- Patent rights granted
For research paper, detailed methodology, and extended results, see Study Note: The transfer just works stable inside of one Model family yet-we are Still working on transfer between different model familys