Skip to content

Non-record: 10L Int6 QAT + SmearGate + SWA (val_bpb=1.1575)#273

Open
dentity007 wants to merge 1 commit intoopenai:mainfrom
NathanMaine:submission/10L-SmearGate-SWA-NathanMaine
Open

Non-record: 10L Int6 QAT + SmearGate + SWA (val_bpb=1.1575)#273
dentity007 wants to merge 1 commit intoopenai:mainfrom
NathanMaine:submission/10L-SmearGate-SWA-NathanMaine

Conversation

@dentity007
Copy link
Copy Markdown

Summary

val_bpb = 1.1575 (single seed 1337, self-verified)

Builds on @baudrillardsgh0st's technique stack (PR #194). Contribution: 10-layer configuration that trades one layer for improved step throughput (9,156 steps vs 7,472 at 11L), informed by systematic analysis across 17 experiments.

  • 10 layers, 512 dim, 8 heads, 4 KV heads, 3x MLP
  • Int6 QAT (STE), per-dim SmearGate, SWA/50, Muon WD=0.038
  • Sliding window eval stride=64, zstd-22
  • 14.73MB artifact (1.27MB headroom)
  • 9,156 steps at 65ms/step on 8×H100

Key finding

10L outperforms 11L under the 10-minute wall-clock constraint. The faster step time (65ms vs 80ms) yields 22% more training steps, more than compensating for the reduced model capacity.

Submission checklist

  • val_bpb and submission.json included
  • Artifact under 16MB (14.73MB)
  • Wallclock < 600s on 8×H100
  • Train log included
  • Reproducible train_gpt.py included
  • README with detailed explanation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant