nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
React to this comment with an emoji to vote for nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 to be supported by Inference Providers.
(optional) Which providers are you interested in? (Novita, Hyperbolic, Together…)
If anyone wants to test Nemotron on real workloads - Doubleword just made Nemotron 3 Super (120B) FREE during GTC.
Useful for eval pipelines, dataset generation, or large-scale async inference.
You can run it here for free: [https://app.doubleword.ai ]
I've been running Nemotron 3 Super 120B A12B (MoE, 12B active) and wanted to share real serving benchmarks from my POC setup.
Setup: Single node, 16 concurrent agents, 128K context window
Results (POC — production expected 2x+):
- Single request TTFT: ~2s median
- 16 agents × 8 turns (128 requests): 100% success, TTFT ~5.7s median
- Burst (10 simultaneous) → steady: TTFT ~3.5s median
- Mixed workload (5K + 40K input): TTFT ~5.5s median
- Zero failures across all test scenarios
The model handles multi-agent orchestration surprisingly well. MoE with 12B active keeps inference efficient while maintaining 120B-level quality.
I'm currently building this into a service and experimenting with flat-rate inference models.
If anyone is working on Nemotron serving or multi-agent workloads, would love to compare notes or share more detailed benchmarks.