nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

#8661
by tjkim02 - opened

React to this comment with an emoji to vote for nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 to be supported by Inference Providers.

(optional) Which providers are you interested in? (Novita, Hyperbolic, Together…)

If anyone wants to test Nemotron on real workloads - Doubleword just made Nemotron 3 Super (120B) FREE during GTC.

Useful for eval pipelines, dataset generation, or large-scale async inference.

You can run it here for free: [https://app.doubleword.ai ]

I've been running Nemotron 3 Super 120B A12B (MoE, 12B active) and wanted to share real serving benchmarks from my POC setup.

Setup: Single node, 16 concurrent agents, 128K context window

Results (POC — production expected 2x+):

  • Single request TTFT: ~2s median
  • 16 agents × 8 turns (128 requests): 100% success, TTFT ~5.7s median
  • Burst (10 simultaneous) → steady: TTFT ~3.5s median
  • Mixed workload (5K + 40K input): TTFT ~5.5s median
  • Zero failures across all test scenarios

The model handles multi-agent orchestration surprisingly well. MoE with 12B active keeps inference efficient while maintaining 120B-level quality.

I'm currently building this into a service and experimenting with flat-rate inference models.

If anyone is working on Nemotron serving or multi-agent workloads, would love to compare notes or share more detailed benchmarks.

Sign up or log in to comment