How NVIDIA Achieved 1.5 Million Tokens Per Second: Boosting OpenAI gpt-oss Models from Cloud to Edge

What does it take to deploy AI models at an exceptional scale without overstretching your budget or running into latency issues? According to NVIDIA, the key lies in a well-integrated stack that combines their latest Blackwell-era systems with carefully optimized inference software. Recent reports from the company highlight an impressive performance of up to 1.5 million tokens per second (TPS) for OpenAI’s gpt-oss models running on the GB200 NVL72 rack-scale platform, demonstrating the ability to efficiently run these models from the cloud to edge devices.

Why This Matters

As organizations increasingly implement generative AI into their operations, inference—rather than training—becomes the primary driver of costs and performance constraints. Tokens per second (TPS) serves as a crucial metric for capacity planning, and reducing latency at scale leads to enhanced user experiences and lower operational expenses. Achieving seven-figure TPS on a single rack signifies a transformative shift where companies can manage extensive fleets of chat, code, and agent workloads with predictable cost and performance parameters.

NVIDIA’s Announcement

NVIDIA revealed that its GB200 NVL72 system has recorded up to 1.5 million TPS on OpenAI’s gpt-oss models, utilizing a robust inference setup comprising NVIDIA TensorRT-LLM, Triton Inference Server, and NIM microservices. The company highlighted the seamless developer and deployment experience across cloud environments, on-premise data centers, and edge devices, allowing teams to transition models from prototype to production with minimal friction.

These results coincide with the larger rollout of NVIDIA’s Blackwell platform, which integrates cutting-edge hardware like the B200 GPU and Grace CPU with an evolving software ecosystem that focuses on inference efficiency and developer productivity (NVIDIA newsroom).

Quick Overview: GB200 NVL72 and the Blackwell Platform

The GB200 NVL72 represents NVIDIA’s premier rack-scale system for the Blackwell generation, incorporating 36 Grace Blackwell Superchips (GB200) interconnected through NVLink and NVSwitch. This setup effectively configures 72 Blackwell GPUs into a singular, high-bandwidth compute fabric, with Grace CPUs designed to maintain memory coherence and orchestration. This architecture is tailored for extensive, latency-sensitive inference and training tasks requiring rapid inter-GPU communication and optimized memory management (NVIDIA Blackwell overview) (GB200 NVL72).

Key features include:

Rack-scale NVLink architecture connecting 72 Blackwell GPUs, enhancing high-throughput and low-latency communication.
Coherent memory access from Grace CPUs for improved data movement and overall system efficiency.
Optimizations for FP8/FP16 precision, advanced scheduling, and large model contexts.
Integration with NVIDIA’s suite of inference software, featuring TensorRT-LLM and Triton Inference Server.

About OpenAI’s gpt-oss Models

NVIDIA’s benchmarks specifically target OpenAI’s gpt-oss models—an open-source suite aimed at fostering widespread experimentation and deployment. While the exact variants and licenses should be verified from OpenAI’s official resources, NVIDIA’s results suggest an active optimization process tailored for these models (NVIDIA newsroom).

Note: Always consult the latest documentation and repository notes from OpenAI for model specifications, licensing, and recommended deployment configurations. Model names, sizes, and usage rights can change rapidly in open model releases.

Achieving 1.5 Million TPS

Achieving a seven-figure TPS is a system-level accomplishment, encompassing more than just GPU capabilities. NVIDIA highlights several critical elements that contribute to delivering this impressive performance:

1) Model Graph and Kernel Optimizations with TensorRT-LLM

NVIDIA’s TensorRT-LLM compiles model graphs into highly optimized execution plans, merging operations and choosing kernels that are specifically tuned for the underlying GPU architecture. Recent updates introduce advancements in paged attention, decode-stage scheduling, and FP8 calibration that reduce latency and enhance GPU utilization for both small- and medium-sized LLMs (TensorRT-LLM).

2) Adaptive Batching and Streaming with Triton Inference Server

Triton manages dynamic batching, multi-model serving, and request prioritization, which are essential for handling real-world traffic efficiently. In LLM workloads, adaptive batching can significantly impact GPU utilization, ensuring consistent high throughput during fluctuating loads (Triton Inference Server).

3) Precision Formats and Quantization

Blackwell supports FP8 and other mixed-precision formats that help decrease memory bandwidth pressure while enhancing tensor core efficacy. Coupled with calibration tools in TensorRT-LLM and careful quantization strategies, teams can effectively lower latency while maintaining quality standards for production service level agreements (SLAs) (Blackwell overview).

4) NVLink/NVSwitch Scalability within a Rack

The high-bandwidth and low-latency interconnects facilitate larger batch assembly, quicker key-value cache sharing, and efficient multi-GPU decoding pathways. This capability results in elevated aggregate TPS when models serve numerous concurrent sessions.

5) Software Maturity from CUDA to Runtime Services

NVIDIA’s complete software stack—comprising CUDA, cuDNN, NCCL, TensorRT-LLM, Triton, and NIM microservices—has been fine-tuned for optimal LLM inference patterns. This optimization significantly reduces the engineering effort required to harness the maximum available performance.

Understanding TPS in Practical Terms

While tokens per second serves as a useful key metric, it can fluctuate based on workload and traffic patterns. Consider these factors when interpreting any TPS figures:

Prompt vs. Decode Mix: Prefill (prompt processing) and decode (token generation) tax different components of the system, with lengthy prompts potentially bottlenecking differently than short interactions.
Batching and Parallelism: Enhanced concurrency can elevate TPS but may impact tail latency. Triton and TensorRT-LLM are designed to effectively balance these trade-offs.
Sampling Settings: Parameters like temperature, top-k, and beam search can modify the compute cost per token.
Context Length: Longer contexts impose greater demands on memory bandwidth and key-value cache.
Model Size and Architecture: Smaller open models can achieve significant TPS, while larger models may prioritize latency and quality over sheer throughput.

NVIDIA’s claim of 1.5 million TPS represents aggregate throughput on a GB200 NVL72 for the targeted gpt-oss models under their testing conditions. Your actual results may vary based on model selection, request patterns, and quality requirements. For cross-vendor benchmarking, it is advisable to consult standardized benchmarks like MLPerf Inference for context regarding efficiency and scaling trends (MLCommons).

From Cloud to Edge: One Stack, Multiple Environments

NVIDIA emphasizes that teams can expect a unified developer experience across cloud VMs, on-premise racks, and edge devices. This consistency is increasingly vital as businesses deploy model-powered assistants, retrieval-augmented generation (RAG) pipelines, and autonomous agents closer to data sources or user endpoints.

Cloud and Data Center

Utilize GB200 NVL72 for ultra-high-throughput serving of small to medium-scale LLMs and mixtures of experts.
Employ traditional GPU instances like H100, H200, and L40S for mainstream inference clusters.
Leverage Triton and TensorRT-LLM with NIM microservices for model packaging, routing, and observability (NVIDIA NIM).

On-Prem and Hybrid

Deploy the same containerized stack using Helm charts, Kubernetes, and NVIDIA AI Enterprise distributions.
Integrate with storage solutions and vector databases for RAG workloads and enterprise-level search capabilities.

Edge Deployments

Implement Jetson platforms and RTX-class GPUs for on-device or near-device inference. Smaller gpt-oss variants can operate within stringent power and latency constraints (NVIDIA Jetson).
Triton Inference Server supports both GPU and CPU backends, allowing for mixed hardware fleet deployments.
Utilize optional compression and quantization techniques to fit models into limited memory while sustaining acceptable quality.

The Developer Workflow: Build Once, Deploy Anywhere

For teams aiming for rapid deployment, a streamlined workflow is crucial to converting projects from pilots into full-fledged services. NVIDIA outlines a pragmatic approach:

Prototype the gpt-oss model using TensorRT-LLM in a notebook to assess latency and quality, capturing calibration data if employing FP8 or other quantization.
Package the serving logic with Triton and NIM microservices, specifying prompts, sampling defaults, and request structures tailored to your application.
Instrument every component: track tracing, request metrics, and token usage to monitor queuing delays and GPU utilization.
Scale horizontally across GPUs and nodes, adjusting batch sizes and concurrency to meet SLA targets without compromising tail latency.
Extend to edge deployments where relevant, reusing the same containers and model artifacts.

This approach is aligned with best practices observed in MLPerf Inference submissions and industry deployments, where end-to-end throughput is influenced by both orchestration and serving capabilities as much as by the underlying hardware (MLPerf Inference).

Cost, Energy, and Sustainability Considerations

Performance metrics hold significance only when they can be translated into better economic outcomes. Here are some of the factors that typically influence the total cost of ownership (TCO) for LLM inference:

Throughput per Rack: Higher TPS per system translates into fewer racks required for a specific workload.
Energy per Token: Mixed precision, quantization, and efficient kernel strategies help reduce power consumption while maintaining quality.
GPU Utilization: Adaptive batching and request shaping minimize periods of idleness and over-provisioning.
Operator Productivity: A mature software stack and automation reduce time-to-market and operational burdens.

Blackwell-era systems are designed to enhance performance-per-watt compared to earlier generations, and the software ecosystem is fine-tuned to keep GPUs fully utilized under real traffic conditions. Your actual energy profile will depend on model choices, sampling techniques, and context lengths (Blackwell overview).

Practical Tips for Deploying gpt-oss at Scale

Right-Size the Model: If you are focused on classification, routing, or extraction tasks, consider using a smaller gpt-oss variant that can deliver significantly higher TPS while maintaining satisfactory quality.
Separate Prefill and Decode: Establish dedicated pools for prompt processing and token generation to stabilize latency during peak loads.
Utilize KV Cache Paging: Manage predictable memory usage as context length increases. TensorRT-LLM’s paged attention implementations provide better scalability for longer contexts.
Tune Batch and Concurrency: Start with conservative batch sizes, gradually increasing them while monitoring tail latency and drop rates.
Benchmark with Realistic Prompts: Implement retrieval-augmented prompts, tool invocations, and extended contexts that reflect production traffic.
Monitor Quality: Use automated evaluation sets to track how quantization or scheduling changes affect task performance.

What to Watch Next

Anticipate rapid advancements in both hardware and software. Inference compilers, attention kernels, and scheduling policies are continually evolving. On the hardware front, closer integration between CPUs, GPUs, and network fabrics is expected to support sustained high throughput for longer contexts and more intricate agent workloads. Furthermore, as open models continue to develop, expect more variations tailored to specific domains and latency objectives.

Bottom Line

NVIDIA’s report of achieving 1.5 million TPS on the GB200 NVL72 serves as a strong indicator that large-scale LLM serving is transitioning into a more predictable engineering domain. With a cohesive technology stack—comprising TensorRT-LLM, Triton, and NIM integrated with Blackwell systems—teams can strive for ambitious throughput targets without compromising quality or latency. If your roadmap includes deploying OpenAI’s gpt-oss models across various environments—including cloud, on-premise, and edge—now is the ideal time to standardize on tools and strategies that enable seamless development and deployment.

FAQs

What does 1.5M TPS represent?

This metric indicates the total tokens per second processed by the GB200 NVL72 system under NVIDIA’s benchmark conditions for OpenAI gpt-oss models. TPS is influenced by model size, batching, sampling, prompt lengths, and traffic patterns, so results may vary.

Can I replicate these results on other hardware?

While you can apply similar software optimizations using TensorRT-LLM and Triton on H100, H200, L40S, and other NVIDIA GPUs, absolute TPS figures may differ. For accurate comparisons, ensure you use similar batching, precision, and model variants.

Are gpt-oss models fully open source?

Refer to OpenAI’s official documentation for the latest licensing and usage terms. Model availability and licensing details may evolve over time.

How can I deploy the same model to the edge?

Utilize the same containerized Triton and TensorRT-LLM setup on Jetson or RTX-class edge GPUs. Apply quantization and use smaller model variants to adhere to stringent memory and power constraints.

Where can I learn more about NVIDIA’s inference stack?

Explore NVIDIA’s documentation and repositories for TensorRT-LLM, Triton Inference Server, and NIM microservices to gain insights into implementation details and reference deployments.

How NVIDIA Achieved 1.5 Million Tokens Per Second: Boosting OpenAI gpt-oss Models from Cloud to Edge

How NVIDIA Achieved 1.5 Million Tokens Per Second: Boosting OpenAI gpt-oss Models from Cloud to Edge

Why This Matters

NVIDIA’s Announcement

Quick Overview: GB200 NVL72 and the Blackwell Platform

About OpenAI’s gpt-oss Models

Achieving 1.5 Million TPS

1) Model Graph and Kernel Optimizations with TensorRT-LLM