Microsoft recently announced that its Azure ND GB300v6 virtual machine has achieved an industry-leading new record of 1.1 million tokens per second in inference speed on Meta's Llama 2 70B model. Satya Nadella, CEO of Microsoft, stated on social media: "This achievement is the result of our long-standing partnership with NVIDIA and our expertise in running AI at production scale."
The Azure ND GB300 virtual machine is powered by NVIDIA's Blackwell Ultra GPUs, specifically the NVIDIA GB300 NVL72 system. It features a single-machine architecture with 72 NVIDIA Blackwell Ultra GPUs and 36 NVIDIA Grace CPUs. Optimized specifically for inference workloads, this virtual machine delivers a 50% increase in GPU memory and a 16% boost in Thermal Design Power (TDP).
To validate the performance improvement, Microsoft ran the Llama 2 70B model (FP4 precision) on 18 ND GB300v6 virtual machines within a single NVIDIA GB300 NVL72 domain, using NVIDIA TensorRT-LLM as the inference engine. Microsoft stated: "An Azure ND GB300v6 cluster with one NVL72 rack achieved a total inference speed of 1.1 million tokens per second." This new record surpasses Microsoft's previous achievement of 865,000 tokens per second on the NVIDIA GB200 NVL72 rack.
Based on the system configuration, each GPU delivers approximately 15,200 tokens per second. Microsoft has also provided detailed simulation processes, along with all log files and results. The performance record has been verified by Signal65, an independent performance validation and benchmarking company.
Russ Fellows, Vice President of Labs at Signal65, noted in a blog post: "This milestone not only breaks the million-tokens-per-second barrier but does so on a platform that meets the dynamic usage and data governance needs of modern enterprises." He added that the Azure ND GB300 offers a 27% improvement in inference performance compared to the previous-generation NVIDIA GB200, while only increasing power specifications by 17%. Compared to the NVIDIA H100 generation, the GB300 delivers nearly 10x higher inference performance and almost 2.5x better power efficiency at the rack level.
Comments