Accelerating LLM Inference on NVIDIA GPUs with ReDrafter

Accelerating LLM inference is a vital ML analysis downside, as auto-regressive token technology is computationally costly and comparatively sluggish, and enhancing inference effectivity can scale back latency for customers. In addition to ongoing efforts to speed up inference on Apple silicon, now we have lately made important progress in accelerating LLM inference for the NVIDIA GPUs broadly used for manufacturing functions throughout the trade.

Earlier this yr, we revealed and open sourced Recurrent Drafter (ReDrafter), a novel method to speculative decoding that achieves state-of-the-art efficiency. ReDrafter makes use of an RNN draft mannequin, and combines beam search with dynamic tree consideration to hurry up LLM token technology by as much as 3.5 tokens per technology step for open supply fashions, surpassing the efficiency of prior speculative decoding strategies.

Tokens per second velocity up

Figure 1: Tokens per second velocity up utilizing NVIDIA TensorRT-LLM with ReDrafter vs Auto-regression.

Productionizing ReDrafter to Speed up NVIDIA TensorRT-LLM

This analysis work demonstrated sturdy outcomes, however its larger affect comes from being utilized in manufacturing to speed up LLM inference. To make this development production-ready for NVIDIA GPUs, we collaborated with NVIDIA to combine ReDrafter into the NVIDIA TensorRT-LLM inference acceleration framework.

Although TensorRT-LLM helps quite a few open supply LLMs and the Medusa speculative decoding methodology, ReDrafter’s beam search and tree consideration algorithms depend on operators that had by no means been utilized in earlier functions. To allow the combination of ReDrafter, NVIDIA added new operators or uncovered current ones, which significantly improved TensorRT-LLM’s functionality to accommodate refined fashions and decoding strategies. ML builders utilizing NVIDIA GPUs can now simply profit from ReDrafter’s accelerated token technology for his or her manufacturing LLM functions with TensorRT-LLM.

In benchmarking a tens-of-billions parameter manufacturing mannequin on NVIDIA GPUs, utilizing the NVIDIA TensorRT-LLM inference acceleration framework with ReDrafter, now we have seen 2.7x speed-up in generated tokens per second for grasping decoding (see Figure 1). These benchmark outcomes point out this tech may considerably scale back latency customers might expertise, whereas additionally utilizing fewer GPUs and consuming much less energy.

For further element, see this submit on the NVIDIA developer weblog.

Conclusion

LLMs are more and more getting used to energy manufacturing functions, and enhancing inference effectivity can each affect computational prices and scale back latency for customers. With ReDrafter’s novel method to speculative decoding built-in into the NVIDIA TensorRT-LLM framework, builders can now profit from sooner token technology on NVIDIA GPUs for his or her manufacturing LLM functions.

Acknowledgements

Many individuals contributed to this undertaking together with: Aonan Zhang, Xuanyu Zhang, Yunfei Cheng, Chong Wang, Yi Wang, Abhishek Udupa, Dhaval Doshi, and our collaborators at NVIDIA.

More

Shedeur Sanders reacts to risk of father teaching Cowboys

Here’s the inflation breakdown for December 2024 — in a single chart

Prince William kicks off 2025 with key engagement in Birmingham

collection

Shedeur Sanders reacts to risk of father teaching Cowboys

Here’s the inflation breakdown for December 2024 — in a single chart

Prince William kicks off 2025 with key engagement in Birmingham

MMA rank 50-1: Pereira? Aspinall? Topuria? Which fighter can have the perfect yr?

Ashley Walters set to return with game-changing 2025 mission

Accelerating LLM Inference on NVIDIA GPUs with ReDrafter

Tokens per second velocity up

Productionizing ReDrafter to Speed up NVIDIA TensorRT-LLM

Conclusion

Acknowledgements

most popular

Shedeur Sanders reacts to risk of father teaching Cowboys

Here’s the inflation breakdown for December 2024 — in a single chart

Prince William kicks off 2025 with key engagement in Birmingham

Latest Articles

Shedeur Sanders reacts to risk of father teaching Cowboys

Here’s the inflation breakdown for December 2024 — in a single chart

Prince William kicks off 2025 with key engagement in Birmingham