Nvidia’s TensorRT 8.0 boasts faster conversational AI performance

2022-07-16 00:03:47 By : Ms. Frances Lu

Nvidia has released TensorRT 8.0 for Nvidia GPUs including its Jetson modules. This latest AI inference optimization SDK delivers up to 2x the natural language query performance compared to v7.0, with 1.2ms latency using BERT. At GTC 2021 in April, Nvidia announced TensorRT 8.0 along with related technologies such as a GUI-based TAO framework that eases AI model training for GPU-equipped platforms. The TensorRT 8.0 SDK is now available for enabling deep learning inference on all Nvidia GPU products, including the Linux-based Jetson modules. TensorRT 8.0 in AI inference work flow (click image to enlarge) The most significant enhancement to TensorRT 8.0 (or TensorRT 8) is the addition of compiler optimizations for transformer-based networks for natural language processing like BERT (Bidirectional Encoder Representations from Transformers). The new release offers up to twice the transformer optimization performance of TensorRT 7.0, with 1.2ms inference latency on BERT-Large, claims Nvidia. As a result, customers can “double or triple their model size to achieve dramatic improvements in accuracy,” says Nvidia

A testimonial from Hugging Face claims that with TensorRT 8.0, its Hugging Face Accelerated Inference API has achieved as low as 1ms inference latency on BERT. Hugging Face will release the technology later this year.

Other TensorRT 8.0 improvements include up to 2x the Quantization Aware Training (QAT) accuracy vs. TensorRT 7.0 when using INT8, claims Nvidia. The new release also adds support for the sparsity technology enabled by Nvidia’s high-end Ampere GPUs, such as the T4 and A100 graphics cards. With sparsity, developers can accelerate neural networks by reducing computational operations. TensorRT 8.0 framework and hardware support (click image to enlarge) TensorRT is used to optimize and deploy neural networks in production, including CNNs, RNNs and transformers. The software can accelerate frameworks and offers optimization for TensorFlow and PyTorch as well as support for ONNX and other frameworks. After fine-tuning models in Nvidia TAO’s Transfer Learning Toolkit (TLT), developers can turn to TensorRT to determine an optimal balance of size and accuracy for each target system.

TensorRT applications can be deployed in a wide range of scenarios from hyperscale data centers to embedded or automotive products. Over the last five years, more than 350,000 developers across 27,500 companies have downloaded TensorRT nearly 2.5 million times, says Nvidia.

TensorRT 8.0 is available now and free to Nvidia Developer members. More information may be found in Nvidia’s announcement and product page.  

Nice article. You shared very useful information.