NVIDIA TensorRT is a production inference SDK that takes a trained neural network model and produces a highly optimized runtime engine for NVIDIA GPUs. The key insight behind TensorRT is that training-time flexibility — dynamic graph construction, Python loops, gradient tracking — is not needed at inference time. By stripping away all of that and applying hardware-aware optimizations, TensorRT can dramatically increase throughput and reduce latency compared to running the same model in a general-purpose framework like PyTorch or TensorFlow.
TensorRT is not a training framework. It is a compiler and runtime for already-trained models.
When you hand a trained model to TensorRT, it runs through a multi-stage optimization pipeline:
TensorRT parses the model’s computational graph and applies a range of operator-level optimizations:
TensorRT supports multiple numerical precisions:
INT8 calibration is where TensorRT’s expertise shows: it runs inference on a calibration dataset, measures activation ranges, and sets quantization parameters to minimize accuracy loss while gaining the speed benefit of integer arithmetic.
TensorRT selects the best CUDA kernel for each operation based on the specific GPU it is targeting and the input tensor shapes. Different GPU architectures — Ampere, Ada Lovelace, Hopper — have different compute capabilities, cache sizes, and tensor core configurations. TensorRT’s profiler tries candidate kernels and picks the fastest for the target device.
The result of TensorRT optimization is a serialized engine file (.trt or .engine). This engine is hardware-specific: an engine built for an A100 will not run correctly on an RTX 4090. This is a common source of confusion in teams that build once and deploy to multiple GPU types.
NVIDIA recommends using ONNX as the interchange format for getting models into TensorRT. The typical workflow is:
torch.onnx.export, tf2onnx, etc.).NVIDIA also provides trtexec, a command-line tool that can take an ONNX model and produce a TensorRT engine in a single command, which is useful for benchmarking and initial exploration.
Optimizing large language models with TensorRT is a more involved problem than optimizing a convolutional network. LLMs have dynamic sequence lengths, attention mechanisms, and KV-cache behaviors that require specialized handling.
NVIDIA ships TensorRT-LLM as a dedicated open-source library that brings TensorRT-style optimizations to LLMs. TensorRT-LLM provides:
TensorRT-LLM is the inference engine underneath NVIDIA NIM for LLMs.
Running inference directly from PyTorch or TensorFlow is straightforward but does not typically extract the full GPU performance available. The comparison in practice:
| Area | PyTorch/TF Native | TensorRT Optimized |
|---|---|---|
| Setup complexity | Low — just run the model | Higher — export, build engine, manage engine files |
| Throughput | Baseline | Typically 2–5× or more, hardware-dependent |
| Latency | Baseline | Lower, especially with FP16/INT8 |
| Flexibility | High — dynamic shapes, any Python logic | More constrained — static or bounded dynamic shapes |
| Hardware portability | Runs on any PyTorch-supported hardware | Engine is GPU-specific; must rebuild per GPU target |
| LLM suitability | Adequate for development | TensorRT-LLM needed for production throughput |
The throughput numbers depend heavily on the model architecture, batch size, sequence length, and GPU. NVIDIA’s published benchmarks should be reproduced on target hardware rather than taken as universal claims.
TensorRT is the right choice when:
TensorRT is probably not the right first step when:
TensorRT does not stand alone — it is a component in a wider inference deployment stack:
Trained Model (PyTorch / TF / JAX)
↓
ONNX Export
↓
TensorRT Engine Build (trtexec or API)
↓
TensorRT Runtime ←→ Triton Inference Server
↓
Production Application
For LLMs specifically, TensorRT-LLM replaces the middle steps with a more specialized pipeline that handles the challenges of autoregressive generation at scale.
NVIDIA NIM builds on top of this stack: NIM containers for LLMs use TensorRT-LLM internally, packaging the engine-build and runtime together so teams do not need to run TensorRT tooling manually.
Choosing precision is a practical engineering decision, not just a benchmarking exercise:
NVIDIA’s precision guidance in TensorRT docs recommends a calibration and accuracy-validation workflow before committing to INT8 for production.