# Performance of TensorRT-LLM

This document summarizes performance measurements of TensorRT-LLM on H100
(Hopper), L40S (Ada) and A100 (Ampere) GPUs for a few key models.

The data in the following tables is provided as a reference point to help users
validate observed performance. It should not be considered as the peak
performance that can be delivered by TensorRT-LLM.

## Methodology

The different performance numbers below were collected using the methodology
described in the benchmarks [folder](../../benchmarks/).

## High Throughput

The below tables provide reference data at large batch sizes, representating
high throughput tasks.

### H100 GPUs (FP8)

| Model                        | Batch Size | TP (1)    | Input Length | Output Length | Throughput (out tok/s) |
| :--------------------------- | :--------- | :-------- | :----------- | :------------ | ---------------------: |
| GPT-J 6B                     | 64         | 1         | 128          | 128           |                 10,907 |
| GPT-J 6B                     | 64         | 1         | 128          | 2048          |                  6,179 |
| GPT-J 6B                     | 64         | 1         | 2048         | 128           |                  2,229 |
| GPT-J 6B                     | 64         | 1         | 2048         | 2048          |                  2,980 |
|                              |            |           |              |               |                        |
| LLaMA 7B                     | 64         | 1         | 128          | 128           |                  9,193 |
| LLaMA 7B                     | 64         | 1         | 128          | 2048          |                  5,367 |
| LLaMA 7B                     | 64         | 1         | 2048         | 128           |                  2,058 |
| LLaMA 7B                     | 32         | 1         | 2048         | 2048          |                  2,230 |
|                              |            |           |              |               |                        |
| LLaMA 70B                    | 64         | 4         | 128          | 128           |                  3,317 |
| LLaMA 70B                    | 64         | 4         | 128          | 2048          |                  2,616 |
| LLaMA 70B                    | 64         | 4         | 2048         | 128           |                    843 |
| LLaMA 70B                    | 64         | 4         | 2048         | 2048          |                  1,583 |
|                              |            |           |              |               |                        |
| Falcon 180B                  | 96         | 8         | 128          | 128           |                  2,686 |
| Falcon 180B                  | 96         | 8         | 128          | 2048          |                  2,073 |
| Falcon 180B                  | 64         | 8         | 2048         | 128           |                    465 |

### L40S GPUs (FP8)

| Model                        | Batch Size | TP (1)    | Input Length | Output Length | Throughput (out tok/s) |
| :--------------------------- | :--------- | :-------- | :----------- | :------------ | ---------------------: |
| GPT-J 6B                     | 64         | 1         | 128          | 128           |                  3,630 |
| GPT-J 6B                     | 64         | 1         | 128          | 2048          |                  1,859 |
| GPT-J 6B                     | 32         | 1         | 2048         | 128           |                    616 |
| GPT-J 6B                     | 32         | 1         | 2048         | 2048          |                    757 |
|                              |            |           |              |               |                        |
| LLaMA 7B                     | 64         | 1         | 128          | 128           |                  3,240 |
| LLaMA 7B                     | 64         | 1         | 128          | 2048          |                  1,622 |
| LLaMA 7B                     | 32         | 1         | 2048         | 128           |                    581 |
| LLaMA 7B                     | 16         | 1         | 2048         | 2048          |                    531 |


### A100 GPUs (FP16)

| Model                        | Batch Size | TP (1)    | Input Length | Output Length | Throughput (out tok/s) |
| :--------------------------- | :--------- | :-------- | :----------- | :------------ | ---------------------: |
| GPT-J 6B                     | 64         | 1         | 128          | 128           |                  3,679 |
| GPT-J 6B                     | 32         | 1         | 128          | 2048          |                  1,558 |
| GPT-J 6B                     | 32         | 1         | 2048         | 128           |                    526 |
| GPT-J 6B                     | 16         | 1         | 2048         | 2048          |                    650 |
|                              |            |           |              |               |                        |
| LLaMA 7B                     | 64         | 1         | 128          | 128           |                  3,486 |
| LLaMA 7B                     | 32         | 1         | 128          | 2048          |                  1,459 |
| LLaMA 7B                     | 32         | 1         | 2048         | 128           |                    529 |
| LLaMA 7B                     | 16         | 1         | 2048         | 2048          |                    592 |
|                              |            |           |              |               |                        |
| LLaMA 70B                    | 64         | 4         | 128          | 128           |                  1,237 |
| LLaMA 70B                    | 64         | 4         | 128          | 2048          |                  1,181 |
| LLaMA 70B                    | 64         | 4         | 2048         | 128           |                    272 |
| LLaMA 70B                    | 64         | 4         | 2048         | 2048          |                    738 |
|                              |            |           |              |               |                        |
| Falcon 180B                  | 64         | 8         | 128          | 128           |                    929 |
| Falcon 180B                  | 64         | 8         | 128          | 2048          |                    923 |
| Falcon 180B                  | 64         | 8         | 2048         | 128           |                    202 |

(1) TP stands for Tensor Parallelism.

## Low Latency

The below tables provide reference data at batch size 1 for first token
latency, representating end-user's percieved latency for online streaming
tasks.

### H100 GPUs (FP8)

| Model                        | Batch Size | TP (1)    | Input Length | 1st Token Latency (ms) |
| :--------------------------- | :--------- | :-------- | :----------- | ---------------------: |
| GPT-J 6B                     | 1          | 1         | 128          |                      7 |
| GPT-J 6B                     | 1          | 1         | 2048         |                     29 |
|                              |            |           |              |                        |
| LLaMA 7B                     | 1          | 1         | 128          |                      7 |
| LLaMA 7B                     | 1          | 1         | 2048         |                     36 |
|                              |            |           |              |                        |
| LLaMA 70B                    | 1          | 4         | 128          |                     26 |
| LLaMA 70B                    | 1          | 4         | 2048         |                    109 |
|                              |            |           |              |                        |
| Falcon 180B                  | 1          | 8         | 128          |                     27 |
| Falcon 180B                  | 1          | 8         | 2048         |                    205 |

### L40S GPUs (FP8)

| Model                        | Batch Size | TP (1)    | Input Length | 1st Token Latency (ms) |
| :--------------------------- | :--------- | :-------- | :----------- | ---------------------: |
| GPT-J 6B                     | 1          | 1         | 128          |                     12 |
| GPT-J 6B                     | 1          | 1         | 2048         |                     71 |
|                              |            |           |              |                        |
| LLaMA 7B                     | 1          | 1         | 128          |                     14 |
| LLaMA 7B                     | 1          | 1         | 2048         |                     73 |

### A100 GPUs (FP16)

| Model                        | Batch Size | TP (1)    | Input Length | 1st Token Latency (ms) |
| :--------------------------- | :--------- | :-------- | :----------- | ---------------------: |
| GPT-J 6B                     | 1          | 1         | 128          |                     12 |
| GPT-J 6B                     | 1          | 1         | 2048         |                    129 |
|                              |            |           |              |                        |
| LLaMA 7B                     | 1          | 1         | 128          |                     16 |
| LLaMA 7B                     | 1          | 1         | 2048         |                    133 |
|                              |            |           |              |                        |
| LLaMA 70B                    | 1          | 4         | 128          |                     47 |
| LLaMA 70B                    | 1          | 4         | 2048         |                    377 |
|                              |            |           |              |                        |
| Falcon 180B                  | 1          | 8         | 128          |                     61 |
| Falcon 180B                  | 1          | 8         | 2048         |                    509 |

(1) TP stands for Tensor Parallelism.


## Known Issues

The following issues are being addressed to improve the efficiency of TensorRT-LLM.

### Fused Matmul + Gated-SiLU (LLaMA)

There are different possible implementations for Matmul followed by Gated-SiLU.
The simplest implementation uses two Matmul operations and combines the results
in a separate CUDA kernel. That's the current implementation in TensorRT-LLM.
The next release will include a more efficient implementation that runs a
single Matmul.