{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["<a href=\"https://colab.research.google.com/github/jerryjliu/llama_index/blob/main/docs/docs/examples/llm/nvidia_tensorrt.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"在 Colab 中打开\"/></a>\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Nvidia TensorRT-LLM是一种用于低延迟推理的深度学习模型优化工具。TensorRT-LLM使用Nvidia的TensorRT推理引擎，通过减少模型推理的延迟来提高性能。它还提供了一些优化技术，如层次化的量化和剪枝，以进一步优化模型。TensorRT-LLM旨在帮助开发人员在边缘设备上部署实时的深度学习推理应用程序。\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["TensorRT-LLM为用户提供了一个易于使用的Python API，用于定义大型语言模型（LLMs）并构建包含最先进优化的TensorRT引擎，以在NVIDIA GPU上高效进行推理。\n", "\n", "[TensorRT-LLM Github](https://github.com/NVIDIA/TensorRT-LLM)\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## TensorRT-LLM环境设置\n", "由于TensorRT-LLM是一个用于与本地模型进行进程交互的SDK，因此必须遵循一些环境设置步骤，以确保可以使用TensorRT-LLM设置。请注意，目前需要Nvidia Cuda 12.2或更高版本才能运行TensorRT-LLM。\n", "\n", "在本教程中，我们将展示如何将连接器与GPT2模型一起使用。\n", "为了获得最佳体验，我们建议按照官方[TensorRT-LLM Github](https://github.com/NVIDIA/TensorRT-LLM)上的[安装](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.8.0?tab=readme-ov-file#installation)过程进行操作。\n", "\n", "以下步骤展示了如何为x86_64用户设置您的模型与TensorRT-LLM v0.8.0。\n", "\n", "1. 获取并启动基本的docker镜像环境。\n", "```\n", "docker run --rm --runtime=nvidia --gpus all --entrypoint /bin/bash -it nvidia/cuda:12.1.0-devel-ubuntu22.04\n", "```\n", "\n", "2. 安装依赖项，TensorRT-LLM需要Python 3.10\n", "```\n", "apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs wget\n", "```\n", "3. 安装最新稳定版本（对应发布分支）的TensorRT-LLM。我们使用的是版本0.8.0，但要获取最新版本，请参考[官方发布页面](https://github.com/NVIDIA/TensorRT-LLM/releases)。\n", "```\n", "pip3 install tensorrt_llm==0.8.0 -U --extra-index-url https://pypi.nvidia.com\n", "```\n", "\n", "4. 检查安装\n", "```\n", "python3 -c \"import tensorrt_llm\"\n", "```\n", "上述命令不应产生任何错误。\n", "\n", "5. 对于本示例，我们将使用GPT2。GPT2模型文件需要通过脚本按照[这里](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gpt#usage)的说明创建。\n", "    * 首先，在第1阶段启动的容器内，克隆TensorRT-LLM存储库：\n", "    ```\n", "    git clone --branch v0.8.0 https://github.com/NVIDIA/TensorRT-LLM.git\n", "    ```\n", "    * 使用以下命令安装GPT2模型的要求：\n", "    ```\n", "    cd TensorRT-LLM/examples/gpt/ && pip install -r requirements.txt\n", "    ```\n", "    * 下载hf gpt2模型\n", "    ```\n", "    rm -rf gpt2 && git clone https://huggingface.co/gpt2-medium gpt2\n", "    cd gpt2\n", "    rm pytorch_model.bin model.safetensors\n", "    wget -q https://huggingface.co/gpt2-medium/resolve/main/pytorch_model.bin\n", "    cd ..\n", "    ```\n", "    * 将权重从HF Transformers转换为TensorRT-LLM格式\n", "    ```\n", "    python3 hf_gpt_convert.py -i gpt2 -o ./c-model/gpt2 --tensor-parallelism 1 --storage-type float16\n", "    ```\n", "    * 构建TensorRT引擎\n", "    ```\n", "    python3 build.py --model_dir=./c-model/gpt2/1-gpu --use_gpt_attention_plugin --remove_input_padding\n", "    ```\n", "  \n", "6. 安装`llama-index-llms-nvidia-tensorrt`包\n", "  ```\n", "  pip install llama-index-llms-nvidia-tensorrt\n", "  ```\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 基本用法\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["#### 使用提示调用`complete`\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["```python\n", "from llama_index.llms.nvidia_tensorrt import LocalTensorRTLLM\n", "\n", "llm = LocalTensorRTLLM(\n", "    model_path=\"./engine_outputs\",\n", "    engine_name=\"gpt_float16_tp1_rank0.engine\",\n", "    tokenizer_dir=\"gpt2\",\n", "    max_new_tokens=40,\n", ")\n", "\n", "resp = llm.complete(\"谁是哈利波特？\")\n", "print(str(resp))\n", "```\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["哈利·波特是J.K.罗琳在她的第一部小说《哈利·波特与魔法石》中创造的虚构人物。这个角色是一个生活在虚构小镇的巫师。\n"]}], "metadata": {"colab": {"provenance": []}, "kernelspec": {"display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3"}, "vscode": {"interpreter": {"hash": "a0a0263b650d907a3bfe41c0f8d6a63a071b884df3cfdc1579f00cdc1aed6b03"}}}, "nbformat": 4, "nbformat_minor": 0}