{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["\"在\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Nvidia Triton是一个用于部署和推理机器学习模型的开源软件。它提供了一个统一的接口,可以用于部署训练好的模型,并支持多种深度学习框架,包括TensorFlow、PyTorch和ONNX等。Triton还提供了高性能推理的能力,可以在GPU和CPU上进行推理,并支持多个模型同时部署。Nvidia Triton还包括了用于监控和管理模型部署的工具,使得模型的部署和管理变得更加简单和高效。\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["[NVIDIA Triton Inference Server](https://github.com/triton-inference-server/server) 提供了针对CPU和GPU进行优化的云端和边缘推理解决方案。此连接器允许 llama_index 与使用 Triton 部署的 TRT-LLM 模型进行远程交互。\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 启动Triton推理服务器\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["这个连接器需要一个运行中的Triton推理服务器,配备一个TensorRT-LLM模型。\n", "在这个示例中,我们将使用[Triton命令行界面(Triton CLI)](https://github.com/triton-inference-server/triton_cli)在Triton上部署一个GPT2模型。\n", "\n", "当在主机上使用Triton和相关工具(而不是在Triton容器映像之外)时,可能需要一些额外的依赖项来支持各种工作流程。大多数系统依赖问题可以通过在最新对应的`tritonserver`容器映像内安装并运行CLI来解决,该容器应该安装了所有必要的系统依赖项。\n", "\n", "对于TRT-LLM,你可以使用`nvcr.io/nvidia/tritonserver:{YY.MM}-trtllm-python-py3`映像,其中`YY.MM`对应于`tritonserver`的版本,例如在这个示例中我们使用了24.02版本的容器。要获取可用版本的列表,请参考[Triton推理服务器NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver)。\n", "\n", "要启动容器,请在Linux终端中运行:\n", "\n", "```\n", "docker run -ti --gpus all --network=host --shm-size=1g --ulimit memlock=-1 nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3\n", "```\n", "接下来,我们需要使用以下命令安装依赖项:\n", "```\n", "pip install \\\n", " \"psutil\" \\\n", " \"pynvml>=11.5.0\" \\\n", " \"torch==2.1.2\" \\\n", " \"tensorrt_llm==0.8.0\" --extra-index-url https://pypi.nvidia.com/\n", "```\n", "最后,运行以下命令安装Triton CLI。\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["```\n", "pip install git+https://github.com/triton-inference-server/triton_cli.git\n", "```\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["生成GPT2模型的模型库并启动Triton服务器实例:\n", "```\n", "triton remove -m all\n", "triton import -m gpt2 --backend tensorrtllm\n", "triton start &\n", "```\n", "请注意,默认情况下,Triton开始监听`localhost:8000` HTTP端口和`localhost:8001` GRPC端口。后者将在本示例中使用。\n", "如需任何其他操作指南和问题,请联系[Triton命令行界面(Triton CLI)](https://github.com/triton-inference-server/triton_cli)的问题页面。\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 安装tritonclient\n", "由于我们将与Triton推理服务器进行交互,因此我们需要[安装](https://github.com/triton-inference-server/client?tab=readme-ov-file#download-using-python-package-installer-pip) `tritonclient` 包。\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["```\n", "pip install tritonclient[all]\n", "```\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["接下来,我们将安装llama索引连接器。\n", "```\n", "pip install llama-index-llms-nvidia-triton\n", "```\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 基本用法\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["#### 使用提示调用`complete`\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["```python\n", "from llama_index.llms.nvidia_triton import NvidiaTriton\n", "\n", "# 必须运行一个Triton服务器实例。使用您所需的Triton服务器实例的正确URL。\n", "triton_url = \"localhost:8001\"\n", "model_name = \"gpt2\"\n", "resp = NvidiaTriton(server_url=triton_url, model_name=model_name, tokens=32).complete(\"北美洲最高的山是 \")\n", "print(resp)\n", "```\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["你应该期待以下的回复\n", "```\n", "吉萨大金字塔,高约1,000英尺。吉萨大金字塔是北美最高的山。\n", "```\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["#### 使用 prompt 调用 `stream_complete`\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["```python\n", "resp = NvidiaTriton(server_url=triton_url, model_name=model_name, tokens=32).stream_complete(\"北美洲最高的山是 \")\n", "for delta in resp:\n", " print(delta.delta, end=\" \")\n", "```\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["您应该期望以下作为流的响应\n", "```\n", "吉萨大金字塔,高约1,000英尺。吉萨大金字塔是北美最高的山。\n", "```\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 更多示例\n", "有关Triton推理服务器的更多信息,请参考[快速入门](https://github.com/triton-inference-server/server/blob/main/docs/getting_started/quickstart.md#quickstart)指南,[NVIDIA开发者Triton页面](https://developer.nvidia.com/triton-inference-server),以及[GitHub问题](https://github.com/triton-inference-server/server/issues)频道。\n"]}], "metadata": {"colab": {"provenance": []}, "kernelspec": {"display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3"}, "vscode": {"interpreter": {"hash": "a0a0263b650d907a3bfe41c0f8d6a63a071b884df3cfdc1579f00cdc1aed6b03"}}}, "nbformat": 4, "nbformat_minor": 0}