{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Optimizing OpenAI GPT-OSS Models with NVIDIA TensorRT-LLM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook provides a step-by-step guide on how to optimizing `gpt-oss` models using NVIDIA's TensorRT-LLM for high-performance inference. TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.\n", "\n", "\n", "TensorRT-LLM supports both models:\n", "- `gpt-oss-20b`\n", "- `gpt-oss-120b`\n", "\n", "In this guide, we will run `gpt-oss-20b`, if you want to try the larger model or want more customization refer to [this](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md) deployment guide.\n", "\n", "Note: Your input prompts should use the [harmony response](http://cookbook.openai.com/articles/openai-harmony) format for the model to work properly, though this guide does not require it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Launch on NVIDIA Brev\n", "You can simplify the environment setup by using [NVIDIA Brev](https://developer.nvidia.com/brev). Click the button below to launch this project on a Brev instance with the necessary dependencies pre-configured.\n", "\n", "Once deployed, click on the \"Open Notebook\" button to get start with this guide\n", "\n", "[![Launch on Brev](https://brev-assets.s3.us-west-1.amazonaws.com/nv-lb-dark.svg)](https://brev.nvidia.com/launchable/deploy?launchableID=env-30i1YjHsRWT109HL6eYxLUeHIwF)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prerequisites" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hardware\n", "To run the gpt-oss-20b model, you will need an NVIDIA GPU with at least 20 GB of VRAM.\n", "\n", "Recommended GPUs: NVIDIA Hopper (e.g., H100, H200), NVIDIA Blackwell (e.g., B100, B200), NVIDIA RTX PRO, NVIDIA RTX 50 Series (e.g., RTX 5090).\n", "\n", "### Software\n", "- CUDA Toolkit 12.8 or later\n", "- Python 3.12 or later" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installing TensorRT-LLM\n", "\n", "There are multiple ways to install TensorRT-LLM. In this guide, we'll cover using a pre-built Docker container from NVIDIA NGC as well as building from source.\n", "\n", "If you're using NVIDIA Brev, you can skip this section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using NVIDIA NGC\n", "\n", "Pull the pre-built [TensorRT-LLM container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags) for GPT-OSS from [NVIDIA NGC](https://www.nvidia.com/en-us/gpu-cloud/).\n", "This is the easiest way to get started and ensures all dependencies are included.\n", "\n", "```bash\n", "docker pull nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev\n", "docker run --gpus all -it --rm -v $(pwd):/workspace nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev\n", "```\n", "\n", "## Using Docker (Build from Source)\n", "\n", "Alternatively, you can build the TensorRT-LLM container from source.\n", "This approach is useful if you want to modify the source code or use a custom branch.\n", "For detailed instructions, see the [official documentation](https://github.com/NVIDIA/TensorRT-LLM/tree/feat/gpt-oss/docker)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TensorRT-LLM will be available through pip soon" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Note on GPU Architecture: The first time you run the model, TensorRT-LLM will build an optimized engine for your specific GPU architecture (e.g., Hopper, Ada, or Blackwell). If you see warnings about your GPU's CUDA capability (e.g., sm_90, sm_120) not being compatible with the PyTorch installation, ensure you have the latest NVIDIA drivers and a matching CUDA Toolkit version for your version of PyTorch." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Verifying TensorRT-LLM Installation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from tensorrt_llm import LLM, SamplingParams" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Utilizing TensorRT-LLM Python API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the next code cell, we will demonstrate how to use the TensorRT-LLM Python API to:\n", "1. Download the specified model weights from Hugging Face (using your HF_TOKEN for authentication).\n", "2. Automatically build the TensorRT engine for your GPU architecture if it does not already exist.\n", "3. Load the model and prepare it for inference.\n", "4. Run a simple text generation example to verify everything is working.\n", "\n", "**Note**: The first run may take several minutes as it downloads the model and builds the engine.\n", "Subsequent runs will be much faster, as the engine will be cached." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "llm = LLM(model=\"openai/gpt-oss-20b\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prompts = [\"Hello, my name is\", \"The capital of France is\"]\n", "sampling_params = SamplingParams(temperature=0.8, top_p=0.95)\n", "for output in llm.generate(prompts, sampling_params):\n", " print(f\"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Conclusion and Next Steps\n", "Congratulations! You have successfully optimized and run a large language model using the TensorRT-LLM Python API.\n", "\n", "In this notebook, you have learned how to:\n", "- Set up your environment with the necessary dependencies.\n", "- Use the `tensorrt_llm.LLM` API to download a model from the Hugging Face Hub.\n", "- Automatically build a high-performance TensorRT engine tailored to your GPU.\n", "- Run inference with the optimized model.\n", "\n", "\n", "You can explore more advanced features to further improve performance and efficiency:\n", "\n", "- Benchmarking: Try running a [benchmark](https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/benchmarking-default-performance.html#benchmarking-with-trtllm-bench) to compare the latency and throughput of the TensorRT-LLM engine against the original Hugging Face model. You can do this by iterating over a larger number of prompts and measuring the execution time.\n", "\n", "- Quantization: TensorRT-LLM [supports](https://github.com/NVIDIA/TensorRT-Model-Optimizer) various quantization techniques (like INT8 or FP8) to reduce model size and accelerate inference with minimal impact on accuracy. This is a powerful feature for deploying models on resource-constrained hardware.\n", "\n", "- Deploy with NVIDIA Dynamo: For production environments, you can deploy your TensorRT-LLM engine using the [NVIDIA Dynamo](https://docs.nvidia.com/dynamo/latest/) for robust, scalable, and multi-model serving.\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 4 }