{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Optimizing OpenAI GPT-OSS Models with NVIDIA TensorRT-LLM"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook provides a step-by-step guide on how to optimizing `gpt-oss` models using NVIDIA's TensorRT-LLM for high-performance inference. TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.\n",
    "\n",
    "\n",
    "TensorRT-LLM supports both models:\n",
    "- `gpt-oss-20b`\n",
    "- `gpt-oss-120b`\n",
    "\n",
    "In this guide, we will run `gpt-oss-20b`, if you want to try the larger model or want more customization refer to [this](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md) deployment guide.\n",
    "\n",
    "Note: Your input prompts should use the [harmony response](http://cookbook.openai.com/articles/openai-harmony) format for the model to work properly, though this guide does not require it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Launch on NVIDIA Brev\n",
    "You can simplify the environment setup by using [NVIDIA Brev](https://developer.nvidia.com/brev). Click the button below to launch this project on a Brev instance with the necessary dependencies pre-configured.\n",
    "\n",
    "Once deployed, click on the \"Open Notebook\" button to get start with this guide\n",
    "\n",
    "[![Launch on Brev](https://brev-assets.s3.us-west-1.amazonaws.com/nv-lb-dark.svg)](https://brev.nvidia.com/launchable/deploy?launchableID=env-30i1YjHsRWT109HL6eYxLUeHIwF)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Hardware\n",
    "To run the gpt-oss-20b model, you will need an NVIDIA GPU with at least 20 GB of VRAM.\n",
    "\n",
    "Recommended GPUs: NVIDIA Hopper (e.g., H100, H200), NVIDIA Blackwell (e.g., B100, B200), NVIDIA RTX PRO, NVIDIA RTX 50 Series (e.g., RTX 5090).\n",
    "\n",
    "### Software\n",
    "- CUDA Toolkit 12.8 or later\n",
    "- Python 3.12 or later"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installing TensorRT-LLM\n",
    "\n",
    "There are multiple ways to install TensorRT-LLM. In this guide, we'll cover using a pre-built Docker container from NVIDIA NGC as well as building from source.\n",
    "\n",
    "If you're using NVIDIA Brev, you can skip this section."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using NVIDIA NGC\n",
    "\n",
    "Pull the pre-built [TensorRT-LLM container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags) for GPT-OSS from [NVIDIA NGC](https://www.nvidia.com/en-us/gpu-cloud/).\n",
    "This is the easiest way to get started and ensures all dependencies are included.\n",
    "\n",
    "```bash\n",
    "docker pull nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev\n",
    "docker run --gpus all -it --rm -v $(pwd):/workspace nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev\n",
    "```\n",
    "\n",
    "## Using Docker (Build from Source)\n",
    "\n",
    "Alternatively, you can build the TensorRT-LLM container from source.\n",
    "This approach is useful if you want to modify the source code or use a custom branch.\n",
    "For detailed instructions, see the [official documentation](https://github.com/NVIDIA/TensorRT-LLM/tree/feat/gpt-oss/docker)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TensorRT-LLM will be available through pip soon"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Note on GPU Architecture: The first time you run the model, TensorRT-LLM will build an optimized engine for your specific GPU architecture (e.g., Hopper, Ada, or Blackwell). If you see warnings about your GPU's CUDA capability (e.g., sm_90, sm_120) not being compatible with the PyTorch installation, ensure you have the latest NVIDIA drivers and a matching CUDA Toolkit version for your version of PyTorch."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Verifying TensorRT-LLM Installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tensorrt_llm import LLM, SamplingParams"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Utilizing TensorRT-LLM Python API"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the next code cell, we will demonstrate how to use the TensorRT-LLM Python API to:\n",
    "1. Download the specified model weights from Hugging Face (using your HF_TOKEN for authentication).\n",
    "2. Automatically build the TensorRT engine for your GPU architecture if it does not already exist.\n",
    "3. Load the model and prepare it for inference.\n",
    "4. Run a simple text generation example to verify everything is working.\n",
    "\n",
    "**Note**: The first run may take several minutes as it downloads the model and builds the engine.\n",
    "Subsequent runs will be much faster, as the engine will be cached."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "llm = LLM(model=\"openai/gpt-oss-20b\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prompts = [\"Hello, my name is\", \"The capital of France is\"]\n",
    "sampling_params = SamplingParams(temperature=0.8, top_p=0.95)\n",
    "for output in llm.generate(prompts, sampling_params):\n",
    "    print(f\"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Conclusion and Next Steps\n",
    "Congratulations! You have successfully optimized and run a large language model using the TensorRT-LLM Python API.\n",
    "\n",
    "In this notebook, you have learned how to:\n",
    "- Set up your environment with the necessary dependencies.\n",
    "- Use the `tensorrt_llm.LLM` API to download a model from the Hugging Face Hub.\n",
    "- Automatically build a high-performance TensorRT engine tailored to your GPU.\n",
    "- Run inference with the optimized model.\n",
    "\n",
    "\n",
    "You can explore more advanced features to further improve performance and efficiency:\n",
    "\n",
    "- Benchmarking: Try running a [benchmark](https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/benchmarking-default-performance.html#benchmarking-with-trtllm-bench) to compare the latency and throughput of the TensorRT-LLM engine against the original Hugging Face model. You can do this by iterating over a larger number of prompts and measuring the execution time.\n",
    "\n",
    "- Quantization: TensorRT-LLM [supports](https://github.com/NVIDIA/TensorRT-Model-Optimizer) various quantization techniques (like INT8 or FP8) to reduce model size and accelerate inference with minimal impact on accuracy. This is a powerful feature for deploying models on resource-constrained hardware.\n",
    "\n",
    "- Deploy with NVIDIA Dynamo: For production environments, you can deploy your TensorRT-LLM engine using the [NVIDIA Dynamo](https://docs.nvidia.com/dynamo/latest/) for robust, scalable, and multi-model serving.\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}