{ "cells": [ { "cell_type": "markdown", "id": "e720418d", "metadata": {}, "source": [ "# The FIL Backend for Triton: FAQs and Advanced Features\n", "\n", "## Introduction\n", "\n", "This example notebook focuses on the technical details of deploying tree-based models with the FIL Backend for Triton. It is organized as a series of FAQs followed by example code providing a practical illustration of the corresponding FAQ section.\n", "\n", "The goal of this notebook is to offer information that goes beyond the basics and provide answers to practical questions that may arise when attempting a real-world deployment with the FIL backend. If you are a complete newcomer to the FIL backend and are looking for a short introduction to the basics of what the FIL backend is and how to use it, you are encouraged to check out [this introductory notebook](https://github.com/triton-inference-server/fil_backend/blob/main/notebooks/categorical-fraud-detection/Fraud_Detection_Example.ipynb).\n", "\n", "While we do provide training code for example models, training models is *not* the subject of this notebook, and we will provide little detail on training. Instead, you are encouraged to use your own model(s) and data with this notebook to get a realistic picture of how your model will perform with Triton." ] }, { "cell_type": "markdown", "id": "e9ad97cc", "metadata": {}, "source": [ "\n", "# Table of Contents\n", "* [Introduction](#Introduction)\n", "* [Table of Contents](#Table-of-Contents)\n", "* [Hardware Pre-requisites](#Hardware-Pre-Requisites)\n", "* [Software Pre-requisites](#Software-Pre-Requisites)\n", "* [FAQ 1: What can I deploy with the FIL Backend?](#FAQ-1:-What-can-I-deploy-with-the-FIL-backend?)\n", " - [FAQ 1.1 Can I deploy non-tree Scikit-Learn models like LinearRegression?](#FAQ-1.1-Can-I-deploy-non-tree-Scikit-Learn-models-like-LinearRegression?)\n", " - [FAQ 1.2 Can I deploy Scikit-Learn/cuML Pipelines with the FIL backend?](#FAQ-1.2-Can-I-deploy-Scikit-Learn/cuML-Pipelines-with-the-FIL-backend?)\n", " - [FAQ 1.3 Can I deploy Scikit-Learn/cuML models serialized with Pickle?](#FAQ-1.3-Can-I-deploy-Scikit-Learn/cuML-models-serialized-with-Pickle?)\n", " - [FAQ 1.4 Can I deploy Scikit-Learn/cuML models serialized with Joblib?](#FAQ-1.4-Can-I-deploy-Scikit-Learn/cuML-models-serialized-with-Joblib?)\n", "* [Example 1: Model Serialization](#Example-1:-Model-Serialization)\n", " - [Example 1.1: Serializing an XGBoost model](#Example-1.1:-Serializing-an-XGBoost-model)\n", " - [Example 1.2 Serializing a LightGBM model](#Example-1.2-Serializing-a-LightGBM-model)\n", " - [Example 1.3 Serializing an in-memory Scikit-Learn model](#Example-1.3-Serializing-an-in-memory-Scikit-Learn-model)\n", " - [Example 1.4 Serializing an in-memory cuML model](#Example-1.4-Serializing-an-in-memory-cuML-model)\n", " - [Example 1.5 Converting a pickled Scikit-Learn model](#Example-1.5-Converting-a-pickled-Scikit-Learn-model)\n", " - [Example 1.6 Converting a pickled cuML model](#Example-1.5-Converting-a-pickled-Scikit-Learn-model)\n", "* [FAQ 2: How do I execute models on CPU only? On GPU?](#FAQ-2:-How-do-I-execute-models-on-CPU-only?-On-GPU?)\n", " - [FAQ 2.1: How do I fall back to CPU only if GPUs are not available?](#FAQ-2:-How-do-I-execute-models-on-CPU-only?-On-GPU?)\n", "* [Example 2: Generating a configuration file](#Example-2:-Generating-a-configuration-file)\n", "* [FAQ 3: How can I quickly test configuration options?](#FAQ-3:-How-can-I-quickly-test-configuration-options?)\n", "* [Example 3: Launching the Triton server with polling mode](#Example-3:-Launching-the-Triton-server-with-polling-mode)\n", "* [FAQ 4: My models are exhausting Triton's memory. What can I do?](#FAQ-4:-My-models-are-exhausting-Triton's-memory.-What-can-I-do?)\n", " - [FAQ 4.1 How can I decrease the memory consumed by a model?](#FAQ-4.1-How-can-I-decrease-the-memory-consumed-by-a-model?)\n", " - [FAQ 4.2 How do I increase Triton's device memory pool?](#FAQ-4.2-How-do-I-increase-Triton's-device-memory-pool?)\n", "* [Example 4: Configuring Triton for large models](#Example-4:-Configuring-Triton-for-large-models)\n", " - [Example 4.1: Changing `storage_type` to reduce memory consumption](#Example-4.1:-Changing-storage_type-to-reduce-memory-consumption)\n", " - [Example 4.2: Increasing Triton's device memory pool](#$\\color{#76b900}{\\text{Example-4.2:-Increasing-Triton's-device-memory-pool}}$)\n", "* [FAQ 5: How do I submit an inference request to Triton?](#FAQ-5:-How-do-I-submit-an-inference-request-to-Triton?)\n", " - [FAQ 5.1: How do I submit inference requests through Triton's C API?](#FAQ-5.1:-How-do-I-submit-inference-requests-through-Triton's-C-API?)\n", " - [FAQ 5.2: How do I submit inference requests with categorical variables?](#FAQ-5.2:-How-do-I-submit-inference-requests-with-categorical-variables?)\n", "* [Example 5: Submitting a request with the Triton Python client](#Example-5:-Submitting-a-request-with-the-Triton-Python-client)\n", "* [FAQ 6: How do I return probability scores rather than classes from a classifier?](#FAQ-6:-How-do-I-return-probability-scores-rather-than-classes-from-a-classifier?)\n", "* [Example 6: Using the `predict_proba` option](#Example-6:-Using-the-predict_proba-option)\n", "* [FAQ 7: Does serving my model with Triton change its accuracy?](#FAQ-7:-Does-serving-my-model-with-Triton-change-its-accuracy?)\n", "* [Example 7: Comparing results from Triton and native execution](#Example-7:-Comparing-results-from-Triton-and-native-execution)\n", "* [FAQ 8: How do we measure performance of the FIL backend?](#FAQ-8:-How-do-we-measure-performance-of-the-FIL-backend?)\n", "* [Example 8: Using perf_analyzer to measure throughput and latency](#Example-8:-Using-perf_analyzer-to-measure-throughput-and-latency)\n", "* [FAQ 9: How can we improve performance of models deployed with the FIL backend?](#FAQ-9:-How-can-we-improve-performance-of-models-deployed-with-the-FIL-backend?)\n", " - [FAQ 9.1: Does specifying preferred batch sizes help FIL's performance?](#FAQ-9.1:-Does-specifying-preferred-batch-sizes-help-FIL's-performance?)\n", "* [Example 9: Optimizing model performance](#Example-9:-Optimizing-model-performance)\n", " - [Example 9.1: Minimizing latency](#Example-9.1:-Minimizing-latency)\n", " - [Example 9.2: Maximizing Throughput](#Example-9.2:-Maximizing-Throughput)\n", " - [Example 9.3: Balancing latency and throughput](#Example-9.3:-Balancing-latency-and-throughput)\n", "* [FAQ 10: How fast is the FIL backend relative to alternatives?](#FAQ-10:-How-fast-is-the-FIL-backend-relative-to-alternatives?)\n", " - [FAQ 10.1 How fast is the FIL backend on CPU vs on GPU?](#FAQ-10.1-How-fast-is-the-FIL-backend-on-CPU-vs-on-GPU?)\n", " - [FAQ 10.2 How fast is the FIL backend relative to the ONNX backend?](#FAQ-10.2-How-fast-is-the-FIL-backend-relative-to-the-ONNX-backend?)\n", "* [Example 10: Comparing the FIL and ONNX backends](#$\\color{#76b900}{\\text{Example-10:-Comparing-the-FIL-and-ONNX-backends}}$)\n", "* [FAQ 11: How do I submit many inference requests in parallel?](#FAQ-11:-How-do-I-submit-many-inference-requests-in-parallel?)\n", "* [Example 11: Submitting requests in parallel with the Python client](#Example-11:-Submitting-requests-in-parallel-with-the-Python-client)\n", "* [FAQ 12: How do I retrieve Shapley values for model explainability?](#$\\color{#76b900}{\\text{FAQ-12:-How-do-I-retrieve-Shapley-values-for-model-explainability?}}$)\n", "* [Example 12: Retrieving Shapley Values](#$\\color{#76b900}{\\text{Example-12:-Retrieving-Shapley-Values}}$)\n", "* [FAQ 13: How do I serve a learning-to-rank model?](#FAQ-13:-How-do-I-serve-a-learning-to-rank-model?)\n", "* [Cleanup](#Cleanup)\n", "* [Conclusion](#Conclusion)" ] }, { "cell_type": "markdown", "id": "e7224487", "metadata": {}, "source": [ "# Hardware Pre-Requisites\n", "Most of this notebook is designed to run either on CPU or GPU. Sections that will only run on GPU will be marked in $\\color{#76b900}{\\text{green}}$. To guarantee that all cells will execute correctly if a GPU is not available, change `USE_GPU` in the following cell to `False`. [^](#Table-of-Contents)" ] }, { "cell_type": "code", "execution_count": null, "id": "58491c57", "metadata": {}, "outputs": [], "source": [ "USE_GPU = True" ] }, { "cell_type": "markdown", "id": "72b761ac", "metadata": {}, "source": [ "## Software Pre-Requisites\n", "\n", "Depending on which model framework you choose to use, you may need a different subset of dependencies. In order to install *all* dependencies with conda, you can use the following environment file:\n", "\n", "```yaml\n", "---\n", "name: triton_faq_nb\n", "channels:\n", " - conda-forge\n", " - nvidia\n", " - rapidsai\n", "dependencies:\n", " - cudatoolkit=11.4\n", " - cuml=22.04\n", " - joblib\n", " - jupyter\n", " - lightgbm\n", " - numpy\n", " - pandas\n", " - pip\n", " - python=3.8\n", " - scikit-learn\n", " - skl2onnx\n", " - treelite=2.3.0\n", " - pip:\n", " - tritonclient[all]\n", " - xgboost>=1.5,<1.6\n", " - protobuf==3.20.1\n", "```\n", "If you do not wish to install all dependencies, remove the frameworks you do not intend to use from this list. If you do not have access to an NVIDIA GPU, you should remove `cuml` and `cudatoolkit` from this list.\n", "\n", "In addition to the above dependencies, the Triton client requires that `libb64` be installed on the system, and Docker must be available to launch the Triton server. [^](#Table-of-Contents)" ] }, { "cell_type": "markdown", "id": "86510373", "metadata": {}, "source": [ "# FAQ 1: What can I deploy with the FIL backend?\n", "The first thing you will need to begin using the FIL backend is a serialized model file. The FIL backend supports **tree-based** models serialized to formats from a variety of frameworks, including the following:\n", "\n", "## XGBoost JSON and binary models\n", "XGBoost uses two serialization formats, both of which are natively supported by the FIL backend. All XGBoost models except for multi-output regression models are supported.\n", "
FIL Backend Version | Treelite |
---|---|
21.08 | 1.3.0 |
21.09-21.10 | 2.0.0 |
21.11-22.02 | 2.1.0 |
22.03-22.06 | 2.3.0 |
22.07+ | 2.4.0 |