{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Serving LLMs with MLflow: Leveraging Custom PyFunc" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "Download this Notebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Introduction\n", "\n", "This tutorial guides you through saving and deploying Large Language Models (LLMs) using a custom `pyfunc` with MLflow, ideal for models not directly supported by MLflow's default transformers flavor.\n", "\n", "### Learning Objectives\n", "\n", "- Understand the need for custom `pyfunc` definitions in specific model scenarios.\n", "- Learn to create a custom `pyfunc` to manage model dependencies and interface data.\n", "- Gain insights into simplifying user interfaces in deployed environments with custom `pyfunc`.\n", "\n", "#### The Challenge with Default Implementations\n", "While MLflow's `transformers` flavor generally handles models from the HuggingFace Transformers library, some models or configurations might not align with this standard approach. In such cases, like ours, where the model cannot utilize the default `pipeline` type, we face a unique challenge of deploying these models using MLflow.\n", "\n", "#### The Power of Custom PyFunc\n", "To address this, MLflow's custom `pyfunc` comes to the rescue. It allows us to:\n", "\n", "- Handle model loading and its dependencies efficiently.\n", "- Customize the inference process to suit specific model requirements.\n", "- Adapt interface data to create a user-friendly environment in deployed applications.\n", "\n", "Our focus will be on the practical application of a custom `pyfunc` to deploy LLMs effectively within MLflow's ecosystem.\n", "\n", "By the end of this tutorial, you'll be equipped with the knowledge to tackle similar challenges in your machine learning projects, leveraging the full potential of MLflow for custom model deployments." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Important Considerations Before Proceeding\n", "\n", "#### Hardware Recommendations\n", "This guide demonstrates the usage of a particularly large and intricate Large Language Model (LLM). Given its complexity:\n", "\n", "- **GPU Requirement**: It's **strongly advised** to run this example on a system with a CUDA-capable GPU that possesses at least 64GB of VRAM.\n", "- **CPU Caution**: While technically feasible, executing the model on a CPU can result in extremely prolonged inference times, potentially taking tens of minutes for a single prediction, even on top-tier CPUs. The final cell of this notebook is deliberately not executed due to the limitations with performance when running this model on a CPU-only system. However, with an appropriately powerful GPU, the total runtime of this notebook is ~8 minutes end to end.\n", "\n", "#### Execution Recommendations\n", "If you're considering running the code in this notebook:\n", "\n", "- **Performance**: For a smoother experience and to truly harness the model's capabilities, use hardware aligned with the model's design.\n", "\n", "- **Dependencies**: Ensure you've installed the recommended dependencies for optimal model performance. These are crucial for efficient model loading, initialization, attention computations, and inference processing:\n", "\n", "```bash\n", "pip install xformers==0.0.20 einops==0.6.1 flash-attn==v1.0.3.post0 triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/benjamin.wilson/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pydantic/_internal/_fields.py:128: UserWarning: Field \"model_server_url\" has conflict with protected namespace \"model_\".\n", "\n", "You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.\n", " warnings.warn(\n", "/Users/benjamin.wilson/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pydantic/_internal/_config.py:317: UserWarning: Valid config keys have changed in V2:\n", "* 'schema_extra' has been renamed to 'json_schema_extra'\n", " warnings.warn(message, UserWarning)\n" ] } ], "source": [ "# Load necessary libraries\n", "\n", "import accelerate\n", "import torch\n", "import transformers\n", "from huggingface_hub import snapshot_download\n", "\n", "import mlflow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Downloading the Model and Tokenizer\n", "\n", "First, we need to download our model and tokenizer. Here's how we do it:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "d37352760f8f4c6386a58aee7506a0a4", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Fetching 24 files: 0%| | 0/24 [00:00" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# If you are running this tutorial in local mode, leave the next line commented out.\n", "# Otherwise, uncomment the following line and set your tracking uri to your local or remote tracking server.\n", "\n", "# mlflow.set_tracking_uri(\"http://127.0.0.1:8080\")\n", "\n", "mlflow.set_experiment(experiment_name=\"mpt-7b-instruct-evaluation\")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "a9585d61652740f888474c89bfb0a6ad", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading artifacts: 0%| | 0/24 [00:00