{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Serving LLMs with MLflow: Leveraging Custom PyFunc"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "<a href=\"https://raw.githubusercontent.com/mlflow/mlflow/master/docs/source/llms/custom-pyfunc-for-llms/notebooks/custom-pyfunc-advanced-llm.ipynb\" class=\"notebook-download-btn\"><i class=\"fas fa-download\"></i>Download this Notebook</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Introduction\n",
    "\n",
    "This tutorial guides you through saving and deploying Large Language Models (LLMs) using a custom `pyfunc` with MLflow, ideal for models not directly supported by MLflow's default transformers flavor.\n",
    "\n",
    "### Learning Objectives\n",
    "\n",
    "- Understand the need for custom `pyfunc` definitions in specific model scenarios.\n",
    "- Learn to create a custom `pyfunc` to manage model dependencies and interface data.\n",
    "- Gain insights into simplifying user interfaces in deployed environments with custom `pyfunc`.\n",
    "\n",
    "#### The Challenge with Default Implementations\n",
    "While MLflow's `transformers` flavor generally handles models from the HuggingFace Transformers library, some models or configurations might not align with this standard approach. In such cases, like ours, where the model cannot utilize the default `pipeline` type, we face a unique challenge of deploying these models using MLflow.\n",
    "\n",
    "#### The Power of Custom PyFunc\n",
    "To address this, MLflow's custom `pyfunc` comes to the rescue. It allows us to:\n",
    "\n",
    "- Handle model loading and its dependencies efficiently.\n",
    "- Customize the inference process to suit specific model requirements.\n",
    "- Adapt interface data to create a user-friendly environment in deployed applications.\n",
    "\n",
    "Our focus will be on the practical application of a custom `pyfunc` to deploy LLMs effectively within MLflow's ecosystem.\n",
    "\n",
    "By the end of this tutorial, you'll be equipped with the knowledge to tackle similar challenges in your machine learning projects, leveraging the full potential of MLflow for custom model deployments."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Important Considerations Before Proceeding\n",
    "\n",
    "#### Hardware Recommendations\n",
    "This guide demonstrates the usage of a particularly large and intricate Large Language Model (LLM). Given its complexity:\n",
    "\n",
    "- **GPU Requirement**: It's **strongly advised** to run this example on a system with a CUDA-capable GPU that possesses at least 64GB of VRAM.\n",
    "- **CPU Caution**: While technically feasible, executing the model on a CPU can result in extremely prolonged inference times, potentially taking tens of minutes for a single prediction, even on top-tier CPUs. The final cell of this notebook is deliberately not executed due to the limitations with performance when running this model on a CPU-only system. However, with an appropriately powerful GPU, the total runtime of this notebook is ~8 minutes end to end.\n",
    "\n",
    "#### Execution Recommendations\n",
    "If you're considering running the code in this notebook:\n",
    "\n",
    "- **Performance**: For a smoother experience and to truly harness the model's capabilities, use hardware aligned with the model's design.\n",
    "\n",
    "- **Dependencies**: Ensure you've installed the recommended dependencies for optimal model performance. These are crucial for efficient model loading, initialization, attention computations, and inference processing:\n",
    "\n",
    "```bash\n",
    "pip install xformers==0.0.20 einops==0.6.1 flash-attn==v1.0.3.post0 triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/benjamin.wilson/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pydantic/_internal/_fields.py:128: UserWarning: Field \"model_server_url\" has conflict with protected namespace \"model_\".\n",
      "\n",
      "You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.\n",
      "  warnings.warn(\n",
      "/Users/benjamin.wilson/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pydantic/_internal/_config.py:317: UserWarning: Valid config keys have changed in V2:\n",
      "* 'schema_extra' has been renamed to 'json_schema_extra'\n",
      "  warnings.warn(message, UserWarning)\n"
     ]
    }
   ],
   "source": [
    "# Load necessary libraries\n",
    "\n",
    "import accelerate\n",
    "import torch\n",
    "import transformers\n",
    "from huggingface_hub import snapshot_download\n",
    "\n",
    "import mlflow"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Downloading the Model and Tokenizer\n",
    "\n",
    "First, we need to download our model and tokenizer. Here's how we do it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "d37352760f8f4c6386a58aee7506a0a4",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Fetching 24 files:   0%|          | 0/24 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "e08b04aaa7e34c17b0f7e4f2b76eab97",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading README.md:   0%|          | 0.00/7.96k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "d20c79e113de405ea9b52dbf8b9263bb",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading .gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "1b3fc3943de54ddea914b45f0975840c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading adapt_tokenizer.py:   0%|          | 0.00/1.72k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "29466458907547b784feeb5233388dc6",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading attention.py:   0%|          | 0.00/21.6k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "b3ec72d369d7469a991cee96b24bfa60",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading config.json:   0%|          | 0.00/1.23k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "1c1583566d514c31aa36279bb4fc9221",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading blocks.py:   0%|          | 0.00/2.84k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "f5c6673494b544ac80b403f6907f9676",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading custom_embedding.py:   0%|          | 0.00/292 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "448205491209457aae07887bab9b700e",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading configuration_mpt.py:   0%|          | 0.00/11.0k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c2c3ea221f7c4996ae1f11267e4b00de",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading meta_init_context.py:   0%|          | 0.00/3.96k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "17cd7dcb31df46c59ce1b1506c981168",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading fc.py:   0%|          | 0.00/167 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "9d7c47961ce44f9386e3064fb7b42baa",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading ffn.py:   0%|          | 0.00/1.75k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "3a1c14663e2d4547a7bd36e18c7d2af7",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ef7aedf9fa8e429bb55a2f0ed9b863fb",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)refixlm_converter.py:   0%|          | 0.00/10.5k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "1ee931f9e5aa46a0b2ac94c0eaa413d4",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading modeling_mpt.py:   0%|          | 0.00/20.1k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ad99d9892f6a4f97ad43fcfc19c68bb2",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading flash_attn_triton.py:   0%|          | 0.00/28.2k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ebc73e51d2d3425a8f8171bc399ab025",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading requirements.txt:   0%|          | 0.00/113 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "662c2c7d009646279ed3eb819a2c833a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading param_init_fns.py:   0%|          | 0.00/11.9k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "97005fdef5404c71bdbeec4399683d68",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)model.bin.index.json:   0%|          | 0.00/16.0k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "efd02c2c5296462096a1c910c684d430",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)cial_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "d46ba744f8784fb2a2cea244977ae711",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading norm.py:   0%|          | 0.00/3.12k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ee254dccf7b7435f9336b337d23f5746",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "6ae00ad881e84f1da55eaf33b0daa75c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "3627ca1664384f75a3c366896ea20625",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/3.36G [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "9a8f9c27f3ea4c08945d1ed47b1609e4",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading tokenizer_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Download the MPT-7B instruct model and tokenizer to a local directory cache\n",
    "snapshot_location = snapshot_download(repo_id=\"mosaicml/mpt-7b-instruct\", local_dir=\"mpt-7b\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Defining the Custom PyFunc\n",
    "\n",
    "Now, let's define our custom `pyfunc`. This will dictate how our model loads its dependencies and how it performs predictions. Notice how we've wrapped the intricacies of the model within this class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "class MPT(mlflow.pyfunc.PythonModel):\n",
    "    def load_context(self, context):\n",
    "        \"\"\"\n",
    "        This method initializes the tokenizer and language model\n",
    "        using the specified model snapshot directory.\n",
    "        \"\"\"\n",
    "        # Initialize tokenizer and language model\n",
    "        self.tokenizer = transformers.AutoTokenizer.from_pretrained(\n",
    "            context.artifacts[\"snapshot\"], padding_side=\"left\"\n",
    "        )\n",
    "\n",
    "        config = transformers.AutoConfig.from_pretrained(\n",
    "            context.artifacts[\"snapshot\"], trust_remote_code=True\n",
    "        )\n",
    "        # If you are running this in a system that has a sufficiently powerful GPU with available VRAM,\n",
    "        # uncomment the configuration setting below to leverage triton.\n",
    "        # Note that triton dramatically improves the inference speed performance\n",
    "\n",
    "        # config.attn_config[\"attn_impl\"] = \"triton\"\n",
    "\n",
    "        self.model = transformers.AutoModelForCausalLM.from_pretrained(\n",
    "            context.artifacts[\"snapshot\"],\n",
    "            config=config,\n",
    "            torch_dtype=torch.bfloat16,\n",
    "            trust_remote_code=True,\n",
    "        )\n",
    "\n",
    "        # NB: If you do not have a CUDA-capable device or have torch installed with CUDA support\n",
    "        # this setting will not function correctly. Setting device to 'cpu' is valid, but\n",
    "        # the performance will be very slow.\n",
    "        self.model.to(device=\"cpu\")\n",
    "        # If running on a GPU-compatible environment, uncomment the following line:\n",
    "        # self.model.to(device=\"cuda\")\n",
    "\n",
    "        self.model.eval()\n",
    "\n",
    "    def _build_prompt(self, instruction):\n",
    "        \"\"\"\n",
    "        This method generates the prompt for the model.\n",
    "        \"\"\"\n",
    "        INSTRUCTION_KEY = \"### Instruction:\"\n",
    "        RESPONSE_KEY = \"### Response:\"\n",
    "        INTRO_BLURB = (\n",
    "            \"Below is an instruction that describes a task. \"\n",
    "            \"Write a response that appropriately completes the request.\"\n",
    "        )\n",
    "\n",
    "        return f\"\"\"{INTRO_BLURB}\n",
    "        {INSTRUCTION_KEY}\n",
    "        {instruction}\n",
    "        {RESPONSE_KEY}\n",
    "        \"\"\"\n",
    "\n",
    "    def predict(self, context, model_input, params=None):\n",
    "        \"\"\"\n",
    "        This method generates prediction for the given input.\n",
    "        \"\"\"\n",
    "        prompt = model_input[\"prompt\"][0]\n",
    "\n",
    "        # Retrieve or use default values for temperature and max_tokens\n",
    "        temperature = params.get(\"temperature\", 0.1) if params else 0.1\n",
    "        max_tokens = params.get(\"max_tokens\", 1000) if params else 1000\n",
    "\n",
    "        # Build the prompt\n",
    "        prompt = self._build_prompt(prompt)\n",
    "\n",
    "        # Encode the input and generate prediction\n",
    "        # NB: Sending the tokenized inputs to the GPU here explicitly will not work if your system does not have CUDA support.\n",
    "        # If attempting to run this with GPU support, change 'cpu' to 'cuda' for maximum performance\n",
    "        encoded_input = self.tokenizer.encode(prompt, return_tensors=\"pt\").to(\"cpu\")\n",
    "        output = self.model.generate(\n",
    "            encoded_input,\n",
    "            do_sample=True,\n",
    "            temperature=temperature,\n",
    "            max_new_tokens=max_tokens,\n",
    "        )\n",
    "\n",
    "        # Removing the prompt from the generated text\n",
    "        prompt_length = len(self.tokenizer.encode(prompt, return_tensors=\"pt\")[0])\n",
    "        generated_response = self.tokenizer.decode(\n",
    "            output[0][prompt_length:], skip_special_tokens=True\n",
    "        )\n",
    "\n",
    "        return {\"candidates\": [generated_response]}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Building the Prompt\n",
    "\n",
    "One key aspect of our custom `pyfunc` is the construction of a model prompt. Instead of the end-user having to understand and construct this prompt, our custom `pyfunc` takes care of it. This ensures that regardless of the intricacies of the model's requirements, the end-user interface remains simple and consistent.\n",
    "\n",
    "Review the method `_build_prompt()` inside our class above to see how custom input processing logic can be added to a custom pyfunc to support required translations of user-input data into a format that is compatible with the wrapped model instance. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "import mlflow\n",
    "from mlflow.models.signature import ModelSignature\n",
    "from mlflow.types import ColSpec, DataType, ParamSchema, ParamSpec, Schema\n",
    "\n",
    "# Define input and output schema\n",
    "input_schema = Schema(\n",
    "    [\n",
    "        ColSpec(DataType.string, \"prompt\"),\n",
    "    ]\n",
    ")\n",
    "output_schema = Schema([ColSpec(DataType.string, \"candidates\")])\n",
    "\n",
    "parameters = ParamSchema(\n",
    "    [\n",
    "        ParamSpec(\"temperature\", DataType.float, np.float32(0.1), None),\n",
    "        ParamSpec(\"max_tokens\", DataType.integer, np.int32(1000), None),\n",
    "    ]\n",
    ")\n",
    "\n",
    "signature = ModelSignature(inputs=input_schema, outputs=output_schema, params=parameters)\n",
    "\n",
    "\n",
    "# Define input example\n",
    "input_example = pd.DataFrame({\"prompt\": [\"What is machine learning?\"]})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Set the experiment that we're going to be logging our custom model to\n",
    "\n",
    "If the experiment doesn't already exist, MLflow will create a new experiment with this name and will alert you that it has created a new experiment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023/11/29 17:33:23 INFO mlflow.tracking.fluent: Experiment with name 'mpt-7b-instruct-evaluation' does not exist. Creating a new experiment.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<Experiment: artifact_location='file:///Users/benjamin.wilson/repos/mlflow-fork/mlflow/docs/source/llms/custom-pyfunc-for-llms/notebooks/mlruns/265930820950682761', creation_time=1701297203895, experiment_id='265930820950682761', last_update_time=1701297203895, lifecycle_stage='active', name='mpt-7b-instruct-evaluation', tags={}>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# If you are running this tutorial in local mode, leave the next line commented out.\n",
    "# Otherwise, uncomment the following line and set your tracking uri to your local or remote tracking server.\n",
    "\n",
    "# mlflow.set_tracking_uri(\"http://127.0.0.1:8080\")\n",
    "\n",
    "mlflow.set_experiment(experiment_name=\"mpt-7b-instruct-evaluation\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "a9585d61652740f888474c89bfb0a6ad",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading artifacts:   0%|          | 0/24 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023/11/29 17:33:24 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false\n",
      "/Users/benjamin.wilson/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/_distutils_hack/__init__.py:30: UserWarning: Setuptools is replacing distutils.\n",
      "  warnings.warn(\"Setuptools is replacing distutils.\")\n"
     ]
    }
   ],
   "source": [
    "# Get the current base version of torch that is installed, without specific version modifiers\n",
    "torch_version = torch.__version__.split(\"+\")[0]\n",
    "\n",
    "# Start an MLflow run context and log the MPT-7B model wrapper along with the param-included signature to\n",
    "# allow for overriding parameters at inference time\n",
    "with mlflow.start_run():\n",
    "    model_info = mlflow.pyfunc.log_model(\n",
    "        \"mpt-7b-instruct\",\n",
    "        python_model=MPT(),\n",
    "        # NOTE: the artifacts dictionary mapping is critical! This dict is used by the load_context() method in our MPT() class.\n",
    "        artifacts={\"snapshot\": snapshot_location},\n",
    "        pip_requirements=[\n",
    "            f\"torch=={torch_version}\",\n",
    "            f\"transformers=={transformers.__version__}\",\n",
    "            f\"accelerate=={accelerate.__version__}\",\n",
    "            \"einops\",\n",
    "            \"sentencepiece\",\n",
    "        ],\n",
    "        input_example=input_example,\n",
    "        signature=signature,\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Load the saved model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/benjamin.wilson/.cache/huggingface/modules/transformers_modules/mpt-7b/configuration_mpt.py:97: UserWarning: alibi is turned on, setting `learned_pos_emb` to `False.`\n",
      "  warnings.warn(f'alibi is turned on, setting `learned_pos_emb` to `False.`')\n"
     ]
    }
   ],
   "source": [
    "loaded_model = mlflow.pyfunc.load_model(model_info.model_uri)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Test the model for inference"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The execution of this is commented out for the purposes of runtime on CPU.\n",
    "# If you are running this on a system with a sufficiently powerful GPU, you may uncomment and interface with the model!\n",
    "\n",
    "# loaded_model.predict(pd.DataFrame(\n",
    "#     {\"prompt\": [\"What is machine learning?\"]}), params={\"temperature\": 0.6}\n",
    "# )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Conclusion\n",
    "\n",
    "Through this tutorial, we've seen the power and flexibility of MLflow's custom `pyfunc`. By understanding the specific needs of our model and defining a custom `pyfunc` to cater to those needs, we can ensure a seamless deployment process and a user-friendly interface.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}