{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## MLflow Signature Playground Notebook\n", "Welcome to the MLflow Signature Playground! This interactive Jupyter notebook is designed to guide you through the foundational concepts of [Model Signatures](https://mlflow.org/docs/latest/model/signatures.html) within the MLflow ecosystem. As you progress through the notebook, you'll gain practical experience with defining, enforcing, and utilizing model signatures—a critical aspect of model management that enhances reproducibility, reliability, and ease of use.\n", "\n", "### Why Model Signatures Matter\n", "In the realm of machine learning, defining the inputs and outputs of models with precision is key to ensuring smooth operations. Model signatures serve as the schema definition for the data your model expects and produces, acting as a blueprint for both model developers and users. This not only clarifies expectations but also facilitates automatic validation checks, streamlining the process from model training to deployment.\n", "\n", "### Signature Enforcement in Action\n", "By exploring the code cells in this notebook, you'll witness firsthand how model signatures can enforce data integrity, prevent common errors, and provide descriptive feedback when discrepancies occur. This is invaluable for maintaining the quality and consistency of model inputs, especially when models are served in production environments.\n", "\n", "### Practical Examples for a Deeper Understanding\n", "The notebook includes a range of examples showcasing different data types and structures, from simple scalars to complex nested dictionaries. These examples demonstrate how signatures are inferred, logged, and updated, providing you with a comprehensive understanding of the signature lifecycle.\n", "As you interact with the provided PythonModel instances and invoke their predict methods, you'll learn how to handle various input scenarios—accounting for both required and optional data fields—and how to update existing models to include detailed signatures.\n", "Whether you're a data scientist looking to refine your model management practices or a developer integrating MLflow into your workflow, this notebook is your sandbox for mastering model signatures. Let's dive in and explore the robust capabilities of MLflow signatures!\n", "\n", "> NOTE: Several of the features shown in this notebook are only available in version 2.10.0 and higher of MLflow. In particular, the support for the `Array` and `Object` types are not available prior to version 2.10.0." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "import mlflow\n", "from mlflow.models.signature import infer_signature, set_signature\n", "\n", "\n", "def report_signature_info(input_data, output_data=None, params=None):\n", " inferred_signature = infer_signature(input_data, output_data, params)\n", "\n", " report = f\"\"\"\n", "The input data: \\n\\t{input_data}.\n", "The data is of type: {type(input_data)}.\n", "The inferred signature is:\\n\\n{inferred_signature}\n", "\"\"\"\n", " print(report)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scalar Support in MLflow Signatures\n", "In this segment of the tutorial, we explore the critical role of scalar data types in the context of MLflow's model signatures. Scalar types, such as strings, integers, floats, doubles, booleans, and datetimes, are fundamental to defining the schema for a model's input and output. Accurate representation of these types is essential for ensuring that models process data correctly, which directly impacts the reliability and accuracy of predictions.\n", "\n", "By examining examples of various scalar types, this section demonstrates how MLflow infers and records the structure and nature of data. We'll see how MLflow signatures cater to different scalar types, ensuring that the data fed into the model matches the expected format. This understanding is crucial for any machine learning practitioner, as it helps in preparing and validating data inputs, leading to smoother model operations and more reliable results.\n", "\n", "Through practical examples, including lists of strings, floats, and other types, we illustrate how MLflow's `infer_signature` function can accurately deduce the data format. This capability is a cornerstone in MLflow's ability to handle diverse data inputs and forms the basis for more complex data structures in machine learning models. By the end of this section, you'll have a clear grasp of how scalar data is represented within MLflow signatures and why this is important for your ML projects.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "The input data: \n", "\t['a', 'list', 'of', 'strings'].\n", "The data is of type: .\n", "The inferred signature is:\n", "\n", "inputs: \n", " [string (required)]\n", "outputs: \n", " None\n", "params: \n", " None\n", "\n", "\n" ] } ], "source": [ "# List of strings\n", "\n", "report_signature_info([\"a\", \"list\", \"of\", \"strings\"])" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "The input data: \n", "\t[0.117, 1.99].\n", "The data is of type: .\n", "The inferred signature is:\n", "\n", "inputs: \n", " [float (required)]\n", "outputs: \n", " None\n", "params: \n", " None\n", "\n", "\n" ] } ], "source": [ "# List of floats\n", "\n", "report_signature_info([np.float32(0.117), np.float32(1.99)])" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "The input data: \n", "\t input_data\n", "0 0.117\n", "1 1.990.\n", "The data is of type: .\n", "The inferred signature is:\n", "\n", "inputs: \n", " ['input_data': double (required)]\n", "outputs: \n", " None\n", "params: \n", " None\n", "\n", "\n" ] } ], "source": [ "# Adding a column header to a list of doubles\n", "my_data = pd.DataFrame({\"input_data\": [np.float64(0.117), np.float64(1.99)]})\n", "report_signature_info(my_data)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "The input data: \n", "\t[{'a': 'a1', 'b': 'b1'}, {'a': 'a2', 'b': 'b2'}].\n", "The data is of type: .\n", "The inferred signature is:\n", "\n", "inputs: \n", " ['a': string (required), 'b': string (required)]\n", "outputs: \n", " None\n", "params: \n", " None\n", "\n", "\n" ] } ], "source": [ "# List of Dictionaries\n", "report_signature_info([{\"a\": \"a1\", \"b\": \"b1\"}, {\"a\": \"a2\", \"b\": \"b2\"}])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "The input data: \n", "\t[['a', 'b', 'c'], ['d', 'e', 'f']].\n", "The data is of type: .\n", "The inferred signature is:\n", "\n", "inputs: \n", " [Array(string) (required)]\n", "outputs: \n", " None\n", "params: \n", " None\n", "\n", "\n" ] } ], "source": [ "# List of Arrays of strings\n", "report_signature_info([[\"a\", \"b\", \"c\"], [\"d\", \"e\", \"f\"]])" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "The input data: \n", "\t[[{'a': 'a', 'b': 'b'}, {'a': 'a', 'b': 'b'}], [{'a': 'a', 'b': 'b'}, {'a': 'a', 'b': 'b'}]].\n", "The data is of type: .\n", "The inferred signature is:\n", "\n", "inputs: \n", " [Array({a: string (required), b: string (required)}) (required)]\n", "outputs: \n", " None\n", "params: \n", " None\n", "\n", "\n" ] } ], "source": [ "# List of Arrays of Dictionaries\n", "report_signature_info(\n", " [[{\"a\": \"a\", \"b\": \"b\"}, {\"a\": \"a\", \"b\": \"b\"}], [{\"a\": \"a\", \"b\": \"b\"}, {\"a\": \"a\", \"b\": \"b\"}]]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Understanding Type Conversion: Int to Long\n", "\n", "In this section of the tutorial, we observe an interesting aspect of type conversion in MLflow's schema inference. When reporting the signature information for a list of integers, you might notice that the inferred data type is `long` instead of `int`. This conversion from int to long is not an error or bug but a valid and intentional type conversion within MLflow's schema inference mechanism.\n", "\n", "#### Why Integers are Inferred as Long\n", "- **Broader Compatibility:** The conversion to `long` ensures compatibility across various platforms and systems. Since the size of an integer (int) can vary depending on the system architecture, using `long` (which has a more consistent size specification) avoids potential discrepancies and data overflow issues.\n", "- **Data Integrity:** By inferring integers as long, MLflow ensures that larger integer values, which might exceed the typical capacity of an int, are accurately represented and handled without data loss or overflow.\n", "- **Consistency in Machine Learning Models:** In many machine learning frameworks, especially those involving larger datasets or computations, long integers are often the standard data type for numerical operations. This standardization in the inferred schema aligns with common practices in the machine learning community." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "The input data: \n", "\t[1, 2, 3].\n", "The data is of type: .\n", "The inferred signature is:\n", "\n", "inputs: \n", " [long (required)]\n", "outputs: \n", " None\n", "params: \n", " None\n", "\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/benjamin.wilson/repos/mlflow-fork/mlflow/mlflow/types/utils.py:378: UserWarning: Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values `_ for more details.\n", " warnings.warn(\n" ] } ], "source": [ "# List of integers\n", "report_signature_info([1, 2, 3])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "The input data: \n", "\t[True, False, False, False, True].\n", "The data is of type: .\n", "The inferred signature is:\n", "\n", "inputs: \n", " [boolean (required)]\n", "outputs: \n", " None\n", "params: \n", " None\n", "\n", "\n" ] } ], "source": [ "# List of Booleans\n", "report_signature_info([True, False, False, False, True])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "The input data: \n", "\t[numpy.datetime64('2023-12-24T11:59:59'), numpy.datetime64('2023-12-25T00:00:00')].\n", "The data is of type: .\n", "The inferred signature is:\n", "\n", "inputs: \n", " [datetime (required)]\n", "outputs: \n", " None\n", "params: \n", " None\n", "\n", "\n" ] } ], "source": [ "# List of Datetimes\n", "report_signature_info([np.datetime64(\"2023-12-24 11:59:59\"), np.datetime64(\"2023-12-25 00:00:00\")])" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "The input data: \n", "\t[{'a': 'b', 'b': [1, 2, 3], 'c': {'d': [4, 5, 6]}}].\n", "The data is of type: .\n", "The inferred signature is:\n", "\n", "inputs: \n", " ['a': string (required), 'b': Array(long) (required), 'c': {d: Array(long) (required)} (required)]\n", "outputs: \n", " None\n", "params: \n", " None\n", "\n", "\n" ] } ], "source": [ "# Complex list of Dictionaries\n", "report_signature_info([{\"a\": \"b\", \"b\": [1, 2, 3], \"c\": {\"d\": [4, 5, 6]}}])" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "The input data: \n", "\t a b c f\n", "0 a [a, b, c] {'d': 1, 'e': 0.1} [{'g': 'g'}, {'h': 1}]\n", "1 NaN [a, b] {'d': 2, 'f': 'f'} [{'g': 'g'}].\n", "The data is of type: .\n", "The inferred signature is:\n", "\n", "inputs: \n", " ['a': string (optional), 'b': Array(string) (required), 'c': {d: long (required), e: double (optional), f: string (optional)} (required), 'f': Array({g: string (optional), h: long (optional)}) (required)]\n", "outputs: \n", " None\n", "params: \n", " None\n", "\n", "\n" ] } ], "source": [ "# Pandas DF input\n", "\n", "data = [\n", " {\"a\": \"a\", \"b\": [\"a\", \"b\", \"c\"], \"c\": {\"d\": 1, \"e\": 0.1}, \"f\": [{\"g\": \"g\"}, {\"h\": 1}]},\n", " {\"b\": [\"a\", \"b\"], \"c\": {\"d\": 2, \"f\": \"f\"}, \"f\": [{\"g\": \"g\"}]},\n", "]\n", "data = pd.DataFrame(data)\n", "\n", "report_signature_info(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Signature Enforcement\n", "\n", "In this part of the tutorial, we focus on the practical application of signature enforcement in MLflow. Signature enforcement is a powerful feature that ensures the data provided to a model aligns with the defined input schema. This step is crucial in preventing errors and inconsistencies that can arise from mismatched or incorrectly formatted data.\n", "\n", "Through hands-on examples, we will observe how MLflow enforces the conformity of data to the expected signature at runtime. We'll use the `MyModel` class, a simple Python model, to demonstrate how MLflow checks the compatibility of input data against the model's signature. This process helps in safeguarding the model against incompatible or erroneous inputs, thereby enhancing the robustness and reliability of model predictions.\n", "\n", "This section also highlights the importance of precise data representation in MLflow and the implications it has on model performance. By testing with different types of data, including those that do not conform to the expected schema, we will see how MLflow validates data and provides informative feedback. This aspect of signature enforcement is invaluable for debugging data issues and refining model inputs, making it a key skill for anyone involved in deploying machine learning models.\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "class MyModel(mlflow.pyfunc.PythonModel):\n", " def predict(self, context, model_input, params=None):\n", " return model_input" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "The input data: \n", "\t[{'a': ['a', 'b', 'c'], 'b': 'b', 'c': {'d': 'd'}}, {'a': ['a'], 'c': {'d': 'd', 'e': 'e'}}].\n", "The data is of type: .\n", "The inferred signature is:\n", "\n", "inputs: \n", " ['a': Array(string) (required), 'b': string (optional), 'c': {d: string (required), e: string (optional)} (required)]\n", "outputs: \n", " None\n", "params: \n", " None\n", "\n", "\n" ] } ], "source": [ "data = [{\"a\": [\"a\", \"b\", \"c\"], \"b\": \"b\", \"c\": {\"d\": \"d\"}}, {\"a\": [\"a\"], \"c\": {\"d\": \"d\", \"e\": \"e\"}}]\n", "\n", "report_signature_info(data)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/benjamin.wilson/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/_distutils_hack/__init__.py:30: UserWarning: Setuptools is replacing distutils.\n", " warnings.warn(\"Setuptools is replacing distutils.\")\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abc
0[a, b, c]b{'d': 'd'}
1[a]NaN{'d': 'd', 'e': 'e'}
\n", "
" ], "text/plain": [ " a b c\n", "0 [a, b, c] b {'d': 'd'}\n", "1 [a] NaN {'d': 'd', 'e': 'e'}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Generate a prediction that will serve as the model output example for signature inference\n", "model_output = MyModel().predict(context=None, model_input=data)\n", "\n", "with mlflow.start_run():\n", " model_info = mlflow.pyfunc.log_model(\n", " python_model=MyModel(),\n", " artifact_path=\"test_model\",\n", " signature=infer_signature(model_input=data, model_output=model_output),\n", " )\n", "\n", "loaded_model = mlflow.pyfunc.load_model(model_info.model_uri)\n", "prediction = loaded_model.predict(data)\n", "\n", "prediction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check the inferred signature directly from the logged model information that is returned from the call to `log_model()`" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "inputs: \n", " ['a': Array(string) (required), 'b': string (optional), 'c': {d: string (required), e: string (optional)} (required)]\n", "outputs: \n", " ['a': Array(string) (required), 'b': string (optional), 'c': {d: string (required), e: string (optional)} (required)]\n", "params: \n", " None" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_info.signature" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also quickly verify that the logged input signature matches the signature inference. While we're at it, we can generate the output signature as well. \n", "\n", "> NOTE: it is recommended to log both the input and output signatures with your models. " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "The input data: \n", "\t[{'a': ['a', 'b', 'c'], 'b': 'b', 'c': {'d': 'd'}}, {'a': ['a'], 'c': {'d': 'd', 'e': 'e'}}].\n", "The data is of type: .\n", "The inferred signature is:\n", "\n", "inputs: \n", " ['a': Array(string) (required), 'b': string (optional), 'c': {d: string (required), e: string (optional)} (required)]\n", "outputs: \n", " ['a': Array(string) (required), 'b': string (optional), 'c': {d: string (required), e: string (optional)} (required)]\n", "params: \n", " None\n", "\n", "\n" ] } ], "source": [ "report_signature_info(data, prediction)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ac
0[a, b, c]{'d': 'd'}
\n", "
" ], "text/plain": [ " a c\n", "0 [a, b, c] {'d': 'd'}" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Using the model while not providing an optional input (note the output return structure and the non existent optional columns)\n", "\n", "loaded_model.predict([{\"a\": [\"a\", \"b\", \"c\"], \"c\": {\"d\": \"d\"}}])" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "ename": "MlflowException", "evalue": "Failed to enforce schema of data '[{'b': 'b'}]' with schema '['a': Array(string) (required), 'b': string (optional), 'c': {d: string (required), e: string (optional)} (required)]'. Error: Model is missing inputs ['a', 'c'].", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mMlflowException\u001b[0m Traceback (most recent call last)", "\u001b[0;32m~/repos/mlflow-fork/mlflow/mlflow/pyfunc/__init__.py\u001b[0m in \u001b[0;36mpredict\u001b[0;34m(self, data, params)\u001b[0m\n\u001b[1;32m 469\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 470\u001b[0;31m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_enforce_schema\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput_schema\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 471\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/repos/mlflow-fork/mlflow/mlflow/models/utils.py\u001b[0m in \u001b[0;36m_enforce_schema\u001b[0;34m(pf_input, input_schema)\u001b[0m\n\u001b[1;32m 939\u001b[0m \u001b[0mmessage\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0;34mf\" Note that there were extra inputs: {extra_cols}\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 940\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mMlflowException\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmessage\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 941\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0minput_schema\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mis_tensor_spec\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mMlflowException\u001b[0m: Model is missing inputs ['a', 'c'].", "\nDuring handling of the above exception, another exception occurred:\n", "\u001b[0;31mMlflowException\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/var/folders/cd/n8n0rm2x53l_s0xv_j_xklb00000gp/T/ipykernel_97464/1628231496.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;31m# stating that the required fields \"a\" and \"c\" are missing)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m \u001b[0mloaded_model\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpredict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m{\u001b[0m\u001b[0;34m\"b\"\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m\"b\"\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m~/repos/mlflow-fork/mlflow/mlflow/pyfunc/__init__.py\u001b[0m in \u001b[0;36mpredict\u001b[0;34m(self, data, params)\u001b[0m\n\u001b[1;32m 471\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 472\u001b[0m \u001b[0;31m# Include error in message for backwards compatibility\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 473\u001b[0;31m raise MlflowException.invalid_parameter_value(\n\u001b[0m\u001b[1;32m 474\u001b[0m \u001b[0;34mf\"Failed to enforce schema of data '{data}' \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 475\u001b[0m \u001b[0;34mf\"with schema '{input_schema}'. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mMlflowException\u001b[0m: Failed to enforce schema of data '[{'b': 'b'}]' with schema '['a': Array(string) (required), 'b': string (optional), 'c': {d: string (required), e: string (optional)} (required)]'. Error: Model is missing inputs ['a', 'c']." ] } ], "source": [ "# Using the model while omitting the input of required fields (this will raise an Exception from schema enforcement,\n", "# stating that the required fields \"a\" and \"c\" are missing)\n", "\n", "loaded_model.predict([{\"b\": \"b\"}])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Updating Signatures\n", " This section of the tutorial addresses the dynamic nature of data and models, focusing on the crucial task of updating an MLflow model's signature. As datasets evolve and requirements change, it becomes necessary to modify the signature of a model to align with the new data structure or inputs. This ability to update a signature is key to maintaining the accuracy and relevance of your model over time.\n", "\n", "We will demonstrate how to identify when a signature update is needed and walk through the process of creating and applying a new signature to an existing model. This section highlights the flexibility of MLflow in accommodating changes in data formats and structures without the need to re-save the entire model. However, for registered models in MLflow, updating the signature requires re-registering the model to reflect the changes in the registered version.\n", "\n", "By exploring the steps to update a model's signature, you will learn how to update the model signature in the event that you manually defined a signature that is invalid or if you failed to define one while logging and need to update the model with a valid signature.\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Updating an existing model that wasn't saved with a signature\n", "\n", "\n", "class MyTypeCheckerModel(mlflow.pyfunc.PythonModel):\n", " def predict(self, context, model_input, params=None):\n", " print(type(model_input))\n", " print(model_input)\n", " if not isinstance(model_input, (pd.DataFrame, list)):\n", " raise ValueError(\"The input must be a list.\")\n", " return \"Input is valid.\"\n", "\n", "\n", "with mlflow.start_run():\n", " model_info = mlflow.pyfunc.log_model(\n", " python_model=MyTypeCheckerModel(),\n", " artifact_path=\"test_model\",\n", " )\n", "\n", "loaded_model = mlflow.pyfunc.load_model(model_info.model_uri)\n", "\n", "loaded_model.metadata.signature" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "[{'a': 'we are expecting strings', 'b': 'and only strings'}, [1, 2, 3]]\n" ] }, { "data": { "text/plain": [ "'Input is valid.'" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_data = [{\"a\": \"we are expecting strings\", \"b\": \"and only strings\"}, [1, 2, 3]]\n", "loaded_model.predict(test_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Necessity of Schema Enforcement in MLflow\n", "\n", "In this part of the tutorial, we address a common challenge in machine learning model deployment: the clarity and interpretability of error messages. Without schema enforcement, models can often return cryptic or misleading error messages. This occurs because, in the absence of a well-defined schema, the model attempts to process inputs that may not align with its expectations, leading to ambiguous or hard-to-diagnose errors.\n", "\n", "#### Why Schema Enforcement Matters\n", "Schema enforcement acts as a gatekeeper, ensuring that the data fed into a model precisely matches the expected format. This not only reduces the likelihood of runtime errors but also makes any errors that do occur much easier to understand and rectify. Without such enforcement, diagnosing issues becomes a time-consuming and complex task, often requiring deep dives into the model's internal logic.\n", "\n", "#### Updating Model Signature for Clearer Error Messages\n", "To illustrate the value of schema enforcement, we will update the signature of a saved model to match an expected data structure. This process involves defining the expected data structure, using the `infer_signature` function to generate the appropriate signature, and then applying this signature to the model using `set_signature`. By doing so, we ensure that any future errors are more informative and aligned with the data structure we anticipate, simplifying troubleshooting and enhancing model reliability.\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "[{'a': 'string', 'b': 'another string'}, {'a': 'string'}]\n" ] } ], "source": [ "expected_data_structure = [{\"a\": \"string\", \"b\": \"another string\"}, {\"a\": \"string\"}]\n", "\n", "signature = infer_signature(expected_data_structure, loaded_model.predict(expected_data_structure))\n", "\n", "set_signature(model_info.model_uri, signature)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "inputs: \n", " ['a': string (required), 'b': string (optional)]\n", "outputs: \n", " [string (required)]\n", "params: \n", " None" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loaded_with_signature = mlflow.pyfunc.load_model(model_info.model_uri)\n", "\n", "loaded_with_signature.metadata.signature" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " a b\n", "0 string another string\n", "1 string NaN\n" ] }, { "data": { "text/plain": [ "'Input is valid.'" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loaded_with_signature.predict(expected_data_structure)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Validating that schema enforcement will not permit a flawed input\n", "\n", "Now that we've set our signature correctly and updated the model definition, let's ensure that the previous flawed input type will raise a useful error message!" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "ename": "MlflowException", "evalue": "Failed to enforce schema of data '[{'a': 'we are expecting strings', 'b': 'and only strings'}, [1, 2, 3]]' with schema '['a': string (required), 'b': string (optional)]'. Error: 'list' object has no attribute 'keys'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m~/repos/mlflow-fork/mlflow/mlflow/pyfunc/__init__.py\u001b[0m in \u001b[0;36mpredict\u001b[0;34m(self, data, params)\u001b[0m\n\u001b[1;32m 469\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 470\u001b[0;31m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_enforce_schema\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput_schema\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 471\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/repos/mlflow-fork/mlflow/mlflow/models/utils.py\u001b[0m in \u001b[0;36m_enforce_schema\u001b[0;34m(pf_input, input_schema)\u001b[0m\n\u001b[1;32m 907\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpf_input\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mlist\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mndarray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mSeries\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 908\u001b[0;31m \u001b[0mpf_input\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mDataFrame\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpf_input\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 909\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, data, index, columns, dtype, copy)\u001b[0m\n\u001b[1;32m 781\u001b[0m \u001b[0mcolumns\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mensure_index\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 782\u001b[0;31m arrays, columns, index = nested_data_to_arrays(\n\u001b[0m\u001b[1;32m 783\u001b[0m \u001b[0;31m# error: Argument 3 to \"nested_data_to_arrays\" has incompatible\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pandas/core/internals/construction.py\u001b[0m in \u001b[0;36mnested_data_to_arrays\u001b[0;34m(data, columns, index, dtype)\u001b[0m\n\u001b[1;32m 497\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 498\u001b[0;31m \u001b[0marrays\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcolumns\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mto_arrays\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcolumns\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 499\u001b[0m \u001b[0mcolumns\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mensure_index\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pandas/core/internals/construction.py\u001b[0m in \u001b[0;36mto_arrays\u001b[0;34m(data, columns, dtype)\u001b[0m\n\u001b[1;32m 831\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mabc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mMapping\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 832\u001b[0;31m \u001b[0marr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcolumns\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_list_of_dict_to_arrays\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcolumns\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 833\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mABCSeries\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pandas/core/internals/construction.py\u001b[0m in \u001b[0;36m_list_of_dict_to_arrays\u001b[0;34m(data, columns)\u001b[0m\n\u001b[1;32m 911\u001b[0m \u001b[0msort\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0many\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0md\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdict\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0md\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 912\u001b[0;31m \u001b[0mpre_cols\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlib\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfast_unique_multiple_list_gen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mgen\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msort\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msort\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 913\u001b[0m \u001b[0mcolumns\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mensure_index\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpre_cols\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pandas/_libs/lib.pyx\u001b[0m in \u001b[0;36mpandas._libs.lib.fast_unique_multiple_list_gen\u001b[0;34m()\u001b[0m\n", "\u001b[0;32m~/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pandas/core/internals/construction.py\u001b[0m in \u001b[0;36m\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m 909\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcolumns\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 910\u001b[0;31m \u001b[0mgen\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mkeys\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mx\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 911\u001b[0m \u001b[0msort\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0many\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0md\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdict\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0md\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mAttributeError\u001b[0m: 'list' object has no attribute 'keys'", "\nDuring handling of the above exception, another exception occurred:\n", "\u001b[0;31mMlflowException\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/var/folders/cd/n8n0rm2x53l_s0xv_j_xklb00000gp/T/ipykernel_97464/2586525788.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mloaded_with_signature\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpredict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtest_data\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m~/repos/mlflow-fork/mlflow/mlflow/pyfunc/__init__.py\u001b[0m in \u001b[0;36mpredict\u001b[0;34m(self, data, params)\u001b[0m\n\u001b[1;32m 471\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 472\u001b[0m \u001b[0;31m# Include error in message for backwards compatibility\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 473\u001b[0;31m raise MlflowException.invalid_parameter_value(\n\u001b[0m\u001b[1;32m 474\u001b[0m \u001b[0;34mf\"Failed to enforce schema of data '{data}' \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 475\u001b[0m \u001b[0;34mf\"with schema '{input_schema}'. \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mMlflowException\u001b[0m: Failed to enforce schema of data '[{'a': 'we are expecting strings', 'b': 'and only strings'}, [1, 2, 3]]' with schema '['a': string (required), 'b': string (optional)]'. Error: 'list' object has no attribute 'keys'" ] } ], "source": [ "loaded_with_signature.predict(test_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Wrapping Up: Insights and Best Practices from the MLflow Signature Playground\n", "As we conclude our journey through the MLflow Signature Playground Notebook, we've gained invaluable insights into the intricacies of model signatures within the MLflow ecosystem. This tutorial has equipped you with the knowledge and practical skills needed to effectively manage and utilize model signatures, ensuring the robustness and accuracy of your machine learning models.\n", "\n", "Key takeaways include the importance of accurately defining scalar types, the significance of enforcing and adhering to model signatures for data integrity, and the flexibility offered by MLflow in updating an invalid model signature. These concepts are not just theoretical but are fundamental to successful model deployment and management in real-world scenarios.\n", "\n", "Whether you're a data scientist refining your models or a developer integrating machine learning into your applications, understanding and utilizing model signatures is crucial. We hope this tutorial has provided you with a solid foundation in MLflow signatures, empowering you to implement these best practices in your future ML projects." ] } ], "metadata": { "kernelspec": { "display_name": "mlflow-dev-env", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 2 }