{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Advanced Semantic Similarity Analysis with Sentence Transformers and MLflow\n", "\n", "Dive into advanced Semantic Similarity Analysis using Sentence Transformers and MLflow in this comprehensive tutorial." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Learning Objectives\n", "\n", "- Configure `sentence-transformers` for semantic similarity analysis.\n", "- Explore custom `PythonModel` implementation in MLflow.\n", "- Log models and manage configurations with MLflow.\n", "- Deploy and apply models for inference using MLflow's features.\n", "\n", "#### Unveiling the Power of Sentence Transformers for NLP\n", "\n", "Sentence Transformers, specialized adaptations of transformer models, excel in producing semantically rich sentence embeddings. Ideal for semantic search and similarity analysis, these models bring a deeper semantic understanding to NLP tasks.\n", "\n", " #### MLflow: Pioneering Flexible Model Management and Deployment\n", "\n", "MLflow's integration with Sentence Transformers introduces enhanced experiment tracking and flexible model management, crucial for NLP projects. Learn to implement a custom `PythonModel` within MLflow, extending functionalities for unique requirements.\n", "\n", "Throughout this tutorial, you'll gain hands-on experience in managing and deploying sophisticated NLP models with MLflow, enhancing your skills in semantic similarity analysis and model lifecycle management." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "env: TOKENIZERS_PARALLELISM=false\n" ] } ], "source": [ "# Disable tokenizers warnings when constructing pipelines\n", "%env TOKENIZERS_PARALLELISM=false\n", "\n", "import warnings\n", "\n", "# Disable a few less-than-useful UserWarnings from setuptools and pydantic\n", "warnings.filterwarnings(\"ignore\", category=UserWarning)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Implementing a Custom SimilarityModel with MLflow\n", "Discover how to create a custom `SimilarityModel` class using MLflow's `PythonModel` to assess semantic similarity between sentences.\n", "\n", "#### Overview of SimilarityModel\n", "\n", "The `SimilarityModel` is a tailored Python class that leverages MLflow's flexible `PythonModel` interface. It is specifically designed to encapsulate the intricacies of computing semantic similarity between sentence pairs using sophisticated sentence embeddings.\n", "\n", "#### Key Components of the Custom Model\n", "\n", "- **Importing Libraries**: Essential libraries from MLflow, data handling, and Sentence Transformers are imported to facilitate model functionality.\n", "- **Custom PythonModel - SimilarityModel**:\n", " - The `load_context` method focuses on efficient and safe model loading, crucial for handling complex models like Sentence Transformers.\n", " - The `predict` method, equipped with input type checking and error handling, ensures that the model delivers accurate cosine similarity scores, reflecting semantic correlations.\n", "\n", "#### Significance of Custom SimilarityModel\n", "\n", "- **Flexibility and Customization**: The model's design allows for specialized handling of inputs and outputs, aligning perfectly with unique requirements of semantic similarity tasks.\n", "- **Robust Error Handling**: Detailed input type checking guarantees a user-friendly experience, preventing common input errors and ensuring the predictability of model behavior.\n", "- **Efficient Model Loading**: The strategic use of the `load_context` method for model initialization circumvents serialization challenges, ensuring a smooth operational flow.\n", "- **Targeted Functionality**: The custom `predict` method directly computes similarity scores, showcasing the model's capability to deliver task-specific, actionable insights.\n", "\n", "This custom `SimilarityModel` exemplifies the adaptability of MLflow's `PythonModel` in crafting bespoke NLP solutions, setting a precedent for similar endeavors in various machine learning projects." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from sentence_transformers import SentenceTransformer, util\n", "\n", "import mlflow\n", "from mlflow.models.signature import infer_signature\n", "from mlflow.pyfunc import PythonModel\n", "\n", "\n", "class SimilarityModel(PythonModel):\n", " def load_context(self, context):\n", " \"\"\"Load the model context for inference.\"\"\"\n", " from sentence_transformers import SentenceTransformer\n", "\n", " try:\n", " self.model = SentenceTransformer.load(context.artifacts[\"model_path\"])\n", " except Exception as e:\n", " raise ValueError(f\"Error loading model: {e}\")\n", "\n", " def predict(self, context, model_input, params):\n", " \"\"\"Predict method for comparing similarity between two sentences.\"\"\"\n", " from sentence_transformers import util\n", "\n", " if isinstance(model_input, pd.DataFrame):\n", " if model_input.shape[1] != 2:\n", " raise ValueError(\"DataFrame input must have exactly two columns.\")\n", " sentence_1 = model_input.iloc[0, 0]\n", " sentence_2 = model_input.iloc[0, 1]\n", " elif isinstance(model_input, dict):\n", " sentence_1 = model_input.get(\"sentence_1\")\n", " sentence_2 = model_input.get(\"sentence_2\")\n", " if sentence_1 is None or sentence_2 is None:\n", " raise ValueError(\n", " \"Both 'sentence_1' and 'sentence_2' must be provided in the input dictionary.\"\n", " )\n", " else:\n", " raise TypeError(\n", " f\"Unexpected type for model_input: {type(model_input)}. Must be either a Dict or a DataFrame.\"\n", " )\n", "\n", " embedding_1 = self.model.encode(sentence_1)\n", " embedding_2 = self.model.encode(sentence_2)\n", "\n", " return np.array(util.cos_sim(embedding_1, embedding_2).tolist())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preparing the Sentence Transformer Model and Signature\n", "Explore the essential steps for setting up the Sentence Transformer model for logging and deployment with MLflow.\n", "\n", "#### Loading and Saving the Pre-trained Model\n", "\n", "- **Model Initialization**: A pre-trained Sentence Transformer model, `\"all-MiniLM-L6-v2\"`, is loaded for its efficiency in generating high-quality embeddings suitable for diverse NLP tasks.\n", "- **Model Saving**: The model is saved locally to `/tmp/sbert_model` to facilitate easy access by MLflow, a prerequisite for model logging in the platform.\n", " \n", "#### Preparing Input Example and Artifacts\n", " \n", "- **Input Example Creation**: A DataFrame with sample sentences is prepared, representing typical model inputs and aiding in defining the model's input format.\n", "- **Defining Artifacts**: The saved model's file path is specified as an artifact in MLflow, an essential step for associating the model with MLflow runs.\n", " \n", "#### Generating Test Output for Signature\n", " \n", "- **Test Output Calculation**: The cosine similarity between sentence embeddings is computed, providing a practical example of the model's output.\n", "- **Signature Inference**: MLflow's `infer_signature` function is utilized to generate a signature that encapsulates the expected input and output formats, reinforcing the model's operational schema.\n", " \n", "#### Importance of These Steps\n", " \n", "- **Model Readiness**: These preparatory steps ensure the model is primed for efficient logging and seamless deployment within the MLflow ecosystem.\n", "- **Input-Output Contract**: The established signature acts as a clear contract, defining the model's input-output dynamics, pivotal for maintaining consistency and accuracy in deployment scenarios.\n", " \n", "Having meticulously prepared the Sentence Transformer model and its signature, we are now well-equipped to advance towards its integration and management in MLflow." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "inputs: \n", " ['sentence_1': string, 'sentence_2': string]\n", "outputs: \n", " [Tensor('float64', (-1, 1))]\n", "params: \n", " None" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load a pre-trained sentence transformer model\n", "model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n", "\n", "# Create an input example DataFrame\n", "input_example = pd.DataFrame([{\"sentence_1\": \"I like apples\", \"sentence_2\": \"I like oranges\"}])\n", "\n", "# Save the model in the /tmp directory\n", "model_directory = \"/tmp/sbert_model\"\n", "model.save(model_directory)\n", "\n", "# Define artifacts with the absolute path\n", "artifacts = {\"model_path\": model_directory}\n", "\n", "# Generate test output for signature\n", "test_output = np.array(\n", " util.cos_sim(\n", " model.encode(input_example[\"sentence_1\"][0]), model.encode(input_example[\"sentence_2\"][0])\n", " ).tolist()\n", ")\n", "\n", "# Define the signature associated with the model\n", "signature = infer_signature(input_example, test_output)\n", "\n", "# Visualize the signature\n", "signature" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating an experiment\n", "\n", "We create a new MLflow Experiment so that the run we're going to log our model to does not log to the default experiment and instead has its own contextually relevant entry." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# If you are running this tutorial in local mode, leave the next line commented out.\n", "# Otherwise, uncomment the following line and set your tracking uri to your local or remote tracking server.\n", "\n", "# mlflow.set_tracking_uri(\"http://127.0.0.1:8080\")\n", "\n", "mlflow.set_experiment(\"Semantic Similarity\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Logging the Custom Model with MLflow\n", "\n", "Learn how to log the custom SimilarityModel with MLflow for effective model management and deployment.\n", "\n", "#### Creating a Path for the PyFunc Model\n", "We establish `pyfunc_path`, a temporary storage location for the Python model. This path is crucial for MLflow to serialize and store the model effectively.\n", "\n", "#### Logging the Model in MLflow\n", "\n", "- **Initiating MLflow Run**: An MLflow run is started, encapsulating all model logging processes within a structured framework.\n", "- **Model Logging Details**: The model is identified as `\"similarity\"`, providing a clear reference for future model retrieval and analysis. An instance of `SimilarityModel` is logged, encapsulating the Sentence Transformer model and similarity prediction logic. An illustrative DataFrame demonstrates the expected model input format, aiding in user comprehension and model usability. The inferred signature, detailing the input-output schema, is included, reinforcing the correct usage of the model. The artifacts dictionary specifies the location of the serialized Sentence Transformer model, crucial for model reconstruction. Dependencies like `sentence_transformers` and `numpy` are listed, ensuring the model's functional integrity in varied deployment environments.\n", "\n", "#### Significance of Model Logging\n", "\n", "- **Model Tracking and Versioning**: Logging facilitates comprehensive tracking and effective versioning, enhancing model lifecycle management.\n", "- **Reproducibility and Deployment**: The logged model, complete with its dependencies, input example, and signature, becomes easily reproducible and deployable, promoting consistent application across environments.\n", "\n", "Having logged our `SimilarityModel` in MLflow, it stands ready for advanced applications such as comparative analysis, version management, and deployment for practical inference use cases." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "8435d539d5604197a47db3e31d7e1923", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading artifacts: 0%| | 0/11 [00:00