{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "0a87a4cd-8a01-4e35-8a71-eaf91ed4ddd2",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "# LLM Evaluation with MLflow Example Notebook\n",
    "\n",
    "In this notebook, we will demonstrate how to evaluate various LLMs and RAG systems with MLflow, leveraging simple metrics such as toxicity, as well as LLM-judged metrics such as relevance, and even custom LLM-judged metrics such as professionalism."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "cce6412a-2279-4ec1-a344-fa76fec70ee1",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "We need to set our OpenAI API key, since we will be using GPT-4 for our LLM-judged metrics.\n",
    "\n",
    "In order to set your private key safely, please be sure to either export your key through a command-line terminal for your current instance, or, for a permanent addition to all user-based sessions, configure your favored environment management configuration file (i.e., .bashrc, .zshrc) to have the following entry:\n",
    "\n",
    "`OPENAI_API_KEY=<your openai API key>`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "import openai\n",
    "import pandas as pd\n",
    "\n",
    "import mlflow"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "a9bbfc03-793e-4b95-b009-ef30dccd7e7d",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "## Basic Question-Answering Evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "ff253b9e-59e8-40e0-92d8-8f9ef85348fd",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "Create a test case of `inputs` that will be passed into the model and `ground_truth` which will be used to compare against the generated output from the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "6199fb3f-5951-42fe-891a-2227010b630a",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": [
    "eval_df = pd.DataFrame(\n",
    "    {\n",
    "        \"inputs\": [\n",
    "            \"How does useEffect() work?\",\n",
    "            \"What does the static keyword in a function mean?\",\n",
    "            \"What does the 'finally' block in Python do?\",\n",
    "            \"What is the difference between multiprocessing and multithreading?\",\n",
    "        ],\n",
    "        \"ground_truth\": [\n",
    "            \"The useEffect() hook tells React that your component needs to do something after render. React will remember the function you passed (we’ll refer to it as our “effect”), and call it later after performing the DOM updates.\",\n",
    "            \"Static members belongs to the class, rather than a specific instance. This means that only one instance of a static member exists, even if you create multiple objects of the class, or if you don't create any. It will be shared by all objects.\",\n",
    "            \"'Finally' defines a block of code to run when the try... except...else block is final. The finally block will be executed no matter if the try block raises an error or not.\",\n",
    "            \"Multithreading refers to the ability of a processor to execute multiple threads concurrently, where each thread runs a process. Whereas multiprocessing refers to the ability of a system to run multiple processors in parallel, where each processor can run one or more threads.\",\n",
    "        ],\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "06825224-49bd-452d-8dab-b11ca8130017",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "Create a simple OpenAI model that asks gpt-4o to answer the question in two sentences. Call `mlflow.evaluate()` with the model and evaluation dataframe. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "7b67eb6f-c91a-4f9a-ac0d-01fd22b087c8",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024/01/04 11:22:24 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.\n",
      "2024/01/04 11:22:24 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.\n",
      "2024/01/04 11:22:26 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...\n",
      "2024/01/04 11:22:26 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count\n",
      "2024/01/04 11:22:26 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity\n",
      "2024/01/04 11:22:26 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level\n",
      "2024/01/04 11:22:26 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level\n",
      "2024/01/04 11:22:26 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: exact_match\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'toxicity/v1/mean': 0.00020710023090941831,\n",
       " 'toxicity/v1/variance': 3.7077160557159724e-09,\n",
       " 'toxicity/v1/p90': 0.00027640480257105086,\n",
       " 'toxicity/v1/ratio': 0.0,\n",
       " 'flesch_kincaid_grade_level/v1/mean': 14.025,\n",
       " 'flesch_kincaid_grade_level/v1/variance': 31.066875000000007,\n",
       " 'flesch_kincaid_grade_level/v1/p90': 20.090000000000003,\n",
       " 'ari_grade_level/v1/mean': 16.525,\n",
       " 'ari_grade_level/v1/variance': 39.361875000000005,\n",
       " 'ari_grade_level/v1/p90': 23.340000000000003,\n",
       " 'exact_match/v1': 0.0}"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "with mlflow.start_run() as run:\n",
    "    system_prompt = \"Answer the following question in two sentences\"\n",
    "    basic_qa_model = mlflow.openai.log_model(\n",
    "        model=\"gpt-4o-mini\",\n",
    "        task=openai.chat.completions,\n",
    "        artifact_path=\"model\",\n",
    "        messages=[\n",
    "            {\"role\": \"system\", \"content\": system_prompt},\n",
    "            {\"role\": \"user\", \"content\": \"{question}\"},\n",
    "        ],\n",
    "    )\n",
    "    results = mlflow.evaluate(\n",
    "        basic_qa_model.model_uri,\n",
    "        eval_df,\n",
    "        targets=\"ground_truth\",  # specify which column corresponds to the expected output\n",
    "        model_type=\"question-answering\",  # model type indicates which metrics are relevant for this task\n",
    "        evaluators=\"default\",\n",
    "    )\n",
    "results.metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "6d078816-1de1-4a6e-b757-5c9cbe056638",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "Inspect the evaluation results table as a dataframe to see row-by-row metrics to further assess model performance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "28688e6c-6a2d-40bd-a737-58cfe70f2e10",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "4c2d4cba3df0416aac71c0ff7b8299f6",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>inputs</th>\n",
       "      <th>ground_truth</th>\n",
       "      <th>outputs</th>\n",
       "      <th>token_count</th>\n",
       "      <th>toxicity/v1/score</th>\n",
       "      <th>flesch_kincaid_grade_level/v1/score</th>\n",
       "      <th>ari_grade_level/v1/score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>How does useEffect() work?</td>\n",
       "      <td>The useEffect() hook tells React that your com...</td>\n",
       "      <td>useEffect() is a hook in React that allows you...</td>\n",
       "      <td>51</td>\n",
       "      <td>0.000212</td>\n",
       "      <td>11.9</td>\n",
       "      <td>14.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>What does the static keyword in a function mean?</td>\n",
       "      <td>Static members belongs to the class, rather th...</td>\n",
       "      <td>The static keyword in a function means that th...</td>\n",
       "      <td>48</td>\n",
       "      <td>0.000145</td>\n",
       "      <td>10.7</td>\n",
       "      <td>12.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>What does the 'finally' block in Python do?</td>\n",
       "      <td>'Finally' defines a block of code to run when ...</td>\n",
       "      <td>The 'finally' block in Python is used to defin...</td>\n",
       "      <td>55</td>\n",
       "      <td>0.000304</td>\n",
       "      <td>9.9</td>\n",
       "      <td>11.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>What is the difference between multiprocessing...</td>\n",
       "      <td>Multithreading refers to the ability of a proc...</td>\n",
       "      <td>Multiprocessing involves using multiple proces...</td>\n",
       "      <td>36</td>\n",
       "      <td>0.000167</td>\n",
       "      <td>23.6</td>\n",
       "      <td>27.3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                              inputs  \\\n",
       "0                         How does useEffect() work?   \n",
       "1   What does the static keyword in a function mean?   \n",
       "2        What does the 'finally' block in Python do?   \n",
       "3  What is the difference between multiprocessing...   \n",
       "\n",
       "                                        ground_truth  \\\n",
       "0  The useEffect() hook tells React that your com...   \n",
       "1  Static members belongs to the class, rather th...   \n",
       "2  'Finally' defines a block of code to run when ...   \n",
       "3  Multithreading refers to the ability of a proc...   \n",
       "\n",
       "                                             outputs  token_count  \\\n",
       "0  useEffect() is a hook in React that allows you...           51   \n",
       "1  The static keyword in a function means that th...           48   \n",
       "2  The 'finally' block in Python is used to defin...           55   \n",
       "3  Multiprocessing involves using multiple proces...           36   \n",
       "\n",
       "   toxicity/v1/score  flesch_kincaid_grade_level/v1/score  \\\n",
       "0           0.000212                                 11.9   \n",
       "1           0.000145                                 10.7   \n",
       "2           0.000304                                  9.9   \n",
       "3           0.000167                                 23.6   \n",
       "\n",
       "   ari_grade_level/v1/score  \n",
       "0                      14.1  \n",
       "1                      12.9  \n",
       "2                      11.8  \n",
       "3                      27.3  "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results.tables[\"eval_results_table\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "1a7363c9-3b73-4e3f-bf7c-1d6887fb4f9e",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "## LLM-judged correctness with OpenAI GPT-4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "cd23fe79-cfbf-42a7-a3f3-14badfe20db5",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "Construct an answer similarity metric using the `answer_similarity()` metric factory function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "88b35b52-5b8f-4b72-9de8-fec05f01e722",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "EvaluationMetric(name=answer_similarity, greater_is_better=True, long_name=answer_similarity, version=v1, metric_details=\n",
      "Task:\n",
      "You must return the following fields in your response in two lines, one below the other:\n",
      "score: Your numerical score for the model's answer_similarity based on the rubric\n",
      "justification: Your reasoning about the model's answer_similarity score\n",
      "\n",
      "You are an impartial judge. You will be given an input that was sent to a machine\n",
      "learning model, and you will be given an output that the model produced. You\n",
      "may also be given additional information that was used by the model to generate the output.\n",
      "\n",
      "Your task is to determine a numerical score called answer_similarity based on the input and output.\n",
      "A definition of answer_similarity and a grading rubric are provided below.\n",
      "You must use the grading rubric to determine your score. You must also justify your score.\n",
      "\n",
      "Examples could be included below for reference. Make sure to use them as references and to\n",
      "understand them before completing the task.\n",
      "\n",
      "Input:\n",
      "{input}\n",
      "\n",
      "Output:\n",
      "{output}\n",
      "\n",
      "{grading_context_columns}\n",
      "\n",
      "Metric definition:\n",
      "Answer similarity is evaluated on the degree of semantic similarity of the provided output to the provided targets, which is the ground truth. Scores can be assigned based on the gradual similarity in meaning and description to the provided targets, where a higher score indicates greater alignment between the provided output and provided targets.\n",
      "\n",
      "Grading rubric:\n",
      "Answer similarity: Below are the details for different scores:\n",
      "- Score 1: The output has little to no semantic similarity to the provided targets.\n",
      "- Score 2: The output displays partial semantic similarity to the provided targets on some aspects.\n",
      "- Score 3: The output has moderate semantic similarity to the provided targets.\n",
      "- Score 4: The output aligns with the provided targets in most aspects and has substantial semantic similarity.\n",
      "- Score 5: The output closely aligns with the provided targets in all significant aspects.\n",
      "\n",
      "Examples:\n",
      "\n",
      "Example Output:\n",
      "MLflow is an open-source platform for managing machine learning workflows, including experiment tracking, model packaging, versioning, and deployment, simplifying the ML lifecycle.\n",
      "\n",
      "Additional information used by the model:\n",
      "key: targets\n",
      "value:\n",
      "MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.\n",
      "\n",
      "Example score: 4\n",
      "Example justification: The definition effectively explains what MLflow is its purpose, and its developer. It could be more concise for a 5-score.\n",
      "        \n",
      "\n",
      "You must return the following fields in your response in two lines, one below the other:\n",
      "score: Your numerical score for the model's answer_similarity based on the rubric\n",
      "justification: Your reasoning about the model's answer_similarity score\n",
      "\n",
      "Do not add additional new lines. Do not add any other fields.\n",
      "    )\n"
     ]
    }
   ],
   "source": [
    "from mlflow.metrics.genai import EvaluationExample, answer_similarity\n",
    "\n",
    "# Create an example to describe what answer_similarity means like for this problem.\n",
    "example = EvaluationExample(\n",
    "    input=\"What is MLflow?\",\n",
    "    output=\"MLflow is an open-source platform for managing machine \"\n",
    "    \"learning workflows, including experiment tracking, model packaging, \"\n",
    "    \"versioning, and deployment, simplifying the ML lifecycle.\",\n",
    "    score=4,\n",
    "    justification=\"The definition effectively explains what MLflow is \"\n",
    "    \"its purpose, and its developer. It could be more concise for a 5-score.\",\n",
    "    grading_context={\n",
    "        \"targets\": \"MLflow is an open-source platform for managing \"\n",
    "        \"the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, \"\n",
    "        \"a company that specializes in big data and machine learning solutions. MLflow is \"\n",
    "        \"designed to address the challenges that data scientists and machine learning \"\n",
    "        \"engineers face when developing, training, and deploying machine learning models.\"\n",
    "    },\n",
    ")\n",
    "\n",
    "# Construct the metric using OpenAI GPT-4 as the judge\n",
    "answer_similarity_metric = answer_similarity(model=\"openai:/gpt-4\", examples=[example])\n",
    "\n",
    "print(answer_similarity_metric)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "d627f7ab-a7e1-430d-9431-9ce4bd810fa7",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "Call `mlflow.evaluate()` again but with your new `answer_similarity_metric`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "cae9d80b-39a2-4e98-ac08-bfa5ba387b8f",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024/01/04 11:22:26 WARNING mlflow.pyfunc: Detected one or more mismatches between the model's dependencies and the current Python environment:\n",
      " - mlflow (current: 2.6.1.dev0, required: mlflow==2.9.2)\n",
      "To fix the mismatches, call `mlflow.pyfunc.get_model_dependencies(model_uri)` to fetch the model's environment and install dependencies using the resulting environment file.\n",
      "2024/01/04 11:22:26 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.\n",
      "2024/01/04 11:22:27 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.\n",
      "2024/01/04 11:22:28 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "985c080910dd45b7902f72bb4dcb1167",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024/01/04 11:22:36 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count\n",
      "2024/01/04 11:22:36 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity\n",
      "2024/01/04 11:22:36 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level\n",
      "2024/01/04 11:22:36 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level\n",
      "2024/01/04 11:22:36 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: exact_match\n",
      "2024/01/04 11:22:36 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: answer_similarity\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "f8ea4b74e1a44756b53b138a61da97fc",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/4 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "{'toxicity/v1/mean': 0.00020082585615455173,\n",
       " 'toxicity/v1/variance': 2.557426090397114e-09,\n",
       " 'toxicity/v1/p90': 0.0002563037458457984,\n",
       " 'toxicity/v1/ratio': 0.0,\n",
       " 'flesch_kincaid_grade_level/v1/mean': 13.275000000000002,\n",
       " 'flesch_kincaid_grade_level/v1/variance': 24.811875,\n",
       " 'flesch_kincaid_grade_level/v1/p90': 18.930000000000003,\n",
       " 'ari_grade_level/v1/mean': 15.35,\n",
       " 'ari_grade_level/v1/variance': 32.522499999999994,\n",
       " 'ari_grade_level/v1/p90': 21.88,\n",
       " 'exact_match/v1': 0.0,\n",
       " 'answer_similarity/v1/mean': 3.75,\n",
       " 'answer_similarity/v1/variance': 1.1875,\n",
       " 'answer_similarity/v1/p90': 4.7}"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "with mlflow.start_run() as run:\n",
    "    results = mlflow.evaluate(\n",
    "        basic_qa_model.model_uri,\n",
    "        eval_df,\n",
    "        targets=\"ground_truth\",\n",
    "        model_type=\"question-answering\",\n",
    "        evaluators=\"default\",\n",
    "        extra_metrics=[answer_similarity_metric],  # use the answer similarity metric created above\n",
    "    )\n",
    "results.metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "df98aa92-4ce4-43dd-9677-68911a0a103d",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "See the row-by-row LLM-judged answer similarity score and justifications"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "6f41f22d-e287-4aad-8231-986252ad6682",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "61f36d9f025b4c3586c64bfd530c0257",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>inputs</th>\n",
       "      <th>ground_truth</th>\n",
       "      <th>outputs</th>\n",
       "      <th>token_count</th>\n",
       "      <th>toxicity/v1/score</th>\n",
       "      <th>flesch_kincaid_grade_level/v1/score</th>\n",
       "      <th>ari_grade_level/v1/score</th>\n",
       "      <th>answer_similarity/v1/score</th>\n",
       "      <th>answer_similarity/v1/justification</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>How does useEffect() work?</td>\n",
       "      <td>The useEffect() hook tells React that your com...</td>\n",
       "      <td>useEffect() is a hook in React that allows you...</td>\n",
       "      <td>44</td>\n",
       "      <td>0.000266</td>\n",
       "      <td>8.3</td>\n",
       "      <td>9.9</td>\n",
       "      <td>4</td>\n",
       "      <td>The output accurately describes what useEffect...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>What does the static keyword in a function mean?</td>\n",
       "      <td>Static members belongs to the class, rather th...</td>\n",
       "      <td>The static keyword in a function means that th...</td>\n",
       "      <td>34</td>\n",
       "      <td>0.000148</td>\n",
       "      <td>13.4</td>\n",
       "      <td>16.0</td>\n",
       "      <td>2</td>\n",
       "      <td>The output provides a correct definition of th...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>What does the 'finally' block in Python do?</td>\n",
       "      <td>'Finally' defines a block of code to run when ...</td>\n",
       "      <td>The 'finally' block in Python is used to speci...</td>\n",
       "      <td>47</td>\n",
       "      <td>0.000235</td>\n",
       "      <td>10.1</td>\n",
       "      <td>11.1</td>\n",
       "      <td>5</td>\n",
       "      <td>The model's output closely aligns with the pro...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>What is the difference between multiprocessing...</td>\n",
       "      <td>Multithreading refers to the ability of a proc...</td>\n",
       "      <td>Multiprocessing involves the execution of mult...</td>\n",
       "      <td>36</td>\n",
       "      <td>0.000155</td>\n",
       "      <td>21.3</td>\n",
       "      <td>24.4</td>\n",
       "      <td>4</td>\n",
       "      <td>The model's output accurately describes the co...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                              inputs  \\\n",
       "0                         How does useEffect() work?   \n",
       "1   What does the static keyword in a function mean?   \n",
       "2        What does the 'finally' block in Python do?   \n",
       "3  What is the difference between multiprocessing...   \n",
       "\n",
       "                                        ground_truth  \\\n",
       "0  The useEffect() hook tells React that your com...   \n",
       "1  Static members belongs to the class, rather th...   \n",
       "2  'Finally' defines a block of code to run when ...   \n",
       "3  Multithreading refers to the ability of a proc...   \n",
       "\n",
       "                                             outputs  token_count  \\\n",
       "0  useEffect() is a hook in React that allows you...           44   \n",
       "1  The static keyword in a function means that th...           34   \n",
       "2  The 'finally' block in Python is used to speci...           47   \n",
       "3  Multiprocessing involves the execution of mult...           36   \n",
       "\n",
       "   toxicity/v1/score  flesch_kincaid_grade_level/v1/score  \\\n",
       "0           0.000266                                  8.3   \n",
       "1           0.000148                                 13.4   \n",
       "2           0.000235                                 10.1   \n",
       "3           0.000155                                 21.3   \n",
       "\n",
       "   ari_grade_level/v1/score  answer_similarity/v1/score  \\\n",
       "0                       9.9                           4   \n",
       "1                      16.0                           2   \n",
       "2                      11.1                           5   \n",
       "3                      24.4                           4   \n",
       "\n",
       "                  answer_similarity/v1/justification  \n",
       "0  The output accurately describes what useEffect...  \n",
       "1  The output provides a correct definition of th...  \n",
       "2  The model's output closely aligns with the pro...  \n",
       "3  The model's output accurately describes the co...  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results.tables[\"eval_results_table\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "85402663-b9d7-4812-a7d2-32aa5b929687",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "## Custom LLM-judged metric for professionalism"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "a8765226-5d95-49e8-88d8-5ba442ea3b9b",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "Create a custom metric that will be used to determine professionalism of the model outputs. Use `make_genai_metric` with a metric definition, grading prompt, grading example, and judge model configuration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "45cca2ec-e06b-4d51-9dde-3cc630df9244",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "EvaluationMetric(name=professionalism, greater_is_better=True, long_name=professionalism, version=v1, metric_details=\n",
      "Task:\n",
      "You must return the following fields in your response in two lines, one below the other:\n",
      "score: Your numerical score for the model's professionalism based on the rubric\n",
      "justification: Your reasoning about the model's professionalism score\n",
      "\n",
      "You are an impartial judge. You will be given an input that was sent to a machine\n",
      "learning model, and you will be given an output that the model produced. You\n",
      "may also be given additional information that was used by the model to generate the output.\n",
      "\n",
      "Your task is to determine a numerical score called professionalism based on the input and output.\n",
      "A definition of professionalism and a grading rubric are provided below.\n",
      "You must use the grading rubric to determine your score. You must also justify your score.\n",
      "\n",
      "Examples could be included below for reference. Make sure to use them as references and to\n",
      "understand them before completing the task.\n",
      "\n",
      "Input:\n",
      "{input}\n",
      "\n",
      "Output:\n",
      "{output}\n",
      "\n",
      "{grading_context_columns}\n",
      "\n",
      "Metric definition:\n",
      "Professionalism refers to the use of a formal, respectful, and appropriate style of communication that is tailored to the context and audience. It often involves avoiding overly casual language, slang, or colloquialisms, and instead using clear, concise, and respectful language\n",
      "\n",
      "Grading rubric:\n",
      "Professionalism: If the answer is written using a professional tone, below are the details for different scores: - Score 1: Language is extremely casual, informal, and may include slang or colloquialisms. Not suitable for professional contexts.- Score 2: Language is casual but generally respectful and avoids strong informality or slang. Acceptable in some informal professional settings.- Score 3: Language is balanced and avoids extreme informality or formality. Suitable for most professional contexts. - Score 4: Language is noticeably formal, respectful, and avoids casual elements. Appropriate for business or academic settings. - Score 5: Language is excessively formal, respectful, and avoids casual elements. Appropriate for the most formal settings such as textbooks. \n",
      "\n",
      "Examples:\n",
      "\n",
      "Example Input:\n",
      "What is MLflow?\n",
      "\n",
      "Example Output:\n",
      "MLflow is like your friendly neighborhood toolkit for managing your machine learning projects. It helps you track experiments, package your code and models, and collaborate with your team, making the whole ML workflow smoother. It's like your Swiss Army knife for machine learning!\n",
      "\n",
      "Example score: 2\n",
      "Example justification: The response is written in a casual tone. It uses contractions, filler words such as 'like', and exclamation points, which make it sound less professional. \n",
      "        \n",
      "\n",
      "You must return the following fields in your response in two lines, one below the other:\n",
      "score: Your numerical score for the model's professionalism based on the rubric\n",
      "justification: Your reasoning about the model's professionalism score\n",
      "\n",
      "Do not add additional new lines. Do not add any other fields.\n",
      "    )\n"
     ]
    }
   ],
   "source": [
    "from mlflow.metrics.genai import EvaluationExample, make_genai_metric\n",
    "\n",
    "professionalism_metric = make_genai_metric(\n",
    "    name=\"professionalism\",\n",
    "    definition=(\n",
    "        \"Professionalism refers to the use of a formal, respectful, and appropriate style of communication that is tailored to the context and audience. It often involves avoiding overly casual language, slang, or colloquialisms, and instead using clear, concise, and respectful language\"\n",
    "    ),\n",
    "    grading_prompt=(\n",
    "        \"Professionalism: If the answer is written using a professional tone, below \"\n",
    "        \"are the details for different scores: \"\n",
    "        \"- Score 1: Language is extremely casual, informal, and may include slang or colloquialisms. Not suitable for professional contexts.\"\n",
    "        \"- Score 2: Language is casual but generally respectful and avoids strong informality or slang. Acceptable in some informal professional settings.\"\n",
    "        \"- Score 3: Language is balanced and avoids extreme informality or formality. Suitable for most professional contexts. \"\n",
    "        \"- Score 4: Language is noticeably formal, respectful, and avoids casual elements. Appropriate for business or academic settings. \"\n",
    "        \"- Score 5: Language is excessively formal, respectful, and avoids casual elements. Appropriate for the most formal settings such as textbooks. \"\n",
    "    ),\n",
    "    examples=[\n",
    "        EvaluationExample(\n",
    "            input=\"What is MLflow?\",\n",
    "            output=(\n",
    "                \"MLflow is like your friendly neighborhood toolkit for managing your machine learning projects. It helps you track experiments, package your code and models, and collaborate with your team, making the whole ML workflow smoother. It's like your Swiss Army knife for machine learning!\"\n",
    "            ),\n",
    "            score=2,\n",
    "            justification=(\n",
    "                \"The response is written in a casual tone. It uses contractions, filler words such as 'like', and exclamation points, which make it sound less professional. \"\n",
    "            ),\n",
    "        )\n",
    "    ],\n",
    "    version=\"v1\",\n",
    "    model=\"openai:/gpt-4\",\n",
    "    parameters={\"temperature\": 0.0},\n",
    "    grading_context_columns=[],\n",
    "    aggregations=[\"mean\", \"variance\", \"p90\"],\n",
    "    greater_is_better=True,\n",
    ")\n",
    "\n",
    "print(professionalism_metric)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "bc615396-b1c1-4302-872d-d19be010382a",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "TODO: Try out your new professionalism metric on a sample output to make sure it behaves as you expect"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "0ca7e945-113a-49ac-8324-2f94efa45771",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "Call `mlflow.evaluate` with your new professionalism metric. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "07bb41ae-c878-4384-b36e-3dfb9b8ac6d9",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024/01/04 11:22:45 WARNING mlflow.pyfunc: Detected one or more mismatches between the model's dependencies and the current Python environment:\n",
      " - mlflow (current: 2.6.1.dev0, required: mlflow==2.9.2)\n",
      "To fix the mismatches, call `mlflow.pyfunc.get_model_dependencies(model_uri)` to fetch the model's environment and install dependencies using the resulting environment file.\n",
      "2024/01/04 11:22:45 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.\n",
      "2024/01/04 11:22:45 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.\n",
      "2024/01/04 11:22:47 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c3b0055b318c413e80acbbdbb5cabe3a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024/01/04 11:22:51 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count\n",
      "2024/01/04 11:22:51 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity\n",
      "2024/01/04 11:22:51 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level\n",
      "2024/01/04 11:22:51 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level\n",
      "2024/01/04 11:22:51 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: exact_match\n",
      "2024/01/04 11:22:51 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: professionalism\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "f6a67f4e2ccd47fd8574ba4ea2aaf97a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/4 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'toxicity/v1/mean': 0.00020230028530932032, 'toxicity/v1/variance': 4.11670960581816e-09, 'toxicity/v1/p90': 0.00027421332633821295, 'toxicity/v1/ratio': 0.0, 'flesch_kincaid_grade_level/v1/mean': 14.15, 'flesch_kincaid_grade_level/v1/variance': 6.702500000000001, 'flesch_kincaid_grade_level/v1/p90': 16.970000000000002, 'ari_grade_level/v1/mean': 17.225, 'ari_grade_level/v1/variance': 18.996875000000003, 'ari_grade_level/v1/p90': 22.020000000000003, 'professionalism/v1/mean': 4.0, 'professionalism/v1/variance': 0.0, 'professionalism/v1/p90': 4.0}\n"
     ]
    }
   ],
   "source": [
    "with mlflow.start_run() as run:\n",
    "    results = mlflow.evaluate(\n",
    "        basic_qa_model.model_uri,\n",
    "        eval_df,\n",
    "        model_type=\"question-answering\",\n",
    "        evaluators=\"default\",\n",
    "        extra_metrics=[professionalism_metric],  # use the professionalism metric we created above\n",
    "    )\n",
    "print(results.metrics)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "486a7ee9-c557-4939-8ddc-bc282ecb4bc3",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "cab3c1b007c84e6699de493a9e8e0aa7",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>inputs</th>\n",
       "      <th>ground_truth</th>\n",
       "      <th>outputs</th>\n",
       "      <th>token_count</th>\n",
       "      <th>toxicity/v1/score</th>\n",
       "      <th>flesch_kincaid_grade_level/v1/score</th>\n",
       "      <th>ari_grade_level/v1/score</th>\n",
       "      <th>professionalism/v1/score</th>\n",
       "      <th>professionalism/v1/justification</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>How does useEffect() work?</td>\n",
       "      <td>The useEffect() hook tells React that your com...</td>\n",
       "      <td>useEffect() is a hook in React that allows you...</td>\n",
       "      <td>49</td>\n",
       "      <td>0.000191</td>\n",
       "      <td>11.7</td>\n",
       "      <td>12.5</td>\n",
       "      <td>4</td>\n",
       "      <td>The language used in the response is formal an...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>What does the static keyword in a function mean?</td>\n",
       "      <td>Static members belongs to the class, rather th...</td>\n",
       "      <td>The static keyword in a function means that th...</td>\n",
       "      <td>34</td>\n",
       "      <td>0.000146</td>\n",
       "      <td>13.4</td>\n",
       "      <td>16.7</td>\n",
       "      <td>4</td>\n",
       "      <td>The language used in the response is formal an...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>What does the 'finally' block in Python do?</td>\n",
       "      <td>'Finally' defines a block of code to run when ...</td>\n",
       "      <td>The 'finally' block in Python is used to speci...</td>\n",
       "      <td>65</td>\n",
       "      <td>0.000310</td>\n",
       "      <td>13.0</td>\n",
       "      <td>15.4</td>\n",
       "      <td>4</td>\n",
       "      <td>The language used in the response is formal an...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>What is the difference between multiprocessing...</td>\n",
       "      <td>Multithreading refers to the ability of a proc...</td>\n",
       "      <td>Multiprocessing involves running multiple proc...</td>\n",
       "      <td>33</td>\n",
       "      <td>0.000162</td>\n",
       "      <td>18.5</td>\n",
       "      <td>24.3</td>\n",
       "      <td>4</td>\n",
       "      <td>The language used in the response is formal an...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                              inputs  \\\n",
       "0                         How does useEffect() work?   \n",
       "1   What does the static keyword in a function mean?   \n",
       "2        What does the 'finally' block in Python do?   \n",
       "3  What is the difference between multiprocessing...   \n",
       "\n",
       "                                        ground_truth  \\\n",
       "0  The useEffect() hook tells React that your com...   \n",
       "1  Static members belongs to the class, rather th...   \n",
       "2  'Finally' defines a block of code to run when ...   \n",
       "3  Multithreading refers to the ability of a proc...   \n",
       "\n",
       "                                             outputs  token_count  \\\n",
       "0  useEffect() is a hook in React that allows you...           49   \n",
       "1  The static keyword in a function means that th...           34   \n",
       "2  The 'finally' block in Python is used to speci...           65   \n",
       "3  Multiprocessing involves running multiple proc...           33   \n",
       "\n",
       "   toxicity/v1/score  flesch_kincaid_grade_level/v1/score  \\\n",
       "0           0.000191                                 11.7   \n",
       "1           0.000146                                 13.4   \n",
       "2           0.000310                                 13.0   \n",
       "3           0.000162                                 18.5   \n",
       "\n",
       "   ari_grade_level/v1/score  professionalism/v1/score  \\\n",
       "0                      12.5                         4   \n",
       "1                      16.7                         4   \n",
       "2                      15.4                         4   \n",
       "3                      24.3                         4   \n",
       "\n",
       "                    professionalism/v1/justification  \n",
       "0  The language used in the response is formal an...  \n",
       "1  The language used in the response is formal an...  \n",
       "2  The language used in the response is formal an...  \n",
       "3  The language used in the response is formal an...  "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results.tables[\"eval_results_table\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "52e9f69f-2f43-46ba-bf88-b4aebae741f4",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "The professionalism score of the `basic_qa_model` is not very good. Let's try to create a new model that can perform better"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "b4ea81e9-6e91-43e7-8539-8dab7b5f52de",
     "showTitle": false,
     "title": ""
    }
   },
   "source": [
    "Call `mlflow.evaluate()` using the new model. Observe that the professionalism score has increased!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "5b21ef8f-50ef-4229-83c9-cc2251a081e2",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/ann.zhang/anaconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/_distutils_hack/__init__.py:18: UserWarning: Distutils was imported before Setuptools, but importing Setuptools also replaces the `distutils` module in `sys.modules`. This may lead to undesirable behaviors or errors. To avoid these issues, avoid using distutils directly, ensure that setuptools is installed in the traditional way (e.g. not an editable install), and/or make sure that setuptools is always imported before distutils.\n",
      "  warnings.warn(\n",
      "/Users/ann.zhang/anaconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.\n",
      "  warnings.warn(\"Setuptools is replacing distutils.\")\n",
      "2024/01/04 11:22:59 WARNING mlflow.pyfunc: Detected one or more mismatches between the model's dependencies and the current Python environment:\n",
      " - mlflow (current: 2.6.1.dev0, required: mlflow==2.9.2)\n",
      "To fix the mismatches, call `mlflow.pyfunc.get_model_dependencies(model_uri)` to fetch the model's environment and install dependencies using the resulting environment file.\n",
      "2024/01/04 11:22:59 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.\n",
      "2024/01/04 11:22:59 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.\n",
      "2024/01/04 11:23:07 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "fc20122b14ef42419ccca19ea065fc2a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024/01/04 11:23:17 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count\n",
      "2024/01/04 11:23:17 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity\n",
      "2024/01/04 11:23:17 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level\n",
      "2024/01/04 11:23:17 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level\n",
      "2024/01/04 11:23:17 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: exact_match\n",
      "2024/01/04 11:23:17 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: professionalism\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "21bd057c9d5c45a9928ec7ba6f66448e",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/4 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'toxicity/v1/mean': 0.00029306011492735706, 'toxicity/v1/variance': 5.9885918086973745e-09, 'toxicity/v1/p90': 0.0003703571535879746, 'toxicity/v1/ratio': 0.0, 'flesch_kincaid_grade_level/v1/mean': 17.1, 'flesch_kincaid_grade_level/v1/variance': 3.7150000000000007, 'flesch_kincaid_grade_level/v1/p90': 18.790000000000003, 'ari_grade_level/v1/mean': 19.575, 'ari_grade_level/v1/variance': 6.421874999999998, 'ari_grade_level/v1/p90': 21.86, 'professionalism/v1/mean': 4.5, 'professionalism/v1/variance': 0.25, 'professionalism/v1/p90': 5.0}\n"
     ]
    }
   ],
   "source": [
    "with mlflow.start_run() as run:\n",
    "    system_prompt = \"Answer the following question using extreme formality.\"\n",
    "    professional_qa_model = mlflow.openai.log_model(\n",
    "        model=\"gpt-4o-mini\",\n",
    "        task=openai.chat.completions,\n",
    "        artifact_path=\"model\",\n",
    "        messages=[\n",
    "            {\"role\": \"system\", \"content\": system_prompt},\n",
    "            {\"role\": \"user\", \"content\": \"{question}\"},\n",
    "        ],\n",
    "    )\n",
    "    results = mlflow.evaluate(\n",
    "        professional_qa_model.model_uri,\n",
    "        eval_df,\n",
    "        model_type=\"question-answering\",\n",
    "        evaluators=\"default\",\n",
    "        extra_metrics=[professionalism_metric],\n",
    "    )\n",
    "print(results.metrics)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {
      "byteLimit": 2048000,
      "rowLimit": 10000
     },
     "inputWidgets": {},
     "nuid": "12027ba1-9d10-4f80-bb44-0857372a2e30",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "bb9737dfd93a4c1a980721199288187f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>inputs</th>\n",
       "      <th>ground_truth</th>\n",
       "      <th>outputs</th>\n",
       "      <th>token_count</th>\n",
       "      <th>toxicity/v1/score</th>\n",
       "      <th>flesch_kincaid_grade_level/v1/score</th>\n",
       "      <th>ari_grade_level/v1/score</th>\n",
       "      <th>professionalism/v1/score</th>\n",
       "      <th>professionalism/v1/justification</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>How does useEffect() work?</td>\n",
       "      <td>The useEffect() hook tells React that your com...</td>\n",
       "      <td>The useEffect() function operates by allowing ...</td>\n",
       "      <td>135</td>\n",
       "      <td>0.000183</td>\n",
       "      <td>17.5</td>\n",
       "      <td>19.8</td>\n",
       "      <td>4</td>\n",
       "      <td>The language used in the response is formal an...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>What does the static keyword in a function mean?</td>\n",
       "      <td>Static members belongs to the class, rather th...</td>\n",
       "      <td>The static keyword, within the realm of functi...</td>\n",
       "      <td>126</td>\n",
       "      <td>0.000297</td>\n",
       "      <td>14.0</td>\n",
       "      <td>15.5</td>\n",
       "      <td>4</td>\n",
       "      <td>The language used in the response is formal an...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>What does the 'finally' block in Python do?</td>\n",
       "      <td>'Finally' defines a block of code to run when ...</td>\n",
       "      <td>The 'finally' block in the Python programming ...</td>\n",
       "      <td>114</td>\n",
       "      <td>0.000290</td>\n",
       "      <td>19.3</td>\n",
       "      <td>22.4</td>\n",
       "      <td>5</td>\n",
       "      <td>The language used in the response is excessive...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>What is the difference between multiprocessing...</td>\n",
       "      <td>Multithreading refers to the ability of a proc...</td>\n",
       "      <td>In the realm of computer systems, it is impera...</td>\n",
       "      <td>341</td>\n",
       "      <td>0.000402</td>\n",
       "      <td>17.6</td>\n",
       "      <td>20.6</td>\n",
       "      <td>5</td>\n",
       "      <td>The language used in the response is excessive...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                              inputs  \\\n",
       "0                         How does useEffect() work?   \n",
       "1   What does the static keyword in a function mean?   \n",
       "2        What does the 'finally' block in Python do?   \n",
       "3  What is the difference between multiprocessing...   \n",
       "\n",
       "                                        ground_truth  \\\n",
       "0  The useEffect() hook tells React that your com...   \n",
       "1  Static members belongs to the class, rather th...   \n",
       "2  'Finally' defines a block of code to run when ...   \n",
       "3  Multithreading refers to the ability of a proc...   \n",
       "\n",
       "                                             outputs  token_count  \\\n",
       "0  The useEffect() function operates by allowing ...          135   \n",
       "1  The static keyword, within the realm of functi...          126   \n",
       "2  The 'finally' block in the Python programming ...          114   \n",
       "3  In the realm of computer systems, it is impera...          341   \n",
       "\n",
       "   toxicity/v1/score  flesch_kincaid_grade_level/v1/score  \\\n",
       "0           0.000183                                 17.5   \n",
       "1           0.000297                                 14.0   \n",
       "2           0.000290                                 19.3   \n",
       "3           0.000402                                 17.6   \n",
       "\n",
       "   ari_grade_level/v1/score  professionalism/v1/score  \\\n",
       "0                      19.8                         4   \n",
       "1                      15.5                         4   \n",
       "2                      22.4                         5   \n",
       "3                      20.6                         5   \n",
       "\n",
       "                    professionalism/v1/justification  \n",
       "0  The language used in the response is formal an...  \n",
       "1  The language used in the response is formal an...  \n",
       "2  The language used in the response is excessive...  \n",
       "3  The language used in the response is excessive...  "
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results.tables[\"eval_results_table\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "application/vnd.databricks.v1+cell": {
     "cellMetadata": {},
     "inputWidgets": {},
     "nuid": "e44bbe77-433a-4e03-a44e-d17eb6c06820",
     "showTitle": false,
     "title": ""
    }
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "application/vnd.databricks.v1+notebook": {
   "dashboards": [],
   "language": "python",
   "notebookMetadata": {
    "pythonIndentUnit": 2
   },
   "notebookName": "LLM Evaluation Examples -- QA",
   "widgets": {}
  },
  "kernelspec": {
   "display_name": "mlflow-dev-env",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.17"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}