{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate a Hugging Face LLM with `mlflow.evaluate()`\n",
    "\n",
    "This guide will show how to load a pre-trained Hugging Face pipeline, log it to MLflow, and use `mlflow.evaluate()` to evaluate builtin metrics as well as custom LLM-judged metrics for the model.\n",
    "\n",
    "For detailed information, please read the documentation on [using MLflow evaluate](https://mlflow.org/docs/latest/llms/llm-evaluate/index.html)."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "<a href=\"https://raw.githubusercontent.com/mlflow/mlflow/master/docs/source/llms/llm-evaluate/notebooks/huggingface-evaluation.ipynb\" class=\"notebook-download-btn\"><i class=\"fas fa-download\"></i>Download this Notebook</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Start MLflow Server\n",
    "\n",
    "You can either:\n",
    "\n",
    "- Start a local tracking server by running `mlflow ui` within the same directory that your notebook is in.\n",
    "- Use a tracking server, as described in [this overview](https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install necessary dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -q mlflow transformers torch torchvision evaluate datasets openai tiktoken fastapi rouge_score textstat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Necessary imports\n",
    "import warnings\n",
    "\n",
    "import pandas as pd\n",
    "from datasets import load_dataset\n",
    "from transformers import pipeline\n",
    "\n",
    "import mlflow\n",
    "from mlflow.metrics.genai import EvaluationExample, answer_correctness, make_genai_metric"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Disable FutureWarnings\n",
    "warnings.filterwarnings(\"ignore\", category=FutureWarning)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load a pretrained Hugging Face pipeline\n",
    "\n",
    "Here we are loading a text generation pipeline, but you can also use either a text summarization or question answering pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "f258c308f1504fb0937f81475dc5fffd",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "mpt_pipeline = pipeline(\"text-generation\", model=\"mosaicml/mpt-7b-chat\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Log the model using MLflow"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We log our pipeline as an MLflow Model, which follows a standard format that lets you save a model in different \"flavors\" that can be understood by different downstream tools. In this case, the model is of the transformers \"flavor\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Successfully registered model 'mpt-7b-chat'.\n",
      "Created version '1' of model 'mpt-7b-chat'.\n"
     ]
    }
   ],
   "source": [
    "mlflow.set_experiment(\"Evaluate Hugging Face Text Pipeline\")\n",
    "\n",
    "# Define the signature\n",
    "signature = mlflow.models.infer_signature(\n",
    "    model_input=\"What are the three primary colors?\",\n",
    "    model_output=\"The three primary colors are red, yellow, and blue.\",\n",
    ")\n",
    "\n",
    "# Log the model using mlflow\n",
    "with mlflow.start_run():\n",
    "    model_info = mlflow.transformers.log_model(\n",
    "        transformers_model=mpt_pipeline,\n",
    "        artifact_path=\"mpt-7b\",\n",
    "        signature=signature,\n",
    "        registered_model_name=\"mpt-7b-chat\",\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load Evaluation Data\n",
    "\n",
    "Load in a dataset from Hugging Face Hub to use for evaluation.\n",
    "\n",
    "The data fields in the dataset below represent:\n",
    "\n",
    "- **instruction**: Describes the task that the model should perform. Each row within the dataset is a unique instruction (task) to be performed.\n",
    "- **input**: Optional contextual information that relates to the task defined in the `instruction` field. For example, for the instruction \"Identify the odd one out\", the `input` contextual guidance is given as the list of items to select an outlier from, \"Twitter, Instagram, Telegram\". \n",
    "- **output**: The answer to the instruction (with the optional `input` context provided) as generated by the original evaluation model (`text-davinci-003` from OpenAI)\n",
    "- **text**: The final total text as a result of applying the `instruction`, `input`, and `output` to the prompt template used, which is sent to the model for fine tuning purposes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>instruction</th>\n",
       "      <th>input</th>\n",
       "      <th>output</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Give three tips for staying healthy.</td>\n",
       "      <td></td>\n",
       "      <td>1.Eat a balanced diet and make sure to include...</td>\n",
       "      <td>Below is an instruction that describes a task....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>What are the three primary colors?</td>\n",
       "      <td></td>\n",
       "      <td>The three primary colors are red, blue, and ye...</td>\n",
       "      <td>Below is an instruction that describes a task....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Describe the structure of an atom.</td>\n",
       "      <td></td>\n",
       "      <td>An atom is made up of a nucleus, which contain...</td>\n",
       "      <td>Below is an instruction that describes a task....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>How can we reduce air pollution?</td>\n",
       "      <td></td>\n",
       "      <td>There are a number of ways to reduce air pollu...</td>\n",
       "      <td>Below is an instruction that describes a task....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Describe a time when you had to make a difficu...</td>\n",
       "      <td></td>\n",
       "      <td>I had to make a difficult decision when I was ...</td>\n",
       "      <td>Below is an instruction that describes a task....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Identify the odd one out.</td>\n",
       "      <td>Twitter, Instagram, Telegram</td>\n",
       "      <td>Telegram</td>\n",
       "      <td>Below is an instruction that describes a task,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Explain why the following fraction is equivale...</td>\n",
       "      <td>4/16</td>\n",
       "      <td>The fraction 4/16 is equivalent to 1/4 because...</td>\n",
       "      <td>Below is an instruction that describes a task,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Write a short story in third person narration ...</td>\n",
       "      <td></td>\n",
       "      <td>John was at a crossroads in his life. He had j...</td>\n",
       "      <td>Below is an instruction that describes a task....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Render a 3D model of a house</td>\n",
       "      <td></td>\n",
       "      <td>&lt;nooutput&gt; This type of instruction cannot be ...</td>\n",
       "      <td>Below is an instruction that describes a task....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Evaluate this sentence for spelling and gramma...</td>\n",
       "      <td>He finnished his meal and left the resturant</td>\n",
       "      <td>He finished his meal and left the restaurant.</td>\n",
       "      <td>Below is an instruction that describes a task,...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                         instruction  \\\n",
       "0               Give three tips for staying healthy.   \n",
       "1                 What are the three primary colors?   \n",
       "2                 Describe the structure of an atom.   \n",
       "3                   How can we reduce air pollution?   \n",
       "4  Describe a time when you had to make a difficu...   \n",
       "5                          Identify the odd one out.   \n",
       "6  Explain why the following fraction is equivale...   \n",
       "7  Write a short story in third person narration ...   \n",
       "8                       Render a 3D model of a house   \n",
       "9  Evaluate this sentence for spelling and gramma...   \n",
       "\n",
       "                                          input  \\\n",
       "0                                                 \n",
       "1                                                 \n",
       "2                                                 \n",
       "3                                                 \n",
       "4                                                 \n",
       "5                  Twitter, Instagram, Telegram   \n",
       "6                                          4/16   \n",
       "7                                                 \n",
       "8                                                 \n",
       "9  He finnished his meal and left the resturant   \n",
       "\n",
       "                                              output  \\\n",
       "0  1.Eat a balanced diet and make sure to include...   \n",
       "1  The three primary colors are red, blue, and ye...   \n",
       "2  An atom is made up of a nucleus, which contain...   \n",
       "3  There are a number of ways to reduce air pollu...   \n",
       "4  I had to make a difficult decision when I was ...   \n",
       "5                                           Telegram   \n",
       "6  The fraction 4/16 is equivalent to 1/4 because...   \n",
       "7  John was at a crossroads in his life. He had j...   \n",
       "8  <nooutput> This type of instruction cannot be ...   \n",
       "9      He finished his meal and left the restaurant.   \n",
       "\n",
       "                                                text  \n",
       "0  Below is an instruction that describes a task....  \n",
       "1  Below is an instruction that describes a task....  \n",
       "2  Below is an instruction that describes a task....  \n",
       "3  Below is an instruction that describes a task....  \n",
       "4  Below is an instruction that describes a task....  \n",
       "5  Below is an instruction that describes a task,...  \n",
       "6  Below is an instruction that describes a task,...  \n",
       "7  Below is an instruction that describes a task....  \n",
       "8  Below is an instruction that describes a task....  \n",
       "9  Below is an instruction that describes a task,...  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset = load_dataset(\"tatsu-lab/alpaca\")\n",
    "eval_df = pd.DataFrame(dataset[\"train\"])\n",
    "eval_df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define Metrics\n",
    "\n",
    "Since we are evaluating how well our model can provide an answer to a given instruction, we may want to choose some metrics to help measure this on top of any builtin metrics that `mlflow.evaluate()` gives us.\n",
    "\n",
    "Let's measure how well our model is doing on the following two metrics:\n",
    "\n",
    "- **Is the answer correct?** Let's use the predefined metric `answer_correctness` here.\n",
    "- **Is the answer fluent, clear, and concise?** We will define a custom metric `answer_quality` to measure this.\n",
    "\n",
    "We will need to pass both of these into the `extra_metrics` argument for `mlflow.evaluate()` in order to assess the quality of our model.\n",
    "\n",
    "#### What is an Evaluation Metric?\n",
    "\n",
    "An evaluation metric encapsulates any quantitative or qualitative measure you want to calculate for your model. For each model type, `mlflow.evaluate()` will automatically calculate some set of builtin metrics. Refer [here](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate) for which builtin metrics will be calculated for each model type. You can also pass in any other metrics you want to calculate as extra metrics. MLflow provides a set of predefined metrics that you can find [here](https://mlflow.org/docs/latest/python_api/mlflow.metrics.html), or you can define your own custom metrics. In the example here, we will use the combination of predefined metrics `mlflow.metrics.genai.answer_correctness` and a custom metric for the quality evaluation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's load our predefined metrics - in this case we are using `answer_correctness` with GPT-4."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "answer_correctness_metric = answer_correctness(model=\"openai:/gpt-4\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we want to create a custom LLM-judged metric named `answer_quality` using `make_genai_metric()`. We need to define a metric definition and grading rubric, as well as some examples for the LLM judge to use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The definition explains what \"answer quality\" entails\n",
    "answer_quality_definition = \"\"\"Please evaluate answer quality for the provided output on the following criteria:\n",
    "fluency, clarity, and conciseness. Each of the criteria is defined as follows:\n",
    "  - Fluency measures how naturally and smooth the output reads.\n",
    "  - Clarity measures how understandable the output is.\n",
    "  - Conciseness measures the brevity and efficiency of the output without compromising meaning.\n",
    "The more fluent, clear, and concise a text, the higher the score it deserves.\n",
    "\"\"\"\n",
    "\n",
    "# The grading prompt explains what each possible score means\n",
    "answer_quality_grading_prompt = \"\"\"Answer quality: Below are the details for different scores:\n",
    "  - Score 1: The output is entirely incomprehensible and cannot be read.\n",
    "  - Score 2: The output conveys some meaning, but needs lots of improvement in to improve fluency, clarity, and conciseness.\n",
    "  - Score 3: The output is understandable but still needs improvement.\n",
    "  - Score 4: The output performs well on two of fluency, clarity, and conciseness, but could be improved on one of these criteria.\n",
    "  - Score 5: The output reads smoothly, is easy to understand, and clear. There is no clear way to improve the output on these criteria.\n",
    "\"\"\"\n",
    "\n",
    "# We provide an example of a \"bad\" output\n",
    "example1 = EvaluationExample(\n",
    "    input=\"What is MLflow?\",\n",
    "    output=\"MLflow is an open-source platform. For managing machine learning workflows, it \"\n",
    "    \"including experiment tracking model packaging versioning and deployment as well as a platform \"\n",
    "    \"simplifying for on the ML lifecycle.\",\n",
    "    score=2,\n",
    "    justification=\"The output is difficult to understand and demonstrates extremely low clarity. \"\n",
    "    \"However, it still conveys some meaning so this output deserves a score of 2.\",\n",
    ")\n",
    "\n",
    "# We also provide an example of a \"good\" output\n",
    "example2 = EvaluationExample(\n",
    "    input=\"What is MLflow?\",\n",
    "    output=\"MLflow is an open-source platform for managing machine learning workflows, including \"\n",
    "    \"experiment tracking, model packaging, versioning, and deployment.\",\n",
    "    score=5,\n",
    "    justification=\"The output is easily understandable, clear, and concise. It deserves a score of 5.\",\n",
    ")\n",
    "\n",
    "answer_quality_metric = make_genai_metric(\n",
    "    name=\"answer_quality\",\n",
    "    definition=answer_quality_definition,\n",
    "    grading_prompt=answer_quality_grading_prompt,\n",
    "    version=\"v1\",\n",
    "    examples=[example1, example2],\n",
    "    model=\"openai:/gpt-4\",\n",
    "    greater_is_better=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to set our OpenAI API key, since we are using GPT-4 for our LLM-judged metrics.\n",
    "\n",
    "In order to set your private key safely, please be sure to either export your key through a command-line terminal for your current instance, or, for a permanent addition to all user-based sessions, configure your favored environment management configuration file (i.e., .bashrc, .zshrc) to have the following entry:\n",
    "\n",
    "`OPENAI_API_KEY=<your openai API key>`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can call `mlflow.evaluate()`. Just to test it out, let's use the first 10 rows of the data. Using the `\"text\"` model type, toxicity and readability metrics are calculated as builtin metrics. We also pass in the two metrics we defined above into the `extra_metrics` parameter to be evaluated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "d057748e00924cf0a195719c509f03a3",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading artifacts:   0%|          | 0/79 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023/12/28 11:57:30 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "4178d977ace3454a8af3a0e81972e76b",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Loading checkpoint shards:   0%|          | 0/66 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023/12/28 12:00:25 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.\n",
      "2023/12/28 12:00:25 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.\n",
      "2023/12/28 12:02:23 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...\n",
      "Using default facebook/roberta-hate-speech-dynabench-r4-target checkpoint\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "f26345dbf35a45a69dc9e09c04c391a9",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "309f7fb9ba764ab7b8fae3bed1ed15d7",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023/12/28 12:02:43 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count\n",
      "2023/12/28 12:02:43 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity\n",
      "2023/12/28 12:02:44 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level\n",
      "2023/12/28 12:02:44 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level\n",
      "2023/12/28 12:02:44 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: answer_correctness\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "f5ec353b5c394f74bc4035e3afad4726",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/10 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023/12/28 12:02:53 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: answer_quality\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "fe85000e2aad4c40993382a8de039e09",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/10 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "with mlflow.start_run():\n",
    "    results = mlflow.evaluate(\n",
    "        model_info.model_uri,\n",
    "        eval_df.head(10),\n",
    "        evaluators=\"default\",\n",
    "        model_type=\"text\",\n",
    "        targets=\"output\",\n",
    "        extra_metrics=[answer_correctness_metric, answer_quality_metric],\n",
    "        evaluator_config={\"col_mapping\": {\"inputs\": \"instruction\"}},\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### View results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`results.metrics` is a dictionary with the aggregate values for all the metrics calculated. Refer [here](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate) for details on the builtin metrics for each model type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'toxicity/v1/mean': 0.00809656630299287,\n",
       " 'toxicity/v1/variance': 0.0004603014839856817,\n",
       " 'toxicity/v1/p90': 0.010559113975614286,\n",
       " 'toxicity/v1/ratio': 0.0,\n",
       " 'flesch_kincaid_grade_level/v1/mean': 4.9,\n",
       " 'flesch_kincaid_grade_level/v1/variance': 6.3500000000000005,\n",
       " 'flesch_kincaid_grade_level/v1/p90': 6.829999999999998,\n",
       " 'ari_grade_level/v1/mean': 4.1899999999999995,\n",
       " 'ari_grade_level/v1/variance': 16.6329,\n",
       " 'ari_grade_level/v1/p90': 7.949999999999998,\n",
       " 'answer_correctness/v1/mean': 1.5,\n",
       " 'answer_correctness/v1/variance': 1.45,\n",
       " 'answer_correctness/v1/p90': 2.299999999999999,\n",
       " 'answer_quality/v1/mean': 2.4,\n",
       " 'answer_quality/v1/variance': 1.44,\n",
       " 'answer_quality/v1/p90': 4.1}"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results.metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also view the `eval_results_table`, which shows us the metrics for each row of data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2d968df3d2614ae49a15236fd1fd12f6",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>instruction</th>\n",
       "      <th>input</th>\n",
       "      <th>text</th>\n",
       "      <th>output</th>\n",
       "      <th>outputs</th>\n",
       "      <th>token_count</th>\n",
       "      <th>toxicity/v1/score</th>\n",
       "      <th>flesch_kincaid_grade_level/v1/score</th>\n",
       "      <th>ari_grade_level/v1/score</th>\n",
       "      <th>answer_correctness/v1/score</th>\n",
       "      <th>answer_correctness/v1/justification</th>\n",
       "      <th>answer_quality/v1/score</th>\n",
       "      <th>answer_quality/v1/justification</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Give three tips for staying healthy.</td>\n",
       "      <td></td>\n",
       "      <td>Below is an instruction that describes a task....</td>\n",
       "      <td>1.Eat a balanced diet and make sure to include...</td>\n",
       "      <td>Give three tips for staying healthy.\\n1. Eat a...</td>\n",
       "      <td>19</td>\n",
       "      <td>0.000446</td>\n",
       "      <td>4.1</td>\n",
       "      <td>4.0</td>\n",
       "      <td>2</td>\n",
       "      <td>The output provided by the model only includes...</td>\n",
       "      <td>3</td>\n",
       "      <td>The output is understandable and fluent but it...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>What are the three primary colors?</td>\n",
       "      <td></td>\n",
       "      <td>Below is an instruction that describes a task....</td>\n",
       "      <td>The three primary colors are red, blue, and ye...</td>\n",
       "      <td>What are the three primary colors?\\nThe three ...</td>\n",
       "      <td>19</td>\n",
       "      <td>0.000217</td>\n",
       "      <td>5.0</td>\n",
       "      <td>4.9</td>\n",
       "      <td>5</td>\n",
       "      <td>The output provided by the model is completely...</td>\n",
       "      <td>5</td>\n",
       "      <td>The model's output is fluent, clear, and conci...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Describe the structure of an atom.</td>\n",
       "      <td></td>\n",
       "      <td>Below is an instruction that describes a task....</td>\n",
       "      <td>An atom is made up of a nucleus, which contain...</td>\n",
       "      <td>Describe the structure of an atom.\\nAn atom is...</td>\n",
       "      <td>18</td>\n",
       "      <td>0.000139</td>\n",
       "      <td>3.1</td>\n",
       "      <td>2.2</td>\n",
       "      <td>1</td>\n",
       "      <td>The output provided by the model is incomplete...</td>\n",
       "      <td>2</td>\n",
       "      <td>The output is incomplete and lacks clarity, ma...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>How can we reduce air pollution?</td>\n",
       "      <td></td>\n",
       "      <td>Below is an instruction that describes a task....</td>\n",
       "      <td>There are a number of ways to reduce air pollu...</td>\n",
       "      <td>How can we reduce air pollution?\\nThere are ma...</td>\n",
       "      <td>18</td>\n",
       "      <td>0.000140</td>\n",
       "      <td>5.0</td>\n",
       "      <td>5.5</td>\n",
       "      <td>1</td>\n",
       "      <td>The output provided by the model is completely...</td>\n",
       "      <td>1</td>\n",
       "      <td>The output is entirely incomprehensible and ca...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Describe a time when you had to make a difficu...</td>\n",
       "      <td></td>\n",
       "      <td>Below is an instruction that describes a task....</td>\n",
       "      <td>I had to make a difficult decision when I was ...</td>\n",
       "      <td>Describe a time when you had to make a difficu...</td>\n",
       "      <td>18</td>\n",
       "      <td>0.000159</td>\n",
       "      <td>5.2</td>\n",
       "      <td>2.9</td>\n",
       "      <td>1</td>\n",
       "      <td>The output provided by the model is completely...</td>\n",
       "      <td>2</td>\n",
       "      <td>The output is incomplete and lacks clarity, ma...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Identify the odd one out.</td>\n",
       "      <td>Twitter, Instagram, Telegram</td>\n",
       "      <td>Below is an instruction that describes a task,...</td>\n",
       "      <td>Telegram</td>\n",
       "      <td>Identify the odd one out.\\n\\n1. A car\\n2. A tr...</td>\n",
       "      <td>18</td>\n",
       "      <td>0.072345</td>\n",
       "      <td>0.1</td>\n",
       "      <td>-5.4</td>\n",
       "      <td>1</td>\n",
       "      <td>The output provided by the model is completely...</td>\n",
       "      <td>2</td>\n",
       "      <td>The output is not clear and lacks fluency. The...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Explain why the following fraction is equivale...</td>\n",
       "      <td>4/16</td>\n",
       "      <td>Below is an instruction that describes a task,...</td>\n",
       "      <td>The fraction 4/16 is equivalent to 1/4 because...</td>\n",
       "      <td>Explain why the following fraction is equivale...</td>\n",
       "      <td>23</td>\n",
       "      <td>0.000320</td>\n",
       "      <td>6.4</td>\n",
       "      <td>7.6</td>\n",
       "      <td>1</td>\n",
       "      <td>The output provided by the model is completely...</td>\n",
       "      <td>2</td>\n",
       "      <td>The output is not clear and does not answer th...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Write a short story in third person narration ...</td>\n",
       "      <td></td>\n",
       "      <td>Below is an instruction that describes a task....</td>\n",
       "      <td>John was at a crossroads in his life. He had j...</td>\n",
       "      <td>Write a short story in third person narration ...</td>\n",
       "      <td>20</td>\n",
       "      <td>0.000247</td>\n",
       "      <td>10.7</td>\n",
       "      <td>11.1</td>\n",
       "      <td>1</td>\n",
       "      <td>The output provided by the model is completely...</td>\n",
       "      <td>1</td>\n",
       "      <td>The output is exactly the same as the input, a...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Render a 3D model of a house</td>\n",
       "      <td></td>\n",
       "      <td>Below is an instruction that describes a task....</td>\n",
       "      <td>&lt;nooutput&gt; This type of instruction cannot be ...</td>\n",
       "      <td>Render a 3D model of a house in Blender - Blen...</td>\n",
       "      <td>19</td>\n",
       "      <td>0.003694</td>\n",
       "      <td>5.2</td>\n",
       "      <td>2.7</td>\n",
       "      <td>1</td>\n",
       "      <td>The output provided by the model is completely...</td>\n",
       "      <td>2</td>\n",
       "      <td>The output is partially understandable but lac...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Evaluate this sentence for spelling and gramma...</td>\n",
       "      <td>He finnished his meal and left the resturant</td>\n",
       "      <td>Below is an instruction that describes a task,...</td>\n",
       "      <td>He finished his meal and left the restaurant.</td>\n",
       "      <td>Evaluate this sentence for spelling and gramma...</td>\n",
       "      <td>18</td>\n",
       "      <td>0.003260</td>\n",
       "      <td>4.2</td>\n",
       "      <td>6.4</td>\n",
       "      <td>1</td>\n",
       "      <td>The output provided by the model is completely...</td>\n",
       "      <td>4</td>\n",
       "      <td>The output is fluent and clear, but it is not ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                         instruction  \\\n",
       "0               Give three tips for staying healthy.   \n",
       "1                 What are the three primary colors?   \n",
       "2                 Describe the structure of an atom.   \n",
       "3                   How can we reduce air pollution?   \n",
       "4  Describe a time when you had to make a difficu...   \n",
       "5                          Identify the odd one out.   \n",
       "6  Explain why the following fraction is equivale...   \n",
       "7  Write a short story in third person narration ...   \n",
       "8                       Render a 3D model of a house   \n",
       "9  Evaluate this sentence for spelling and gramma...   \n",
       "\n",
       "                                          input  \\\n",
       "0                                                 \n",
       "1                                                 \n",
       "2                                                 \n",
       "3                                                 \n",
       "4                                                 \n",
       "5                  Twitter, Instagram, Telegram   \n",
       "6                                          4/16   \n",
       "7                                                 \n",
       "8                                                 \n",
       "9  He finnished his meal and left the resturant   \n",
       "\n",
       "                                                text  \\\n",
       "0  Below is an instruction that describes a task....   \n",
       "1  Below is an instruction that describes a task....   \n",
       "2  Below is an instruction that describes a task....   \n",
       "3  Below is an instruction that describes a task....   \n",
       "4  Below is an instruction that describes a task....   \n",
       "5  Below is an instruction that describes a task,...   \n",
       "6  Below is an instruction that describes a task,...   \n",
       "7  Below is an instruction that describes a task....   \n",
       "8  Below is an instruction that describes a task....   \n",
       "9  Below is an instruction that describes a task,...   \n",
       "\n",
       "                                              output  \\\n",
       "0  1.Eat a balanced diet and make sure to include...   \n",
       "1  The three primary colors are red, blue, and ye...   \n",
       "2  An atom is made up of a nucleus, which contain...   \n",
       "3  There are a number of ways to reduce air pollu...   \n",
       "4  I had to make a difficult decision when I was ...   \n",
       "5                                           Telegram   \n",
       "6  The fraction 4/16 is equivalent to 1/4 because...   \n",
       "7  John was at a crossroads in his life. He had j...   \n",
       "8  <nooutput> This type of instruction cannot be ...   \n",
       "9      He finished his meal and left the restaurant.   \n",
       "\n",
       "                                             outputs  token_count  \\\n",
       "0  Give three tips for staying healthy.\\n1. Eat a...           19   \n",
       "1  What are the three primary colors?\\nThe three ...           19   \n",
       "2  Describe the structure of an atom.\\nAn atom is...           18   \n",
       "3  How can we reduce air pollution?\\nThere are ma...           18   \n",
       "4  Describe a time when you had to make a difficu...           18   \n",
       "5  Identify the odd one out.\\n\\n1. A car\\n2. A tr...           18   \n",
       "6  Explain why the following fraction is equivale...           23   \n",
       "7  Write a short story in third person narration ...           20   \n",
       "8  Render a 3D model of a house in Blender - Blen...           19   \n",
       "9  Evaluate this sentence for spelling and gramma...           18   \n",
       "\n",
       "   toxicity/v1/score  flesch_kincaid_grade_level/v1/score  \\\n",
       "0           0.000446                                  4.1   \n",
       "1           0.000217                                  5.0   \n",
       "2           0.000139                                  3.1   \n",
       "3           0.000140                                  5.0   \n",
       "4           0.000159                                  5.2   \n",
       "5           0.072345                                  0.1   \n",
       "6           0.000320                                  6.4   \n",
       "7           0.000247                                 10.7   \n",
       "8           0.003694                                  5.2   \n",
       "9           0.003260                                  4.2   \n",
       "\n",
       "   ari_grade_level/v1/score  answer_correctness/v1/score  \\\n",
       "0                       4.0                            2   \n",
       "1                       4.9                            5   \n",
       "2                       2.2                            1   \n",
       "3                       5.5                            1   \n",
       "4                       2.9                            1   \n",
       "5                      -5.4                            1   \n",
       "6                       7.6                            1   \n",
       "7                      11.1                            1   \n",
       "8                       2.7                            1   \n",
       "9                       6.4                            1   \n",
       "\n",
       "                 answer_correctness/v1/justification  answer_quality/v1/score  \\\n",
       "0  The output provided by the model only includes...                        3   \n",
       "1  The output provided by the model is completely...                        5   \n",
       "2  The output provided by the model is incomplete...                        2   \n",
       "3  The output provided by the model is completely...                        1   \n",
       "4  The output provided by the model is completely...                        2   \n",
       "5  The output provided by the model is completely...                        2   \n",
       "6  The output provided by the model is completely...                        2   \n",
       "7  The output provided by the model is completely...                        1   \n",
       "8  The output provided by the model is completely...                        2   \n",
       "9  The output provided by the model is completely...                        4   \n",
       "\n",
       "                     answer_quality/v1/justification  \n",
       "0  The output is understandable and fluent but it...  \n",
       "1  The model's output is fluent, clear, and conci...  \n",
       "2  The output is incomplete and lacks clarity, ma...  \n",
       "3  The output is entirely incomprehensible and ca...  \n",
       "4  The output is incomplete and lacks clarity, ma...  \n",
       "5  The output is not clear and lacks fluency. The...  \n",
       "6  The output is not clear and does not answer th...  \n",
       "7  The output is exactly the same as the input, a...  \n",
       "8  The output is partially understandable but lac...  \n",
       "9  The output is fluent and clear, but it is not ...  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results.tables[\"eval_results_table\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### View results in UI"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can view our evaluation results in the MLflow UI. We can select our experiment on the left sidebar, which will bring us to the following page. We can see that one run logged our model \"mpt-7b-chat\", and the other run has the dataset we evaluated."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Evaluation Main](https://i.imgur.com/alymcBq.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We click on the Evaluation tab and hide any irrelevant runs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Evaluation Filtering](https://i.imgur.com/sr7R9TL.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now choose what columns we want to group by, as well as which column we want to compare. In the following example, we are looking at the score for answer correctness for each input-output pair, but we could choose any other metric to compare."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Evaluation Selection](https://i.imgur.com/AeoYMEt.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we get to the following view, where we can see the justification and score for answer correctness for each row."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Evaluation Comparison](https://i.imgur.com/axsHZxP.png)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.17"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}