{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluate a Hugging Face LLM with `mlflow.evaluate()`\n", "\n", "This guide will show how to load a pre-trained Hugging Face pipeline, log it to MLflow, and use `mlflow.evaluate()` to evaluate builtin metrics as well as custom LLM-judged metrics for the model.\n", "\n", "For detailed information, please read the documentation on [using MLflow evaluate](https://mlflow.org/docs/latest/llms/llm-evaluate/index.html)." ] }, { "cell_type": "raw", "metadata": {}, "source": [ "Download this Notebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Start MLflow Server\n", "\n", "You can either:\n", "\n", "- Start a local tracking server by running `mlflow ui` within the same directory that your notebook is in.\n", "- Use a tracking server, as described in [this overview](https://mlflow.org/docs/latest/getting-started/tracking-server-overview/index.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Install necessary dependencies" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install -q mlflow transformers torch torchvision evaluate datasets openai tiktoken fastapi rouge_score textstat" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Necessary imports\n", "import warnings\n", "\n", "import pandas as pd\n", "from datasets import load_dataset\n", "from transformers import pipeline\n", "\n", "import mlflow\n", "from mlflow.metrics.genai import EvaluationExample, answer_correctness, make_genai_metric" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Disable FutureWarnings\n", "warnings.filterwarnings(\"ignore\", category=FutureWarning)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load a pretrained Hugging Face pipeline\n", "\n", "Here we are loading a text generation pipeline, but you can also use either a text summarization or question answering pipeline." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "f258c308f1504fb0937f81475dc5fffd", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Loading checkpoint shards: 0%| | 0/2 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
instructioninputoutputtext
0Give three tips for staying healthy.1.Eat a balanced diet and make sure to include...Below is an instruction that describes a task....
1What are the three primary colors?The three primary colors are red, blue, and ye...Below is an instruction that describes a task....
2Describe the structure of an atom.An atom is made up of a nucleus, which contain...Below is an instruction that describes a task....
3How can we reduce air pollution?There are a number of ways to reduce air pollu...Below is an instruction that describes a task....
4Describe a time when you had to make a difficu...I had to make a difficult decision when I was ...Below is an instruction that describes a task....
5Identify the odd one out.Twitter, Instagram, TelegramTelegramBelow is an instruction that describes a task,...
6Explain why the following fraction is equivale...4/16The fraction 4/16 is equivalent to 1/4 because...Below is an instruction that describes a task,...
7Write a short story in third person narration ...John was at a crossroads in his life. He had j...Below is an instruction that describes a task....
8Render a 3D model of a house<nooutput> This type of instruction cannot be ...Below is an instruction that describes a task....
9Evaluate this sentence for spelling and gramma...He finnished his meal and left the resturantHe finished his meal and left the restaurant.Below is an instruction that describes a task,...
\n", "" ], "text/plain": [ " instruction \\\n", "0 Give three tips for staying healthy. \n", "1 What are the three primary colors? \n", "2 Describe the structure of an atom. \n", "3 How can we reduce air pollution? \n", "4 Describe a time when you had to make a difficu... \n", "5 Identify the odd one out. \n", "6 Explain why the following fraction is equivale... \n", "7 Write a short story in third person narration ... \n", "8 Render a 3D model of a house \n", "9 Evaluate this sentence for spelling and gramma... \n", "\n", " input \\\n", "0 \n", "1 \n", "2 \n", "3 \n", "4 \n", "5 Twitter, Instagram, Telegram \n", "6 4/16 \n", "7 \n", "8 \n", "9 He finnished his meal and left the resturant \n", "\n", " output \\\n", "0 1.Eat a balanced diet and make sure to include... \n", "1 The three primary colors are red, blue, and ye... \n", "2 An atom is made up of a nucleus, which contain... \n", "3 There are a number of ways to reduce air pollu... \n", "4 I had to make a difficult decision when I was ... \n", "5 Telegram \n", "6 The fraction 4/16 is equivalent to 1/4 because... \n", "7 John was at a crossroads in his life. He had j... \n", "8 This type of instruction cannot be ... \n", "9 He finished his meal and left the restaurant. \n", "\n", " text \n", "0 Below is an instruction that describes a task.... \n", "1 Below is an instruction that describes a task.... \n", "2 Below is an instruction that describes a task.... \n", "3 Below is an instruction that describes a task.... \n", "4 Below is an instruction that describes a task.... \n", "5 Below is an instruction that describes a task,... \n", "6 Below is an instruction that describes a task,... \n", "7 Below is an instruction that describes a task.... \n", "8 Below is an instruction that describes a task.... \n", "9 Below is an instruction that describes a task,... " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset = load_dataset(\"tatsu-lab/alpaca\")\n", "eval_df = pd.DataFrame(dataset[\"train\"])\n", "eval_df.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Define Metrics\n", "\n", "Since we are evaluating how well our model can provide an answer to a given instruction, we may want to choose some metrics to help measure this on top of any builtin metrics that `mlflow.evaluate()` gives us.\n", "\n", "Let's measure how well our model is doing on the following two metrics:\n", "\n", "- **Is the answer correct?** Let's use the predefined metric `answer_correctness` here.\n", "- **Is the answer fluent, clear, and concise?** We will define a custom metric `answer_quality` to measure this.\n", "\n", "We will need to pass both of these into the `extra_metrics` argument for `mlflow.evaluate()` in order to assess the quality of our model.\n", "\n", "#### What is an Evaluation Metric?\n", "\n", "An evaluation metric encapsulates any quantitative or qualitative measure you want to calculate for your model. For each model type, `mlflow.evaluate()` will automatically calculate some set of builtin metrics. Refer [here](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate) for which builtin metrics will be calculated for each model type. You can also pass in any other metrics you want to calculate as extra metrics. MLflow provides a set of predefined metrics that you can find [here](https://mlflow.org/docs/latest/python_api/mlflow.metrics.html), or you can define your own custom metrics. In the example here, we will use the combination of predefined metrics `mlflow.metrics.genai.answer_correctness` and a custom metric for the quality evaluation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's load our predefined metrics - in this case we are using `answer_correctness` with GPT-4." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "answer_correctness_metric = answer_correctness(model=\"openai:/gpt-4\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we want to create a custom LLM-judged metric named `answer_quality` using `make_genai_metric()`. We need to define a metric definition and grading rubric, as well as some examples for the LLM judge to use." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# The definition explains what \"answer quality\" entails\n", "answer_quality_definition = \"\"\"Please evaluate answer quality for the provided output on the following criteria:\n", "fluency, clarity, and conciseness. Each of the criteria is defined as follows:\n", " - Fluency measures how naturally and smooth the output reads.\n", " - Clarity measures how understandable the output is.\n", " - Conciseness measures the brevity and efficiency of the output without compromising meaning.\n", "The more fluent, clear, and concise a text, the higher the score it deserves.\n", "\"\"\"\n", "\n", "# The grading prompt explains what each possible score means\n", "answer_quality_grading_prompt = \"\"\"Answer quality: Below are the details for different scores:\n", " - Score 1: The output is entirely incomprehensible and cannot be read.\n", " - Score 2: The output conveys some meaning, but needs lots of improvement in to improve fluency, clarity, and conciseness.\n", " - Score 3: The output is understandable but still needs improvement.\n", " - Score 4: The output performs well on two of fluency, clarity, and conciseness, but could be improved on one of these criteria.\n", " - Score 5: The output reads smoothly, is easy to understand, and clear. There is no clear way to improve the output on these criteria.\n", "\"\"\"\n", "\n", "# We provide an example of a \"bad\" output\n", "example1 = EvaluationExample(\n", " input=\"What is MLflow?\",\n", " output=\"MLflow is an open-source platform. For managing machine learning workflows, it \"\n", " \"including experiment tracking model packaging versioning and deployment as well as a platform \"\n", " \"simplifying for on the ML lifecycle.\",\n", " score=2,\n", " justification=\"The output is difficult to understand and demonstrates extremely low clarity. \"\n", " \"However, it still conveys some meaning so this output deserves a score of 2.\",\n", ")\n", "\n", "# We also provide an example of a \"good\" output\n", "example2 = EvaluationExample(\n", " input=\"What is MLflow?\",\n", " output=\"MLflow is an open-source platform for managing machine learning workflows, including \"\n", " \"experiment tracking, model packaging, versioning, and deployment.\",\n", " score=5,\n", " justification=\"The output is easily understandable, clear, and concise. It deserves a score of 5.\",\n", ")\n", "\n", "answer_quality_metric = make_genai_metric(\n", " name=\"answer_quality\",\n", " definition=answer_quality_definition,\n", " grading_prompt=answer_quality_grading_prompt,\n", " version=\"v1\",\n", " examples=[example1, example2],\n", " model=\"openai:/gpt-4\",\n", " greater_is_better=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to set our OpenAI API key, since we are using GPT-4 for our LLM-judged metrics.\n", "\n", "In order to set your private key safely, please be sure to either export your key through a command-line terminal for your current instance, or, for a permanent addition to all user-based sessions, configure your favored environment management configuration file (i.e., .bashrc, .zshrc) to have the following entry:\n", "\n", "`OPENAI_API_KEY=`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can call `mlflow.evaluate()`. Just to test it out, let's use the first 10 rows of the data. Using the `\"text\"` model type, toxicity and readability metrics are calculated as builtin metrics. We also pass in the two metrics we defined above into the `extra_metrics` parameter to be evaluated." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "d057748e00924cf0a195719c509f03a3", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading artifacts: 0%| | 0/79 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
instructioninputtextoutputoutputstoken_counttoxicity/v1/scoreflesch_kincaid_grade_level/v1/scoreari_grade_level/v1/scoreanswer_correctness/v1/scoreanswer_correctness/v1/justificationanswer_quality/v1/scoreanswer_quality/v1/justification
0Give three tips for staying healthy.Below is an instruction that describes a task....1.Eat a balanced diet and make sure to include...Give three tips for staying healthy.\\n1. Eat a...190.0004464.14.02The output provided by the model only includes...3The output is understandable and fluent but it...
1What are the three primary colors?Below is an instruction that describes a task....The three primary colors are red, blue, and ye...What are the three primary colors?\\nThe three ...190.0002175.04.95The output provided by the model is completely...5The model's output is fluent, clear, and conci...
2Describe the structure of an atom.Below is an instruction that describes a task....An atom is made up of a nucleus, which contain...Describe the structure of an atom.\\nAn atom is...180.0001393.12.21The output provided by the model is incomplete...2The output is incomplete and lacks clarity, ma...
3How can we reduce air pollution?Below is an instruction that describes a task....There are a number of ways to reduce air pollu...How can we reduce air pollution?\\nThere are ma...180.0001405.05.51The output provided by the model is completely...1The output is entirely incomprehensible and ca...
4Describe a time when you had to make a difficu...Below is an instruction that describes a task....I had to make a difficult decision when I was ...Describe a time when you had to make a difficu...180.0001595.22.91The output provided by the model is completely...2The output is incomplete and lacks clarity, ma...
5Identify the odd one out.Twitter, Instagram, TelegramBelow is an instruction that describes a task,...TelegramIdentify the odd one out.\\n\\n1. A car\\n2. A tr...180.0723450.1-5.41The output provided by the model is completely...2The output is not clear and lacks fluency. The...
6Explain why the following fraction is equivale...4/16Below is an instruction that describes a task,...The fraction 4/16 is equivalent to 1/4 because...Explain why the following fraction is equivale...230.0003206.47.61The output provided by the model is completely...2The output is not clear and does not answer th...
7Write a short story in third person narration ...Below is an instruction that describes a task....John was at a crossroads in his life. He had j...Write a short story in third person narration ...200.00024710.711.11The output provided by the model is completely...1The output is exactly the same as the input, a...
8Render a 3D model of a houseBelow is an instruction that describes a task....<nooutput> This type of instruction cannot be ...Render a 3D model of a house in Blender - Blen...190.0036945.22.71The output provided by the model is completely...2The output is partially understandable but lac...
9Evaluate this sentence for spelling and gramma...He finnished his meal and left the resturantBelow is an instruction that describes a task,...He finished his meal and left the restaurant.Evaluate this sentence for spelling and gramma...180.0032604.26.41The output provided by the model is completely...4The output is fluent and clear, but it is not ...
\n", "" ], "text/plain": [ " instruction \\\n", "0 Give three tips for staying healthy. \n", "1 What are the three primary colors? \n", "2 Describe the structure of an atom. \n", "3 How can we reduce air pollution? \n", "4 Describe a time when you had to make a difficu... \n", "5 Identify the odd one out. \n", "6 Explain why the following fraction is equivale... \n", "7 Write a short story in third person narration ... \n", "8 Render a 3D model of a house \n", "9 Evaluate this sentence for spelling and gramma... \n", "\n", " input \\\n", "0 \n", "1 \n", "2 \n", "3 \n", "4 \n", "5 Twitter, Instagram, Telegram \n", "6 4/16 \n", "7 \n", "8 \n", "9 He finnished his meal and left the resturant \n", "\n", " text \\\n", "0 Below is an instruction that describes a task.... \n", "1 Below is an instruction that describes a task.... \n", "2 Below is an instruction that describes a task.... \n", "3 Below is an instruction that describes a task.... \n", "4 Below is an instruction that describes a task.... \n", "5 Below is an instruction that describes a task,... \n", "6 Below is an instruction that describes a task,... \n", "7 Below is an instruction that describes a task.... \n", "8 Below is an instruction that describes a task.... \n", "9 Below is an instruction that describes a task,... \n", "\n", " output \\\n", "0 1.Eat a balanced diet and make sure to include... \n", "1 The three primary colors are red, blue, and ye... \n", "2 An atom is made up of a nucleus, which contain... \n", "3 There are a number of ways to reduce air pollu... \n", "4 I had to make a difficult decision when I was ... \n", "5 Telegram \n", "6 The fraction 4/16 is equivalent to 1/4 because... \n", "7 John was at a crossroads in his life. He had j... \n", "8 This type of instruction cannot be ... \n", "9 He finished his meal and left the restaurant. \n", "\n", " outputs token_count \\\n", "0 Give three tips for staying healthy.\\n1. Eat a... 19 \n", "1 What are the three primary colors?\\nThe three ... 19 \n", "2 Describe the structure of an atom.\\nAn atom is... 18 \n", "3 How can we reduce air pollution?\\nThere are ma... 18 \n", "4 Describe a time when you had to make a difficu... 18 \n", "5 Identify the odd one out.\\n\\n1. A car\\n2. A tr... 18 \n", "6 Explain why the following fraction is equivale... 23 \n", "7 Write a short story in third person narration ... 20 \n", "8 Render a 3D model of a house in Blender - Blen... 19 \n", "9 Evaluate this sentence for spelling and gramma... 18 \n", "\n", " toxicity/v1/score flesch_kincaid_grade_level/v1/score \\\n", "0 0.000446 4.1 \n", "1 0.000217 5.0 \n", "2 0.000139 3.1 \n", "3 0.000140 5.0 \n", "4 0.000159 5.2 \n", "5 0.072345 0.1 \n", "6 0.000320 6.4 \n", "7 0.000247 10.7 \n", "8 0.003694 5.2 \n", "9 0.003260 4.2 \n", "\n", " ari_grade_level/v1/score answer_correctness/v1/score \\\n", "0 4.0 2 \n", "1 4.9 5 \n", "2 2.2 1 \n", "3 5.5 1 \n", "4 2.9 1 \n", "5 -5.4 1 \n", "6 7.6 1 \n", "7 11.1 1 \n", "8 2.7 1 \n", "9 6.4 1 \n", "\n", " answer_correctness/v1/justification answer_quality/v1/score \\\n", "0 The output provided by the model only includes... 3 \n", "1 The output provided by the model is completely... 5 \n", "2 The output provided by the model is incomplete... 2 \n", "3 The output provided by the model is completely... 1 \n", "4 The output provided by the model is completely... 2 \n", "5 The output provided by the model is completely... 2 \n", "6 The output provided by the model is completely... 2 \n", "7 The output provided by the model is completely... 1 \n", "8 The output provided by the model is completely... 2 \n", "9 The output provided by the model is completely... 4 \n", "\n", " answer_quality/v1/justification \n", "0 The output is understandable and fluent but it... \n", "1 The model's output is fluent, clear, and conci... \n", "2 The output is incomplete and lacks clarity, ma... \n", "3 The output is entirely incomprehensible and ca... \n", "4 The output is incomplete and lacks clarity, ma... \n", "5 The output is not clear and lacks fluency. The... \n", "6 The output is not clear and does not answer th... \n", "7 The output is exactly the same as the input, a... \n", "8 The output is partially understandable but lac... \n", "9 The output is fluent and clear, but it is not ... " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results.tables[\"eval_results_table\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### View results in UI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can view our evaluation results in the MLflow UI. We can select our experiment on the left sidebar, which will bring us to the following page. We can see that one run logged our model \"mpt-7b-chat\", and the other run has the dataset we evaluated." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Evaluation Main](https://i.imgur.com/alymcBq.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We click on the Evaluation tab and hide any irrelevant runs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Evaluation Filtering](https://i.imgur.com/sr7R9TL.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now choose what columns we want to group by, as well as which column we want to compare. In the following example, we are looking at the score for answer correctness for each input-output pair, but we could choose any other metric to compare." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Evaluation Selection](https://i.imgur.com/AeoYMEt.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we get to the following view, where we can see the justification and score for answer correctness for each row." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Evaluation Comparison](https://i.imgur.com/axsHZxP.png)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.17" } }, "nbformat": 4, "nbformat_minor": 2 }