# MLflow 3 and DSPy — Develop and evaluate a *clickbait agent*.

This notebook demonstrates how to implement and evaluate a clickbait detector and rewriter agent using DSPy and MLflow 3's GenAI tools. It was developed as part of the talk "What Comes After Coding: Evaluating Agentic Behaviour" presented at the Madrid Databricks User Group Meetup.

The focus is on practical agent development and rigorous evaluation. Core topics include agent development with DSPy, MLflow instrumentation and the use of custom scorers and built-in judges to assess agentic behavior directly from execution traces.

## Environment Setup and Configuration

Ensure that the configuration settings in this section are reviewed and updated before proceeding.

These values determine where artifacts such as models and volumes will be stored, tokens to allow interaction with external tools, and addresses to volumes with datasets.

In [0]:
%pip install -U -qqqq dspy mlflow[databricks]>=3.1.0
%restart_python

### Global variables

This notebook uses Unity Catalog to register models and other resources, and relies on a model serving endpoint for inference. You need to specify:

- `UC_CATALOG`: the Unity Catalog catalog where resources will be stored.
- `UC_SCHEMA`: the schema within the catalog to use.
- `model_serving_endpoint`: the name of the Databricks model serving endpoint that will handle inference requests. You can use a default one or [create your own](https://docs.databricks.com/aws/en/machine-learning/model-serving/create-manage-serving-endpoints).

In [0]:
UC_CATALOG = ""
UC_SCHEMA = ""

AGENT_PARAMETERS = {
 "model_serving_endpoint": "gemini-2-0-flash-lite"
}

#### Jina AI token

The agent will have access to an external tool to fetch website contents from urls. This tool is implemented using *Jina AI Reader API*, a service for extracting content from web pages and converting it into clean LLM-ready markdown text.

A free Jina AI API key can be obtained by following [this tutorial](https://www.youtube.com/watch?v=SLv6tSEKYOg).

The token is managed using [databricks secrets](https://docs.databricks.com/aws/en/security/secrets/). For testing purposes you could replace this definition by inlining the token string.

In [0]:
import base64
from databricks.sdk import WorkspaceClient

JINA_AI_TOKEN = base64.b64decode(
 WorkspaceClient().secrets.get_secret("credentials", "jinaai").value
).decode()

# JINA_AI_TOKEN = "YOUR_JINA_AI_TOKEN" # Replace with your Jina AI token

### Clickbait dataset

A corpus of clickbait and non-clickbait title examples will prove useful to evaluate our agent implementation. You can download a dataset with headlines and binary (clickbait or not) classifications from this [Kaggle dataset](https://www.kaggle.com/datasets/amananandrai/clickbait-dataset/data).

The `CORPUS_FILE` variable stores the path to the dataset in csv format. In this example, the dataset is stored in a unity catalog volume.

In [0]:
CORPUS_FILE = ""

In [0]:
assert UC_CATALOG != "", "Please set UC_CATALOG to your catalog name"
assert UC_SCHEMA != "", "Please set UC_SCHEMA to your schema name"
assert "model_serving_endpoint" in AGENT_PARAMETERS and AGENT_PARAMETERS["model_serving_endpoint"] != "", "Please set AGENT_PARAMETERS['model_serving_endpoint'] to your model serving endpoint name"
assert JINA_AI_TOKEN != "", "Please set JINA_AI_TOKEN to your Jina AI token"
assert CORPUS_FILE != "", "Please set CORPUS_FILE to the path of your corpus file"

## Implement the agent

### Tool definitions

The agent will have available a tool for fetching website content:

- `fetch_url_title_and_content`: fetches title and content from the given url, using *Jina AI reader API*.

In [0]:
import requests
import json
from urllib.parse import urlparse

def fetch_url_title_and_content(url: str) -> dict[str, str]:
 """
 Gets title and content from an url

 Arguments:
 url: The url to fetch data from
 
 Returns:
 dict: Dictionary with title and content
 """
 
 parsed = urlparse(url)
 if parsed.scheme not in ("http", "https"):
 url = f"https://{url}"
 parsed = urlparse(url)
 if len(parsed.netloc) == 0:
 raise ValueError(f"Invalid URL: {url}")
 
 response = requests.get(f"https://r.jina.ai/{url}", headers={
 "Accept": "application/json",
 "Authorization": f"Bearer {JINA_AI_TOKEN}",
 "X-Md-Heading-Style": "setext",
 "X-Base": "final",
 "X-Retain-Images": "none",
 "X-Md-Link-Style": "discarded",
 "X-Timeout": "10",
 }).json()

 return {
 "title": response["data"].get("title", ""),
 "content": response["data"].get("content", ""),
 }

### DSPy agent

DSPy agents are implemented by defining signatures a modules.

- `Signature`: In DSPy, signatures declaratively define the input and output behavior of your language model tasks. They abstract away the complexities of prompt engineering, allowing DSPy's compiler to optimize how your LM interacts with your data, leading to more robust and performant applications.
- `Module`: DSPy modules are the fundamental building blocks for constructing LM-powered programs. They encapsulate specific prompting techniques (like `ChainOfThought`) and can be easily composed. You can leverage predefined modules for common patterns or define your own custom modules for more specialized tasks, enabling flexible and modular agent design.

Activating logging (*traces*) executions of our agent with MLflow 3, is as simple as executing `mlflow.dspy.autolog()`.


In [0]:
import dspy
from typing import Any, Generator, Optional
from dataclasses import dataclass
from pprint import pprint
import mlflow
from mlflow.entities import SpanType
from mlflow.pyfunc.model import ChatAgent
import uuid
from mlflow.types.agent import (
 ChatAgentMessage,
 ChatAgentResponse,
 ChatContext,
)

mlflow.dspy.autolog()

dspy.configure(lm=dspy.LM(
 model=f"databricks/{AGENT_PARAMETERS["model_serving_endpoint"]}",
 max_tokens=16384
))
dspy.settings.configure(track_usage=True)

class TextHandler(dspy.Signature):
 "You are a text handler. You are given a text and you have to return the text splitted in the title contained in the text and content. If the text tells is an URL or tells you to use an URL you can use tools to grab the content and its already splitted in title and content"
 text: str = dspy.InputField(desc="The text to be splitted.")
 title: str = dspy.OutputField(desc="The title of the text.")
 content: str = dspy.OutputField(desc="The content of the text.")

class ClickBaitDetector(dspy.Signature):
 """Clasify if the text provided is the title of a clickbait article."""
 title: str = dspy.InputField(desc="The title of a article.")
 is_clickbait: bool = dspy.OutputField(desc="Marks if the title is a clickbait.")
 reason: str = dspy.OutputField(desc="The reason why the title is a clickbait. Only the bullet points.")

class ClickBaitExtractor(dspy.Signature):
 """Obtain an alternative version of the title without clickbait, use the content of the article to answer any question the reader of the original title could have. Make the response short, concise and in the same language as the provided title."""
 title: str = dspy.InputField(desc="the title of a article")
 content: str = dspy.InputField(desc="the content of a article")
 response: str = dspy.OutputField(desc="the response to the clickbait in the language provided")

class ClickBaitTextAnalyzer(dspy.Module):
 def __init__(self, callbacks=None):
 super().__init__(callbacks)
 self.urlTool = dspy.Tool(fetch_url_title_and_content, name="urlTool", desc="A tool to fetch the title and content of an url.")
 self.textHandler = dspy.ReAct(TextHandler, tools = [self.urlTool], max_iters=1)
 self.clickbait = dspy.ChainOfThought(ClickBaitDetector)
 self.extr = dspy.Predict(ClickBaitExtractor)

 def forward(self, message: str):
 """Detect if the title is a clickbait and if it is, extract the response to the clickbait, if its not, returns the reason why its not clickbait. If the tool fails and you cant continue to do your task, return the error in the title it and put the reason in the content."""
 
 titleAndContent = self.textHandler(text=message)
 clickbait = self.clickbait(title=titleAndContent.title)
 if clickbait.is_clickbait:
 response = self.extr(title=titleAndContent.title, content=titleAndContent.content)
 response.original_title = titleAndContent.title
 return response
 else:
 clickbait.original_title = titleAndContent.title
 return clickbait

### Logging a DSPy agent into MLflow

- `set_active_model` is used to set the active model with the specified name. This model will be linked to traces generated from now on. The model name can include the version. In this notebook the version is not being associated to a versioning system. [This page](https://mlflow.org/docs/latest/genai/prompt-version-mgmt/version-tracking/track-application-versions-with-mlflow#step-3-link-traces-to-the-application-version) of MLflow documentation gives some insights about agent versioning.
- Model hyperparameters can be associated to the active model using `log_model_params`.
- Finally the agent graph is instantiated. It's important to leave this step as the last one, after enabling logging and registering the active model. Once the agent graph has been initialized, it can be visualized using `display` function.

In [0]:
from time import time

mlflow.langchain.autolog()
active_model_info = mlflow.set_active_model(name=f"clickbait-dspy-{int(time())}")
mlflow.log_model_params(model_id=active_model_info.model_id, params=AGENT_PARAMETERS)
agent = ClickBaitTextAnalyzer()

## Call the agent

With the agent instantiated, it is now possible to call our agent as a function by passing it an user message.

We can confirm the autologging is working by checking the produced trace in the block results.

In [0]:
agent("is this article clickbait https://www.lavanguardia.com/cribeo/fast-news/20250527/10723991/popular-nombre-masculino-espana-jonathan-coeficiente-intelectual-bajo-mmn.amp.html f?")

It is also possible to call just a component of the agent.

In [0]:
agent.clickbait.predict(
 title="El actor mejor pagado tiene su última pelicula en nuestra plataforma de streaming"
)

## Agent evaluation

Once an agent is implemented it is essential to define a set of evaluation scores that enable both quantitative and qualitative assessments of its behavior. This enables the gathering of meaningful insights about how the model—or specific components of it—are performing.

MLflow 3's new GenAI API introduces the `evaluate`, which supports executing scorer functions on traces or custom function outputs, via the `predict_fn` parameter. This `predict_fn` function allows wrapping the entire agent agent or individual components making it possible to reproduce classic testing paradigms such as *integration tests* and *unit tests* within the agent evaluation workflow.

Evaluation metrics are defined using the `@scorer` decorator. Custom scorer functions can access inputs, outputs, traces, and expectations, giving them full context to evaluate agent behavior.

Custom scorers can tipically be written in an implementation-agnostic way, allowing the agent logic to be replaced—potentially even with a different framework, such as LangGraph—without requiring changes to the scorers themselves. The only likely adjustment needed when switching implementations is updating the `predict_fn`, particularly if the new implementation introduces a different function signature.

Inside a scorer, it's possible to make custom calls to language models—for example, by passing them the output to assess specific properties. This approach resembles *property-based testing*, where evaluations are derived from general behavioral rules rather than hardcoded expected outputs. Scorers that rely on language models for evaluation are known as judges. Databricks provides several [built-in judges](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/predefined-judge-scorers) for common evaluation scenarios, streamlining the process for standard use cases.

### Useful datasets

In the following block some datasets are imported from the *kaggle clickbait dataset* and the `urls_data.json`.

In [0]:
from mlflow.entities import Feedback, Trace
from mlflow.genai.scorers import scorer
import pandas as pd
import random

# Kaggle dataset
corpus = pd.read_csv(CORPUS_FILE)
SAMPLE_SIZE = 10
RANDOM_SEED = 42 # fix the random seed to get the same sample every time
clickbait_sample = corpus[corpus["clickbait"] == 1].sample(n=SAMPLE_SIZE//2, random_state=RANDOM_SEED)
non_clickbait_sample = corpus[corpus["clickbait"] == 0].sample(n=SAMPLE_SIZE//2, random_state=RANDOM_SEED)
corpus_sample = pd.concat([clickbait_sample, non_clickbait_sample], ignore_index=True)

# dataset with url, title and content (20 rows)
url_articles = pd.read_json("./urls_data.json")

### Classification correctness

The `check_clickbait` scorer evaluates whether the agent correctly classifies input titles as clickbait or not.
- The dataset consists of inputs with a title field and expectations containing the expected clickbait classification label.
- To isolate and evaluate just the classification logic, the predict_fn wraps the invoke method of the `classify` of the agent.
- The scorer compares the agent’s output to the expected label, marking incorrect predictions as either false positives or false negatives.

In [0]:
titles_dataset = (
 corpus_sample
 .apply(
 lambda row: {
 "inputs": {"title": row["headline"]},
 "expectations": {"is_clickbait": (row["clickbait"] == 1)},
 },
 axis=1,
 result_type="expand",
 )
 .sample(frac=1, random_state=42)
)

def run_clickbait(title: str):
 return agent.clickbait(title=title).toDict()

@scorer
def check_clickbait(outputs, expectations):

 value = outputs["is_clickbait"] == expectations["is_clickbait"] 
 rationale = (
 "The model predicted the wrong clickbait status" if value else "The model predicted the correct clickbait status"
 )
 
 feedback = [Feedback(
 name = "success",
 value = value,
 rationale=rationale
 )] 

 if not value:
 if expectations["is_clickbait"]:
 feedback.append(Feedback(
 name = "false negative",
 value = True))
 feedback.append(Feedback(
 name = "false positive",
 value = False))
 else :
 feedback.append(Feedback(
 name = "false negative",
 value = False))
 feedback.append(Feedback(
 name = "false positive",
 value = True))

 return feedback

evaluation = mlflow.genai.evaluate(
 data=titles_dataset,
 scorers=[check_clickbait],
 predict_fn=run_clickbait
)

### Clickbait response correctness

The `check_clickbait_response` scorer evaluates whether the agent successfully rewrites titles originally classified as clickbait, eliminating the clickbait aspect in the revised version.

- The dataset includes title and content fields as inputs.
- To isolate and evaluate the rewriting logic, the `predict_fn` wraps the invoke method of the `clickbait_response` node.
- The scorer then uses the `classify` node as a judge, applying it to the rewritten output to determine whether it would still be classified as clickbait.

In [0]:
clickbait_titles_dataset = (
 url_articles
 .apply(
 lambda row: {
 "inputs": {
 "title": row["title"],
 "content": row["content"]
 }
 },
 axis=1,
 result_type="expand",
 )
)

def run_extr(title: str, content: str):
 return agent.extr(title=title, content=content).toDict()

@scorer
def check_clickbait_response(inputs, outputs):
 output_title = outputs["messages"][-1].content
 judge_output = agent.clickbait(title=output_title)
 
 return Feedback(
 name="response is no clickbait",
 value=not judge_output["is_clickbait"],
 rationale=judge_output["classification_reason"],
 )

evaluation = mlflow.genai.evaluate(
 data=clickbait_titles_dataset,
 scorers=[check_clickbait_response],
 predict_fn=run_extr
)

### Tool call correctness

The `check_tool_use` scorer verifies whether the agent invoked the expected tools during execution.

- The dataset consists of user messages (in this case, URLs) as inputs, with expectations specifying that the agent should call the `fetch_url_title_and_content` tool.
- The `predict_fn` wraps the full agent invocation, simulating a user message being processed end-to-end.
- The scorer inspects the execution trace and collects all tool invocations by searching for spans of type `"TOOL"`. It stores the tool names in a set and compares them to the `expected_tools`, which are also converted into a set.

Note: Tool names are stored in sets to ignore the order of invocation. This simplifies comparison but does mean that repeated invocations of the same tool (e.g., retries or multiple uses) are ignored. MLflow automatically renames multiple executions of the same tool-appending suffixes like _1, _2, etc- so information about how many times each tool was called is not lost when storing the tool call names in a set.

In [0]:
tool_use_dataset = url_articles.apply(
 lambda row: {
 "inputs": {"user_message": row["url"]},
 "expectations": {"expected_tools": ["fetch_url_title_and_content"]},
 },
 axis=1,
 result_type="expand",
)

def run_agent(user_message: str):
 return agent(user_message).toDict()

@scorer
def check_tool_use(trace, expectations):
 tool_calls = {span.name for span in trace.search_spans(span_type="TOOL")}

 if not tool_calls:
 return Feedback(value=False, rationale="No tool calls found")
 
 expected_tools = set(expectations["expected_tools"])

 if expected_tools != tool_calls:
 return Feedback(
 value=False,
 rationale=(
 "Tool calls did not match expectations.\n"
 f"Expected {expected_tools} but got {tool_calls}."
 )
 )
 return Feedback(value=True)

check_tool_use_eval_result = mlflow.genai.evaluate(
 data=tool_use_dataset, scorers=[check_tool_use], predict_fn=run_agent
)

### Predefined judges

MLflow GenAI provides several built-in judges, that can be used to evaluate common behavioral expectations without writing custom logic. Two such predefined judges are `Safety` and `Guidelines`.

- The `Safety` scorer evaluates content—whether generated by the application or provided by a user—for harmful, unethical, or inappropriate material.
- The `Guidelines` scorer enables fast and flexible evaluation based on natural language rules, framed as binary pass/fail conditions. These criteria can be tailored to specific application constraints or behavioral policies.

Unlike earlier examples that require defining a `predict_fn` to execute and evaluate an agent function, this approach uses the data parameter to evaluate previously executed traces. This is particularly useful when agent outputs have already been logged—such as during experimentation or batch processing—and additional scoring needs to be applied without re-executing the agent.

In [0]:
from mlflow.genai.scorers import Safety, Guidelines

traces = mlflow.search_traces(run_id=check_tool_use_eval_result.run_id)

mlflow.genai.evaluate(
 data=traces,
 scorers=[
 Safety(),
 Guidelines(name="question", guidelines="The response must not contain a question")
 ],
)