{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "c29851a4", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import dask.dataframe as dd" ] }, { "cell_type": "code", "execution_count": null, "id": "37cf1f6d", "metadata": {}, "outputs": [], "source": [ "from fairlearn.metrics import (\n", " MetricFrame,\n", " true_positive_rate,\n", " false_negative_rate,\n", " false_positive_rate,\n", " count\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "9422c741", "metadata": {}, "outputs": [], "source": [ "import matplotlib\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "sns.set()" ] }, { "cell_type": "code", "execution_count": null, "id": "3e302f11", "metadata": {}, "outputs": [], "source": [ "from collections import defaultdict\n", "import logging\n", "\n", "logger = logging.getLogger()\n", "logger.setLevel(logging.CRITICAL)" ] }, { "cell_type": "markdown", "id": "682924f6", "metadata": {}, "source": [ "# Sample Notebook - Face Validation " ] }, { "cell_type": "markdown", "id": "dcf6b1ec", "metadata": {}, "source": [ "This Jupyter notebook walks you through an example of assessing a face validation system for any potential fairness-related disparities. You can either use the provided sample CSV file `face_verify_sample_rand_data.csv` or use your own dataset." ] }, { "cell_type": "code", "execution_count": null, "id": "b31ad634", "metadata": {}, "outputs": [], "source": [ "import zipfile\n", "from raiutils.dataset import fetch_dataset\n", "outdirname = 'responsibleai.12.28.21'\n", "zipfilename = outdirname + '.zip'\n", "\n", "fetch_dataset('https://publictestdatasets.blob.core.windows.net/data/' + zipfilename, zipfilename)\n", "\n", "with zipfile.ZipFile(zipfilename, 'r') as unzip:\n", " unzip.extractall('.')\n", "results_csv = \"face_verify_sample_rand_data.csv\"" ] }, { "cell_type": "code", "execution_count": null, "id": "ca3ce668", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(results_csv, index_col=0)" ] }, { "cell_type": "code", "execution_count": null, "id": "0a343ec5", "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "id": "5d7e1250", "metadata": {}, "source": [ "Our fairness assessment can be broken down into three tasks:\n", "\n", "1. Idenfity harms and which groups may be harmed.\n", "\n", "2. Define fairness metrics to quantify harms\n", "\n", "3. Compare our quantified harms across the relevant groups." ] }, { "cell_type": "markdown", "id": "a5585c06", "metadata": {}, "source": [ "## 1.) Identify which groups may be harmed and how" ] }, { "cell_type": "markdown", "id": "79492a3e", "metadata": {}, "source": [ "The first step of our fairness assessment is understanding which groups are more likely to be *adversely affected* by our face verification system." ] }, { "cell_type": "markdown", "id": "7edd10f4", "metadata": {}, "source": [ "The work of Joy Buolamwini and Timnit Gebru on *Gender Shades* ([Buolamwini and Gebru, 2018](http://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf)) showed a performance disparity in the accuracy of commerically available facial recognition systems between darker-skinned women and lighter-skinned men. One key takeaway from this work is the importance of intersectionality when conducting a fairness assessment. For this fairness assessment, we will explore performance disparities disaggregated by `race` and `gender`." ] }, { "cell_type": "markdown", "id": "bd393f5f", "metadata": {}, "source": [ "Using the terminology recommended by the [Fairlearn User Guide](https://fairlearn.org/v0.7.0/user_guide/fairness_in_machine_learning.html#fairness-of-ai-systems), we are interested in mitigating **quality-of-service harms**. **Quality-of-Service** harms are focused on whether a systems achieves the same level of performance for one person as it does for others, even when no opportunities or resources are withheld. The *Face validation* system produces this harm if it fails to validate faces for members of one demographic group higher compared to other demographic groups." ] }, { "cell_type": "code", "execution_count": null, "id": "9f540e82", "metadata": {}, "outputs": [], "source": [ "sensitive_features = [\"race\", \"gender\"]" ] }, { "cell_type": "code", "execution_count": null, "id": "2b7e22a2", "metadata": {}, "outputs": [], "source": [ "df.groupby(sensitive_features)[\"golden_label\"].mean()" ] }, { "cell_type": "markdown", "id": "9bf3b022", "metadata": {}, "source": [ "The `matching_score` represents the probability the two images represent the same face, according to the vision model. We say two faces *match* if the `matching_score` is greater than a specific threshold, `0.5` by default. Based on your needs, you can increase or decrease this threshold to any value between `0.0` and `1.0`." ] }, { "cell_type": "code", "execution_count": null, "id": "96c772c0", "metadata": {}, "outputs": [], "source": [ "threshold = 0.5\n", "df.loc[:, \"matching_score_binary\"] = df[\"matching_score\"] >= threshold" ] }, { "cell_type": "markdown", "id": "04f3c6be", "metadata": {}, "source": [ "## 2.) Define fairness to quantify harms" ] }, { "cell_type": "markdown", "id": "caa0ff75", "metadata": {}, "source": [ "The second step of our fairness assessment is to translate our fairness-related harms into quantifiable metrics.With face validation, there are two harms we should consider:\n", "\n", "1. *False Positives* where two different faces are considered by the system to be matching. A *false positive* can be extremely dangerous in many cases, such as security authentication. We would not want people to unlock someone else's phone due to a Face ID false positive.\n", "\n", "2. *False Negatives* occur when two pictures of the same person are not considered to be a match by the system. A *false negative* may result in an individual being locked out their account due to a lack of facial verifications. However in many cases, *false negatives* are not nearly as harmful as *false positives*." ] }, { "cell_type": "markdown", "id": "b50183be", "metadata": {}, "source": [ "To assess fairness-related disparities using the `MetricFrame`, we must first specify our *sensitive features* `A` along with our `fairness_metrics`. In this scenario, we will look at three different *fairness metrics*:\n", "- `count`: The number of data points in each demographic category.\n", "- `FNR`: The false negative rate for the group.\n", "- `FPR`: The false positive rate for the group." ] }, { "cell_type": "markdown", "id": "794f4306", "metadata": {}, "source": [ "With our system, we want to keep *false_positives* as low as possible while also not yielding too much disparity in the *false_negative_rate* for each group. For our example, we will look at the system's performance disaggregated by `race` and `gender`." ] }, { "cell_type": "code", "execution_count": null, "id": "2f293c83", "metadata": {}, "outputs": [], "source": [ "A, Y = df.loc[:, sensitive_features], df.loc[:, \"golden_label\"]\n", "Y_pred = df.loc[:, \"matching_score_binary\"]" ] }, { "cell_type": "code", "execution_count": null, "id": "5c6ac1e0", "metadata": {}, "outputs": [], "source": [ "fairness_metrics = {\n", " \"count\": count,\n", " \"FNR\": false_negative_rate,\n", " \"FPR\": false_positive_rate\n", "}" ] }, { "cell_type": "markdown", "id": "ad0c47db", "metadata": {}, "source": [ "## 3.) Compared quantified harms across different groups" ] }, { "cell_type": "markdown", "id": "cecc32ff", "metadata": {}, "source": [ "In the final step of our fairness assessment, we instantiate our `MetricFrame` by defining the following parameters:\n", "\n", "- *metrics*: The metrics of interest for our fairness assessment.\n", "- *y_true*: The ground truth labels for the ML task\n", "- *y_pred*: The model's predicted labels for the ML tasks\n", "- *sensitive_features*: The set of feature(s) for our fairness assessment" ] }, { "cell_type": "code", "execution_count": null, "id": "8e5023e4", "metadata": {}, "outputs": [], "source": [ "metricframe = MetricFrame(\n", " metrics=fairness_metrics,\n", " y_true=Y,\n", " y_pred=Y_pred,\n", " sensitive_features=A\n", ")" ] }, { "cell_type": "markdown", "id": "76f25bd0", "metadata": {}, "source": [ "With our `MetricFrame`, we can call the `by_group` function to view our `fairness_metrics` dissaggregated by our different demographic groups." ] }, { "cell_type": "code", "execution_count": null, "id": "fe25786d", "metadata": {}, "outputs": [], "source": [ "metricframe.by_group" ] }, { "cell_type": "markdown", "id": "8a191d6b", "metadata": {}, "source": [ "With the `difference` method, we can view the maximal disparity in each metric. We see there is a maximal `false negative rate difference` between `Black female` and `White male` of `0.0177`." ] }, { "cell_type": "code", "execution_count": null, "id": "caf5b439", "metadata": {}, "outputs": [], "source": [ "metricframe.difference()" ] }, { "cell_type": "markdown", "id": "4b4a405a", "metadata": {}, "source": [ "### Applying Different Thresholds" ] }, { "cell_type": "markdown", "id": "23361d07", "metadata": {}, "source": [ "In the previous section, we used a *threshold* of `0.5` to determine the minimum `matching_score` needed for a successful match. In practice, we could choose any *threshold* between 0.0 and 1.0 to get a *false negative rate* and *false positive rate* that is acceptable for the specific task.\n", "\n", "Now, we're going to explore how changing the threshold affects the resultant *false positive rate* and *false negative rate*." ] }, { "cell_type": "code", "execution_count": null, "id": "98bda06a", "metadata": {}, "outputs": [], "source": [ "def update_dictionary_helper(dictionary, results):\n", " for (k, v) in results.items():\n", " dictionary[k].append(v)\n", " return dictionary" ] }, { "cell_type": "markdown", "id": "678962c6", "metadata": {}, "source": [ "The following function iterates through a set of potential thresholds and computes the resultant model predictions at each threshold. The function then creates a `MetricFrame` to compute the disaggregated metrics at this threshold level." ] }, { "cell_type": "code", "execution_count": null, "id": "0ccbd06b", "metadata": {}, "outputs": [], "source": [ "def compute_group_thresholds_dask(dataframe, metric, A,bins=10):\n", " thresholds = np.linspace(0, 1, bins+1)[1:]\n", " full_dict = defaultdict(list)\n", " for threshold in thresholds:\n", " Y_pred_threshold = dataframe.loc[:, \"matching_score\"] >= threshold\n", " metricframe_threshold = MetricFrame(\n", " metrics={f\"{metric.__name__}\": metric},\n", " y_true= dataframe.loc[:, \"golden_label\"],\n", " y_pred = Y_pred_threshold,\n", " sensitive_features=A\n", " )\n", " results = metricframe_threshold.by_group[metric.__name__].to_dict()\n", " full_dict = update_dictionary_helper(full_dict, results)\n", " return full_dict" ] }, { "cell_type": "markdown", "id": "186be386", "metadata": {}, "source": [ "Using the `plot_thresholds` function, we can visualize the `false_positive_rate` and `false_negative_rate` for the data at each *threshold* level." ] }, { "cell_type": "code", "execution_count": null, "id": "c5881e7a", "metadata": {}, "outputs": [], "source": [ "def plot_thresholds(thresholds, thresholds_dict,metric):\n", " plt.figure(figsize=[12,8])\n", " for (k, vals) in thresholds_dict.items():\n", " plt.plot(thresholds, vals, label=f\"{k}\")\n", " plt.scatter(thresholds, vals, s=20)\n", " plt.xlabel(\"Threshold\")\n", " plt.xticks(thresholds)\n", " plt.ylabel(f\"{metric.__name__}\")\n", " plt.legend(bbox_to_anchor=(1,1), loc=\"upper left\")\n", " plt.grid(b=True, which=\"both\", axis=\"both\", color='gray', linestyle='dashdot', linewidth=1)" ] }, { "cell_type": "code", "execution_count": null, "id": "834f0446", "metadata": {}, "outputs": [], "source": [ "thresholds = np.linspace(0, 1, 11)[1:]\n", "fn_thresholds_dict = compute_group_thresholds_dask(df, false_negative_rate, A)\n", "fp_thresholds_dict = compute_group_thresholds_dask(df, false_positive_rate, A)" ] }, { "cell_type": "markdown", "id": "0a2895ed", "metadata": {}, "source": [ "From the visualization, we see the *false_negative_rate* for all groups increases as the threshold increases. Furthermore, the maximal `false_negative_rate_difference` occurs between *White Female* and *Black Male* when the `threshold` is set to `0.7`." ] }, { "cell_type": "code", "execution_count": null, "id": "f8977b81", "metadata": {}, "outputs": [], "source": [ "plot_thresholds(thresholds, fn_thresholds_dict, false_negative_rate)" ] }, { "cell_type": "code", "execution_count": null, "id": "879970fc", "metadata": {}, "outputs": [], "source": [ "plot_thresholds(thresholds, fp_thresholds_dict, false_positive_rate)" ] }, { "cell_type": "markdown", "id": "32f1da24", "metadata": {}, "source": [ "If it were essential to keep the *false_positive_rate* at 0 for all groups, then according to the plots above, we simply need to choose a *threshold* greater than or equal to 0.5. However increasing the *threshold* above *0.5* in our data also increases the **absolute false negative rate** across all groups as well as the *relative false negative rate difference* between groups." ] }, { "cell_type": "markdown", "id": "ab15fa20", "metadata": {}, "source": [ "### Comparison to Synthetic Disparity" ] }, { "cell_type": "markdown", "id": "b07e294b", "metadata": {}, "source": [ "In our dataset, there isn't a substantial disparity in the `false_negative_rate` between the different demographic groups. In this section, we will introduce a synthetic `race_synth` feature to illustate what the results would look like if a disparity were present. We generate `race_synth` such that the feature is uncorrelated with `gender` and dependent entirely on the `golden_label`." ] }, { "cell_type": "markdown", "id": "a7401cf5", "metadata": {}, "source": [ "If `golden_label` is `0`, then the synthetic `matching_score` is drawn from `Uniform(0, 0.5)`. If the synthetic `golden_label` is `1`, then the `matching_score` is drawn from `Uniform(0, 1)`. The below function `create_disparity` creates additional rows in the DataFrame using this process." ] }, { "cell_type": "code", "execution_count": null, "id": "9ccef68d", "metadata": {}, "outputs": [], "source": [ "def create_disparity(dataframe, num_rows=2000):\n", " n = dataframe.shape[0]\n", " synth_ground_truth = np.random.randint(low=0,high=2, size=num_rows)\n", " synth_gender = np.random.choice([\"Male\", \"Female\"], size=num_rows)\n", " synth_match_score = np.random.random(size=num_rows)/(2.0-synth_ground_truth)\n", " \n", " new_indices = range(n, n+num_rows)\n", " src_imgs, dst_imgs = [f\"Source_Img_{i}\" for i in new_indices], [f\"Target_Img_{i}\" for i in new_indices]\n", " synth_rows = pd.DataFrame.from_dict({\n", " \"source_image\": src_imgs,\n", " \"target_image\": dst_imgs,\n", " \"race\": [\"race_synth\" for i in new_indices],\n", " \"gender\": synth_gender,\n", " \"golden_label\": synth_ground_truth,\n", " \"matching_score\": synth_match_score\n", " })\n", " return synth_rows" ] }, { "cell_type": "code", "execution_count": null, "id": "eb1283e2", "metadata": {}, "outputs": [], "source": [ "disp = create_disparity(df)" ] }, { "cell_type": "code", "execution_count": null, "id": "44496362", "metadata": {}, "outputs": [], "source": [ "synth_df = pd.concat([df, disp], axis=0)" ] }, { "cell_type": "code", "execution_count": null, "id": "603b8c0a", "metadata": {}, "outputs": [], "source": [ "synth_df.loc[:, \"matching_score_binary\"] = synth_df[\"matching_score\"] > threshold" ] }, { "cell_type": "markdown", "id": "8ebdcbb2", "metadata": {}, "source": [ "Now we create another `MetricFrame` with the same parameters as above one." ] }, { "cell_type": "code", "execution_count": null, "id": "31a455d9", "metadata": {}, "outputs": [], "source": [ "synth_metricframe = MetricFrame(\n", " metrics=fairness_metrics,\n", " y_true=synth_df.loc[:, \"golden_label\"],\n", " y_pred=synth_df.loc[:,\"matching_score_binary\"],\n", " sensitive_features=synth_df.loc[:, sensitive_features]\n", ")" ] }, { "cell_type": "markdown", "id": "39165178", "metadata": {}, "source": [ "Now when we call `by_group` on this new `MetricFrame`, we can easily see the vast disparity between the `race_synth` groups and the other racial groups." ] }, { "cell_type": "code", "execution_count": null, "id": "03581522", "metadata": {}, "outputs": [], "source": [ "synth_metricframe.by_group" ] }, { "cell_type": "code", "execution_count": null, "id": "d59dea89", "metadata": {}, "outputs": [], "source": [ "synth_metricframe.difference()" ] }, { "cell_type": "code", "execution_count": null, "id": "d5432753", "metadata": {}, "outputs": [], "source": [ "synth_metricframe.by_group.plot(kind=\"bar\", y=\"FNR\", figsize=[12,8], title=\"FNR by Race and Gender\")" ] }, { "cell_type": "markdown", "id": "e0462d1b", "metadata": {}, "source": [ "## Fairness Assessment Dashboard" ] }, { "cell_type": "markdown", "id": "3e836ddb", "metadata": {}, "source": [ "With the `raiwidgets` library, we can use the `FairnessDashboard` to visualize the disparities between our different `race` and `gender` demographics. We pass in our *sensitive_features*, *golden_labels*, and *thresholded matching scores* to the dashboard. We can view the **dashboard** either within this Jupyter notebook or at a separate **localhost**." ] }, { "cell_type": "code", "execution_count": null, "id": "92ef06a0", "metadata": {}, "outputs": [], "source": [ "from raiwidgets import (\n", " FairnessDashboard\n", ")" ] }, { "cell_type": "markdown", "id": "d916eb9a", "metadata": {}, "source": [ "We instantitate the `FairnessDashboard` by passing in three parameters:\n", "- `sensitive_feature`: The set of sensitive features\n", "- `y_true`: The ground truth labels\n", "- `y_pred`: The model's predictive labels\n", "\n", "The `FairnessDashboard` can either be accessed within the Jupyter notebook or by going to the *localhost url*." ] }, { "cell_type": "code", "execution_count": null, "id": "52813001", "metadata": {}, "outputs": [], "source": [ "FairnessDashboard(\n", " sensitive_features=synth_df.loc[:, sensitive_features],\n", " y_true=synth_df.loc[:, \"golden_label\"],\n", " y_pred=synth_df.loc[:,\"matching_score_binary\"]\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.11" } }, "nbformat": 4, "nbformat_minor": 5 }