{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Measuring Fairness of a Data Set\n\n\nThis example illustrates how to find unfair rows in a data set using the\n:func:`fatf.fairness.data.measures.systemic_bias` function and how to check\nwhether each class is distributed equally between values of a selected feature,\ni.e. measuring the Sample Size Disparity (with :func:`fatf.utils.data.tools.group_by_column` function).\n\n

Note

Please note that this example uses a data set that is represented as a\n structured numpy array, which supports mixed data types among columns with\n the features (columns) being index by the feature name rather than by\n consecutive integers.

\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Author: Kacper Sokol \n# License: new BSD\n\nfrom pprint import pprint\nimport numpy as np\n\nimport fatf.utils.data.datasets as fatf_datasets\n\nimport fatf.fairness.data.measures as fatf_dfm\n\nimport fatf.utils.data.tools as fatf_data_tools\n\nprint(__doc__)\n\n# Load data\nhr_data_dict = fatf_datasets.load_health_records()\nhr_X = hr_data_dict['data']\nhr_y = hr_data_dict['target']\nhr_feature_names = hr_data_dict['feature_names']\nhr_class_names = hr_data_dict['target_names']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Systemic Bias\n-------------\n\nBefore we proceed, we need to select which feature are **protected**, i.e.\nwhich ones are illegal to use when generating the prediction.\n\nWe use them to see whether the data set contains rows that differ in some of\nthe protected features and the labels (ground truth) but not in the rest of\nthe features.\n\nThe example presented below is rather naive as we do not have access to a\nmore complicated dataset within the FAT Forensics package. To demonstrate the\nfunctionality of the we indicate all but one feature to be protected, hence\nwe are guaranteed to find quite a few unfair rows in the health records data\nset. This means that \"unfair\" data rows are the ones that have the same value\nof the *diagnosis* feature (with rest of the feature values being\nunimportant) and differ in their target (ground truth) value.\n\nSystematic bias is expressed here as a square matrix (numpy array) of length\nequal to the number of rows in the data array. Each element of this matrix\nis a boolean indicating whether the rows in the data array with a particular\npair of indices (the row and column indices of the boolean matrix) violate\nthe aforementioned fairness criterion.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Select which features should be treated as protected\nprotected_features = [\n 'name', 'email', 'age', 'weight', 'gender', 'zipcode', 'dob'\n]\n\n# Compute the data fairness matrix\ndata_fairness_matrix = fatf_dfm.systemic_bias(hr_X, hr_y, protected_features)\n\n# Check if the data set is unfair (at least one unfair pair of data points)\nis_data_unfair = fatf_dfm.systemic_bias_check(data_fairness_matrix)\n\n# Identify which pairs of indices cause the unfairness\nunfair_pairs_tuple = np.where(data_fairness_matrix)\nunfair_pairs = []\nfor i, j in zip(*unfair_pairs_tuple):\n pair_a, pair_b = (i, j), (j, i)\n if pair_a not in unfair_pairs and pair_b not in unfair_pairs:\n unfair_pairs.append(pair_a)\n\n# Print out whether the fairness condition is violated\nif is_data_unfair:\n unfair_n = len(unfair_pairs)\n unfair_fill = ('is', '') if unfair_n == 1 else ('are', 's')\n print('\\nThere {} {} pair{} of data points that violates the fairness '\n 'criterion.\\n'.format(unfair_fill[0], unfair_n, unfair_fill[1]))\nelse:\n print('The data set is fair.\\n')\n\n# Show the first pair of violating rows\npprint(hr_X[[unfair_pairs[0][0], unfair_pairs[0][1]]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sample Size Disparity\n---------------------\n\nThe measure of Sample Size Disparity can be achieved by calling the\n:func:`fatf.utils.data.tools.group_by_column` grouping function and counting\nthe number of instances in each group. By doing that for the *target vector*\n(ground truth) we can see whether the classes in our data set are balanced\nfor each sub-group defined by a specified set of values for that feature.\n\nIn the example below we will check whether there are roughly the same number\nof data points collected for *males* and *females*. Then we will see whether\nthe class distribution (*fail* and *success*) for these two sub-populations\nis similar.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Group the data based on the unique values of the 'gender' column\ngrouping_column = 'gender'\ngrouping_indices, grouping_names = fatf_data_tools.group_by_column(\n hr_X, grouping_column, treat_as_categorical=True)\n\n# Print out the data distribution for the grouping\nprint('The grouping based on the *{}* feature has the '\n 'following distribution:'.format(grouping_column))\nfor grouping_name, grouping_idx in zip(grouping_names, grouping_indices):\n print(' * \"{}\" grouping has {} instances.'.format(\n grouping_name, len(grouping_idx)))\n\n# Get the class distribution for each sub-grouping\ngrouping_class_distribution = dict()\nfor grouping_name, grouping_idx in zip(grouping_names, grouping_indices):\n sg_y = hr_y[grouping_idx]\n sg_classes, sg_counts = np.unique(sg_y, return_counts=True)\n\n grouping_class_distribution[grouping_name] = dict()\n for sg_class, sg_count in zip(sg_classes, sg_counts):\n sg_class_name = hr_class_names[sg_class]\n\n grouping_class_distribution[grouping_name][sg_class_name] = sg_count\n\n# Print out the class distribution per sub-group\nprint('\\nThe class distribution per sub-population:')\nfor grouping_name, class_distribution in grouping_class_distribution.items():\n print(' * For the \"{}\" grouping the classes are distributed as '\n 'follows:'.format(grouping_name))\n for class_name, class_count in class_distribution.items():\n print(' - The class *{}* has {} data points.'.format(\n class_name, class_count))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.16" } }, "nbformat": 4, "nbformat_minor": 0 }