{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Matching with Custom Backends\n", "\n", "When performing matching on a sample set, we may want to use non-standard distance measurements or faster implementations. The default behavior is to use the scikit-learn `NearestNeighbors` object but that can be overriden using the `knn_backend` keyword argument when initializing a `Matching`, `PropensityMatching` or `MatchingTransformer` object. \n", "\n", "In this notebook, we show how to use the `faiss`-based backend which we have provided in the `causallib.contrib` module. This leads to a speed-up of 5x or more on the full Lalonde dataset as shown here.\n", "\n", "We also show how to use a custom metric function, by implementing a log odds ratio of the propensity score on the level of the distance metric and matching thereon. \n", "\n", "## Performance of Matching with Faiss vs Sklearn\n", "To see the speedup, we load the augmented lalonde dataset as in the lalonde notebook. The next few cells are the same as we did there." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(22106, 10)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trainingageeducationblackhispanicmarriedno_degreere74re75re78
168270.026.013.00.00.00.00.058.77850.1290331.03226
54120.027.012.00.00.01.00.016297.18013429.2100019562.14000
153990.026.012.00.00.00.00.05217.5273174.2420025564.67000
130770.038.016.00.00.01.00.023713.0109178.9840018814.41000
21890.055.08.00.00.01.01.00.0000.000000.00000
\n", "
" ], "text/plain": [ " training age education black hispanic married no_degree \\\n", "16827 0.0 26.0 13.0 0.0 0.0 0.0 0.0 \n", "5412 0.0 27.0 12.0 0.0 0.0 1.0 0.0 \n", "15399 0.0 26.0 12.0 0.0 0.0 0.0 0.0 \n", "13077 0.0 38.0 16.0 0.0 0.0 1.0 0.0 \n", "2189 0.0 55.0 8.0 0.0 0.0 1.0 1.0 \n", "\n", " re74 re75 re78 \n", "16827 58.778 50.12903 31.03226 \n", "5412 16297.180 13429.21000 19562.14000 \n", "15399 5217.527 3174.24200 25564.67000 \n", "13077 23713.010 9178.98400 18814.41000 \n", "2189 0.000 0.00000 0.00000 " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "columns = [\"training\", # Treatment assignment indicator\n", " \"age\", # Age of participant\n", " \"education\", # Years of education\n", " \"black\", # Indicate whether individual is black\n", " \"hispanic\", # Indicate whether individual is hispanic\n", " \"married\", # Indicate whether individual is married\n", " \"no_degree\", # Indicate if individual has no high-school diploma\n", " \"re74\", # Real earnings in 1974, prior to study participation\n", " \"re75\", # Real earnings in 1975, prior to study participation\n", " \"re78\"] # Real earnings in 1978, after study end\n", "\n", "#treated = pd.read_csv(\"http://www.nber.org/~rdehejia/data/nswre74_treated.txt\", \n", "# delim_whitespace=True, header=None, names=columns)\n", "#control = pd.read_csv(\"http://www.nber.org/~rdehejia/data/nswre74_control.txt\",\n", "# delim_whitespace=True, header=None, names=columns)\n", "file_names = [\"http://www.nber.org/~rdehejia/data/nswre74_treated.txt\",\n", " \"http://www.nber.org/~rdehejia/data/nswre74_control.txt\",\n", " \"http://www.nber.org/~rdehejia/data/psid_controls.txt\",\n", " \"http://www.nber.org/~rdehejia/data/psid2_controls.txt\",\n", " \"http://www.nber.org/~rdehejia/data/psid3_controls.txt\",\n", " \"http://www.nber.org/~rdehejia/data/cps_controls.txt\",\n", " \"http://www.nber.org/~rdehejia/data/cps2_controls.txt\",\n", " \"http://www.nber.org/~rdehejia/data/cps3_controls.txt\"]\n", "files = [pd.read_csv(file_name, delim_whitespace=True, header=None, names=columns) for file_name in file_names]\n", "lalonde = pd.concat(files, ignore_index=True)\n", "lalonde = lalonde.sample(frac=1.0, random_state=42) # Shuffle\n", "\n", "print(lalonde.shape)\n", "lalonde.head()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The dataset contains 22106 people, out of which 185 received training\n" ] } ], "source": [ "print(f'The dataset contains {lalonde.shape[0]} people, out of which {lalonde[\"training\"].sum():.0f} received training')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trainingageeducationblackhispanicmarriedno_degreere74re75re78re74=0re75=0
168270.026.013.00.00.00.00.058.77850.1290331.0322600
54120.027.012.00.00.01.00.016297.18013429.2100019562.1400000
153990.026.012.00.00.00.00.05217.5273174.2420025564.6700000
130770.038.016.00.00.01.00.023713.0109178.9840018814.4100000
21890.055.08.00.00.01.01.00.0000.000000.0000011
\n", "
" ], "text/plain": [ " training age education black hispanic married no_degree \\\n", "16827 0.0 26.0 13.0 0.0 0.0 0.0 0.0 \n", "5412 0.0 27.0 12.0 0.0 0.0 1.0 0.0 \n", "15399 0.0 26.0 12.0 0.0 0.0 0.0 0.0 \n", "13077 0.0 38.0 16.0 0.0 0.0 1.0 0.0 \n", "2189 0.0 55.0 8.0 0.0 0.0 1.0 1.0 \n", "\n", " re74 re75 re78 re74=0 re75=0 \n", "16827 58.778 50.12903 31.03226 0 0 \n", "5412 16297.180 13429.21000 19562.14000 0 0 \n", "15399 5217.527 3174.24200 25564.67000 0 0 \n", "13077 23713.010 9178.98400 18814.41000 0 0 \n", "2189 0.000 0.00000 0.00000 1 1 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lalonde = lalonde.join((lalonde[[\"re74\", \"re75\"]] == 0).astype(int), rsuffix=(\"=0\"))\n", "lalonde.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((22106, 10), (22106,), (22106,))" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = lalonde.pop(\"training\")\n", "y = lalonde.pop(\"re78\")\n", "X = lalonde\n", "X.shape, a.shape, y.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the lalonde matching notebook, we saw before that the full Lalonde dataset is rather slow to execute matching on. This can be sped up substantially if we use the `faiss` backend which is found in the `contrib` module. It will use GPU acceleration if available, falling back on CPU if not. The timings below were generated using CPU only (Intel i7-9750H). Use of this backend requires the installation of the `faiss-gpu` or `faiss-cpu` package from pypi." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from causallib.estimation import Matching\n", "from causallib.contrib.faissknn import FaissNearestNeighbors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `\"sklearn\"` backend is not optimized for speed and takes approximately 2 minutes to fit and match.\n", "\n", " ⚠️**WARNING**⚠️: the `%%timeit` blocks may take a long time to execute because they run several trials. If you want to run this notebook for any purpose other than timing the difference in backends you may want to comment the first line of the following cells out." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1min 51s ± 13.2 s per loop (mean ± std. dev. of 3 runs, 2 loops each)\n" ] } ], "source": [ "%%timeit -n 2 -r 3 \n", "m = Matching(knn_backend=\"sklearn\")\n", "m.fit(X,a,y)\n", "y_potential_outcomes = m.estimate_population_outcome(X,a)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `knn_backend` argument can be a callable returning a `NearestNeighbors`-like object or an object directly. If it is an object, that object will be copied and fit for each treatment value in the data. Here we use the class name as a callable:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "20.3 s ± 1.51 s per loop (mean ± std. dev. of 3 runs, 2 loops each)\n" ] } ], "source": [ "%%timeit -n 2 -r 3 \n", "m = Matching(knn_backend=FaissNearestNeighbors)\n", "m.fit(X,a,y)\n", "y_potential_outcomes = m.estimate_population_outcome(X,a)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we use an instance of the `FaissNearestNeighbors` class. To see the supported options, see the documentation for `FaissNearestNeighbors`." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "18.3 s ± 97.2 ms per loop (mean ± std. dev. of 3 runs, 2 loops each)\n" ] } ], "source": [ "%%timeit -n 2 -r 3 \n", "m = Matching(knn_backend=FaissNearestNeighbors(index_type=\"ivfflat\"))\n", "m.fit(X,a,y)\n", "y_potential_outcomes = m.estimate_population_outcome(X,a)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Custom metric function: Log odds ratio for propensity comparison\n", "\n", "When comparing the difference between two samples using their propensity scores, a raw difference may be misleading. This is because the difference between 0.01 and 0.05 is much more meaningful than a difference between 0.51 and 0.55. In _Causal Inference for Statistics, Social, and Biomedical Sciences_ section 18.5, Imbens and Rubin recommend taking the \"log odds ratio\" $$l(x) = ln( x / (1 - x) )$$ and comparing differences in propensity scores on that scale. This is not the default behavior of `Matching` but it is easy to implement." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(0.01,0.05): original distance: 0.04 logodds distance: 1.65\n", "(0.51,0.55): original distance: 0.04 logodds distance: 0.16\n" ] } ], "source": [ "def logodds(x):\n", " return np.log( x / (1 - x))\n", "def logodds_distance(x,y):\n", " return np.abs(logodds(x) - logodds(y))\n", "def check_difference(x,y):\n", " print(\"({x:.2f},{y:.2f}): original distance: {d1:.2f} logodds distance: {d2:.2f}\"\n", " .format_map({\"x\":x,\"y\":y,\"d1\":np.abs(x-y),\"d2\":logodds_distance(x,y)}))\n", "check_difference(0.01,0.05)\n", "check_difference(0.51,0.55)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is useful for matching because matching a sample with propensity 0.51 to a sample with propensity 0.55 may represent a better match than matching a sample of 0.01 with 0.05.\n", "\n", "We implement this by passing the `logodds_distance` function to `NearestNeighbors` and using that as the `knn_backend` for `PropensityMatching`." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "from causallib.estimation import PropensityMatching\n", "from sklearn.neighbors import NearestNeighbors\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "logodds_knn = NearestNeighbors(metric=logodds_distance)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For speed and ease of comparison, we will load the NHEFS data that is provided with `causallib`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from causallib.datasets import load_nhefs\n", "\n", "data_nhefs = load_nhefs(augment=False,onehot=False)\n", "X_nhefs, a_nhefs, y_nhefs = data_nhefs.X, data_nhefs.a, data_nhefs.y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can estimate the population outcome doing propensity score matching with the log odds ratio distance, where the propensity model is logistic regression." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.802614\n", "1 4.560399\n", "dtype: float64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pm_nhefs_log = PropensityMatching(\n", " learner=LogisticRegression(solver=\"liblinear\"),\n", " knn_backend=logodds_knn,\n", " ).fit(X_nhefs, a_nhefs, y_nhefs)\n", "pm_nhefs_log.estimate_population_outcome(X_nhefs, a_nhefs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For comparison, we fit the same data with a standard Euclidean metric." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.802614\n", "1 4.548021\n", "dtype: float64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pm_nhefs_lin = PropensityMatching(\n", " learner=LogisticRegression(solver=\"liblinear\"),\n", " knn_backend=\"sklearn\"\n", " ).fit(X_nhefs,a_nhefs,y_nhefs)\n", "pm_nhefs_lin.estimate_population_outcome(X_nhefs,a_nhefs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could similarly use the log odds ratio for the full lalonde dataset with a caliper to address the imbalances discussed in the lalonde matching notebook:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0 6340.629027\n", "1.0 7201.100363\n", "dtype: float64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pm_lalonde_log = PropensityMatching(\n", " learner=LogisticRegression(solver=\"liblinear\"),\n", " knn_backend=logodds_knn,\n", " caliper=0.01,\n", " ).fit(X, a, y)\n", "pm_lalonde_log.estimate_population_outcome(X, a)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `FaissNearestNeighbors` object does not currently support metrics other than Mahalanobis and Euclidean. \n", "\n", "**NOTE** : In practice, alternative metrics that can be expressed as transforms on the covariates such as the log odds ratio should be in a `propensity_transform` object, not inside the metric function. This would substantially speed up run time because custom functions are slower than the built-in metrics for `NearestNeighbors`, and are not supported by the `FaissNearestNeighbors` backend." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }