{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Before we begin, we will change a few settings to make the notebook look a bit prettier" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "# Understanding VERTIGO\n", "In this notebook we provide a step-by-step explanation of how VERTIGO works and how we implemented them. \n", "\n", "For ease of use, you can find an object version of VERTIGO in [`vertigo.py`](./vertigo.py). For a (simulated) distributed scenario where we actually use it, see the notebook [`demo_vertigo_local`](https://nbviewer.jupyter.org/github/IKNL/vertigo/blob/master/scripts/demo_vertigo_local.ipynb). For a real-life distributed implementation of VERTIGO using our [privacy preserving DL infrastructure](https://github.com/IKNL/ppDLI), see our repository, `d_vertigo`.\n", "\n", "All right. Let's get started." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preliminaries\n", "Display plots." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import packages" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import sys\n", "import os\n", "import pathlib\n", "import pandas as pd\n", "import numpy as np\n", "import numpy.matlib as npm\n", "import seaborn as sns\n", "import warnings\n", "import matplotlib as mpl\n", "from matplotlib import pyplot as plt\n", "import statsmodels as sm\n", "from sklearn.model_selection import train_test_split\n", "\n", "import auxiliaries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set (default) plotting parameters." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "mpl.rcParams['font.sans-serif'] = \"Calibri\"\n", "mpl.rcParams['font.family'] = \"sans-serif\"\n", "sns.set(font_scale=1.75)\n", "sns.set_style('ticks')\n", "plt.rc('axes.spines', top=False, right=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Setup paths." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Setup working path\n", "os.chdir(os.getcwd())\n", "\n", "PATH_DATA = pathlib.Path(r'../data/')\n", "PATH_RESULTS = pathlib.Path(r'../results/')\n", "\n", "if not PATH_DATA.exists():\n", " raise ValueError(\"Data directory not found.\")\n", " \n", "if not PATH_RESULTS.exists():\n", " path_results.mkdir()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Setup execution parameters." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Test/train partitioning.\n", "TEST_PROP = 0.2\n", "SEED = 31416 # For reproducibility.\n", "\n", "\n", "# VERTIGO execution parameters.\n", "LAMBDA_ = np.array([1000])\n", "TOLERANCE = 1e-8\n", "ITERATIONS = 500\n", "\n", "# We know alpha needs to be bounded between 0 and 1 (see Fig. 4, 3.ii).\n", "# Thus we will initialize it at 0.5.\n", "ALPHA_INIT_VAL = 0.5 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The data\n", "We will be using the [Breast Cancer Wisconsin Diagnostic Data Set](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29). The data consists of real-valued features computed from a digitalized image of breast biposy. The features describe the characteristics of the cell nuclei present in the image: \n", "* `radius` - mean of distances from center to points on the perimeter\n", "* `texture` - standard deviation of gray-scale values\n", "* `perimeter`\n", "* `area`\n", "* `smoothness` - local variation in radii lengths\n", "* `compactness` - perimeter^2 / area - 1.0\n", "* `concavity` - severity of concave portions of the contour\n", "* `concave points` - number of concave portions of the contour\n", "* `symmetry`\n", "* `fractal dimension` - \"coastline approximation\" - 1\n", "\n", "The mean, standard error, and \"worst\" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For the sake of simplicity, we will only consider the mean values of the features, resulting in 10 features. The output variable is `diagnosis`, which states if result of the biopsy was a benign (`B`) or a malignant (`M`) tumor." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
diagnosisradius_meantexture_meanperimeter_meanarea_meansmoothness_meancompactness_meanconcavity_meanconcave points_meansymmetry_meanfractal_dimension_mean
id
842302M17.9910.38122.801001.00.118400.277600.30010.147100.24190.07871
842517M20.5717.77132.901326.00.084740.078640.08690.070170.18120.05667
84300903M19.6921.25130.001203.00.109600.159900.19740.127900.20690.05999
84348301M11.4220.3877.58386.10.142500.283900.24140.105200.25970.09744
84358402M20.2914.34135.101297.00.100300.132800.19800.104300.18090.05883
\n", "
" ], "text/plain": [ " diagnosis radius_mean ... symmetry_mean fractal_dimension_mean\n", "id ... \n", "842302 M 17.99 ... 0.2419 0.07871\n", "842517 M 20.57 ... 0.1812 0.05667\n", "84300903 M 19.69 ... 0.2069 0.05999\n", "84348301 M 11.42 ... 0.2597 0.09744\n", "84358402 M 20.29 ... 0.1809 0.05883\n", "\n", "[5 rows x 11 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(PATH_DATA/'data.csv', index_col=0, usecols=range(0, 12))\n", "df.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how in this case, all variables are continuous. If the data had categorical values, they would need to be encoded, preferably using [one-hot encoding](https://en.wikipedia.org/wiki/One-hot)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data pre-processing\n", "Extract the features only" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "column_names = list(df.columns)[1:]\n", "X_centralized = df[column_names]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to avoid having features that are dominant just because of their range, we will normalize them." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "X_centralized = (X_centralized - np.min(X_centralized))/(np.max(X_centralized) - np.min(X_centralized))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Re-structuring of the data\n", "For the sake of this example, we will simulate two different parties. Each of these parties will have half of the features." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
radius_meantexture_meanperimeter_meanarea_meansmoothness_mean
id
8423020.5210370.0226580.5459890.3637330.593753
8425170.6431440.2725740.6157830.5015910.289880
843009030.6014960.3902600.5957430.4494170.514309
843483010.2100900.3608390.2335010.1029060.811321
843584020.6298930.1565780.6309860.4892900.430351
\n", "
" ], "text/plain": [ " radius_mean texture_mean perimeter_mean area_mean smoothness_mean\n", "id \n", "842302 0.521037 0.022658 0.545989 0.363733 0.593753\n", "842517 0.643144 0.272574 0.615783 0.501591 0.289880\n", "84300903 0.601496 0.390260 0.595743 0.449417 0.514309\n", "84348301 0.210090 0.360839 0.233501 0.102906 0.811321\n", "84358402 0.629893 0.156578 0.630986 0.489290 0.430351" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X1 = X_centralized[column_names[0:round(len(column_names)/2)]]\n", "X1.head(5)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
compactness_meanconcavity_meanconcave points_meansymmetry_meanfractal_dimension_mean
id
8423020.7920370.7031400.7311130.6863640.605518
8425170.1817680.2036080.3487570.3797980.141323
843009030.4310170.4625120.6356860.5095960.211247
843483010.8113610.5656040.5228630.7762631.000000
843584020.3478930.4639180.5183900.3782830.186816
\n", "
" ], "text/plain": [ " compactness_mean ... fractal_dimension_mean\n", "id ... \n", "842302 0.792037 ... 0.605518\n", "842517 0.181768 ... 0.141323\n", "84300903 0.431017 ... 0.211247\n", "84348301 0.811361 ... 1.000000\n", "84358402 0.347893 ... 0.186816\n", "\n", "[5 rows x 5 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X2 = X_centralized[column_names[round(len(column_names)/2):]]\n", "X2.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Output\n", "VERTIGO uses an output of -1 or 1." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "y = df[['diagnosis']]\n", "y = y['diagnosis'].map(lambda x: 1 if x=='M' else -1)\n", "y = pd.DataFrame(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Class balancing\n", "This is particularly important for any logistic regression, including VERTIGO. If a class is underrepresented, the model will apparently perform well, but it will likely just predict one class all the time. Thus we need to balance it. We will do so using downsampling." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "X1_balanced, y_balanced = auxiliaries.downsample(X1, y, random_state=0)\n", "X2_balanced, _ = auxiliaries.downsample(X2, y, random_state=0)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=[10, 5])\n", "sns.countplot(x='diagnosis', data=y, palette='husl', ax=ax1)\n", "sns.countplot(x='diagnosis', data=y_balanced, palette='husl', ax=ax2)\n", "ax1.set_title(\"Original\", weight='bold')\n", "ax2.set_title(\"Balanced (downsampling)\", weight='bold')\n", "ax1.set_ylabel(\"No. of patients\", weight='bold')\n", "ax2.set_ylabel(\"\")\n", "fig.savefig(PATH_RESULTS/('class_proportion.pdf'), dpi=1000, bbox_inches='tight')\n", "plt.show();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data partition\n", "Partition data into training and testing sets." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "X1_train, X1_test, y_train, y_test = train_test_split(X1_balanced, y_balanced, test_size=TEST_PROP, random_state=SEED)\n", "X2_train, X2_test, _, _ = train_test_split(X2_balanced, y_balanced, test_size=TEST_PROP, random_state=SEED)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## VERTIGO\n", "### 0. Preparations\n", "Before actually computing VERTIGO, we need to do a few things. These steps are shown here for didactic purposes. **Notice that the object `vertigo.vertigo` already does them internally**. \n", "\n", "First, we need to convert DataFrames to numpy arrays (for mathematical operations). Furthermore, we will rename the variables to match the paper's notation." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "X1 = X1_train.to_numpy()\n", "X2 = X2_train.to_numpy()\n", "\n", "Y = y_train.to_numpy()\n", "Y_diag = np.diag(Y.flatten())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add column of 1s to serve as constant term in the computation of the residual coefficient." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "X1 = sm.tools.add_constant(X1, prepend=False)\n", "X2 = sm.tools.add_constant(X2, prepend=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute some variables for ease of use." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "M = X1.shape[0] # Number of patients (same for both parties).\n", "N1 = X1.shape[1] # Number of party 1's features.\n", "N2 = X2.shape[1] # Number of party 2's features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Initialize alpha." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "ALPHA_INIT = ALPHA_INIT_VAL * np.ones((M, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly, we will define a few useful functions. These are also part of `vertigo.vertigo` methods" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Check if matrix is diagonally dominant.\n", "def is_diagonally_dominant(X):\n", " D = np.diag(np.abs(X)) # Find diagonal coefficients\n", " S = np.sum(np.abs(X), axis=1) - D # Find row sum without diagonal\n", " if np.all(D > S):\n", " return True\n", " else:\n", " return False\n", "\n", "# Check if matrix is positive definite.\n", "def is_positive_definite(x):\n", " return np.all(np.linalg.eigvals(x) > 0)\n", "\n", "# Sigmoid function to bound a number between 0 and 1\n", "def sigmoid(x):\n", " return 1 / (1 + np.exp(-x))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Local\n", "\n", "#### 1.1 Compute local gram matrix (i.e., linear kernel matrix)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "K1 = np.matmul(X1, X1.T)\n", "K2 = np.matmul(X2, X2.T)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Server\n", "#### 2.1 Compute global gram matrix $K$" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "K = np.add(K1, K2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.2 Compute fixed Hessian matrix $\\tilde{H}$(Eq. 10)\n", "Originally, the paper states that $c$ must be set to a value that makes $\\tilde{H}$ diagonally dominant. We initialize it as \"the maximum of the elements in the original $H$\" (p. 573). Then, we increase $c$ by a factor in steps of 0.5 until the condition of diagonally dominance is satisfied.\n", "\n", "However, in our experiments we didn't find this condition to be necessary, getting the same results if we adjusted $c$ or not. We give the option to the user to do so (see line 14). \n", "\n", "Depending on the data, computing $\\tilde{H}$ can take quite long." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "not_diagonally_dominant = True\n", "c_factor = 1\n", "\n", "while not_diagonally_dominant:\n", "\n", " H = (1/LAMBDA_) * np.matmul(np.matmul(Y_diag, K), Y_diag) + np.diag(1/(ALPHA_INIT*(1-ALPHA_INIT)))\n", " c = c_factor * H.max()\n", " I = np.identity(len(Y_diag))\n", "\n", " H_wave = (1/LAMBDA_) * np.matmul(np.matmul(Y_diag, K), Y_diag) + c*I\n", " not_diagonally_dominant = not is_diagonally_dominant(H_wave)\n", " c_factor += 0.5\n", "\n", " # Comment break to make sure that H_wave is diagonally dominant.\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.3 Compute $\\tilde{H}^{-1}$" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "H_wave_1 = np.linalg.inv(H_wave)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Local + server\n", "Compute $\\alpha*$" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# We will store the evolution of the alpha coefficients here.\n", "alpha_all = np.zeros((Y.shape[0], ITERATIONS))\n", "\n", "for s in range(0, ITERATIONS):\n", "\n", " if s==0:\n", " alpha_old = ALPHA_INIT \n", " else:\n", " alpha_old = alpha\n", "\n", " alpha_all[:, s] = alpha_old.T\n", " \n", "\n", " # 3.1 LOCAL\n", " # Eq. 5\n", " beta1 = (1/LAMBDA_) * np.sum((npm.repmat(alpha_old, 1, N1) * \n", " npm.repmat(Y, 1, N1) *\n", " X1), 0)\n", " beta2 = (1/LAMBDA_) * np.sum((npm.repmat(alpha_old, 1, N2) * \n", " npm.repmat(Y, 1, N2) *\n", " X2), 0)\n", "\n", " # In order for the matrix multiplications to work, Y needs to be\n", " # transposed (which is not stated in the paper).\n", " E1 = Y.T * (np.matmul(beta1, X1.T))\n", " E2 = Y.T * (np.matmul(beta2, X2.T))\n", "\n", "\n", " # 3.2 SERVER\n", " E = E1 + E2\n", "\n", " # Catch incorrect cases (logarithm of a negative number)\n", " with warnings.catch_warnings():\n", " warnings.filterwarnings('error')\n", " try:\n", " J = E.T + np.log10(alpha_old/(1-alpha_old))\n", " except Warning as e:\n", " print('WARNING:', e)\n", "\n", " \n", " # 3.3 SERVER\n", " # Update alpha.\n", " alpha = alpha_old - np.matmul(H_wave_1, J)\n", "\n", " # Bound alpha (0 < alpha < 1)\n", " # This isn't defined in the original paper, but is done to avoid\n", " # problems when computing J (the logarithm of a negative number is\n", " # undefined).\n", " alpha = sigmoid(alpha)\n", "\n", " # Check for convergence.\n", " if max(abs(alpha - alpha_old)) < TOLERANCE:\n", " # Trim alpha_all.\n", " alpha_all = np.delete(alpha_all, np.s_[s:], axis=1)\n", " break\n", "\n", "# Final alpha value.\n", "alpha_star = alpha" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Local\n", "Compute the $\\beta$ coefficients (and the intercept). This step isn't defined in Fig. 4, but it is the logical next step" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Party 1 coefficients are [0.02656502 0.01223053 0.02735187 0.02322579 0.01138798]\n", "Party 2 coefficients are [0.02350349 0.03031696 0.03433975 0.00996463 0.00073673]\n", "Intercept is -0.002826607481672041\n" ] } ], "source": [ "coeffs1 = (1/LAMBDA_) * np.sum((npm.repmat(alpha_star, 1, N1) * \n", " npm.repmat(Y, 1, N1) *\n", " X1[:, :]), 0)\n", "coeffs2 = (1/LAMBDA_) * np.sum((npm.repmat(alpha_star, 1, N2) * \n", " npm.repmat(Y, 1, N2) *\n", " X2[:, :]), 0)\n", "\n", "beta1 = coeffs1[0:-1]\n", "beta2 = coeffs2[0:-1]\n", "intercept = coeffs1[-1] # Notice how the intercept is the same for both parties.\n", "\n", "print(\"Party 1 coefficients are \", end='', flush=True)\n", "print(beta1)\n", "print(\"Party 2 coefficients are \", end='', flush=True)\n", "print(beta2)\n", "print(\"Intercept is \", end='', flush=True)\n", "print(intercept)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And that's it! These are the coefficients of our model. They have the same order as the input features. They can be found in a `vertigo.vertigo` object as `beta_coef_` and `intercept_`." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (Spyder)", "language": "python3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }