{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## AI for Medicine Course 1 Week 1 lecture exercises"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Patient Overlap and Data Leakage\n",
    "\n",
    "Patient overlap in medical data is a part of a more general problem in machine learning called **data leakage**.  To identify patient overlap in this week's graded assignment, you'll check to see if a patient's ID appears in both the training set and the test set. You should also verify that you don't have patient overlap in the training and validation sets, which is what you'll do here.\n",
    "\n",
    "Below is a simple example showing how you can check for and remove patient overlap in your training and validations sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import necessary packages\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "import os\n",
    "import seaborn as sns\n",
    "sns.set()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read in the data from a csv file\n",
    "\n",
    "First, you'll read in your training and validation datasets from csv files. Run the next two cells to read these csvs into `pandas` dataframes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read csv file containing training data\n",
    "train_df = pd.read_csv(\"nih/train-small.csv\")\n",
    "# Print first 5 rows\n",
    "print(f'There are {train_df.shape[0]} rows and {train_df.shape[1]} columns in the training dataframe')\n",
    "train_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read csv file containing validation data\n",
    "valid_df = pd.read_csv(\"nih/valid-small.csv\")\n",
    "# Print first 5 rows\n",
    "print(f'There are {valid_df.shape[0]} rows and {valid_df.shape[1]} columns in the validation dataframe')\n",
    "valid_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extract and compare the PatientId columns from the train and validation sets\n",
    "By running the next four cells you will do the following:\n",
    "1. Extract patient IDs from the train and validation sets\n",
    "2. Convert these arrays of numbers into `set()` datatypes for easy comparison\n",
    "3. Identify patient overlap in the intersection of the two sets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract patient id's for the training set\n",
    "ids_train = train_df.PatientId.values\n",
    "# Extract patient id's for the validation set\n",
    "ids_valid = valid_df.PatientId.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a \"set\" datastructure of the training set id's to identify unique id's\n",
    "ids_train_set = set(ids_train)\n",
    "print(f'There are {len(ids_train_set)} unique Patient IDs in the training set')\n",
    "# Create a \"set\" datastructure of the validation set id's to identify unique id's\n",
    "ids_valid_set = set(ids_valid)\n",
    "print(f'There are {len(ids_valid_set)} unique Patient IDs in the validation set')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Identify patient overlap by looking at the intersection between the sets\n",
    "patient_overlap = list(ids_train_set.intersection(ids_valid_set))\n",
    "n_overlap = len(patient_overlap)\n",
    "print(f'There are {n_overlap} Patient IDs in both the training and validation sets')\n",
    "print('')\n",
    "print(f'These patients are in both the training and validation datasets:')\n",
    "print(f'{patient_overlap}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Identify rows (indices) of overlapping patients and remove from either the train or validation set\n",
    "Run the next two cells to do the following:\n",
    "1. Create lists of the overlapping row numbers in both the training and validation sets. \n",
    "2. Drop the overlapping patient records from the validation set (could also choose to drop from train set)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_overlap_idxs = []\n",
    "valid_overlap_idxs = []\n",
    "for idx in range(n_overlap):\n",
    "    train_overlap_idxs.extend(train_df.index[train_df['PatientId'] == patient_overlap[idx]].tolist())\n",
    "    valid_overlap_idxs.extend(valid_df.index[valid_df['PatientId'] == patient_overlap[idx]].tolist())\n",
    "    \n",
    "print(f'These are the indices of overlapping patients in the training set: ')\n",
    "print(f'{train_overlap_idxs}')\n",
    "print(f'These are the indices of overlapping patients in the validation set: ')\n",
    "print(f'{valid_overlap_idxs}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Drop the overlapping rows from the validation set\n",
    "valid_df.drop(valid_overlap_idxs, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Check that everything worked as planned by rerunning the patient ID comparison between train and validation sets.\n",
    "\n",
    "When you run the next two cells you should see that there are now fewer records in the validation set and that the overlap problem has been removed!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract patient id's for the validation set\n",
    "ids_valid = valid_df.PatientId.values\n",
    "# Create a \"set\" datastructure of the validation set id's to identify unique id's\n",
    "ids_valid_set = set(ids_valid)\n",
    "print(f'There are {len(ids_valid_set)} unique Patient IDs in the validation set')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Identify patient overlap by looking at the intersection between the sets\n",
    "patient_overlap = list(ids_train_set.intersection(ids_valid_set))\n",
    "n_overlap = len(patient_overlap)\n",
    "print(f'There are {n_overlap} Patient IDs in both the training and validation sets')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Congratulations! You removed overlapping patients from the validation set! \n",
    "\n",
    "You could have just as well removed them from the training set. \n",
    "\n",
    "Always be sure to check for patient overlap in your train, validation and test sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}