{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Chest X-Ray Medical Diagnosis with Deep Learning"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "FZYK-0rin5x7"
},
"source": [
"
\n",
"\n",
"__Welcome to the first assignment of course 1!__ \n",
"\n",
"In this assignment! You will explore medical image diagnosis by building a state-of-the-art chest X-ray classifier using Keras. \n",
"\n",
"The assignment will walk through some of the steps of building and evaluating this deep learning classifier model. In particular, you will:\n",
"- Pre-process and prepare a real-world X-ray dataset\n",
"- Use transfer learning to retrain a DenseNet model for X-ray image classification\n",
"- Learn a technique to handle class imbalance\n",
"- Measure diagnostic performance by computing the AUC (Area Under the Curve) for the ROC (Receiver Operating Characteristic) curve\n",
"- Visualize model activity using GradCAMs\n",
"\n",
"In completing this assignment you will learn about the following topics: \n",
"\n",
"- Data preparation\n",
" - Visualizing data\n",
" - Preventing data leakage\n",
"- Model Development\n",
" - Addressing class imbalance\n",
" - Leveraging pre-trained models using transfer learning\n",
"- Evaluation\n",
" - AUC and ROC curves"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Outline\n",
"Use these links to jump to specific sections of this assignment!\n",
"\n",
"- [1. Import Packages and Function](#1)\n",
"- [2. Load the Datasets](#2)\n",
" - [2.1 Preventing Data Leakage](#2-1)\n",
" - [Exercise 1 - Checking Data Leakage](#Ex-1)\n",
" - [2.2 Preparing Images](#2-2)\n",
"- [3. Model Development](#3)\n",
" - [3.1 Addressing Class Imbalance](#3-1)\n",
" - [Exercise 2 - Computing Class Frequencies](#Ex-2)\n",
" - [Exercise 3 - Weighted Loss](#Ex-3)\n",
" - [3.3 DenseNet121](#3-3)\n",
"- [4. Training [optional]](#4)\n",
" - [4.1 Training on the Larger Dataset](#4-1)\n",
"- [5. Prediction and Evaluation](#5)\n",
" - [5.1 ROC Curve and AUROC](#5-1)\n",
" - [5.2 Visualizing Learning with GradCAM](#5-2)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "XI8PBrk_2Z4V"
},
"source": [
"\n",
"## 1. Import Packages and Functions¶\n",
"\n",
"We'll make use of the following packages:\n",
"- `numpy` and `pandas` is what we'll use to manipulate our data\n",
"- `matplotlib.pyplot` and `seaborn` will be used to produce plots for visualization\n",
"- `util` will provide the locally defined utility functions that have been provided for this assignment\n",
"\n",
"We will also use several modules from the `keras` framework for building deep learning models.\n",
"\n",
"Run the next cell to import all the necessary packages."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "Je3yV0Wnn5x8",
"scrolled": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Using TensorFlow backend.\n"
]
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"\n",
"from keras.preprocessing.image import ImageDataGenerator\n",
"from keras.applications.densenet import DenseNet121\n",
"from keras.layers import Dense, GlobalAveragePooling2D\n",
"from keras.models import Model\n",
"from keras import backend as K\n",
"\n",
"from keras.models import load_model\n",
"\n",
"import util"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "6PMDCWQRn5yA"
},
"source": [
"\n",
"## 2 Load the Datasets\n",
"\n",
"For this assignment, we will be using the [ChestX-ray8 dataset](https://arxiv.org/abs/1705.02315) which contains 108,948 frontal-view X-ray images of 32,717 unique patients. \n",
"- Each image in the data set contains multiple text-mined labels identifying 14 different pathological conditions. \n",
"- These in turn can be used by physicians to diagnose 8 different diseases. \n",
"- We will use this data to develop a single model that will provide binary classification predictions for each of the 14 labeled pathologies. \n",
"- In other words it will predict 'positive' or 'negative' for each of the pathologies.\n",
" \n",
"You can download the entire dataset for free [here](https://nihcc.app.box.com/v/ChestXray-NIHCC). \n",
"- We have provided a ~1000 image subset of the images for you.\n",
"- These can be accessed in the folder path stored in the `IMAGE_DIR` variable.\n",
"\n",
"The dataset includes a CSV file that provides the labels for each X-ray. \n",
"\n",
"To make your job a bit easier, we have processed the labels for our small sample and generated three new files to get you started. These three files are:\n",
"\n",
"1. `nih/train-small.csv`: 875 images from our dataset to be used for training.\n",
"1. `nih/valid-small.csv`: 109 images from our dataset to be used for validation.\n",
"1. `nih/test.csv`: 420 images from our dataset to be used for testing. \n",
"\n",
"This dataset has been annotated by consensus among four different radiologists for 5 of our 14 pathologies:\n",
"- `Consolidation`\n",
"- `Edema`\n",
"- `Effusion`\n",
"- `Cardiomegaly`\n",
"- `Atelectasis`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Sidebar on meaning of 'class'\n",
"It is worth noting that the word **'class'** is used in multiple ways is these discussions. \n",
"- We sometimes refer to each of the 14 pathological conditions that are labeled in our dataset as a class. \n",
"- But for each of those pathologies we are attempting to predict whether a certain condition is present (i.e. positive result) or absent (i.e. negative result). \n",
" - These two possible labels of 'positive' or 'negative' (or the numerical equivalent of 1 or 0) are also typically referred to as classes. \n",
"- Moreover, we also use the term in reference to software code 'classes' such as `ImageDataGenerator`.\n",
"\n",
"As long as you are aware of all this though, it should not cause you any confusion as the term 'class' is usually clear from the context in which it is used."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Read in the data\n",
"Let's open these files using the [pandas](https://pandas.pydata.org/) library"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 224
},
"colab_type": "code",
"id": "5JRSHB7i0t_6",
"outputId": "69830050-af47-4ebc-946d-d411d0cbdf5b"
},
"outputs": [
{
"data": {
"text/html": [
"
| \n", " | Image | \n", "Atelectasis | \n", "Cardiomegaly | \n", "Consolidation | \n", "Edema | \n", "Effusion | \n", "Emphysema | \n", "Fibrosis | \n", "Hernia | \n", "Infiltration | \n", "Mass | \n", "Nodule | \n", "PatientId | \n", "Pleural_Thickening | \n", "Pneumonia | \n", "Pneumothorax | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "00008270_015.png | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "8270 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 1 | \n", "00029855_001.png | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "29855 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 2 | \n", "00001297_000.png | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1297 | \n", "1 | \n", "0 | \n", "0 | \n", "
| 3 | \n", "00012359_002.png | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "12359 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 4 | \n", "00017951_001.png | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "17951 | \n", "0 | \n", "0 | \n", "0 | \n", "
\n", "
df1_patients_unique...[continue your code here] \n", "