{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!-- Autogenerated by `scripts/make_examples.py` -->\n",
    "<table align=\"left\">\n",
    "    <td>\n",
    "        <a target=\"_blank\" href=\"https://colab.research.google.com/github/voxel51/fiftyone-examples/blob/master/examples/chest_xray14.ipynb\">\n",
    "            <img src=\"https://user-images.githubusercontent.com/25985824/104791629-6e618700-5769-11eb-857f-d176b37d2496.png\" height=\"32\" width=\"32\">\n",
    "            Try in Google Colab\n",
    "        </a>\n",
    "    </td>\n",
    "    <td>\n",
    "        <a target=\"_blank\" href=\"https://nbviewer.jupyter.org/github/voxel51/fiftyone-examples/blob/master/examples/chest_xray14.ipynb\">\n",
    "            <img src=\"https://user-images.githubusercontent.com/25985824/104791634-6efa1d80-5769-11eb-8a4c-71d6cb53ccf0.png\" height=\"32\" width=\"32\">\n",
    "            Share via nbviewer\n",
    "        </a>\n",
    "    </td>\n",
    "    <td>\n",
    "        <a target=\"_blank\" href=\"https://github.com/voxel51/fiftyone-examples/blob/master/examples/chest_xray14.ipynb\">\n",
    "            <img src=\"https://user-images.githubusercontent.com/25985824/104791633-6efa1d80-5769-11eb-8ee3-4b2123fe4b66.png\" height=\"32\" width=\"32\">\n",
    "            View on GitHub\n",
    "        </a>\n",
    "    </td>\n",
    "    <td>\n",
    "        <a href=\"https://github.com/voxel51/fiftyone-examples/raw/master/examples/chest_xray14.ipynb\" download>\n",
    "            <img src=\"https://user-images.githubusercontent.com/25985824/104792428-60f9cc00-576c-11eb-95a4-5709d803023a.png\" height=\"32\" width=\"32\">\n",
    "            Download notebook\n",
    "        </a>\n",
    "    </td>\n",
    "</table>\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load X-ray Data into FiftyOne\n",
    "\n",
    "This notebook walks you through how to load the NIH [ChestX-ray14](https://paperswithcode.com/dataset/chestx-ray8) dataset!\n",
    "\n",
    "First, we'll download the data. Then, we'll load the data into FiftyOne.\n",
    "\n",
    "**Note**: You can also browse this dataset for free at [try.fiftyone.ai](https://try.fiftyone.ai/datasets/chestx-ray14/samples)!"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To run this code, you will need to install the [FiftyOne open source library](https://github.com/voxel51/fiftyone) for dataset curation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install fiftyone"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will import all of the necessary modules:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from glob import glob\n",
    "import os\n",
    "import subprocess\n",
    "import urllib.request\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from PIL import Image\n",
    "from tqdm.notebook import tqdm\n",
    "\n",
    "import fiftyone as fo\n",
    "from fiftyone import ViewField as F"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Downloading Data"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All of the raw data is hosted by the NIH [here](https://nihcc.app.box.com/v/ChestXray-NIHCC).\n",
    "\n",
    "Download the following files:\n",
    "\n",
    "- `Data_Entry_2017.csv`\n",
    "- `BBox_List_2017.csv`\n",
    "- `train_val_list.txt`\n",
    "- `test_list.txt`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the following cell to batch download the zip files containing the X-ray images:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# URLs for the zip files\n",
    "links = [\n",
    "    'https://nihcc.box.com/shared/static/vfk49d74nhbxq3nqjg0900w5nvkorp5c.gz',\n",
    "    'https://nihcc.box.com/shared/static/i28rlmbvmfjbl8p2n3ril0pptcmcu9d1.gz',\n",
    "    'https://nihcc.box.com/shared/static/f1t00wrtdk94satdfb9olcolqx20z2jp.gz',\n",
    "\t'https://nihcc.box.com/shared/static/0aowwzs5lhjrceb3qp67ahp0rd1l1etg.gz',\n",
    "    'https://nihcc.box.com/shared/static/v5e3goj22zr6h8tzualxfsqlqaygfbsn.gz',\n",
    "\t'https://nihcc.box.com/shared/static/asi7ikud9jwnkrnkj99jnpfkjdes7l6l.gz',\n",
    "\t'https://nihcc.box.com/shared/static/jn1b4mw4n6lnh74ovmcjb8y48h8xj07n.gz',\n",
    "    'https://nihcc.box.com/shared/static/tvpxmn7qyrgl0w8wfh9kqfjskv6nmm1j.gz',\n",
    "\t'https://nihcc.box.com/shared/static/upyy3ml7qdumlgk2rfcvlb9k6gvqq2pj.gz',\n",
    "\t'https://nihcc.box.com/shared/static/l6nilvfa9cg3s28tqv1qc1olm3gnz54p.gz',\n",
    "\t'https://nihcc.box.com/shared/static/hhq8fkdgvcari67vfhs7ppg2w6ni4jze.gz',\n",
    "\t'https://nihcc.box.com/shared/static/ioqwiy20ihqwyr8pf4c24eazhh281pbu.gz'\n",
    "]\n",
    "\n",
    "for idx, link in enumerate(links):\n",
    "    fn = 'images_%02d.tar.gz' % (idx+1)\n",
    "    print('downloading'+fn+'...')\n",
    "    urllib.request.urlretrieve(link, fn)  # download the zip file\n",
    "\n",
    "print(\"Download complete. Please check the checksums\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then unzip these zip files:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "for file in glob('*.tar.gz'):\n",
    "    directory = file.rsplit('.', 2)[0]\n",
    "    os.makedirs(directory, exist_ok=True)\n",
    "    subprocess.run(['tar', '-xzf', file, '-C', directory])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And move all of the images into a common `images` folder:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.system(\"mkdir images\")\n",
    "for image_dir in glob('images_*/'):\n",
    "    os.system(f\"mv {image_dir}images/* images/\")\n",
    "    os.system(f\"rm -r {image_dir}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import fiftyone as fo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = fo.Dataset(\"CXR8\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Name:        CXR8\n",
       "Media type:  image\n",
       "Num samples: 112120\n",
       "Persistent:  True\n",
       "Tags:        []\n",
       "Sample fields:\n",
       "    id:               fiftyone.core.fields.ObjectIdField\n",
       "    filepath:         fiftyone.core.fields.StringField\n",
       "    tags:             fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
       "    metadata:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)\n",
       "    patient_id:       fiftyone.core.fields.StringField\n",
       "    view_position:    fiftyone.core.fields.StringField\n",
       "    patient_age:      fiftyone.core.fields.IntField\n",
       "    patient_gender:   fiftyone.core.fields.StringField\n",
       "    follow_up_number: fiftyone.core.fields.IntField\n",
       "    findings:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classifications)\n",
       "    detection:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detection)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Atelectasis',\n",
       " 'Cardiomegaly',\n",
       " 'Consolidation',\n",
       " 'Edema',\n",
       " 'Effusion',\n",
       " 'Emphysema',\n",
       " 'Fibrosis',\n",
       " 'Hernia',\n",
       " 'Infiltration',\n",
       " 'Mass',\n",
       " 'No Finding',\n",
       " 'Nodule',\n",
       " 'Pleural_Thickening',\n",
       " 'Pneumonia',\n",
       " 'Pneumothorax']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset.distinct(\"findings.classifications.label\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Data into FiftyOne"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can create a dataset from this image directory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = fo.Dataset.from_images_dir(\"images\")\n",
    "dataset.name = \"ChestX-ray14\"\n",
    "dataset.persistent= True"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's add in the split information (\"train\" vs \"test\") as tags:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "dirpath = os.path.dirname(dataset.first().filepath)\n",
    "test_filepaths = [\n",
    "    os.path.join(dirpath, f) for f in test_filenames\n",
    "]\n",
    "\n",
    "train_filepaths = [\n",
    "    os.path.join(dirpath, f) for f in train_filenames\n",
    "\n",
    "for fp in tqdm(train_filepaths):\n",
    "    sample = dataset[fp]\n",
    "    sample.tags.append(\"train\")\n",
    "    sample.save()\n",
    "\n",
    "for fp in tqdm(test_filepaths):\n",
    "    sample = dataset[fp]\n",
    "    sample.tags.append(\"test\")\n",
    "    sample.save()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's add in basic attributes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## load as pandas dataframe\n",
    "attributes_df = pd.read_csv(\"Data_Entry_2017_v2020.csv\")\n",
    "\n",
    "## add fields to dataset\n",
    "dataset.add_sample_field(\"follow_up_number\", fo.IntField)\n",
    "dataset.add_sample_field(\"patient_id\", fo.StringField)\n",
    "dataset.add_sample_field(\"view_position\", fo.StringField)\n",
    "dataset.add_sample_field(\"patient_age\", fo.IntField)\n",
    "dataset.add_sample_field(\"patient_gender\", fo.StringField)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## iterate through rows of the dataframe\n",
    "for row in tqdm(attributes_df.iterrows()):\n",
    "    age, gender, view_pos = row[1][['Patient Age', 'Patient Gender', 'View Position']]\n",
    "    pid, fup = row[1][['Patient ID', 'Follow-up #']]\n",
    "    finding = row[1]['Finding Labels'].split('|')\n",
    "    filename = row[1]['Image Index']\n",
    "    fp = os.path.join(dirpath, filename)\n",
    "    classifs = fo.Classifications(\n",
    "        classifications=[\n",
    "            fo.Classification(label=l) for l in finding\n",
    "        ]\n",
    "    )\n",
    "    sample = dataset[fp]\n",
    "    sample['patient_age'] = age\n",
    "    sample[\"patient_gender\"] = gender\n",
    "    sample[\"view_position\"] = view_pos\n",
    "    sample[\"patient_id\"] = str(pid)\n",
    "    sample[\"follow_up_number\"] = int(fup)\n",
    "    sample[\"classifications\"] = classifs\n",
    "    sample.save()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let's add in the detection bounding boxes. There are less than 1,000 of them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## compute metadata so we have width and height\n",
    "dataset.compute_metadata()\n",
    "\n",
    "## load the bounding box data\n",
    "bbox_df = pd.read_csv('BBox_List_2017.csv')\n",
    "\n",
    "## create a new field called \"detection\" that contains the bounding box\n",
    "for row in bbox_df.iterrows():\n",
    "    fp = os.path.join(dirpath, row[1][\"Image Index\"])\n",
    "    sample = dataset[fp]\n",
    "    box_w = row[1][\"w\"]\n",
    "    box_h = row[1][\"h]\"]\n",
    "    box_x = row[1][\"Bbox [x\"]\n",
    "    box_y = row[1][\"y\"]\n",
    "    label = row[1][\"Finding Label\"]\n",
    "    image_w, image_h = sample.metadata.width, sample.metadata.height\n",
    "    bounding_box = [box_x/image_w, box_y/image_h, box_w/image_w, box_h/image_h]\n",
    "    sample[\"detection\"] = fo.Detection(\n",
    "        label=label,\n",
    "        bounding_box=bounding_box\n",
    "    )\n",
    "    sample.save()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can visualize the data in the FiftyOne App:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Session launched. Run `session.show()` to open the App in a cell output.\n"
     ]
    }
   ],
   "source": [
    "session = fo.launch_app(dataset, auto=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![chest_xray14](https://user-images.githubusercontent.com/12500356/258531329-9cb9e262-3f3d-4761-949c-96a4f18c9ac4.png)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Depending on what analysis we are performing, it may be helpful to look at the results for each patient individually. We can achieve this by dynamically grouping by `patient_id` and ordering by `follow_up_number`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![chest_xray14_dynam_group](https://user-images.githubusercontent.com/12500356/258531339-deeb78c1-4953-452f-82a3-750b770b9ae3.png)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}