<!-- Autogenerated by `scripts/make_examples.py` -->
<table align="left">
    <td>
        <a target="_blank" href="https://colab.research.google.com/github/voxel51/fiftyone-examples/blob/master/examples/chest_xray14.ipynb">
            <img src="https://user-images.githubusercontent.com/25985824/104791629-6e618700-5769-11eb-857f-d176b37d2496.png" height="32" width="32">
            Try in Google Colab
        </a>
    </td>
    <td>
        <a target="_blank" href="https://nbviewer.jupyter.org/github/voxel51/fiftyone-examples/blob/master/examples/chest_xray14.ipynb">
            <img src="https://user-images.githubusercontent.com/25985824/104791634-6efa1d80-5769-11eb-8a4c-71d6cb53ccf0.png" height="32" width="32">
            Share via nbviewer
        </a>
    </td>
    <td>
        <a target="_blank" href="https://github.com/voxel51/fiftyone-examples/blob/master/examples/chest_xray14.ipynb">
            <img src="https://user-images.githubusercontent.com/25985824/104791633-6efa1d80-5769-11eb-8ee3-4b2123fe4b66.png" height="32" width="32">
            View on GitHub
        </a>
    </td>
    <td>
        <a href="https://github.com/voxel51/fiftyone-examples/raw/master/examples/chest_xray14.ipynb" download>
            <img src="https://user-images.githubusercontent.com/25985824/104792428-60f9cc00-576c-11eb-95a4-5709d803023a.png" height="32" width="32">
            Download notebook
        </a>
    </td>
</table>


# Load X-ray Data into FiftyOne

This notebook walks you through how to load the NIH [ChestX-ray14](https://paperswithcode.com/dataset/chestx-ray8) dataset!

First, we'll download the data. Then, we'll load the data into FiftyOne.

**Note**: You can also browse this dataset for free at [try.fiftyone.ai](https://try.fiftyone.ai/datasets/chestx-ray14/samples)!

## Setup

To run this code, you will need to install the [FiftyOne open source library](https://github.com/voxel51/fiftyone) for dataset curation.

In [None]:
!pip install fiftyone

We will import all of the necessary modules:

In [None]:
from glob import glob
import os
import subprocess
import urllib.request

import numpy as np
import pandas as pd
from PIL import Image
from tqdm.notebook import tqdm

import fiftyone as fo
from fiftyone import ViewField as F

## Downloading Data

All of the raw data is hosted by the NIH [here](https://nihcc.app.box.com/v/ChestXray-NIHCC).

Download the following files:

- `Data_Entry_2017.csv`
- `BBox_List_2017.csv`
- `train_val_list.txt`
- `test_list.txt`

Run the following cell to batch download the zip files containing the X-ray images:

In [1]:
# URLs for the zip files
links = [
    'https://nihcc.box.com/shared/static/vfk49d74nhbxq3nqjg0900w5nvkorp5c.gz',
    'https://nihcc.box.com/shared/static/i28rlmbvmfjbl8p2n3ril0pptcmcu9d1.gz',
    'https://nihcc.box.com/shared/static/f1t00wrtdk94satdfb9olcolqx20z2jp.gz',
	'https://nihcc.box.com/shared/static/0aowwzs5lhjrceb3qp67ahp0rd1l1etg.gz',
    'https://nihcc.box.com/shared/static/v5e3goj22zr6h8tzualxfsqlqaygfbsn.gz',
	'https://nihcc.box.com/shared/static/asi7ikud9jwnkrnkj99jnpfkjdes7l6l.gz',
	'https://nihcc.box.com/shared/static/jn1b4mw4n6lnh74ovmcjb8y48h8xj07n.gz',
    'https://nihcc.box.com/shared/static/tvpxmn7qyrgl0w8wfh9kqfjskv6nmm1j.gz',
	'https://nihcc.box.com/shared/static/upyy3ml7qdumlgk2rfcvlb9k6gvqq2pj.gz',
	'https://nihcc.box.com/shared/static/l6nilvfa9cg3s28tqv1qc1olm3gnz54p.gz',
	'https://nihcc.box.com/shared/static/hhq8fkdgvcari67vfhs7ppg2w6ni4jze.gz',
	'https://nihcc.box.com/shared/static/ioqwiy20ihqwyr8pf4c24eazhh281pbu.gz'
]

for idx, link in enumerate(links):
    fn = 'images_%02d.tar.gz' % (idx+1)
    print('downloading'+fn+'...')
    urllib.request.urlretrieve(link, fn)  # download the zip file

print("Download complete. Please check the checksums")

Then unzip these zip files:

In [2]:
for file in glob('*.tar.gz'):
    directory = file.rsplit('.', 2)[0]
    os.makedirs(directory, exist_ok=True)
    subprocess.run(['tar', '-xzf', file, '-C', directory])

And move all of the images into a common `images` folder:

In [None]:
os.system("mkdir images")
for image_dir in glob('images_*/'):
    os.system(f"mv {image_dir}images/* images/")
    os.system(f"rm -r {image_dir}")

In [1]:
import fiftyone as fo

In [2]:
dataset = fo.Dataset("CXR8")

In [6]:
dataset

Name:        CXR8
Media type:  image
Num samples: 112120
Persistent:  True
Tags:        []
Sample fields:
    id:               fiftyone.core.fields.ObjectIdField
    filepath:         fiftyone.core.fields.StringField
    tags:             fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    patient_id:       fiftyone.core.fields.StringField
    view_position:    fiftyone.core.fields.StringField
    patient_age:      fiftyone.core.fields.IntField
    patient_gender:   fiftyone.core.fields.StringField
    follow_up_number: fiftyone.core.fields.IntField
    findings:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classifications)
    detection:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detection)

In [8]:
dataset.distinct("findings.classifications.label")

['Atelectasis',
 'Cardiomegaly',
 'Consolidation',
 'Edema',
 'Effusion',
 'Emphysema',
 'Fibrosis',
 'Hernia',
 'Infiltration',
 'Mass',
 'No Finding',
 'Nodule',
 'Pleural_Thickening',
 'Pneumonia',
 'Pneumothorax']

## Load Data into FiftyOne

Now we can create a dataset from this image directory:

In [3]:
dataset = fo.Dataset.from_images_dir("images")
dataset.name = "ChestX-ray14"
dataset.persistent= True

Now let's add in the split information ("train" vs "test") as tags:

In [4]:
dirpath = os.path.dirname(dataset.first().filepath)
test_filepaths = [
    os.path.join(dirpath, f) for f in test_filenames
]

train_filepaths = [
    os.path.join(dirpath, f) for f in train_filenames

for fp in tqdm(train_filepaths):
    sample = dataset[fp]
    sample.tags.append("train")
    sample.save()

for fp in tqdm(test_filepaths):
    sample = dataset[fp]
    sample.tags.append("test")
    sample.save()

Next, let's add in basic attributes:

In [None]:
## load as pandas dataframe
attributes_df = pd.read_csv("Data_Entry_2017_v2020.csv")

## add fields to dataset
dataset.add_sample_field("follow_up_number", fo.IntField)
dataset.add_sample_field("patient_id", fo.StringField)
dataset.add_sample_field("view_position", fo.StringField)
dataset.add_sample_field("patient_age", fo.IntField)
dataset.add_sample_field("patient_gender", fo.StringField)

In [None]:
## iterate through rows of the dataframe
for row in tqdm(attributes_df.iterrows()):
    age, gender, view_pos = row[1][['Patient Age', 'Patient Gender', 'View Position']]
    pid, fup = row[1][['Patient ID', 'Follow-up #']]
    finding = row[1]['Finding Labels'].split('|')
    filename = row[1]['Image Index']
    fp = os.path.join(dirpath, filename)
    classifs = fo.Classifications(
        classifications=[
            fo.Classification(label=l) for l in finding
        ]
    )
    sample = dataset[fp]
    sample['patient_age'] = age
    sample["patient_gender"] = gender
    sample["view_position"] = view_pos
    sample["patient_id"] = str(pid)
    sample["follow_up_number"] = int(fup)
    sample["classifications"] = classifs
    sample.save()

Finally, let's add in the detection bounding boxes. There are less than 1,000 of them:

In [None]:
## compute metadata so we have width and height
dataset.compute_metadata()

## load the bounding box data
bbox_df = pd.read_csv('BBox_List_2017.csv')

## create a new field called "detection" that contains the bounding box
for row in bbox_df.iterrows():
    fp = os.path.join(dirpath, row[1]["Image Index"])
    sample = dataset[fp]
    box_w = row[1]["w"]
    box_h = row[1]["h]"]
    box_x = row[1]["Bbox [x"]
    box_y = row[1]["y"]
    label = row[1]["Finding Label"]
    image_w, image_h = sample.metadata.width, sample.metadata.height
    bounding_box = [box_x/image_w, box_y/image_h, box_w/image_w, box_h/image_h]
    sample["detection"] = fo.Detection(
        label=label,
        bounding_box=bounding_box
    )
    sample.save()

Now we can visualize the data in the FiftyOne App:

In [10]:
session = fo.launch_app(dataset, auto=False)

Session launched. Run `session.show()` to open the App in a cell output.


![chest_xray14](https://user-images.githubusercontent.com/12500356/258531329-9cb9e262-3f3d-4761-949c-96a4f18c9ac4.png)

Depending on what analysis we are performing, it may be helpful to look at the results for each patient individually. We can achieve this by dynamically grouping by `patient_id` and ordering by `follow_up_number`.

![chest_xray14_dynam_group](https://user-images.githubusercontent.com/12500356/258531339-deeb78c1-4953-452f-82a3-750b770b9ae3.png)