{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Deep Learning Models -- A collection of various deep learning architectures, models, and tips for TensorFlow and PyTorch in Jupyter Notebooks.\n",
    "- Author: Sebastian Raschka\n",
    "- GitHub Repository: https://github.com/rasbt/deeplearning-models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sebastian Raschka \n",
      "\n",
      "CPython 3.7.1\n",
      "IPython 7.2.0\n",
      "\n",
      "torch 1.0.0\n"
     ]
    }
   ],
   "source": [
    "%load_ext watermark\n",
    "%watermark -a 'Sebastian Raschka' -v -p torch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model Zoo -- Using PyTorch Dataset Loading Utilities for Custom Datasets (CSV files converted to HDF5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook provides an example for how to load a dataset from an HDF5 file created from a CSV file, using PyTorch's data loading utilities. For a more in-depth discussion, please see the official\n",
    "\n",
    "- [Data Loading and Processing Tutorial](http://pytorch.org/tutorials/beginner/data_loading_tutorial.html)\n",
    "- [torch.utils.data](http://pytorch.org/docs/master/data.html) API documentation\n",
    "\n",
    "An Hierarchical Data Format (HDF) is a convenient way that allows quick access to data instances during minibatch learning if a dataset is too large to fit into memory. The approach outlined in this notebook uses uses the common [HDF5](https://support.hdfgroup.org/HDF5/) format and should be accessible to any programming language or tool with an HDF5 API.\n",
    "\n",
    "**In this example, we are going to use the Iris dataset for illustrative purposes. Let's pretend it's our large training dataset that doesn't fit into memory**.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import h5py\n",
    "import torch\n",
    "from torch.utils.data import Dataset\n",
    "from torch.utils.data import DataLoader"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Converting a CSV file to HDF5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this first step, we are going to process a CSV file (here, Iris) into an HDF5 database:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# suppose this is a large CSV that does not \n",
    "# fit into memory:\n",
    "csv_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'\n",
    "\n",
    "# Get number of lines in the CSV file if it's on your hard drive:\n",
    "#num_lines = subprocess.check_output(['wc', '-l', in_csv])\n",
    "#num_lines = int(nlines.split()[0]) \n",
    "num_lines = 150\n",
    "num_features = 4\n",
    "\n",
    "class_dict = {'Iris-setosa': 0,\n",
    "              'Iris-versicolor': 1,\n",
    "              'Iris-virginica': 2}\n",
    "\n",
    "# use 10,000 or 100,000 or so for large files\n",
    "chunksize = 10\n",
    "\n",
    "# this is your HDF5 database:\n",
    "with h5py.File('iris.h5', 'w') as h5f:\n",
    "    \n",
    "    # use num_features-1 if the csv file has a column header\n",
    "    dset1 = h5f.create_dataset('features',\n",
    "                               shape=(num_lines, num_features),\n",
    "                               compression=None,\n",
    "                               dtype='float32')\n",
    "    dset2 = h5f.create_dataset('labels',\n",
    "                               shape=(num_lines,),\n",
    "                               compression=None,\n",
    "                               dtype='int32')\n",
    "\n",
    "    # change range argument from 0 -> 1 if your csv file contains a column header\n",
    "    for i in range(0, num_lines, chunksize):  \n",
    "\n",
    "        df = pd.read_csv(csv_path,  \n",
    "                header=None,  # no header, define column header manually later\n",
    "                nrows=chunksize, # number of rows to read at each iteration\n",
    "                skiprows=i)   # skip rows that were already read\n",
    "        \n",
    "        df[4] = df[4].map(class_dict)\n",
    "\n",
    "        features = df.values[:, :4]\n",
    "        labels = df.values[:, -1]\n",
    "        \n",
    "        # use i-1 and i-1+10 if csv file has a column header\n",
    "        dset1[i:i+10, :] = features\n",
    "        dset2[i:i+10] = labels[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After creating the database, let's double-check that everything works correctly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(150, 4)\n",
      "(150,)\n"
     ]
    }
   ],
   "source": [
    "with h5py.File('iris.h5', 'r') as h5f:\n",
    "    print(h5f['features'].shape)\n",
    "    print(h5f['labels'].shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Features of entry no. 99: [5.7 2.8 4.1 1.3]\n",
      "Class label of entry no. 99: 1\n"
     ]
    }
   ],
   "source": [
    "with h5py.File('iris.h5', 'r') as h5f:\n",
    "    print('Features of entry no. 99:', h5f['features'][99])\n",
    "    print('Class label of entry no. 99:', h5f['labels'][99])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implementing a Custom Dataset Class"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we implement a custom `Dataset` for reading the training examples. The `__getitem__` method will\n",
    "\n",
    "1. read a single training example from HDF5 based on an `index` (more on batching later)\n",
    "2. return a single training example and it's corresponding label\n",
    "\n",
    "Note that we will keep an open connection to the database for efficiency via `self.h5f = h5py.File(h5_path, 'r')` -- you may want to close it when you are done (more on this later)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Hdf5Dataset(Dataset):\n",
    "    \"\"\"Custom Dataset for loading entries from HDF5 databases\"\"\"\n",
    "\n",
    "    def __init__(self, h5_path, transform=None):\n",
    "    \n",
    "        self.h5f = h5py.File(h5_path, 'r')\n",
    "        self.num_entries = self.h5f['labels'].shape[0]\n",
    "        self.transform = transform\n",
    "\n",
    "    def __getitem__(self, index):\n",
    "        \n",
    "        features = self.h5f['features'][index]\n",
    "        label = self.h5f['labels'][index]\n",
    "        if self.transform is not None:\n",
    "            features = self.transform(features)\n",
    "        return features, label\n",
    "\n",
    "    def __len__(self):\n",
    "        return self.num_entries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have created our custom Dataset class, we can initialize a Dataset instance for the training examples using the 'iris.h5' database file. Then, we initialize a `DataLoader` that allows us to read from the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_dataset = Hdf5Dataset(h5_path='iris.h5',\n",
    "                            transform=None)\n",
    "\n",
    "train_loader = DataLoader(dataset=train_dataset,\n",
    "                          batch_size=50,\n",
    "                          shuffle=True,\n",
    "                          num_workers=4) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's it! Now we can iterate over an epoch using the train_loader as an iterator and use the features and labels from the training dataset for model training as shown in the next section"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Iterating Through the Custom Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch: 1 | Batch index: 0 | Batch size: 50\n",
      "Epoch: 1 | Batch index: 1 | Batch size: 50\n",
      "Epoch: 1 | Batch index: 2 | Batch size: 50\n",
      "Epoch: 2 | Batch index: 0 | Batch size: 50\n",
      "Epoch: 2 | Batch index: 1 | Batch size: 50\n",
      "Epoch: 2 | Batch index: 2 | Batch size: 50\n",
      "Epoch: 3 | Batch index: 0 | Batch size: 50\n",
      "Epoch: 3 | Batch index: 1 | Batch size: 50\n",
      "Epoch: 3 | Batch index: 2 | Batch size: 50\n",
      "Epoch: 4 | Batch index: 0 | Batch size: 50\n",
      "Epoch: 4 | Batch index: 1 | Batch size: 50\n",
      "Epoch: 4 | Batch index: 2 | Batch size: 50\n",
      "Epoch: 5 | Batch index: 0 | Batch size: 50\n",
      "Epoch: 5 | Batch index: 1 | Batch size: 50\n",
      "Epoch: 5 | Batch index: 2 | Batch size: 50\n"
     ]
    }
   ],
   "source": [
    "device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")\n",
    "torch.manual_seed(0)\n",
    "\n",
    "num_epochs = 5\n",
    "for epoch in range(num_epochs):\n",
    "\n",
    "    for batch_idx, (x, y) in enumerate(train_loader):\n",
    "        \n",
    "        print('Epoch:', epoch+1, end='')\n",
    "        print(' | Batch index:', batch_idx, end='')\n",
    "        print(' | Batch size:', y.size()[0])\n",
    "        \n",
    "        x = x.to(device)\n",
    "        y = y.to(device)\n",
    "\n",
    "        # do model training on x and y here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Remember that we kept an open connection to the HDF5 database in the `Hdf5Dataset` (via `self.h5f = h5py.File(h5_path, 'r')`). Once we are done, we may want to close this connection:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_dataset.h5f.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "torch  1.0.0\n",
      "pandas 0.23.4\n",
      "numpy  1.15.4\n",
      "h5py   2.8.0\n",
      "\n"
     ]
    }
   ],
   "source": [
    "%watermark -iv"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}