{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Deep Learning Models -- A collection of various deep learning architectures, models, and tips for TensorFlow and PyTorch in Jupyter Notebooks.\n", "- Author: Sebastian Raschka\n", "- GitHub Repository: https://github.com/rasbt/deeplearning-models" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sebastian Raschka \n", "\n", "CPython 3.7.1\n", "IPython 7.2.0\n", "\n", "torch 1.0.0\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -a 'Sebastian Raschka' -v -p torch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Model Zoo -- Using PyTorch Dataset Loading Utilities for Custom Datasets (CSV files converted to HDF5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook provides an example for how to load a dataset from an HDF5 file created from a CSV file, using PyTorch's data loading utilities. For a more in-depth discussion, please see the official\n", "\n", "- [Data Loading and Processing Tutorial](http://pytorch.org/tutorials/beginner/data_loading_tutorial.html)\n", "- [torch.utils.data](http://pytorch.org/docs/master/data.html) API documentation\n", "\n", "An Hierarchical Data Format (HDF) is a convenient way that allows quick access to data instances during minibatch learning if a dataset is too large to fit into memory. The approach outlined in this notebook uses uses the common [HDF5](https://support.hdfgroup.org/HDF5/) format and should be accessible to any programming language or tool with an HDF5 API.\n", "\n", "**In this example, we are going to use the Iris dataset for illustrative purposes. Let's pretend it's our large training dataset that doesn't fit into memory**.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import h5py\n", "import torch\n", "from torch.utils.data import Dataset\n", "from torch.utils.data import DataLoader" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Converting a CSV file to HDF5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this first step, we are going to process a CSV file (here, Iris) into an HDF5 database:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# suppose this is a large CSV that does not \n", "# fit into memory:\n", "csv_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'\n", "\n", "# Get number of lines in the CSV file if it's on your hard drive:\n", "#num_lines = subprocess.check_output(['wc', '-l', in_csv])\n", "#num_lines = int(nlines.split()[0]) \n", "num_lines = 150\n", "num_features = 4\n", "\n", "class_dict = {'Iris-setosa': 0,\n", " 'Iris-versicolor': 1,\n", " 'Iris-virginica': 2}\n", "\n", "# use 10,000 or 100,000 or so for large files\n", "chunksize = 10\n", "\n", "# this is your HDF5 database:\n", "with h5py.File('iris.h5', 'w') as h5f:\n", " \n", " # use num_features-1 if the csv file has a column header\n", " dset1 = h5f.create_dataset('features',\n", " shape=(num_lines, num_features),\n", " compression=None,\n", " dtype='float32')\n", " dset2 = h5f.create_dataset('labels',\n", " shape=(num_lines,),\n", " compression=None,\n", " dtype='int32')\n", "\n", " # change range argument from 0 -> 1 if your csv file contains a column header\n", " for i in range(0, num_lines, chunksize): \n", "\n", " df = pd.read_csv(csv_path, \n", " header=None, # no header, define column header manually later\n", " nrows=chunksize, # number of rows to read at each iteration\n", " skiprows=i) # skip rows that were already read\n", " \n", " df[4] = df[4].map(class_dict)\n", "\n", " features = df.values[:, :4]\n", " labels = df.values[:, -1]\n", " \n", " # use i-1 and i-1+10 if csv file has a column header\n", " dset1[i:i+10, :] = features\n", " dset2[i:i+10] = labels[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After creating the database, let's double-check that everything works correctly:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(150, 4)\n", "(150,)\n" ] } ], "source": [ "with h5py.File('iris.h5', 'r') as h5f:\n", " print(h5f['features'].shape)\n", " print(h5f['labels'].shape)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Features of entry no. 99: [5.7 2.8 4.1 1.3]\n", "Class label of entry no. 99: 1\n" ] } ], "source": [ "with h5py.File('iris.h5', 'r') as h5f:\n", " print('Features of entry no. 99:', h5f['features'][99])\n", " print('Class label of entry no. 99:', h5f['labels'][99])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Implementing a Custom Dataset Class" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we implement a custom `Dataset` for reading the training examples. The `__getitem__` method will\n", "\n", "1. read a single training example from HDF5 based on an `index` (more on batching later)\n", "2. return a single training example and it's corresponding label\n", "\n", "Note that we will keep an open connection to the database for efficiency via `self.h5f = h5py.File(h5_path, 'r')` -- you may want to close it when you are done (more on this later)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "class Hdf5Dataset(Dataset):\n", " \"\"\"Custom Dataset for loading entries from HDF5 databases\"\"\"\n", "\n", " def __init__(self, h5_path, transform=None):\n", " \n", " self.h5f = h5py.File(h5_path, 'r')\n", " self.num_entries = self.h5f['labels'].shape[0]\n", " self.transform = transform\n", "\n", " def __getitem__(self, index):\n", " \n", " features = self.h5f['features'][index]\n", " label = self.h5f['labels'][index]\n", " if self.transform is not None:\n", " features = self.transform(features)\n", " return features, label\n", "\n", " def __len__(self):\n", " return self.num_entries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have created our custom Dataset class, we can initialize a Dataset instance for the training examples using the 'iris.h5' database file. Then, we initialize a `DataLoader` that allows us to read from the dataset." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "train_dataset = Hdf5Dataset(h5_path='iris.h5',\n", " transform=None)\n", "\n", "train_loader = DataLoader(dataset=train_dataset,\n", " batch_size=50,\n", " shuffle=True,\n", " num_workers=4) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's it! Now we can iterate over an epoch using the train_loader as an iterator and use the features and labels from the training dataset for model training as shown in the next section" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Iterating Through the Custom Dataset" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch: 1 | Batch index: 0 | Batch size: 50\n", "Epoch: 1 | Batch index: 1 | Batch size: 50\n", "Epoch: 1 | Batch index: 2 | Batch size: 50\n", "Epoch: 2 | Batch index: 0 | Batch size: 50\n", "Epoch: 2 | Batch index: 1 | Batch size: 50\n", "Epoch: 2 | Batch index: 2 | Batch size: 50\n", "Epoch: 3 | Batch index: 0 | Batch size: 50\n", "Epoch: 3 | Batch index: 1 | Batch size: 50\n", "Epoch: 3 | Batch index: 2 | Batch size: 50\n", "Epoch: 4 | Batch index: 0 | Batch size: 50\n", "Epoch: 4 | Batch index: 1 | Batch size: 50\n", "Epoch: 4 | Batch index: 2 | Batch size: 50\n", "Epoch: 5 | Batch index: 0 | Batch size: 50\n", "Epoch: 5 | Batch index: 1 | Batch size: 50\n", "Epoch: 5 | Batch index: 2 | Batch size: 50\n" ] } ], "source": [ "device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")\n", "torch.manual_seed(0)\n", "\n", "num_epochs = 5\n", "for epoch in range(num_epochs):\n", "\n", " for batch_idx, (x, y) in enumerate(train_loader):\n", " \n", " print('Epoch:', epoch+1, end='')\n", " print(' | Batch index:', batch_idx, end='')\n", " print(' | Batch size:', y.size()[0])\n", " \n", " x = x.to(device)\n", " y = y.to(device)\n", "\n", " # do model training on x and y here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Remember that we kept an open connection to the HDF5 database in the `Hdf5Dataset` (via `self.h5f = h5py.File(h5_path, 'r')`). Once we are done, we may want to close this connection:**" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "train_dataset.h5f.close()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "torch 1.0.0\n", "pandas 0.23.4\n", "numpy 1.15.4\n", "h5py 2.8.0\n", "\n" ] } ], "source": [ "%watermark -iv" ] }