{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Deep Learning Models -- A collection of various deep learning architectures, models, and tips for TensorFlow and PyTorch in Jupyter Notebooks.\n",
    "- Author: Sebastian Raschka\n",
    "- GitHub Repository: https://github.com/rasbt/deeplearning-models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Generating Validation Set Splits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Often, we obtain datasets for which only training and test splits are provided, and validation splits are missing. As we all know, the use of validation sets for repeated model tuning and evaluation is recommended to avoid overfitting on the test set. \n",
    "\n",
    "Since we sometimes want to rotate the validation set, or merge training and validation sets at a later stage to obtain more training data, it is not always convenient to define a separate validation set, and it can be more convenient to split the validation set portion off the training set if/when we need it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Suppose we load the MNIST dataset as follows -- note that there is no validation set pre-specified for MNIST, and the same is true for CIFAR-10/100."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A Typical Dataset (here: MNIST)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "from torchvision import datasets\n",
    "from torchvision import transforms\n",
    "from torch.utils.data import DataLoader\n",
    "\n",
    "BATCH_SIZE = 64"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0it [00:00, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data/MNIST/raw/train-images-idx3-ubyte.gz\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "9920512it [00:02, 4390618.70it/s]                             \n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Extracting data/MNIST/raw/train-images-idx3-ubyte.gz\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "32768it [00:00, 293812.98it/s]                           \n",
      "0it [00:00, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to data/MNIST/raw/train-labels-idx1-ubyte.gz\n",
      "Extracting data/MNIST/raw/train-labels-idx1-ubyte.gz\n",
      "Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to data/MNIST/raw/t10k-images-idx3-ubyte.gz\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "1654784it [00:00, 2762205.03it/s]                            \n",
      "8192it [00:00, 124866.40it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Extracting data/MNIST/raw/t10k-images-idx3-ubyte.gz\n",
      "Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to data/MNIST/raw/t10k-labels-idx1-ubyte.gz\n",
      "Extracting data/MNIST/raw/t10k-labels-idx1-ubyte.gz\n",
      "Processing...\n",
      "Done!\n",
      "Image batch dimensions: torch.Size([64, 1, 28, 28])\n",
      "Image label dimensions: torch.Size([64])\n"
     ]
    }
   ],
   "source": [
    "##########################\n",
    "### MNIST DATASET\n",
    "##########################\n",
    "\n",
    "# Note transforms.ToTensor() scales input images\n",
    "# to 0-1 range\n",
    "train_dataset = datasets.MNIST(root='data', \n",
    "                               train=True, \n",
    "                               transform=transforms.ToTensor(),\n",
    "                               download=True)\n",
    "\n",
    "test_dataset = datasets.MNIST(root='data', \n",
    "                              train=False, \n",
    "                              transform=transforms.ToTensor())\n",
    "\n",
    "\n",
    "train_loader = DataLoader(dataset=train_dataset, \n",
    "                          batch_size=BATCH_SIZE,\n",
    "                          num_workers=4,\n",
    "                          shuffle=True)\n",
    "\n",
    "test_loader = DataLoader(dataset=test_dataset, \n",
    "                         batch_size=BATCH_SIZE,\n",
    "                         num_workers=4,\n",
    "                         shuffle=False)\n",
    "\n",
    "# Checking the dataset\n",
    "for images, labels in train_loader:  \n",
    "    print('Image batch dimensions:', images.shape)\n",
    "    print('Image label dimensions:', labels.shape)\n",
    "    break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total number of training examples: 60000\n"
     ]
    }
   ],
   "source": [
    "print(f'Total number of training examples: {len(train_dataset)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Subset Method"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most of the time, a convenient method for splitting a training set into a training subset and validation subset is the `Subset` method . However, note that we have to use the same `transform` methodology for both training and test sets (which may not be desired in all cases; for instance, if we want to perform random cropping or rotation for training set augmentation).\n",
    "\n",
    "Concretely, we will reserve the first 1000 training examples for validation and use the remaining 59000 examples for the new training set. Note that the `Subset` method will automatically shuffle the data prior to each epoch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from torch.utils.data.dataset import Subset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "valid_indices = torch.arange(0, 1000)\n",
    "train_indices = torch.arange(1000, 60000)\n",
    "\n",
    "\n",
    "train_and_valid = datasets.MNIST(root='data', \n",
    "                                 train=True, \n",
    "                                 transform=transforms.ToTensor(),\n",
    "                                 download=True)\n",
    "\n",
    "train_dataset = Subset(train_and_valid, train_indices)\n",
    "valid_dataset = Subset(train_and_valid, valid_indices)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_loader = DataLoader(dataset=train_dataset, \n",
    "                          batch_size=BATCH_SIZE,\n",
    "                          num_workers=4,\n",
    "                          shuffle=True)\n",
    "\n",
    "valid_loader = DataLoader(dataset=valid_dataset, \n",
    "                          batch_size=BATCH_SIZE,\n",
    "                          num_workers=4,\n",
    "                          shuffle=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Image batch dimensions: torch.Size([64, 1, 28, 28])\n",
      "Image label dimensions: torch.Size([64])\n"
     ]
    }
   ],
   "source": [
    "# Checking the dataset\n",
    "for images, labels in train_loader:  \n",
    "    print('Image batch dimensions:', images.shape)\n",
    "    print('Image label dimensions:', labels.shape)\n",
    "    break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([1, 7, 2, 4, 7, 7, 8, 4, 0, 5])\n",
      "tensor([5, 5, 6, 4, 2, 3, 8, 0, 7, 5])\n"
     ]
    }
   ],
   "source": [
    "# Check that shuffling works properly\n",
    "# i.e., label indices should be in random order.\n",
    "# Also, the label order should be different in the second\n",
    "# epoch.\n",
    "\n",
    "for images, labels in train_loader:  \n",
    "    pass\n",
    "print(labels[:10])\n",
    "\n",
    "for images, labels in train_loader:  \n",
    "    pass\n",
    "print(labels[:10])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([1, 0, 3, 7, 0, 7, 5, 6, 8, 3])\n",
      "tensor([1, 0, 3, 7, 0, 7, 5, 6, 8, 3])\n"
     ]
    }
   ],
   "source": [
    "# Check that shuffling works properly.\n",
    "# i.e., label indices should be in random order.\n",
    "# Via the fixed random seed, both epochs should return\n",
    "# the same label sequence.\n",
    "\n",
    "torch.manual_seed(123)\n",
    "for images, labels in train_loader:  \n",
    "    pass\n",
    "print(labels[:10])\n",
    "\n",
    "torch.manual_seed(123)\n",
    "for images, labels in train_loader:  \n",
    "    pass\n",
    "print(labels[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SubsetRandomSampler Method"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compared to the `Subset` method, the `SubsetRandomSampler` is a more convenient solution if we want to assign different transformation methods to training and test subsets. Similar to the `Subset` example, we will use the first 1000 examples for the validation set and the remaining 59000 examples for training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from torch.utils.data import SubsetRandomSampler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_indices = torch.arange(1000, 60000)\n",
    "valid_indices = torch.arange(0, 1000)\n",
    "\n",
    "\n",
    "train_sampler = SubsetRandomSampler(train_indices)\n",
    "valid_sampler = SubsetRandomSampler(valid_indices)\n",
    "\n",
    "\n",
    "training_transform = transforms.Compose([transforms.Resize((32, 32)),\n",
    "                                         transforms.RandomCrop((28, 28)),\n",
    "                                         transforms.ToTensor()])\n",
    "\n",
    "valid_transform = transforms.Compose([transforms.Resize((32, 32)),\n",
    "                                         transforms.CenterCrop((28, 28)),\n",
    "                                         transforms.ToTensor()])\n",
    "\n",
    "\n",
    "\n",
    "train_dataset = datasets.MNIST(root='data', \n",
    "                               train=True, \n",
    "                               transform=training_transform,\n",
    "                               download=True)\n",
    "\n",
    "# note that this is the same dataset as \"train_dataset\" above\n",
    "# however, we can now choose a different transform method\n",
    "valid_dataset = datasets.MNIST(root='data', \n",
    "                               train=True, \n",
    "                               transform=valid_transform,\n",
    "                               download=False)\n",
    "\n",
    "test_dataset = datasets.MNIST(root='data', \n",
    "                              train=False, \n",
    "                              transform=valid_transform,\n",
    "                              download=False)\n",
    "\n",
    "train_loader = DataLoader(train_dataset,\n",
    "                          batch_size=BATCH_SIZE,\n",
    "                          num_workers=4,\n",
    "                          sampler=train_sampler)\n",
    "\n",
    "valid_loader = DataLoader(valid_dataset,\n",
    "                          batch_size=BATCH_SIZE,\n",
    "                          num_workers=4,\n",
    "                          sampler=valid_sampler)\n",
    "\n",
    "test_loader = DataLoader(dataset=test_dataset, \n",
    "                         batch_size=BATCH_SIZE,\n",
    "                         num_workers=4,\n",
    "                         shuffle=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Image batch dimensions: torch.Size([64, 1, 28, 28])\n",
      "Image label dimensions: torch.Size([64])\n"
     ]
    }
   ],
   "source": [
    "# Checking the dataset\n",
    "for images, labels in train_loader:  \n",
    "    print('Image batch dimensions:', images.shape)\n",
    "    print('Image label dimensions:', labels.shape)\n",
    "    break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([5, 7, 4, 9, 1, 7, 4, 1, 6, 7])\n",
      "tensor([8, 2, 0, 7, 1, 3, 2, 6, 0, 4])\n"
     ]
    }
   ],
   "source": [
    "# Check that shuffling works properly\n",
    "# i.e., label indices should be in random order.\n",
    "# Also, the label order should be different in the second\n",
    "# epoch.\n",
    "\n",
    "for images, labels in train_loader:  \n",
    "    pass\n",
    "print(labels[:10])\n",
    "\n",
    "for images, labels in train_loader:  \n",
    "    pass\n",
    "print(labels[:10])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([1, 0, 3, 7, 0, 7, 5, 6, 8, 3])\n",
      "tensor([1, 0, 3, 7, 0, 7, 5, 6, 8, 3])\n"
     ]
    }
   ],
   "source": [
    "# Check that shuffling works properly.\n",
    "# i.e., label indices should be in random order.\n",
    "# Via the fixed random seed, both epochs should return\n",
    "# the same label sequence.\n",
    "\n",
    "torch.manual_seed(123)\n",
    "for images, labels in train_loader:  \n",
    "    pass\n",
    "print(labels[:10])\n",
    "\n",
    "torch.manual_seed(123)\n",
    "for images, labels in train_loader:  \n",
    "    pass\n",
    "print(labels[:10])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}