{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How to train your own models using timm? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial, we will be looking at the training script of `timm`. There are various features that `timm` has to offer and some of them have been listed below: \n",
    "\n",
    "1. Auto Augmentation [paper](https://arxiv.org/abs/1805.09501)\n",
    "2. Augmix \n",
    "3. Distributed Training on multiple GPUs\n",
    "4. Mixed precision training \n",
    "5. Auxiliary Batch Norm for AdvProp [paper](https://arxiv.org/abs/1911.09665)\n",
    "6. Synchronized Batch Norm \n",
    "7. Mixup and Cutmix with an ability to switch between the two & also turn-off augmentation at a certain epoch \n",
    "\n",
    "`timm` also supports multiple optimizers & schedulers. In this tutorial we will be only be looking at the above 7 features and look at how you could utilize `timm` to use these features for your own experiments on a custom dataset. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As part of this tutorial, we will first start out with a general introduction to the training script and look at the various key steps that occur inside this script at a high-level. Then, we will look at some of the details of the above 7 features to get a further understanding of `train.py`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "## Training args"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "The training script in `timm` can accept ~100 arguments. You can find more about these by running `python train.py --help`. These arguments are to define Dataset/Model parameters, Optimizer parameters, Learnining Rate scheduler parameters, Augmentation and regularization, Batch Norm parameters, Model exponential moving average parameters, and some miscellaneaous parameters such as `--seed`, `--tta` etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "As part of this tutorial, we will be looking at how the training script makes use of these parameters from a high-level view. This could be beneficial for you to able to run your own experiments on ImageNet or any other custom dataset using `timm`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true,
    "hidden": true
   },
   "source": [
    "### Required args"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "The only argument required by `timm` training script is the path to the training data such as ImageNet which is structured in the following way: \n",
    "\n",
    "```\n",
    "imagenette2-320\n",
    "├── train\n",
    "│   ├── n01440764\n",
    "│   ├── n02102040\n",
    "│   ├── n02979186\n",
    "│   ├── n03000684\n",
    "│   ├── n03028079\n",
    "│   ├── n03394916\n",
    "│   ├── n03417042\n",
    "│   ├── n03425413\n",
    "│   ├── n03445777\n",
    "│   └── n03888257\n",
    "└── val\n",
    "    ├── n01440764\n",
    "    ├── n02102040\n",
    "    ├── n02979186\n",
    "    ├── n03000684\n",
    "    ├── n03028079\n",
    "    ├── n03394916\n",
    "    ├── n03417042\n",
    "    ├── n03425413\n",
    "    ├── n03445777\n",
    "    └── n03888257\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "So to start training on this `imagenette2-320` we could just do something like `python train.py <path_to_imagenette2-320_dataset>`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true,
    "hidden": true
   },
   "source": [
    "### Default args"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "The various default args, in the training script are setup for you and what get's passed to the training script looks something like this: \n",
    "\n",
    "```python \n",
    "Namespace(aa=None, amp=False, apex_amp=False, aug_splits=0, batch_size=32, bn_eps=None, bn_momentum=None, bn_tf=False, channels_last=False, clip_grad=None, color_jitter=0.4, cooldown_epochs=10, crop_pct=None, cutmix=0.0, cutmix_minmax=None, data_dir='../imagenette2-320', dataset='', decay_epochs=30, decay_rate=0.1, dist_bn='', drop=0.0, drop_block=None, drop_connect=None, drop_path=None, epochs=200, eval_metric='top1', gp=None, hflip=0.5, img_size=None, initial_checkpoint='', input_size=None, interpolation='', jsd=False, local_rank=0, log_interval=50, lr=0.01, lr_cycle_limit=1, lr_cycle_mul=1.0, lr_noise=None, lr_noise_pct=0.67, lr_noise_std=1.0, mean=None, min_lr=1e-05, mixup=0.0, mixup_mode='batch', mixup_off_epoch=0, mixup_prob=1.0, mixup_switch_prob=0.5, model='resnet101', model_ema=False, model_ema_decay=0.9998, model_ema_force_cpu=False, momentum=0.9, native_amp=False, no_aug=False, no_prefetcher=False, no_resume_opt=False, num_classes=None, opt='sgd', opt_betas=None, opt_eps=None, output='', patience_epochs=10, pin_mem=False, pretrained=False, ratio=[0.75, 1.3333333333333333], recount=1, recovery_interval=0, remode='const', reprob=0.0, resplit=False, resume='', save_images=False, scale=[0.08, 1.0], sched='step', seed=42, smoothing=0.1, split_bn=False, start_epoch=None, std=None, sync_bn=False, torchscript=False, train_interpolation='random', train_split='train', tta=0, use_multi_epochs_loader=False, val_split='validation', validation_batch_size_multiplier=1, vflip=0.0, warmup_epochs=3, warmup_lr=0.0001, weight_decay=0.0001, workers=4)\n",
    "```\n",
    "\n",
    "Notice, that `args` is a `Namespace` which means we can set more along the way if needed by doing something like `args.new_variable=\"some_value\"`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "To get a one-line introduction of these various arguments, we can just do something like `python train.py --help`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "> NOTE: We will be looking at what these parameters in greater detail as part of this tutorial. Do note that some random augmentations are set by default such as `color_jitter`, `hfliip` but there is a parameter `no-aug` in case you wanted to turn of all training data augmentations. Also, the default optimizer `opt` is 'sgd' but it is possible to change that. `timm` offers a vast number of optimizers to train your models with. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The training script in 20 steps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section we will look at the various steps from a high level perspective that occur inside the training script. These steps have been outlined below in the correct order:\n",
    "\n",
    "1. Setup up <u>distributed training parameters</u> if `args.distributed` is `True`. \n",
    "2. Setup <u>manual seed</u> for reproducible results. \n",
    "3. **Create Model**: Create the model to train using `timm.create_model` function. \n",
    "4. Setup <u>data config</u> based on model's default config. In general the default config of the model looks something like: \n",
    "```python\n",
    "{'url': '', 'num_classes': 1000, 'input_size': (3, 224, 224), 'pool_size': (7, 7), 'crop_pct': 0.875, 'interpolation': 'bicubic', 'mean': (0.485, 0.456, 0.406), 'std': (0.229, 0.224, 0.225), 'first_conv': 'conv1', 'classifier': 'fc'}\n",
    "```\n",
    "5. Setup <u>augmentation batch splits</u> and if the number of augmentation batch splits is more than 1, and if so, convert all model BatchNormlayers to Split Batch Normalization layers.\n",
    "\n",
    "> NOTE: I feel this needs a little more explaination. In general, when we train a model, we apply the data augmentation to the complete batch and then define batch norm statistics from this complete batch such as mean and variance. But as introduced in this [paper](https://arxiv.org/abs/1911.09665), sometimes it is beneficial to split the data into groups and use separate Batch Normalization layers for each to normalize the groups independently throughout the training process. This is referred to as auxiliary batch norm in the paper and is referred to `SplitBatchNorm2d` in `timm`. \n",
    "\n",
    "> NOTE: Let's assume that number of augmentation batch splits is set to two. In that case, we split the data into two groups - one without any augmentations (referred to as clean) and one with augmentations. Then we use two separate batch normalization layers to normalize the two groups throughout the training process. \n",
    "\n",
    "6. If we are using multiple GPUs for training, then setup either apex syncBN or PyTorch native [SyncBatchNorm](https://pytorch.org/docs/stable/generated/torch.nn.SyncBatchNorm.html) to set up <u>Synchronized Batch Normalization</u>. This means that rather than normalizing the data on each individual GPU, we normalize the whole batch at one spread across multiple GPUs. \n",
    "\n",
    "7. Make model exportable using `torch.jit` if requested. \n",
    "8. <u>Initialize optimizer</u> based on arguments passed to the training script. \n",
    "9. Setup <u>mixed Precision</u> - either using `apex.amp` or using native torch amp - `torch.cuda.amp.autocast`. \n",
    "10. Load model weights if resuming from a <u>model checkpoint</u>.\n",
    "11. Setup <u>exponential moving average</u> of model weights. This is similar to [Stochastic Weight Averaging](https://pytorch.org/blog/pytorch-1.6-now-includes-stochastic-weight-averaging/). \n",
    "12. Setup <u>distributed training</u> based on parameters from step-1.\n",
    "13. Setup <u>learning rate scheduler</u>.\n",
    "14. Create <u>training and validation dataset</u>.\n",
    "15. Setup <u>Mixup/Cutmix</u> data augmentation.\n",
    "16. Convert training dataset to <u>`AugmixDataset`</u> if number of augmentation batch splits from step-5 is greater than 1. \n",
    "17. Create <u>training data loader and Validation dataloader</u>.\n",
    "> NOTE: Transforms/Augmentations also get created inside the training dataloader.\n",
    "18. Setup <u>Loss</u> function. \n",
    "19. Setup <u>model checkpointing and evaluation metrics</u>. \n",
    "20. <u>Train and Validate</u> the model and also store the eval metrics to an output file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Some key `timm` features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Auto-Augment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#hide\n",
    "# TODO: reference docs for auto-augment here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To enable auto augmentation during training - \n",
    "\n",
    "```python\n",
    "python train.py ./imagenette2-320 --aa 'v0'\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Augmix\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A brief introduction about augmix has been presented [here](https://fastai.github.io/timmdocs/dataset.html#AugmixDataset). To enable augmix during training, simply do: \n",
    "\n",
    "```python \n",
    "python train.py ./imagenette2-320 --aug-splits 3 --jsd \n",
    "```\n",
    "\n",
    "`timm` also supports augmix with `RandAugment` and `AutoAugment` like so: \n",
    "\n",
    "```python \n",
    "python train.py ./imagenette2-320 --aug-splits 3 --jsd --aa rand-m9-mstd0.5-inc1\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Distributed Training on multiple GPUs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To train models on multiple GPUs, simply replace `python train.py` with `./distributed_train.sh <num-gpus>` like so:\n",
    "\n",
    "```python\n",
    "./distributed_train.sh 4 ./imagenette2-320 --aug-splits 3 --jsd \n",
    "```\n",
    "\n",
    "This trains the model using `AugMix` data augmentation on 4 GPUs.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Mixed precision training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To enable mixed precision training, simply add the `--amp` flag. `timm` will automatically implement mixed precision training either using `apex` or PyTorch Native mixed precision training. \n",
    "\n",
    "```python\n",
    "python train.py ../imagenette2-320 --aug-splits 3 --jsd --amp \n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Auxiliary Batch Norm/ `SplitBatchNorm`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#hide\n",
    "# TODO: refer blog post or tutorial here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the paper, \n",
    "\n",
    "```markdown\n",
    "Batch normalization serves as an essential component for many state-of-the-art computer vision models. Specifically, BN normalizes input features by the mean and variance computed within each mini-batch. **One intrinsic assumption of utilizing BN is that the input features should come from a single or similar distributions.** This normalization behavior could be problematic if the mini-batch contains data from different distributions, there- fore resulting in inaccurate statistics estimation.\n",
    "\n",
    "To disentangle this mixture distribution into two simpler ones respectively for the clean and adversarial images, we hereby propose an auxiliary BN to guarantee its normalization statistics are exclusively preformed on the adversarial examples.\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To enable split batch norm, \n",
    "\n",
    "```python \n",
    "python train.py ./imagenette2-320 --aug-splits 3 --aa rand-m9-mstd0.5-inc1 --split-bn\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the above command, `timm` now has separate batch normalization layer for each augmentation split. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Synchronized Batch Norm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Synchronized batch norm is only used when training on multiple GPUs. From [papers with code](https://paperswithcode.com/method/syncbn):\n",
    "\n",
    "```markdown\n",
    "Synchronized Batch Normalization (SyncBN) is a type of batch normalization used for multi-GPU training. Standard batch normalization only normalizes the data within each device (GPU). SyncBN normalizes the input within the whole mini-batch.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To enable, simply add `--sync-bn` flag like so: \n",
    "\n",
    "```python\n",
    "./distributed_train.sh 4 ../imagenette2-320 --aug-splits 3 --jsd --sync-bn  \n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Mixup and Cutmix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To enable either mixup or cutmix, simply add the `--mixup` or `--cutmix` flag with alpha value.  \n",
    "Default probability of applying the augmentation is 1.0. If you need to change it, use `--mixup-prob` argument with new value.\n",
    "\n",
    "For example, to enable mixup, \n",
    "```python \n",
    "train.py ../imagenette2-320 --mixup 0.5\n",
    "train.py ../imagenette2-320 --mixup 0.5 --mixup-prob 0.7\n",
    "```\n",
    "\n",
    "Or for Cutmix, \n",
    "```python \n",
    "train.py ../imagenette2-320 --cutmix 0.5\n",
    "train.py ../imagenette2-320 --cutmix 0.5 --mixup-prob 0.7\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is also possible to enable both, \n",
    "\n",
    "```python \n",
    "python train.py ../imagenette2-320 --mixup 0.5 --cutmix 0.5 --mixup-switch-prob 0.3\n",
    "```\n",
    "\n",
    "The above command will use either Mixup or Cutmix as data augmentation techniques and apply it to the batch with 50% probability. It will also switch between the two with 30% probability (Mixup - 70%, 30% switch to Cutmix)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is also a parameter to turn off Mixup/Cutmix augmentation at a certail epoch:\n",
    "\n",
    "```python \n",
    "python train.py ../imagenette2-320 --mixup 0.5 --cutmix 0.5 --mixup-switch-prob 0.3 --mixup-off-epoch 10\n",
    "```\n",
    "\n",
    "The above command only applies the Mixup/Cutmix data augmentation for the first 10 epochs. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}