{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%reload_ext autoreload\n",
    "%autoreload 2\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "import csv\n",
    "import shutil\n",
    "from pathlib import Path\n",
    "from IPython.display import Audio, display\n",
    "import librosa\n",
    "import librosa.display\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import torch\n",
    "import random"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#To have reproducible results with fastai you must also set num_workers=1 in your databunch, and seed=seed\n",
    "#in split_by_rand_pct\n",
    "seed = 42\n",
    "# python RNG\n",
    "random.seed(seed)\n",
    "# pytorch RNGs\n",
    "torch.manual_seed(seed)\n",
    "torch.backends.cudnn.deterministic = True\n",
    "if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)\n",
    "# numpy RNG\n",
    "np.random.seed(seed)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from fastai.vision import *\n",
    "from fastai.metrics import error_rate\n",
    "\n",
    "sys.path.append(\"..\")\n",
    "from audio import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook we will use the library on our first real world dataset to create a model that can identify a broad range of sounds (50 classes including pigs, wind, clapping, chainsaws...etc).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ESC-50: Dataset for Environmental Sound Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**ESC-50** is a really nice starting dataset as it is especially clean (fixed-length, hand-labeled, single sample-rate) and well maintained. Many thanks to [Karol Piczak](https://github.com/karoldvl) for maintaining a really great [Github Repo](https://github.com/karoldvl/ESC-50) based around this dataset. The cell below with spectrograms and labels is taken directly from there, but the page itself is really worth viewing as they keep a leaderboard of different benchmarks/results/papers on the dataset which will allow us to see directly how we measure up."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div><img src=\"https://github.com/karoldvl/ESC-50/raw/master/esc50.gif\" alt=\"ESC-50 clip preview\" title=\"ESC-50 clip preview\" /></div>\n",
    "\n",
    "The **ESC-50 dataset** is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification.\n",
    "\n",
    "The dataset consists of 5-second-long recordings organized into 50 semantical classes (with 40 examples per class) loosely arranged into 5 major categories:\n",
    "\n",
    "| <sub>Animals</sub> | <sub>Natural soundscapes & water sounds </sub> | <sub>Human, non-speech sounds</sub> | <sub>Interior/domestic sounds</sub> | <sub>Exterior/urban noises</sub> |\n",
    "| :--- | :--- | :--- | :--- | :--- |\n",
    "| <sub>Dog</sub> | <sub>Rain</sub> | <sub>Crying baby</sub> | <sub>Door knock</sub> | <sub>Helicopter</sub></sub> |\n",
    "| <sub>Rooster</sub> | <sub>Sea waves</sub> | <sub>Sneezing</sub> | <sub>Mouse click</sub> | <sub>Chainsaw</sub> |\n",
    "| <sub>Pig</sub> | <sub>Crackling fire</sub> | <sub>Clapping</sub> | <sub>Keyboard typing</sub> | <sub>Siren</sub> |\n",
    "| <sub>Cow</sub> | <sub>Crickets</sub> | <sub>Breathing</sub> | <sub>Door, wood creaks</sub> | <sub>Car horn</sub> |\n",
    "| <sub>Frog</sub> | <sub>Chirping birds</sub> | <sub>Coughing</sub> | <sub>Can opening</sub> | <sub>Engine</sub> |\n",
    "| <sub>Cat</sub> | <sub>Water drops</sub> | <sub>Footsteps</sub> | <sub>Washing machine</sub> | <sub>Train</sub> |\n",
    "| <sub>Hen</sub> | <sub>Wind</sub> | <sub>Laughing</sub> | <sub>Vacuum cleaner</sub> | <sub>Church bells</sub> |\n",
    "| <sub>Insects (flying)</sub> | <sub>Pouring water</sub> | <sub>Brushing teeth</sub> | <sub>Clock alarm</sub> | <sub>Airplane</sub> |\n",
    "| <sub>Sheep</sub> | <sub>Toilet flush</sub> | <sub>Snoring</sub> | <sub>Clock tick</sub> | <sub>Fireworks</sub> |\n",
    "| <sub>Crow</sub> | <sub>Thunderstorm</sub> | <sub>Drinking, sipping</sub> | <sub>Glass breaking</sub> | <sub>Hand saw</sub> |\n",
    "\n",
    "\n",
    "\n",
    "Clips in this dataset have been manually extracted from public field recordings gathered by the **[Freesound.org project](http://freesound.org/)**. The dataset has been prearranged into 5 folds for comparable cross-validation, making sure that fragments from the same original source file are contained in a single fold.\n",
    "\n",
    "A more thorough description of the dataset is available in the original [paper](http://karol.piczak.com/papers/Piczak2015-ESC-Dataset.pdf) with some supplementary materials on GitHub: **[ESC: Dataset for Environmental Sound Classification - paper replication data](https://github.com/karoldvl/paper-2015-esc-dataset)**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Code to download/install dataset here\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PATH_DATA = Path(\"../data/ESC-50/\")\n",
    "#! wget -P {PATH_DATA} https://github.com/karoldvl/ESC-50/archive/master.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PATH_ZIP = PATH_DATA/\"master.zip\"\n",
    "#! unzip {PATH_ZIP} -d {PATH_DATA}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PATH_BASE = Path(PATH_DATA/\"ESC-50-master\")\n",
    "PATH_AUDIO = PATH_BASE/\"audio\"\n",
    "#PATH_CSV  = PATH_BASE/\"meta/esc50.csv\"\n",
    "PATH_CSV  = \"../meta/esc50.csv\"\n",
    "DF = pd.read_csv(PATH_AUDIO/PATH_CSV)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Leakage in Audio"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When working with audio, you have to be especially careful not to leak data between your training, validation and test sets. For instance, in ESC-50, some of the clips are generated by splitting up files into several shorter clips (e.g. taking 15 seconds of guitar, and splitting it into three 5 second clips). If those files aren't kept together in one fold, and some end up in training and others in the validation set, our model may learn features we aren't interested in, like background noise, or something particular to the microphone that was used, to identify the label. \n",
    "\n",
    "Another example is speaker identification. If you have users record their voices on their own devices and then train a model, it may learn to identify the quirks of their microphone, or their environment, rather than the unique features of their voice. Thus you might have a model with 99.8% accuracy, but when you test it by using input from one device for all speakers, it may fail to generalize. \n",
    "\n",
    "In this case the data is split into 5 folds, and any clips that have been split up from one longer original are all together in the same fold. This strict segregation of data will also allow us to compare results to others knowing that we are all using the same validation set. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Training on a single fold"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we set our spectrogram config. Most of the values below are defaults but I've made them explicit just to show the setup. Since the data is so clean, we need almost nothing for our `AudioConfig`, just caching and passing in the `SpectrogramConfig`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sg_cfg= SpectrogramConfig(hop_length=512, n_mels=128, n_fft=1024, top_db=80, f_min=20.0, f_max=22050.0)\n",
    "cfg = AudioConfig(cache=True, sg_cfg=sg_cfg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can set our validation fold to five and split our files up into Train (folds 1, 2, 3, and 4) and Valid (fold 5). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "VAL_FOLD = 5\n",
    "FILES_TRAIN = [f for f in DF.loc[DF['fold'] != VAL_FOLD].filename]\n",
    "FILES_VALID = [f for f in DF.loc[DF['fold'] == VAL_FOLD].filename]\n",
    "len(FILES_TRAIN), len(FILES_VALID)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we load up an AudioList with our data, using `split_by_files` to separate valid and training. We don't do any transforms because we will be using mixup, which is made incredibly easy by fastai and works very well for acoustic scene classification.\n",
    "\n",
    "In my (brief) experience, SpecAugment (putting bars over the spectrogram to hide info as a form of data augmentation) works better for speech data, and has little impact for scenes, but you should experiment further with this and normal fastai image transforms. Most image transforms don't make sense for spectrograms, but some people in our audio thread have reported small gains by using a few limited transforms.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#num_workers = 1 for reproducibility, see https://docs.fast.ai/dev/test.html#getting-reproducible-results\n",
    "#tfms = get_spectro_transforms(mask_frequency=False, mask_time=False, size=(256,430))\n",
    "tfms=None\n",
    "db_audio = (AudioList.from_csv(PATH_AUDIO, \"../meta/esc50.csv\", config=cfg)\n",
    "                .split_by_files(FILES_VALID).label_from_df(\"category\")\n",
    "               .transform(tfms=tfms)\n",
    "               .databunch(bs=16, num_workers = 1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "db_audio.show_batch()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "db_audio.train_ds[0][0].shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "db_audio.show_batch(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When using mixup on audio, I've found much better results using densenets (121 and 161) but this is without having done a comprehensive search of the available architectures. More work is needed here from the community. Resnets up to resnet50 don't seem to be deep enough as training loss is always higher than validation (I haven't tried lowering other forms of regularization to reduce this)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn = audio_learner(db_audio, models.densenet161, metrics=accuracy, callback_fns=ShowGraph, pretrained=True).mixup()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn.lr_find(); learn.recorder.plot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn.fit_one_cycle(10, slice(1e-2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn.save('ESC50-stage1')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn.export('models/895peakacc-stg1')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interp = ClassificationInterpretation.from_learner(learn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interp.plot_confusion_matrix(figsize=(15,15))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For a closer look at our most frequently confused classes, we can call the aptly named `most_confused` function. All of these seem like classes that could be hard to differentiate, so we are on the right track. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interp.most_confused(min_val=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can call `plot_top_losses` to both see and hear the clips that are fooling our model. If you listen it becomes clear just how hard some of these are to distinguish. Others seem like something the model shouldn't be getting wrong, and are a good place to start trying to look for ways to improve the model. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interp.plot_top_losses(9, heatmap=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Training on all folds"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because we have limited data, it is best practice to repeat the process cycling the validation fold to make sure we aren't just overfitting that one fold, we will repeat the process cycling the validation fold. For the sake of time, and because I clearly haven't figured out how to fine-tune mixup yet, and unfreeze/layer_groups might not be working on our learner, we will just do 100 epochs at 1e-2.\n",
    "\n",
    "**Side note**: Sometimes you'll see this referred to as \"k-folds validation\" in kaggle competitions. The idea is that you split your data up into k (usually 5 or 10) different folds and then train models cycling the training/validation sets. The point is that by doing this on 5 folds, you'll have 5 different models that can then be used to perform inference, ensembling the various predictions. The cost is high though as you will increase your training time by 5x for a small gain. I also don't fully understand why this is better than just using 100% of your data and no validation once you have a training method you've already validated. If you can explain why, please post in the audio thread and I'll update/credit you here. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = []\n",
    "fold_5_result = learn.validate()\n",
    "results.append(fold_5_result)\n",
    "results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#it is range(1,5), not range(1,6) because we already used fold 5 as validation fold\n",
    "for VAL_FOLD in range(1,5):\n",
    "    FILES_TRAIN = [f for f in DF.loc[DF['fold'] != VAL_FOLD].filename]\n",
    "    FILES_TEST = [f for f in DF.loc[DF['fold'] == VAL_FOLD].filename]\n",
    "    audio_train = AudioList.from_csv(PATH_AUDIO, PATH_CSV, config=cfg).split_by_files(FILES_TEST).label_from_df(\"category\")\n",
    "    db_audio = audio_train.transform(tfms=None).databunch(bs=16, num_workers=1)\n",
    "    learn = audio_learner(db_audio, models.densenet161, metrics=accuracy, callback_fns=ShowGraph, pretrained=True).mixup()\n",
    "    learn.fit_one_cycle(10, slice(1e-2))\n",
    "    results.append(learn.validate())\n",
    "    print(results)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pickle\n",
    "with open('results.pkl', 'wb') as f:\n",
    "    pickle.dump(results, f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "average_validation = sum([score for _, score in results])/5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "average_validation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note, the first time I ran this it produced the numbers above, but on loading the exported model and revalidating, it's accuracy on fold 5 was .8750, which would bring the overall average validation down to 0.8865"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}