{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "sys.path.append(\"..\")\n", "from audio import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# FastAI Audio Features Notebook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is a fairly comprehensive look at the features of the library. It is meant to be oriented to beginners, and the main point is not to understand every single thing that is happening, but to understand how to copy/use the code on your own datasets so that you can try to solve audio problems in your domain without a huge degree of signals processing expertise. \n", "\n", "This is a long notebook and is best consumed by hopping around and checking out different features that you would like to try using now, and then trying them with your code/data. If you get stuck, reach out on the fastai forums in the [fastai audio thread](https://forums.fast.ai/t/deep-learning-with-audio-thread/38123) or contact us via PM [@baz](https://forums.fast.ai/u/baz/) or [@madeupmasters](https://forums.fast.ai/u/MadeUpMasters/) We also have a telegram group for audio ML. If you would like to join, message us on the forums.\n", "\n", "The code is set up so you can jump from place to place trying things out." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Just load this small speaker recognition dataset first" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "label_pattern = r'_([mf]\\d+)_'\n", "data_url = 'http://www.openslr.org/resources/45/ST-AEDS-20180100_1-OS'\n", "data_folder = datapath4file(url2name(data_url))\n", "if not os.path.exists(data_folder): untar_data(data_url, dest=data_folder)\n", "audios = AudioList.from_folder(data_folder)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Table of Contents\n", "1. [Preprocessing Features](#Preprocessing-Features)\n", " 1. [Resampling](#Resampling)\n", " 2. [Silence Removal](#Silence-Removal)\n", " 1. [Silence Trimming](#trim)\n", " 2. [Split by Silence](#split)\n", " 3. [Remove Silence From Middle](#all)\n", " 3. [Segmentation](#Segmentation)\n", " 4. [Caching](#Caching)\n", "2. [Generating Images from Audio](#Generating-Images-from-Audio)\n", " 1. [Spectrogram Generation](#Spectrogram-Generation)\n", " 1. [duration](#duration)\n", " 2. [max_to_pad](#max-to-pad)\n", " 2. [Spectrogram Configuration and Fine Tuning](#Spectrogram-Configuration-and-Fine-Tuning)\n", " 3. [Mel Frequency Cepstral Coefficients(MFCC)](#Mel-Frequency-Cepstral-Coefficients)\n", " 4. [Delta and Acceleration Stacking](#Delta-and-Acceleration-Stacking)\n", "3. [Transforms](#Transforms)\n", " 1. [Mixup](#Mixup)\n", " 2. [Transform Manager](#Transform-Manager)\n", " 3. [Resizing](#Resizing)\n", " 4. [SpecAugment](#SpecAugment)\n", " 1. [Frequency Masking](#Frequency-Masking)\n", " 2. [Time Masking](#Time-Masking)\n", " 5. [Rolling](#Rolling)\n", "4. [Conclusion](#Conclusion)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "audios" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "audios[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Preprocessing Features " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Preprocessing options currently consist of resampling, silence removal, and segmenting the clips, in that order. This happens when you label your items (e.g. call `label_from_folder()`). These actions can take several minutes, and possibly longer for large datasets, thus we automatically cache the results for you so that the process doesn't have to be repeated. This happens even if `cache_spectro=False` in your Config, that feature is for spectrogram caching only. This will allow you to try different configurations as quickly as possible. The caching happens at each stage, so if you change your settings for silence removal, the library will not need to repeat the resampling, but will instead pull the resampled files from the cache and resume silence removal. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Resampling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have multiple sample rates, you will have to resample to a single sample rate as your images won't be comparable (the time-axis (x-axis) of the spectrogram will have varied scales. \n", "\n", "Also you may sometimes want to resample from high sample rates to low sample rates. This will allow you to represent longer durations in the same space of a spectrogram (compressing the time axis). You can also achieve a similar effect by increasing the hop length of the spectrogram.\n", "\n", "Keep in mind that by downsampling you will be throwing away any frequency information that is below 1/2 your sample rate. For example, at 16000hz, you will only be able to accurately represent frequencies 0-8000, so downsampling from 44100hz to 16000hz, you will lose information for frequencies in the range 8000hz-22050hz. For human voice this is okay, for music, it isn't. For more info read about Nyquist Theorem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Resampling is as simple as setting the `resample_to` attribute of your config to the sample rate you want**, let's resample to 8000hz. This can be done after creation, or you can pass `resample_to=8000` as an argument by typing `config = AudioConfig(resample_to=8000)`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "label_pattern = r'_([mf]\\d+)_'\n", "config = AudioConfig()\n", "config.resample_to = 8000\n", "rs_audio = AudioList.from_folder(data_folder, config=config).split_by_rand_pct(.2, seed=4).label_from_re(label_pattern)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rs_audio" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we have the same number of files, but they've all been resampled to 8000hz.\n", "\n", "For efficiency we use a polyphase resampling method instead of FFT-based. This will be faster except in rare cases where the greatest common denominator of the old sample rate and the new sample rate is low (< 20). Since we mostly use round numbers, this doesn't tend to happen, but if you suddenly decide to resample to a prime sample rate, you may seem your time balloon upward. \n", "\n", "Remember we are caching so while it took ~13 seconds to resample the first time, creating a new LabelList with the same settings and data will be nearly instant" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "config = AudioConfig()\n", "config.resample_to = 8000\n", "rs_audio = AudioList.from_folder(data_folder, config=config).split_by_rand_pct(.2, seed=4).label_from_re(label_pattern)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Silence Removal" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes we have a dataset that is full of lengthy clips with lots of silence. A real world example is a marine biologist trying to identify whale calls in a recording. Most of the recording will be silence, along with some occasional noise that we need to classify (call or not a call). It would be helpful to remove the silence and split the clip into separate files with each noise so that we can build a classifier. \n", "\n", "Another example is trying to guess what is happening in an audio clip (acoustic scene classification), some clips may be 30 seconds long, but with only 10 seconds of actual content in the middle. Since we only grab small time chunks when training, that model will be grabbing silent sections and associating them with the label, thus wasting time and possibly causing underperformance. It would be better if we trimmed the silence from the edges during preprocessing and spent more time training on the content we are interested in. \n", "\n", "By default `config.remove_silence` is set to `None`, but you can choose to set it to `trim`, `split` or `all`. \n", "- `trim` will remove any leading and trailing silence.\n", "- `split` will split the clip into multiple clips at points of silence, removing most of the silence but leaving a bit of padding to keep it smooth.\n", "- `all` will return the same info as `split` but in a single clip, all concatenated together.\n", "\n", "To determine what is considered silence, set `config.silence_threshold` to an int (default is 20, unit is decibels). To determine how much padding to use, set `config.silence_padding` to an int (default is 200, unit is ms). \n", "\n", "Note that we are creating a fresh AudioConfig object each time. It may seem a bit redundant but it allows you to go through the notebook out of order if you wish." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `trim`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config_trim= AudioConfig(remove_silence = \"trim\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config_trim.remove_silence, config_trim.silence_padding, config_trim.silence_threshold" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# adjust the silence padding to 100ms and preprocess\n", "config_trim.silence_padding = 100\n", "audio_trim=AudioList.from_folder(data_folder, config=config_trim).split_by_rand_pct(.2, seed=4).label_from_re(label_pattern)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's listen to the first item from our audios, and the first item with silence trimming to hear the difference." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "audios[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "audio_trim.train[0][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Maybe that's a little too tightly cut and we want to keep a little more of the silence. All we do is set `config.silence_padding` to 500ms and run again" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config_trim.silence_padding = 500\n", "audio_trim=AudioList.from_folder(data_folder, config=config_trim).split_by_rand_pct(.2, seed=4).label_from_re(label_pattern)\n", "audio_trim.train[0][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `split`\n", "\n", "Split will find any point where the audio is less than `silence_threshold` db, for more than `2*silence_padding` length of time, and split at that point into separate clips. Since the current dataset we are using is not a good candidate, we will switch briefly to the whale clip. Since it's a long clip with lots of silence, let's adjust the silence padding to be 1000ms so that we make sure to not break up any calls unless they are surrounded by a full second of silence on both sides." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config_split= AudioConfig(remove_silence = \"split\", silence_padding=1000, silence_threshold=20)\n", "audio_split = AudioList.from_folder('../data/misc/whale', config=config_split).split_none().label_from_func(lambda x: \"whale\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "audio_split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below you can see how the first clip has been split into 9 separate clips at points where there is 1000ms of sound that measures less than 20db, and the silence at the start and end has been removed. After removing excess silence, we now have about 30 seconds of content rich audio instead of 90 seconds of sparse audio. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Original audio:\")\n", "path_example = Path('../data/misc/whale/Right_whale.wav')\n", "open_audio(path_example).show()\n", "print(\"Split audio with silence removed:\")\n", "for a in audio_split.train:\n", " a[0].show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `all`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Setting `config.remove_silence` to `all` will do the same thing as `split` but will concatenate it all together into one clip. It should be rare that you need to use this setting, but it can be useful in cases like acoustic scene classification where you have some clips with too much silence throughout the middle, and lots of starting/stopping. For instance, with our whale call example above, it may not be important for use to identify the actual calls, but just to determine that this is audio of a whale. In that case, jamming it all together into a smaller clip would be fine. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config_all= AudioConfig(remove_silence = \"all\", silence_padding=1000, silence_threshold=20)\n", "audio_all = AudioList.from_folder('../data/misc/whale', config=config_all).split_none().label_from_func(lambda x: \"whale\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Audio with outer and middle silence removed:\")\n", "audio_all.train[0][0].show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For one final example, let's lower the padding and get something really compact. I wouldn't recommend this setting for most cases but we think it is good to have the option." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config_all.silence_padding=100\n", "audio_all = AudioList.from_folder('../data/misc/whale', config=config_all).split_none().label_from_func(lambda x: \"whale\")\n", "audio_all.train[0][0].show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Segmentation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Segmentation will chop your audio clips up into equal intervals for you. For example if you have a 7.2s clip, and would like 1s long clips, it will chop it into 8 1-second-long clips (the last will be padded to be a full second). \n", "\n", "Note: It is recommended instead to set the `duration` attribute of your config to the number of milliseconds you want your spectrograms to be. Only use segmentation if you want the actual underlying audio clips to be chopped into pieces and saved. In most cases this will be slower, consume more memory, and train to a lower accuracy. \n", "\n", "The code below will trim silence at start and end, and then create equal size (500ms) chunks for us so that we can compare to\n", "the example that just removes silence. Note that we now have around 24,000 clips." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config_segment = AudioConfig(remove_silence = \"trim\", segment_size = 500)\n", "audio_segment = AudioList.from_folder(data_folder, config=config_segment).split_by_rand_pct(.2, seed=4).label_from_re(label_pattern)\n", "audio_segment" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Segmented audio:\")\n", "for i in range(6):\n", " print(audio_segment.train[i][0].path)\n", " audio_segment.train[i][0].show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Caching" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Spectrogram generation is done through TorchAudio, but even optimized for torch, discrete fourier transforms are slow (several ms per item) and are a time bottleneck. Read/write from disk is almost always faster, so we offer the option to cache files (.pt files, saved torch tensors). If you set `cache_spectro = True`, your spectrograms will be saved to `cache_folder` inside of the same folder where your data is stored. You shouldn't need to change this, and the current implementation is a bit rigid, so you can only choose the subfolder of your datafolder where it is located, you cannot currently choose a location outside of that folder. It is set by default to be '.cache' " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config.cache_dir" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Files that start with a . are hidden in Linux, so if you're searching in the terminal you may need to type `ls -a`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! ls {config.cache_dir}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The spectrograms are stored in a folder with the hash of the set of settings you used. The preprocessing is stored in it's own folder. rs_8000 is resample to 8000, sh_20-200 is silence removal with a threshold of 20dB and padding 200ms, and s_500 is segment into 500ms chunks. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that for large datasets this can be potentially huge amounts of data (~20-50GB) so if you are working with a large dataset you may want to clear the cache each time you change settings. To check how much memory the cache is using, call `.cache_size()` on your config object, and you'll get back a tuple containing an int (memory used in bytes) and a string (representation in MB)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config.cache_size()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To clear the cache, just call `.clear_cache()` on your config. Everytime the library adds a file to the cache, the path of that file is stored in a list in the cache so that it may be safely removed later." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config.clear_cache()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config.cache_size()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you get the message `Cache not found, try calling again after creating your AudioList`, it is because the AudioConfig doesn't know where your data is when first initialized, and it's only when you create an AudioList and pass in your config that the `cache_dir` can be linked to your data folder. If for whatever reason clear_cache is not removing certain files, you can also clear the directory manually with" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!rm -rf {str(data_folder / '.cache')}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Generating Images from Audio" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using raw audio with deep neural nets is promising, but has had mixed results and is much more expensive to train. The vast majority of models use a spectral extraction of the audio rather than the raw audio itself. The most common is the melspectrogram (If you don't know what a spectrogram or the mel-scale are, please see the Intro to Audio notebook)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Spectrogram Generation\n", "\n", "The library will automatically create melspectrograms for you on the fly, transform them, and train. To do this you just need to set use_spectro = True. Note the following code should raise a warning about not being able to collate samples into a batch." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#use spectrograms (this by default is true, and currently there is no real way to use raw audio)\n", "config_sg = AudioConfig(use_spectro=True)\n", "audios_sg = AudioList.from_folder(data_folder, config=config_sg).split_by_rand_pct(.2, seed=4).label_from_re(label_pattern)\n", "db_sg = audios_sg.databunch(bs=64)\n", "db_sg.show_batch()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see we are generating spectrograms, but they are unequal widths, because of the varying durations of the audio. As mentioned in the Getting Started guide, you can fix this by either setting `duration` (both are how long you want clips to be in ms). Duration should train better in almost all cases so we'll go with that. If your audios are of exactly the same length (rarely the case) you can proceed without setting `duration`/`max_to_pad`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `duration`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Setting duration will generate a full spectrogram, but at train time, grab a random section that is equivalent to `duration` ms of audio. If the entire audio is shorter than the duration specified, it will pad with zeros by default, or will repeat the spectrogram if `pad_mode='repeat'`. Below we set the duration to 10s just to demonstrate the two padding types" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config_duration = AudioConfig(use_spectro=True, duration=10000, pad_mode=\"zeros\")\n", "audios_duration = AudioList.from_folder(data_folder, config=config_duration ).split_by_rand_pct(.2, seed=4).label_from_re(label_pattern)\n", "db_duration = audios_duration.databunch(bs=64)\n", "db_duration.show_batch(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Same config but using repeat padding" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config_duration = AudioConfig(use_spectro=True, duration=10000, pad_mode=\"repeat\")\n", "audios_duration = AudioList.from_folder(data_folder, config=config_duration ).split_by_rand_pct(.2, seed=4).label_from_re(label_pattern)\n", "db_duration = audios_duration.databunch(bs=64)\n", "db_duration.show_batch(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Back to 2000 ms duration and zero pad for actual training. \n", "\n", "Note that duration will also tell you which part of the clip you're listening to and seeing, and it will be different every time you look at an item." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config_duration = AudioConfig(use_spectro=True, duration=2000, pad_mode=\"zeros\")\n", "audios_duration = AudioList.from_folder(data_folder, config=config_duration ).split_by_rand_pct(.2, seed=4).label_from_re(label_pattern)\n", "db_duration = audios_duration.databunch(bs=64)\n", "db_duration.show_batch(2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "learn = audio_learner(db_duration)\n", "learn.lr_find(); learn.recorder.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "learn.fit_one_cycle(5, slice(2e-3, 2e-2))\n", "learn.unfreeze()\n", "learn.fit_one_cycle(10, slice(1e-3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `max_to_pad `" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**It is recommended that you use `duration` instead of max_to_pad, so feel free to skip to the [next section](#Spectrogram-Configuration-and-Fine-Tuning)**\n", "\n", "`max_to_pad` is an alternate option that will trim or pad the audio signal (with zeros) to be a fixed length (`max_to_pad` ms long). We don't currently allow repeat padding for signals but may in a future version. If repeat pad is essential, use `duration`. \n", "\n", "`duration` will perform better because instead of taking just the first 2000ms of audio, it will take the equivalent of 2000ms at random from the spectrogram and will not throw away any data. This will be more and more important the lower you set `duration`/`max_to_pad`, or the more variation of audio length in your dataset there is" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config_max_to_pad = AudioConfig(use_spectro=True, max_to_pad=2000)\n", "audios_max_to_pad = AudioList.from_folder(data_folder, config=config_max_to_pad).split_by_rand_pct(.2, seed=4).label_from_re(label_pattern)\n", "db_max_to_pad = audios_max_to_pad.databunch(bs=64)\n", "db_max_to_pad.show_batch(2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "learn = audio_learner(db_max_to_pad)\n", "learn.fit_one_cycle(5, slice(2e-3, 2e-2))\n", "learn.unfreeze()\n", "learn.fit_one_cycle(10, slice(1e-3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Spectrogram Configuration and Fine Tuning\n", "\n", "If you've explored the config object at all, you may notice it has an `sg_cfg` inside of it. This is where all your sg settings are held, and by adjusting them like any other hyperparameter you can train your models to higher degrees of accuracy. We won't cover every setting here, just enough to get more detailed spectrograms. For a deep-dive on each of these settings, see the **Intro to Audio** guide. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config_tune = AudioConfig(use_spectro=True, duration=2000)\n", "config_tune.sg_cfg" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can alter these settings and the spectrograms will change. Here we make the spectrograms larger by increasing the number of mel bins (taller spectrogram) and decreasing the hop (wider spectrogram)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sg_cfg_tune = SpectrogramConfig(hop_length=256, n_mels=192)\n", "config_tune.sg_cfg = sg_cfg_tune\n", "audios_tune = AudioList.from_folder(data_folder, config=config_tune).split_by_rand_pct(.2, seed=4).label_from_re(label_pattern)\n", "db_tune = audios_tune.databunch(bs=64)\n", "db_tune.show_batch(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's train and check our results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "learn = audio_learner(db_tune)\n", "learn.fit_one_cycle(5, slice(2e-3, 2e-2))\n", "learn.unfreeze()\n", "learn.fit_one_cycle(10, slice(1e-3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like these settings might give us slightly better results, but it's hard to tell with such an easy dataset. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mel-Frequency Cepstral Coefficients" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mel frequency cepstral coefficients are a form of audio extraction used in speech/voice recognition. They can train to pretty high degrees of accuracy extremely quickly, but given the limited amount of data they provide to the model, they often underperform melspectrograms in our testing, but can sometimes beat melspec, and can also be considered for part of an ensemble.\n", "\n", "Trying them out is fast and easy, just set `config.mfcc = True`, and your melspectrogram will be replaced by an mfcc. Note that`use_spectro` still should be True. Also you can set the number of coefficients used by altering your Spectrogram Configs n_mfcc setting, below that would mean `config_mfcc.sg_cfg.n_mfcc = 40` or whatever value you want. Your image will then be `n_mfcc` pixels tall, in this case 20. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config_mfcc = AudioConfig(use_spectro=True, duration=4000, mfcc=True)\n", "audios_mfcc = AudioList.from_folder(data_folder, config=config_mfcc).split_by_rand_pct(.2, seed=4).label_from_re(label_pattern)\n", "db_mfcc = audios_mfcc.databunch(bs=64)\n", "db_mfcc.show_batch()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "learn = audio_learner(db_mfcc)\n", "learn.lr_find(); learn.recorder.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "learn.fit_one_cycle(5, slice(3e-3, 3e-2))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "learn.unfreeze()\n", "learn.fit_one_cycle(10, slice(1e-3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Delta and Acceleration Stacking" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Delta and acceleration appending/stacking means you take the 1st derivative (delta) and 2nd derivative (accelerate) and pass them to your model in some way (in our case by stacking them in the 2nd and 3rd channels). This is very common to see in ML audio papers and now to reproduce, all you need to do is set `delta=True` and the 2nd and 3rd channel of your image will be the delta and accelerate of your spectrogram/mfcc. \n", "\n", "The library will display the channels to you as separate 1 channel images because combined as 1 RGB image, it will be nonsensical. \n", "\n", "Also you may need to lower batch size when stacking, as 3x the memory will be consumed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config_mfcc_stack = AudioConfig(mfcc=True, delta=True, duration=4000)\n", "audios_mfcc_stack = AudioList.from_folder(data_folder, config=config_mfcc_stack).split_by_rand_pct(.2, seed=4).label_from_re(label_pattern)\n", "db_mfcc_stack = audios_mfcc_stack.databunch(bs=32)\n", "db_mfcc_stack.show_batch(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "learn = audio_learner(db_mfcc_stack)\n", "learn.lr_find(); learn.recorder.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "learn.fit_one_cycle(5, slice(3e-3, 3e-2))\n", "learn.unfreeze()\n", "learn.fit_one_cycle(10, slice(1e-3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Transforms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above section covered basic training and the types of features you can extract to feed to an image classifier. Once those spectrograms are generated, you can perform real-time transforms on them in order to prevent overfitting and obtain better results. The normal transforms you would use for an image don't apply to spectrograms. For example, a horizontal flip of a cat is still a cat, but a horizontal flip of a spectrogram represents a different sound (something close to the reverse of it) and for most cases, this would change classification. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mixup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mixup isn't a typical transform, and is actually applied directly to the learner. It takes two images of different classifications and combines them into one. Thanks to fastai callbacks, mixup in audio is the easiest thing ever, and it works really well. Just add .mixup when you create your learner and you're all set. Check out the competition notebooks to see it in action. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config_sg = AudioConfig(use_spectro=True, duration=4000)\n", "db_sg = (AudioList.from_folder(data_folder, config=config_sg).\n", " split_by_rand_pct(.2, seed=4).\n", " label_from_re(label_pattern).\n", " databunch(bs=32))\n", "learn = audio_learner(db_sg).mixup()\n", "learn.lr_find();learn.recorder.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "learn.fit_one_cycle(5, slice(3e-3, 3e-2))\n", "learn.unfreeze()\n", "learn.fit_one_cycle(10, slice(1e-3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that our training loss is much higher the first epoch using mixup, but we are still generalizing extremely well. This allows us to keep training longer and eventually reach higher accuracies. The effect is more pronounced on tougher datasets like those in the competition notebook. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transform Manager" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get our transforms, use `get_spectro_transforms()`. It will return two lists of transforms, the first is transforms to be run on your training set, and the second is validation set transforms. By default it will perform frequency masking, time masking, and rolling on the training set (described below), but not on the validation set. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tfms = get_spectro_transforms(); tfms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Resizing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recently [@mnpinto](https://forums.fast.ai/u/mnpinto/summary) discovered that increasing image size using bilinear interpolation, while not adding any new information, can sometimes make models perform better. You can pass a `size` tuple formatted as `(height, width)` to `get_spectro_transforms` to upsample your images and see if they produce better results. The implementation is a bit hacky and we are currently working to use the same method as FastAI ImageLists, which involve passing an int or tuple to `.transform()` along with your tfms.\n", "\n", "Currently our images are (128,250), let's double the height by passing in size=(256,250), while also turning off other transforms" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tfms = get_spectro_transforms(size=(256,250), mask_time=False, mask_frequency=False, roll=False);tfms" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config_sg = AudioConfig(use_spectro=True, duration=4000)\n", "db_sg = (AudioList.from_folder(data_folder, config=config_sg).\n", " split_by_rand_pct(.2, seed=4).\n", " label_from_re(label_pattern).\n", " transform(tfms).\n", " databunch(bs=32))\n", "db_sg.show_batch()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SpecAugment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just this year Google released a paper called [SpecAugment](https://arxiv.org/abs/1904.08779) where they simply block out portions of the spectrogram with the (per channel) mean values, and it was highly effective, reaching new state of the art results. This idea of performing transforms directly on the spectrogram is especially appealing since we cache spectrograms, thus allowing us to perform real-time augmentations at high throughput. The alternative pipeline involves augmenting the raw audio (can be computationally expensive) and then having to regenerate new spectrograms every epoch. \n", "\n", "We implemented the two most important components of the paper, frequency masking and time masking along with some variables that allow you to customize it. We plan to add time_warping at a later date but the paper shows it to be the least effective (and most complicated) of the three. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also customize the transforms by passing arguments to get_spectro_transforms(), these options are specified in the individual sections below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Frequency Masking" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Frequency masking just means putting horizontal bars on the spectrogram to hide information from the model in hopes that it will learn to generalize better. By adding a horizontal bar, you are effectively removing (or masking) information about the range of frequencies that the bar blocks on the spectrogram (e.g. maybe it is masking the info contained in the range 4230hz-7392hz) . \n", "\n", "Arguments that can be passed to get_spectro_transforms() to customize frequency masking:\n", "1. fmasks=1 - this is the number of masks to create \n", "2. num_rows=30 - how many rows should it mask (1 row = 1 pixel)\n", "3. start_row=None - Do you want it to start at a certain row? If None, then it will choose a row at random each time (recommended) \n", "4. fmask_value=None - Do you want it to mask the sg with a specific value? If None, then it will use the mean of the channel.\n", "\n", "Note that you may need to adjust `num_rows` based on how wide your particular image is. If it's 32px wide, 30 rows is not going to train well. If it's 460 pixels wide, it probably won't have much affect." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tfms = get_spectro_transforms(mask_time=False, mask_freq=True, roll=False);tfms" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config_sg = AudioConfig(use_spectro=True, duration=4000)\n", "db_sg = (AudioList.from_folder(data_folder, config=config_sg).\n", " split_by_rand_pct(.2, seed=4).\n", " label_from_re(label_pattern).\n", " transform(tfms).\n", " databunch(bs=32))\n", "db_sg.show_batch()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# use 4 masks of 5 rows each and set the mask_value to be 42\n", "tfms = get_spectro_transforms(mask_time=False, mask_freq=True, roll=False, fmasks=4, num_rows=5, fmask_value=42)\n", "db_sg = (AudioList.from_folder(data_folder, config=config_sg).\n", " split_by_rand_pct(.2, seed=4).\n", " label_from_re(label_pattern).\n", " transform(tfms).\n", " databunch(bs=32))\n", "db_sg.show_batch()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Time Masking" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Time masking is much the same as frequency masking, except the bars are vertical thus blocking time info, and the arguments used to customize have slightly different names\n", "\n", "1. tmasks=1 - t instead of f, this is the number of masks to create \n", "2. num_cols=30 - how many cols should it mask (1 col = 1 pixel)\n", "3. start_col=None - Do you want it to start at a certain col? If None, then it will choose a col at random each time (recommended) \n", "4. tmask_value=None - Do you want it to mask the sg with a specific value? If None, then it will use the mean of the channel." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# now let's check out with time and frequency masking, but let's tone down the size a bit\n", "config_sg = AudioConfig(use_spectro=True, duration=4000)\n", "tfms = get_spectro_transforms(mask_time=True, mask_freq=True, roll=False, num_rows=12, num_cols=8);tfms\n", "db_sg = (AudioList.from_folder(data_folder, config=config_sg).\n", " split_by_rand_pct(.2, seed=4).\n", " label_from_re(label_pattern).\n", " transform(tfms).\n", " databunch(bs=32))\n", "db_sg.show_batch()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Rolling " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rolling a spectrogram just means shifting along it's x-axis and wrapping around from the end to the beginning as to not lose any information. Whether this is useful to you depends on your data. Ask yourself if your data would make sense a bit out of order. Traffic noise wrapped would still be traffic noise, but the word \"hello\" wouldn't be the word \"hello\". Also, if you are using mixup and specAugment, you may have confused your model sufficiently that rolling isn't going to help. Still we've found it to help in some cases so it is included.\n", "\n", "Note that the green bar you're seeing is not a mask, but is actually the padding at the end of the spectrogram that has now been rolled randomly to somewhere in the middle" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config_sg = AudioConfig(use_spectro=True, duration=4000)\n", "tfms = get_spectro_transforms(mask_time=True, mask_freq=True, roll=True, num_rows=14, num_cols=10);tfms\n", "db_sg = (AudioList.from_folder(data_folder, config=config_sg).\n", " split_by_rand_pct(.2, seed=4).\n", " label_from_re(label_pattern).\n", " transform(tfms).\n", " databunch(bs=32))\n", "db_sg.show_batch()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally let's train our speaker model using specaugment and rolling and see how it does. Keep in mind that we are hiding information from the model as it trains, so it may seem initially to be worse (after the same number of epochs) but this type of augmentation will allow you to train longer without overfitting, sometimes reaching higher accuracies than would be possible without the augmentation.\n", "\n", "You can see below from `lr_find()` that the curve isn't as steep." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "learn = audio_learner(db_sg)\n", "learn.lr_find();learn.recorder.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "learn.fit_one_cycle(5, slice(2e-3, 2e-2))\n", "learn.unfreeze()\n", "learn.fit_one_cycle(10, slice(1e-3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's it, from here we recommend you go code and try this stuff out on your own dataset. Follow Jeremy's advice, change settings, see what goes in and what comes out, and adjust. Once you've done that and feel comfortable with the settings, check out our kaggle audio competition notebooks to see how to tune things and get world class results. \n", "\n", "Also we would love feedback, bug reports, feature requests, and whatever else you have to offer. We welcome contributors of all skill levels. If you need to get in touch for any reason, please post in the [fastai audio thread](https://forums.fast.ai/t/deep-learning-with-audio-thread/38123) or contact us via PM [@baz](https://forums.fast.ai/u/baz/) or [@madeupmasters](https://forums.fast.ai/u/MadeUpMasters/). Let's build an audio machine learning community!" ] } ], "metadata": { "_draft": { "nbviewer_url": "https://gist.github.com/44b813541fecdcfe05eb80093f67a97b" }, "gist": { "data": { "description": "tutorials/02_Features.ipynb", "public": false }, "id": "44b813541fecdcfe05eb80093f67a97b" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }