In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

# FastAI Audio

This notebook will show you the fastest way to get started with FastAI audio by demonstrating only the most essential functionality. In the `examples` folder, we have included a number of other notebooks that show more features, and teach you about audio in general. If you'd like to follow along in a colab notebook, please click [here](https://colab.research.google.com/drive/1s0Ouw5PxvrmHdm_gBU0qiA6piOf3VSWO) and copy this into your own google drive.

First, import fastai audio, this will import all the dependecies you will need to work with Audio.

In [None]:
from audio import *

## AudioItem

Here we create an `AudioItem` to load an audio file and listen to it by passing the filename (either `str` or `PosixPath`) to `open_audio()`, we can also see some information about the audio.

In [None]:
path_example = Path('data/misc/whale/Right_whale.wav')
sound = open_audio(path_example)
sound

This clip is 87.73 seconds long. Audio is a continuous wave that is "sampled" by measuring the amplitude of the wave at a given point in time. How many times you sample per second is called the "sample rate" and can be thought of as the resolution of the audio. In our example, the audio was sampled 44100 times per second, so our data is a rank 1 tensor with length 44100*time in seconds = 3869019 samples. 

If any of this is new to you, definitely check out our **Intro to Audio Notebook** in the `examples` folder.

In [None]:
sound.shape

### Important attributes inside of an AudioItem

In [None]:
#sig means signal, it's a rank one tensor with the amplitudes sampled from the raw sound wave
sound.sig

In [None]:
#sr means sample rate
sound.sr

In [None]:
#path is a reference to the location of the sound file
sound.path

## AudioList and Speaker Recognition Example

We'll work with a fairly small dataset that has 10 speakers, 5 male and 5 female, with the goal of recognizing who is speaking.

We can download the data into our default fastai data directory

In [None]:
data_url = 'http://www.openslr.org/resources/45/ST-AEDS-20180100_1-OS'
data_folder = datapath4file(url2name(data_url))
untar_data(data_url, dest=data_folder)

We first create an AudioList. This extends fastai ItemList so you can use other methods like `from_csv()` to load your data as well

In [None]:
audios = AudioList.from_folder(data_folder)

Because audio data can be so variable, we provide a convenience function `.stats()` that will display a list of sample rates, and how many files have that sample rate, as well as a plot of the lengths, in seconds, of the audio files in your `AudioList`. You can also specify `prec` to set the number of digits the file lengths are rounded to before plotting the graph (default is 0). Expect it to take about 2 seconds per 5000 files in your dataset, a progress bar is provided.

In [None]:
len_dict = audios.stats(prec=1)

`stats` will pass you back a dictionary with the file lengths, and file names, so that you may do with it what you want. 

One option is to call `get_outliers` which will return a sorted list of tuples containing the filename, and length of files that are more than `devs` (float) standard deviations from the mean length. This can be helpful for weeding out bad data. 

In [None]:
outliers = get_outliers(len_dict, devs=3)
print("Total Outliers:", len(outliers))
outliers[:10]

The `stats` method showed us that this dataset has only one sample rate. If you have multiple sample rates, you will need to resample to a single sample rate by setting `resample_to` in the configuration settings. If you want to do any customization, you'll need to pass a config object to the AudioList constructor, so before we go any further, here's how to use it.

## Audio Configuration

All config settings are managed through an `AudioConfig object`. It also contains within it a `SpectrogramConfig` object that holds settings related to spectrograms and MFCC (mel-frequency cepstral coefficients). The inner config can be changed just like the outer one by nesting. `config.sg_cfg.top_db=80` for instance

In [None]:
config = AudioConfig()
config

As you can see there are tons of features here, most of which you will not need to adjust to get pretty good results. If you plan on doing a lot of work on audio, or have a dataset with lots of silence, or a wide variety of audio lengths, check out our **Features Notebook** in the example folder, it shows when and how to adjust each of these settings.

For now we will only cover the most essential features `resample_to`, `max_to_pad` and `duration`

### `duration` and `max_to_pad`

Eventually, our audio will become spectrograms (visual representations of audio that can be passed to an image classifier). 
Like images, it is important that our spectrograms be the same size so that the GPU can handle them efficiently. Since audio clips rarely have precisely equal length, we give you two options for generating fixed width spectrograms. Which one is best for you will depend on the nature of your data. If your data varies in length by even a moderate amount, you will want to use `duration`.

1. Specify the `duration` setting of your config. This will compute the spectrogram using the entire clip regardless of length, but at train time will grab random sections that are `duration` milliseconds long. If duration is greater than the length of the clip, it will pad your spectrogram with zeroes to be the same length as the others. 

2. Set the `max_to_pad` attribute of your config (in milliseconds) to be the length you want your audio to be. This will pad or trim the underlying audio, and then generate spectrograms from that resulting audio. It will zero-pad clips that are too short, and trim clips that are too long, throwing away the remaining data. 

For this dataset, let's use duration so we don't throwaway data from the longer clips, and let's use 4000ms (4s). 

In [None]:
config.duration = 4000

### `resample_to`

Also it is important that all of the data is the same sample rate. If one spectrogram has a sample rate of 44100, and another's is 16000, the x-axis of the spectrogram will represent different amounts of time, and thus they won't be comparable. So if you see more than one sample rate when you call the `.stats()` method above, you will need to set `resample_to` to be an int representing the sample rate you wish to use. It is best practice to use common sample rates (44100, 22050, 16000 or 8000) as they will be faster to resample. 

For our data, there is no need to resample, but if we did, the code to downsample to 8000 would just be `config.resample_to=8000`

## Creating a databunch

Now we follow the normal fastai datablock API, making sure to pass our config to the AudioList

In [None]:
label_pattern = r'_([mf]\d+)_'
audios = AudioList.from_folder(data_folder, config=config).split_by_rand_pct(.2, seed=4).label_from_re(label_pattern)

Fastai Audio performs on the fly data augmentation directly on spectrograms. Try uncommenting the second line and playing around with the transform manager and for more detail check out the Features Notebook

In [None]:
tfms = None
#tfms = get_spectro_transforms(mask_time=False, mask_freq=True, roll=False, num_rows=12)
db = audios.transform(tfms).databunch(bs=64)
db.show_batch(20)

When audio is longer than the duration you've selected for training, it is clipped at random, but those items will tell you what time portion of the original audio clip the spectrogram and displayed audio represent. It will appear as '2.53s-6.53s of original clip'. Clips that are shorter than duration are padded with zeros, this will appear as a blue-green bar on the right hand side of the spectrogram

# Learner

An Audio learner takes a databunch, base_arch(optional, defaults to resnet18 for now), and metrics(optional, defaults to accuracy) and returns a cnn_learner. For now it is just a wrapper, but additional functionality is coming soon. 

In [None]:
learn = audio_learner(db)
learn.lr_find()
learn.recorder.plot()

In [None]:
learn.fit_one_cycle(5, slice(2e-3, 2e-2))

In [None]:
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()

# Conclusion 

With 30 seconds of compute, and no preprocessing or fine tuning, you just created a voice-recognition system with 99% accuracy. 
But this is really just scratching the surface, so please check out our other notebooks in the examples folder and see what else is possible. 

# Acknowledgements
This library builds on the work of many others. It is of course built on top of fastai, so thank you to **Jeremy, Rachel, Stas, Sylvain** and countless others. It is a fork of https://github.com/zcaceres/fastai-audio and so we owe a lot to **@aamir7117 @marii @simonjhb @ste @ThomM @zachcaceres**. And it is built on top of torchaudio which helps us do things many things much faster than would otherwise be possible. Thanks as well to those who have been active in the [fastai audio thread](https://forums.fast.ai/t/deep-learning-with-audio-thread/38123). 

Also we would love feedback, bug reports, feature requests, and whatever else you have to offer. We welcome contributors of all skill levels. If you need to get in touch for any reason, please post in the [fastai audio thread](https://forums.fast.ai/t/deep-learning-with-audio-thread/38123) or contact us via PM [@baz](https://forums.fast.ai/u/baz/) or [@madeupmasters](https://forums.fast.ai/u/MadeUpMasters/) Let's build an audio machine learning community!