{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Applying Rocket to Raw Audio\n", "\n", "Note: this notebook is extremely messy as a result of me trying to rapidly prototype rocket for raw audio and not following best practices. It also uses the beta version of fastai v2 for removing silence and preprocessing. If you are interested in experimenting yourself and can't make sense of something here, please reach out to [me](https://forums.fast.ai/u/madeupmasters/summary) by PM, or in the [Deep Learning with Audio](https://forums.fast.ai/t/deep-learning-with-audio-thread) or [Time Series](https://forums.fast.ai/t/time-series-sequential-data-study-group) threads\n", "\n", "This notebook will apply the findings of the recent [Rocket Paper](https://arxiv.org/abs/1910.13051) by Angus Dempster, François Petitjean, Geoffrey I. Webb to 1D raw audio signals for the task of voice recognition. Some of this code is also adapted from [Ignacio Oguiza](https://forums.fast.ai/u/oguiza) and his [Time Series Module for FastAI v1](https://github.com/timeseriesAI/timeseriesAI)\n", "\n", "Initially the signals were too long and slow to train at a sample rate of 16000 (was going to take ~30-40 minutes for 1s clips). Training a 3800 audio 10 class dataset (small problem, trains to 99%+ accuracy in 2 minutes using typical audio pipeline of spectrogram + CNN), . To speed things up I added a stride which sped up results without a drop in accuracy, but still only led to 85% accuracy after 4 minutes of training. Removing silence and doubling the amount of time to 2s allowed us to get great results (95% accuracy in 6s, 98.6% in 20 seconds, 99.2% in 1 min 20 sec).\n", "\n", "Unfortunately so far, I have not been able to scale the results to harder problems, such as a 250 speaker dataset. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary of Findings\n", "\n", "In order to spare you having to go through this network, here's a summary of the interesting results\n", "\n", " - 95% accuracy in 6s, 98.6% in 20 seconds, 99.2% in 1 min 20 sec on a 10 class problem using raw audio and no augmentation\n", " - Having a stride of 5-7 seems to be an optimal balance between computational cost and accuracy\n", " - Bigger filter sizes are better only up to 7x7 or 9x9 and then they actually decrease accuracy with increased cost\n", " - A variety of filter sizes (7,9,11) doesnt seem to beat out just individual filter sizes 7,9,11 but more testing is needed with a harder dataset\n", " - Testing tons of individual kernels on random subsets of the data and then selecting the best ones actually results in less accuracy vs the same number of random kernels (this is likely due to increased correlation between the good random kernels, making them worse in an ensemble)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from local.torch_basics import *\n", "from local.test import *\n", "from local.basics import *\n", "from local.data.all import *\n", "from local.vision.core import *\n", "from local.notebook.showdoc import show_doc\n", "from local.audio.core import *\n", "from local.audio.augment import *\n", "from local.vision.learner import *\n", "from local.vision.models.xresnet import *\n", "from local.metrics import *\n", "from local.callback.schedule import *\n", "import torchaudio\n", "from fastprogress import progress_bar as pb\n", "import time\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import RidgeClassifierCV\n", "from rocket import generate_kernels, apply_kernel, apply_kernels" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PosixPath('/home/jupyter/.fastai/data/ST-AEDS-20180100_1-OS/ST-AEDS-20180100_1-OS')" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p10speakers = Config()['data_path'] / 'ST-AEDS-20180100_1-OS'\n", "untar_data(URLs.SPEAKERS10, fname=str(p10speakers)+'.tar', dest=p10speakers)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "get_audio = AudioGetter(\"\", recurse=True, folders=None)\n", "files_10 = get_audio(p10speakers)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(#3842) [/home/jupyter/.fastai/data/ST-AEDS-20180100_1-OS/f0004_us_f0004_00446.wav,/home/jupyter/.fastai/data/ST-AEDS-20180100_1-OS/m0002_us_m0002_00128.wav,/home/jupyter/.fastai/data/ST-AEDS-20180100_1-OS/f0003_us_f0003_00279.wav,/home/jupyter/.fastai/data/ST-AEDS-20180100_1-OS/f0001_us_f0001_00168.wav,/home/jupyter/.fastai/data/ST-AEDS-20180100_1-OS/f0005_us_f0005_00286.wav,/home/jupyter/.fastai/data/ST-AEDS-20180100_1-OS/m0005_us_m0005_00282.wav,/home/jupyter/.fastai/data/ST-AEDS-20180100_1-OS/f0005_us_f0005_00432.wav,/home/jupyter/.fastai/data/ST-AEDS-20180100_1-OS/f0005_us_f0005_00054.wav,/home/jupyter/.fastai/data/ST-AEDS-20180100_1-OS/m0004_us_m0004_00110.wav,/home/jupyter/.fastai/data/ST-AEDS-20180100_1-OS/m0003_us_m0003_00180.wav...]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "files_10" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "audio_opener = OpenAudio(files_10)\n", "p10_labeler = lambda x: str(x).split('/')[-1][:5] #grab the label from each file" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "CLIP_LENGTH = 2000" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "