{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Hey, Jetson!\n", "\n", "### By Brice Walker\n", "\n", "[View the full project on GitHub](https://github.com/bricewalker/Hey-Jetson)\n", "\n", "[Render the notebook with nbviewer](https://nbviewer.jupyter.org/github/bricewalker/Hey-Jetson/blob/master/Speech.ipynb)\n", "\n", "\n", "## Outline\n", "\n", "- [Introduction](#intro)\n", "- [Importing Libraries](#libraries)\n", "- [Importing The Dataset](#data)\n", "- [Acoustic Feature Extraction/Engineering for Speech Recognition](#features)\n", "- [Visualizing The Data](#plotting)\n", " - [Raw Audio](#raw)\n", " - [Spectrograms](#spectograms)\n", " - [Mel-Frequency Cepstral Coefficients](#mfcc)\n", "- [Deep Neural Networks for Acoustic Modeling](#deeplearning)\n", " - [RNN](#rnn)\n", " - [RNN + TimeDistributed Dense](#rnntd)\n", " - [CNN + RNN + TimeDistributed Dense](#cnnrnn)\n", " - [Deeper RNN + TimeDistributed Dense](#deeprnn)\n", " - [Bidirectional RNN + TimeDistributed Dense](#bidirectional)\n", " - [CNN + Deeper Bidirectional RNN + TimeDistributed Dense](#cnndeepbi)\n", " - [Advanced Modeling Techniques](#modeling)\n", " - [Dropout](#dropout)\n", " - [Dilated Convolutions](#dilated)\n", " - [Aggregate Models](#aggregate)\n", " - [Deep Aggregate Model](#deepagg)\n", " - [Attention](#attention)\n", " - [Final Model](#final)\n", "- [Visualizing The Final Model Architecture](#architecture)\n", "- [Comparing Models](#selection)\n", " - [Final Model Performance](#test)\n", " - [Cosine Similarity](#similarity)\n", " - [Word Error Rate](#error_rate)\n", " - [Benchmarking Performance](#benchmark)\n", "- [Conclusion](#conclusion)\n", "\n", "\n", "## Introduction\n", "\n", "This project builds a scalable attention based speech recognition platform in Keras/Tensorflow for inference on the Nvidia Jetson. This real world application of automatic speech recognition was inspired by my career in mental health. This project begins a journey towards building a platform for real time therapeutic intervention inference and feedback. The ultimate intent was to build a tool that can give therapists real time feedback on the efficacy of their interventions, but this has many applications in mobile, robotics, or other areas where cloud based deep learning is not desirable.\n", "\n", "This notebook explores three common ways of visualizing/mathematically representing audio for use in machine learning models. This project then walks you through the construction of a series of increasingly complex character-level phonetics sequencing models. For this project, I have chosen Recurrent Neural Networks, as they allow us to harness the power of deep neural networks for time sequencing issues and allow fast training on GPU's compared to other models. I chose character level phonetics modeling as it provides a more accurate depiction of language and would allow building a system that can pick up on the nuances of human-to-human communication in deeply personal conversations. Additionally, this notebook explores measures of model performance and makes predictions based on the trained models.\n", "\n", "The final production model has a word error rate of roughly 16% and a cosine similarity score of about 79%.\n", "\n", "### Automatic Speech Recognition\n", "Speech recognition models are based on a statistical optimization problem called the fundamental equation of speech recognition. Given a sequence of observations, we look for the most likely character or word sequence. So, using Bayes Theory, we are looking for the sequence which maximizes the posterior probability of the character given the observation. The speech recognition problem is a search over this model for the best character sequence.\n", "\n", "Character level speech recognition can be broken into two parts; the acoustic model, that describes the distribution over acoustic observations, O, given the character sequence, C; and the language model based solely on the character sequence which assigns a probability to every possible character sequence. This sequence to sequence model combines both the acoustic and language models into one neural network, though pretrained acoustic models are available from [kaldi](http://www.kaldi-asr.org/downloads/build/6/trunk/egs/) if you would like to speed up training.\n", "\n", "### Problem Statement\n", "My goal was to build a character-level ASR system using an encoder/decoder based recurrent neural network with an attention mechanism in TensorFlow that can run inference on an Nvidia Jetson with a word error rate of <20%." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] }, { "data": { "text/html": [ "" ], "text/vnd.plotly.v1+html": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/vnd.plotly.v1+html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Common, File Based, and Math Imports\n", "import pandas as pd\n", "import numpy as np\n", "import collections\n", "import os\n", "from os.path import isdir, join\n", "from pathlib import Path\n", "from subprocess import check_output\n", "import sys\n", "import math\n", "import pickle\n", "from glob import glob\n", "import random\n", "from random import sample\n", "import json\n", "from mpl_toolkits.axes_grid1 import make_axes_locatable\n", "from numpy.lib.stride_tricks import as_strided\n", "from tqdm import tqdm\n", "\n", "# Audio processing\n", "from scipy import signal\n", "from scipy.fftpack import dct\n", "import soundfile\n", "import json\n", "from python_speech_features import mfcc\n", "import scipy.io.wavfile as wav\n", "from scipy.fftpack import fft\n", "\n", "# Neural Network\n", "import keras\n", "from keras.utils.generic_utils import get_custom_objects\n", "from keras import backend as K\n", "from keras import regularizers, callbacks\n", "from keras.constraints import max_norm\n", "from keras.models import Model, Sequential, load_model\n", "from keras.layers import Input, Lambda, Dense, Dropout, Flatten, Embedding, merge, Activation, GRUCell, LSTMCell,SimpleRNNCell\n", "from keras.layers import Convolution2D, MaxPooling2D, Convolution1D, Conv1D, SimpleRNN, GRU, LSTM, CuDNNLSTM, CuDNNGRU, Conv2D\n", "from keras.layers.advanced_activations import LeakyReLU, PReLU, ThresholdedReLU, ELU\n", "from keras.layers import LeakyReLU, PReLU, ThresholdedReLU, ELU\n", "from keras.layers import BatchNormalization, TimeDistributed, Bidirectional\n", "from keras.layers import activations, Wrapper\n", "from keras.regularizers import l2\n", "from keras.optimizers import Adam, SGD, RMSprop, Adagrad, Adadelta, Adamax, Nadam\n", "from keras.callbacks import ModelCheckpoint \n", "from keras.utils import np_utils\n", "from keras import constraints, initializers, regularizers\n", "from keras.engine.topology import Layer\n", "import keras.losses\n", "from keras.backend.tensorflow_backend import set_session\n", "from keras.engine import InputSpec\n", "import tensorflow as tf\n", "\n", "# Model metrics\n", "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n", "from sklearn.metrics.pairwise import cosine_similarity\n", "\n", "# Visualization\n", "import IPython.display as ipd\n", "from IPython.display import Markdown, display, Audio\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import plotly.offline as py\n", "import plotly.graph_objs as go\n", "import plotly.tools as tls\n", "\n", "py.init_notebook_mode(connected=True)\n", "color = sns.color_palette()\n", "sns.set_style('darkgrid')\n", "py.init_notebook_mode(connected=True)\n", "%matplotlib inline\n", "\n", "# Setting Random Seeds\n", "np.random.seed(95)\n", "RNG_SEED = 95\n", "\n", "# Suppressing some of Tensorflow's warnings\n", "tf.logging.set_verbosity(tf.logging.ERROR)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[22. 28.]\n", " [49. 64.]]\n" ] } ], "source": [ "# Simple matrix multiplication test to check if tf is using GPU device. \n", "with tf.device('/gpu:0'):\n", " a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')\n", " b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')\n", " c = tf.matmul(a, b)\n", "# If there is an output then it is able to access the device.\n", "with tf.Session() as sess:\n", " print (sess.run(c))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Importing The Dataset\n", "\n", "The primary dataset used is the [LibriSpeech ASR corpus](http://www.openslr.org/12/) which includes 1000 hours of recorded speech. A 960 hour subset of the dataset of audio files was used for model development and training. The dataset consists of 16kHz audio files of spoken English derived from read audiobooks from the LibriVox project. Some issues identified with this data set are the age of some of the works (the Declaration of Independence probably doesn't relate well to modern spoken English), the fact that there is much overlap in words spoken between the books, a lack of 'white noise' and other non-voice noises to help the model differentiate spoken words from background noise, and the fact that this does not include conversational English. An overview of the difficulties of working with data such as this can be found [here](https://awni.github.io/speech-recognition/).\n", "\n", "The dataset is prepared using a set of scripts borrowed from [Baidu Research's Deep Speech GitHub Repo](https://github.com/baidu-research/ba-dls-deepspeech). \n", "\n", "The dataset consists of 16kHz audio files between 2-15 seconds long. Using the prepared scripts, the audio files were converted to single channel (mono) WAV/WAVE files (.wav extension) with a 64k bit rate, and a 16kHz sample rate. They were encoded in PCM format, and then cut/padded to an equal length of 10 seconds. The pre-processing techniques used for the text transcriptions include the removal of any punctuation other than apostrophes, and transforming all characters to lowercase. Full instructions on how to download the dataset, convert the .flac files to .wav, and build the corpus are found in the README on the GitHub repository or in the wiki." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train Set Duration Mean: 12.301810444600907\n", "Valid Set Duration Mean: 6.795830092509413\n", "Test Set Duration Mean: 6.958454892966327\n" ] } ], "source": [ "train_corpus = pd.read_json('train_corpus.json', lines=True)\n", "valid_corpus = pd.read_json('valid_corpus.json', lines=True)\n", "test_corpus = pd.read_json('test_corpus.json', lines=True)\n", "train_duration_mean = train_corpus.duration.mean()\n", "valid_duration_mean = valid_corpus.duration.mean()\n", "test_duration_mean = test_corpus.duration.mean()\n", "print('Train Set Duration Mean:', train_duration_mean)\n", "print('Valid Set Duration Mean:', valid_duration_mean)\n", "print('Test Set Duration Mean:', test_duration_mean)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train Set Duration Median: 13.79\n", "Valid Set Duration Median: 5.53\n", "Test Set Duration Median: 5.455\n" ] } ], "source": [ "train_duration_median = train_corpus.duration.median()\n", "valid_duration_median = valid_corpus.duration.median()\n", "test_duration_median = test_corpus.duration.median()\n", "print('Train Set Duration Median:', train_duration_median)\n", "print('Valid Set Duration Median:', valid_duration_median)\n", "print('Test Set Duration Median:', test_duration_median)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Defining some initial functions for preparing the dataset" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Function for shuffling data which is important as neural networks make multiple passes through the data\n", "def shuffle_dataset(audio_paths, durations, texts):\n", " p = np.random.permutation(len(audio_paths))\n", " audio_paths = [audio_paths[i] for i in p] \n", " durations = [durations[i] for i in p] \n", " texts = [texts[i] for i in p]\n", " return audio_paths, durations, texts\n", "\n", "# Function for sorting data by duration\n", "def sort_dataset(audio_paths, durations, texts):\n", " p = np.argsort(durations).tolist()\n", " audio_paths = [audio_paths[i] for i in p]\n", " durations = [durations[i] for i in p] \n", " texts = [texts[i] for i in p]\n", " return audio_paths, durations, texts\n", "\n", "# Mapping each character that could be spoken at each time step\n", "char_map_str = \"\"\"\n", "' 0\n", " 1\n", "a 2\n", "b 3\n", "c 4\n", "d 5\n", "e 6\n", "f 7\n", "g 8\n", "h 9\n", "i 10\n", "j 11\n", "k 12\n", "l 13\n", "m 14\n", "n 15\n", "o 16\n", "p 17\n", "q 18\n", "r 19\n", "s 20\n", "t 21\n", "u 22\n", "v 23\n", "w 24\n", "x 25\n", "y 26\n", "z 27\n", "\"\"\"\n", "# This leaves \"blank\" character mapped to number 28\n", "\n", "char_map = {}\n", "index_map = {}\n", "for line in char_map_str.strip().split('\\n'):\n", " ch, index = line.split()\n", " char_map[ch] = int(index)\n", " index_map[int(index)+1] = ch\n", "index_map[2] = ' '\n", "\n", "# Function for converting text to an integer sequence\n", "def text_to_int_seq(text):\n", " int_sequence = []\n", " for c in text:\n", " if c == ' ':\n", " ch = char_map['']\n", " else:\n", " ch = char_map[c]\n", " int_sequence.append(ch)\n", " return int_sequence\n", "\n", "# Function for converting an integer sequence to text\n", "def int_seq_to_text(int_sequence):\n", " text = []\n", " for c in int_sequence:\n", " ch = index_map[c]\n", " text.append(ch)\n", " return text\n", "# Function for calculating feature dimensions.\n", "def calc_feat_dim(window, max_freq):\n", " return int(0.001 * window * max_freq) + 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Defining the primary class for preparing the dataset for visualization and modeling.\n", "\n", "This class provides options for training models on both MFCC's and Spectrograms of the data but is set to use spectrograms by default." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "class AudioGenerator():\n", " def __init__(self, step=10, window=20, max_freq=8000, mfcc_dim=13,\n", " minibatch_size=20, desc_file=None, spectrogram=True, max_duration=10.0, \n", " sort_by_duration=False):\n", " # Initializing variables\n", " self.feat_dim = calc_feat_dim(window, max_freq)\n", " self.mfcc_dim = mfcc_dim\n", " self.feats_mean = np.zeros((self.feat_dim,))\n", " self.feats_std = np.ones((self.feat_dim,))\n", " self.rng = random.Random(RNG_SEED)\n", " if desc_file is not None:\n", " self.load_metadata_from_desc_file(desc_file)\n", " self.step = step\n", " self.window = window\n", " self.max_freq = max_freq\n", " self.cur_train_index = 0\n", " self.cur_valid_index = 0\n", " self.cur_test_index = 0\n", " self.max_duration=max_duration\n", " self.minibatch_size = minibatch_size\n", " self.spectrogram = spectrogram\n", " self.sort_by_duration = sort_by_duration\n", "\n", " def get_batch(self, partition):\n", " # Obtain a batch of audio files\n", " if partition == 'train':\n", " audio_paths = self.train_audio_paths\n", " cur_index = self.cur_train_index\n", " texts = self.train_texts\n", " elif partition == 'valid':\n", " audio_paths = self.valid_audio_paths\n", " cur_index = self.cur_valid_index\n", " texts = self.valid_texts\n", " elif partition == 'test':\n", " audio_paths = self.test_audio_paths\n", " cur_index = self.test_valid_index\n", " texts = self.test_texts\n", " else:\n", " raise Exception(\"Invalid partition. Must be train/validation/test\")\n", "\n", " features = [self.normalize(self.featurize(a)) for a in \n", " audio_paths[cur_index:cur_index+self.minibatch_size]]\n", "\n", " # Calculate size\n", " max_length = max([features[i].shape[0] \n", " for i in range(0, self.minibatch_size)])\n", " max_string_length = max([len(texts[cur_index+i]) \n", " for i in range(0, self.minibatch_size)])\n", " \n", " # Initialize arrays\n", " X_data = np.zeros([self.minibatch_size, max_length, \n", " self.feat_dim*self.spectrogram + self.mfcc_dim*(not self.spectrogram)])\n", " labels = np.ones([self.minibatch_size, max_string_length]) * 28\n", " input_length = np.zeros([self.minibatch_size, 1])\n", " label_length = np.zeros([self.minibatch_size, 1])\n", " \n", " for i in range(0, self.minibatch_size):\n", " # Calculate input_length\n", " feat = features[i]\n", " input_length[i] = feat.shape[0]\n", " X_data[i, :feat.shape[0], :] = feat\n", "\n", " # Calculate label_length\n", " label = np.array(text_to_int_seq(texts[cur_index+i])) \n", " labels[i, :len(label)] = label\n", " label_length[i] = len(label)\n", "\n", " # Output arrays\n", " outputs = {'ctc': np.zeros([self.minibatch_size])}\n", " inputs = {'the_input': X_data, \n", " 'the_labels': labels, \n", " 'input_length': input_length, \n", " 'label_length': label_length \n", " }\n", " return (inputs, outputs)\n", "\n", " def shuffle_dataset_by_partition(self, partition):\n", " # More shuffling\n", " if partition == 'train':\n", " self.train_audio_paths, self.train_durations, self.train_texts = shuffle_dataset(\n", " self.train_audio_paths, self.train_durations, self.train_texts)\n", " elif partition == 'valid':\n", " self.valid_audio_paths, self.valid_durations, self.valid_texts = shuffle_dataset(\n", " self.valid_audio_paths, self.valid_durations, self.valid_texts)\n", " else:\n", " raise Exception(\"Invalid partition. \"\n", " \"Must be train/val\")\n", "\n", " def sort_dataset_by_duration(self, partition):\n", " # Extra shuffling\n", " if partition == 'train':\n", " self.train_audio_paths, self.train_durations, self.train_texts = sort_dataset(\n", " self.train_audio_paths, self.train_durations, self.train_texts)\n", " elif partition == 'valid':\n", " self.valid_audio_paths, self.valid_durations, self.valid_texts = sort_dataset(\n", " self.valid_audio_paths, self.valid_durations, self.valid_texts)\n", " else:\n", " raise Exception(\"Invalid partition. \"\n", " \"Must be train/val\")\n", "\n", " def next_train(self):\n", " # Get a batch of training data\n", " while True:\n", " ret = self.get_batch('train')\n", " self.cur_train_index += self.minibatch_size\n", " if self.cur_train_index >= len(self.train_texts) - self.minibatch_size:\n", " self.cur_train_index = 0\n", " self.shuffle_dataset_by_partition('train')\n", " yield ret \n", "\n", " def next_valid(self):\n", " # Get a batch of validation data\n", " while True:\n", " ret = self.get_batch('valid')\n", " self.cur_valid_index += self.minibatch_size\n", " if self.cur_valid_index >= len(self.valid_texts) - self.minibatch_size:\n", " self.cur_valid_index = 0\n", " self.shuffle_dataset_by_partition('valid')\n", " yield ret\n", "\n", " def next_test(self):\n", " # Get a batch of testing data\n", " while True:\n", " ret = self.get_batch('test')\n", " self.cur_test_index += self.minibatch_size\n", " if self.cur_test_index >= len(self.test_texts) - self.minibatch_size:\n", " self.cur_test_index = 0\n", " yield ret\n", " \n", " # Load datasets\n", " def load_train_data(self, desc_file='train_corpus.json'):\n", " self.load_metadata_from_desc_file(desc_file, 'train')\n", " self.fit_train()\n", " if self.sort_by_duration:\n", " self.sort_dataset_by_duration('train')\n", " \n", "\n", " def load_validation_data(self, desc_file='valid_corpus.json'):\n", " self.load_metadata_from_desc_file(desc_file, 'validation')\n", " if self.sort_by_duration:\n", " self.sort_dataset_by_duration('valid')\n", "\n", " def load_test_data(self, desc_file='test_corpus.json'):\n", " self.load_metadata_from_desc_file(desc_file, 'test')\n", " if self.sort_by_duration:\n", " self.sort_dataset_by_duration('test')\n", " \n", " def load_metadata_from_desc_file(self, desc_file, partition):\n", " # Get metadata from json corpus\n", " audio_paths, durations, texts = [], [], []\n", " with open(desc_file) as json_line_file:\n", " for line_num, json_line in enumerate(json_line_file):\n", " try:\n", " spec = json.loads(json_line)\n", " if float(spec['duration']) > self.max_duration:\n", " continue\n", " audio_paths.append(spec['key'])\n", " durations.append(float(spec['duration']))\n", " texts.append(spec['text'])\n", " except Exception as e:\n", " print('Error reading line #{}: {}'\n", " .format(line_num, json_line))\n", " if partition == 'train':\n", " self.train_audio_paths = audio_paths\n", " self.train_durations = durations\n", " self.train_texts = texts\n", " elif partition == 'validation':\n", " self.valid_audio_paths = audio_paths\n", " self.valid_durations = durations\n", " self.valid_texts = texts\n", " elif partition == 'test':\n", " self.test_audio_paths = audio_paths\n", " self.test_durations = durations\n", " self.test_texts = texts\n", " else:\n", " raise Exception(\"Invalid partition. \"\n", " \"Must be train/validation/test\")\n", " \n", " def fit_train(self, k_samples=100):\n", " # Estimate descriptive stats for training set based on sample of 100 instances\n", " k_samples = min(k_samples, len(self.train_audio_paths))\n", " samples = self.rng.sample(self.train_audio_paths, k_samples)\n", " feats = [self.featurize(s) for s in samples]\n", " feats = np.vstack(feats)\n", " self.feats_mean = np.mean(feats, axis=0)\n", " self.feats_std = np.std(feats, axis=0)\n", " \n", " def featurize(self, audio_clip):\n", " # Create features from data, either spectrogram or mfcc\n", " if self.spectrogram:\n", " return spectrogram_from_file(\n", " audio_clip, step=self.step, window=self.window,\n", " max_freq=self.max_freq)\n", " else:\n", " (rate, sig) = wav.read(audio_clip)\n", " return mfcc(sig, rate, numcep=self.mfcc_dim)\n", "\n", " def normalize(self, feature, eps=1e-14):\n", " # Scale the data to improve neural network performance and reduce the size of the gradients\n", " return (feature - self.feats_mean) / (self.feats_std + eps)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Acoustic Feature Extraction/Engineering for Speech Recognition\n", "\n", "There are 3 primary methods for extracting features for speech recognition. This includes using raw audio forms, spectrograms, and mfcc's. For this project, I will be creating a character level sequencing model. This allows me to train a model on a data set with a limited vocabulary that can generalize to more unique/rare words better. The downsides are that these models are more computationally expensive, more difficult to interpret/understand, and they are more susceptible to the problems of vanishing or exploding gradients as the sequences can be quite long.\n", "\n", "##### The primary dataset used will not need much cleaning as it is taken from audiobooks that have been preprocessed for background noises. This will, of course, lead to reduced performance in distracting environments." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# Defining 3 different ways of converting audio files to spectrograms\n", "\n", "def spectrogram(samples, fft_length=256, sample_rate=2, hop_length=128):\n", "# Create a spectrogram from audio signals\n", " assert not np.iscomplexobj(samples), \"You shall not pass in complex numbers\"\n", " window = np.hanning(fft_length)[:, None]\n", " window_norm = np.sum(window**2) \n", " scale = window_norm * sample_rate\n", " trunc = (len(samples) - fft_length) % hop_length\n", " x = samples[:len(samples) - trunc]\n", " # Reshape to include the overlap\n", " nshape = (fft_length, (len(x) - fft_length) // hop_length + 1)\n", " nstrides = (x.strides[0], x.strides[0] * hop_length)\n", " x = as_strided(x, shape=nshape, strides=nstrides)\n", " # Window stride sanity check\n", " assert np.all(x[:, 1] == samples[hop_length:(hop_length + fft_length)])\n", " # Broadcast window, and then compute fft over columns and square mod\n", " x = np.fft.rfft(x * window, axis=0)\n", " x = np.absolute(x)**2\n", " # Scale 2.0 for everything except dc and fft_length/2\n", " x[1:-1, :] *= (2.0 / scale)\n", " x[(0, -1), :] /= scale\n", " freqs = float(sample_rate) / fft_length * np.arange(x.shape[0])\n", " return x, freqs\n", "\n", "def spectrogram_from_file(filename, step=10, window=20, max_freq=None, eps=1e-14):\n", "# Calculate log(linear spectrogram) from FFT energy\n", " with soundfile.SoundFile(filename) as sound_file:\n", " audio = sound_file.read(dtype='float32')\n", " sample_rate = sound_file.samplerate\n", " if audio.ndim >= 2:\n", " audio = np.mean(audio, 1)\n", " if max_freq is None:\n", " max_freq = sample_rate / 2\n", " if max_freq > sample_rate / 2:\n", " raise ValueError(\"max_freq can not be > than 0.5 of \"\n", " \" sample rate\")\n", " if step > window:\n", " raise ValueError(\"step size can not be > than window size\")\n", " hop_length = int(0.001 * step * sample_rate)\n", " fft_length = int(0.001 * window * sample_rate)\n", " pxx, freqs = spectrogram(\n", " audio, fft_length=fft_length, sample_rate=sample_rate,\n", " hop_length=hop_length)\n", " ind = np.where(freqs <= max_freq)[0][-1] + 1\n", " return np.transpose(np.log(pxx[:ind, :] + eps))\n", "\n", "def log_spectrogram_feature(samples, sample_rate, window_size=20, step_size=10, eps=1e-14):\n", " nperseg = int(round(window_size * sample_rate / 1e3))\n", " noverlap = int(round(step_size * sample_rate / 1e3))\n", " freqs, times, spec = signal.spectrogram(samples,\n", " fs=sample_rate,\n", " window='hann',\n", " nperseg=nperseg,\n", " noverlap=noverlap,\n", " detrend=False)\n", " freqs = (freqs*2)\n", " return freqs, times, np.log(spec.T.astype(np.float64) + eps)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Visualizing The Data\n", "\n", "\n", "- [Raw Audio](#raw)\n", "- [Spectrograms](#spectograms)\n", "- [Mel-Frequency Cepstral Coefficients](#mfcc)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def vis_train_features(index):\n", "# Function for visualizing a single audio file based on index chosen\n", " # Get spectrogram\n", " audio_gen = AudioGenerator(spectrogram=True)\n", " audio_gen.load_train_data()\n", " vis_audio_path = audio_gen.train_audio_paths[index]\n", " vis_spectrogram_feature = audio_gen.normalize(audio_gen.featurize(vis_audio_path))\n", " # Get mfcc\n", " audio_gen = AudioGenerator(spectrogram=False)\n", " audio_gen.load_train_data()\n", " vis_mfcc_feature = audio_gen.normalize(audio_gen.featurize(vis_audio_path))\n", " # Obtain text label\n", " vis_text = audio_gen.train_texts[index]\n", " # Obtain raw audio\n", " sample_rate, samples = wav.read(vis_audio_path)\n", " # Print total number of training examples\n", " print('There are %d total training examples.' % len(audio_gen.train_audio_paths))\n", " # Return labels for plotting\n", " return vis_text, vis_mfcc_feature, vis_spectrogram_feature, vis_audio_path, sample_rate, samples" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 64220 total training examples.\n" ] } ], "source": [ "# Creating visualisations for audio file at index number 2012\n", "vis_text, vis_mfcc_feature, vis_spectrogram_feature, vis_audio_path, sample_rate, samples, = vis_train_features(index=2012)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Raw Audio\n", "\n", "This method uses the raw wave forms of the audio files and is a 1D vector of the amplitude where X = [x1, x2, x3...]\n", "\n", "This is used by the [Pannous Sequence to Sequence](https://github.com/pannous/tensorflow-speech-recognition) models built in Caffe and TensorFlow." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def plot_raw_audio(sample_rate, samples):\n", " # Plot the raw audio signal\n", " time = np.arange(0, float(samples.shape[0]), 1) / sample_rate\n", " fig = plt.figure(figsize=(12,5))\n", " ax = fig.add_subplot(111)\n", " ax.plot(time, samples, linewidth=1, alpha=0.7, color='#76b900')\n", " plt.title('Raw Audio Signal')\n", " plt.xlabel('Time')\n", " plt.ylabel('Amplitude')\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "text/markdown": [ "**Audio File Transcription** : in front of the table benches arranged in zigzag form like the circumvallations of a retrenchment formed a succession of bastions and curtains" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Plot the raw audio file\n", "plot_raw_audio(sample_rate, samples)\n", "# Print the transcript corresponding to the audio file\n", "display(Markdown('**Audio File Transcription** : ' + str(vis_text)))\n", "# Play the raw audio file\n", "Audio(vis_audio_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Spectrograms\n", "\n", "This is what we will use by default for this project. A spectrogram transforms the raw audio wave forms into a 2D tensor (using the Fourier transform) where the first dimension corresponds to time (the horizontal axis), and the second dimension corresponds to frequency (the vertical axis). We lose a little bit of information in this conversion process as we take the log of the power of FFT. This can be written as log |FFT(X)|^2. This gives us 161 features, so each feature corresponds to something between 99-100 Hz. The full transformation process is documented [here](resources/spectrograms.pdf).\n", "\n", "Spectrograms are used in [Baidu's Deep Speech](https://github.com/baidu-research/ba-dls-deepspeech) system." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def plot_spectrogram_feature(vis_spectrogram_feature):\n", " # Plot a normalized spectrogram\n", " fig = plt.figure(figsize=(12,5))\n", " ax1 = fig.add_subplot(111)\n", " im = ax1.imshow(vis_spectrogram_feature.T, cmap=plt.cm.viridis, aspect='auto', origin='lower')\n", " plt.title('Normalized Log Spectrogram')\n", " plt.ylabel('Frequency')\n", " plt.xlabel('Time (s)')\n", " divider = make_axes_locatable(ax1)\n", " cax = divider.append_axes(\"right\", size=\"5%\", pad=0.05)\n", " plt.colorbar(im, cax=cax)\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "text/markdown": [ "**Shape of the Spectrogram** : (904, 161)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot the spectrogram for the selected file\n", "plot_spectrogram_feature(vis_spectrogram_feature)\n", "# Print shape of the spectrogram for the selected file\n", "display(Markdown('**Shape of the Spectrogram** : ' + str(vis_spectrogram_feature.shape)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we have 161 features for each frame, and frequencies are between 0 and 16000, then each feature corresponds to around 100 Hz. Humans have a resolution of around 3.6 Hz, so our hearing is much more precise than what this transformation allows. This graph looks a little noisy so let's look at it with a finer grain of detail:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "freqs, times, log_spectrogram = log_spectrogram_feature(samples, sample_rate)\n", "\n", "mean = np.mean(log_spectrogram, axis=0)\n", "std = np.std(log_spectrogram, axis=0)\n", "log_spectrogram = (log_spectrogram - mean) / std\n", "\n", "def plot_log_spectrogram_feature(freqs, times, log_spectrogram):\n", " fig = plt.figure(figsize=(12,5))\n", " ax2 = fig.add_subplot(111)\n", " ax2.imshow(log_spectrogram.T, aspect='auto', origin='lower', cmap=plt.cm.viridis, \n", " extent=[times.min(), times.max(), freqs.min(), freqs.max()])\n", " ax2.set_yticks(freqs[::20])\n", " ax2.set_xticks(times[::20])\n", " ax2.set_title('Normalized Log Spectrogram')\n", " ax2.set_ylabel('Frequency')\n", " ax2.set_xlabel('Time (s)')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plot_log_spectrogram_feature(freqs, times, log_spectrogram)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Now, let's take a look at it in 3D, where we add the (log) amplitude as a 3rd dimension:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = [go.Surface(z=log_spectrogram.T, colorscale='Viridis')]\n", "layout = go.Layout(\n", "title='3D Spectrogram',\n", "autosize=True,\n", "margin=dict(l=50, r=50, b=50, t=50))\n", "fig = go.Figure(data=data, layout=layout)\n", "py.iplot(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### And as a contour map." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = [go.Contour(z=log_spectrogram.T, colorscale='Viridis')]\n", "layout = go.Layout(\n", "title='Contour Graph',\n", "autosize=True)\n", "fig = go.Figure(data=data, layout=layout)\n", "py.iplot(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Mel-Frequency Cepstral Coefficients\n", "\n", "Like the spectrogram, this turns the audio wave form into a 2D array. This works by mapping the powers of the Fourier transform of the signal, and then taking the discrete cosine transform of the logged mel powers. This produces a 2D array with reduced dimensions when compared to spectrograms, effectively allowing for compression of the spectrogram and speeding up training as we are left with 13 features. The full process for deriving MFCC's from audio is outlined [here](resources/mfccs.pdf).\n", "\n", "This is used in Mozilla's implementation of [Deep Speech](https://github.com/mozilla/DeepSpeech) in TensorFlow." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def plot_mfcc_feature(vis_mfcc_feature):\n", " # Plot a normalized MFCC feature\n", " fig = plt.figure(figsize=(12,5))\n", " ax = fig.add_subplot(111)\n", " im = ax.imshow(vis_mfcc_feature, cmap=plt.cm.viridis, aspect='auto')\n", " plt.title('Normalized MFCC')\n", " plt.ylabel('Time')\n", " plt.xlabel('MFCC Coefficient')\n", " divider = make_axes_locatable(ax)\n", " cax = divider.append_axes(\"right\", size=\"5%\", pad=0.05)\n", " plt.colorbar(im, cax=cax)\n", " ax.set_xticks(np.arange(0, 13, 2), minor=False);\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "text/markdown": [ "**Shape of the MFCC** : (904, 13)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot the MFCC of the selected file\n", "plot_mfcc_feature(vis_mfcc_feature)\n", "# Print the shape of the MFCC of the selected file\n", "display(Markdown('**Shape of the MFCC** : ' + str(vis_mfcc_feature.shape)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Deep Neural Networks for Acoustic Modeling\n", "\n", "- [RNN](#rnn)\n", "- [RNN + TimeDistributed Dense](#rnntd)\n", "- [CNN + RNN + TimeDistributed Dense](#cnnrnn)\n", "- [Deeper RNN + TimeDistributed Dense](#deeprnn)\n", "- [Bidirectional RNN + TimeDistributed Dense](#bidirectional)\n", "- [CNN + Deeper Bidirectional RNN + TimeDistributed Dense](#cnndeepbi)\n", "- [Advanced Modeling Techniques](#modeling)\n", " - [Dropout](#dropout)\n", " - [Dilated Convolutions](#dilated)\n", "- [Aggregate Models](#aggregate)\n", " - [Deep Aggregate Model](#deepagg)\n", "- [Attention](#attention)\n", " - [Final Model](#final)\n", "\n", "The two most common tools for automatic speech recognition are Hidden Markov Models (HMM's), and Deep Neural Networks. For this project, the architecture chosen is a (Recurrent) Deep Neural Network (RNN) as it is easy to implement, and scales well. Though the most effective and sophisticated models implement \"hybrid\" systems or DNN-HMM, this is beyond the scope of this project. While HMM's using weighted finite state transducers are still considered the most powerful speech recognition tools, they were ignored for this program due to their complexity and increased computing requirements. HMM's also require the development of an extensive vocabulary of phonemes and graphemes that could not be produced under the time constraints of this project.\n", "\n", "Recurrent neurons are similar to feedforward neurons, except they also have connections pointing backward. At each step in time, each neuron recieves an input as well as its own output form the previous time step. Each neuron has two sets of weights, one for the input and one for the output at the last time step. Each layer takes vectors as inputs and outputs some vector. This model works by calculating forword propogation through each time step, t, and then back propagation through each time step. At each time step, the speaker is assumed to have spoken 1 of 29 possible characters (26 letters, 1 space character, 1 apostrophe, and 1 blank/empty character used to pad short files since inputs will have varying length). The output of this model at each time step will be a list of probabilitites for each possible character.\n", "\n", "The RNN is comprised of a combined acoustic model and language model. The acoustic model scores sequences of acoustic model labels over a time frame and the language model scores sequences of characters. A decoding graph then maps valid acoustic label sequences to the corresponding character sequences. Speech recognition is a path search algorithm through the decoding graph, where the score of the path is the sum of the score given to it by the decoding graph, and the score given to it by the acoustic model. So, to put it simply, speech recognition is the process of finding the character sequence that maximizes both the language and acoustic model scores.\n", "\n", "In this notebook, I have created several end-to-end RNN's for ASR. I have addressed the common issues with RNN's; exploding gradients, and vanishing gradients through gradient clipping, and the use of GRU, and LSTM cells respectively. I have also employed batch normalization and recurrent dropout.\n", "\n", "For more information on the use of deep learning in speech recognition, read [George Dahl's paper](resources/deeplearning.pdf)." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Custom CTC loss function (discussed below)\n", "def ctc_lambda_func(args):\n", " y_pred, labels, input_length, label_length = args\n", " return K.ctc_batch_cost(labels, y_pred, input_length, label_length)\n", "\n", "def add_ctc_loss(input_to_softmax):\n", " the_labels = Input(name='the_labels', shape=(None,), dtype='float32')\n", " input_lengths = Input(name='input_length', shape=(1,), dtype='int64')\n", " label_lengths = Input(name='label_length', shape=(1,), dtype='int64')\n", " output_lengths = Lambda(input_to_softmax.output_length)(input_lengths)\n", " # CTC loss is implemented in a lambda layer\n", " loss_out = Lambda(ctc_lambda_func, output_shape=(1,), name='ctc')(\n", " [input_to_softmax.output, the_labels, output_lengths, label_lengths])\n", " model = Model(\n", " inputs=[input_to_softmax.input, the_labels, input_lengths, label_lengths], \n", " outputs=loss_out)\n", " return model\n", "\n", "# Function for modifying CNN layers for sequence problems \n", "def cnn_output_length(input_length, filter_size, border_mode, stride,\n", " dilation=1):\n", "# Compute the length of cnn output seq after 1D convolution across time\n", " if input_length is None:\n", " return None\n", " assert border_mode in {'same', 'valid', 'causal'}\n", " dilated_filter_size = filter_size + (filter_size - 1) * (dilation - 1)\n", " if border_mode == 'same':\n", " output_length = input_length\n", " elif border_mode == 'valid':\n", " output_length = input_length - dilated_filter_size + 1\n", " elif border_mode == 'causal':\n", " output_length = input_length\n", " return (output_length + stride - 1) // stride" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Connectionist Temporal Classification\n", "\n", "The loss function I am using is a custom implementation of Connectionist Temporal Classification (CTC), which is a special case of sequential objective functions that addresses some of the modeling burden in cross-entropy that forces the model to link every frame of input data to a label. CTC's label set includes a \"blank\" symbol in its alphabet so if a frame of data doesn’t contain any utterance, the CTC system can output \"blank\" indicating that there isn't enough information to classify an output. This also has the added benefits of allowing us to have inputs/outputs of varying length as short files can be padded with the \"blank\" character and allowing us to model words using a character level classification system. This function only observes the sequence of labels along a path, ignoring the alignment of the labels to the acoustic data.\n", "\n", "More information on CTC can be found in Alex Grave's [paper](resources/ctc.pdf)." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "def train_model(input_to_softmax, \n", " pickle_path,\n", " save_model_path,\n", " train_json='train_corpus.json',\n", " valid_json='valid_corpus.json',\n", " minibatch_size=16, # You will want to change this depending on the GPU you are training on\n", " spectrogram=True,\n", " mfcc_dim=13,\n", " optimizer=Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False, clipnorm=1, clipvalue=.5),\n", " epochs=30, # You will want to change this depending on the model you are training and data you are using\n", " verbose=1,\n", " sort_by_duration=False,\n", " max_duration=10.0):\n", " \n", " # Obtain batches of data\n", " audio_gen = AudioGenerator(minibatch_size=minibatch_size, \n", " spectrogram=spectrogram, mfcc_dim=mfcc_dim, max_duration=max_duration,\n", " sort_by_duration=sort_by_duration)\n", " # Load the datasets\n", " audio_gen.load_train_data(train_json)\n", " audio_gen.load_validation_data(valid_json) \n", " # Calculate steps per epoch\n", " num_train_examples=len(audio_gen.train_audio_paths)\n", " steps_per_epoch = num_train_examples//minibatch_size\n", " # Calculate validation steps\n", " num_valid_samples = len(audio_gen.valid_audio_paths) \n", " validation_steps = num_valid_samples//minibatch_size \n", " # Add custom CTC loss function to the nn\n", " model = add_ctc_loss(input_to_softmax)\n", " # Dummy lambda function for loss since CTC loss is implemented above\n", " model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer=optimizer)\n", " # Make initial results/ directory for saving model pickles\n", " if not os.path.exists('results'):\n", " os.makedirs('results')\n", " # Add callbacks\n", " checkpointer = ModelCheckpoint(filepath='results/'+save_model_path, verbose=0)\n", " terminator = callbacks.TerminateOnNaN()\n", " time_machiner = callbacks.History()\n", " logger = callbacks.CSVLogger('training.log')\n", " stopper = callbacks.EarlyStopping(monitor='val_loss', patience=2, verbose=1, mode='auto')\n", " reducer = callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=10, verbose=0, mode='auto', min_delta=0.0001, cooldown=0, min_lr=0)\n", " tensor_boarder = callbacks.TensorBoard(log_dir='./logs', batch_size=16,\n", " write_graph=True, write_grads=True, write_images=True,)\n", " # Fit/train model\n", " hist = model.fit_generator(generator=audio_gen.next_train(), steps_per_epoch=steps_per_epoch,\n", " epochs=epochs, validation_data=audio_gen.next_valid(), validation_steps=validation_steps,\n", " callbacks=[checkpointer, terminator, logger, time_machiner, tensor_boarder, stopper, reducer], verbose=verbose)\n", " # Save model loss\n", " with open('results/'+pickle_path, 'wb') as f:\n", " pickle.dump(hist.history, f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adam Optimizer\n", "The Adam optimizer was chosen as it has momentum and has been shown to work well in speech recognition. \n", "\n", "#### Exporting the model for inference\n", "\n", "The preferred method of exporting Keras models for inference is to use the built-in saver/checkpointer (and this is what is used for the inference engine). This uses H5PY, which converts the data to the HDF5 binary data format, letting you store huge amounts of numerical data, and easily manipulate that data from NumPy. The models were set to save checkpoints in a .h5 file after each epoch and at the end of training. These are stored in the /results/ directory and will be used by the inference engine in the flask web app." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# Creating a TensorFlow session\n", "from keras.backend.tensorflow_backend import set_session\n", "config = tf.ConfigProto()\n", "config.gpu_options.per_process_gpu_memory_fraction = 1.0\n", "set_session(tf.Session(config=config))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Training\n", "This notebook walks through the development of a series of models that successively get more complex. We will train all of the developmental models (model 0-7) for 20 epochs on a 100 hour subset of the data, and then test an aggregate model (model 8 and 9) by training for 30 epochs using both spectrograms and MFCC's on a 460 hour subset before training the final model architecture (model 10) for 30 epochs on the full 960 hour training set.\n", "\n", "\n", "## RNN\n", "\n", "This model explores a simple RNN with 1 layer of Gated Recurrent Units, a simplified type of Long-Short Term Memory Recurrent Neuron with fewer parameters than typical LSTM's. These work via a memory update gate and provide most of the performance of traditional LSTM's at a fraction of the computing costs.\n", "\n", "For more information on the use of reccurent neural networks in speech recognition, read [Alex Graves' paper](resources/rnn.pdf).\n", "\n", "To learn more about GRU's, you can check out [this paper](resources/gru.pdf)." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "def regular_rnn_model(input_dim, output_dim=29):\n", " # Input\n", " input_data = Input(name='the_input', shape=(None, input_dim))\n", " # Recurrent layer\n", " simp_rnn = GRU(output_dim, return_sequences=True, \n", " implementation=2, name='rnn')(input_data)\n", " # Softmax Activation Layer\n", " y_pred = Activation('softmax', name='softmax')(simp_rnn)\n", " # Specifying the model\n", " model = Model(inputs=input_data, outputs=y_pred)\n", " model.output_length = lambda x: x\n", " print(model.summary())\n", " return model" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "the_input (InputLayer) (None, None, 161) 0 \n", "_________________________________________________________________\n", "rnn (GRU) (None, None, 29) 16617 \n", "_________________________________________________________________\n", "softmax (Activation) (None, None, 29) 0 \n", "=================================================================\n", "Total params: 16,617\n", "Trainable params: 16,617\n", "Non-trainable params: 0\n", "_________________________________________________________________\n", "None\n" ] } ], "source": [ "model_0 = regular_rnn_model(input_dim=161) # 161 for Spectrogram/13 for MFCC" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_model(input_to_softmax=model_0, \n", " pickle_path='model_0.pickle', \n", " save_model_path='model_0.h5',\n", " spectrogram=True,\n", " ) # True for Spectrogram/False for MFCC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This model came a long way. Its performance is alright, but let's see if we can improve it with a more complex model.\n", "\n", "\n", "## RNN + TimeDistributed Dense\n", "\n", "This model explores the addition of layers of normal Dense neurons to every temporal slice of an input. This model also uses batch normalization, which normalizes the activations of the layers with a mean close to 0 and standard deviation close to 1.\n", "\n", "This model uses LSTM's. These cells include forget and output gates, which allow more control over the cell's memory by allowing separate control of what is forgotten and what is passed through to the next hidden layer of cells. This will also make it easier to implement 'peepholes' later, which allow the cell to look at both the previous output state and hidden state when making this determination.\n", "\n", "More information on the use of LSTM's in speech recognition can be found in [this paper](resources/lstm.pdf) from Google and for more info on TimeDistributed Layers, check out the [Keras Documentation](https://keras.io/layers/wrappers/)." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "def rnn_tdd_model(input_dim, units, activation, output_dim=29):\n", " # Input\n", " input_data = Input(name='the_input', shape=(None, input_dim))\n", " # Recurrent layer\n", " simp_rnn = LSTM(units, activation=activation,\n", " return_sequences=True, implementation=2, name='rnn')(input_data)\n", " bn_rnn = BatchNormalization()(simp_rnn)\n", " # TimeDistributed Dense layer\n", " time_dense = TimeDistributed(Dense(output_dim))(bn_rnn)\n", " # Softmax activation layer\n", " y_pred = Activation('softmax', name='softmax')(time_dense)\n", " # Specifying the model\n", " model = Model(inputs=input_data, outputs=y_pred)\n", " model.output_length = lambda x: x\n", " print(model.summary())\n", " return model" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "the_input (InputLayer) (None, None, 161) 0 \n", "_________________________________________________________________\n", "rnn (LSTM) (None, None, 200) 289600 \n", "_________________________________________________________________\n", "batch_normalization_1 (Batch (None, None, 200) 800 \n", "_________________________________________________________________\n", "time_distributed_1 (TimeDist (None, None, 29) 5829 \n", "_________________________________________________________________\n", "softmax (Activation) (None, None, 29) 0 \n", "=================================================================\n", "Total params: 296,229\n", "Trainable params: 295,829\n", "Non-trainable params: 400\n", "_________________________________________________________________\n", "None\n" ] } ], "source": [ "model_1 = rnn_tdd_model(input_dim=161, # 161 for Spectrogram/13 for MFCC\n", " units=200,\n", " activation='relu')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_model(input_to_softmax=model_1, \n", " pickle_path='model_1.pickle', \n", " save_model_path='model_1.h5',\n", " spectrogram=True) # True for Spectrogram/False for MFCC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This model did significantly better, so let's see if deepening the framework can improve our scores.\n", "\n", "\n", "## CNN + RNN + TimeDistributed Dense\n", "\n", "This model explores the addition of a Convolutional Neural Network to the RNN." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "def cnn_rnn_td_model(input_dim, filters, activation, kernel_size, conv_stride,\n", " conv_border_mode, units, output_dim=29):\n", " # Input\n", " input_data = Input(name='the_input', shape=(None, input_dim))\n", " # Convolutional layer\n", " conv_1d = Conv1D(filters, kernel_size, \n", " strides=conv_stride, \n", " padding=conv_border_mode,\n", " activation=activation,\n", " name='conv1d')(input_data)\n", " # Batch normalization\n", " bn_cnn = BatchNormalization(name='bn_conv1d')(conv_1d)\n", " # Recurrent layer\n", " simp_rnn = GRU(units, activation=activation,\n", " return_sequences=True, implementation=2, name='rnn')(bn_cnn)\n", " # Batch Normalization\n", " bn_rnn = BatchNormalization()(simp_rnn)\n", " # TimeDistributed Dense layer\n", " time_dense = TimeDistributed(Dense(output_dim))(bn_rnn)\n", " # Softmax activation layer\n", " y_pred = Activation('softmax', name='softmax')(time_dense)\n", " # Specifying the model\n", " model = Model(inputs=input_data, outputs=y_pred)\n", " model.output_length = lambda x: cnn_output_length(\n", " x, kernel_size, conv_border_mode, conv_stride)\n", " print(model.summary())\n", " return model" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "the_input (InputLayer) (None, None, 161) 0 \n", "_________________________________________________________________\n", "conv1d (Conv1D) (None, None, 200) 354400 \n", "_________________________________________________________________\n", "bn_conv1d (BatchNormalizatio (None, None, 200) 800 \n", "_________________________________________________________________\n", "rnn (GRU) (None, None, 200) 240600 \n", "_________________________________________________________________\n", "batch_normalization_2 (Batch (None, None, 200) 800 \n", "_________________________________________________________________\n", "time_distributed_2 (TimeDist (None, None, 29) 5829 \n", "_________________________________________________________________\n", "softmax (Activation) (None, None, 29) 0 \n", "=================================================================\n", "Total params: 602,429\n", "Trainable params: 601,629\n", "Non-trainable params: 800\n", "_________________________________________________________________\n", "None\n" ] } ], "source": [ "model_2 = cnn_rnn_td_model(input_dim=161, # 161 for Spectrogram/13 for MFCC\n", " filters=200,\n", " kernel_size=11, \n", " conv_stride=2,\n", " conv_border_mode='valid',\n", " activation='relu',\n", " units=200)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_model(input_to_softmax=model_2, \n", " pickle_path='model_2.pickle', \n", " save_model_path='model_2.h5', \n", " spectrogram=True) # True for Spectrogram/False for MFCC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Adding a convolution layer greatly improved our score, but what about adding another RNN layer?\n", "\n", "\n", "## Deeper RNN + TimeDistributed Dense\n", "\n", "This model explores deepening of the network with additional recurrent layers." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "def deep_rnn_tdd_model(input_dim, units, recur_layers, activation, output_dim=29):\n", " # Input\n", " input_data = Input(name='the_input', shape=(None, input_dim))\n", " # 1st Recurrent layer\n", " simp_rnn = GRU(units, activation=activation, \n", " return_sequences=True, implementation=2, name='rnn_0')(input_data)\n", " # Batch normalization \n", " bn_rnn = BatchNormalization()(simp_rnn)\n", " # Loop for additional layers\n", " for i in range(recur_layers - 1):\n", " name = 'rnn_' + str(i + 1)\n", " simp_rnn = GRU(units, activation=activation, \n", " return_sequences=True, implementation=2, name=name)(bn_rnn)\n", " bn_rnn = BatchNormalization()(simp_rnn)\n", " # TimeDistributed Dense layer\n", " time_dense = TimeDistributed(Dense(output_dim))(bn_rnn)\n", " # Softmax activation layer\n", " y_pred = Activation('softmax', name='softmax')(time_dense)\n", " # Specifying the model\n", " model = Model(inputs=input_data, outputs=y_pred)\n", " model.output_length = lambda x: x\n", " print(model.summary())\n", " return model" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "the_input (InputLayer) (None, None, 161) 0 \n", "_________________________________________________________________\n", "rnn_0 (GRU) (None, None, 200) 217200 \n", "_________________________________________________________________\n", "batch_normalization_3 (Batch (None, None, 200) 800 \n", "_________________________________________________________________\n", "rnn_1 (GRU) (None, None, 200) 240600 \n", "_________________________________________________________________\n", "batch_normalization_4 (Batch (None, None, 200) 800 \n", "_________________________________________________________________\n", "time_distributed_3 (TimeDist (None, None, 29) 5829 \n", "_________________________________________________________________\n", "softmax (Activation) (None, None, 29) 0 \n", "=================================================================\n", "Total params: 465,229\n", "Trainable params: 464,429\n", "Non-trainable params: 800\n", "_________________________________________________________________\n", "None\n" ] } ], "source": [ "model_3 = deep_rnn_tdd_model(input_dim=161, units=200, recur_layers=2, activation='relu') # 161 for Spectrogram/13 for MFCC" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_model(input_to_softmax=model_3, \n", " pickle_path='model_3.pickle', \n", " save_model_path='model_3.h5', \n", " spectrogram=True) # True for Spectrogram/False for MFCC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This one did pretty well, but didn't perform quite as well as the convolution layer.\n", "\n", "\n", "## Bidirectional RNN + TimeDistributed Dense\n", "\n", "This model explores connecting two hidden layers of opposite directions to the same output, making their future input information reachable from the current state. To put it simply, this creates two layers of neurons; 1 that goes through the sequence forward in time and 1 that goes through it backward through time. This allows the output layer to get information from past and future states meaning that it will have knowledge of the letters located before and after the current utterance. This can lead to great improvements in performance but comes at a cost of increased latency.\n", "\n", "Inspiration for bidirectional layers came from [this paper](resources/bidirectional.pdf).\n", "\n", "> Note: The original implementation of this model ran into the problem of exploding gradients (which can be recognized by your loss being nan) and clipnorm=1 was added to the Adam optimizer above to clip the gradients and address this issue. This [blog post](https://machinelearningmastery.com/exploding-gradients-in-neural-networks/) gives a great overview of the various approaches for dealing with exploding gradients." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "def brnn_tdd_model(input_dim, units, activation, output_dim=29):\n", " # Input\n", " input_data = Input(name='the_input', shape=(None, input_dim))\n", " # Bidirectional recurrent layer\n", " brnn = Bidirectional(LSTM(units, activation=activation, \n", " return_sequences=True, implementation=2, name='brnn'))(input_data)\n", " # TimeDistributed Dense layer\n", " time_dense = TimeDistributed(Dense(output_dim))(brnn)\n", " # Softmax activation layer\n", " y_pred = Activation('softmax', name='softmax')(time_dense)\n", " # Specifying the model\n", " model = Model(inputs=input_data, outputs=y_pred)\n", " model.output_length = lambda x: x\n", " print(model.summary())\n", " return model" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "the_input (InputLayer) (None, None, 161) 0 \n", "_________________________________________________________________\n", "bidirectional_1 (Bidirection (None, None, 400) 579200 \n", "_________________________________________________________________\n", "time_distributed_4 (TimeDist (None, None, 29) 11629 \n", "_________________________________________________________________\n", "softmax (Activation) (None, None, 29) 0 \n", "=================================================================\n", "Total params: 590,829\n", "Trainable params: 590,829\n", "Non-trainable params: 0\n", "_________________________________________________________________\n", "None\n" ] } ], "source": [ "model_4 = brnn_tdd_model(input_dim=161, units=200, activation='relu') # 161 for Spectrogram/13 for MFCC" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_model(input_to_softmax=model_4, \n", " pickle_path='model_4.pickle', \n", " save_model_path='model_4.h5', \n", " spectrogram=True) # True for Spectrogram/False for MFCC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This also lead to some improvements in the model so let's see if we can combine these techniques for increased performance.\n", "\n", "\n", "## CNN + Deeper Bidirectional RNN + TimeDistributed Dense\n", "\n", "This model combines all of the ideas in the precedding models." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "def cnn_deep_brnn_tdd_model(input_dim, filters, activation, kernel_size, conv_stride,\n", " conv_border_mode, recur_layers, units, output_dim=29):\n", " # Input\n", " input_data = Input(name='the_input', shape=(None, input_dim))\n", " # Convolutional layer\n", " conv_1d = Conv1D(filters, kernel_size, \n", " strides=conv_stride, \n", " padding=conv_border_mode,\n", " activation=activation,\n", " name='conv1d')(input_data)\n", " # Batch normalization\n", " bn_cnn = BatchNormalization()(conv_1d)\n", " # Bidirectional recurrent layer\n", " brnn = Bidirectional(GRU(units, activation=activation, \n", " return_sequences=True, name='brnn'))(bn_cnn)\n", " # Batch normalization \n", " bn_rnn = BatchNormalization()(brnn)\n", " # Loop for additional layers\n", " for i in range(recur_layers - 1):\n", " name = 'brnn_' + str(i + 1)\n", " brnn = Bidirectional(GRU(units, activation=activation, \n", " return_sequences=True, implementation=2, name=name))(bn_rnn)\n", " bn_rnn = BatchNormalization()(brnn)\n", " # TimeDistributed Dense layer\n", " time_dense = TimeDistributed(Dense(output_dim))(bn_rnn)\n", " # Softmax activation layer\n", " y_pred = Activation('softmax', name='softmax')(time_dense)\n", " # Specifying the model\n", " model = Model(inputs=input_data, outputs=y_pred)\n", " model.output_length = lambda x: cnn_output_length(\n", " x, kernel_size, conv_border_mode, conv_stride)\n", " print(model.summary())\n", " return model" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "the_input (InputLayer) (None, None, 161) 0 \n", "_________________________________________________________________\n", "conv1d (Conv1D) (None, None, 200) 354400 \n", "_________________________________________________________________\n", "batch_normalization_5 (Batch (None, None, 200) 800 \n", "_________________________________________________________________\n", "bidirectional_2 (Bidirection (None, None, 400) 481200 \n", "_________________________________________________________________\n", "batch_normalization_6 (Batch (None, None, 400) 1600 \n", "_________________________________________________________________\n", "bidirectional_3 (Bidirection (None, None, 400) 721200 \n", "_________________________________________________________________\n", "batch_normalization_7 (Batch (None, None, 400) 1600 \n", "_________________________________________________________________\n", "time_distributed_5 (TimeDist (None, None, 29) 11629 \n", "_________________________________________________________________\n", "softmax (Activation) (None, None, 29) 0 \n", "=================================================================\n", "Total params: 1,572,429\n", "Trainable params: 1,570,429\n", "Non-trainable params: 2,000\n", "_________________________________________________________________\n", "None\n" ] } ], "source": [ "model_5 = cnn_deep_brnn_tdd_model(input_dim=161, # 161 for Spectrogram/13 for MFCC\n", " filters=200,\n", " activation='relu',\n", " kernel_size=11, \n", " conv_stride=2,\n", " conv_border_mode='valid',\n", " recur_layers=2,\n", " units=200)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_model(input_to_softmax=model_5, \n", " pickle_path='model_5.pickle', \n", " save_model_path='model_5.h5', \n", " spectrogram=True) # True for Spectrogram/False for MFCC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This ASR program scored very well, so let's see if we can squeeze a little more out of our model using a few tricks.\n", "\n", "## Advanced Modeling Techniques\n", "\n", "\n", "### Dropout\n", "\n", "This model adds randomized dropout of inputs to the aggregate model to prevent the model from over fitting.\n", "\n", "> Note: The dropout rate is 1%. Any larger than this will lead to exploding gradients. Due to this, this idea won't be pursued any further but one proposed solution can be found in this [paper](resources/gradientsexplode.pdf) and an example of this in action can be found in this [paper](resources/microsoft2016.pdf)." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "def cnn_deep_brnn_dropout_model(input_dim, filters, activation, kernel_size, conv_stride,\n", " conv_border_mode, recur_layers, units, output_dim=29):\n", " # Input\n", " input_data = Input(name='the_input', shape=(None, input_dim))\n", " # Convolutional layer\n", " conv_1d = Conv1D(filters, kernel_size, \n", " strides=conv_stride, \n", " padding=conv_border_mode,\n", " activation=activation,\n", " name='conv1d')(input_data)\n", " # Batch normalization\n", " bn_cnn = BatchNormalization()(conv_1d)\n", " # Bidirectional recurrent layer\n", " brnn = Bidirectional(GRU(units, activation=activation, \n", " return_sequences=True, implementation=2, recurrent_dropout=0.01, name='brnn'))(bn_cnn)\n", " # Batch normalization \n", " bn_rnn = BatchNormalization()(brnn)\n", " # Loop for additional layers\n", " for i in range(recur_layers - 1):\n", " name = 'brnn_' + str(i + 1)\n", " brnn = Bidirectional(GRU(units, activation=activation, \n", " return_sequences=True, implementation=2, name=name))(bn_rnn)\n", " bn_rnn = BatchNormalization()(brnn)\n", " # TimeDistributed Dense layer\n", " time_dense = TimeDistributed(Dense(output_dim))(bn_rnn)\n", " # Softmax activation layer\n", " y_pred = Activation('softmax', name='softmax')(time_dense)\n", " # Specifying the model\n", " model = Model(inputs=input_data, outputs=y_pred)\n", " model.output_length = lambda x: cnn_output_length(\n", " x, kernel_size, conv_border_mode, conv_stride)\n", " print(model.summary())\n", " return model" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "the_input (InputLayer) (None, None, 161) 0 \n", "_________________________________________________________________\n", "conv1d (Conv1D) (None, None, 200) 354400 \n", "_________________________________________________________________\n", "batch_normalization_8 (Batch (None, None, 200) 800 \n", "_________________________________________________________________\n", "bidirectional_4 (Bidirection (None, None, 400) 481200 \n", "_________________________________________________________________\n", "batch_normalization_9 (Batch (None, None, 400) 1600 \n", "_________________________________________________________________\n", "bidirectional_5 (Bidirection (None, None, 400) 721200 \n", "_________________________________________________________________\n", "batch_normalization_10 (Batc (None, None, 400) 1600 \n", "_________________________________________________________________\n", "time_distributed_6 (TimeDist (None, None, 29) 11629 \n", "_________________________________________________________________\n", "softmax (Activation) (None, None, 29) 0 \n", "=================================================================\n", "Total params: 1,572,429\n", "Trainable params: 1,570,429\n", "Non-trainable params: 2,000\n", "_________________________________________________________________\n", "None\n" ] } ], "source": [ "model_6 = cnn_deep_brnn_dropout_model(input_dim=161, # 161 for Spectrogram/13 for MFCC\n", " filters=200,\n", " activation='relu',\n", " kernel_size=11, \n", " conv_stride=2,\n", " conv_border_mode='valid',\n", " recur_layers=2,\n", " units=200)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_model(input_to_softmax=model_6, \n", " pickle_path='model_6.pickle', \n", " save_model_path='model_6.h5', \n", " spectrogram=True) # True for Spectrogram/False for MFCC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Despite the small amount of instances dropped, this still lead to improved performance.\n", "\n", "\n", "### Dilated Convolutions\n", "\n", "This model adds dilated CNN's to the aggregate model. Dilation introduces gaps into the CNN's kernels, so that the receptive field must encircle areas rather than simply slide over the window in a systematic way. This means that the convolutional layer can pick up on the global context of what it is looking at while still only having as many weights/inputs as the standard form.\n", "\n", "Inspiration for this technique came from [IBM's Watson Team](resources/dilation.pdf)." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "def cnn_deep_brnn_dilated_model(input_dim, filters, activation, kernel_size, conv_stride,\n", " conv_border_mode, recur_layers, dilation_rate, units, conv_layers, output_dim=29):\n", " input_data = Input(name='the_input', shape=(None, input_dim))\n", " conv_1d = Conv1D(filters, kernel_size, \n", " strides=conv_stride, \n", " padding=conv_border_mode,\n", " activation=activation,\n", " name='conv1d')(input_data)\n", " # Batch normalization\n", " bn_cnn = BatchNormalization()(conv_1d)\n", " for i in range(conv_layers - 1):\n", " conv_1d = Conv1D(filters, kernel_size,\n", " padding='causal',\n", " activation='relu',\n", " dilation_rate=2**i,\n", " name=\"conv_1d_\"+str(i))(bn_cnn)\n", " bn_cnn = BatchNormalization()(conv_1d)\n", " # Bidirectional recurrent layer\n", " brnn = Bidirectional(GRU(units, activation=activation, \n", " return_sequences=True, implementation=2, name='brnn'))(bn_cnn)\n", " # Batch normalization \n", " bn_rnn = BatchNormalization()(brnn)\n", " # Loop for additional layers\n", " for i in range(recur_layers - 1):\n", " name = 'brnn_' + str(i + 1)\n", " brnn = Bidirectional(GRU(units, activation=activation, \n", " return_sequences=True, implementation=2, name=name))(bn_rnn)\n", " bn_rnn = BatchNormalization()(brnn)\n", " time_dense = TimeDistributed(Dense(output_dim))(bn_rnn)\n", " y_pred = Activation('softmax', name='softmax')(time_dense)\n", " model = Model(inputs=input_data, outputs=y_pred)\n", " model.output_length = lambda x: cnn_output_length(\n", " x, kernel_size, 'causal', 1)\n", " print(model.summary())\n", " return model" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "the_input (InputLayer) (None, None, 161) 0 \n", "_________________________________________________________________\n", "conv1d (Conv1D) (None, None, 200) 354400 \n", "_________________________________________________________________\n", "batch_normalization_11 (Batc (None, None, 200) 800 \n", "_________________________________________________________________\n", "conv_1d_0 (Conv1D) (None, None, 200) 440200 \n", "_________________________________________________________________\n", "batch_normalization_12 (Batc (None, None, 200) 800 \n", "_________________________________________________________________\n", "bidirectional_6 (Bidirection (None, None, 400) 481200 \n", "_________________________________________________________________\n", "batch_normalization_13 (Batc (None, None, 400) 1600 \n", "_________________________________________________________________\n", "bidirectional_7 (Bidirection (None, None, 400) 721200 \n", "_________________________________________________________________\n", "batch_normalization_14 (Batc (None, None, 400) 1600 \n", "_________________________________________________________________\n", "time_distributed_7 (TimeDist (None, None, 29) 11629 \n", "_________________________________________________________________\n", "softmax (Activation) (None, None, 29) 0 \n", "=================================================================\n", "Total params: 2,013,429\n", "Trainable params: 2,011,029\n", "Non-trainable params: 2,400\n", "_________________________________________________________________\n", "None\n" ] } ], "source": [ "model_7 = cnn_deep_brnn_dilated_model(input_dim=161, # 161 for Spectrogram/13 for MFCC\n", " filters=200,\n", " activation='relu',\n", " kernel_size=11, \n", " conv_stride=1,\n", " conv_border_mode='causal',\n", " recur_layers=2,\n", " conv_layers=2,\n", " dilation_rate=2,\n", " units=200)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_model(input_to_softmax=model_7, \n", " pickle_path='model_7.pickle', \n", " save_model_path='model_7.h5', \n", " spectrogram=True) # True for Spectrogram/False for MFCC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This model also does pretty well (at the cost of increased training time) so let's see if we can combine all of these ideas into one very deep and very powerful speech recognition platform.\n", "\n", "\n", "## Aggregate Models\n", "\n", "The aggregate Keras model is a fine tuned implementation of model_5 (CNN + Deep BRNN + TDD). The model will consist of 1 convolutional layer, 2 GRU layers, and 1 Time Distributed Dense layer. The convolutional layer conducts feature/pattern extraction, while the RNN layers develop predictions on those features. This model won't make use of dropout or dilated convolutions as they both led to gradient explosions in tests. We have also increased the number of neurons in each layer. This model is trained for 30 epochs on a 460 hour subset of the data. This is the model deployed in the [heyjetson.com](https://heyjetson.com) project. \n", "\n", "Inspiration for the aggregate architecture came from Baidu's [Deep Speech 2](resources/deepspeech2.pdf) engine.\n", "\n", "#### Training with spectrograms" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "def agg_model(input_dim, filters, activation, kernel_size, conv_stride,\n", " conv_border_mode, recur_layers, units, output_dim=29):\n", " # Input\n", " input_data = Input(name='the_input', shape=(None, input_dim))\n", " # Convolutional layer\n", " conv_1d = Conv1D(filters, kernel_size, \n", " strides=conv_stride, \n", " padding=conv_border_mode,\n", " activation=activation,\n", " name='conv1d')(input_data)\n", " # Batch normalization\n", " bn_cnn = BatchNormalization()(conv_1d)\n", " # Bidirectional recurrent layer\n", " brnn = Bidirectional(GRU(units, activation=activation, \n", " return_sequences=True, name='brnn'))(bn_cnn)\n", " # Batch normalization \n", " bn_rnn = BatchNormalization()(brnn)\n", " # Loop for additional layers\n", " for i in range(recur_layers - 1):\n", " name = 'brnn_' + str(i + 1)\n", " brnn = Bidirectional(GRU(units, activation=activation, \n", " return_sequences=True, implementation=2, name=name))(bn_rnn)\n", " bn_rnn = BatchNormalization()(brnn)\n", " # TimeDistributed Dense layer\n", " time_dense = TimeDistributed(Dense(output_dim))(bn_rnn)\n", " # Softmax activation layer\n", " y_pred = Activation('softmax', name='softmax')(time_dense)\n", " # Specifying the model\n", " model = Model(inputs=input_data, outputs=y_pred)\n", " model.output_length = lambda x: cnn_output_length(\n", " x, kernel_size, conv_border_mode, conv_stride)\n", " print(model.summary())\n", " return model" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "the_input (InputLayer) (None, None, 161) 0 \n", "_________________________________________________________________\n", "conv1d (Conv1D) (None, None, 256) 453632 \n", "_________________________________________________________________\n", "batch_normalization_15 (Batc (None, None, 256) 1024 \n", "_________________________________________________________________\n", "bidirectional_8 (Bidirection (None, None, 512) 787968 \n", "_________________________________________________________________\n", "batch_normalization_16 (Batc (None, None, 512) 2048 \n", "_________________________________________________________________\n", "bidirectional_9 (Bidirection (None, None, 512) 1181184 \n", "_________________________________________________________________\n", "batch_normalization_17 (Batc (None, None, 512) 2048 \n", "_________________________________________________________________\n", "time_distributed_8 (TimeDist (None, None, 29) 14877 \n", "_________________________________________________________________\n", "softmax (Activation) (None, None, 29) 0 \n", "=================================================================\n", "Total params: 2,442,781\n", "Trainable params: 2,440,221\n", "Non-trainable params: 2,560\n", "_________________________________________________________________\n", "None\n" ] } ], "source": [ "model_8 = agg_model(input_dim=161, # 161 for Spectrogram/13 for MFCC\n", " filters=256,\n", " activation='relu',\n", " kernel_size=11, \n", " conv_stride=2,\n", " conv_border_mode='valid',\n", " recur_layers=2,\n", " units=256)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_model(input_to_softmax=model_8, \n", " pickle_path='model_8.pickle', \n", " save_model_path='model_8.h5', \n", " spectrogram=True) # True for Spectrogram/False for MFCC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Training with MFCC's\n", "Let's train this model using MFCC's just to see if there is a difference in performance:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "the_input (InputLayer) (None, None, 13) 0 \n", "_________________________________________________________________\n", "conv1d (Conv1D) (None, None, 256) 36864 \n", "_________________________________________________________________\n", "batch_normalization_18 (Batc (None, None, 256) 1024 \n", "_________________________________________________________________\n", "bidirectional_10 (Bidirectio (None, None, 512) 787968 \n", "_________________________________________________________________\n", "batch_normalization_19 (Batc (None, None, 512) 2048 \n", "_________________________________________________________________\n", "bidirectional_11 (Bidirectio (None, None, 512) 1181184 \n", "_________________________________________________________________\n", "batch_normalization_20 (Batc (None, None, 512) 2048 \n", "_________________________________________________________________\n", "time_distributed_9 (TimeDist (None, None, 29) 14877 \n", "_________________________________________________________________\n", "softmax (Activation) (None, None, 29) 0 \n", "=================================================================\n", "Total params: 2,026,013\n", "Trainable params: 2,023,453\n", "Non-trainable params: 2,560\n", "_________________________________________________________________\n", "None\n" ] } ], "source": [ "model_9 = agg_model(input_dim=13, # 161 for Spectrogram/13 for MFCC\n", " filters=256,\n", " activation='relu',\n", " kernel_size=11, \n", " conv_stride=2,\n", " conv_border_mode='valid',\n", " recur_layers=2,\n", " units=256)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_model(input_to_softmax=model_9, \n", " pickle_path='model_9.pickle', \n", " save_model_path='model_9.h5', \n", " spectrogram=False) # True for Spectrogram/False for MFCC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks like using MFCC's lead to a model that didn't perform quite as well, but did come with a speed up.\n", "\n", "\n", "### Deep Aggregate Model\n", "\n", "Now we will train a deeper version of this model architecture on the full 960 hour data set. For this, we will reintroduce dilated convolutions and recurrent dropout." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "def hey_jetson(input_dim, filters, activation, kernel_size, conv_stride,\n", " conv_border_mode, recur_layers, dilation_rate, units, conv_layers, output_dim=29):\n", " # Input\n", " input_data = Input(name='the_input', shape=(None, input_dim))\n", " # Convolutional layer\n", " conv_1d = Conv1D(filters, kernel_size, \n", " strides=conv_stride, \n", " padding=conv_border_mode,\n", " activation=activation,\n", " name='conv1d')(input_data)\n", " # Batch normalization\n", " bn_cnn = BatchNormalization()(conv_1d)\n", " for i in range(conv_layers - 1):\n", " conv_1d = Conv1D(filters, kernel_size,\n", " padding=conv_border_mode,\n", " activation=activation,\n", " dilation_rate=2**i,\n", " name=\"conv_1d_\"+str(i))(bn_cnn)\n", " bn_cnn = BatchNormalization()(conv_1d)\n", " # Bidirectional recurrent layer\n", " brnn = Bidirectional(GRU(units, activation=activation, \n", " return_sequences=True, implementation=2, recurrent_dropout=0.01, name='brnn'))(bn_cnn)\n", " # Batch normalization \n", " bn_rnn = BatchNormalization()(brnn)\n", " # Loop for additional layers\n", " for i in range(recur_layers - 1):\n", " name = 'brnn_' + str(i + 1)\n", " brnn = Bidirectional(GRU(units, activation=activation, \n", " return_sequences=True, implementation=2, name=name))(bn_rnn)\n", " bn_rnn = BatchNormalization()(brnn)\n", " # TimeDistributed Dense layer\n", " time_distributed_dense = TimeDistributed(Dense(1024))(bn_rnn)\n", " time_dense = TimeDistributed(Dense(output_dim))(time_distributed_dense)\n", " # Softmax activation layer\n", " y_pred = Activation('softmax', name='softmax')(time_dense)\n", " # Specifying the model\n", " model = Model(inputs=input_data, outputs=y_pred)\n", " model.output_length = lambda x: cnn_output_length(\n", " x, kernel_size, conv_border_mode, conv_stride)\n", " print(model.summary())\n", " return model" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "the_input (InputLayer) (None, None, 161) 0 \n", "_________________________________________________________________\n", "conv1d (Conv1D) (None, None, 256) 206336 \n", "_________________________________________________________________\n", "batch_normalization_1 (Batch (None, None, 256) 1024 \n", "_________________________________________________________________\n", "conv_1d_0 (Conv1D) (None, None, 256) 327936 \n", "_________________________________________________________________\n", "batch_normalization_2 (Batch (None, None, 256) 1024 \n", "_________________________________________________________________\n", "conv_1d_1 (Conv1D) (None, None, 256) 327936 \n", "_________________________________________________________________\n", "batch_normalization_3 (Batch (None, None, 256) 1024 \n", "_________________________________________________________________\n", "bidirectional_1 (Bidirection (None, None, 512) 787968 \n", "_________________________________________________________________\n", "batch_normalization_4 (Batch (None, None, 512) 2048 \n", "_________________________________________________________________\n", "bidirectional_2 (Bidirection (None, None, 512) 1181184 \n", "_________________________________________________________________\n", "batch_normalization_5 (Batch (None, None, 512) 2048 \n", "_________________________________________________________________\n", "bidirectional_3 (Bidirection (None, None, 512) 1181184 \n", "_________________________________________________________________\n", "batch_normalization_6 (Batch (None, None, 512) 2048 \n", "_________________________________________________________________\n", "bidirectional_4 (Bidirection (None, None, 512) 1181184 \n", "_________________________________________________________________\n", "batch_normalization_7 (Batch (None, None, 512) 2048 \n", "_________________________________________________________________\n", "bidirectional_5 (Bidirection (None, None, 512) 1181184 \n", "_________________________________________________________________\n", "batch_normalization_8 (Batch (None, None, 512) 2048 \n", "_________________________________________________________________\n", "bidirectional_6 (Bidirection (None, None, 512) 1181184 \n", "_________________________________________________________________\n", "batch_normalization_9 (Batch (None, None, 512) 2048 \n", "_________________________________________________________________\n", "bidirectional_7 (Bidirection (None, None, 512) 1181184 \n", "_________________________________________________________________\n", "batch_normalization_10 (Batc (None, None, 512) 2048 \n", "_________________________________________________________________\n", "time_distributed_1 (TimeDist (None, None, 1024) 525312 \n", "_________________________________________________________________\n", "time_distributed_2 (TimeDist (None, None, 29) 29725 \n", "_________________________________________________________________\n", "softmax (Activation) (None, None, 29) 0 \n", "=================================================================\n", "Total params: 9,309,725\n", "Trainable params: 9,301,021\n", "Non-trainable params: 8,704\n", "_________________________________________________________________\n", "None\n" ] } ], "source": [ "model_10 = hey_jetson(input_dim=161, # 161 for Spectrogram/13 for MFCC\n", " filters=256,\n", " activation='relu',\n", " kernel_size=5, \n", " conv_stride=2,\n", " recur_layers=7,\n", " conv_border_mode='causal',\n", " conv_layers=3,\n", " dilation_rate=2,\n", " units=256)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_model(input_to_softmax=model_10, \n", " pickle_path='model_10.pickle', \n", " save_model_path='model_10.h5', \n", " spectrogram=True) # True for Spectrogram/False for MFCC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Attention\n", "\n", "Now, we will add a single Attention layer to the model that will take as input the output from the encoder(the RNN layers). The decoder portion of the model will include the ability to \"attend\" to different parts of the audio clip at each time step. This lets the model learn what to pay attention to based on the input and what it has predicted the output to be so far. Attention allows the network to refer back to the input sequence by giving the network access to its internal memory, which is the hidden state of the encoder.\n", "\n", "Let's define the class that will make up the Attention layer." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "class Attention(keras.layers.Layer):\n", "\n", " ATTENTION_TYPE_ADD = 'additive'\n", " ATTENTION_TYPE_MUL = 'multiplicative'\n", "\n", " def __init__(self,\n", " units=512,\n", " attention_width=None,\n", " attention_type=ATTENTION_TYPE_MUL,\n", " return_attention=False,\n", " history_only=False,\n", " kernel_initializer='glorot_normal',\n", " bias_initializer='zeros',\n", " kernel_regularizer=None,\n", " bias_regularizer=None,\n", " kernel_constraint=None,\n", " bias_constraint=None,\n", " use_additive_bias=True,\n", " use_attention_bias=True,\n", " attention_activation=None,\n", " attention_regularizer_weight=0.0,\n", " **kwargs):\n", " \"\"\"Layer initialization.\n", "\n", " For additive attention, see: https://arxiv.org/pdf/1806.01264.pdf\n", "\n", " :param units: The dimension of the vectors that used to calculate the attention weights.\n", " :param attention_width: The width of local attention.\n", " :param attention_type: 'additive' or 'multiplicative'.\n", " :param return_attention: Whether to return the attention weights for visualization.\n", " :param history_only: Only use historical pieces of data.\n", " :param kernel_initializer: The initializer for weight matrices.\n", " :param bias_initializer: The initializer for biases.\n", " :param kernel_regularizer: The regularization for weight matrices.\n", " :param bias_regularizer: The regularization for biases.\n", " :param kernel_constraint: The constraint for weight matrices.\n", " :param bias_constraint: The constraint for biases.\n", " :param use_additive_bias: Whether to use bias while calculating the relevance of inputs features\n", " in additive mode.\n", " :param use_attention_bias: Whether to use bias while calculating the weights of attention.\n", " :param attention_activation: The activation used for calculating the weights of attention.\n", " :param attention_regularizer_weight: The weights of attention regularizer.\n", " :param kwargs: Parameters for parent class.\n", " \"\"\"\n", " super(Attention, self).__init__(**kwargs)\n", " self.supports_masking = True\n", " self.units = units\n", " self.attention_width = attention_width\n", " self.attention_type = attention_type\n", " self.return_attention = return_attention\n", " self.history_only = history_only\n", " if history_only and attention_width is None:\n", " self.attention_width = int(1e9)\n", "\n", " self.use_additive_bias = use_additive_bias\n", " self.use_attention_bias = use_attention_bias\n", " self.kernel_initializer = keras.initializers.get(kernel_initializer)\n", " self.bias_initializer = keras.initializers.get(bias_initializer)\n", " self.kernel_regularizer = keras.regularizers.get(kernel_regularizer)\n", " self.bias_regularizer = keras.regularizers.get(bias_regularizer)\n", " self.kernel_constraint = keras.constraints.get(kernel_constraint)\n", " self.bias_constraint = keras.constraints.get(bias_constraint)\n", " self.attention_activation = keras.activations.get(attention_activation)\n", " self.attention_regularizer_weight = attention_regularizer_weight\n", " self._backend = keras.backend.backend()\n", "\n", " if attention_type == Attention.ATTENTION_TYPE_ADD:\n", " self.Wx, self.Wt, self.bh = None, None, None\n", " self.Wa, self.ba = None, None\n", " elif attention_type == Attention.ATTENTION_TYPE_MUL:\n", " self.Wa, self.ba = None, None\n", " else:\n", " raise NotImplementedError('No implementation for attention type : ' + attention_type)\n", "\n", " def get_config(self):\n", " config = {\n", " 'units': self.units,\n", " 'attention_width': self.attention_width,\n", " 'attention_type': self.attention_type,\n", " 'return_attention': self.return_attention,\n", " 'history_only': self.history_only,\n", " 'use_additive_bias': self.use_additive_bias,\n", " 'use_attention_bias': self.use_attention_bias,\n", " 'kernel_initializer': keras.regularizers.serialize(self.kernel_initializer),\n", " 'bias_initializer': keras.regularizers.serialize(self.bias_initializer),\n", " 'kernel_regularizer': keras.regularizers.serialize(self.kernel_regularizer),\n", " 'bias_regularizer': keras.regularizers.serialize(self.bias_regularizer),\n", " 'kernel_constraint': keras.constraints.serialize(self.kernel_constraint),\n", " 'bias_constraint': keras.constraints.serialize(self.bias_constraint),\n", " 'attention_activation': keras.activations.serialize(self.attention_activation),\n", " 'attention_regularizer_weight': self.attention_regularizer_weight,\n", " }\n", " base_config = super(Attention, self).get_config()\n", " return dict(list(base_config.items()) + list(config.items()))\n", "\n", " def build(self, input_shape):\n", " if self.attention_type == Attention.ATTENTION_TYPE_ADD:\n", " self._build_additive_attention(input_shape)\n", " elif self.attention_type == Attention.ATTENTION_TYPE_MUL:\n", " self._build_multiplicative_attention(input_shape)\n", " super(Attention, self).build(input_shape)\n", "\n", " def _build_additive_attention(self, input_shape):\n", " feature_dim = int(input_shape[2])\n", "\n", " self.Wt = self.add_weight(shape=(feature_dim, self.units),\n", " name='{}_Add_Wt'.format(self.name),\n", " initializer=self.kernel_initializer,\n", " regularizer=self.kernel_regularizer,\n", " constraint=self.kernel_constraint)\n", " self.Wx = self.add_weight(shape=(feature_dim, self.units),\n", " name='{}_Add_Wx'.format(self.name),\n", " initializer=self.kernel_initializer,\n", " regularizer=self.kernel_regularizer,\n", " constraint=self.kernel_constraint)\n", " if self.use_additive_bias:\n", " self.bh = self.add_weight(shape=(self.units,),\n", " name='{}_Add_bh'.format(self.name),\n", " initializer=self.bias_initializer,\n", " regularizer=self.bias_regularizer,\n", " constraint=self.bias_constraint)\n", "\n", " self.Wa = self.add_weight(shape=(self.units, 1),\n", " name='{}_Add_Wa'.format(self.name),\n", " initializer=self.kernel_initializer,\n", " regularizer=self.kernel_regularizer,\n", " constraint=self.kernel_constraint)\n", " if self.use_attention_bias:\n", " self.ba = self.add_weight(shape=(1,),\n", " name='{}_Add_ba'.format(self.name),\n", " initializer=self.bias_initializer,\n", " regularizer=self.bias_regularizer,\n", " constraint=self.bias_constraint)\n", "\n", " def _build_multiplicative_attention(self, input_shape):\n", " feature_dim = int(input_shape[2])\n", "\n", " self.Wa = self.add_weight(shape=(feature_dim, feature_dim),\n", " name='{}_Mul_Wa'.format(self.name),\n", " initializer=self.kernel_initializer,\n", " regularizer=self.kernel_regularizer,\n", " constraint=self.kernel_constraint)\n", " if self.use_attention_bias:\n", " self.ba = self.add_weight(shape=(1,),\n", " name='{}_Mul_ba'.format(self.name),\n", " initializer=self.bias_initializer,\n", " regularizer=self.bias_regularizer,\n", " constraint=self.bias_constraint)\n", "\n", " def call(self, inputs, mask=None, **kwargs):\n", " input_len = K.shape(inputs)[1]\n", "\n", " if self.attention_type == Attention.ATTENTION_TYPE_ADD:\n", " e = self._call_additive_emission(inputs)\n", " elif self.attention_type == Attention.ATTENTION_TYPE_MUL:\n", " e = self._call_multiplicative_emission(inputs)\n", "\n", " if self.attention_activation is not None:\n", " e = self.attention_activation(e)\n", " e = K.exp(e - K.max(e, axis=-1, keepdims=True))\n", " if self.attention_width is not None:\n", " if self.history_only:\n", " lower = K.arange(0, input_len) - (self.attention_width - 1)\n", " else:\n", " lower = K.arange(0, input_len) - self.attention_width // 2\n", " lower = K.expand_dims(lower, axis=-1)\n", " upper = lower + self.attention_width\n", " indices = K.expand_dims(K.arange(0, input_len), axis=0)\n", " e = e * K.cast(lower <= indices, K.floatx()) * K.cast(indices < upper, K.floatx())\n", " if mask is not None:\n", " mask = K.cast(mask, K.floatx())\n", " mask = K.expand_dims(mask)\n", " e = K.permute_dimensions(K.permute_dimensions(e * mask, (0, 2, 1)) * mask, (0, 2, 1))\n", "\n", " # a_{t} = \\text{softmax}(e_t)\n", " s = K.sum(e, axis=-1, keepdims=True)\n", " a = e / (s + K.epsilon())\n", "\n", " # l_t = \\sum_{t'} a_{t, t'} x_{t'}\n", " v = K.batch_dot(a, inputs)\n", " if self.attention_regularizer_weight > 0.0:\n", " self.add_loss(self._attention_regularizer(a))\n", "\n", " if self.return_attention:\n", " return [v, a]\n", " return v\n", "\n", " def _call_additive_emission(self, inputs):\n", " input_shape = K.shape(inputs)\n", " batch_size, input_len = input_shape[0], input_shape[1]\n", "\n", " # h_{t, t'} = \\tanh(x_t^T W_t + x_{t'}^T W_x + b_h)\n", " q = K.expand_dims(K.dot(inputs, self.Wt), 2)\n", " k = K.expand_dims(K.dot(inputs, self.Wx), 1)\n", " if self.use_additive_bias:\n", " h = K.tanh(q + k + self.bh)\n", " else:\n", " h = K.tanh(q + k)\n", "\n", " # e_{t, t'} = W_a h_{t, t'} + b_a\n", " if self.use_attention_bias:\n", " e = K.reshape(K.dot(h, self.Wa) + self.ba, (batch_size, input_len, input_len))\n", " else:\n", " e = K.reshape(K.dot(h, self.Wa), (batch_size, input_len, input_len))\n", " return e\n", "\n", " def _call_multiplicative_emission(self, inputs):\n", " # e_{t, t'} = x_t^T W_a x_{t'} + b_a\n", " e = K.batch_dot(K.dot(inputs, self.Wa), K.permute_dimensions(inputs, (0, 2, 1)))\n", " if self.use_attention_bias:\n", " e += self.ba[0]\n", " return e\n", "\n", " def compute_output_shape(self, input_shape):\n", " output_shape = input_shape\n", " if self.return_attention:\n", " attention_shape = (input_shape[0], output_shape[1], input_shape[1])\n", " return [output_shape, attention_shape]\n", " return output_shape\n", "\n", " def compute_mask(self, inputs, mask=None):\n", " if self.return_attention:\n", " return [mask, None]\n", " return mask\n", "\n", " def _attention_regularizer(self, attention):\n", " batch_size = K.cast(K.shape(attention)[0], K.floatx())\n", " input_len = K.shape(attention)[-1]\n", " indices = K.expand_dims(K.arange(0, input_len), axis=0)\n", " diagonal = K.expand_dims(K.arange(0, input_len), axis=-1)\n", " eye = K.cast(K.equal(indices, diagonal), K.floatx())\n", " return self.attention_regularizer_weight * K.sum(K.square(K.batch_dot(\n", " attention,\n", " K.permute_dimensions(attention, (0, 2, 1))) - eye)) / batch_size" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Final Model\n", "\n", "Now we will train the final attention based model architecture on the full 960 hour data set." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "def keras_model(input_dim, filters, activation, kernel_size, conv_stride,\n", " conv_border_mode, recur_layers, dilation_rate, units, conv_layers, output_dim=29):\n", " # Input\n", " input_data = Input(name='the_input', shape=(None, input_dim))\n", " # Inital Convolutional layer\n", " conv_1d = Conv1D(filters, kernel_size, \n", " strides=conv_stride, \n", " padding=conv_border_mode,\n", " activation=activation,\n", " name='conv1d')(input_data)\n", " # Batch normalization\n", " bn_cnn = BatchNormalization()(conv_1d)\n", " # Loop for additional layers\n", " for i in range(conv_layers - 1):\n", " conv_1d = Conv1D(filters, kernel_size,\n", " padding=conv_border_mode,\n", " activation=activation,\n", " dilation_rate=2**i,\n", " name=\"conv_1d_\"+str(i))(bn_cnn)\n", " bn_cnn = BatchNormalization()(conv_1d)\n", " # Initial Bidirectional recurrent layer\n", " brnn = Bidirectional(GRU(units, activation=activation, \n", " return_sequences=True, implementation=2, recurrent_dropout=0.02, name='brnn'))(bn_cnn)\n", " # Batch normalization \n", " bn_rnn = BatchNormalization()(brnn)\n", " # Loop for additional layers\n", " for i in range(recur_layers - 1):\n", " name = 'brnn_' + str(i + 1)\n", " brnn = Bidirectional(GRU(units, activation=activation, \n", " return_sequences=True, implementation=2, name=name))(bn_rnn)\n", " bn_rnn = BatchNormalization()(brnn)\n", " # Attention layer\n", " attentive = Attention()(bn_rnn)\n", " # TimeDistributed Dense layers\n", " time_distributed_dense = TimeDistributed(Dense(1024))(attentive)\n", " time_dense = TimeDistributed(Dense(output_dim))(time_distributed_dense)\n", " # Softmax activation layer\n", " y_pred = Activation('softmax', name='softmax')(time_dense)\n", " # Specifying the model\n", " model = Model(inputs=input_data, outputs=y_pred)\n", " model.output_length = lambda x: cnn_output_length(\n", " x, kernel_size, conv_border_mode, conv_stride)\n", " print(model.summary())\n", " return model" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "the_input (InputLayer) (None, None, 161) 0 \n", "_________________________________________________________________\n", "conv1d (Conv1D) (None, None, 256) 206336 \n", "_________________________________________________________________\n", "batch_normalization_1 (Batch (None, None, 256) 1024 \n", "_________________________________________________________________\n", "conv_1d_0 (Conv1D) (None, None, 256) 327936 \n", "_________________________________________________________________\n", "batch_normalization_2 (Batch (None, None, 256) 1024 \n", "_________________________________________________________________\n", "conv_1d_1 (Conv1D) (None, None, 256) 327936 \n", "_________________________________________________________________\n", "batch_normalization_3 (Batch (None, None, 256) 1024 \n", "_________________________________________________________________\n", "bidirectional_1 (Bidirection (None, None, 512) 787968 \n", "_________________________________________________________________\n", "batch_normalization_4 (Batch (None, None, 512) 2048 \n", "_________________________________________________________________\n", "bidirectional_2 (Bidirection (None, None, 512) 1181184 \n", "_________________________________________________________________\n", "batch_normalization_5 (Batch (None, None, 512) 2048 \n", "_________________________________________________________________\n", "bidirectional_3 (Bidirection (None, None, 512) 1181184 \n", "_________________________________________________________________\n", "batch_normalization_6 (Batch (None, None, 512) 2048 \n", "_________________________________________________________________\n", "bidirectional_4 (Bidirection (None, None, 512) 1181184 \n", "_________________________________________________________________\n", "batch_normalization_7 (Batch (None, None, 512) 2048 \n", "_________________________________________________________________\n", "bidirectional_5 (Bidirection (None, None, 512) 1181184 \n", "_________________________________________________________________\n", "batch_normalization_8 (Batch (None, None, 512) 2048 \n", "_________________________________________________________________\n", "bidirectional_6 (Bidirection (None, None, 512) 1181184 \n", "_________________________________________________________________\n", "batch_normalization_9 (Batch (None, None, 512) 2048 \n", "_________________________________________________________________\n", "bidirectional_7 (Bidirection (None, None, 512) 1181184 \n", "_________________________________________________________________\n", "batch_normalization_10 (Batc (None, None, 512) 2048 \n", "_________________________________________________________________\n", "attention_1 (Attention) (None, None, 512) 262145 \n", "_________________________________________________________________\n", "time_distributed_1 (TimeDist (None, None, 1024) 525312 \n", "_________________________________________________________________\n", "time_distributed_2 (TimeDist (None, None, 29) 29725 \n", "_________________________________________________________________\n", "softmax (Activation) (None, None, 29) 0 \n", "=================================================================\n", "Total params: 9,571,870\n", "Trainable params: 9,563,166\n", "Non-trainable params: 8,704\n", "_________________________________________________________________\n", "None\n" ] } ], "source": [ "hey_jetson = keras_model(input_dim=161, # 161 for Spectrogram/13 for MFCC\n", " filters=256,\n", " activation='relu',\n", " kernel_size=5, \n", " conv_stride=2,\n", " recur_layers=7,\n", " conv_border_mode='causal',\n", " conv_layers=3,\n", " dilation_rate=2,\n", " units=256)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_model(input_to_softmax=hey_jetson, \n", " pickle_path='model_11.pickle', \n", " save_model_path='model_11.h5', \n", " spectrogram=True) # True for Spectrogram/False for MFCC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Visualizing The Final Model Architecture" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Outputting a graph of the model architecture for inclusion in the app and repo\n", "keras.utils.plot_model(hey_jetson, to_file='./app/static/images/model_11.png', show_shapes=True, show_layer_names=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you can visualize the model in TensorBoard by typing ```tensorboard --logdir ./logs``` into a terminal in the repository directory and then navigating to [http://localhost:6006](http://localhost:6006) in your browser.\n", "\n", "Training on our production model has finally concluded!\n", "\n", "With each epoch taking around 9 hours using spectrograms, the total training time on an Nvidia GTX1070(8GB) for each model using the final architecture was roughly 11 days.\n", "\n", "\n", "## Comparing Models\n", "\n", "- [Final Model Performance](#test)\n", "- [Cosine Similarity](#similarity)\n", "- [Word Error Rate](#error_rate)\n", "- [Benchmarking Performance](#benchmark)\n", "\n", "#### First, let's plot the loss from the saved pickle files for each model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load saved model pickles\n", "all_pickles = sorted(glob(\"results/*.pickle\"))\n", "# Extract model names\n", "model_names = [item[8:-7] for item in all_pickles]\n", "# Extract loss history\n", "valid_loss = [pickle.load( open( i, \"rb\" ) )['val_loss'] for i in all_pickles]\n", "train_loss = [pickle.load( open( i, \"rb\" ) )['loss'] for i in all_pickles]\n", "# Identify number of epochs each model ran for\n", "num_epochs = [len(valid_loss[i]) for i in range(len(valid_loss))]\n", "\n", "fig = plt.figure(figsize=(16,5))\n", "\n", "# Plot the training loss vs. epochs\n", "ax1 = fig.add_subplot(121)\n", "for i in range(len(all_pickles)):\n", " ax1.plot(np.linspace(1, num_epochs[i], num_epochs[i]), \n", " train_loss[i], label=model_names[i])\n", "ax1.legend() \n", "ax1.set_xlim([1, max(num_epochs)])\n", "plt.xlabel('Epoch')\n", "plt.ylabel('Training Loss')\n", "\n", "# Plot the validation loss vs. epochs\n", "ax2 = fig.add_subplot(122)\n", "for i in range(len(all_pickles)):\n", " ax2.plot(np.linspace(1, num_epochs[i], num_epochs[i]), \n", " valid_loss[i], label=model_names[i])\n", "ax2.legend() \n", "ax2.set_xlim([1, max(num_epochs)])\n", "plt.xlabel('Epoch')\n", "plt.ylabel('Validation Loss')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Final Model Performance\n", "\n", "Language modeling, the component of a speech recognition system that estimates the prior probabilities of spoken sounds, is the system's knowledge of what probable word sequences are. This system uses a class based language model, which allows it to narrow down its search field through the vocabulary of the speech recognizer (the first part of the system) as it will rarely see a sentence that looks like \"the dog the ate sand the water\" so it will assume that 'the' is not likely to come after the word 'sand'. We do this by assigning a probability to every possible sentence and then picking the word with the highest prior probability of occurring. Language model smoothing (often called discounting) will help us overcome the problem that this creates a model that will assign a probability of 0 to anything it hasn't witnessed in training. This is done by distributing non-zero probabilities over all possible occurrences in proportion to the unigram probabilities of words. This overcomes the limitations of traditional n-gram based modeling and is all made possible by the added dimension of time sequences in the recurrent neural network.\n", "\n", "The best performing model is considered the one that gives the highest probabilities to the words that are found in a test set, since it wastes less probability on words that actually occur. More information on comparing models can be found in this [paper](resources/comparingmodels.pdf). \n", "\n", "#### Let's check out our model predictions:\n", "\n", "We'll also benchmark how long it takes the model to produce the predictions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_predictions(index, partition, input_to_softmax, model_path):\n", " # Load the train and test data\n", " data_gen = AudioGenerator(spectrogram = spectrogram)\n", " data_gen.load_train_data()\n", " data_gen.load_validation_data()\n", " data_gen.load_test_data()\n", " # Obtain ground truth transcriptions and audio features \n", " if partition == 'validation':\n", " transcription = data_gen.valid_texts[index]\n", " audio_path = data_gen.valid_audio_paths[index]\n", " data_point = data_gen.normalize(data_gen.featurize(audio_path))\n", " elif partition == 'train':\n", " transcription = data_gen.train_texts[index]\n", " audio_path = data_gen.train_audio_paths[index]\n", " data_point = data_gen.normalize(data_gen.featurize(audio_path))\n", " elif partition == 'test':\n", " transcription = data_gen.test_texts[index]\n", " audio_path = data_gen.test_audio_paths[index]\n", " data_point = data_gen.normalize(data_gen.featurize(audio_path))\n", " else:\n", " raise Exception('Invalid partition! Must be \"train\", \"test\", or \"validation\"') \n", " # Obtain predictions\n", " input_to_softmax.load_weights(model_path)\n", " prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))\n", " output_length = [input_to_softmax.output_length(data_point.shape[0])] \n", " pred_ints = (K.eval(K.ctc_decode(\n", " prediction, output_length)[0][0])+1).flatten().tolist()\n", " # Display ground truth transcription and predicted transcripted.\n", " print('True transcription:\\n' + '\\n' + transcription)\n", " print('Predicted transcription:\\n' + '\\n' + ''.join(int_seq_to_text(pred_ints)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%time get_predictions(index=95, partition='train', input_to_softmax=hey_jetson, model_path='./results/model_11.h5')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%time get_predictions(index=95, partition='validation', input_to_softmax=hey_jetson, model_path='./results/model_11.h5')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%time get_predictions(index=95, partition='test', input_to_softmax=hey_jetson, model_path='./results/model_11.h5')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Now, let's check the aggregate model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%time get_predictions(index=95, partition='test', input_to_softmax=model_10, model_path='./results/model_10.h5')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Now, let's check the spectrogram model trained on 460 hours of audio:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True transcription:\n", "\n", "in the absence of a hypodermic syringe the remedy may be given by the rectum\n", "Predicted transcription:\n", "\n", "inse absens of the hapademec shaenge sevemety may be gave in vye of recttim\n", "CPU times: user 2.29 s, sys: 89.4 ms, total: 2.37 s\n", "Wall time: 2.48 s\n" ] } ], "source": [ "%time get_predictions(index=95, partition='test', input_to_softmax=model_8, model_path='./results/model_8.h5')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Let's compare the final model's predictions on the test set agains our last 3 developmental models and the original RNN model." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True transcription:\n", "\n", "in the absence of a hypodermic syringe the remedy may be given by the rectum\n", "Predicted transcription:\n", "\n", "en tet fom af e heacedemo fovang the vemat o mo de givven by tha re tem\n", "CPU times: user 2.39 s, sys: 105 ms, total: 2.49 s\n", "Wall time: 2.56 s\n" ] } ], "source": [ "# Dilated Deep CNN + Deep Bidirectional RNN + Time Distributed Dense\n", "%time get_predictions(index=95, partition='test', input_to_softmax=model_7, model_path='./results/model_7.h5')" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True transcription:\n", "\n", "in the absence of a hypodermic syringe the remedy may be given by the rectum\n", "Predicted transcription:\n", "\n", "eensteedsoms of the had edem ofs hemine the veman youmo be given by thevectim\n", "CPU times: user 2.33 s, sys: 105 ms, total: 2.43 s\n", "Wall time: 2.5 s\n" ] } ], "source": [ "# Deep CNN + Deep Bidirectional RNN + Time Distributed Dense w/ Droupout\n", "%time get_predictions(index=95, partition='test', input_to_softmax=model_6, model_path='./results/model_6.h5')" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True transcription:\n", "\n", "in the absence of a hypodermic syringe the remedy may be given by the rectum\n", "Predicted transcription:\n", "\n", "insth at srens of he had a demme sivvems thof remmantd ye mobe gaven by the vectame\n", "CPU times: user 2.34 s, sys: 80.1 ms, total: 2.42 s\n", "Wall time: 2.49 s\n" ] } ], "source": [ "# Model_5 on the 100 hour subset\n", "%time get_predictions(index=95, partition='test', input_to_softmax=model_5, model_path='./results/model_5.h5')" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True transcription:\n", "\n", "in the absence of a hypodermic syringe the remedy may be given by the rectum\n", "Predicted transcription:\n", "\n", " \n", "CPU times: user 2.1 s, sys: 88.9 ms, total: 2.19 s\n", "Wall time: 2.21 s\n" ] } ], "source": [ "# Initial RNN model\n", "%time get_predictions(index=95, partition='test', input_to_softmax=model_0, model_path='./results/model_0.h5')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've come a long way. Our final model comes close to the actual spoken transcription, while the first few models predicted nothing at all, or the same letter for every utterance. Now, let's quantify the final model's performance.\n", "\n", "#### Calculating error rates:\n", "Fist we need to obtain the ground truth transcriptions and the predicted transcriptions for the validation and test sets. Then we can use several measures to determine accuracy." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "# Function for extracting the ground truth transcriptions from the audio files.\n", "def get_ground_truth(partition):\n", " ground_truth_list = []\n", " data_gen = AudioGenerator(spectrogram = spectrogram)\n", " if partition == 'train':\n", " data_gen.load_train_data()\n", " for i in range(0, 61956):\n", " transcription = data_gen.train_texts[i]\n", " ground_truth_list.append(transcription)\n", " elif partition == 'validation':\n", " data_gen.load_validation_data()\n", " for i in range(0, 4277):\n", " transcription = data_gen.valid_texts[i]\n", " ground_truth_list.append(transcription)\n", " elif partition == 'test':\n", " data_gen.load_test_data()\n", " for i in range(0, 4176):\n", " transcription = data_gen.test_texts[i]\n", " ground_truth_list.append(transcription)\n", " ground_truth = np.asarray(ground_truth_list)\n", " return ground_truth" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['looking about me i saw a gentleman in a neat black dress smiling and his hand extended to me with great cordiality',\n", " 'he must have realized i was a stranger and wished to tender his hospitality to me i accepted it gratefully i clasped his hand he pressed mine',\n", " \"we gazed for a moment silently into each other's eyes\", ...,\n", " 'that penance hath no blame which magdalen found sweet purging our shame self punishment is virtue all men know',\n", " 'heaven help that body which a little mind housed in a head lacking ears tongue and eyes and senseless but for smell can tyrannise',\n", " 'due to thee their praise of maiden pure of teeming motherhood'],\n", " dtype='\n", "#### Cosine Similarity\n", "\n", "This is a measure where a score of 1 means that the two strings are exactly equal and a score of 0 means the prediction transcription contains none of the words in the ground truth label.\n", "\n", "More info on this metric can be found at [Wikipedia](https://en.wikipedia.org/wiki/Cosine_similarity)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_sim(partition, input_to_softmax, model_path):\n", " # Getting the cosine similarity using Count Vectorizer\n", " sim_list = []\n", " data_gen = AudioGenerator(spectrogram = spectrogram)\n", " data_gen.load_test_data()\n", " data_gen.load_validation_data()\n", " data_gen.load_train_data()\n", " if partition == 'train':\n", " for i in range(0, 61956):\n", " transcription = data_gen.train_texts[i]\n", " audio_path = data_gen.train_audio_paths[i]\n", " data_point = data_gen.normalize(data_gen.featurize(audio_path))\n", " input_to_softmax.load_weights(model_path)\n", " prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))\n", " output_length = [input_to_softmax.output_length(data_point.shape[0])] \n", " pred_ints = (K.eval(K.ctc_decode(\n", " prediction, output_length)[0][0])+1).flatten().tolist()\n", " pred_trans = ''.join(int_seq_to_text(pred_ints))\n", " cv = CountVectorizer()\n", " ground_truth_vec = cv.fit_transform([transcription])\n", " pred_transcription_vec = cv.transform([pred_trans])\n", " sim = cosine_similarity(ground_truth_vec, pred_transcription_vec)\n", " sim_list.append(sim)\n", " if i%2000 == 0: print('Processed {}'.format(i))\n", " \n", " elif partition == 'validation':\n", " for i in range(0, 4277):\n", " transcription = data_gen.valid_texts[i]\n", " audio_path = data_gen.valid_audio_paths[i]\n", " data_point = data_gen.normalize(data_gen.featurize(audio_path))\n", " input_to_softmax.load_weights(model_path)\n", " prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))\n", " output_length = [input_to_softmax.output_length(data_point.shape[0])] \n", " pred_ints = (K.eval(K.ctc_decode(\n", " prediction, output_length)[0][0])+1).flatten().tolist()\n", " pred_trans = ''.join(int_seq_to_text(pred_ints))\n", " cv = CountVectorizer()\n", " ground_truth_vec = cv.fit_transform([transcription])\n", " pred_transcription_vec = cv.transform([pred_trans])\n", " sim = cosine_similarity(ground_truth_vec, pred_transcription_vec)\n", " sim_list.append(sim)\n", " if i%200 == 0: print('Processed {}'.format(i))\n", " \n", " elif partition == 'test':\n", " for i in range(0, 4176):\n", " transcription = data_gen.test_texts[i]\n", " audio_path = data_gen.test_audio_paths[i]\n", " data_point = data_gen.normalize(data_gen.featurize(audio_path))\n", " input_to_softmax.load_weights(model_path)\n", " prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))\n", " output_length = [input_to_softmax.output_length(data_point.shape[0])] \n", " pred_ints = (K.eval(K.ctc_decode(\n", " prediction, output_length)[0][0])+1).flatten().tolist()\n", " pred_trans = ''.join(int_seq_to_text(pred_ints))\n", " cv = CountVectorizer()\n", " ground_truth_vec = cv.fit_transform([transcription])\n", " pred_transcription_vec = cv.transform([pred_trans])\n", " sim = cosine_similarity(ground_truth_vec, pred_transcription_vec)\n", " sim_list.append(sim)\n", " if i%200 == 0: print('Processed {}'.format(i))\n", "\n", " sim_array = np.asarray(sim_list)\n", " return sim_array" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extracting the validation count vectorizer cosine similarities\n", "valid_sim = get_sim(partition='validation', \n", " input_to_softmax=hey_jetson, model_path='./results/model_11.h5')\n", "valid_sim" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "valid_sim.mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extracting the test count vectorizer cosine similarities\n", "test_sim = get_sim(partition='test', \n", " input_to_softmax=hey_jetson, model_path='./results/model_11.h5')\n", "test_sim" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_sim.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like we have about an 80% similarity between the predictions and ground truth transcriptions in the validation and a 78% similarity in the test sets when using count vectorization, so let's see if TF-IDF vectorization produces different results:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_tfidf_sim(partition, input_to_softmax, model_path):\n", " # Getting the cosine similarity using Tfidf Vectorizer\n", " sim_list = []\n", " data_gen = AudioGenerator(spectrogram = spectrogram)\n", " data_gen.load_test_data()\n", " data_gen.load_validation_data()\n", " data_gen.load_train_data()\n", " if partition == 'train':\n", " for i in range(0, 61956):\n", " transcription = data_gen.train_texts[i]\n", " audio_path = data_gen.train_audio_paths[i]\n", " data_point = data_gen.normalize(data_gen.featurize(audio_path))\n", " input_to_softmax.load_weights(model_path)\n", " prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))\n", " output_length = [input_to_softmax.output_length(data_point.shape[0])] \n", " pred_ints = (K.eval(K.ctc_decode(\n", " prediction, output_length)[0][0])+1).flatten().tolist()\n", " pred_trans = ''.join(int_seq_to_text(pred_ints))\n", " tfidf = TfidfVectorizer()\n", " ground_truth_vec = tfidf.fit_transform([transcription])\n", " pred_transcription_vec = tfidf.transform([pred_trans])\n", " sim = cosine_similarity(ground_truth_vec, pred_transcription_vec)\n", " sim_list.append(sim)\n", " if i%2000 == 0: print('Processed {}'.format(i))\n", " \n", " elif partition == 'validation':\n", " for i in range(0, 4277):\n", " transcription = data_gen.valid_texts[i]\n", " audio_path = data_gen.valid_audio_paths[i]\n", " data_point = data_gen.normalize(data_gen.featurize(audio_path))\n", " input_to_softmax.load_weights(model_path)\n", " prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))\n", " output_length = [input_to_softmax.output_length(data_point.shape[0])] \n", " pred_ints = (K.eval(K.ctc_decode(\n", " prediction, output_length)[0][0])+1).flatten().tolist()\n", " pred_trans = ''.join(int_seq_to_text(pred_ints))\n", " tfidf = TfidfVectorizer()\n", " ground_truth_vec = tfidf.fit_transform([transcription])\n", " pred_transcription_vec = tfidf.transform([pred_trans])\n", " sim = cosine_similarity(ground_truth_vec, pred_transcription_vec)\n", " sim_list.append(sim)\n", " if i%200 == 0: print('Processed {}'.format(i))\n", " \n", " elif partition == 'test':\n", " for i in range(0, 4176):\n", " transcription = data_gen.test_texts[i]\n", " audio_path = data_gen.test_audio_paths[i]\n", " data_point = data_gen.normalize(data_gen.featurize(audio_path))\n", " input_to_softmax.load_weights(model_path)\n", " prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))\n", " output_length = [input_to_softmax.output_length(data_point.shape[0])] \n", " pred_ints = (K.eval(K.ctc_decode(\n", " prediction, output_length)[0][0])+1).flatten().tolist()\n", " pred_trans = ''.join(int_seq_to_text(pred_ints))\n", " tfidf = TfidfVectorizer()\n", " ground_truth_vec = tfidf.fit_transform([transcription])\n", " pred_transcription_vec = tfidf.transform([pred_trans])\n", " sim = cosine_similarity(ground_truth_vec, pred_transcription_vec)\n", " sim_list.append(sim)\n", " if i%200 == 0: print('Processed {}'.format(i))\n", "\n", " sim_array = np.asarray(sim_list)\n", " return sim_array" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extracting the validation tfidf cosine similarities\n", "valid_tfidf_sim = get_tfidf_sim(partition='validation', \n", " input_to_softmax=hey_jetson, model_path='./results/model_11.h5')\n", "valid_tfidf_sim" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "valid_tfidf_sim.mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extracting the test tfidf cosine similarities\n", "test_tfidf_sim = get_tfidf_sim(partition='test', \n", " input_to_softmax=hey_jetson, model_path='./results/model_11.h5')\n", "test_tfidf_sim" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_tfidf_sim.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like we have about a 80% similarity between the predictions and ground truth transcriptions in the validation set and around a 78% similarity in the test sets when using TF-IDF vectorization as well.\n", "\n", "\n", "#### Word Error Rate\n", "\n", "Word error rate is defined as (substitutions + deletions + insertions) / # of words in the ground truth transcription. \n", "\n", "More info on this metric can be found at [Wikipedia](https://en.wikipedia.org/wiki/Word_error_rate)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def wer_calc(ref, pred):\n", " # Calcualte word error rate\n", " d = np.zeros((len(ref) + 1) * (len(pred) + 1), dtype=np.uint16)\n", " d = d.reshape((len(ref) + 1, len(pred) + 1))\n", " for i in range(len(ref) + 1):\n", " for j in range(len(pred) + 1):\n", " if i == 0:\n", " d[0][j] = j\n", " elif j == 0:\n", " d[i][0] = i\n", "\n", " for i in range(1, len(ref) + 1):\n", " for j in range(1, len(pred) + 1):\n", " if ref[i - 1] == pred[j - 1]:\n", " d[i][j] = d[i - 1][j - 1]\n", " else:\n", " substitution = d[i - 1][j - 1] + 1\n", " insertion = d[i][j - 1] + 1\n", " deletion = d[i - 1][j] + 1\n", " d[i][j] = min(substitution, insertion, deletion)\n", " result = float(d[len(ref)][len(pred)]) / len(ref) * 100\n", " return result\n", " \n", "# Function for extracting the predicted transcriptions from the audio files and calculating word error rate on them\n", "def get_wer(partition, input_to_softmax, model_path):\n", " wer_list = []\n", " data_gen = AudioGenerator(spectrogram = spectrogram)\n", " data_gen.load_test_data()\n", " data_gen.load_validation_data()\n", " data_gen.load_train_data()\n", " if partition == 'train':\n", " for i in range(0, 61956):\n", " transcription = data_gen.train_texts[i]\n", " audio_path = data_gen.train_audio_paths[i]\n", " data_point = data_gen.normalize(data_gen.featurize(audio_path))\n", " input_to_softmax.load_weights(model_path)\n", " prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))\n", " output_length = [input_to_softmax.output_length(data_point.shape[0])] \n", " pred_ints = (K.eval(K.ctc_decode(\n", " prediction, output_length)[0][0])+1).flatten().tolist()\n", " pred_trans = ''.join(int_seq_to_text(pred_ints))\n", " error_rate = wer_calc(transcription, pred_trans)\n", " wer_list.append(error_rate)\n", " if i%2000 == 0: print('Processed {}'.format(i))\n", " \n", " elif partition == 'validation':\n", " for i in range(0, 4277):\n", " transcription = data_gen.valid_texts[i]\n", " audio_path = data_gen.valid_audio_paths[i]\n", " data_point = data_gen.normalize(data_gen.featurize(audio_path))\n", " input_to_softmax.load_weights(model_path)\n", " prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))\n", " output_length = [input_to_softmax.output_length(data_point.shape[0])] \n", " pred_ints = (K.eval(K.ctc_decode(\n", " prediction, output_length)[0][0])+1).flatten().tolist()\n", " pred_trans = ''.join(int_seq_to_text(pred_ints))\n", " error_rate = wer_calc(transcription, pred_trans)\n", " wer_list.append(error_rate)\n", " if i%200 == 0: print('Processed {}'.format(i))\n", " \n", " elif partition == 'test':\n", " for i in range(0, 4176):\n", " transcription = data_gen.test_texts[i]\n", " audio_path = data_gen.test_audio_paths[i]\n", " data_point = data_gen.normalize(data_gen.featurize(audio_path))\n", " input_to_softmax.load_weights(model_path)\n", " prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))\n", " output_length = [input_to_softmax.output_length(data_point.shape[0])] \n", " pred_ints = (K.eval(K.ctc_decode(\n", " prediction, output_length)[0][0])+1).flatten().tolist()\n", " pred_trans = ''.join(int_seq_to_text(pred_ints))\n", " error_rate = wer_calc(transcription, pred_trans)\n", " wer_list.append(error_rate)\n", " if i%200 == 0: print('Processed {}'.format(i))\n", "\n", " wer_array = np.asarray(wer_list)\n", " return wer_array" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extracting the validation word error rates\n", "valid_wer = get_wer(partition='validation', \n", " input_to_softmax=hey_jetson, model_path='./results/model_11.h5')\n", "valid_wer" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Calculating the word error rate in the validation set\n", "valid_wer.mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extracting the test word error rates\n", "test_wer = get_wer(partition='test', \n", " input_to_softmax=hey_jetson, model_path='./results/model_11.h5')\n", "test_wer" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Calculating the word error rate in the test set\n", "test_wer.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We did pretty well! Our model achieved a word error rate of about 16% in the validation set and 18% in the test set.The error rate of our model is still pretty high compared to some of the models expored in the reference papers, but this is due to the scope of the project, the lack of a pretrained language model, the single 8G GPU and the length of time I had for this project prevented me from exploring further.\n", "\n", "\n", "#### Benchmarking performance\n", "\n", "Let's time inference on the production system running live on the server to get a feel for how quickly we can serve up predictions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%time get_predictions(index=12, partition='test', input_to_softmax=hey_jetson, model_path='./results/model_11.h5')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%time get_predictions(index=21, partition='test', input_to_softmax=hey_jetson, model_path='./results/model_11.h5')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%time get_predictions(index=2012, partition='test', input_to_softmax=hey_jetson, model_path='./results/model_11.h5')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Conclusion\n", "\n", "This concludes the model construction demo. You have now trained a strong performing recurrent neural network for speech recognition, from scratch, with a word error rate of <20%. You have built an ASR model ready for deployment in production environments. If you would like to do so, instructions for building and deploying this model on the Nvidia Jetson using the flask RESTful web app framework for python are included in the [GitHub Repository](https://github.com/bricewalker/Hey-Jetson).\n", "\n", "### Next Steps\n", "\n", "Next steps for this project, and things you can try on your own, include: \n", "- Build a deeper model with more layers.\n", "- Train the model on [audio with background noise](https://www.tensorflow.org/versions/master/tutorials/audio_recognition).\n", "- Train the model on [Mozilla's Common Voice](https://voice.mozilla.org/) dataset to identify the speaker's gender and accent using this [reference project](https://github.com/mozilla/DeepSpeech).\n", "- Train the model on conversational speech, like that found in the [Buckeye Corpus](https://buckeyecorpus.osu.edu/), [Santa Barbara Corpus](http://www.linguistics.ucsb.edu/research/santa-barbara-corpus), or [COSINE Corpus](http://melodi.ee.washington.edu/cosine/).\n", "- Develop a production system for handling speech with sensitive personal information like in this reference [paper](resources/privateconversations.pdf). \n", "- Get the audio files into an [SQL database](https://www.mysql.com/) for faster service for the inference engine and for allowing service to end users with [HTML5's audio tag](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/audio) so they can actually hear the audio file.\n", "- Store user recorded audio for online training of the model to improve performance.\n", "- Recreate the model in [TensorFlow](https://www.tensorflow.org/) for [improved performance](https://github.com/tensorflow/tensorflow). [Mozilla](https://github.com/mozilla/DeepSpeech) has demonstrated the incredible power of TensorFlow for ASR.\n", "- Train the model using just the raw audio files, like this project from [Pannous](https://github.com/pannous/tensorflow-speech-recognition).\n", "- Train the model to [identify individual speakers](resources/speakeridentification.pdf) like [Google](resources/googlespeaker.pdf) using the [VoxCeleb](http://www.robots.ox.ac.uk/~vgg/data/voxceleb/) dataset.\n", "- Train the model to identify the speaker's level of [emotion](resources/emotionrecognition.pdf). There are many examples on [Github](https://github.com/).\n", "- Convert the inference engine to Nvidia's [TensorRT](https://developer.nvidia.com/tensorrt) inference platform using their [Developer Guide](http://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html) and the [RESTful interface](https://devblogs.nvidia.com/tensorrt-container/).\n", "- Train the model on other languages, like [Baidu's Deep Speech 2](resources/deepspeech2.pdf).\n", "- Try out a [transducer model](resources/transducers.pdf), like Baidu is doing in [Deep Speech 3](http://research.baidu.com/deep-speech-3%EF%BC%9Aexploring-neural-transducers-end-end-speech-recognition/).\n", "- Build a more traditional [encoder/decoder](resources/encoderdecoder.pdf) model as outlined by [Lu et al](resources/encoderdecoder2.pdf). \n", "- Add other [augmentation methods](https://distill.pub/2016/augmented-rnns/) besides just attention to the model.\n", "- Add [peephole connections](resources/peepholes.pdf) to the [LSTM cells](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/LSTMCell).\n", "- Add a [Hidden Markov Model](resources/hmm.pdf)/[Gaussian Mixture Model](resources/gmm.pdf).\n", "- Use a pretrained language model like this one from [kaldi](http://www.kaldi-asr.org/downloads/build/6/trunk/egs/).\n", "- Reduce the word error rate to [<10%](https://hacks.mozilla.org/2017/11/a-journey-to-10-word-error-rate/).\n", "- Include [entity extraction](https://towardsdatascience.com/entity-extraction-using-deep-learning-8014acac6bb8) in the model so that it can begin to identify the topic of discussion.\n", "- Implement a [wake-word detection engine](https://github.com/Picovoice/Porcupine).\n", "\n", "### Special Thanks\n", "\n", "I want to thank the following people/organizations for their support and training:\n", "\n", "- The instructional staff including Charles Rice, Riley Davis, and David Yerrington at [General Assembly](https://generalassemb.ly/) for their fantastic training in data science and machine/deep learning.\n", "- Andrew Ng with [deeplearning.ai](https://www.deeplearning.ai/), for developing the [Coursera Course on Sequence Models](https://www.coursera.org/learn/nlp-sequence-models) which helped me understand the mathematics behind recurrent neural networks.\n", "- [Microsoft ](https://www.microsoft.com/en-us/)for putting together the [edX course on Speech Recognition Systems](https://www.edx.org/course/speech-recognition-and-synthesis) which helped me understand the history of and theory behind speech recognition systems.\n", "- Alexis Cook and the staff at [Udacity](https://www.udacity.com/), [IBM's Watson team](https://www.ibm.com/watson/), and the [Amazon Alexa](https://developer.amazon.com/alexa) team for the course on [Artificial Intelligence on Udacity](https://www.udacity.com/course/artificial-intelligence-nanodegree--nd889) which helped me learn how to apply my knowledge on a real world dataset.\n", "- Paolo Prandoni and Martin Vetterli at [École Polytechnique Fédérale de Lausanne](https://www.epfl.ch/) for teaching the course on [Digital Signal Processing on Coursera](https://www.coursera.org/learn/dsp/) that helped me understand the mathematics behind the Fourier transform.\n", "- The staff at [Nvidia](http://www.nvidia.com/page/home.html) who have helped me learn how to run inference on the Jetson.\n", "- The Seattle DSI-3 Cohort at General Assembly for supporting my journey and giving me good constructive feedback in the development phase of this project.\n", "- [Miguel Grinberg](https://blog.miguelgrinberg.com/index) whose book and online tutorial on Flask helped me learn how to deploy web apps in Flask.\n", "- [Jetson Hacks](http://www.jetsonhacks.com/) for providing several tutorials and repos that helped me learn how to develop on the Jetson.\n", "\n", "### Contributions\n", "\n", "If you would like to contribute to this project, please fork and submit a pull request. I am always open to feedback and would love help with this project.\n", "\n", "[Click here to go back to the top of the notebook](#top)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }