{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "Cb4espuLKJiA" }, "source": [ "##### Copyright 2021 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "cellView": "form", "execution": { "iopub.execute_input": "2023-10-27T05:52:28.082781Z", "iopub.status.busy": "2023-10-27T05:52:28.082357Z", "iopub.status.idle": "2023-10-27T05:52:28.085988Z", "shell.execute_reply": "2023-10-27T05:52:28.085428Z" }, "id": "DjZQV2njKJ3U" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "mTL0TERThT6z" }, "source": [ "\n", " \n", " \n", " \n", " \n", " \n", "
\n", " View on TensorFlow.org\n", " \n", " Run in Google Colab\n", " \n", " View on GitHub\n", " \n", " Download notebook\n", " \n", " See TF Hub model\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "K2madPFAGHb3" }, "source": [ "# Transfer learning with YAMNet for environmental sound classification\n", "\n", "[YAMNet](https://tfhub.dev/google/yamnet/1) is a pre-trained deep neural network that can predict audio events from [521 classes](https://github.com/tensorflow/models/blob/master/research/audioset/yamnet/yamnet_class_map.csv), such as laughter, barking, or a siren. \n", "\n", " In this tutorial you will learn how to:\n", "\n", "- Load and use the YAMNet model for inference.\n", "- Build a new model using the YAMNet embeddings to classify cat and dog sounds.\n", "- Evaluate and export your model.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "5Mdp2TpBh96Y" }, "source": [ "## Import TensorFlow and other libraries\n" ] }, { "cell_type": "markdown", "metadata": { "id": "zCcKYqu_hvKe" }, "source": [ "Start by installing [TensorFlow I/O](https://www.tensorflow.org/io), which will make it easier for you to load audio files off disk." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:52:28.089668Z", "iopub.status.busy": "2023-10-27T05:52:28.089092Z", "iopub.status.idle": "2023-10-27T05:52:59.059468Z", "shell.execute_reply": "2023-10-27T05:52:59.058385Z" }, "id": "urBpRWDHTHHU" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\r\n", "tensorflow-datasets 4.9.3 requires protobuf>=3.20, but you have protobuf 3.19.6 which is incompatible.\r\n", "tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 3.19.6 which is incompatible.\u001b[0m\u001b[31m\r\n", "\u001b[0m" ] } ], "source": [ "!pip install -q \"tensorflow==2.11.*\"\n", "# tensorflow_io 0.28 is compatible with TensorFlow 2.11\n", "!pip install -q \"tensorflow_io==0.28.*\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:52:59.064124Z", "iopub.status.busy": "2023-10-27T05:52:59.063530Z", "iopub.status.idle": "2023-10-27T05:53:01.618126Z", "shell.execute_reply": "2023-10-27T05:53:01.617418Z" }, "id": "7l3nqdWVF-kC" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2023-10-27 05:52:59.905244: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2023-10-27 05:53:00.569783: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory\n", "2023-10-27 05:53:00.569881: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory\n", "2023-10-27 05:53:00.569891: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\n" ] } ], "source": [ "import os\n", "\n", "from IPython import display\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "\n", "import tensorflow as tf\n", "import tensorflow_hub as hub\n", "import tensorflow_io as tfio" ] }, { "cell_type": "markdown", "metadata": { "id": "v9ZhybCnt_bM" }, "source": [ "## About YAMNet\n", "\n", "[YAMNet](https://github.com/tensorflow/models/tree/master/research/audioset/yamnet) is a pre-trained neural network that employs the [MobileNetV1](https://arxiv.org/abs/1704.04861) depthwise-separable convolution architecture. It can use an audio waveform as input and make independent predictions for each of the 521 audio events from the [AudioSet](http://g.co/audioset) corpus.\n", "\n", "Internally, the model extracts \"frames\" from the audio signal and processes batches of these frames. This version of the model uses frames that are 0.96 second long and extracts one frame every 0.48 seconds .\n", "\n", "The model accepts a 1-D float32 Tensor or NumPy array containing a waveform of arbitrary length, represented as single-channel (mono) 16 kHz samples in the range `[-1.0, +1.0]`. This tutorial contains code to help you convert WAV files into the supported format.\n", "\n", "The model returns 3 outputs, including the class scores, embeddings (which you will use for transfer learning), and the log mel [spectrogram](https://www.tensorflow.org/tutorials/audio/simple_audio#spectrogram). You can find more details [here](https://tfhub.dev/google/yamnet/1).\n", "\n", "One specific use of YAMNet is as a high-level feature extractor - the 1,024-dimensional embedding output. You will use the base (YAMNet) model's input features and feed them into your shallower model consisting of one hidden `tf.keras.layers.Dense` layer. Then, you will train the network on a small amount of data for audio classification _without_ requiring a lot of labeled data and training end-to-end. (This is similar to [transfer learning for image classification with TensorFlow Hub](https://www.tensorflow.org/tutorials/images/transfer_learning_with_hub) for more information.)\n", "\n", "First, you will test the model and see the results of classifying audio. You will then construct the data pre-processing pipeline.\n", "\n", "### Loading YAMNet from TensorFlow Hub\n", "\n", "You are going to use a pre-trained YAMNet from [Tensorflow Hub](https://tfhub.dev/) to extract the embeddings from the sound files.\n", "\n", "Loading a model from TensorFlow Hub is straightforward: choose the model, copy its URL, and use the `load` function.\n", "\n", "Note: to read the documentation of the model, use the model URL in your browser." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:53:01.622541Z", "iopub.status.busy": "2023-10-27T05:53:01.621963Z", "iopub.status.idle": "2023-10-27T05:53:06.081346Z", "shell.execute_reply": "2023-10-27T05:53:06.080619Z" }, "id": "06CWkBV5v3gr" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2023-10-27 05:53:02.611471: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n", "2023-10-27 05:53:02.611570: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory\n", "2023-10-27 05:53:02.611631: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory\n", "2023-10-27 05:53:02.611689: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory\n", "2023-10-27 05:53:02.666048: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory\n", "2023-10-27 05:53:02.666240: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.\n", "Skipping registering GPU devices...\n" ] } ], "source": [ "yamnet_model_handle = 'https://tfhub.dev/google/yamnet/1'\n", "yamnet_model = hub.load(yamnet_model_handle)" ] }, { "cell_type": "markdown", "metadata": { "id": "GmrPJ0GHw9rr" }, "source": [ "With the model loaded, you can follow the [YAMNet basic usage tutorial](https://www.tensorflow.org/hub/tutorials/yamnet) and download a sample WAV file to run the inference.\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:53:06.085803Z", "iopub.status.busy": "2023-10-27T05:53:06.085161Z", "iopub.status.idle": "2023-10-27T05:53:06.211731Z", "shell.execute_reply": "2023-10-27T05:53:06.211123Z" }, "id": "C5i6xktEq00P" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading data from https://storage.googleapis.com/audioset/miaow_16k.wav\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", " 8192/215546 [>.............................] - ETA: 0s" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "215546/215546 [==============================] - 0s 0us/step\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "./test_data/miaow_16k.wav\n" ] } ], "source": [ "testing_wav_file_name = tf.keras.utils.get_file('miaow_16k.wav',\n", " 'https://storage.googleapis.com/audioset/miaow_16k.wav',\n", " cache_dir='./',\n", " cache_subdir='test_data')\n", "\n", "print(testing_wav_file_name)" ] }, { "cell_type": "markdown", "metadata": { "id": "mBm9y9iV2U_-" }, "source": [ "You will need a function to load audio files, which will also be used later when working with the training data. (Learn more about reading audio files and their labels in [Simple audio recognition](https://www.tensorflow.org/tutorials/audio/simple_audio#reading_audio_files_and_their_labels).\n", "\n", "Note: The returned `wav_data` from `load_wav_16k_mono` is already normalized to values in the `[-1.0, 1.0]` range (for more information, go to [YAMNet's documentation on TF Hub](https://tfhub.dev/google/yamnet/1))." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:53:06.215420Z", "iopub.status.busy": "2023-10-27T05:53:06.214799Z", "iopub.status.idle": "2023-10-27T05:53:06.219782Z", "shell.execute_reply": "2023-10-27T05:53:06.219118Z" }, "id": "Xwc9Wrdg2EtY" }, "outputs": [], "source": [ "# Utility functions for loading audio files and making sure the sample rate is correct.\n", "\n", "@tf.function\n", "def load_wav_16k_mono(filename):\n", " \"\"\" Load a WAV file, convert it to a float tensor, resample to 16 kHz single-channel audio. \"\"\"\n", " file_contents = tf.io.read_file(filename)\n", " wav, sample_rate = tf.audio.decode_wav(\n", " file_contents,\n", " desired_channels=1)\n", " wav = tf.squeeze(wav, axis=-1)\n", " sample_rate = tf.cast(sample_rate, dtype=tf.int64)\n", " wav = tfio.audio.resample(wav, rate_in=sample_rate, rate_out=16000)\n", " return wav" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:53:06.222945Z", "iopub.status.busy": "2023-10-27T05:53:06.222542Z", "iopub.status.idle": "2023-10-27T05:53:07.139882Z", "shell.execute_reply": "2023-10-27T05:53:07.139110Z" }, "id": "FRqpjkwB0Jjw" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.\n", "Instructions for updating:\n", "Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.\n", "Instructions for updating:\n", "Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:Using a while_loop for converting IO>AudioResample cause there is no registered converter for this op.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:Using a while_loop for converting IO>AudioResample cause there is no registered converter for this op.\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "testing_wav_data = load_wav_16k_mono(testing_wav_file_name)\n", "\n", "_ = plt.plot(testing_wav_data)\n", "\n", "# Play the audio file.\n", "display.Audio(testing_wav_data, rate=16000)" ] }, { "cell_type": "markdown", "metadata": { "id": "6z6rqlEz20YB" }, "source": [ "### Load the class mapping\n", "\n", "It's important to load the class names that YAMNet is able to recognize. The mapping file is present at `yamnet_model.class_map_path()` in the CSV format." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:53:07.144212Z", "iopub.status.busy": "2023-10-27T05:53:07.143667Z", "iopub.status.idle": "2023-10-27T05:53:07.160140Z", "shell.execute_reply": "2023-10-27T05:53:07.159416Z" }, "id": "6Gyj23e_3Mgr" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Speech\n", "Child speech, kid speaking\n", "Conversation\n", "Narration, monologue\n", "Babbling\n", "Speech synthesizer\n", "Shout\n", "Bellow\n", "Whoop\n", "Yell\n", "Children shouting\n", "Screaming\n", "Whispering\n", "Laughter\n", "Baby laughter\n", "Giggle\n", "Snicker\n", "Belly laugh\n", "Chuckle, chortle\n", "Crying, sobbing\n", "...\n" ] } ], "source": [ "class_map_path = yamnet_model.class_map_path().numpy().decode('utf-8')\n", "class_names =list(pd.read_csv(class_map_path)['display_name'])\n", "\n", "for name in class_names[:20]:\n", " print(name)\n", "print('...')" ] }, { "cell_type": "markdown", "metadata": { "id": "5xbycDnT40u0" }, "source": [ "### Run inference\n", "\n", "YAMNet provides frame-level class-scores (i.e., 521 scores for every frame). In order to determine clip-level predictions, the scores can be aggregated per-class across frames (e.g., using mean or max aggregation). This is done below by `scores_np.mean(axis=0)`. Finally, to find the top-scored class at the clip-level, you take the maximum of the 521 aggregated scores.\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:53:07.163484Z", "iopub.status.busy": "2023-10-27T05:53:07.162930Z", "iopub.status.idle": "2023-10-27T05:53:07.447375Z", "shell.execute_reply": "2023-10-27T05:53:07.446624Z" }, "id": "NT0otp-A4Y3u" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The main sound is: Animal\n", "The embeddings shape: (13, 1024)\n" ] } ], "source": [ "scores, embeddings, spectrogram = yamnet_model(testing_wav_data)\n", "class_scores = tf.reduce_mean(scores, axis=0)\n", "top_class = tf.math.argmax(class_scores)\n", "inferred_class = class_names[top_class]\n", "\n", "print(f'The main sound is: {inferred_class}')\n", "print(f'The embeddings shape: {embeddings.shape}')" ] }, { "cell_type": "markdown", "metadata": { "id": "YBaLNg5H5IWa" }, "source": [ "Note: The model correctly inferred an animal sound. Your goal in this tutorial is to increase the model's accuracy for specific classes. Also, notice that the model generated 13 embeddings, 1 per frame." ] }, { "cell_type": "markdown", "metadata": { "id": "fmthELBg1A2-" }, "source": [ "## ESC-50 dataset\n", "\n", "The [ESC-50 dataset](https://github.com/karolpiczak/ESC-50#repository-content) ([Piczak, 2015](https://www.karolpiczak.com/papers/Piczak2015-ESC-Dataset.pdf)) is a labeled collection of 2,000 five-second long environmental audio recordings. The dataset consists of 50 classes, with 40 examples per class.\n", "\n", "Download the dataset and extract it. \n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:53:07.451121Z", "iopub.status.busy": "2023-10-27T05:53:07.450832Z", "iopub.status.idle": "2023-10-27T05:54:05.063024Z", "shell.execute_reply": "2023-10-27T05:54:05.062029Z" }, "id": "MWobqK8JmZOU" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading data from https://github.com/karoldvl/ESC-50/archive/master.zip\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", " 8192/Unknown - 0s 0us/step" ] } ], "source": [ "_ = tf.keras.utils.get_file('esc-50.zip',\n", " 'https://github.com/karoldvl/ESC-50/archive/master.zip',\n", " cache_dir='./',\n", " cache_subdir='datasets',\n", " extract=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "qcruxiuX1cO5" }, "source": [ "### Explore the data\n", "\n", "The metadata for each file is specified in the csv file at `./datasets/ESC-50-master/meta/esc50.csv`\n", "\n", "and all the audio files are in `./datasets/ESC-50-master/audio/`\n", "\n", "You will create a pandas `DataFrame` with the mapping and use that to have a clearer view of the data.\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:54:05.067487Z", "iopub.status.busy": "2023-10-27T05:54:05.067025Z", "iopub.status.idle": "2023-10-27T05:54:05.080694Z", "shell.execute_reply": "2023-10-27T05:54:05.080110Z" }, "id": "jwmLygPrMAbH" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
filenamefoldtargetcategoryesc10src_filetake
01-100032-A-0.wav10dogTrue100032A
11-100038-A-14.wav114chirping_birdsFalse100038A
21-100210-A-36.wav136vacuum_cleanerFalse100210A
31-100210-B-36.wav136vacuum_cleanerFalse100210B
41-101296-A-19.wav119thunderstormFalse101296A
\n", "
" ], "text/plain": [ " filename fold target category esc10 src_file take\n", "0 1-100032-A-0.wav 1 0 dog True 100032 A\n", "1 1-100038-A-14.wav 1 14 chirping_birds False 100038 A\n", "2 1-100210-A-36.wav 1 36 vacuum_cleaner False 100210 A\n", "3 1-100210-B-36.wav 1 36 vacuum_cleaner False 100210 B\n", "4 1-101296-A-19.wav 1 19 thunderstorm False 101296 A" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "esc50_csv = './datasets/ESC-50-master/meta/esc50.csv'\n", "base_data_path = './datasets/ESC-50-master/audio/'\n", "\n", "pd_data = pd.read_csv(esc50_csv)\n", "pd_data.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "7d4rHBEQ2QAU" }, "source": [ "### Filter the data\n", "\n", "Now that the data is stored in the `DataFrame`, apply some transformations:\n", "\n", "- Filter out rows and use only the selected classes - `dog` and `cat`. If you want to use any other classes, this is where you can choose them.\n", "- Amend the filename to have the full path. This will make loading easier later.\n", "- Change targets to be within a specific range. In this example, `dog` will remain at `0`, but `cat` will become `1` instead of its original value of `5`." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:54:05.084289Z", "iopub.status.busy": "2023-10-27T05:54:05.083662Z", "iopub.status.idle": "2023-10-27T05:54:05.096047Z", "shell.execute_reply": "2023-10-27T05:54:05.095430Z" }, "id": "tFnEoQjgs14I" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
filenamefoldtargetcategoryesc10src_filetake
0./datasets/ESC-50-master/audio/1-100032-A-0.wav10dogTrue100032A
14./datasets/ESC-50-master/audio/1-110389-A-0.wav10dogTrue110389A
157./datasets/ESC-50-master/audio/1-30226-A-0.wav10dogTrue30226A
158./datasets/ESC-50-master/audio/1-30344-A-0.wav10dogTrue30344A
170./datasets/ESC-50-master/audio/1-32318-A-0.wav10dogTrue32318A
175./datasets/ESC-50-master/audio/1-34094-A-5.wav11catFalse34094A
176./datasets/ESC-50-master/audio/1-34094-B-5.wav11catFalse34094B
229./datasets/ESC-50-master/audio/1-47819-A-5.wav11catFalse47819A
230./datasets/ESC-50-master/audio/1-47819-B-5.wav11catFalse47819B
231./datasets/ESC-50-master/audio/1-47819-C-5.wav11catFalse47819C
\n", "
" ], "text/plain": [ " filename fold target category \\\n", "0 ./datasets/ESC-50-master/audio/1-100032-A-0.wav 1 0 dog \n", "14 ./datasets/ESC-50-master/audio/1-110389-A-0.wav 1 0 dog \n", "157 ./datasets/ESC-50-master/audio/1-30226-A-0.wav 1 0 dog \n", "158 ./datasets/ESC-50-master/audio/1-30344-A-0.wav 1 0 dog \n", "170 ./datasets/ESC-50-master/audio/1-32318-A-0.wav 1 0 dog \n", "175 ./datasets/ESC-50-master/audio/1-34094-A-5.wav 1 1 cat \n", "176 ./datasets/ESC-50-master/audio/1-34094-B-5.wav 1 1 cat \n", "229 ./datasets/ESC-50-master/audio/1-47819-A-5.wav 1 1 cat \n", "230 ./datasets/ESC-50-master/audio/1-47819-B-5.wav 1 1 cat \n", "231 ./datasets/ESC-50-master/audio/1-47819-C-5.wav 1 1 cat \n", "\n", " esc10 src_file take \n", "0 True 100032 A \n", "14 True 110389 A \n", "157 True 30226 A \n", "158 True 30344 A \n", "170 True 32318 A \n", "175 False 34094 A \n", "176 False 34094 B \n", "229 False 47819 A \n", "230 False 47819 B \n", "231 False 47819 C " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_classes = ['dog', 'cat']\n", "map_class_to_id = {'dog':0, 'cat':1}\n", "\n", "filtered_pd = pd_data[pd_data.category.isin(my_classes)]\n", "\n", "class_id = filtered_pd['category'].apply(lambda name: map_class_to_id[name])\n", "filtered_pd = filtered_pd.assign(target=class_id)\n", "\n", "full_path = filtered_pd['filename'].apply(lambda row: os.path.join(base_data_path, row))\n", "filtered_pd = filtered_pd.assign(filename=full_path)\n", "\n", "filtered_pd.head(10)" ] }, { "cell_type": "markdown", "metadata": { "id": "BkDcBS-aJdCz" }, "source": [ "### Load the audio files and retrieve embeddings\n", "\n", "Here you'll apply the `load_wav_16k_mono` and prepare the WAV data for the model.\n", "\n", "When extracting embeddings from the WAV data, you get an array of shape `(N, 1024)` where `N` is the number of frames that YAMNet found (one for every 0.48 seconds of audio)." ] }, { "cell_type": "markdown", "metadata": { "id": "AKDT5RomaDKO" }, "source": [ "Your model will use each frame as one input. Therefore, you need to create a new column that has one frame per row. You also need to expand the labels and the `fold` column to proper reflect these new rows.\n", "\n", "The expanded `fold` column keeps the original values. You cannot mix frames because, when performing the splits, you might end up having parts of the same audio on different splits, which would make your validation and test steps less effective." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:54:05.099335Z", "iopub.status.busy": "2023-10-27T05:54:05.099114Z", "iopub.status.idle": "2023-10-27T05:54:05.111402Z", "shell.execute_reply": "2023-10-27T05:54:05.110555Z" }, "id": "u5Rq3_PyKLtU" }, "outputs": [ { "data": { "text/plain": [ "(TensorSpec(shape=(), dtype=tf.string, name=None),\n", " TensorSpec(shape=(), dtype=tf.int64, name=None),\n", " TensorSpec(shape=(), dtype=tf.int64, name=None))" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "filenames = filtered_pd['filename']\n", "targets = filtered_pd['target']\n", "folds = filtered_pd['fold']\n", "\n", "main_ds = tf.data.Dataset.from_tensor_slices((filenames, targets, folds))\n", "main_ds.element_spec" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:54:05.114895Z", "iopub.status.busy": "2023-10-27T05:54:05.114289Z", "iopub.status.idle": "2023-10-27T05:54:05.256661Z", "shell.execute_reply": "2023-10-27T05:54:05.256071Z" }, "id": "rsEfovDVAHGY" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:Using a while_loop for converting IO>AudioResample cause there is no registered converter for this op.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:Using a while_loop for converting IO>AudioResample cause there is no registered converter for this op.\n" ] }, { "data": { "text/plain": [ "(TensorSpec(shape=, dtype=tf.float32, name=None),\n", " TensorSpec(shape=(), dtype=tf.int64, name=None),\n", " TensorSpec(shape=(), dtype=tf.int64, name=None))" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def load_wav_for_map(filename, label, fold):\n", " return load_wav_16k_mono(filename), label, fold\n", "\n", "main_ds = main_ds.map(load_wav_for_map)\n", "main_ds.element_spec" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:54:05.260019Z", "iopub.status.busy": "2023-10-27T05:54:05.259414Z", "iopub.status.idle": "2023-10-27T05:54:05.432291Z", "shell.execute_reply": "2023-10-27T05:54:05.431670Z" }, "id": "k0tG8DBNAHcE" }, "outputs": [ { "data": { "text/plain": [ "(TensorSpec(shape=(1024,), dtype=tf.float32, name=None),\n", " TensorSpec(shape=(), dtype=tf.int64, name=None),\n", " TensorSpec(shape=(), dtype=tf.int64, name=None))" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# applies the embedding extraction model to a wav data\n", "def extract_embedding(wav_data, label, fold):\n", " ''' run YAMNet to extract embedding from the wav data '''\n", " scores, embeddings, spectrogram = yamnet_model(wav_data)\n", " num_embeddings = tf.shape(embeddings)[0]\n", " return (embeddings,\n", " tf.repeat(label, num_embeddings),\n", " tf.repeat(fold, num_embeddings))\n", "\n", "# extract embedding\n", "main_ds = main_ds.map(extract_embedding).unbatch()\n", "main_ds.element_spec" ] }, { "cell_type": "markdown", "metadata": { "id": "ZdfPIeD0Qedk" }, "source": [ "### Split the data\n", "\n", "You will use the `fold` column to split the dataset into train, validation and test sets.\n", "\n", "ESC-50 is arranged into five uniformly-sized cross-validation `fold`s, such that clips from the same original source are always in the same `fold` - find out more in the [ESC: Dataset for Environmental Sound Classification](https://www.karolpiczak.com/papers/Piczak2015-ESC-Dataset.pdf) paper.\n", "\n", "The last step is to remove the `fold` column from the dataset since you're not going to use it during training.\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:54:05.435991Z", "iopub.status.busy": "2023-10-27T05:54:05.435532Z", "iopub.status.idle": "2023-10-27T05:54:05.518530Z", "shell.execute_reply": "2023-10-27T05:54:05.517931Z" }, "id": "1ZYvlFiVsffC" }, "outputs": [], "source": [ "cached_ds = main_ds.cache()\n", "train_ds = cached_ds.filter(lambda embedding, label, fold: fold < 4)\n", "val_ds = cached_ds.filter(lambda embedding, label, fold: fold == 4)\n", "test_ds = cached_ds.filter(lambda embedding, label, fold: fold == 5)\n", "\n", "# remove the folds column now that it's not needed anymore\n", "remove_fold_column = lambda embedding, label, fold: (embedding, label)\n", "\n", "train_ds = train_ds.map(remove_fold_column)\n", "val_ds = val_ds.map(remove_fold_column)\n", "test_ds = test_ds.map(remove_fold_column)\n", "\n", "train_ds = train_ds.cache().shuffle(1000).batch(32).prefetch(tf.data.AUTOTUNE)\n", "val_ds = val_ds.cache().batch(32).prefetch(tf.data.AUTOTUNE)\n", "test_ds = test_ds.cache().batch(32).prefetch(tf.data.AUTOTUNE)" ] }, { "cell_type": "markdown", "metadata": { "id": "v5PaMwvtcAIe" }, "source": [ "## Create your model\n", "\n", "You did most of the work!\n", "Next, define a very simple [Sequential](https://www.tensorflow.org/guide/keras/sequential_model) model with one hidden layer and two outputs to recognize cats and dogs from sounds.\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:54:05.522080Z", "iopub.status.busy": "2023-10-27T05:54:05.521664Z", "iopub.status.idle": "2023-10-27T05:54:05.767219Z", "shell.execute_reply": "2023-10-27T05:54:05.766572Z" }, "id": "JYCE0Fr1GpN3" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"my_model\"\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "_________________________________________________________________\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " Layer (type) Output Shape Param # \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "=================================================================\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " dense (Dense) (None, 512) 524800 \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " dense_1 (Dense) (None, 2) 1026 \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "=================================================================\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Total params: 525,826\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Trainable params: 525,826\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Non-trainable params: 0\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "_________________________________________________________________\n" ] } ], "source": [ "my_model = tf.keras.Sequential([\n", " tf.keras.layers.Input(shape=(1024), dtype=tf.float32,\n", " name='input_embedding'),\n", " tf.keras.layers.Dense(512, activation='relu'),\n", " tf.keras.layers.Dense(len(my_classes))\n", "], name='my_model')\n", "\n", "my_model.summary()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:54:05.774050Z", "iopub.status.busy": "2023-10-27T05:54:05.773324Z", "iopub.status.idle": "2023-10-27T05:54:05.786707Z", "shell.execute_reply": "2023-10-27T05:54:05.786068Z" }, "id": "l1qgH35HY0SE" }, "outputs": [], "source": [ "my_model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),\n", " optimizer=\"adam\",\n", " metrics=['accuracy'])\n", "\n", "callback = tf.keras.callbacks.EarlyStopping(monitor='loss',\n", " patience=3,\n", " restore_best_weights=True)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:54:05.790079Z", "iopub.status.busy": "2023-10-27T05:54:05.789666Z", "iopub.status.idle": "2023-10-27T05:54:10.905865Z", "shell.execute_reply": "2023-10-27T05:54:10.904756Z" }, "id": "T3sj84eOZ3pk" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/20\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", " 1/Unknown - 4s 4s/step - loss: 0.9121 - accuracy: 0.0625" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", " 14/Unknown - 4s 4ms/step - loss: 0.7774 - accuracy: 0.7679" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "15/15 [==============================] - 5s 42ms/step - loss: 0.7971 - accuracy: 0.7750 - val_loss: 1.2405 - val_accuracy: 0.8625\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 2/20\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", " 1/15 [=>............................] - ETA: 0s - loss: 1.0615 - accuracy: 0.8750" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "15/15 [==============================] - ETA: 0s - loss: 0.5425 - accuracy: 0.8875" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "15/15 [==============================] - 0s 5ms/step - loss: 0.5425 - accuracy: 0.8875 - val_loss: 0.2145 - val_accuracy: 0.9187\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 3/20\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", " 1/15 [=>............................] - ETA: 0s - loss: 0.1802 - accuracy: 0.9062" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "15/15 [==============================] - 0s 5ms/step - loss: 0.2448 - accuracy: 0.9021 - val_loss: 0.2020 - val_accuracy: 0.9125\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 4/20\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", " 1/15 [=>............................] - ETA: 0s - loss: 0.2989 - accuracy: 0.7812" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "15/15 [==============================] - 0s 5ms/step - loss: 0.2630 - accuracy: 0.9000 - val_loss: 0.2601 - val_accuracy: 0.9125\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 5/20\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", " 1/15 [=>............................] - ETA: 0s - loss: 0.5643 - accuracy: 0.8125" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "15/15 [==============================] - 0s 4ms/step - loss: 0.3721 - accuracy: 0.9146 - val_loss: 0.9568 - val_accuracy: 0.8750\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 6/20\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", " 1/15 [=>............................] - ETA: 0s - loss: 0.6629 - accuracy: 0.9375" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "15/15 [==============================] - 0s 5ms/step - loss: 0.4284 - accuracy: 0.9167 - val_loss: 0.2955 - val_accuracy: 0.9125\n" ] } ], "source": [ "history = my_model.fit(train_ds,\n", " epochs=20,\n", " validation_data=val_ds,\n", " callbacks=callback)" ] }, { "cell_type": "markdown", "metadata": { "id": "OAbraYKYpdoE" }, "source": [ "Let's run the `evaluate` method on the test data just to be sure there's no overfitting." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:54:10.910100Z", "iopub.status.busy": "2023-10-27T05:54:10.909461Z", "iopub.status.idle": "2023-10-27T05:54:11.066169Z", "shell.execute_reply": "2023-10-27T05:54:11.065474Z" }, "id": "H4Nh5nec3Sky" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r", " 1/Unknown - 0s 127ms/step - loss: 0.0970 - accuracy: 1.0000" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "5/5 [==============================] - 0s 5ms/step - loss: 0.2311 - accuracy: 0.9062\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Loss: 0.23105616867542267\n", "Accuracy: 0.90625\n" ] } ], "source": [ "loss, accuracy = my_model.evaluate(test_ds)\n", "\n", "print(\"Loss: \", loss)\n", "print(\"Accuracy: \", accuracy)" ] }, { "cell_type": "markdown", "metadata": { "id": "cid-qIrIpqHS" }, "source": [ "You did it!" ] }, { "cell_type": "markdown", "metadata": { "id": "nCKZonrJcXab" }, "source": [ "## Test your model\n", "\n", "Next, try your model on the embedding from the previous test using YAMNet only.\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:54:11.069966Z", "iopub.status.busy": "2023-10-27T05:54:11.069511Z", "iopub.status.idle": "2023-10-27T05:54:11.103911Z", "shell.execute_reply": "2023-10-27T05:54:11.103146Z" }, "id": "79AFpA3_ctCF" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The main sound is: cat\n" ] } ], "source": [ "scores, embeddings, spectrogram = yamnet_model(testing_wav_data)\n", "result = my_model(embeddings).numpy()\n", "\n", "inferred_class = my_classes[result.mean(axis=0).argmax()]\n", "print(f'The main sound is: {inferred_class}')" ] }, { "cell_type": "markdown", "metadata": { "id": "k2yleeev645r" }, "source": [ "## Save a model that can directly take a WAV file as input\n", "\n", "Your model works when you give it the embeddings as input.\n", "\n", "In a real-world scenario, you'll want to use audio data as a direct input.\n", "\n", "To do that, you will combine YAMNet with your model into a single model that you can export for other applications.\n", "\n", "To make it easier to use the model's result, the final layer will be a `reduce_mean` operation. When using this model for serving (which you will learn about later in the tutorial), you will need the name of the final layer. If you don't define one, TensorFlow will auto-define an incremental one that makes it hard to test, as it will keep changing every time you train the model. When using a raw TensorFlow operation, you can't assign a name to it. To address this issue, you'll create a custom layer that applies `reduce_mean` and call it `'classifier'`.\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:54:11.107678Z", "iopub.status.busy": "2023-10-27T05:54:11.107073Z", "iopub.status.idle": "2023-10-27T05:54:11.111411Z", "shell.execute_reply": "2023-10-27T05:54:11.110761Z" }, "id": "QUVCI2Suunpw" }, "outputs": [], "source": [ "class ReduceMeanLayer(tf.keras.layers.Layer):\n", " def __init__(self, axis=0, **kwargs):\n", " super(ReduceMeanLayer, self).__init__(**kwargs)\n", " self.axis = axis\n", "\n", " def call(self, input):\n", " return tf.math.reduce_mean(input, axis=self.axis)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:54:11.114511Z", "iopub.status.busy": "2023-10-27T05:54:11.114256Z", "iopub.status.idle": "2023-10-27T05:54:20.714211Z", "shell.execute_reply": "2023-10-27T05:54:20.713444Z" }, "id": "zE_Npm0nzlwc" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:absl:Found untraced functions such as _update_step_xla while saving (showing 1 of 1). These functions will not be directly callable after loading.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Assets written to: ./dogs_and_cats_yamnet/assets\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:tensorflow:Assets written to: ./dogs_and_cats_yamnet/assets\n" ] } ], "source": [ "saved_model_path = './dogs_and_cats_yamnet'\n", "\n", "input_segment = tf.keras.layers.Input(shape=(), dtype=tf.float32, name='audio')\n", "embedding_extraction_layer = hub.KerasLayer(yamnet_model_handle,\n", " trainable=False, name='yamnet')\n", "_, embeddings_output, _ = embedding_extraction_layer(input_segment)\n", "serving_outputs = my_model(embeddings_output)\n", "serving_outputs = ReduceMeanLayer(axis=0, name='classifier')(serving_outputs)\n", "serving_model = tf.keras.Model(input_segment, serving_outputs)\n", "serving_model.save(saved_model_path, include_optimizer=False)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:54:20.718070Z", "iopub.status.busy": "2023-10-27T05:54:20.717809Z", "iopub.status.idle": "2023-10-27T05:54:20.855159Z", "shell.execute_reply": "2023-10-27T05:54:20.854269Z" }, "id": "y-0bY5FMme1C" }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf.keras.utils.plot_model(serving_model)" ] }, { "cell_type": "markdown", "metadata": { "id": "btHQDN9mqxM_" }, "source": [ "Load your saved model to verify that it works as expected." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:54:20.858881Z", "iopub.status.busy": "2023-10-27T05:54:20.858600Z", "iopub.status.idle": "2023-10-27T05:54:25.899633Z", "shell.execute_reply": "2023-10-27T05:54:25.898933Z" }, "id": "KkYVpJS72WWB" }, "outputs": [], "source": [ "reloaded_model = tf.saved_model.load(saved_model_path)" ] }, { "cell_type": "markdown", "metadata": { "id": "4BkmvvNzq49l" }, "source": [ "And for the final test: given some sound data, does your model return the correct result?" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:54:25.903709Z", "iopub.status.busy": "2023-10-27T05:54:25.903453Z", "iopub.status.idle": "2023-10-27T05:54:26.199203Z", "shell.execute_reply": "2023-10-27T05:54:26.198485Z" }, "id": "xeXtD5HO28y-" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The main sound is: cat\n" ] } ], "source": [ "reloaded_results = reloaded_model(testing_wav_data)\n", "cat_or_dog = my_classes[tf.math.argmax(reloaded_results)]\n", "print(f'The main sound is: {cat_or_dog}')" ] }, { "cell_type": "markdown", "metadata": { "id": "ZRrOcBYTUgwn" }, "source": [ "If you want to try your new model on a serving setup, you can use the 'serving_default' signature." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:54:26.203019Z", "iopub.status.busy": "2023-10-27T05:54:26.202384Z", "iopub.status.idle": "2023-10-27T05:54:26.406752Z", "shell.execute_reply": "2023-10-27T05:54:26.406030Z" }, "id": "ycC8zzDSUG2s" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The main sound is: cat\n" ] } ], "source": [ "serving_results = reloaded_model.signatures['serving_default'](testing_wav_data)\n", "cat_or_dog = my_classes[tf.math.argmax(serving_results['classifier'])]\n", "print(f'The main sound is: {cat_or_dog}')\n" ] }, { "cell_type": "markdown", "metadata": { "id": "da7blblCHs8c" }, "source": [ "## (Optional) Some more testing\n", "\n", "The model is ready.\n", "\n", "Let's compare it to YAMNet on the test dataset." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:54:26.410675Z", "iopub.status.busy": "2023-10-27T05:54:26.410029Z", "iopub.status.idle": "2023-10-27T05:54:26.885973Z", "shell.execute_reply": "2023-10-27T05:54:26.885289Z" }, "id": "vDf5MASIIN1z" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "./datasets/ESC-50-master/audio/5-203128-A-0.wav\n", "WARNING:tensorflow:Using a while_loop for converting IO>AudioResample cause there is no registered converter for this op.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:Using a while_loop for converting IO>AudioResample cause there is no registered converter for this op.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Waveform values: [-5.1828759e-09 1.5151235e-08 -1.1082188e-08 ... 4.9873297e-03\n", " 5.2141696e-03 4.2495923e-03]\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "test_pd = filtered_pd.loc[filtered_pd['fold'] == 5]\n", "row = test_pd.sample(1)\n", "filename = row['filename'].item()\n", "print(filename)\n", "waveform = load_wav_16k_mono(filename)\n", "print(f'Waveform values: {waveform}')\n", "_ = plt.plot(waveform)\n", "\n", "display.Audio(waveform, rate=16000)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "execution": { "iopub.execute_input": "2023-10-27T05:54:26.889257Z", "iopub.status.busy": "2023-10-27T05:54:26.888990Z", "iopub.status.idle": "2023-10-27T05:54:27.190216Z", "shell.execute_reply": "2023-10-27T05:54:27.189409Z" }, "id": "eYUzFxYJIcE1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[YAMNet] The main sound is: Animal (0.9544865489006042)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[Your model] The main sound is: dog (0.999213457107544)\n" ] } ], "source": [ "# Run the model, check the output.\n", "scores, embeddings, spectrogram = yamnet_model(waveform)\n", "class_scores = tf.reduce_mean(scores, axis=0)\n", "top_class = tf.math.argmax(class_scores)\n", "inferred_class = class_names[top_class]\n", "top_score = class_scores[top_class]\n", "print(f'[YAMNet] The main sound is: {inferred_class} ({top_score})')\n", "\n", "reloaded_results = reloaded_model(waveform)\n", "your_top_class = tf.math.argmax(reloaded_results)\n", "your_inferred_class = my_classes[your_top_class]\n", "class_probabilities = tf.nn.softmax(reloaded_results, axis=-1)\n", "your_top_score = class_probabilities[your_top_class]\n", "print(f'[Your model] The main sound is: {your_inferred_class} ({your_top_score})')" ] }, { "cell_type": "markdown", "metadata": { "id": "g8Tsym8Rq-0V" }, "source": [ "## Next steps\n", "\n", "You have created a model that can classify sounds from dogs or cats. With the same idea and a different dataset you can try, for example, building an [acoustic identifier of birds](https://www.kaggle.com/c/birdclef-2021/) based on their singing.\n", "\n", "Share your project with the TensorFlow team on social media!\n" ] } ], "metadata": { "accelerator": "GPU", "colab": { "collapsed_sections": [], "name": "transfer_learning_audio.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 0 }