{
  "cells": [
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "view-in-github"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/Carleslc/AudioToText/blob/master/AudioToText.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "v5hvo8QWN-a9"
      },
      "source": [
        "# 🗣️ [**AudioToText**](https://github.com/Carleslc/AudioToText)\n",
        "\n",
        "[![Donate](https://www.ko-fi.com/img/githubbutton_sm.svg)](https://ko-fi.com/carleslc)\n",
        "\n",
        "### 🛠 [Whisper by OpenAI](https://github.com/openai/whisper)\n"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "U_lylR1xWMxk"
      },
      "source": [
        "## [Step 1] ⚙️ Install the required libraries\n",
        "\n",
        "Click ▶️ button below to install the dependencies for this notebook."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "SJl7HJOeo0-P"
      },
      "outputs": [],
      "source": [
        "#@title { display-mode: \"form\" }\n",
        "import subprocess\n",
        "\n",
        "from sys import platform as sys_platform\n",
        "\n",
        "status, ffmpeg_version = subprocess.getstatusoutput(\"ffmpeg -version\")\n",
        "\n",
        "if status != 0:\n",
        "  from platform import platform\n",
        "\n",
        "  if sys_platform == 'linux' and 'ubuntu' in platform().lower():\n",
        "    !apt install ffmpeg\n",
        "  else:\n",
        "    print(\"Install ffmpeg: https://ffmpeg.org/download.html\")\n",
        "else:\n",
        "  print(ffmpeg_version.split('\\n')[0])\n",
        "\n",
        "  NO_ROOT_WARNING = '|& grep -v \\\"WARNING: Running pip as the \\'root\\' user\"' # running in Colab\n",
        "\n",
        "  !pip install --no-warn-script-location --user --upgrade pip {NO_ROOT_WARNING}\n",
        "  !pip install --root-user-action=ignore git+https://github.com/openai/whisper.git@v20231117 openai==1.9.0 numpy scipy deepl pydub cohere ffmpeg-python torch==2.1.0 tensorflow-probability==0.23.0 typing-extensions==4.9.0"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "5A5bTMB8XmtI"
      },
      "source": [
        "## [Step 2] 📁 Upload your audio files to the Files folder\n",
        "\n",
        "⬅️ Files folder in Google Colab is on the left menu\n",
        "\n",
        "Almost any audio or video file format is [supported](https://gist.github.com/Carleslc/1d6b922c8bf4a7e9627a6970d178b3a6)."
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "pS8OFGobrfJx"
      },
      "source": [
        "## [Step 2.5] 🎙 Record your own audio ⏺\n",
        "\n",
        "This is an **optional** step to record your microphone, useful if you do not have an audio file to upload and want to create one.\n",
        "\n",
        "Run this cell to start recording your microphone.\n",
        "A button will appear to stop the recording when you're done.\n",
        "\n",
        "The recording will be saved as `recording.wav` which you can use in the next step `audio_file`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "BwkH_QCu60qd"
      },
      "outputs": [],
      "source": [
        "# This cell code is a slightly modified version of DotCSV code in the following Colab along with other references:\n",
        "# https://colab.research.google.com/drive/1CvvYPAFemIZdSOt9fhN541esSlZR7Ic6?usp=sharing\n",
        "\n",
        "try:\n",
        "  import io\n",
        "  import ffmpeg\n",
        "  import numpy as np\n",
        "\n",
        "  # Only available in Google Colab\n",
        "  from google.colab.output import eval_js\n",
        "\n",
        "  from IPython.display import HTML, Audio\n",
        "  from scipy.io.wavfile import write, read as wav_read\n",
        "  from base64 import b64decode\n",
        "  from os.path import isfile\n",
        "\n",
        "  AUDIO_HTML = \"\"\"\n",
        "  <script>\n",
        "  var my_div = document.createElement(\"DIV\");\n",
        "  var my_p = document.createElement(\"P\");\n",
        "  var my_btn = document.createElement(\"BUTTON\");\n",
        "  var t = document.createTextNode(\"Starting recording...\");\n",
        "\n",
        "  my_btn.appendChild(t);\n",
        "  my_div.appendChild(my_btn);\n",
        "  document.body.appendChild(my_div);\n",
        "\n",
        "  var base64data = 0;\n",
        "  var reader;\n",
        "  var recorder, gumStream;\n",
        "  var recordButton = my_btn;\n",
        "\n",
        "  var handleSuccess = function(stream) {\n",
        "    gumStream = stream;\n",
        "    var options = {\n",
        "      bitsPerSecond: 16000,\n",
        "      mimeType : 'audio/webm;codecs=opus' //codecs=pcm\n",
        "    };\n",
        "    recorder = new MediaRecorder(stream, options);\n",
        "    //recorder = new MediaRecorder(stream);\n",
        "\n",
        "    recorder.ondataavailable = function(e) {\n",
        "      var url = URL.createObjectURL(e.data);\n",
        "      var preview = document.createElement('audio');\n",
        "      preview.controls = true;\n",
        "      preview.src = url;\n",
        "      document.body.appendChild(preview);\n",
        "\n",
        "      reader = new FileReader();\n",
        "      reader.readAsDataURL(e.data);\n",
        "      reader.onloadend = function() {\n",
        "        base64data = reader.result;\n",
        "        //console.log(\"reader.onloadend: \" + base64data);\n",
        "      }\n",
        "    };\n",
        "    recorder.start();\n",
        "    recordButton.innerText = \"🔴 Recording... press to STOP\";\n",
        "  };\n",
        "\n",
        "  navigator.mediaDevices.getUserMedia({audio: true}).then(handleSuccess);\n",
        "\n",
        "  function toggleRecording() {\n",
        "    if (recorder && recorder.state == \"recording\") {\n",
        "        recorder.stop();\n",
        "        gumStream.getAudioTracks()[0].stop();\n",
        "        recordButton.innerText = \"Saving the recording... please wait!\";\n",
        "    }\n",
        "  }\n",
        "\n",
        "  // https://stackoverflow.com/a/951057\n",
        "  function sleep(ms) {\n",
        "    return new Promise(resolve => setTimeout(resolve, ms));\n",
        "  }\n",
        "\n",
        "  var data = new Promise(resolve => {\n",
        "    recordButton.onclick = () => {\n",
        "      toggleRecording();\n",
        "\n",
        "      sleep(2000).then(() => {\n",
        "        // wait 2000ms for the data to be available...\n",
        "        //console.log(\"resolve data: \" + base64data);\n",
        "        resolve(base64data.toString());\n",
        "      });\n",
        "    }\n",
        "  });\n",
        "\n",
        "  function doneRecording(recording_file) {\n",
        "    my_div.removeChild(recordButton);\n",
        "    my_p.innerText = recording_file;\n",
        "    my_div.appendChild(my_p);\n",
        "  }\n",
        "\n",
        "  </script>\n",
        "  \"\"\"\n",
        "\n",
        "  def get_audio():\n",
        "    display(HTML(AUDIO_HTML))\n",
        "    data = eval_js(\"data\")\n",
        "    binary = b64decode(data.split(',')[1])\n",
        "    \n",
        "    process = (ffmpeg\n",
        "      .input('pipe:0')\n",
        "      .output('pipe:1', format='wav')\n",
        "      .run_async(pipe_stdin=True, pipe_stdout=True, pipe_stderr=True, quiet=True, overwrite_output=True)\n",
        "    )\n",
        "    output, err = process.communicate(input=binary)\n",
        "    \n",
        "    riff_chunk_size = len(output) - 8\n",
        "    # Break up the chunk size into four bytes, held in b.\n",
        "    q = riff_chunk_size\n",
        "    b = []\n",
        "    for i in range(4):\n",
        "      q, r = divmod(q, 256)\n",
        "      b.append(r)\n",
        "\n",
        "    # Replace bytes 4:8 in proc.stdout with the actual size of the RIFF chunk.\n",
        "    riff = output[:4] + bytes(b) + output[8:]\n",
        "\n",
        "    sr, audio = wav_read(io.BytesIO(riff))\n",
        "\n",
        "    return audio, sr\n",
        "\n",
        "  recording_file = \"recording.wav\" #@param {type:\"string\"}\n",
        "\n",
        "  if isfile(recording_file):\n",
        "    print(f\"{recording_file} already exists, if you want to create another recording with the same name, delete it first\")\n",
        "  else:\n",
        "    # record microphone\n",
        "    audio, sr = get_audio()\n",
        "\n",
        "    # write recording\n",
        "    write(recording_file, sr, audio)\n",
        "\n",
        "    eval_js(f'doneRecording(\"{recording_file}\")')\n",
        "except ImportError:\n",
        "  print(\"Recording only available in Google Colab\")"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "-9_I0W3tqTjr"
      },
      "source": [
        "## [Step 3] 👂 Transcribe or Translate\n",
        "\n",
        "3.1. Choose a `task`:\n",
        "  - `Transcribe` speech to text in the same language of the source audio file.\n",
        "  - `Translate to English` speech to text in English.\n",
        "  \n",
        "Translation to other languages is not supported with _Whisper_ by default.\n",
        "You may try to choose the _Transcribe_ task and set your desired `language`, but translation is not guaranteed. However, you can use **_DeepL_** later in the Step 5 to translate the transcription to another language.\n",
        "\n",
        "3.2. Edit the `audio_file` to match your uploaded file name to transcribe.\n",
        "\n",
        "- If you want to transcribe multiple files with the same parameters you must separate their file names with commas `,`\n",
        "\n",
        "3.3. Run this cell and wait for the transcription to complete.\n",
        "\n",
        "  You can try other parameters if the result with default parameters does not suit your needs.\n",
        "\n",
        "  [Available models and languages](https://github.com/openai/whisper#available-models-and-languages)\n",
        "\n",
        "  Setting the `language` to the language of source audio file may provide better results than Auto-Detect.\n",
        "\n",
        "  You can add an optional initial `prompt` to provide context about the audio or encourage a specific writing style, see the [prompting guide](https://platform.openai.com/docs/guides/speech-to-text/prompting).\n",
        "\n",
        "  If the execution takes too long to complete you can choose a smaller model in `use_model`, with an accuracy tradeoff, or use the OpenAI API.\n",
        "\n",
        "  By default the open-source models are used, but you can also use the OpenAI API if the `api_key` parameter is set with your [OpenAI API Key](https://platform.openai.com/account/api-keys), which can improve the inference speed substantially, but it has an associated cost, see [API pricing](https://openai.com/pricing#audio-models).\n",
        "\n",
        "  When using API some options are fixed: **use_model** is ignored (uses _large-v2_) and **coherence_preference** is ignored (uses _More coherence_).\n",
        "  \n",
        "  More parameters are available in the code `options` object."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "opNkn_Lgpat4"
      },
      "outputs": [],
      "source": [
        "import os, subprocess\n",
        "\n",
        "import whisper\n",
        "from whisper.utils import format_timestamp, get_writer, WriteTXT\n",
        "\n",
        "import numpy as np\n",
        "\n",
        "try:\n",
        "  import tensorflow  # required in Colab to avoid protobuf compatibility issues\n",
        "except ImportError:\n",
        "  pass\n",
        "\n",
        "import torch\n",
        "\n",
        "import math\n",
        "\n",
        "from openai import OpenAI\n",
        "\n",
        "# select task\n",
        "\n",
        "task = \"Transcribe\" #@param [\"Transcribe\", \"Translate to English\"]\n",
        "\n",
        "task = \"transcribe\" if task == \"Transcribe\" else \"translate\"\n",
        "\n",
        "# select audio file\n",
        "\n",
        "audio_file = \"recording.wav\" #@param {type:\"string\"}\n",
        "\n",
        "audio_files = list(map(lambda audio_path: audio_path.strip(), audio_file.split(',')))\n",
        "\n",
        "for audio_path in audio_files:\n",
        "  if not os.path.isfile(audio_path):\n",
        "    raise FileNotFoundError(audio_path)\n",
        "\n",
        "# set model\n",
        "\n",
        "use_model = \"large-v2\" #@param [\"tiny\", \"base\", \"small\", \"medium\", \"large-v1\", \"large-v2\"]\n",
        "\n",
        "# select language\n",
        "\n",
        "language = \"Auto-Detect\" #@param [\"Auto-Detect\", \"Afrikaans\", \"Albanian\", \"Amharic\", \"Arabic\", \"Armenian\", \"Assamese\", \"Azerbaijani\", \"Bashkir\", \"Basque\", \"Belarusian\", \"Bengali\", \"Bosnian\", \"Breton\", \"Bulgarian\", \"Burmese\", \"Castilian\", \"Catalan\", \"Chinese\", \"Croatian\", \"Czech\", \"Danish\", \"Dutch\", \"English\", \"Estonian\", \"Faroese\", \"Finnish\", \"Flemish\", \"French\", \"Galician\", \"Georgian\", \"German\", \"Greek\", \"Gujarati\", \"Haitian\", \"Haitian Creole\", \"Hausa\", \"Hawaiian\", \"Hebrew\", \"Hindi\", \"Hungarian\", \"Icelandic\", \"Indonesian\", \"Italian\", \"Japanese\", \"Javanese\", \"Kannada\", \"Kazakh\", \"Khmer\", \"Korean\", \"Lao\", \"Latin\", \"Latvian\", \"Letzeburgesch\", \"Lingala\", \"Lithuanian\", \"Luxembourgish\", \"Macedonian\", \"Malagasy\", \"Malay\", \"Malayalam\", \"Maltese\", \"Maori\", \"Marathi\", \"Moldavian\", \"Moldovan\", \"Mongolian\", \"Myanmar\", \"Nepali\", \"Norwegian\", \"Nynorsk\", \"Occitan\", \"Panjabi\", \"Pashto\", \"Persian\", \"Polish\", \"Portuguese\", \"Punjabi\", \"Pushto\", \"Romanian\", \"Russian\", \"Sanskrit\", \"Serbian\", \"Shona\", \"Sindhi\", \"Sinhala\", \"Sinhalese\", \"Slovak\", \"Slovenian\", \"Somali\", \"Spanish\", \"Sundanese\", \"Swahili\", \"Swedish\", \"Tagalog\", \"Tajik\", \"Tamil\", \"Tatar\", \"Telugu\", \"Thai\", \"Tibetan\", \"Turkish\", \"Turkmen\", \"Ukrainian\", \"Urdu\", \"Uzbek\", \"Valencian\", \"Vietnamese\", \"Welsh\", \"Yiddish\", \"Yoruba\"]\n",
        "\n",
        "# other parameters\n",
        "\n",
        "prompt = \"\" #@param {type:\"string\"}\n",
        "\n",
        "coherence_preference = \"More coherence, but may repeat text\" #@param [\"More coherence, but may repeat text\", \"Less repetitions, but may have less coherence\"]\n",
        "\n",
        "api_key = '' #@param {type:\"string\"}\n",
        "\n",
        "# detect device\n",
        "\n",
        "if api_key:\n",
        "  print(\"Using API\")\n",
        "\n",
        "  from pydub import AudioSegment\n",
        "  from pydub.silence import split_on_silence\n",
        "else:\n",
        "  DEVICE = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
        "\n",
        "  print(f\"Using {'GPU' if DEVICE == 'cuda' else 'CPU ⚠️'}\")\n",
        "\n",
        "  # https://medium.com/analytics-vidhya/the-google-colab-system-specification-check-69d159597417\n",
        "  if DEVICE == \"cuda\":\n",
        "    !nvidia-smi -L\n",
        "  else:\n",
        "    if sys_platform == 'linux':\n",
        "      !lscpu | grep \"Model name\" | awk '{$1=$1};1'\n",
        "    \n",
        "    print(\"Not using GPU can result in a very slow execution\")\n",
        "    print(\"Ensure Hardware accelerator by GPU is enabled in Google Colab: Runtime > Change runtime type\")\n",
        "\n",
        "    if use_model not in ['tiny', 'base', 'small']:\n",
        "      print(\"You may also want to try a smaller model (tiny, base, small)\")\n",
        "\n",
        "# display language\n",
        "\n",
        "WHISPER_LANGUAGES = [k.title() for k in whisper.tokenizer.TO_LANGUAGE_CODE.keys()]\n",
        "\n",
        "if language == \"Auto-Detect\":\n",
        "  language = \"detect\"\n",
        "\n",
        "if language and language != \"detect\" and language not in WHISPER_LANGUAGES:\n",
        "  print(f\"\\nLanguage '{language}' is invalid\")\n",
        "  language = \"detect\"\n",
        "\n",
        "if language and language != \"detect\":\n",
        "  print(f\"\\nLanguage: {language}\")\n",
        "\n",
        "# load model\n",
        "\n",
        "if api_key:\n",
        "  print()\n",
        "else:\n",
        "  MODELS_WITH_ENGLISH_VERSION = [\"tiny\", \"base\", \"small\", \"medium\"]\n",
        "\n",
        "  if language == \"English\" and use_model in MODELS_WITH_ENGLISH_VERSION:\n",
        "    use_model += \".en\"\n",
        "\n",
        "  print(f\"\\nLoading {use_model} model... {os.path.expanduser(f'~/.cache/whisper/{use_model}.pt')}\")\n",
        "\n",
        "  model = whisper.load_model(use_model, device=DEVICE)\n",
        "\n",
        "  print(\n",
        "      f\"Model {use_model} is {'multilingual' if model.is_multilingual else 'English-only'} \"\n",
        "      f\"and has {sum(np.prod(p.shape) for p in model.parameters()):,d} parameters.\\n\"\n",
        "  )\n",
        "\n",
        "# set options\n",
        "\n",
        "## https://github.com/openai/whisper/blob/v20231117/whisper/transcribe.py#L37\n",
        "## https://github.com/openai/whisper/blob/v20231117/whisper/decoding.py#L81\n",
        "options = {\n",
        "    'task': task,\n",
        "    'verbose': True,\n",
        "    'fp16': True,\n",
        "    'best_of': 5,\n",
        "    'beam_size': 5,\n",
        "    'patience': None,\n",
        "    'length_penalty': None,\n",
        "    'suppress_tokens': '-1',\n",
        "    'temperature': (0.0, 0.2, 0.4, 0.6, 0.8, 1.0), # float or tuple\n",
        "    'condition_on_previous_text': coherence_preference == \"More coherence, but may repeat text\",\n",
        "    'initial_prompt': prompt or None,\n",
        "    'word_timestamps': False,\n",
        "}\n",
        "\n",
        "if api_key:\n",
        "  api_client = OpenAI(api_key=api_key)\n",
        "\n",
        "  api_supported_formats = ['mp3', 'mp4', 'mpeg', 'mpga', 'm4a', 'wav', 'webm']\n",
        "  api_max_bytes = 25 * 1024 * 1024 # 25 MB\n",
        "\n",
        "  api_transcribe = api_client.audio.transcriptions if task == 'transcribe' else api_client.audio.translations\n",
        "  api_transcribe = api_transcribe.create\n",
        "  \n",
        "  api_model = 'whisper-1' # large-v2\n",
        "\n",
        "  # https://platform.openai.com/docs/api-reference/audio?lang=python\n",
        "  api_options = {\n",
        "    'response_format': 'verbose_json',\n",
        "  }\n",
        "\n",
        "  if prompt:\n",
        "    api_options['prompt'] = prompt\n",
        "  \n",
        "  api_temperature = options['temperature'][0] if isinstance(options['temperature'], (tuple, list)) else options['temperature']\n",
        "  \n",
        "  if isinstance(api_temperature, (float, int)):\n",
        "    api_options['temperature'] = api_temperature\n",
        "  else:\n",
        "    raise ValueError(\"Invalid temperature type, it must be a float or a tuple of floats\")\n",
        "elif DEVICE == 'cpu':\n",
        "  options['fp16'] = False\n",
        "  torch.set_num_threads(os.cpu_count())\n",
        "\n",
        "# execute task\n",
        "# !whisper \"{audio_file}\" --task {task} --model {use_model} --output_dir {output_dir} --device {DEVICE} --verbose {options['verbose']}\n",
        "\n",
        "if task == \"translate\":\n",
        "  print(\"-- TRANSLATE TO ENGLISH --\")\n",
        "else:\n",
        "  print(\"-- TRANSCRIPTION --\")\n",
        "\n",
        "results = {} # audio_path to result\n",
        "\n",
        "for audio_path in audio_files:\n",
        "  print(f\"\\nProcessing: {audio_path}\\n\")\n",
        "\n",
        "  # detect language\n",
        "  detect_language = not language or language == \"detect\"\n",
        "\n",
        "  if not detect_language:\n",
        "    options['language'] = language\n",
        "    source_language_code = whisper.tokenizer.TO_LANGUAGE_CODE.get(language.lower())\n",
        "  elif not api_key:\n",
        "    # load audio and pad/trim it to fit 30 seconds\n",
        "    audio = whisper.load_audio(audio_path)\n",
        "    audio = whisper.pad_or_trim(audio)\n",
        "\n",
        "    # make log-Mel spectrogram and move to the same device as the model\n",
        "    mel = whisper.log_mel_spectrogram(audio).to(model.device)\n",
        "\n",
        "    # detect the spoken language\n",
        "    _, probs = model.detect_language(mel)\n",
        "\n",
        "    source_language_code = max(probs, key=probs.get)\n",
        "    options['language'] = whisper.tokenizer.LANGUAGES[source_language_code].title()\n",
        "    \n",
        "    print(f\"Detected language: {options['language']}\\n\")\n",
        "\n",
        "  # transcribe\n",
        "  if api_key:\n",
        "    # API\n",
        "    if task == \"transcribe\" and not detect_language:\n",
        "      api_options['language'] = source_language_code\n",
        "    \n",
        "    source_audio_name_path, source_audio_ext = os.path.splitext(audio_path)\n",
        "    source_audio_ext = source_audio_ext[1:]\n",
        "\n",
        "    if source_audio_ext in api_supported_formats:\n",
        "      api_audio_path = audio_path\n",
        "      api_audio_ext = source_audio_ext\n",
        "    else:\n",
        "      ## convert audio file to a supported format\n",
        "      if options['verbose']:\n",
        "        print(f\"API supported formats: {','.join(api_supported_formats)}\")\n",
        "        print(f\"Converting {source_audio_ext} audio to a supported format...\")\n",
        "\n",
        "      api_audio_ext = 'mp3'\n",
        "\n",
        "      api_audio_path = f'{source_audio_name_path}.{api_audio_ext}'\n",
        "\n",
        "      subprocess.run(['ffmpeg', '-i', audio_path, api_audio_path], check=True, capture_output=True)\n",
        "\n",
        "      if options['verbose']:\n",
        "        print(api_audio_path, end='\\n\\n')\n",
        "\n",
        "    ## split audio file in chunks\n",
        "    api_audio_chunks = []\n",
        "\n",
        "    audio_bytes = os.path.getsize(api_audio_path)\n",
        "\n",
        "    if audio_bytes >= api_max_bytes:\n",
        "      if options['verbose']:\n",
        "        print(f\"Audio exceeds API maximum allowed file size.\\nSplitting audio in chunks...\")\n",
        "      \n",
        "      audio_segment_file = AudioSegment.from_file(api_audio_path, api_audio_ext)\n",
        "\n",
        "      min_chunks = math.ceil(audio_bytes / (api_max_bytes / 2))\n",
        "\n",
        "      # print(f\"Min chunks: {min_chunks}\")\n",
        "\n",
        "      max_chunk_milliseconds = int(len(audio_segment_file) // min_chunks)\n",
        "\n",
        "      # print(f\"Max chunk milliseconds: {max_chunk_milliseconds}\")\n",
        "\n",
        "      def add_chunk(api_audio_chunk):\n",
        "        api_audio_chunk_path = f\"{source_audio_name_path}_{len(api_audio_chunks) + 1}.{api_audio_ext}\"\n",
        "        api_audio_chunk.export(api_audio_chunk_path, format=api_audio_ext)\n",
        "        api_audio_chunks.append(api_audio_chunk_path)\n",
        "      \n",
        "      def raw_split(big_chunk):\n",
        "        subchunks = math.ceil(len(big_chunk) / max_chunk_milliseconds)\n",
        "\n",
        "        for subchunk_i in range(subchunks):\n",
        "          chunk_start = max_chunk_milliseconds * subchunk_i\n",
        "          chunk_end = min(max_chunk_milliseconds * (subchunk_i + 1), len(big_chunk))\n",
        "          add_chunk(big_chunk[chunk_start:chunk_end])\n",
        "      \n",
        "      non_silent_chunks = split_on_silence(audio_segment_file,\n",
        "                                           seek_step=5, # ms\n",
        "                                           min_silence_len=1250, # ms\n",
        "                                           silence_thresh=-25, # dB\n",
        "                                           keep_silence=True) # needed to aggregate timestamps\n",
        "\n",
        "      # print(f\"Non silent chunks: {len(non_silent_chunks)}\")\n",
        "      \n",
        "      current_chunk = non_silent_chunks[0] if non_silent_chunks else audio_segment_file\n",
        "\n",
        "      for next_chunk in non_silent_chunks[1:]:\n",
        "        if len(current_chunk) > max_chunk_milliseconds:\n",
        "          raw_split(current_chunk)\n",
        "          current_chunk = next_chunk\n",
        "        elif len(current_chunk) + len(next_chunk) <= max_chunk_milliseconds:\n",
        "          current_chunk += next_chunk\n",
        "        else:\n",
        "          add_chunk(current_chunk)\n",
        "          current_chunk = next_chunk\n",
        "      \n",
        "      if len(current_chunk) > max_chunk_milliseconds:\n",
        "        raw_split(current_chunk)\n",
        "      else:\n",
        "        add_chunk(current_chunk)\n",
        "      \n",
        "      if options['verbose']:\n",
        "        print(f'Total chunks: {len(api_audio_chunks)}\\n')\n",
        "    else:\n",
        "      api_audio_chunks.append(api_audio_path)\n",
        "    \n",
        "    ## process chunks\n",
        "    result = None\n",
        "\n",
        "    for api_audio_chunk_path in api_audio_chunks:\n",
        "      ## API request\n",
        "      with open(api_audio_chunk_path, 'rb') as api_audio_file:\n",
        "        api_result = api_transcribe(model=api_model, file=api_audio_file, **api_options)\n",
        "        api_result = api_result.model_dump() # to dict\n",
        "      \n",
        "      api_segments = api_result['segments']\n",
        "      \n",
        "      if result:\n",
        "        ## update timestamps\n",
        "        last_segment_timestamp = result['segments'][-1]['end'] if result['segments'] else 0\n",
        "\n",
        "        for segment in api_segments:\n",
        "          segment['start'] += last_segment_timestamp\n",
        "          segment['end'] += last_segment_timestamp\n",
        "\n",
        "        ## append new segments\n",
        "        result['segments'].extend(api_segments)\n",
        "        \n",
        "        if 'duration' in result:\n",
        "          result['duration'] += api_result.get('duration', 0)\n",
        "      else:\n",
        "        ## first request\n",
        "        result = api_result\n",
        "        \n",
        "        if detect_language:\n",
        "          print(f\"Detected language: {result['language'].title()}\\n\")\n",
        "    \n",
        "      ## display segments\n",
        "      if options['verbose']:\n",
        "        for segment in api_segments:\n",
        "          print(f\"[{format_timestamp(segment['start'])} --> {format_timestamp(segment['end'])}] {segment['text']}\")\n",
        "  else:\n",
        "    # Open-Source\n",
        "    result = whisper.transcribe(model, audio_path, **options)\n",
        "\n",
        "  # fix results formatting\n",
        "  for segment in result['segments']:\n",
        "    segment['text'] = segment['text'].strip()\n",
        "  \n",
        "  result['text'] = '\\n'.join(map(lambda segment: segment['text'], result['segments']))\n",
        "\n",
        "  # set results for this audio file\n",
        "  results[audio_path] = result"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "hTrxbUivk_h3"
      },
      "source": [
        "## [Step 4] 💾 **Save results**\n",
        "\n",
        "Run this cell to write the transcription as a file output.\n",
        "\n",
        "Results will be available in the **audio_transcription** folder in the formats selected in `output_formats`.\n",
        "\n",
        "If you don't see that folder, you may need to refresh 🔄 the Files folder.\n",
        "\n",
        "Available formats: `txt,vtt,srt,tsv,json`"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "wNsrB45_lCIl"
      },
      "outputs": [],
      "source": [
        "# set output folder\n",
        "output_dir = \"audio_transcription\"\n",
        "\n",
        "# set output formats: https://github.com/openai/whisper/blob/v20231117/whisper/utils.py#L283\n",
        "output_formats = \"txt,vtt,srt\" #@param [\"txt,vtt,srt,tsv,json\", \"txt,vtt,srt\", \"txt,vtt\", \"txt,srt\", \"txt\", \"vtt\", \"srt\", \"tsv\", \"json\"] {allow-input: true}\n",
        "output_formats = output_formats.split(',')\n",
        "\n",
        "from typing import TextIO\n",
        "\n",
        "class WriteText(WriteTXT):\n",
        "\n",
        "  def write_result(self, result: dict, file: TextIO, **kwargs):\n",
        "    print(result['text'], file=file, flush=True)\n",
        "\n",
        "def write_result(result, output_format, output_file_name):\n",
        "  output_format = output_format.strip()\n",
        "\n",
        "  # start captions in non-zero timestamp (some media players does not detect the first caption)\n",
        "  fix_vtt = output_format == 'vtt' and result['segments'] and result['segments'][0].get('start') == 0\n",
        "  \n",
        "  if fix_vtt:\n",
        "    result['segments'][0]['start'] += 1/1000 # +1ms\n",
        "\n",
        "  # write result in the desired format\n",
        "  writer = WriteText(output_dir) if output_format == 'txt' else get_writer(output_format, output_dir)\n",
        "  writer(result, output_file_name)\n",
        "\n",
        "  if fix_vtt:\n",
        "    result['segments'][0]['start'] = 0 # reset change\n",
        "\n",
        "  output_file_path = os.path.join(output_dir, f\"{output_file_name}.{output_format}\")\n",
        "  print(output_file_path)\n",
        "\n",
        "# save results\n",
        "\n",
        "print(\"Writing results...\")\n",
        "\n",
        "os.makedirs(output_dir, exist_ok=True)\n",
        "\n",
        "for audio_path, result in results.items():\n",
        "  print(end='\\n')\n",
        "  \n",
        "  output_file_name = os.path.splitext(os.path.basename(audio_path))[0]\n",
        "\n",
        "  for output_format in output_formats:\n",
        "    write_result(result, output_format, output_file_name)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "ZfkDhNMMvY8s"
      },
      "source": [
        "## [Step 5] 💬 Translate results with DeepL (API key needed)\n",
        "\n",
        "This is an **optional** step to translate the transcription to another language using the **DeepL** API.\n",
        "\n",
        "[Get a DeepL Developer Account API Key](https://www.deepl.com/pro-api)\n",
        "\n",
        "Set the `deepl_api_key` to translate the transcription to a supported language in `deepl_target_language`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "28f7EIP-rez0"
      },
      "outputs": [],
      "source": [
        "import deepl\n",
        "\n",
        "# translation service options (DeepL Developer Account)\n",
        "\n",
        "deepl_api_key = \"\" #@param {type:\"string\"}\n",
        "\n",
        "deepl_target_language = \"\" #@param [\"\", \"Bulgarian\", \"Chinese (simplified)\", \"Czech\", \"Danish\", \"Dutch\", \"English (American)\", \"English (British)\", \"Estonian\", \"Finnish\", \"French\", \"German\", \"Greek\", \"Hungarian\", \"Indonesian\", \"Italian\", \"Japanese\", \"Korean\", \"Latvian\", \"Lithuanian\", \"Norwegian\", \"Polish\", \"Portuguese (Brazilian)\", \"Portuguese (European)\", \"Romanian\", \"Russian\", \"Slovak\", \"Slovenian\", \"Spanish\", \"Swedish\", \"Turkish\", \"Ukrainian\"]\n",
        "\n",
        "deepl_formality = \"default\" #@param [\"default\", \"formal\", \"informal\"]\n",
        "\n",
        "deepl_coherence_preference = \"Share context between lines\" #@param [\"Share context between lines\", \"Translate each line independently\"]\n",
        "deepl_coherence_preference = deepl_coherence_preference == \"Share context between lines\"\n",
        "\n",
        "if not deepl_api_key:\n",
        "  print(\"Required: deepl_api_key\")\n",
        "  print(\"Get a DeepL Developer Account API Key: https://www.deepl.com/pro-api\")\n",
        "\n",
        "if not deepl_target_language:\n",
        "  print(\"Required: deepl_target_language\")\n",
        "elif deepl_target_language == 'English':\n",
        "  deepl_target_language = \"English (British)\"\n",
        "elif deepl_target_language == 'Chinese':\n",
        "  deepl_target_language = \"Chinese (simplified)\"\n",
        "elif deepl_target_language == 'Portuguese':\n",
        "  deepl_target_language = \"Portuguese (European)\"\n",
        "\n",
        "use_deepl_translation = deepl_api_key and deepl_target_language\n",
        "\n",
        "if use_deepl_translation:\n",
        "  if deepl_formality != 'default':\n",
        "    deepl_formality = 'prefer_more' if deepl_formality == 'formal' else 'prefer_less'\n",
        "\n",
        "  translated_results = {} # audio_path to translated results\n",
        "\n",
        "  try:\n",
        "    deepl_translator = deepl.Translator(deepl_api_key)\n",
        "\n",
        "    deepl_source_languages = [lang.code.upper() for lang in deepl_translator.get_source_languages()]\n",
        "    \n",
        "    deepl_target_languages_dict = deepl_translator.get_target_languages()\n",
        "    deepl_target_languages = [lang.name for lang in deepl_target_languages_dict]\n",
        "\n",
        "    deepl_target_language_code = next(lang.code for lang in deepl_target_languages_dict if lang.name == deepl_target_language).upper()\n",
        "    target_language_code = deepl_target_language_code.split('-')[0]\n",
        "    \n",
        "    for audio_path, result in results.items():\n",
        "      deepl_usage = deepl_translator.get_usage()\n",
        "      \n",
        "      if deepl_usage.any_limit_reached:\n",
        "        print(audio_path)\n",
        "        raise deepl.DeepLException(\"Quota for this billing period has been exceeded, message: Quota Exceeded\")\n",
        "      else:\n",
        "        print(audio_path + '\\n')\n",
        "      \n",
        "      # translate results (DeepL)\n",
        "      source_language_code = whisper.tokenizer.TO_LANGUAGE_CODE.get(result['language'].lower()).upper()\n",
        "\n",
        "      if (task == 'translate' and target_language_code != 'EN') or (task == 'transcribe' and source_language_code in deepl_source_languages and source_language_code != target_language_code):\n",
        "        source_lang = source_language_code if task == 'transcribe' else None\n",
        "        translate_from = f\"from {result['language'].title()} [{source_language_code}] \" if source_lang else ''\n",
        "        print(f\"DeepL: Translate results {translate_from}to {deepl_target_language} [{deepl_target_language_code}]\\n\")\n",
        "\n",
        "        segments = result['segments']\n",
        "\n",
        "        translated_results[audio_path] = { 'text': '', 'segments': [], 'language': deepl_target_language }\n",
        "\n",
        "        # segments / request (max 128 KiB / request, so deepl_batch_requests_size is limited to around 1000)\n",
        "        deepl_batch_requests_size = 200 # 200 segments * ~100 bytes / segment = ~20 KB / request  (~15 minutes of speech)\n",
        "        \n",
        "        for batch_segments in [segments[i:i + deepl_batch_requests_size] for i in range(0, len(segments), deepl_batch_requests_size)]:\n",
        "          batch_segments_text = [segment['text'] for segment in batch_segments]\n",
        "\n",
        "          if deepl_coherence_preference:\n",
        "            batch_segments_text = '<br/>'.join(batch_segments_text)\n",
        "\n",
        "          # DeepL request\n",
        "          deepl_results = deepl_translator.translate_text(\n",
        "              text=batch_segments_text,\n",
        "              source_lang=source_lang,\n",
        "              target_lang=deepl_target_language_code,\n",
        "              formality=deepl_formality,\n",
        "              split_sentences='nonewlines',\n",
        "              tag_handling='xml' if deepl_coherence_preference else None,\n",
        "              ignore_tags='br' if deepl_coherence_preference else None, # used to synchronize sentences with whisper lines but without splitting sentences in DeepL\n",
        "              outline_detection=False if deepl_coherence_preference else None\n",
        "          )\n",
        "          \n",
        "          deepl_results_segments = deepl_results.text.split('<br/>') if deepl_coherence_preference else [deepl_result_segment.text for deepl_result_segment in deepl_results]\n",
        "\n",
        "          for j, translated_text in enumerate(deepl_results_segments):\n",
        "            segment = batch_segments[j]\n",
        "\n",
        "            # fix sentence formatting\n",
        "            translated_text = translated_text.lstrip(',.。 ').rstrip()\n",
        "\n",
        "            if not deepl_coherence_preference and translated_text and translated_text[-1] in '.。' and segment['text'][-1] not in '.。':\n",
        "              translated_text = translated_text[:-1]\n",
        "\n",
        "            # add translated segments\n",
        "            translated_results[audio_path]['segments'].append(dict(id=segment['id'], start=segment['start'], end=segment['end'], text=translated_text))\n",
        "\n",
        "            if options['verbose']:\n",
        "              print(f\"[{format_timestamp(segment['start'])} --> {format_timestamp(segment['end'])}] {translated_text}\")\n",
        "        \n",
        "        deepl_usage = deepl_translator.get_usage()\n",
        "        \n",
        "        if deepl_usage.character.valid:\n",
        "          print(f\"\\nDeepL: Character usage: {deepl_usage.character.count} / {deepl_usage.character.limit} ({100*(deepl_usage.character.count/deepl_usage.character.limit):.2f}%)\\n\")\n",
        "      elif source_language_code == target_language_code:\n",
        "        print(f\"Nothing to translate. Results are already in {result['language']}.\")\n",
        "      elif task == 'transcribe' and source_language_code not in deepl_source_languages:\n",
        "        print(f\"DeepL: {result['language']} is not yet supported\")\n",
        "  except deepl.DeepLException as e:\n",
        "    if isinstance(e, deepl.AuthorizationException) and str(e) == \"Authorization failure, check auth_key\":\n",
        "      e = \"Authorization failure, check deepl_api_key\"\n",
        "    print(f\"\\nDeepL: [Error] {e}\\n\")\n",
        "  \n",
        "  # save translated results (if any)\n",
        "\n",
        "  if translated_results:\n",
        "    print(\"Writing translated results...\")\n",
        "\n",
        "    for audio_path, translated_result in translated_results.items():\n",
        "      print(end='\\n')\n",
        "\n",
        "      translated_result['text'] = '\\n'.join(map(lambda translated_segment: translated_segment['text'], translated_result['segments']))\n",
        "      \n",
        "      output_file_name = os.path.splitext(os.path.basename(audio_path))[0]\n",
        "      translated_output_file_name = f\"{output_file_name}_{deepl_target_language}\"\n",
        "\n",
        "      for output_format in output_formats:\n",
        "        write_result(translated_result, output_format, translated_output_file_name)"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "include_colab_link": true,
      "provenance": []
    },
    "gpuClass": "standard",
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.1"
    },
    "vscode": {
      "interpreter": {
        "hash": "cf5167ffd3d40c187bbdee173a0e169759f2b54f4182487e61a0f59820dcd535"
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}