# Speech-to-Text with faster-whisper (Whisper large v3) of large audio files in any language

- Author: Pierre Guillou
- Date: 04/12/2023
- Post blog: [Speech-to-Text | Quickly get a transcription of a large audio file in any language with "Faster-Whisper"](https://medium.com/@pierre_guillou/speech-to-text-quickly-get-a-transcription-of-a-large-audio-file-in-any-language-with-e4d4d2daf0cd)
- Sources
  - github: https://github.com/guillaumekln/faster-whisper
  - [Whisper large v3](https://huggingface.co/openai/whisper-large-v3)
  - blog: [Making OpenAI Whisper faster](https://github.com/guillaumekln/faster-whisper#faster-whisper-transcription-with-ctranslate2)

In [6]:
# check if there is a GPU
!nvidia-smi

Mon Dec  4 16:01:38 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P8     8W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## About faster-whisper

This project implemented the OpenAI Whisper model in CTranslate2. CTranslate2 is a library for efficient inference with transformer models. This is made possible by applying various methods to increase efficiency, such as weight quantization, layer fusion, batch reordering, etc.

In the case of the project faster-whisper, a noticeable performance boost was achieved.

**Method**: we just need to give access to the wav audio file (even the rate conversion to 16k is done by the library faster whisper).

## Setup

In [1]:
!pip install -q faster-whisper

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.0/31.0 MB[0m [31m40.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.8/36.8 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.4/6.4 MB[0m [31m47.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for faster-whisper (setup.py) ... [?25l[?25hdone


In [2]:
# audio library
!pip install -q pydub
import pydub

In [3]:
import pathlib
from pathlib import Path

import pandas as pd

## Path to audio files

### mp3

In [4]:
# path to mp3 audio files
path_to_main = "/content/audio_files/"
path_to_mp3_audio_folder = path_to_main + "mp3_audio_files/"

# path to transcripts
path_to_transcripts_folder = path_to_mp3_audio_folder + "transcripts/"

if not Path(path_to_transcripts_folder).is_dir(): Path(path_to_transcripts_folder).mkdir(parents=True, exist_ok=True)

Upload your mp3 file into the `path_to_mp3_audio_folder` folder.

Here, we use the [audio file](https://github.com/piegu/language-models/blob/master/audio/lesson1_of_RAG_course_with_DeepLearningAI.mp3) of the lesson 1 video of the course [Building and Evaluating Advanced RAG Applications](https://www.deeplearning.ai/short-courses/building-evaluating-advanced-rag/) (DeepLearning.AI).

In [10]:
p = Path(path_to_mp3_audio_folder).glob('**/*')
mp3_audio_files = [x for x in p if x.is_file() and ".mp3" in x.name]
len(mp3_audio_files), mp3_audio_files[0]

(1,
 PosixPath('/content/audio_files/mp3_audio_files/introduction_to_RAG_course_with_DeepLearningAI.mp3'))

### wav

If your file is in wav format, you can convert it to mp3 with the following code but this code is just to show the wav-to-mp3 conversion code because **(faster) Whisper does not need a mp3 format**.

It works very well (and faster!) with a wav format.

In [None]:
# path to wav audio files
path_to_wav_audio_folder = path_to_main + "wav_audio_files/"
if not Path(path_to_wav_audio_folder).is_dir(): Path(path_to_wav_audio_folder).mkdir(parents=True, exist_ok=True)

Upload your wav file into the `path_to_wav_audio_folder` folder.

In [None]:
p = Path(path_to_wav_audio_folder).glob('**/*')
wav_audio_files = [x for x in p if x.is_file() and ".wav" in x.name]
print(len(wav_audio_files), wav_audio_files[0])

path_to_audio_file_wav = wav_audio_files[0]

#### (option) Analysis and reading of the wav audio file

As `Audio()` from `IPython.display` does not read wav file in Jupyter notebook (it's a bug), we use `pydub` in order to read it in this notebook.

In [None]:
# Analysis of the wav audio file

import wave
obj = wave.open(path_to_audio_file_wav,'r')
print( "Number of channels",obj.getnchannels())
print ( "Sample width",obj.getsampwidth())
print ( "Frame rate.",obj.getframerate())
print ("Number of frames",obj.getnframes())
print ( "parameters:",obj.getparams())
obj.close()

Number of channels 2
Sample width 2
Frame rate. 44100
Number of frames 12071052
parameters: _wave_params(nchannels=2, sampwidth=2, framerate=44100, nframes=12071052, comptype='NONE', compname='not compressed')


In [None]:
# Display and read wav audio file

sound = pydub.AudioSegment.from_wav(path_to_audio_file_wav)
sound = sound.set_frame_rate(16000) # allow a faster display

sound

#### Conversion wav to mp3

In [None]:
# create mp3 folder if does not exist
path_to_main = "/content/audio_files/"
path_to_mp3_audio_folder = path_to_main + "mp3_audio_files/"
if not Path(path_to_mp3_audio_folder).is_dir(): Path(path_to_mp3_audio_folder).mkdir(parents=True, exist_ok=True)

# path to mp3 audio file
path_to_audio_file_mp3 = path_to_mp3_audio_folder + path_to_audio_file_wav.replace(".wav", ".mp3")

# conversion to mp3
sound.export(path_to_audio_file_mp3, format="mp3")
# print(f"frame rate (mp3 audio file): {sound.frame_rate}")
sound

**Note**: the traditional python way to display and read an audio file is done through the following code:

```
from IPython.display import Audio
Audio(path_to_audio_file_mp3)
```



## Model (faster) Whisper

In [7]:
import torch
from faster_whisper import WhisperModel

# model_size = "large-v2"
model_size = "large-v3"

# get device
device = "cuda:0" if torch.cuda.is_available() else "cpu"

if device == "cuda:0":
    # Run on GPU with FP16
    model = WhisperModel(model_size, device="cuda", compute_type="float16")
    # or Run on GPU with INT8
    # model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")
else:
    # Run on CPU with INT8
    model = WhisperModel(model_size, device="cpu", compute_type="int8")

config.json:   0%|          | 0.00/2.39k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

vocabulary.json:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

In [8]:
# check GPU (cuda) or CPU
device

'cuda:0'

## Transcript and audio segments

[About VAD filter](https://github.com/guillaumekln/faster-whisper#vad-filter): The library integrates the Silero VAD model to filter out parts of the audio without speech (*vad_filter=True*).

```
segments, _ = model.transcribe("audio.wav", vad_filter=True)
```

In [11]:
%%time

for i, path_to_audio_file in enumerate(mp3_audio_files[:1]):

    # get all audio segments
    # from pydub.playback import play
    segments, _ = model.transcribe(str(path_to_audio_file), beam_size=5, vad_filter=True)
    # segments = list(segments)  # The transcription will actually run here.

    # get audio language
    # print("Detected language '%s' with probability %f with faster whisper\n" % (info.language, info.language_probability))

    # save audio segments with start and end time, and transcript by audio segment
    start_segments, end_segments, text_segments = list(), list(), list()
    for segment in segments:
        # print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
        start, end, text = segment.start, segment.end, segment.text
        start_segments.append(start)
        end_segments.append(end)
        text_segments.append(text)

    # save transcript into csv
    df = pd.DataFrame()
    df["start"] = start_segments
    df["end"] = end_segments
    df["text"] = text_segments
    path_to_audio_file_transcript = path_to_transcripts_folder + path_to_audio_file.name.replace(".mp3", ".csv").replace(".wav", ".csv")
    df.to_csv(path_to_audio_file_transcript, encoding='utf-8', index=False)

    if i % 2 == 0: print(i)



0
CPU times: user 34.4 s, sys: 1.2 s, total: 35.6 s
Wall time: 37.8 s


In [12]:
df

Unnamed: 0,start,end,text
0,1.97,6.75,"Retrieval Augmented Generation, or RAG, has b..."
1,6.75,13.25,answered questions over a user's own data. Bu...
2,13.25,18.91,"RAG system, it costs a lot to have effective ..."
3,18.91,24.05,"relevant context to generate his answer, and ..."
4,24.05,30.21,to help you efficiently iterate and improve y...
5,30.21,35.47,during post-deployment maintenance. This cour...
6,35.99,41.03,sentence window retrieval and auto-merging re...
7,41.03,47.45,context to the LM than simpler methods. It al...
8,47.45,53.21,"system with three evaluation metrics, context..."
9,53.73,59.65,"I'm excited to introduce Jerry Liu, co-founde..."


## Display transcript

In [None]:
import nltk

# Download the Punkt tokenizer
nltk.download('punkt')

In [43]:
paragraph = ' '.join(df["text"].tolist()). replace("  ", " ")

# Tokenize the paragraph into sentences
sentences = nltk.sent_tokenize(paragraph)

In [42]:
css = '''
        <style>
            .highlighted {
                color: black;
                font-weight: bold;
                font-size: 120%
            }
        </style>
      '''

from IPython.display import display, HTML
for sentence in sentences:
    display(HTML(f'{css} <p class="highlighted">{sentence}</p>'))

# END