{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Transcribe audio files with Whisper\n", "\n", "Convert speech to text locally using OpenAI's open-source Whisper model—no API key needed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem\n", "\n", "You have audio or video files that need transcription. Long files are memory-intensive to process at once, so you need to split them into manageable segments.\n", "\n", "| File | Duration | Challenge |\n", "|------|----------|-----------|\n", "| podcast.mp3 | 60 min | Too long to process at once |\n", "| interview.mp4 | 30 min | Need to extract audio first |\n", "| meeting.wav | 2 hours | Must segment for memory efficiency |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Solution\n", "\n", "**What's in this recipe:**\n", "\n", "- Transcribe audio files locally with Whisper (no API key)\n", "- Automatically segment long files\n", "- Extract and transcribe audio from videos\n", "\n", "You create a view with `audio_splitter` to break long files into segments, then add a computed column for transcription. Whisper runs locally on your machine—no API calls needed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install -qU pixeltable openai-whisper" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load audio files" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata\n", "Converting metadata from version 45 to 46\n", "Created directory 'audio_demo'.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pixeltable as pxt\n", "from pixeltable.functions import whisper\n", "from pixeltable.functions.audio import audio_splitter\n", "\n", "# Create a fresh directory\n", "pxt.drop_dir('audio_demo', force=True)\n", "pxt.create_dir('audio_demo')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Created table 'files'.\n" ] } ], "source": [ "# Create table for audio files\n", "audio = pxt.create_table('audio_demo/files', {'audio': pxt.Audio})" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserted 1 row with 0 errors in 1.05 s (0.95 rows/s)\n" ] }, { "data": { "text/plain": [ "1 row inserted." ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Insert a sample audio file (video files also work - audio is extracted automatically)\n", "audio.insert(\n", " [\n", " {\n", " 'audio': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/audio-transcription-demo/Lex-Fridman-Podcast-430-Excerpt-0.mp4'\n", " }\n", " ]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split into segments\n", "\n", "Create a view that splits audio into 30-second segments with overlap:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Split audio into segments for transcription\n", "segments = pxt.create_view(\n", " 'audio_demo/segments',\n", " audio,\n", " iterator=audio_splitter(\n", " audio.audio,\n", " duration=30.0, # 30-second segments\n", " overlap=2.0, # 2-second overlap for context\n", " min_segment_duration=5.0, # Drop segments shorter than 5 seconds\n", " ),\n", ")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
segment_startsegment_end
0.30.
28.00358.003
" ], "text/plain": [ " segment_start segment_end\n", "0 0.0000 30.0002\n", "1 28.0033 58.0034" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View the segments\n", "segments.select(segments.segment_start, segments.segment_end).collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Transcribe with Whisper\n", "\n", "Add a computed column that transcribes each segment:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Added 2 column values with 0 errors in 3.35 s (0.60 rows/s)\n" ] }, { "data": { "text/plain": [ "2 rows updated." ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Add transcription column (runs locally - no API key needed)\n", "segments.add_computed_column(\n", " transcription=whisper.transcribe(\n", " audio=segments.audio_segment,\n", " model='base.en', # Options: tiny.en, base.en, small.en, medium.en, large\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Added 2 column values with 0 errors in 0.06 s (31.82 rows/s)\n" ] }, { "data": { "text/plain": [ "2 rows updated." ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Extract just the text\n", "segments.add_computed_column(text=segments.transcription.text)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
segment_startsegment_endtext
0.30.of experiencing self versus remembering self. I was hoping you can give a simple answer of how we should live life. Based on the fact that our memories could be a source of happiness or could be the primary source of happiness, that an event when experienced bears its fruits the most when it's remembered over and over and over and over.
28.00358.003over and over and over and over and maybe there is some wisdom in the fact that we can control to some degree how we remember how we evolve our memory of it such that it can maximize the long-term happiness of that repeated experience. Okay, well first I'll say I wish I could take you on the road with me. That was such a great description. Can I be your opening ax? Oh my God, no, I'm going to open for you dude. Otherwise it's like, you know, everybody leaves.
" ], "text/plain": [ " segment_start segment_end \\\n", "0 0.0000 30.0002 \n", "1 28.0033 58.0034 \n", "\n", " text \n", "0 of experiencing self versus remembering self.... \n", "1 over and over and over and over and maybe the... " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View transcriptions with timestamps\n", "segments.select(\n", " segments.segment_start, segments.segment_end, segments.text\n", ").collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explanation\n", "\n", "**Whisper models:**\n", "\n", "| Model | Speed | Quality | Best for |\n", "|-------|-------|---------|----------|\n", "| `tiny.en` | Fastest | Basic | Quick tests |\n", "| `base.en` | Fast | Good | General use |\n", "| `small.en` | Medium | Better | Higher accuracy |\n", "| `medium.en` | Slow | Great | Professional quality |\n", "| `large` | Slowest | Best | Maximum accuracy |\n", "\n", "Models ending in `.en` are English-only and faster. Remove `.en` for multilingual support.\n", "\n", "**audio_splitter parameters:**\n", "\n", "| Parameter | Description |\n", "|-----------|-------------|\n", "| `duration` | Duration of each segment in seconds |\n", "| `overlap` | Overlap between segments (helps with word boundaries) |\n", "| `min_segment_duration` | Drop the last segment if shorter than this |\n", "\n", "**Video files work too:**\n", "\n", "When you insert a video file, Pixeltable automatically extracts the audio track." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## See also\n", "\n", "- [Iterators documentation](https://docs.pixeltable.com/platform/iterators)\n", "- [Whisper library](https://github.com/openai/whisper)" ] } ], "metadata": { "kernelspec": { "display_name": "pxt", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.19" } }, "nbformat": 4, "nbformat_minor": 2 }