{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Load media from S3 and other cloud storage\n", "\n", "Import images, videos, and audio files from S3, GCS, HTTP URLs, or local paths into Pixeltable tables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem\n", "\n", "You have media files stored in cloud storage (S3, GCS) or accessible via HTTP URLs. You need to process these files with AI models without downloading them all upfront.\n", "\n", "| Source | Files | Challenge |\n", "|--------|-------|-----------|\n", "| s3://my-bucket/images/ | 100K images | Too large to download |\n", "| https://cdn.example.com/ | 10K videos | Need lazy loading |\n", "| gs://my-bucket/audio/ | 50K audio | Process incrementally |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Solution\n", "\n", "**What's in this recipe:**\n", "\n", "- Reference media files by URL (S3, HTTP, local paths)\n", "- Automatic caching of remote files on access\n", "- Process files lazily without bulk downloads\n", "\n", "You insert media URLs as references. Pixeltable stores the URLs and automatically downloads/caches files when you access them through queries or computed columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install -qU pixeltable boto3" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pixeltable as pxt" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata\n", "Created directory 'cloud_demo'.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a fresh directory\n", "pxt.drop_dir('cloud_demo', force=True)\n", "pxt.create_dir('cloud_demo')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load images from HTTP URLs\n", "\n", "Reference images by URL—Pixeltable downloads them on demand:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Created table 'images'.\n" ] } ], "source": [ "# Create a table with image column\n", "images = pxt.create_table('cloud_demo/images', {'image': pxt.Image})" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inserting rows into `images`: 3 rows [00:00, 767.91 rows/s]\n", "Inserted 3 rows with 0 errors.\n" ] }, { "data": { "text/plain": [ "3 rows inserted, 6 values computed." ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Insert images by URL (HTTP)\n", "image_urls = [\n", " 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000036.jpg',\n", " 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000090.jpg',\n", " 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000106.jpg',\n", "]\n", "\n", "images.insert([{'image': url} for url in image_urls])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
image
\n", " \n", "
\n", " \n", "
\n", " \n", "
" ], "text/plain": [ " image\n", "0 \n", " \n", " \n", " video\n", " \n", " \n", " \n", " \n", "
\n", " \n", "
\n", " \n", " \n", "
\n", " \n", "
\n", " \n", " \n", "" ], "text/plain": [ " video\n", "0 /Users/pjlb/.pixeltable/file_cache/4111331dab9...\n", "1 /Users/pjlb/.pixeltable/file_cache/4111331dab9..." ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View videos - downloaded and cached on access\n", "videos.collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add computed columns on remote media\n", "\n", "Process remote media with computed columns—files are fetched automatically:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Added 3 column values with 0 errors.\n", "Added 3 column values with 0 errors.\n" ] }, { "data": { "text/plain": [ "3 rows updated, 6 values computed." ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Add computed columns for image properties\n", "images.add_computed_column(width=images.image.width)\n", "images.add_computed_column(height=images.image.height)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
imagewidthheight
\n", " \n", "
481640
\n", " \n", "
640429
\n", " \n", "
640426
" ], "text/plain": [ " image width height\n", "0 \n", " \n", " \n", " video\n", " original_uri\n", " http_url\n", " \n", " \n", " \n", " \n", "
\n", " \n", "
\n", " s3://multimedia-commons/data/videos/mp4/ffe/ffb/ffeffbef41bbc269810b2a1a888de.mp4\n", " https://s3.us-west-2.amazonaws.com/multimedia-commons/data/videos/mp4/ffe/ffb/ffeffbef41bbc269810b2a1a888de.mp4\n", " \n", " \n", "
\n", " \n", "
\n", " s3://multimedia-commons/data/videos/mp4/ffe/feb/ffefebb41485539f964760e6115fbc44.mp4\n", " https://s3.us-west-2.amazonaws.com/multimedia-commons/data/videos/mp4/ffe/feb/ffefebb41485539f964760e6115fbc44.mp4\n", " \n", " \n", "" ], "text/plain": [ " video \\\n", "0 /Users/pjlb/.pixeltable/file_cache/4111331dab9... \n", "1 /Users/pjlb/.pixeltable/file_cache/4111331dab9... \n", "\n", " original_uri \\\n", "0 s3://multimedia-commons/data/videos/mp4/ffe/ff... \n", "1 s3://multimedia-commons/data/videos/mp4/ffe/fe... \n", "\n", " http_url \n", "0 https://s3.us-west-2.amazonaws.com/multimedia-... \n", "1 https://s3.us-west-2.amazonaws.com/multimedia-... " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pixeltable.functions as pxtf\n", "\n", "# Generate presigned URLs for videos (1-hour expiration)\n", "videos.select(\n", " videos.video,\n", " original_uri=videos.video.fileurl,\n", " http_url=pxtf.net.presigned_url(videos.video.fileurl, 3600),\n", ").collect()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Added 2 column values with 0 errors.\n" ] }, { "data": { "text/plain": [ "2 rows updated, 4 values computed." ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Store presigned URLs as computed column for API responses\n", "videos.add_computed_column(\n", " serving_url=pxtf.net.presigned_url(\n", " videos.video.fileurl, 86400\n", " ) # 24-hour expiration\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Use cases for presigned URLs:**\n", "\n", "- Serve private media in web applications without exposing credentials\n", "- Generate download links for end users\n", "- Integrate with CDNs or video players that require HTTP URLs\n", "\n", "**Provider limitations:**\n", "\n", "| Provider | Max expiration |\n", "|----------|----------------|\n", "| AWS S3 | 7 days |\n", "| Google Cloud Storage | 7 days |\n", "| Azure Blob Storage | Depends on account policy |\n", "\n", "Note: HTTP/HTTPS URLs pass through unchanged (already publicly accessible)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Supported URL formats\n", "\n", "Pixeltable supports multiple URL schemes for media files:\n", "\n", "| Scheme | Example | Credentials |\n", "|--------|---------|-------------|\n", "| `http://` | `http://example.com/image.jpg` | None |\n", "| `https://` | `https://cdn.example.com/video.mp4` | None |\n", "| `s3://` | `s3://bucket/path/file.jpg` | AWS credentials\\* |\n", "| `gs://` | `gs://bucket/path/file.mp4` | GCP credentials\\* |\n", "| `file://` | `file:///path/to/file.jpg` | None |\n", "| (local path) | `/path/to/file.jpg` | None |\n", "\n", "\\*Configure AWS/GCP credentials via environment variables or config files." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explanation\n", "\n", "**How caching works:**\n", "\n", "1. URLs are stored as references in the table\n", "1. Files are downloaded on first access (query or computed column)\n", "1. Downloaded files are cached in `~/.pixeltable/file_cache/`\n", "1. Cache uses LRU eviction when space is needed\n", "\n", "**Benefits of URL-based storage:**\n", "\n", "- **Lazy loading** - Only download files when needed\n", "- **Deduplication** - Same URL is cached once\n", "- **Incremental processing** - Add files without bulk downloads\n", "- **Cloud-native** - Works directly with object storage\n", "\n", "**For private S3 buckets:**\n", "\n", "Configure AWS credentials using standard methods:\n", "\n", "- Environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)\n", "- AWS credentials file (`~/.aws/credentials`)\n", "- IAM roles (when running on EC2/ECS)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## See also\n", "\n", "- [Upload to S3](/howto/cookbooks/data/data-export-s3) - Store generated media in S3/GCS\n", "- [Import from CSV](/howto/cookbooks/data/data-import-csv) - Load structured data\n", "- [Extract frames from videos](https://docs.pixeltable.com/howto/cookbooks/video/video-extract-frames) - Process video files\n", "- [Analyze images in batch](https://docs.pixeltable.com/howto/cookbooks/images/vision-batch-analysis) - AI vision on images\n", "- [Configure API keys](https://docs.pixeltable.com/howto/cookbooks/core/workflow-api-keys) - Set up credentials" ] } ], "metadata": { "kernelspec": { "display_name": "pixeltable", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 2 }