{ "cells": [ { "cell_type": "raw", "metadata": { "tags": [ "skip-execution" ] }, "source": [ "---\n", "sidebar_label: 'Video Processing'\n", "sidebar_position: 6\n", "description: \"Parallel Video Resizing via File Sharding\"\n", "---" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "tags": [ "skip-execution" ] }, "source": [ "# Video Processing\n", "\n", "\n", "[![stars - badge-generator](https://img.shields.io/github/stars/bacalhau-project/bacalhau?style=social)](https://github.com/bacalhau-project/bacalhau)\n", "\n", "Many data engineering workloads consist of embarrassingly parallel workloads where you want to run a simple execution on a large number of files. In this example tutorial, we will run a simple video filter on a large number of video files.\n", "\n", "## TD;LR\n", "Running video files with Bacalhau" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "tags": [ "skip-execution" ] }, "source": [ "## Prerequisite\n", "\n", "To get started, you need to install the Bacalhau client, see more information [here](https://docs.bacalhau.org/getting-started/installation)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "tags": [ "skip-execution" ] }, "source": [ "## Submit the workload\n", "\n", "To submit a workload to Bacalhau, we will use the `bacalhau docker run` command. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "skip-execution" ] }, "outputs": [], "source": [ "%%bash --out job_id\n", "bacalhau docker run \\\n", " --wait \\\n", " --wait-timeout-secs 100 \\\n", " --id-only \\\n", " -i ipfs://Qmd9CBYpdgCLuCKRtKRRggu24H72ZUrGax5A9EYvrbC72j:/inputs \\\n", " linuxserver/ffmpeg -- \\\n", " bash -c 'find /inputs -iname \"*.mp4\" -printf \"%f\\n\" | xargs -I{} ffmpeg -y -i /inputs/{} -vf \"scale=-1:72,setsar=1:1\" /outputs/scaled_{}'\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The job has been submitted and Bacalhau has printed out the related job id. We store that in an environment variable so that we can reuse it later on." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "skip-execution", "remove_cell" ] }, "outputs": [], "source": [ "%env JOB_ID={job_id}" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The `bacalhau docker run` command allows one to pass input data volume with a `-i ipfs://CID:path` argument just like Docker, except the left-hand side of the argument is a [content identifier (CID)](https://github.com/multiformats/cid). This results in Bacalhau mounting a *data volume* inside the container. By default, Bacalhau mounts the input volume at the path `/inputs` inside the container.\n", "\n", "We created a 72px wide video thumbnails for all the videos in the `inputs` directory. The `outputs` directory will contain the thumbnails for each video. We will shard by 1 video per job, and use the `linuxserver/ffmpeg` container to resize the videos." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ ":::tip\n", "[Bacalhau overwrites the default entrypoint](https://github.com/filecoin-project/bacalhau/blob/v0.2.3/cmd/bacalhau/docker_run.go#L64) so we must run the full command after the `--` argument. In this line you will list all of the mp4 files in the `/inputs` directory and execute `ffmpeg` against each instance.\n", ":::\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "tags": [ "skip-execution" ] }, "source": [ "## Checking the State of your Jobs\n", "\n", "- **Job status**: You can check the status of the job using `bacalhau list`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "bacalhau list --id-filter=${JOB_ID} --no-style" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "When it says `Published` or `Completed`, that means the job is done, and we can get the results." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "- **Job information**: You can find out more information about your job by using `bacalhau describe`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "bacalhau describe ${JOB_ID}" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "- **Job download**: You can download your job results directly by using `bacalhau get`. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "skip-execution", "remove_output" ] }, "outputs": [], "source": [ "%%bash\n", "mkdir -p ./results # Temporary directory to store the results\n", "bacalhau get --output-dir ./results ${JOB_ID} # Download the results" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "After the download has finished you should see the following contents in results directory." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Viewing your Job Output\n", "\n", "To view the file, run the following command:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "skip-execution", "remove_cell" ] }, "outputs": [], "source": [ "%%bash\n", "# Copy the files to the local directory, to allow the documentation scripts to copy them to the right place\n", "cp results/outputs/* ./ && rm -rf results/outputs/*\n", "# Remove any spaces from the filenames\n", "for f in *\\ *; do mv \"$f\" \"${f// /_}\"; done" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Display the videos\n", "\n", "To view the videos, we will use **glob** to return all file paths that match a specific pattern. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "skip-execution", "remove_cell" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import glob\n", "from IPython.display import Video, display\n", "for file in glob.glob('*.mp4'):\n", " display(Video(filename=file))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "tags": [ "skip-execution" ] }, "source": [ "\n", "\n", "\n", "" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Need Support?\n", "\n", "For questions, feedback, please reach out in our [forum](https://github.com/filecoin-project/bacalhau/discussions)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "9ac03a0a6051494cc606d484d27d20fce22fb7b4d169f583271e11d5ba46a56e" } } }, "nbformat": 4, "nbformat_minor": 2 }