{
"cells": [
{
"cell_type": "raw",
"metadata": {
"tags": [
"skip-execution"
]
},
"source": [
"---\n",
"sidebar_label: 'Video Processing'\n",
"sidebar_position: 6\n",
"description: \"Parallel Video Resizing via File Sharding\"\n",
"---"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"tags": [
"skip-execution"
]
},
"source": [
"# Video Processing\n",
"\n",
"\n",
"[](https://github.com/bacalhau-project/bacalhau)\n",
"\n",
"Many data engineering workloads consist of embarrassingly parallel workloads where you want to run a simple execution on a large number of files. In this example tutorial, we will run a simple video filter on a large number of video files.\n",
"\n",
"## TD;LR\n",
"Running video files with Bacalhau"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"tags": [
"skip-execution"
]
},
"source": [
"## Prerequisite\n",
"\n",
"To get started, you need to install the Bacalhau client, see more information [here](https://docs.bacalhau.org/getting-started/installation)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"tags": [
"skip-execution"
]
},
"source": [
"## Submit the workload\n",
"\n",
"To submit a workload to Bacalhau, we will use the `bacalhau docker run` command. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"skip-execution"
]
},
"outputs": [],
"source": [
"%%bash --out job_id\n",
"bacalhau docker run \\\n",
" --wait \\\n",
" --wait-timeout-secs 100 \\\n",
" --id-only \\\n",
" -i ipfs://Qmd9CBYpdgCLuCKRtKRRggu24H72ZUrGax5A9EYvrbC72j:/inputs \\\n",
" linuxserver/ffmpeg -- \\\n",
" bash -c 'find /inputs -iname \"*.mp4\" -printf \"%f\\n\" | xargs -I{} ffmpeg -y -i /inputs/{} -vf \"scale=-1:72,setsar=1:1\" /outputs/scaled_{}'\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The job has been submitted and Bacalhau has printed out the related job id. We store that in an environment variable so that we can reuse it later on."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"skip-execution",
"remove_cell"
]
},
"outputs": [],
"source": [
"%env JOB_ID={job_id}"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The `bacalhau docker run` command allows one to pass input data volume with a `-i ipfs://CID:path` argument just like Docker, except the left-hand side of the argument is a [content identifier (CID)](https://github.com/multiformats/cid). This results in Bacalhau mounting a *data volume* inside the container. By default, Bacalhau mounts the input volume at the path `/inputs` inside the container.\n",
"\n",
"We created a 72px wide video thumbnails for all the videos in the `inputs` directory. The `outputs` directory will contain the thumbnails for each video. We will shard by 1 video per job, and use the `linuxserver/ffmpeg` container to resize the videos."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
":::tip\n",
"[Bacalhau overwrites the default entrypoint](https://github.com/filecoin-project/bacalhau/blob/v0.2.3/cmd/bacalhau/docker_run.go#L64) so we must run the full command after the `--` argument. In this line you will list all of the mp4 files in the `/inputs` directory and execute `ffmpeg` against each instance.\n",
":::\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"tags": [
"skip-execution"
]
},
"source": [
"## Checking the State of your Jobs\n",
"\n",
"- **Job status**: You can check the status of the job using `bacalhau list`. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"bacalhau list --id-filter=${JOB_ID} --no-style"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"When it says `Published` or `Completed`, that means the job is done, and we can get the results."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"- **Job information**: You can find out more information about your job by using `bacalhau describe`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"bacalhau describe ${JOB_ID}"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"- **Job download**: You can download your job results directly by using `bacalhau get`. Alternatively, you can choose to create a directory to store your results. In the command below, we created a directory and downloaded our job output to be stored in that directory."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"skip-execution",
"remove_output"
]
},
"outputs": [],
"source": [
"%%bash\n",
"mkdir -p ./results # Temporary directory to store the results\n",
"bacalhau get --output-dir ./results ${JOB_ID} # Download the results"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"After the download has finished you should see the following contents in results directory."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Viewing your Job Output\n",
"\n",
"To view the file, run the following command:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"skip-execution",
"remove_cell"
]
},
"outputs": [],
"source": [
"%%bash\n",
"# Copy the files to the local directory, to allow the documentation scripts to copy them to the right place\n",
"cp results/outputs/* ./ && rm -rf results/outputs/*\n",
"# Remove any spaces from the filenames\n",
"for f in *\\ *; do mv \"$f\" \"${f// /_}\"; done"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Display the videos\n",
"\n",
"To view the videos, we will use **glob** to return all file paths that match a specific pattern. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"skip-execution",
"remove_cell"
]
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import glob\n",
"from IPython.display import Video, display\n",
"for file in glob.glob('*.mp4'):\n",
" display(Video(filename=file))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"tags": [
"skip-execution"
]
},
"source": [
"\n",
"\n",
"\n",
""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Need Support?\n",
"\n",
"For questions, feedback, please reach out in our [forum](https://github.com/filecoin-project/bacalhau/discussions)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "9ac03a0a6051494cc606d484d27d20fce22fb7b4d169f583271e11d5ba46a56e"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}