{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ISB-CGC Community Notebooks¶\n", "Check out more notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!\n", "\n", "```\n", "Title: How to create convert 10X bams to fastq files using dsub\n", "Author: David L Gibbs\n", "Created: 2019-08-07\n", "Purpose: Demonstrate how to make fastq files from 10X bams\n", "Notes: \n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# How to use dsub to convert 10X bam files to fastqs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "In this example, we'll be using DataBiosphere's dsub. dsub makes it easy to run a job without having to spin up and shut down a VM. It's all done automatically. \n", "\n", "https://github.com/DataBiosphere/dsub\n", "\n", "Docs for the genomics pipeline run: https://cloud.google.com/sdk/gcloud/reference/alpha/genomics/pipelines/run\n", "\n", "For this to work, we need to make sure that the Google Genomics API is enabled. To do that, from the main menu in the cloud console, select 'APIs & Services'. !~/.local/bin/dsub # hello world test

# using the local provider (--provider local)
# is a faster way to develop the task

! ~/.local/bin/dsub \
 --provider local \
 --logging /tmp/dsub-test/logging/ \
 --output OUT=/tmp/dsub-test/output/out.txt \
 --command 'echo "Hello World" > "${OUT}"' \
 --wait +x bamtofastq;\n", "./bamtofastq ${INPUT_FILE} $(dirname ${OUTPUT_FOLDER})/fastq;\n" ] } ], "source": [ "# dsub can take a shell script..\n", "\n", "cmd = '''\n", "apt-get update;\n", "apt-get --yes install wget;\n", "wget http://cf.10xgenomics.com/misc/bamtofastq;\n", "chmod +x bamtofastq;\n", "OUTPUT_DIR=\"$OUTPUT_FOLDER/fastq\";", "./bamtofastq ${INPUT_FILE} ${OUTPUT_DIR};", "'''\n", "\n", "fout = open('job.sh', 'w')\n", "fout.write(cmd)\n", "fout.close()\n", "\n", "!cat job.sh" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# default for dsub is for a ubuntu image\n", "# which is great, because bamtofastq is compatible " ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Job: job--jupyter--190808-184740-70\n", "Launched job-id: !~/.local/bin/dsub \
 --provider google-v2 \
 --project cgc-05-0180 \
 --zones "us-west1-*" \
 --script job.sh \
 --input INPUT_FILE="gs://cgc_bam_bucket_007/pbmc_1k_protein_v3_possorted_genome_bam.bam" \
 --output-recursive OUTPUT_FOLDER="gs://cgc_output/testout/" \
 --disk-size 200 \
 --logging "gs://cgc_temp_02/testout" \
 --wait
 
 
#error: error creating output directory: \"/mnt/data/output/gs/cruk_data_02\". Does it already exist? " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's it! We can check the output with:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!gsutil ls gs://cgc_bam_bucket_007/output" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 4 }