{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ISB-CGC Community Notebooks¶\n", "Check out more notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!\n", "\n", "```\n", "Title: How to create convert 10X bams to fastq files using dsub\n", "Author: David L Gibbs\n", "Created: 2019-08-07\n", "Purpose: Demonstrate how to make fastq files from 10X bams\n", "Notes: \n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# How to use dsub to convert 10X bam files to fastqs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "In this example, we'll be using DataBiosphere's dsub. dsub makes it easy to run a job without having to spin up and shut down a VM. It's all done automatically. \n", "\n", "https://github.com/DataBiosphere/dsub\n", "\n", "Docs for the genomics pipeline run: https://cloud.google.com/sdk/gcloud/reference/alpha/genomics/pipelines/run\n", "\n", "For this to work, we need to make sure that the Google Genomics API is enabled. To do that, from the main menu in the cloud console, select 'APIs & Services'. The API is called: genomics.googleapis.com." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: dsub in ./.local/lib/python2.7/site-packages\n", "Requirement already satisfied: oauth2client in /usr/local/lib/python2.7/dist-packages (from dsub)\n", "Requirement already satisfied: six in /usr/local/lib/python2.7/dist-packages (from dsub)\n", "Requirement already satisfied: python-dateutil in /usr/local/lib/python2.7/dist-packages (from dsub)\n", "Requirement already satisfied: pyyaml in /usr/local/lib/python2.7/dist-packages (from dsub)\n", "Requirement already satisfied: pytz in /usr/local/lib/python2.7/dist-packages (from dsub)\n", "Requirement already satisfied: parameterized in ./.local/lib/python2.7/site-packages (from dsub)\n", "Requirement already satisfied: google-api-python-client in /usr/local/lib/python2.7/dist-packages (from dsub)\n", "Requirement already satisfied: retrying in /usr/local/lib/python2.7/dist-packages (from dsub)\n", "Requirement already satisfied: tabulate in ./.local/lib/python2.7/site-packages (from dsub)\n", "Requirement already satisfied: rsa>=3.1.4 in /usr/local/lib/python2.7/dist-packages (from oauth2client->dsub)\n", "Requirement already satisfied: httplib2>=0.9.1 in /usr/local/lib/python2.7/dist-packages (from oauth2client->dsub)\n", "Requirement already satisfied: pyasn1-modules>=0.0.5 in /usr/local/lib/python2.7/dist-packages (from oauth2client->dsub)\n", "Requirement already satisfied: pyasn1>=0.1.7 in /usr/local/lib/python2.7/dist-packages (from oauth2client->dsub)\n", "Requirement already satisfied: google-auth>=1.4.1 in /usr/local/lib/python2.7/dist-packages (from google-api-python-client->dsub)\n", "Requirement already satisfied: google-auth-httplib2>=0.0.3 in /usr/local/lib/python2.7/dist-packages (from google-api-python-client->dsub)\n", "Requirement already satisfied: uritemplate<4dev,>=3.0.0 in /usr/local/lib/python2.7/dist-packages (from google-api-python-client->dsub)\n", "Requirement already satisfied: cachetools>=2.0.0 in /usr/local/lib/python2.7/dist-packages (from google-auth>=1.4.1->google-api-python-client->dsub)\n" ] } ], "source": [ "\n", "# first to install dsub,\n", "# it's also possible to install it directly from \n", "# github\n", "\n", "!pip install dsub" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name: dsub\n", "Version: 0.3.2\n", "Summary: A command-line tool that makes it easy to submit and run batch scripts in the cloud\n", "Home-page: https://github.com/DataBiosphere/dsub\n", "Author: Verily\n", "Author-email: UNKNOWN\n", "License: Apache\n", "Location: /home/jupyter/.local/lib/python2.7/site-packages\n", "Requires: oauth2client, six, python-dateutil, pyyaml, pytz, parameterized, google-api-python-client, retrying, tabulate\n" ] } ], "source": [ "# let's see if it's installed OK\n", "\n", "!pip show dsub" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "usage: /home/jupyter/.local/bin/dsub [-h] [--provider PROVIDER]\n", " [--version VERSION] [--unique-job-id]\n", " [--name NAME]\n", " [--tasks [FILE M-N [FILE M-N ...]]]\n", " [--image IMAGE] [--dry-run]\n", " [--command COMMAND] [--script SCRIPT]\n", " [--env [KEY=VALUE [KEY=VALUE ...]]]\n", " [--label [KEY=VALUE [KEY=VALUE ...]]]\n", " [--input [KEY=REMOTE_PATH [KEY=REMOTE_PATH ...]]]\n", " [--input-recursive [KEY=REMOTE_PATH [KEY=REMOTE_PATH ...]]]\n", " [--output [KEY=REMOTE_PATH [KEY=REMOTE_PATH ...]]]\n", " [--output-recursive [KEY=REMOTE_PATH [KEY=REMOTE_PATH ...]]]\n", " [--user USER]\n", " [--user-project USER_PROJECT]\n", " [--mount [KEY=PATH_SPEC [KEY=PATH_SPEC ...]]]\n", " [--wait] [--retries RETRIES]\n", " [--poll-interval POLL_INTERVAL]\n", " [--after AFTER [AFTER ...]] [--skip]\n", " [--min-cores MIN_CORES]\n", " [--min-ram MIN_RAM]\n", " [--disk-size DISK_SIZE]\n", " [--logging LOGGING] [--project PROJECT]\n", " [--boot-disk-size BOOT_DISK_SIZE]\n", " [--preemptible]\n", " [--zones ZONES [ZONES ...]]\n", " [--scopes SCOPES [SCOPES ...]]\n", " [--accelerator-type ACCELERATOR_TYPE]\n", " [--accelerator-count ACCELERATOR_COUNT]\n", " [--keep-alive KEEP_ALIVE]\n", " [--regions REGIONS [REGIONS ...]]\n", " [--machine-type MACHINE_TYPE]\n", " [--cpu-platform CPU_PLATFORM]\n", " [--network NETWORK]\n", " [--subnetwork SUBNETWORK]\n", " [--use-private-address]\n", " [--timeout TIMEOUT]\n", " [--log-interval LOG_INTERVAL] [--ssh]\n", " [--nvidia-driver-version NVIDIA_DRIVER_VERSION]\n", " [--service-account SERVICE_ACCOUNT]\n", " [--disk-type DISK_TYPE]\n", " [--enable-stackdriver-monitoring]\n", "/home/jupyter/.local/bin/dsub: error: argument --project is required\n" ] } ], "source": [ "# pip install software in the /.local/bin directory .. not part of PATH yet\n", "\n", "!~/.local/bin/dsub" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Job: echo--jupyter--190808-173557-030088\n", "Launched job-id: echo--jupyter--190808-173557-030088\n", "To check the status, run:\n", " dstat --provider local --jobs 'echo--jupyter--190808-173557-030088' --users 'jupyter' --status '*'\n", "To cancel the job, run:\n", " ddel --provider local --jobs 'echo--jupyter--190808-173557-030088' --users 'jupyter'\n", "Waiting for job to complete...\n", "Waiting for: echo--jupyter--190808-173557-030088.\n", " echo--jupyter--190808-173557-030088: SUCCESS\n", "echo--jupyter--190808-173557-030088\n" ] } ], "source": [ "# hello world test\n", "\n", "# using the local provider (--provider local)\n", "# is a faster way to develop the task\n", "\n", "! ~/.local/bin/dsub \\\n", " --provider local \\\n", " --logging /tmp/dsub-test/logging/ \\\n", " --output OUT=/tmp/dsub-test/output/out.txt \\\n", " --command 'echo \"Hello World\" > \"${OUT}\"' \\\n", " --wait" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hello World\n" ] } ], "source": [ "\n", "# and we can check the output\n", "!cat /tmp/dsub-test/output/out.txt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "apt-get update;\n", "apt-get --yes install wget;\n", "wget http://cf.10xgenomics.com/misc/bamtofastq;\n", "chmod +x bamtofastq;\n", "./bamtofastq ${INPUT_FILE} $(dirname ${OUTPUT_FOLDER})/fastq;\n" ] } ], "source": [ "# dsub can take a shell script..\n", "\n", "cmd = '''\n", "apt-get update;\n", "apt-get --yes install wget;\n", "wget http://cf.10xgenomics.com/misc/bamtofastq;\n", "chmod +x bamtofastq;\n", "OUTPUT_DIR=\"$OUTPUT_FOLDER/fastq\";", "./bamtofastq ${INPUT_FILE} ${OUTPUT_DIR};", "'''\n", "\n", "fout = open('job.sh', 'w')\n", "fout.write(cmd)\n", "fout.close()\n", "\n", "!cat job.sh" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# default for dsub is for a ubuntu image\n", "# which is great, because bamtofastq is compatible " ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Job: job--jupyter--190808-184740-70\n", "Launched job-id: job--jupyter--190808-184740-70\n", "To check the status, run:\n", " dstat --provider google-v2 --project cgc-05-0180 --jobs 'job--jupyter--190808-184740-70' --users 'jupyter' --status '*'\n", "To cancel the job, run:\n", " ddel --provider google-v2 --project cgc-05-0180 --jobs 'job--jupyter--190808-184740-70' --users 'jupyter'\n", "Waiting for job to complete...\n", "Waiting for: job--jupyter--190808-184740-70.\n", " job--jupyter--190808-184740-70: SUCCESS\n", "job--jupyter--190808-184740-70\n" ] } ], "source": [ "!~/.local/bin/dsub \\\n", " --provider google-v2 \\\n", " --project cgc-05-0180 \\\n", " --zones \"us-west1-*\" \\\n", " --script job.sh \\\n", " --input INPUT_FILE=\"gs://cgc_bam_bucket_007/pbmc_1k_protein_v3_possorted_genome_bam.bam\" \\\n", " --output-recursive OUTPUT_FOLDER=\"gs://cgc_output/testout/\" \\\n", " --disk-size 200 \\\n", " --logging \"gs://cgc_temp_02/testout\" \\\n", " --wait\n", " \n", " \n", "#error: error creating output directory: \"/mnt/data/output/gs/cruk_data_02\". Does it already exist? " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's it! We can check the output with:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!gsutil ls gs://cgc_bam_bucket_007/output" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 4 }