{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ISB-CGC Community Notebooks¶\n",
    "Check out more notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!\n",
    "\n",
    "```\n",
    "Title:   How to convert 10X bams to fastq files\n",
    "Author:  David L Gibbs\n",
    "Created: 2019-08-07\n",
    "Purpose: Demonstrate how to make fastq files from 10X bams\n",
    "Notes:   \n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using 10X bamtofastq to convert a bam file to fastq files."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, we'll be using the Google Genomics Pipelines API. The pipelines API makes it easy to run a job without having to  spin up and shut down a VM. It's all done automatically.\n",
    "\n",
    "The work is uses materials from https://cloud.google.com/genomics/docs/quickstart.\n",
    "\n",
    "Docs for the genomics pipeline run: https://cloud.google.com/sdk/gcloud/reference/alpha/genomics/pipelines/run\n",
    "\n",
    "For this to work, we need to make sure that the Google Genomics API is enabled. To do that, from the main menu in the cloud console, select 'APIs & Services'. The API is called: genomics.googleapis.com."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "In this case, we're going to be using data and compiled software from 10X,\n",
    "which is compatible with Ubuntu. Therefore, we'll select an\n",
    "ubuntu image to run the pipeline.\n",
    "\n",
    "Software can be found here:\n",
    "https://support.10xgenomics.com/docs/bamtofastq\n",
    "\n",
    "Data is found here:\n",
    "https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_1k_protein_v3\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This notebook cell specifies the job parameters\n",
    "# We are using an ubuntu docker image which doesn't \n",
    "# contain the 'wget' program, so we install that, \n",
    "# then use it to get software from 10X, which finally\n",
    "# is used to convert the bam file to fastqs.\n",
    "# everything in the outputPath will be copied back to\n",
    "# cloud storage.\n",
    "\n",
    "# note on the input files..\n",
    "# the ${file} contains the full path.\n",
    "# if there's multiple bam files to process\n",
    "# we could use \n",
    "# for file in $(ls /path/*.bam | sed s/^.*\\\\/\\//) # to get the last one\n",
    "# and we could write those fastq files in /mnt/data/output/{$file}\n",
    "\n",
    "params='''\n",
    "name: bamtofastq\n",
    "description: Run 10X bamtofastq on bams \n",
    "\n",
    "resources:\n",
    "  zones:\n",
    "  - us-west1-b\n",
    "\n",
    "  disks:\n",
    "  - name: datadisk\n",
    "    autoDelete: True\n",
    "\n",
    "    # Within the Docker container, specify a mount point for the disk.\n",
    "    mountPoint: /mnt/data\n",
    "\n",
    "docker:\n",
    "  imageName: ubuntu:19.10\n",
    "\n",
    "  # The Pipelines API does not create the output directory.\n",
    "  cmd: >\n",
    "    mkdir /mnt/data/output &&\n",
    "    find /mnt/data/input &&\n",
    "    for file in $(/bin/ls /mnt/data/input/*.bam); do\n",
    "        apt-get update;\n",
    "        apt-get --yes install wget;\n",
    "        wget http://cf.10xgenomics.com/misc/bamtofastq;\n",
    "        chmod +x bamtofastq;\n",
    "        ./bamtofastq ${file} /mnt/data/output/fastq;\n",
    "        tar -czvf /mnt/data/output/fastq.tar.gz /mnt/data/output/fastq;\n",
    "    done\n",
    "\n",
    "\n",
    "inputParameters:\n",
    "- name: bamFile\n",
    "  description: the sorted bam file\n",
    "  localCopy:\n",
    "    path: input/\n",
    "    disk: datadisk\n",
    "- name: bamIndex\n",
    "  description: bam file index\n",
    "  localCopy:\n",
    "    path: input/\n",
    "    disk: datadisk\n",
    "    \n",
    "outputParameters:\n",
    "- name: outputPath\n",
    "  description: Cloud Storage path for where bamtofastq writes\n",
    "  localCopy:\n",
    "    path: output/*\n",
    "    disk: datadisk    \n",
    "\n",
    "'''\n",
    "\n",
    "fout = open('10X_bamtofastq.yaml','w')\n",
    "fout.write(params)\n",
    "fout.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "See: https://cloud.google.com/sdk/gcloud/reference/alpha/genomics/pipelines/run  for more parameters.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "env: TEMP=gs://cgc_scrnaseq_temp\n"
     ]
    }
   ],
   "source": [
    "# We're going to save this bucket location to our environment.\n",
    "# Not necessary, but is useful in some cases.\n",
    "%set_env TEMP=gs://cgc_scrnaseq_temp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "The call to the pipelines API takes parameters pointing to the\n",
    "inputs and output dir, locations to save log files, and can take \n",
    "parameters that designate the compute environment, such as disk size\n",
    "and number of processors.\n",
    "***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Running [operations/EIrt7PTGLRj6xq_K45OovFsg0JGr1fcSKg9wcm9kdWN0aW9uUXVldWU].\n"
     ]
    }
   ],
   "source": [
    "!gcloud alpha genomics pipelines run \\\n",
    "    --pipeline-file 10X_bamtofastq.yaml \\\n",
    "    --inputs bamFile=gs://cgc_bam_bucket_007/pbmc_1k_protein_v3_possorted_genome_bam.bam \\\n",
    "    --inputs bamIndex=gs://cgc_bam_bucket_007/pbmc_1k_protein_v3_possorted_genome_bam.bam.bai \\\n",
    "    --outputs outputPath=gs://cgc_bam_bucket_007/output/ \\\n",
    "    --logging \"${TEMP}/10X_bamtofastq.log\" \\\n",
    "    --disk-size datadisk:200"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "Then you can check the status (describe), or cancel the job using the following command."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Waiting for [operations/EIrt7PTGLRj6xq_K45OovFsg0JGr1fcSKg9wcm9kdWN0aW9uUXVldWU\n",
      "]...done.\n"
     ]
    }
   ],
   "source": [
    "!gcloud alpha genomics operations wait EIrt7PTGLRj6xq_K45OovFsg0JGr1fcSKg9wcm9kdWN0aW9uUXVldWU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}