# ISB-CGC Community Notebooks¶
Check out more notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!

```
Title: How to convert 10X bams to fastq files
Author: David L Gibbs
Created: 2019-08-07
Purpose: Demonstrate how to make fastq files from 10X bams
Notes: 
```

# Using 10X bamtofastq to convert a bam file to fastq files.

In this example, we'll be using the Google Genomics Pipelines API. The pipelines API makes it easy to run a job without having to spin up and shut down a VM. It's all done automatically.

The work is uses materials from https://cloud.google.com/genomics/docs/quickstart.

Docs for the genomics pipeline run: https://cloud.google.com/sdk/gcloud/reference/alpha/genomics/pipelines/run

For this to work, we need to make sure that the Google Genomics API is enabled. To do that, from the main menu in the cloud console, select 'APIs & Services'. The API is called: genomics.googleapis.com.


In this case, we're going to be using data and compiled software from 10X,
which is compatible with Ubuntu. Therefore, we'll select an
ubuntu image to run the pipeline.

Software can be found here:
https://support.10xgenomics.com/docs/bamtofastq

Data is found here:
https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_1k_protein_v3


In [56]:
# This notebook cell specifies the job parameters
# We are using an ubuntu docker image which doesn't 
# contain the 'wget' program, so we install that, 
# then use it to get software from 10X, which finally
# is used to convert the bam file to fastqs.
# everything in the outputPath will be copied back to
# cloud storage.

# note on the input files..
# the ${file} contains the full path.
# if there's multiple bam files to process
# we could use 
# for file in $(ls /path/*.bam | sed s/^.*\\/\//) # to get the last one
# and we could write those fastq files in /mnt/data/output/{$file}

params='''
name: bamtofastq
description: Run 10X bamtofastq on bams 

resources:
 zones:
 - us-west1-b

 disks:
 - name: datadisk
 autoDelete: True

 # Within the Docker container, specify a mount point for the disk.
 mountPoint: /mnt/data

docker:
 imageName: ubuntu:19.10

 # The Pipelines API does not create the output directory.
 cmd: >
 mkdir /mnt/data/output &&
 find /mnt/data/input &&
 for file in $(/bin/ls /mnt/data/input/*.bam); do
 apt-get update;
 apt-get --yes install wget;
 wget http://cf.10xgenomics.com/misc/bamtofastq;
 chmod +x bamtofastq;
 ./bamtofastq ${file} /mnt/data/output/fastq;
 tar -czvf /mnt/data/output/fastq.tar.gz /mnt/data/output/fastq;
 done


inputParameters:
- name: bamFile
 description: the sorted bam file
 localCopy:
 path: input/
 disk: datadisk
- name: bamIndex
 description: bam file index
 localCopy:
 path: input/
 disk: datadisk
 
outputParameters:
- name: outputPath
 description: Cloud Storage path for where bamtofastq writes
 localCopy:
 path: output/*
 disk: datadisk 

'''

fout = open('10X_bamtofastq.yaml','w')
fout.write(params)
fout.close()



See: https://cloud.google.com/sdk/gcloud/reference/alpha/genomics/pipelines/run for more parameters.



In [30]:
# We're going to save this bucket location to our environment.
# Not necessary, but is useful in some cases.
%set_env TEMP=gs://cgc_scrnaseq_temp

env: TEMP=gs://cgc_scrnaseq_temp


***
The call to the pipelines API takes parameters pointing to the
inputs and output dir, locations to save log files, and can take 
parameters that designate the compute environment, such as disk size
and number of processors.
***

In [57]:
!gcloud alpha genomics pipelines run \
 --pipeline-file 10X_bamtofastq.yaml \
 --inputs bamFile=gs://cgc_bam_bucket_007/pbmc_1k_protein_v3_possorted_genome_bam.bam \
 --inputs bamIndex=gs://cgc_bam_bucket_007/pbmc_1k_protein_v3_possorted_genome_bam.bam.bai \
 --outputs outputPath=gs://cgc_bam_bucket_007/output/ \
 --logging "${TEMP}/10X_bamtofastq.log" \
 --disk-size datadisk:200

Running [operations/EIrt7PTGLRj6xq_K45OovFsg0JGr1fcSKg9wcm9kdWN0aW9uUXVldWU].


***
Then you can check the status (describe), or cancel the job using the following command.

In [60]:
!gcloud alpha genomics operations wait EIrt7PTGLRj6xq_K45OovFsg0JGr1fcSKg9wcm9kdWN0aW9uUXVldWU

Waiting for [operations/EIrt7PTGLRj6xq_K45OovFsg0JGr1fcSKg9wcm9kdWN0aW9uUXVldWU
]...done.
