## Convert _P.generosa_ GFF to GTF

### Notebook relies on:

- [GffRead](https://github.com/gpertea/gffread)

### Addresses [this GitHub Issue](https://github.com/RobertsLab/resources/issues/1411)

### List computer specs

In [1]:
%%bash
echo "TODAY'S DATE:"
date
echo "------------"
echo ""
#Display operating system info
lsb_release -a
echo ""
echo "------------"
echo "HOSTNAME: "; hostname 
echo ""
echo "------------"
echo "Computer Specs:"
echo ""
lscpu
echo ""
echo "------------"
echo ""
echo "Memory Specs"
echo ""
free -mh

TODAY'S DATE:
Tue 01 Mar 2022 10:33:08 AM PST
------------

Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.3 LTS
Release:	20.04
Codename:	focal

------------
HOSTNAME: 
computer

------------
Computer Specs:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 45 bits physical, 48 bits virtual
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 165
Model name: Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
Stepping: 2
CPU MHz: 2400.008
BogoMIPS: 4800.01
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 64 KiB
L1i cache: 64 KiB
L2 cache: 512 KiB
L3 cache: 32 MiB
NUMA node0 CPU(s): 0,1
Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported
Vulnerability L1tf: Mitigation; PTE Inversion
Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown: Mitigation; PTI
Vulnerabilit

No LSB modules are available.


### Set variables
- `%env` indicates a bash variable

- without `%env` is Python variable

In [1]:
# Set directories, input/output files
%env data_dir=/home/sam/data/P_generosa/genomes
%env analysis_dir=/home/sam/analyses/20220301-pgen-gff_to_gtf
analysis_dir="20220301-pgen-gff_to_gtf"

# Input files (from NCBI)
%env gff=Panopea-generosa-v1.0.a4.gff3

# URL to download files from NCBI
%env url=https://gannet.fish.washington.edu/Atumefaciens/20191105_swoose_pgen_v074_renaming

# Output file(s)
%env gtf=Panopea-generosa-v1.0.a4.gtf


# Set program locations
%env gffread=/home/sam/programs/gffread-0.12.7.Linux_x86_64/gffread

env: data_dir=/home/sam/data/P_generosa/genomes
env: analysis_dir=/home/sam/analyses/20220301-pgen-gff_to_gtf
env: gff=Panopea-generosa-v1.0.a4.gff3
env: url=https://gannet.fish.washington.edu/Atumefaciens/20191105_swoose_pgen_v074_renaming
env: gtf=Panopea-generosa-v1.0.a4.gtf
env: gffread=/home/sam/programs/gffread-0.12.7.Linux_x86_64/gffread


### Create analysis directory

In [2]:
%%bash
# Make analysis and data directory, if doesn't exist
mkdir --parents "${analysis_dir}"

mkdir --parents "${data_dir}"

### Download GFF

In [3]:
%%bash
cd "${data_dir}"

# Download with wget.
# Use --quiet option to prevent wget output from printing too many lines to notebook
# Use --continue to prevent re-downloading fie if it's already been downloaded.
# Use --no-check-certificate to avoid download error from gannet
wget --quiet \
--continue \
--no-check-certificate \
${url}/${gff}

ls -ltrh "${gff}"

-rw-rw-r-- 1 sam sam 454M Mar 19 07:58 Panopea-generosa-v1.0.a4.gff3


### Examine GFF

In [4]:
%%bash
head -n 20 "${data_dir}"/"${gff}"

##gff-version 3
##Generated using GenSAS, Monday 7th of October 2019 04:54:37 AM
##Project Name : Pgenerosa_v074
Scaffold_01	GenSAS_5d9637f372b5d-publish	gene	2	4719	.	+	.	ID=PGEN_.00g000010;Name=PGEN_.00g000010;original_ID=21510-PGEN_.00g234140;Alias=21510-PGEN_.00g234140;original_name=21510-PGEN_.00g234140;Notes=sp|Q86IC9|CAMT1_DICDI [BLAST protein vs protein (blastp) 2.7.1],PF01596.12 [Pfam 1.6]
Scaffold_01	GenSAS_5d9637f372b5d-publish	mRNA	2	4719	.	+	.	ID=PGEN_.00g000010.m01;Name=PGEN_.00g000010.m01;Parent=PGEN_.00g000010;original_ID=21510-PGEN_.00g234140.m01;Alias=21510-PGEN_.00g234140.m01;original_name=21510-PGEN_.00g234140
Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	2	125	.	+	.	ID=PGEN_.00g000010.m01.exon01;Name=PGEN_.00g000010.m01.exon01;Parent=PGEN_.00g000010.m01;original_ID=21510-PGEN_.00g234140.m01.exon1;Alias=21510-PGEN_.00g234140.m01.exon1
Scaffold_01	GenSAS_5d9637f372b5d-publish	CDS	2	125	.	+	0	ID=PGEN_.00g000010.m01.CDS01;Name=PGEN_.00g000010.m01.CDS01;Parent=PGEN_.00g

### Convert GFF to GTF

In [5]:
%%bash
cd "${data_dir}"

${gffread} -E \
${data_dir}/"${gff}" -T \
1> ${analysis_dir}/"${gtf}" \
2> ${analysis_dir}/gffread-gff_to_gtf.stderr

### Inspect GTF

In [6]:
%%bash
head ${analysis_dir}/"${gtf}"

Scaffold_01	GenSAS_5d9637f372b5d-publish	transcript	2	4719	.	+	.	transcript_id "PGEN_.00g000010.m01"; gene_id "PGEN_.00g000010"
Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	2	125	.	+	.	transcript_id "PGEN_.00g000010.m01"; gene_id "PGEN_.00g000010";
Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	1995	2095	.	+	.	transcript_id "PGEN_.00g000010.m01"; gene_id "PGEN_.00g000010";
Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	3325	3495	.	+	.	transcript_id "PGEN_.00g000010.m01"; gene_id "PGEN_.00g000010";
Scaffold_01	GenSAS_5d9637f372b5d-publish	exon	4651	4719	.	+	.	transcript_id "PGEN_.00g000010.m01"; gene_id "PGEN_.00g000010";
Scaffold_01	GenSAS_5d9637f372b5d-publish	CDS	2	125	.	+	0	transcript_id "PGEN_.00g000010.m01"; gene_id "PGEN_.00g000010";
Scaffold_01	GenSAS_5d9637f372b5d-publish	CDS	1995	2095	.	+	2	transcript_id "PGEN_.00g000010.m01"; gene_id "PGEN_.00g000010";
Scaffold_01	GenSAS_5d9637f372b5d-publish	CDS	3325	3495	.	+	0	transcript_id "PGEN_.00g000010.m01"; gene_id "PGEN_.00g000010";
S

### Generate checksum(s)

In [7]:
%%bash
cd "${analysis_dir}"

for file in *
do
 md5sum "${file}" | tee --append checksums.md5
done

c50d7afa83901c90e48fc53a024b51f8 gffread-gff_to_gtf.stderr
a11a9eb85db1b4bed8df58658f6c0ecb Panopea-generosa-v1.0.a4.gtf


### Document GffRead program options

In [8]:
%%bash
${gffread} -h

gffread v0.12.7. Usage:
gffread [-g | ] [-s ] 
 [-o ] [-t ] [-r []:- [-R]]
 [--jmatch :-] [--no-pseudo] 
 [-CTVNJMKQAFPGUBHZWTOLE] [-w ] [-x ] [-y ]
 [-j ][--ids | --nids ] [--attrs ] [-i ]
 [--stream] [--bed | --gtf | --tlf] [--table ] [--sort-by ]
 [] 

 Filter, convert or cluster GFF/GTF/BED records, extract the sequence of
 transcripts (exon or CDS) and more.
 By default (i.e. without -O) only transcripts are processed, discarding any
 other non-transcript features. Default output is a simplified GFF3 with only
 the basic attributes.
 
Options:
 --ids discard records/transcripts if their IDs are not listed in 
 --nids discard records/transcripts if their IDs are listed in 
 -i discard transcripts having an intron larger than 
 -l discard transcripts shorter than bases
 -r only show transcripts overlapping coordinate range ..
 (on chromosome/contig , strand if provided)
 -R for -r option, discard all transcripts that are not fully 
 contained within the given range
 --jmatch only ou

CalledProcessError: Command 'b'${gffread} -h\n'' returned non-zero exit status 1.