# Getting genome assembly data using NCBI Datasets command line tools

The objective of this Notebook is to demonstrate how to use NCBI Datasets command line tools to explore and download genome assembly sequence and metadata. 

## Getting started 
First, we'll download and grant execute permissions for the datasets command line tools. 
Datasets has two command line tools 
- The **datasets** tool is used to query and download sequence, annotation and metadata for all domains of life.
- The **dataformat** tool is used to convert metadata downloaded from NCBI Datasets from JSON lines format to other formats.

In [1]:
%%bash
printf "Downloading CLI tools...\n"
for app in datasets dataformat
do
    curl --silent --remote-name "https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/${app}"
    chmod +x ${app}
    printf "[size: %s] %s v%s\n" $(du --human-readable ${app}) $(./${app} version)
done

Downloading CLI tools...
[size: 11M] datasets v11.7.0
[size: 13M] dataformat v11.7.0


We'll also download the command line tool [jq](https://stedolan.github.io/jq/) to parse the datasets JSON Lines data reports into a readable format.

In [1]:
%%bash
curl --silent --location --output jq 'https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64'
chmod +x jq
printf "Downloaded %s" $(./jq --version)

Downloaded jq-1.6

## Getting help
To get help in using the tools or any sub-commands specify --help after the command:

In [1]:
!./datasets --help

datasets is a command-line tool that is used to query and download biological sequence data
across all domains of life from NCBI databases.

Refer to NCBI's [command line start](https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-start) documentation for information about getting started with the command-line tools.

Usage
  datasets [command]

Data Retrieval Commands
  summary              print a summary of a gene or genome dataset
  download             download a gene, genome or coronavirus dataset as a zip file
  rehydrate            rehydrate a downloaded, dehydrated dataset

Miscellaneous Commands
  completion           generate autocompletion scripts
  version              print the version of this client and exit
  help                 Help about any command

Flags
  -h, --help   help for datasets

Use datasets help <command> for detailed help about a command.


## Getting genome metadata

To begin, we'll use the Datasets summary genome command to explore all the available RefSeq genomes for a group of organisms.

Genome summaries can be accessed in four ways:

- accession: an NCBI Assembly accession
- organism: an organism or a taxonomical group name
- taxid: using an NCBI Taxonomy identifier, at any level.
- BioProject: using an NCBI BioProject accession

In this example, we'll view metadata for all Crustacea genome assemblies using taxon name. Additionally, we'll limit our search to genome annotated by NCBI's RefSeq group using the --refseq flag. To make the JSON output easy to read we'll use the command line parser jq. 

In [1]:
!./datasets summary genome taxon Crustacea --refseq | ./jq .

[1;39m{
  [0m[34;1m"assemblies"[0m[1;39m: [0m[1;39m[
    [1;39m{
      [0m[34;1m"assembly"[0m[1;39m: [0m[1;39m{
        [0m[34;1m"annotation_metadata"[0m[1;39m: [0m[1;39m{
          [0m[34;1m"file"[0m[1;39m: [0m[1;39m[
            [1;39m{
              [0m[34;1m"estimated_size"[0m[1;39m: [0m[0;32m"8160265"[0m[1;39m,
              [0m[34;1m"type"[0m[1;39m: [0m[0;32m"GENOME_GFF"[0m[1;39m
            [1;39m}[0m[1;39m,
            [1;39m{
              [0m[34;1m"estimated_size"[0m[1;39m: [0m[0;32m"60912986"[0m[1;39m,
              [0m[34;1m"type"[0m[1;39m: [0m[0;32m"GENOME_GBFF"[0m[1;39m
            [1;39m}[0m[1;39m,
            [1;39m{
              [0m[34;1m"estimated_size"[0m[1;39m: [0m[0;32m"15723551"[0m[1;39m,
              [0m[34;1m"type"[0m[1;39m: [0m[0;32m"RNA_FASTA"[0m[1;39m
            [1;39m}[0m[1;39m,
            [1;39m{
              [0m[34;1m"estimated_size"[0m[1;39m: [0m[0;32m"557986

If you just want to get the count of available RefSeq (GCF) genomes that fall under a particular tax name, use the --refseq flag and set --limit to NONE:

In [1]:
!./datasets summary genome taxon crustacea --refseq --limit NONE

{"total_count":6}


## Downloading genome assembly sequence and metadata 
In this section, we'll show you how to download a genome data package for one of the Crustacean genomes using the datasets download genome command. Genome data packages can be retrieved in four ways 

- accession: an NCBI Assembly accession
- organism: an organism or a taxonomical group name
- taxid: using an NCBI Taxonomy identifier, at any level.
- BioProject: using an NCBI BioProject accession

The default genome data package includes the following data (when available):

- genomic sequence (genomic.fna)
- transcript sequences (rna.fna)
- protein sequences (protein.faa)
- annotation in gff3 format (genomic.gff)
- a data report containing genome assembly and annotation metadata (assembly_data_report.jsonl)
- a sequence report listing the nucleotide sequences that comprise the genome assembly (sequence_report.jsonl)

In this example, we'll download the Datasets genome package for the <em>Penaeus vannamei</em> reference genome. For the purposes of this demonstration, we will redirect all messages from the datasets command to datasets.log.

In [1]:
!./datasets download genome taxon "penaeus vannamei" --filename pacific_white_shrimp.zip >datasets.log 2>&1
!printf "Downloaded:\n%s" "$(du --human-readable pacific_white_shrimp.zip)"

Downloaded:
901M	pacific_white_shrimp.zip

## Converting the Datasets assembly data report to tabular format
The Datasets genome assembly data report can be converted to tabular format using the dataformat tool. In this step, we'll use the help command to view the data fields available for conversion  

In [1]:
!./dataformat tsv genome --help


Convert Genome Assembly Data Report into TSV format.

Refer to NCBI's [command line start](https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-start) documentation for information about getting started with the command-line tools.

Usage
  dataformat tsv genome [flags]

Examples
  dataformat tsv genome --inputfile human/ncbi_dataset/data/assembly_data_report.jsonl
  dataformat tsv genome --package human.zip

Flags
      --fields strings     comma-separated list of fields
                               - annotinfo-featcount-gene-non-coding
                               - annotinfo-featcount-gene-other
                               - annotinfo-featcount-gene-protein-coding
                               - annotinfo-featcount-gene-pseudogene
                               - annotinfo-featcount-gene-total
                               - annotinfo-name
                               - annotinfo-release-date
                               - annotinfo-report-url
                      

Let's look at the catalog inside the package, converting this JSON into an easy-to-read table.

In [1]:
!./dataformat catalog --package pacific_white_shrimp.zip 2>/dev/null | ./jq -r '.assemblies[] | .files[] | [.filePath, .fileType] | @csv'

"GCA_003730335.1/GCA_003730335.1_ASM373033v1_genomic.fna","GENOMIC_NUCLEOTIDE_FASTA"
"GCA_003730335.1/sequence_report.jsonl","SEQUENCE_REPORT"
"GCA_003789085.1/GCA_003789085.1_ASM378908v1_genomic.fna","GENOMIC_NUCLEOTIDE_FASTA"
"GCA_003789085.1/genomic.gff","GFF3"
"GCA_003789085.1/protein.faa","PROTEIN_FASTA"
"GCA_003789085.1/sequence_report.jsonl","SEQUENCE_REPORT"
"GCF_003789085.1/GCF_003789085.1_ASM378908v1_genomic.fna","GENOMIC_NUCLEOTIDE_FASTA"
"GCF_003789085.1/genomic.gff","GFF3"
"GCF_003789085.1/protein.faa","PROTEIN_FASTA"
"GCF_003789085.1/rna.fna","RNA_NUCLEOTIDE_FASTA"
"GCF_003789085.1/sequence_report.jsonl","SEQUENCE_REPORT"
"assembly_data_report.jsonl","DATA_REPORT"


Now we'll use the dataformat tool to convert a default set of data fields into tsv format.

In [1]:
!./dataformat tsv genome --package pacific_white_shrimp.zip --fields assminfo-name,assminfo-refseq-assm-accession,assminfo-genbank-assm-accession,assminfo-refseq-category,assmstats-number-of-contigs,assmstats-number-of-scaffolds

Assembly Name	Assembly RefSeq Accession	Assembly GenBank Accession	Assembly Refseq Dategory	Assembly Stats Number of Contigs	Assembly Stats Number of Scaffolds
ASM373033v1	na	GCA_003730335.1	na	19584	19584
ASM378908v1	GCF_003789085.1	GCA_003789085.1	representative genome	33019	4682
ASM378908v1	GCF_003789085.1	GCA_003789085.1	representative genome	33019	4682


Next, we can list the first 30 FASTA deflines for the ASM378908v1 RefSeq assembly:

In [1]:
!unzip -q -c pacific_white_shrimp.zip ncbi_dataset/data/GCF_003789085.1/GCF_003789085.1_ASM378908v1_genomic.fna | grep --max-count=30 '^>'

>NW_020868286.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1, whole genome shotgun sequence
>NW_020868287.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_10, whole genome shotgun sequence
>NW_020868288.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_100, whole genome shotgun sequence
>NW_020868289.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1000, whole genome shotgun sequence
>NW_020868290.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1001, whole genome shotgun sequence
>NW_020868291.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1002, whole genome shotgun sequence
>NW_020868292.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1003, whole genome shotgun sequence
>NW_020868293.1 Penaeus vannamei breed K