---
name: bio-genome-assembly-assembly-qc
description: Assess genome assembly quality using QUAST for contiguity metrics and BUSCO for completeness. Essential for evaluating assembly success and comparing assemblers. Use when evaluating assembly completeness and quality.
tool_type: cli
primary_tool: QUAST
---

# Assembly QC

Evaluate genome assembly quality with contiguity metrics (QUAST) and gene completeness (BUSCO).

## Key Metrics

| Metric | Good Assembly |
|--------|---------------|
| N50 | High (relative to genome) |
| L50 | Low |
| Contigs | Few |
| Misassemblies | 0 (with reference) |
| BUSCO Complete | >95% |
| BUSCO Duplicated | <5% (unless polyploid) |

## QUAST

### Installation

```bash
conda install -c bioconda quast
```

### Basic Usage

```bash
quast.py assembly.fasta -o quast_output
```

### With Reference Genome

```bash
quast.py assembly.fasta -r reference.fasta -o quast_output
```

### Compare Multiple Assemblies

```bash
quast.py assembly1.fa assembly2.fa assembly3.fa -o comparison
```

### Key Options

| Option | Description |
|--------|-------------|
| `-o` | Output directory |
| `-r` | Reference genome |
| `-g` | Gene annotations (GFF) |
| `-t` | Threads |
| `-m` | Min contig length (default: 500) |
| `--large` | For large genomes (>100Mb) |
| `--fragmented` | For highly fragmented assemblies |
| `--scaffolds` | Input is scaffolds (includes N-gaps) |

### With Gene Annotations

```bash
quast.py assembly.fasta -r reference.fasta -g genes.gff -o quast_output
```

### For Large Genomes

```bash
quast.py --large assembly.fasta -o quast_output -t 16
```

### Output Files

```
quast_output/
├── report.txt        # Summary statistics
├── report.html       # Interactive report
├── report.tsv        # Tab-separated stats
├── icarus.html       # Contig viewer
└── aligned_stats/    # If reference provided
```

### Key Output Metrics

| Metric | Description |
|--------|-------------|
| Total length | Sum of contig lengths |
| # contigs | Number of contigs (>= min length) |
| Largest contig | Length of largest contig |
| N50 | 50% of assembly in contigs >= this length |
| N90 | 90% of assembly in contigs >= this length |
| L50 | Number of contigs comprising N50 |
| GC % | GC content |
| # misassemblies | With reference: structural errors |
| Genome fraction | With reference: % of reference covered |

## BUSCO

### Installation

```bash
conda install -c bioconda busco
```

### Basic Usage

```bash
busco -i assembly.fasta -m genome -l bacteria_odb10 -o busco_output
```

### Key Options

| Option | Description |
|--------|-------------|
| `-i` | Input assembly |
| `-m` | Mode: genome, proteins, transcriptome |
| `-l` | Lineage dataset |
| `-o` | Output name |
| `-c` | CPU threads |
| `--auto-lineage` | Auto-detect lineage |
| `--offline` | Use downloaded datasets only |
| `--list-datasets` | List available lineages |

### List Available Lineages

```bash
busco --list-datasets
```

### Common Lineages

| Lineage | Use For |
|---------|---------|
| bacteria_odb10 | Bacteria |
| archaea_odb10 | Archaea |
| eukaryota_odb10 | General eukaryote |
| fungi_odb10 | Fungi |
| metazoa_odb10 | Animals |
| vertebrata_odb10 | Vertebrates |
| mammalia_odb10 | Mammals |
| viridiplantae_odb10 | Plants |
| saccharomycetes_odb10 | Yeasts |

### Auto-Lineage Detection

```bash
busco -i assembly.fasta -m genome --auto-lineage -o busco_output
```

### Output Files

```
busco_output/
├── short_summary.txt           # Quick summary
├── full_table.tsv              # All BUSCO results
├── missing_busco_list.tsv      # Missing genes
└── busco_sequences/            # BUSCO gene sequences
```

### Interpret Results

```
C:98.5%[S:97.0%,D:1.5%],F:0.5%,M:1.0%,n:4085

C - Complete (total)
S - Single-copy
D - Duplicated
F - Fragmented
M - Missing
n - Total BUSCO groups
```

### Quality Thresholds

| Quality | Complete | Missing |
|---------|----------|---------|
| Excellent | >95% | <2% |
| Good | >90% | <5% |
| Acceptable | >80% | <10% |
| Poor | <80% | >10% |

## Complete QC Workflow

```bash
#!/bin/bash
set -euo pipefail

ASSEMBLY=$1
REFERENCE=${2:-}
LINEAGE=${3:-bacteria_odb10}
OUTDIR=${4:-assembly_qc}

mkdir -p $OUTDIR

echo "=== Assembly QC ==="

# QUAST
echo "Running QUAST..."
if [ -n "$REFERENCE" ]; then
    quast.py $ASSEMBLY -r $REFERENCE -o ${OUTDIR}/quast -t 8
else
    quast.py $ASSEMBLY -o ${OUTDIR}/quast -t 8
fi

# BUSCO
echo "Running BUSCO..."
busco -i $ASSEMBLY -m genome -l $LINEAGE -o busco_run -c 8
mv busco_run ${OUTDIR}/busco

# Summary
echo ""
echo "=== QUAST Summary ==="
cat ${OUTDIR}/quast/report.txt

echo ""
echo "=== BUSCO Summary ==="
cat ${OUTDIR}/busco/short_summary*.txt

echo ""
echo "Reports saved to $OUTDIR"
```

## Compare Assemblies

### QUAST Comparison

```bash
quast.py \
    spades_assembly.fa \
    flye_assembly.fa \
    canu_assembly.fa \
    -r reference.fa \
    -l "SPAdes,Flye,Canu" \
    -o assembly_comparison
```

### BUSCO Comparison

```bash
# Run BUSCO on each assembly
for asm in spades.fa flye.fa canu.fa; do
    name=$(basename $asm .fa)
    busco -i $asm -m genome -l bacteria_odb10 -o busco_${name}
done

# Generate comparison plot
generate_plot.py -wd . busco_spades busco_flye busco_canu
```

## Python: Parse QUAST Output

```python
import pandas as pd

def parse_quast(report_tsv):
    '''Parse QUAST report.tsv file.'''
    df = pd.read_csv(report_tsv, sep='\t', index_col=0)
    return df.T

stats = parse_quast('quast_output/report.tsv')
print(f"N50: {stats['N50'].values[0]}")
print(f"Total length: {stats['Total length'].values[0]}")
print(f"# contigs: {stats['# contigs'].values[0]}")
```

## Python: Parse BUSCO Output

```python
import re

def parse_busco_summary(summary_file):
    '''Parse BUSCO short summary.'''
    with open(summary_file) as f:
        text = f.read()

    pattern = r'C:(\d+\.\d+)%\[S:(\d+\.\d+)%,D:(\d+\.\d+)%\],F:(\d+\.\d+)%,M:(\d+\.\d+)%,n:(\d+)'
    match = re.search(pattern, text)

    if match:
        return {
            'complete': float(match.group(1)),
            'single': float(match.group(2)),
            'duplicated': float(match.group(3)),
            'fragmented': float(match.group(4)),
            'missing': float(match.group(5)),
            'total': int(match.group(6))
        }
    return None

result = parse_busco_summary('busco_output/short_summary.txt')
print(f"Complete: {result['complete']}%")
```

## MetaQUAST (Metagenomes)

```bash
metaquast.py metagenome_assembly.fa -o metaquast_output -t 16
```

## Troubleshooting

### Low N50
- Check coverage depth
- Consider longer reads
- Try different assembler

### Low BUSCO Completeness
- Check input read quality
- Verify correct lineage dataset
- May indicate real gene loss (compare to relatives)

### High Duplication in BUSCO
- Normal for polyploids
- May indicate contamination
- Check for collapsed haplotypes

## Related Skills

- short-read-assembly - SPAdes assembly
- long-read-assembly - Flye/Canu assembly
- assembly-polishing - Improve accuracy
- metagenomics - Metagenome analysis