---
name: "salmon-rna-quantification"
description: "Ultra-fast RNA-seq transcript/gene quantification via quasi-mapping (no BAM). Builds a k-mer index from transcriptome FASTA, quantifies in minutes. Outputs TPM/count tables (quant.sf) with optional GC- and sequence-bias correction. Integrates with tximeta/tximport for DESeq2/edgeR. Use STAR when a genome-aligned BAM is needed."
license: "GPL-3.0"
---

# Salmon — Fast RNA-seq Quantification

## Overview

Salmon quantifies transcript abundance from RNA-seq reads using quasi-mapping — matching reads to a k-mer index of the transcriptome without full genome alignment. This makes Salmon 20–50× faster than alignment-based tools while producing accurate TPM and estimated count values. Salmon corrects for sequence-specific bias (`--seqBias`), GC-content bias (`--gcBias`), and fragment length distribution automatically. Output `quant.sf` files integrate directly with `tximeta` (R) or `pydeseq2` (Python) for differential expression analysis. For improved accuracy, decoy-aware indexing uses the full genome to identify spurious quasi-mappings.

## When to Use

- Performing fast RNA-seq quantification when you do not need a genome-aligned BAM file
- Running large-scale RNA-seq studies where alignment speed is a bottleneck (Salmon is 20-50× faster than STAR + featureCounts)
- Computing TPM and estimated counts from bulk RNA-seq for differential expression with DESeq2 or edgeR
- Correcting for GC bias, fragment length, and sequence context bias with `--gcBias --seqBias`
- Estimating transcript-level uncertainty via bootstrap resampling with `--numBootstraps`
- Use **STAR** instead when you need a genome-aligned BAM for downstream tools (variant calling, deeptools, IGV visualization)
- Use **Kallisto** as an alternative for similar speed; Salmon provides better bias correction and decoy-aware indexing

## Prerequisites

- **Software**: Salmon ≥ 1.10 (conda or pre-compiled binary)
- **Reference**: transcriptome FASTA (cDNA sequences, e.g., GENCODE or Ensembl) + genome FASTA for decoy-aware indexing
- **Python packages**: `pandas` for parsing output; `pydeseq2` for differential expression

> **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v salmon` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run salmon` rather than bare `salmon`.

```bash
# Install with conda (recommended)
conda install -c bioconda salmon

# Verify
salmon --version
# salmon 1.10.3

# Or download pre-compiled binary
wget https://github.com/COMBINE-lab/salmon/releases/download/v1.10.0/salmon-1.10.0_linux_x86_64.tar.gz
tar xzvf salmon-1.10.0_linux_x86_64.tar.gz
export PATH="$PWD/salmon-latest_linux_x86_64/bin:$PATH"
```

## Quick Start

```bash
# 1. Build transcriptome index (~5 min)
salmon index -t transcriptome.fa -i salmon_index/ -p 8

# 2. Quantify paired-end reads (~2-5 min per sample)
salmon quant \
    -i salmon_index/ \
    -l A \
    -1 sample_R1.fastq.gz \
    -2 sample_R2.fastq.gz \
    -p 8 \
    --gcBias --validateMappings \
    -o results/sample1/

# Output: results/sample1/quant.sf
head results/sample1/quant.sf
```

## Workflow

### Step 1: Download Transcriptome Reference

Fetch a transcript FASTA from GENCODE or Ensembl (cDNA sequences only — not genome).

```bash
# Human transcriptome from GENCODE (recommended)
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.transcripts.fa.gz
gunzip gencode.v47.transcripts.fa.gz

# Count transcripts
grep -c "^>" gencode.v47.transcripts.fa
# ~252,000 transcripts

echo "Reference ready."
ls -lh gencode.v47.transcripts.fa
```

### Step 2: Build Salmon Index

Index the transcriptome for quasi-mapping. Add genome decoys for improved accuracy.

```bash
# Standard index (fast, sufficient for most analyses)
salmon index \
    -t gencode.v47.transcripts.fa \
    -i salmon_index/ \
    -p 8
echo "Standard index complete."

# Decoy-aware index (recommended for accuracy — uses full genome as decoy)
# Step 1: create decoy list from genome chromosome names
grep "^>" GRCh38.primary_assembly.genome.fa | cut -d " " -f 1 | sed 's/>//' > decoys.txt

# Step 2: concatenate transcriptome + genome
cat gencode.v47.transcripts.fa GRCh38.primary_assembly.genome.fa > gentrome.fa

# Step 3: build decoy-aware index
salmon index \
    -t gentrome.fa \
    -d decoys.txt \
    -i salmon_decoy_index/ \
    -p 8
echo "Decoy-aware index complete."
```

### Step 3: Quantify Single-End Reads

Run Salmon on single-end FASTQ files.

```bash
# Single-end quantification
salmon quant \
    -i salmon_index/ \
    -l A \
    -r sample1.fastq.gz \
    -p 8 \
    --seqBias \
    --validateMappings \
    -o results/sample1/

echo "Mapping rate: $(grep 'Mapping rate' results/sample1/logs/salmon_quant.log | tail -1)"
echo "Output: results/sample1/quant.sf"
```

### Step 4: Quantify Paired-End Reads with Bias Correction

Run Salmon on paired-end FASTQ files with recommended bias correction flags.

```bash
# Paired-end with GC bias + sequence bias correction
salmon quant \
    -i salmon_decoy_index/ \
    -l A \
    -1 sample1_R1.fastq.gz \
    -2 sample1_R2.fastq.gz \
    -p 8 \
    --gcBias \
    --seqBias \
    --validateMappings \
    --numBootstraps 100 \
    -o results/sample1/

# quant.sf columns: Name, Length, EffectiveLength, TPM, NumReads
head results/sample1/quant.sf
```

### Step 5: Load and Summarize Quantification Output

Parse `quant.sf` to build a gene-level count matrix for differential expression.

```python
import pandas as pd
from pathlib import Path

# Load single-sample output
quant = pd.read_csv("results/sample1/quant.sf", sep="\t")
print(f"Transcripts quantified: {len(quant)}")
print(f"Total estimated reads: {quant['NumReads'].sum():.0f}")
print(f"Transcripts with TPM > 1: {(quant['TPM'] > 1).sum()}")
print(quant.sort_values("TPM", ascending=False).head())

# Build a multi-sample TPM matrix
samples = ["ctrl_1", "ctrl_2", "treat_1", "treat_2"]
tpm_matrix = pd.DataFrame({
    s: pd.read_csv(f"results/{s}/quant.sf", sep="\t").set_index("Name")["TPM"]
    for s in samples
})
print(f"\nTPM matrix: {tpm_matrix.shape}")
tpm_matrix.to_csv("tpm_matrix.tsv", sep="\t")
```

### Step 6: Aggregate to Gene Level and Run DESeq2

Summarize transcript-level estimates to gene level and perform differential expression.

```python
import pandas as pd
import re
from pathlib import Path
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats

# Aggregate transcript counts to gene level using Ensembl gene IDs
# quant.sf Name format: "ENST00000456328.2|ENSG00000223972.6|..."
def extract_gene_id(transcript_id):
    parts = transcript_id.split("|")
    return parts[1].split(".")[0] if len(parts) > 1 else transcript_id

samples = ["ctrl_1", "ctrl_2", "treat_1", "treat_2"]
count_frames = []
for s in samples:
    df = pd.read_csv(f"results/{s}/quant.sf", sep="\t")
    df["gene_id"] = df["Name"].apply(extract_gene_id)
    gene_counts = df.groupby("gene_id")["NumReads"].sum().round().astype(int)
    count_frames.append(gene_counts.rename(s))

count_matrix = pd.DataFrame(count_frames).fillna(0).astype(int)
metadata = pd.DataFrame({
    "condition": ["control", "control", "treated", "treated"]
}, index=samples)

# Run DESeq2
dds = DeseqDataSet(counts=count_matrix, metadata=metadata,
                   design_factors="condition",
                   inference=DefaultInference(n_cpus=4))
dds.deseq2()

stat_res = DeseqStats(dds, contrast=["condition", "treated", "control"],
                      inference=DefaultInference())
stat_res.summary()
results = stat_res.results_df
print(f"DE genes (padj < 0.05): {(results['padj'] < 0.05).sum()}")
print(results[results['padj'] < 0.05].sort_values('log2FoldChange').head())
```

## Key Parameters

| Parameter | Default | Range/Options | Effect |
|-----------|---------|---------------|--------|
| `-l / --libType` | required | `A` (auto), `SF`, `SR`, `IU`, `IS`, `MS`, `MR` | Library strandedness; `A` auto-detects from first reads |
| `-p / --threads` | `1` | 1–64 | CPU threads; 8–16 is typical |
| `--gcBias` | off | flag | Correct for GC-content bias in fragment selection; recommended for most samples |
| `--seqBias` | off | flag | Correct for sequence-specific bias at read starts; recommended |
| `--validateMappings` | off | flag | Use selective alignment for improved accuracy; slight speed cost |
| `--numBootstraps` | `0` | 0–200 | Bootstrap replicates for uncertainty estimation; enables Sleuth/Swish |
| `--dumpCsvCounts` | off | flag | Dump raw counts to CSV alongside quant.sf |
| `-d / --decoys` | — | file | Decoy sequence list for decoy-aware indexing |
| `--rangeFactorizationBins` | `4` | 1–8 | Bins for range-factorization model; increases accuracy at small speed cost |
| `--skipQuant` | off | flag | Build index and exit; useful for cluster pipelines |

## Common Recipes

### Recipe 1: Batch Quantify All Samples

```bash
#!/bin/bash
# Quantify all paired-end samples with recommended settings
INDEX="salmon_decoy_index"
DATA="data"
OUT="results"
THREADS=12

SAMPLES=(ctrl_1 ctrl_2 treat_1 treat_2)
mkdir -p "$OUT"

for sample in "${SAMPLES[@]}"; do
    echo "Quantifying: $sample"
    salmon quant \
        -i "$INDEX" \
        -l A \
        -1 "$DATA/${sample}_R1.fastq.gz" \
        -2 "$DATA/${sample}_R2.fastq.gz" \
        -p "$THREADS" \
        --gcBias --seqBias --validateMappings \
        -o "$OUT/$sample/"
    echo "Done: $sample — mapping $(grep 'Mapping rate' $OUT/$sample/logs/salmon_quant.log | tail -1)"
done
echo "All samples quantified."
```

### Recipe 2: Add Salmon to a Snakemake Pipeline

```python
# Snakefile — Salmon quantification rule
configfile: "config.yaml"

SAMPLES = config["samples"]

rule all:
    input:
        expand("results/{sample}/quant.sf", sample=SAMPLES)

rule salmon_index:
    input:
        transcriptome = config["transcriptome_fa"]
    output:
        directory("salmon_index")
    threads: 8
    shell:
        "salmon index -t {input.transcriptome} -i {output} -p {threads}"

rule salmon_quant:
    input:
        index = "salmon_index",
        r1 = "data/{sample}_R1.fastq.gz",
        r2 = "data/{sample}_R2.fastq.gz"
    output:
        quant = "results/{sample}/quant.sf"
    params:
        outdir = "results/{sample}"
    threads: 8
    shell:
        """
        salmon quant -i {input.index} -l A \
            -1 {input.r1} -2 {input.r2} \
            -p {threads} --gcBias --seqBias --validateMappings \
            -o {params.outdir}
        """
```

### Recipe 3: Filter Low-Expression Transcripts and Compute Fold Changes

```python
import pandas as pd
import numpy as np

samples = {
    "ctrl_1": "results/ctrl_1/quant.sf",
    "ctrl_2": "results/ctrl_2/quant.sf",
    "treat_1": "results/treat_1/quant.sf",
    "treat_2": "results/treat_2/quant.sf",
}

# Build TPM matrix
tpm = pd.DataFrame({
    name: pd.read_csv(path, sep="\t").set_index("Name")["TPM"]
    for name, path in samples.items()
})

# Filter: keep transcripts with TPM > 1 in at least 2 samples
expressed = (tpm > 1).sum(axis=1) >= 2
tpm_filt = tpm[expressed]
print(f"Expressed transcripts: {expressed.sum()} / {len(tpm)}")

# Simple log2 fold change (treat vs ctrl)
ctrl_mean = tpm_filt[["ctrl_1", "ctrl_2"]].mean(axis=1)
treat_mean = tpm_filt[["treat_1", "treat_2"]].mean(axis=1)
lfc = np.log2(treat_mean + 0.5) - np.log2(ctrl_mean + 0.5)
top_up = lfc.sort_values(ascending=False).head(10)
print("Top upregulated transcripts:")
print(top_up)
```

## Expected Outputs

| Output | Format | Description |
|--------|--------|-------------|
| `quant.sf` | TSV | Transcript-level quantification: Name, Length, EffectiveLength, TPM, NumReads |
| `quant.genes.sf` | TSV | Gene-level quantification (when `--geneMap` provided) |
| `logs/salmon_quant.log` | Text | Detailed log with mapping rate, processed reads, elapsed time |
| `aux_info/meta_info.json` | JSON | Run metadata: library type detected, mapping rate, num processed reads |
| `aux_info/fld.gz` | Binary | Fragment length distribution (paired-end) |
| `bootstrap/` | Binary | Bootstrap count distributions (when `--numBootstraps > 0`) |

## Troubleshooting

| Problem | Cause | Solution |
|---------|-------|----------|
| Mapping rate < 50% | Wrong transcriptome species or assembly mismatch | Verify transcriptome FASTA matches sample organism; use genome-decoy index |
| Library type detection wrong | Ambiguous or mixed-strand library | Specify explicitly: `-l SF` (stranded fwd) or `-l SR` (stranded rev) |
| `quant.sf` all zeros | Index built from wrong reference | Rebuild index with correct transcriptome FASTA |
| Out of memory during indexing | Transcriptome + genome concatenation too large | Use standard index without genome decoy; or increase available RAM |
| Many low-mapping transcripts | No GC/seq bias correction | Add `--gcBias --seqBias --validateMappings`; helps with low-complexity regions |
| `meta_info.json` mapping rate < 30% | Reads are from a different molecule (e.g., rRNA contamination) | Check FastQC overrepresented sequences; verify library preparation |
| Gene-level output missing | `--geneMap` or `-g` not provided | Re-run with `-g gencode.v47.gtf` to get `quant.genes.sf` |
| Bootstrap takes too long | High `--numBootstraps` on slow disk | Reduce to `--numBootstraps 30` for most DE tests; use SSD |

## References

- [Salmon documentation](https://salmon.readthedocs.io/) — official usage guide and FAQ
- [Salmon GitHub: COMBINE-lab/salmon](https://github.com/COMBINE-lab/salmon) — source, releases, and issues
- Patro R et al. (2017) "Salmon provides fast and bias-aware quantification of transcript expression" — *Nature Methods* 14:417-419. [DOI:10.1038/nmeth.4197](https://doi.org/10.1038/nmeth.4197)
- [tximeta R package](https://bioconductor.org/packages/tximeta/) — recommended for importing Salmon output into R/Bioconductor with full provenance