--- name: fastq-analysis-pipeline title: FASTQ analysis and RNA-seq alignment with omicverse description: Guide through omicverse's alignment module for SRA downloading, FASTQ quality control, STAR alignment, gene quantification, and single-cell kallisto/bustools pipelines covering both bulk and single-cell RNA-seq workflows. --- ## Overview OmicVerse provides a complete FASTQ-to-count-matrix pipeline via the `ov.alignment` module. This skill covers: - **SRA data acquisition**: `prefetch` and `fqdump` (fasterq-dump wrapper) - **Quality control**: `fastp` for adapter trimming and QC reports - **RNA-seq alignment**: `STAR` aligner with auto-index building - **Gene quantification**: `featureCount` (subread featureCounts wrapper) - **Single-cell path**: `ref` and `count` via kb-python (kallisto/bustools) - **Parallel SRA download**: `parallel_fastq_dump` All functions share a common CLI infrastructure (`_cli_utils.py`) that handles tool resolution, auto-installation via conda/mamba, parallel execution, and streaming output. ## Instructions 1. **Environment setup** - Bioinformatics tools are resolved automatically from PATH or the active conda environment. - If `auto_install=True` (default), missing tools are installed via mamba/conda on demand. - Supported tools: `prefetch`, `vdb-validate`, `fasterq-dump`, `fastp`, `STAR`, `samtools`, `featureCounts`, `pigz`, `gzip`. - For the single-cell path, ensure `kb-python` is installed: `pip install kb-python`. 2. **SRA data download** (`ov.alignment.prefetch` + `ov.alignment.fqdump`) - Use `prefetch` first for reliable downloads with integrity validation (`vdb-validate`). - Then convert to FASTQ with `fqdump`. It auto-detects single-end vs paired-end. - `fqdump` can also work directly from SRR accessions without prefetch. - Both support retry with exponential backoff for network errors. ```python import omicverse as ov # Step 1: Prefetch SRA files (optional but recommended) pre = ov.alignment.prefetch(['SRR1234567', 'SRR1234568'], output_dir='prefetch', jobs=4) # Step 2: Convert to FASTQ fq = ov.alignment.fqdump(['SRR1234567', 'SRR1234568'], output_dir='fastq', sra_dir='prefetch', gzip=True, threads=8, jobs=4) ``` 3. **FASTQ quality control** (`ov.alignment.fastp`) - Runs fastp for adapter trimming, quality filtering, and QC reporting. - Supports single-end and paired-end reads. - Produces per-sample JSON and HTML QC reports. - Sample format: tuple of `(sample_name, fq1_path, fq2_path_or_None)`. ```python samples = [ ('S1', 'fastq/SRR1234567/SRR1234567_1.fastq.gz', 'fastq/SRR1234567/SRR1234567_2.fastq.gz'), ('S2', 'fastq/SRR1234568/SRR1234568_1.fastq.gz', 'fastq/SRR1234568/SRR1234568_2.fastq.gz'), ] clean = ov.alignment.fastp(samples, output_dir='fastp', threads=8, jobs=2) ``` 4. **STAR alignment** (`ov.alignment.STAR`) - Aligns FASTQ reads using the STAR aligner. - **Auto-index building**: set `auto_index=True` (default) with `genome_fasta_files` and `gtf` to build index automatically if missing. - Produces coordinate-sorted BAM files. - Handles gzip-compressed FASTQs automatically (uses pigz/gzip/zcat). - Use `strict=False` (default) for graceful error handling per sample. ```python # Prepare samples from fastp output star_samples = [ ('S1', 'fastp/S1/S1_clean_1.fastq.gz', 'fastp/S1/S1_clean_2.fastq.gz'), ('S2', 'fastp/S2/S2_clean_1.fastq.gz', 'fastp/S2/S2_clean_2.fastq.gz'), ] bams = ov.alignment.STAR( star_samples, genome_dir='star_index', output_dir='star_out', gtf='genes.gtf', genome_fasta_files=['genome.fa'], threads=8, memory='50G', ) ``` 5. **Gene quantification** (`ov.alignment.featureCount`) - Counts aligned reads per gene using featureCounts (subread). - Auto-detects paired-end from BAM headers (via pysam or samtools). - `auto_fix=True` (default) retries with corrected paired-end flag on error. - `gene_mapping=True` maps gene_id to gene_name from the GTF. - `merge_matrix=True` produces a combined count matrix across all samples. ```python bam_items = [ ('S1', 'star_out/S1/Aligned.sortedByCoord.out.bam'), ('S2', 'star_out/S2/Aligned.sortedByCoord.out.bam'), ] counts = ov.alignment.featureCount( bam_items, gtf='genes.gtf', output_dir='counts', gene_mapping=True, merge_matrix=True, threads=8, ) # counts is a pandas DataFrame (gene_id x samples) ``` 6. **Single-cell path** (`ov.alignment.ref` + `ov.alignment.count`) - Uses kb-python (kallisto + bustools) for single-cell RNA-seq quantification. - `ref()` builds a kallisto index and transcript-to-gene mapping. - `count()` quantifies single-cell data with barcode/UMI handling. - Supports technologies: 10XV2, 10XV3, BULK, and custom. - Output formats: h5ad, loom, cellranger MTX. ```python # Build reference index ref_result = ov.alignment.ref( index_path='kb_ref/index.idx', t2g_path='kb_ref/t2g.txt', fasta_paths=['genome.fa'], gtf_paths=['genes.gtf'], threads=8, ) # Quantify 10x v3 data count_result = ov.alignment.count( index_path='kb_ref/index.idx', t2g_path='kb_ref/t2g.txt', technology='10XV3', fastq_paths=['sample_R1.fastq.gz', 'sample_R2.fastq.gz'], output_path='kb_out', h5ad=True, filter_barcodes=True, threads=8, ) ``` 7. **Wiring fastp output into STAR input** - fastp output is a list of dicts with keys: `sample`, `clean1`, `clean2`, `json`, `html`. - Convert to STAR sample tuples: ```python star_samples = [ (r['sample'], r['clean1'], r['clean2'] if r['clean2'] else None) for r in (clean if isinstance(clean, list) else [clean]) ] ``` 8. **Wiring STAR output into featureCount input** - STAR output is a list of dicts with keys: `sample`, `bam` (or `error`). - Convert to featureCount items: ```python bam_items = [ (r['sample'], r['bam']) for r in (bams if isinstance(bams, list) else [bams]) if 'bam' in r ] ``` 9. **Skipping completed steps** - All functions check for existing outputs and skip if `overwrite=False` (default). - Set `overwrite=True` to force re-execution. 10. **Troubleshooting** - If a tool is not found, check `auto_install=True` and that conda/mamba is accessible. - For STAR index errors, ensure `genome_fasta_files` points to uncompressed or gzip FASTA files. - For featureCounts paired-end detection errors, `auto_fix=True` handles most cases automatically. - GTF files can be gzip-compressed; they are auto-decompressed as needed. ## Critical API Reference ### Sample Format Convention All alignment functions use a consistent sample tuple format: - **FASTQ samples**: `(sample_name, fq1_path, fq2_path_or_None)` - **BAM items**: `(sample_name, bam_path)` or `(sample_name, bam_path, is_paired_bool)` - Single samples can be passed as a single tuple; multiple as a list of tuples. - When a single tuple is passed, the return value is a single dict; for a list, a list of dicts. ### Auto-installation ```python # All functions support these parameters: auto_install=True # Auto-install missing tools via conda/mamba overwrite=False # Skip if outputs already exist threads=8 # Per-tool thread count jobs=None # Concurrent job count (auto-detected from CPU count) ``` ## Examples - **Bulk RNA-seq from SRA**: `prefetch` -> `fqdump` -> `fastp` -> `STAR` -> `featureCount` -> pandas DataFrame - **Single-cell 10x v3**: `ref` -> `count` with `technology='10XV3'` -> h5ad AnnData - **Local FASTQ files**: Skip download steps, start directly with `fastp` -> `STAR` -> `featureCount` ## References - See [reference.md](reference.md) for copy-paste-ready code templates.