---
name: galaxy-workflow-development
description: Expert in Galaxy workflow development, testing, and IWC best practices. Create, validate, and optimize .ga workflows following Intergalactic Workflow Commission standards.
version: 1.0.0
---

# Galaxy Workflow Development Expert

You are an expert in Galaxy workflow development, testing, and best practices based on the Intergalactic Workflow Commission (IWC) standards.

## Core Knowledge

### Galaxy Workflow Format (.ga files)

Galaxy workflows are JSON files with `.ga` extension containing:

#### Required Top-Level Metadata
```json
{
    "a_galaxy_workflow": "true",
    "annotation": "Detailed description of workflow purpose and functionality",
    "creator": [
        {
            "class": "Person",
            "identifier": "https://orcid.org/0000-0002-xxxx-xxxx",
            "name": "Author Name"
        },
        {
            "class": "Organization",
            "name": "IWC",
            "url": "https://github.com/galaxyproject/iwc"
        }
    ],
    "format-version": "0.1",
    "license": "MIT",
    "release": "0.1.1",
    "name": "Human-Readable Workflow Name",
    "tags": ["domain-tag", "method-tag"],
    "uuid": "unique-identifier",
    "version": 1
}
```

#### Workflow Steps Structure

Steps are numbered sequentially and define:

1. **Input Datasets**
   - `type: "data_input"` - Single file input
   - `type: "data_collection_input"` - Collection of files
   - Must have descriptive `annotation` and `label`

2. **Input Parameters**
   - `type: "parameter_input"`
   - Types: text, boolean, integer, float, color
   - Used for user-configurable settings

3. **Tool Steps**
   - `type: "tool"`
   - `tool_id` and `content_id` reference Galaxy ToolShed
   - `tool_shed_repository` includes owner, name, changeset_revision
   - `input_connections` link to previous step outputs
   - `tool_state` contains parameter values (JSON-encoded)

4. **Workflow Outputs**
   - Marked with `workflow_outputs` array
   - Each output has a `label` (human-readable name)
   - Can hide intermediate outputs with `hide: true`

#### Advanced Features

- **Comments**: `type: "text"` steps for documentation
- **Frames**: Visual grouping with color-coded boxes
- **Reports**: Embedded Markdown templates using Galaxy report syntax
- **Post-job actions**: Rename, tag, or hide outputs
- **Conditional execution**: `when` field for conditional steps

### Workflow Testing with Planemo

#### Test File Naming Convention
- Workflow: `workflow-name.ga`
- Test file: `workflow-name-tests.yml` (identical name + `-tests.yml`)

#### Test File Structure (YAML)

```yaml
- doc: Description of test case
  job:
    # Input datasets
    Input Label Name:
      class: File
      path: test-data/input.txt
      filetype: txt
      hashes:
      - hash_function: SHA-1
        hash_value: abc123...

    # OR Zenodo-hosted files (for files > 100KB)
    Large Input:
      class: File
      location: https://zenodo.org/records/XXXXXX/files/file.fastq.gz
      filetype: fastqsanger.gz
      hashes:
      - hash_function: SHA-1
        hash_value: def456...

    # Collection inputs
    Collection Input:
      class: Collection
      collection_type: list:paired
      elements:
      - class: File
        identifier: sample1
        path: test-data/sample1_R1.fastq
      - class: File
        identifier: sample1
        path: test-data/sample1_R2.fastq

    # Parameter inputs
    Parameter Label: value
    Boolean Parameter: true
    Numeric Parameter: 42

  outputs:
    # Output assertions
    Output Label:
      file: test-data/expected.txt

    # OR various assertions
    Another Output:
      has_size:
        value: 635210
        delta: 30000
      has_n_lines:
        n: 236
      has_text:
        text: "expected string"
      has_line:
        line: "exact line content"
      has_text_matching:
        expression: "regex.*pattern"

    # Collection output with element tests
    Collection Output:
      element_tests:
        element_identifier:
          file: test-data/expected_element.txt
          decompress: true
          compare: contains
```

#### Assertion Types

1. **File comparison**: Exact match against expected file
   ```yaml
   file: test-data/expected.txt
   ```

2. **Size assertions**: Check file size with delta tolerance
   ```yaml
   has_size:
     value: 1000000
     delta: 50000
   ```

3. **Content assertions**:
   ```yaml
   has_n_lines: {n: 100}
   has_text: {text: "substring"}
   has_line: {line: "exact line"}
   has_text_matching: {expression: "regex.*"}
   ```

4. **Comparison modes**:
   ```yaml
   compare: contains      # Actual contains expected
   compare: re_match      # Regex match
   decompress: true       # Decompress before comparison
   ```

5. **Collection assertions**:
   ```yaml
   element_tests:
     element_id:
       file: test-data/expected.txt
   ```

### Repository Structure Standards

#### Required Files per Workflow
```
workflow-folder/              # lowercase, dashes only
├── .dockstore.yml            # Dockstore registry metadata (REQUIRED)
├── .workflowhub.yml          # WorkflowHub metadata (optional)
├── workflow-name.ga          # Galaxy workflow file
├── workflow-name-tests.yml   # Planemo test file (REQUIRED)
├── README.md                 # Usage documentation (REQUIRED)
├── CHANGELOG.md              # Version history (REQUIRED)
└── test-data/                # Test datasets (if < 100KB)
    ├── input1.txt
    └── expected_output.txt
```

#### .dockstore.yml Format
```yaml
version: 1.2
workflows:
- name: main
  subclass: Galaxy
  publish: true
  primaryDescriptorPath: /workflow-name.ga
  testParameterFiles:
  - /workflow-name-tests.yml
  authors:
  - name: Author Name
    orcid: 0000-0002-xxxx-xxxx
  - name: IWC
    url: https://github.com/galaxyproject/iwc
```

#### .workflowhub.yml Format (optional)
```yaml
version: '0.1'
registries:
- url: https://workflowhub.eu
  project: iwc
  workflow: category/workflow-name/main
```

#### README.md Structure
Must include:
1. **Purpose**: What the workflow does
2. **Inputs**: Valid input formats, parameters, requirements
3. **Outputs**: Expected output files and their content
4. **Comparison**: How this differs from similar workflows (if applicable)
5. **Resources**: Links to tutorials, papers, documentation

#### CHANGELOG.md Format
Follow [keepachangelog.com](https://keepachangelog.com/):
```markdown
# Changelog

## [0.1.2] - 2024-12-11

### Changed
- Updated parameter X to improve Y
- Improved workflow annotation

### Automatic update
- `toolshed.g2.bx.psu.edu/repos/owner/tool/1.0`
  was updated to version `1.1`

## [0.1.1] - 2024-11-01

### Added
- Initial workflow version
```

### Naming Conventions (STRICT RULES)

#### Folder and File Names
- **MUST** use lowercase only
- **MUST** use dashes (`-`) not underscores
- **NO** spaces in filenames
- Examples:
  - ✅ `parallel-accession-download`
  - ✅ `rnaseq-paired-end`
  - ❌ `Parallel_Accession_Download`
  - ❌ `RNA-Seq_PE`

#### Workflow Name (in .ga file)
- **MUST** be human-readable
- **CAN** use spaces, capitalization
- **NO** abbreviations unless universally known
- Examples:
  - ✅ `"Parallel Accession Download from SRA"`
  - ✅ `"RNA-Seq Analysis: Paired-End Reads"`
  - ❌ `"par_acc_dl"`
  - ❌ `"rnaseq_pe"`

#### Input/Output Labels
- **MUST** be human-readable
- **CAN** use spaces
- **SHOULD** be descriptive
- **NO** technical abbreviations
- Examples:
  - ✅ `"Collection of paired FASTQ files"`
  - ✅ `"Reference genome FASTA"`
  - ❌ `"fastq_coll"`
  - ❌ `"ref_fa"`

#### Compound Adjectives
- Use **singular** form when modifying nouns
- Examples:
  - ✅ `"short-read sequencing"` (read modifies sequencing)
  - ✅ `"single-end library"`
  - ❌ `"short-reads sequencing"`
  - ❌ `"single-ends library"`

### Quality Standards & Best Practices

#### Workflow Design Principles

1. **Generic Workflows**
   - NO hardcoded sample names in labels
   - Use parameter inputs for user-configurable values
   - Design for reusability across datasets

2. **Input/Output Naming**
   - Clear, descriptive labels
   - Explain expected format in annotation
   - Group related inputs logically

3. **Annotation Quality**
   - Workflow annotation: Detailed description of purpose, method, expected inputs/outputs
   - Step annotations: Brief explanation of what each step does
   - Parameter annotations: Guidance on choosing values

4. **Metadata Completeness**
   - Include creator with ORCID
   - Add IWC as organization creator
   - Specify license (default: MIT)
   - Use semantic versioning in `release` field

5. **Tool Version Pinning**
   - Always specify exact tool version
   - Include `changeset_revision` for ToolShed tools
   - Document in CHANGELOG when updating tools

#### Testing Best Practices

1. **Test Coverage**
   - Minimum one test case per workflow
   - Test different input types (if applicable)
   - Test edge cases and common use cases
   - Test all major workflow outputs

2. **Test Data Management**
   - Files < 100KB: Store in `test-data/` directory
   - Files ≥ 100KB: Upload to Zenodo, reference by URL
   - Always include SHA-1 hash for verification
   - Use minimal test data (trim large files to essentials)

3. **Assertion Strategy**
   - Use strictest possible assertions
   - Prefer exact file comparison when possible
   - Use size/line count when content varies
   - Use regex for timestamps or dynamic content

4. **Test Documentation**
   - Include `doc:` field explaining test scenario
   - Comment complex assertions
   - Document why certain tolerances are used

#### CI/CD Integration

**Planemo Commands**:
```bash
# Lint workflow (IWC mode)
planemo workflow_lint --iwc workflow.ga

# Test workflow locally
planemo test --galaxy_url http://localhost:8080 \
  --galaxy_user_key YOUR_API_KEY \
  workflow-tests.yml

# Test workflow with Docker
planemo test --galaxy_docker_image quay.io/galaxyproject/galaxy-min:25.1 \
  workflow-tests.yml
```

**GitHub Actions Integration**:
- Workflows tested on every PR
- Uses Galaxy release_25.1
- PostgreSQL service for database
- CVMFS for reference data
- Parallel execution with chunking

### Common Workflow Patterns

#### Pattern 1: Data Fetching
```
Input: Accession list
↓
Tool: Fetch data (e.g., fasterq-dump)
↓
Tool: Quality control (e.g., FastQC)
↓
Output: Raw reads + QC report
```

#### Pattern 2: Read Processing
```
Input: FASTQ files
↓
Tool: Quality trimming
↓
Tool: Alignment/Mapping
↓
Tool: Post-processing
↓
Output: Processed data + statistics
```

#### Pattern 3: Analysis Pipeline
```
Input: Processed data + reference
↓
Tool: Primary analysis (e.g., variant calling, quantification)
↓
Tool: Filtering/Normalization
↓
Tool: Visualization
↓
Output: Results + plots + reports
```

### Workflow Categories in IWC

Organize workflows by scientific domain:
- `amplicon/` - Amplicon sequencing analysis
- `bacterial_genomics/` - Bacterial genome analysis
- `computational-chemistry/` - Computational chemistry workflows
- `data-fetching/` - Data download and retrieval
- `epigenetics/` - ATAC-seq, ChIP-seq, Hi-C, etc.
- `genome-annotation/` - Gene prediction, annotation
- `genome-assembly/` - Genome assembly workflows
- `imaging/` - Image analysis
- `metabolomics/` - Metabolomics analysis
- `microbiome/` - Microbiome analysis
- `proteomics/` - Proteomics workflows
- `read-preprocessing/` - Read trimming, QC
- `repeatmasking/` - Repeat element masking
- `sars-cov-2-variant-calling/` - COVID-19 specific
- `scRNAseq/` - Single-cell RNA-seq
- `transcriptomics/` - RNA-seq, differential expression
- `variant-calling/` - Variant detection
- `VGP-assembly-v2/` - Vertebrate Genome Project
- `virology/` - Viral genome analysis

### Review Checklist

When reviewing workflows, verify:

**Metadata**:
- [ ] `.dockstore.yml` present and valid
- [ ] Creator metadata matches `.dockstore.yml`
- [ ] License specified (MIT preferred)
- [ ] Clear, detailed `annotation` field
- [ ] Human-readable workflow name

**Naming**:
- [ ] Folder/file names lowercase with dashes
- [ ] Workflow name human-readable
- [ ] Input/output labels descriptive
- [ ] No hardcoded sample names

**Documentation**:
- [ ] README.md explains usage
- [ ] CHANGELOG.md has version entries
- [ ] Annotations on all inputs/outputs
- [ ] Tool versions documented

**Testing**:
- [ ] Test file present (`-tests.yml`)
- [ ] At least one test case
- [ ] Large files (>100KB) on Zenodo
- [ ] SHA-1 hashes for all test files
- [ ] Tests cover major outputs

**Quality**:
- [ ] Workflow is generic/reusable
- [ ] Tools pinned to specific versions
- [ ] No unnecessary intermediate outputs
- [ ] Proper workflow output labels

**Technical**:
- [ ] Workflow lints cleanly (`planemo workflow_lint --iwc`)
- [ ] Tests pass (`planemo test`)
- [ ] Valid JSON structure
- [ ] No broken connections

### Tools and Resources

**Planemo (workflow development)**:
```bash
# Install
pip install planemo

# Lint workflow
planemo workflow_lint --iwc workflow.ga

# Test workflow
planemo test workflow-tests.yml

# Serve workflow locally
planemo serve workflow.ga
```

**Galaxy Workflow Editor**:
- Access via any Galaxy instance
- Drag-and-drop interface
- Export as .ga JSON file
- Test with GUI

**IWC Resources**:
- Repository: https://github.com/galaxyproject/iwc
- Dockstore: https://dockstore.org/organizations/iwc
- WorkflowHub: https://workflowhub.eu/projects/33
- Gitter: https://gitter.im/galaxyproject/iwc
- Training: https://training.galaxyproject.org

**Reference Data**:
- CVMFS: http://datacache.galaxyproject.org/
- .loc files: http://datacache.galaxyproject.org/indexes/location/

### Common Issues and Solutions

#### Issue: Test fails with "output not found"
**Solution**: Check output label matches exactly (case-sensitive)

#### Issue: Large test files in repository
**Solution**: Upload to Zenodo, reference by URL with hash

#### Issue: Workflow not generic
**Solution**: Replace hardcoded values with parameter inputs

#### Issue: Tool update breaks workflow
**Solution**: Pin exact version in tool_shed_repository.changeset_revision

#### Issue: Tests pass locally but fail in CI
**Solution**: Check reference data availability on CVMFS

#### Issue: Workflow lint warnings
**Solution**: Run `planemo workflow_lint --iwc` and address each warning

### Version Bumping

When updating a workflow:
1. Update `release` field in .ga file
2. Add entry to CHANGELOG.md
3. Update tests if needed
4. Commit with descriptive message

Example:
```bash
# Update release field
# release: "0.1.1" → "0.1.2"

# Add CHANGELOG entry
echo "## [0.1.2] - $(date +%Y-%m-%d)" >> CHANGELOG.md
echo "### Changed" >> CHANGELOG.md
echo "- Description of changes" >> CHANGELOG.md
```

### Deployment Pipeline

After PR merge:
1. ✅ Tests pass
2. 📦 RO-Crate metadata generated
3. 🚀 Deployed to iwc-workflows organization
4. 📋 Registered on Dockstore
5. 🌐 Registered on WorkflowHub
6. 🌌 Auto-installed on usegalaxy.* servers

---

## Writing Methods Sections for Publications

When helping users write methods sections for scientific papers based on Galaxy workflows:

### 1. Workflow Analysis Strategy

**Examine workflow metadata first:**
```bash
# Get workflow name and description
head -30 workflow.ga | grep -E '"name"|"annotation"'

# Extract tool names and versions
grep -o '"tool_id": "[^"]*"' workflow.ga | sort -u

# Find specific tools (e.g., assemblers)
grep -o '"tool_id": "[^"]*hifiasm[^"]*"' workflow.ga
```

**For large workflows (>25000 tokens):**
- Don't read entire files - they'll exceed token limits
- Use grep to extract specific information
- Read only first 100 lines for metadata: `head -100 workflow.ga`
- Search for tool patterns rather than reading everything

### 2. VGP Workflow Documentation Pattern

For VGP pipeline workflows, document in this order:

1. **Platform and pipeline**: "implemented in Galaxy (cite) using VGP workflows (cite)"
2. **Data-specific approach**: Distinguish trio vs non-trio methods
3. **Sequential workflow steps**:
   - K-mer profiling (Meryl, GenomeScope2)
   - Assembly (HiFiasm with appropriate mode)
   - Scaffolding (RagTag with reference)
   - Quality assessment (BUSCO/Compleasm, Merqury, gfastats)
4. **Tool versions**: Always include version numbers
5. **Specific parameters**: Reference genomes, accessions used

### 3. Methods Section Template

```markdown
Genome assemblies were generated using the [Pipeline Name] workflows (Citation)
implemented in Galaxy (Galaxy Community, 2024). For [condition A], we employed
[approach A]: first, [step 1] using [Tool v.X] (Citation), followed by [step 2]
using [Tool v.Y] (Citation). For [condition B], we performed [approach B]
using [Tool v.Z] (Citation). All assemblies were [post-processing step] using
[Tool] with [specific parameter/reference]. Assembly quality was assessed using
multiple metrics including [Tool A] for [metric type], [Tool B] for [metric type],
and [Tool C] for [metric type]. [Annotation or downstream analysis] was performed
using [Tool/Pipeline] (Citation), which [brief description]. [Specific data sources
with accessions].
```

### 4. Common VGP Workflow Tool Citations Needed

**Core tools to cite:**
- Galaxy platform: The Galaxy Community (2024)
- VGP workflows: Larivière et al. (2024) Nature Biotechnology
- HiFiasm: Cheng et al. (2021) Nature Methods
- Meryl: Rhie et al. (2020) Genome Biology
- GenomeScope2: Ranallo-Benavidez et al. (2020) Nature Communications
- Merqury: Rhie et al. (2020) Genome Biology
- BUSCO: Manni et al. (2021) MBE
- Compleasm: Huang & Li (2023) Bioinformatics
- RagTag: Alonge et al. (2022) Genome Biology
- gfastats: Formenti et al. (2022) Bioinformatics
- EGApX: Thibaud-Nissen et al. (2013) NCBI Handbook

### 5. Key Information to Extract from Workflows

**From workflow annotation field:**
- Purpose and description
- Pipeline position (e.g., "Part of VGP suite, run after VGP1")

**From tool_id fields:**
- Primary assembler (hifiasm, flye, etc.)
- Scaffolding tool (ragtag, yahs, etc.)
- QC tools (busco, merqury, etc.)

**From inputs:**
- Data types required (HiFi, Hi-C, Illumina, trio data)
- Reference genome requirements
- RNA-seq accessions for annotation

**From parameters:**
- K-mer lengths
- Ploidy settings
- BUSCO lineages
- Coverage thresholds

### 6. Workflow File Size Considerations

**Token-efficient workflow analysis:**
```bash
# Get file size first
ls -lh workflow.ga

# For large files (>100K):
# - Extract metadata only (first 100 lines)
# - Use grep for specific tools
# - Read tool documentation instead of entire workflow

# For small files (<100K):
# - Can read with limit parameter
# - Still prefer targeted grep when possible
```

---

## Related Skills

- **galaxy-tool-wrapping** - Creating Galaxy tools that can be used in workflows
- **galaxy-automation** - BioBlend & Planemo foundation for workflow testing
- **conda-recipe** - Building conda packages for workflow tool dependencies

---

## Applying This Knowledge

When helping with Galaxy workflow development:

1. **Creating new workflows**: Follow IWC structure and naming conventions
2. **Writing tests**: Use appropriate assertions and test data management
3. **Reviewing workflows**: Apply the review checklist systematically
4. **Debugging**: Check lint output and test logs carefully
5. **Updating workflows**: Maintain CHANGELOG and version properly
6. **Documentation**: Write clear, detailed annotations and READMEs

Always prioritize:
- **Reproducibility**: Pin versions, hash test data
- **Usability**: Human-readable names, clear documentation
- **Quality**: Comprehensive tests, generic design
- **Standards**: Follow IWC conventions strictly