--- name: tooluniverse-sequence-retrieval description: Retrieves biological sequences (DNA, RNA, protein) from NCBI and ENA with gene disambiguation, accession type handling, and comprehensive sequence profiles. Creates detailed reports with sequence metadata, cross-database references, and download options. Use when users need nucleotide sequences, protein sequences, genome data, or mention GenBank, RefSeq, EMBL accessions. --- # Biological Sequence Retrieval Retrieve DNA, RNA, and protein sequences with proper disambiguation and cross-database handling. ## Workflow Overview ``` Phase 0: Clarify (if needed) ↓ Phase 1: Disambiguate Gene/Organism ↓ Phase 2: Search & Retrieve (Internal) ↓ Phase 3: Report Sequence Profile ``` --- ## Phase 0: Clarification (When Needed) Ask the user ONLY if: - Gene name exists in multiple organisms (e.g., "BRCA1" → human or mouse?) - Sequence type unclear (mRNA, genomic, protein?) - Strain/isolate matters (e.g., E. coli → K-12, O157:H7, etc.) Skip clarification for: - Specific accession numbers (NC_*, NM_*, U*, etc.) - Clear organism + gene combinations - Complete genome requests with organism specified --- ## Phase 1: Gene/Organism Disambiguation ### 1.1 Resolve Identifiers ```python from tooluniverse import ToolUniverse tu = ToolUniverse() tu.load_tools() # Strategy depends on input type if user_provided_accession: # Direct retrieval based on accession type accession = user_provided_accession elif user_provided_gene_and_organism: # Search NCBI Nucleotide result = tu.tools.NCBI_search_nucleotide( operation="search", organism=organism, gene=gene, limit=10 ) ``` ### 1.2 Accession Type Decision Tree **CRITICAL**: Accession prefix determines which tools to use. | Prefix | Type | Use With | |--------|------|----------| | NC_* | RefSeq chromosome | NCBI only | | NM_* | RefSeq mRNA | NCBI only | | NR_* | RefSeq ncRNA | NCBI only | | NP_* | RefSeq protein | NCBI only | | XM_* | RefSeq predicted mRNA | NCBI only | | U*, M*, K*, X* | GenBank | NCBI or ENA | | CP*, NZ_* | GenBank genome | NCBI or ENA | | EMBL format | EMBL | ENA preferred | ### 1.3 Identity Resolution Checklist - [ ] Organism confirmed (scientific name) - [ ] Gene symbol/name identified - [ ] Sequence type determined (genomic/mRNA/protein) - [ ] Strain specified (if relevant) - [ ] Accession prefix identified → tool selection --- ## Phase 2: Data Retrieval (Internal) Retrieve silently. Do NOT narrate the search process. ### 2.1 Search for Sequences ```python # Search NCBI Nucleotide result = tu.tools.NCBI_search_nucleotide( operation="search", organism=organism, gene=gene, strain=strain, # Optional keywords=keywords, # Optional seq_type=seq_type, # complete_genome, mrna, refseq limit=10 ) # Get accession numbers from UIDs accessions = tu.tools.NCBI_fetch_accessions( operation="fetch_accession", uids=result["data"]["uids"] ) ``` ### 2.2 Retrieve Sequence Data ```python # Get sequence in desired format sequence = tu.tools.NCBI_get_sequence( operation="fetch_sequence", accession=accession, format="fasta" # or "genbank" ) # GenBank format for annotations annotations = tu.tools.NCBI_get_sequence( operation="fetch_sequence", accession=accession, format="genbank" ) ``` ### 2.3 ENA Alternative (for GenBank/EMBL accessions) ```python # Only for non-RefSeq accessions! if not accession.startswith(("NC_", "NM_", "NR_", "NP_", "XM_", "XR_")): # ENA entry info entry = tu.tools.ena_get_entry(accession=accession) # ENA FASTA fasta = tu.tools.ena_get_sequence_fasta(accession=accession) # ENA summary summary = tu.tools.ena_get_entry_summary(accession=accession) ``` ### Fallback Chains | Primary | Fallback | Notes | |---------|----------|-------| | NCBI_get_sequence | ENA (if GenBank format) | NCBI unavailable | | ENA_get_entry | NCBI_get_sequence | ENA doesn't have RefSeq | | NCBI_search_nucleotide | Try broader keywords | No results | **Critical Rule**: Never try ENA tools with RefSeq accessions (NC_, NM_, etc.) - they will return 404 errors. --- ## Phase 3: Report Sequence Profile ### Output Structure Present as a **Sequence Profile Report**. Hide search process. ```markdown # Sequence Profile: [Gene/Organism] **Search Summary** - Query: [gene] in [organism] - Database: NCBI Nucleotide - Results: [N] sequences found --- ## Primary Sequence ### [Accession]: [Definition/Title] | Attribute | Value | |-----------|-------| | **Accession** | [accession] | | **Type** | RefSeq / GenBank | | **Organism** | [scientific name] | | **Strain** | [strain if applicable] | | **Length** | [X,XXX bp / aa] | | **Molecule** | DNA / mRNA / Protein | | **Topology** | Linear / Circular | **Curation Level**: ●●● RefSeq (curated) / ●●○ GenBank (submitted) / ●○○ Third-party ### Sequence Statistics | Statistic | Value | |-----------|-------| | **Length** | [X,XXX] bp | | **GC Content** | [XX.X]% | | **Genes** | [N] (if genome) | | **CDS** | [N] (if annotated) | ### Sequence Preview ```fasta >[accession] [definition] ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA ... [truncated, full sequence in download] ``` ### Annotations Summary (from GenBank format) | Feature | Count | Examples | |---------|-------|----------| | CDS | [N] | [gene names] | | tRNA | [N] | - | | rRNA | [N] | 16S, 23S | | Regulatory | [N] | promoters | --- ## Alternative Sequences Ranked by relevance and curation level: | Accession | Type | Length | Description | ENA Compatible | |-----------|------|--------|-------------|----------------| | NC_000913.3 | RefSeq | 4.6 Mb | E. coli K-12 reference | ✗ | | U00096.3 | GenBank | 4.6 Mb | E. coli K-12 | ✓ | | CP001509.3 | GenBank | 4.6 Mb | E. coli DH10B | ✓ | --- ## Cross-Database References | Database | Accession | Link | |----------|-----------|------| | RefSeq | [NC_*] | [NCBI link] | | GenBank | [U*] | [NCBI link] | | ENA/EMBL | [same as GenBank] | [ENA link] | | BioProject | [PRJNA*] | [link] | | BioSample | [SAMN*] | [link] | --- ## Download Options ### Formats Available | Format | Description | Use Case | |--------|-------------|----------| | FASTA | Sequence only | BLAST, alignment | | GenBank | Sequence + annotations | Gene analysis | | GFF3 | Annotations only | Genome browsers | ### Direct Commands ```python # FASTA format tu.tools.NCBI_get_sequence( operation="fetch_sequence", accession="[accession]", format="fasta" ) # GenBank format (with annotations) tu.tools.NCBI_get_sequence( operation="fetch_sequence", accession="[accession]", format="genbank" ) ``` --- ## Related Sequences ### Other Strains/Isolates | Accession | Strain | Similarity | Notes | |-----------|--------|------------|-------| | [acc1] | [strain1] | 99.9% | [notes] | | [acc2] | [strain2] | 99.5% | [notes] | ### Protein Products (if applicable) | Protein Accession | Product Name | Length | |-------------------|--------------|--------| | [NP_*] | [protein name] | [X] aa | --- Retrieved: [date] Database: NCBI Nucleotide ``` --- ## Curation Level Tiers (Aligned with Evidence Grading) ### Sequence Curation Levels | Tier | Symbol | Accession Prefix | Description | Evidence Equivalent | |------|--------|------------------|-------------|---------------------| | RefSeq Reference | ●●●● | NC_, NM_, NP_ | NCBI-curated, gold standard | ★★★ | | RefSeq Predicted | ●●●○ | XM_, XP_, XR_ | Computationally predicted | ★★☆ | | GenBank Validated | ●●○○ | Various | Submitted, some curation | ★★☆ | | GenBank Direct | ●○○○ | Various | Direct submission | ★☆☆ | | Third Party | ○○○○ | TPA_ | Third-party annotation | ★☆☆ | ### Data Reliability Mapping | Data Type | Reliability | Notes | |-----------|-------------|-------| | RefSeq curated sequence | ★★★ | Gold standard for reference | | RefSeq annotations | ★★★ | Validated gene models | | GenBank sequence | ★★☆ | Submitted, generally reliable | | GenBank annotations | ★☆☆ | Submitter-provided, verify | | Predicted genes (XM_) | ★★☆ | Computational, may lack validation | | Genome assembly | ★★★-★☆☆ | Depends on assembly quality | Include in report: ```markdown **Curation Level**: ●●●● RefSeq Reference (★★★) - Curated by NCBI RefSeq project - Regular updates and validation - Recommended for reference use **Data Reliability Note**: - Sequence: ★★★ (experimentally derived) - Gene annotations: ★★★ (curated models) - Variant annotations: ★★☆ (computational) ``` --- ## Completeness Checklist Every sequence report MUST include: ### Per Sequence (Required) - [ ] Accession number - [ ] Organism (scientific name) - [ ] Sequence type (DNA/RNA/protein) - [ ] Length - [ ] Curation level - [ ] Database source ### Search Summary (Required) - [ ] Query parameters - [ ] Number of results - [ ] Ranking rationale ### Include Even If Limited - [ ] Alternative sequences (or "Only one sequence found") - [ ] Cross-database references (or "No cross-references available") - [ ] Download instructions --- ## Common Use Cases ### Reference Genome User: "Get E. coli K-12 complete genome" ```python result = tu.tools.NCBI_search_nucleotide( operation="search", organism="Escherichia coli", strain="K-12", seq_type="complete_genome", limit=3 ) # Return NC_000913.3 (RefSeq reference) ``` ### Gene Sequence User: "Find human BRCA1 mRNA" ```python result = tu.tools.NCBI_search_nucleotide( operation="search", organism="Homo sapiens", gene="BRCA1", seq_type="mrna", limit=10 ) ``` ### Specific Accession User: "Get sequence for NC_045512.2" → Direct retrieval with full metadata ### Strain Comparison User: "Compare E. coli K-12 and O157:H7 genomes" → Search both strains, provide comparison table --- ## Error Handling | Error | Response | |-------|----------| | "No search criteria provided" | Add organism, gene, or keywords | | "ENA 404 error" | Accession is likely RefSeq → use NCBI only | | "No results found" | Broaden search, check spelling, try synonyms | | "Sequence too large" | Note size, provide download link instead of preview | | "API rate limit" | Tools auto-retry; if persistent, wait briefly | --- ## Tool Reference **NCBI Tools (All Accessions)** | Tool | Purpose | |------|---------| | `NCBI_search_nucleotide` | Search by gene/organism | | `NCBI_fetch_accessions` | Convert UIDs to accessions | | `NCBI_get_sequence` | Retrieve sequence data | **ENA Tools (GenBank/EMBL Only)** | Tool | Purpose | |------|---------| | `ena_get_entry` | Entry metadata | | `ena_get_sequence_fasta` | FASTA sequence | | `ena_get_entry_summary` | Summary info | --- ## Search Parameters Reference **NCBI_search_nucleotide** | Parameter | Description | Example | |-----------|-------------|---------| | `operation` | Always "search" | "search" | | `organism` | Scientific name | "Homo sapiens" | | `gene` | Gene symbol | "BRCA1" | | `strain` | Specific strain | "K-12" | | `keywords` | Free text | "complete genome" | | `seq_type` | Sequence type | "complete_genome", "mrna", "refseq" | | `limit` | Max results | 10 | **NCBI_get_sequence** | Parameter | Description | Example | |-----------|-------------|---------| | `operation` | Always "fetch_sequence" | "fetch_sequence" | | `accession` | Accession number | "NC_000913.3" | | `format` | Output format | "fasta", "genbank" |