--- name: ena-database description: "Access European Nucleotide Archive via API/FTP. Retrieve DNA/RNA sequences, raw reads (FASTQ), genome assemblies by accession, for genomics and bioinformatics pipelines. Supports multiple formats." --- # ENA Database ## Overview The European Nucleotide Archive (ENA) is a comprehensive public repository for nucleotide sequence data and associated metadata. Access and query DNA/RNA sequences, raw reads, genome assemblies, and functional annotations through REST APIs and FTP for genomics and bioinformatics pipelines. ## When to Use This Skill This skill should be used when: - Retrieving nucleotide sequences or raw sequencing reads by accession - Searching for samples, studies, or assemblies by metadata criteria - Downloading FASTQ files or genome assemblies for analysis - Querying taxonomic information for organisms - Accessing sequence annotations and functional data - Integrating ENA data into bioinformatics pipelines - Performing cross-reference searches to related databases - Bulk downloading datasets via FTP or Aspera ## Core Capabilities ### 1. Data Types and Structure ENA organizes data into hierarchical object types: **Studies/Projects** - Group related data and control release dates. Studies are the primary unit for citing archived data. **Samples** - Represent units of biomaterial from which sequencing libraries were produced. Samples must be registered before submitting most data types. **Raw Reads** - Consist of: - **Experiments**: Metadata about sequencing methods, library preparation, and instrument details - **Runs**: References to data files containing raw sequencing reads from a single sequencing run **Assemblies** - Genome, transcriptome, metagenome, or metatranscriptome assemblies at various completion levels. **Sequences** - Assembled and annotated sequences stored in the EMBL Nucleotide Sequence Database, including coding/non-coding regions and functional annotations. **Analyses** - Results from computational analyses of sequence data. **Taxonomy Records** - Taxonomic information including lineage and rank. ### 2. Programmatic Access ENA provides multiple REST APIs for data access. Consult `references/api_reference.md` for detailed endpoint documentation. **Key APIs:** **ENA Portal API** - Advanced search functionality across all ENA data types - Documentation: https://www.ebi.ac.uk/ena/portal/api/doc - Use for complex queries and metadata searches **ENA Browser API** - Direct retrieval of records and metadata - Documentation: https://www.ebi.ac.uk/ena/browser/api/doc - Use for downloading specific records by accession - Returns data in XML format **ENA Taxonomy REST API** - Query taxonomic information - Access lineage, rank, and related taxonomic data **ENA Cross Reference Service** - Access related records from external databases - Endpoint: https://www.ebi.ac.uk/ena/xref/rest/ **CRAM Reference Registry** - Retrieve reference sequences - Endpoint: https://www.ebi.ac.uk/ena/cram/ - Query by MD5 or SHA1 checksums **Rate Limiting**: All APIs have a rate limit of 50 requests per second. Exceeding this returns HTTP 429 (Too Many Requests). ### 3. Searching and Retrieving Data **Browser-Based Search:** - Free text search across all fields - Sequence similarity search (BLAST integration) - Cross-reference search to find related records - Advanced search with Rulespace query builder **Programmatic Queries:** - Use Portal API for advanced searches at scale - Filter by data type, date range, taxonomy, or metadata fields - Download results as tabulated metadata summaries or XML records **Example API Query Pattern:** ```python import requests # Search for samples from a specific study base_url = "https://www.ebi.ac.uk/ena/portal/api/search" params = { "result": "sample", "query": "study_accession=PRJEB1234", "format": "json", "limit": 100 } response = requests.get(base_url, params=params) samples = response.json() ``` ### 4. Data Retrieval Formats **Metadata Formats:** - XML (native ENA format) - JSON (via Portal API) - TSV/CSV (tabulated summaries) **Sequence Data:** - FASTQ (raw reads) - BAM/CRAM (aligned reads) - FASTA (assembled sequences) - EMBL flat file format (annotated sequences) **Download Methods:** - Direct API download (small files) - FTP for bulk data transfer - Aspera for high-speed transfer of large datasets - enaBrowserTools command-line utility for bulk downloads ### 5. Common Use Cases **Retrieve raw sequencing reads by accession:** ```python # Download run files using Browser API accession = "ERR123456" url = f"https://www.ebi.ac.uk/ena/browser/api/xml/{accession}" ``` **Search for all samples in a study:** ```python # Use Portal API to list samples study_id = "PRJNA123456" url = f"https://www.ebi.ac.uk/ena/portal/api/search?result=sample&query=study_accession={study_id}&format=tsv" ``` **Find assemblies for a specific organism:** ```python # Search assemblies by taxonomy organism = "Escherichia coli" url = f"https://www.ebi.ac.uk/ena/portal/api/search?result=assembly&query=tax_tree({organism})&format=json" ``` **Get taxonomic lineage:** ```python # Query taxonomy API taxon_id = "562" # E. coli url = f"https://www.ebi.ac.uk/ena/taxonomy/rest/tax-id/{taxon_id}" ``` ### 6. Integration with Analysis Pipelines **Bulk Download Pattern:** 1. Search for accessions matching criteria using Portal API 2. Extract file URLs from search results 3. Download files via FTP or using enaBrowserTools 4. Process downloaded data in pipeline **BLAST Integration:** Integrate with EBI's NCBI BLAST service (REST/SOAP API) for sequence similarity searches against ENA sequences. ### 7. Best Practices **Rate Limiting:** - Implement exponential backoff when receiving HTTP 429 responses - Batch requests when possible to stay within 50 req/sec limit - Use bulk download tools for large datasets instead of iterating API calls **Data Citation:** - Always cite using Study/Project accessions when publishing - Include accession numbers for specific samples, runs, or assemblies used **API Response Handling:** - Check HTTP status codes before processing responses - Parse XML responses using proper XML libraries (not regex) - Handle pagination for large result sets **Performance:** - Use FTP/Aspera for downloading large files (>100MB) - Prefer TSV/JSON formats over XML when only metadata is needed - Cache taxonomy lookups locally when processing many records ## Resources This skill includes detailed reference documentation for working with ENA: ### references/ **api_reference.md** - Comprehensive API endpoint documentation including: - Detailed parameters for Portal API and Browser API - Response format specifications - Advanced query syntax and operators - Field names for filtering and searching - Common API patterns and examples Load this reference when constructing complex API queries, debugging API responses, or needing specific parameter details.