--- name: string-database description: "Query STRING API for protein-protein interactions (59M proteins, 20B interactions). Network analysis, GO/KEGG enrichment, interaction discovery, 5000+ species, for systems biology." --- # STRING Database ## Overview STRING is a comprehensive database of known and predicted protein-protein interactions covering 59M proteins and 20B+ interactions across 5000+ organisms. Query interaction networks, perform functional enrichment, discover partners via REST API for systems biology and pathway analysis. ## When to Use This Skill This skill should be used when: - Retrieving protein-protein interaction networks for single or multiple proteins - Performing functional enrichment analysis (GO, KEGG, Pfam) on protein lists - Discovering interaction partners and expanding protein networks - Testing if proteins form significantly enriched functional modules - Generating network visualizations with evidence-based coloring - Analyzing homology and protein family relationships - Conducting cross-species protein interaction comparisons - Identifying hub proteins and network connectivity patterns ## Quick Start The skill provides: 1. Python helper functions (`scripts/string_api.py`) for all STRING REST API operations 2. Comprehensive reference documentation (`references/string_reference.md`) with detailed API specifications When users request STRING data, determine which operation is needed and use the appropriate function from `scripts/string_api.py`. ## Core Operations ### 1. Identifier Mapping (`string_map_ids`) Convert gene names, protein names, and external IDs to STRING identifiers. **When to use**: Starting any STRING analysis, validating protein names, finding canonical identifiers. **Usage**: ```python from scripts.string_api import string_map_ids # Map single protein result = string_map_ids('TP53', species=9606) # Map multiple proteins result = string_map_ids(['TP53', 'BRCA1', 'EGFR', 'MDM2'], species=9606) # Map with multiple matches per query result = string_map_ids('p53', species=9606, limit=5) ``` **Parameters**: - `species`: NCBI taxon ID (9606 = human, 10090 = mouse, 7227 = fly) - `limit`: Number of matches per identifier (default: 1) - `echo_query`: Include query term in output (default: 1) **Best practice**: Always map identifiers first for faster subsequent queries. ### 2. Network Retrieval (`string_network`) Get protein-protein interaction network data in tabular format. **When to use**: Building interaction networks, analyzing connectivity, retrieving interaction evidence. **Usage**: ```python from scripts.string_api import string_network # Get network for single protein network = string_network('9606.ENSP00000269305', species=9606) # Get network with multiple proteins proteins = ['9606.ENSP00000269305', '9606.ENSP00000275493'] network = string_network(proteins, required_score=700) # Expand network with additional interactors network = string_network('TP53', species=9606, add_nodes=10, required_score=400) # Physical interactions only network = string_network('TP53', species=9606, network_type='physical') ``` **Parameters**: - `required_score`: Confidence threshold (0-1000) - 150: low confidence (exploratory) - 400: medium confidence (default, standard analysis) - 700: high confidence (conservative) - 900: highest confidence (very stringent) - `network_type`: `'functional'` (all evidence, default) or `'physical'` (direct binding only) - `add_nodes`: Add N most connected proteins (0-10) **Output columns**: Interaction pairs, confidence scores, and individual evidence scores (neighborhood, fusion, coexpression, experimental, database, text-mining). ### 3. Network Visualization (`string_network_image`) Generate network visualization as PNG image. **When to use**: Creating figures, visual exploration, presentations. **Usage**: ```python from scripts.string_api import string_network_image # Get network image proteins = ['TP53', 'MDM2', 'ATM', 'CHEK2', 'BRCA1'] img_data = string_network_image(proteins, species=9606, required_score=700) # Save image with open('network.png', 'wb') as f: f.write(img_data) # Evidence-colored network img = string_network_image(proteins, species=9606, network_flavor='evidence') # Confidence-based visualization img = string_network_image(proteins, species=9606, network_flavor='confidence') # Actions network (activation/inhibition) img = string_network_image(proteins, species=9606, network_flavor='actions') ``` **Network flavors**: - `'evidence'`: Colored lines show evidence types (default) - `'confidence'`: Line thickness represents confidence - `'actions'`: Shows activating/inhibiting relationships ### 4. Interaction Partners (`string_interaction_partners`) Find all proteins that interact with given protein(s). **When to use**: Discovering novel interactions, finding hub proteins, expanding networks. **Usage**: ```python from scripts.string_api import string_interaction_partners # Get top 10 interactors of TP53 partners = string_interaction_partners('TP53', species=9606, limit=10) # Get high-confidence interactors partners = string_interaction_partners('TP53', species=9606, limit=20, required_score=700) # Find interactors for multiple proteins partners = string_interaction_partners(['TP53', 'MDM2'], species=9606, limit=15) ``` **Parameters**: - `limit`: Maximum number of partners to return (default: 10) - `required_score`: Confidence threshold (0-1000) **Use cases**: - Hub protein identification - Network expansion from seed proteins - Discovering indirect connections ### 5. Functional Enrichment (`string_enrichment`) Perform enrichment analysis across Gene Ontology, KEGG pathways, Pfam domains, and more. **When to use**: Interpreting protein lists, pathway analysis, functional characterization, understanding biological processes. **Usage**: ```python from scripts.string_enrichment import string_enrichment # Enrichment for a protein list proteins = ['TP53', 'MDM2', 'ATM', 'CHEK2', 'BRCA1', 'ATR', 'TP73'] enrichment = string_enrichment(proteins, species=9606) # Parse results to find significant terms import pandas as pd df = pd.read_csv(io.StringIO(enrichment), sep='\t') significant = df[df['fdr'] < 0.05] ``` **Enrichment categories**: - **Gene Ontology**: Biological Process, Molecular Function, Cellular Component - **KEGG Pathways**: Metabolic and signaling pathways - **Pfam**: Protein domains - **InterPro**: Protein families and domains - **SMART**: Domain architecture - **UniProt Keywords**: Curated functional keywords **Output columns**: - `category`: Annotation database (e.g., "KEGG Pathways", "GO Biological Process") - `term`: Term identifier - `description`: Human-readable term description - `number_of_genes`: Input proteins with this annotation - `p_value`: Uncorrected enrichment p-value - `fdr`: False discovery rate (corrected p-value) **Statistical method**: Fisher's exact test with Benjamini-Hochberg FDR correction. **Interpretation**: FDR < 0.05 indicates statistically significant enrichment. ### 6. PPI Enrichment (`string_ppi_enrichment`) Test if a protein network has significantly more interactions than expected by chance. **When to use**: Validating if proteins form functional module, testing network connectivity. **Usage**: ```python from scripts.string_api import string_ppi_enrichment import json # Test network connectivity proteins = ['TP53', 'MDM2', 'ATM', 'CHEK2', 'BRCA1'] result = string_ppi_enrichment(proteins, species=9606, required_score=400) # Parse JSON result data = json.loads(result) print(f"Observed edges: {data['number_of_edges']}") print(f"Expected edges: {data['expected_number_of_edges']}") print(f"P-value: {data['p_value']}") ``` **Output fields**: - `number_of_nodes`: Proteins in network - `number_of_edges`: Observed interactions - `expected_number_of_edges`: Expected in random network - `p_value`: Statistical significance **Interpretation**: - p-value < 0.05: Network is significantly enriched (proteins likely form functional module) - p-value ≥ 0.05: No significant enrichment (proteins may be unrelated) ### 7. Homology Scores (`string_homology`) Retrieve protein similarity and homology information. **When to use**: Identifying protein families, paralog analysis, cross-species comparisons. **Usage**: ```python from scripts.string_api import string_homology # Get homology between proteins proteins = ['TP53', 'TP63', 'TP73'] # p53 family homology = string_homology(proteins, species=9606) ``` **Use cases**: - Protein family identification - Paralog discovery - Evolutionary analysis ### 8. Version Information (`string_version`) Get current STRING database version. **When to use**: Ensuring reproducibility, documenting methods. **Usage**: ```python from scripts.string_api import string_version version = string_version() print(f"STRING version: {version}") ``` ## Common Analysis Workflows ### Workflow 1: Protein List Analysis (Standard Workflow) **Use case**: Analyze a list of proteins from experiment (e.g., differential expression, proteomics). ```python from scripts.string_api import (string_map_ids, string_network, string_enrichment, string_ppi_enrichment, string_network_image) # Step 1: Map gene names to STRING IDs gene_list = ['TP53', 'BRCA1', 'ATM', 'CHEK2', 'MDM2', 'ATR', 'BRCA2'] mapping = string_map_ids(gene_list, species=9606) # Step 2: Get interaction network network = string_network(gene_list, species=9606, required_score=400) # Step 3: Test if network is enriched ppi_result = string_ppi_enrichment(gene_list, species=9606) # Step 4: Perform functional enrichment enrichment = string_enrichment(gene_list, species=9606) # Step 5: Generate network visualization img = string_network_image(gene_list, species=9606, network_flavor='evidence', required_score=400) with open('protein_network.png', 'wb') as f: f.write(img) # Step 6: Parse and interpret results ``` ### Workflow 2: Single Protein Investigation **Use case**: Deep dive into one protein's interactions and partners. ```python from scripts.string_api import (string_map_ids, string_interaction_partners, string_network_image) # Step 1: Map protein name protein = 'TP53' mapping = string_map_ids(protein, species=9606) # Step 2: Get all interaction partners partners = string_interaction_partners(protein, species=9606, limit=20, required_score=700) # Step 3: Visualize expanded network img = string_network_image(protein, species=9606, add_nodes=15, network_flavor='confidence', required_score=700) with open('tp53_network.png', 'wb') as f: f.write(img) ``` ### Workflow 3: Pathway-Centric Analysis **Use case**: Identify and visualize proteins in a specific biological pathway. ```python from scripts.string_api import string_enrichment, string_network # Step 1: Start with known pathway proteins dna_repair_proteins = ['TP53', 'ATM', 'ATR', 'CHEK1', 'CHEK2', 'BRCA1', 'BRCA2', 'RAD51', 'XRCC1'] # Step 2: Get network network = string_network(dna_repair_proteins, species=9606, required_score=700, add_nodes=5) # Step 3: Enrichment to confirm pathway annotation enrichment = string_enrichment(dna_repair_proteins, species=9606) # Step 4: Parse enrichment for DNA repair pathways import pandas as pd import io df = pd.read_csv(io.StringIO(enrichment), sep='\t') dna_repair = df[df['description'].str.contains('DNA repair', case=False)] ``` ### Workflow 4: Cross-Species Analysis **Use case**: Compare protein interactions across different organisms. ```python from scripts.string_api import string_network # Human network human_network = string_network('TP53', species=9606, required_score=700) # Mouse network mouse_network = string_network('Trp53', species=10090, required_score=700) # Yeast network (if ortholog exists) yeast_network = string_network('gene_name', species=4932, required_score=700) ``` ### Workflow 5: Network Expansion and Discovery **Use case**: Start with seed proteins and discover connected functional modules. ```python from scripts.string_api import (string_interaction_partners, string_network, string_enrichment) # Step 1: Start with seed protein(s) seed_proteins = ['TP53'] # Step 2: Get first-degree interactors partners = string_interaction_partners(seed_proteins, species=9606, limit=30, required_score=700) # Step 3: Parse partners to get protein list import pandas as pd import io df = pd.read_csv(io.StringIO(partners), sep='\t') all_proteins = list(set(df['preferredName_A'].tolist() + df['preferredName_B'].tolist())) # Step 4: Perform enrichment on expanded network enrichment = string_enrichment(all_proteins[:50], species=9606) # Step 5: Filter for interesting functional modules enrichment_df = pd.read_csv(io.StringIO(enrichment), sep='\t') modules = enrichment_df[enrichment_df['fdr'] < 0.001] ``` ## Common Species When specifying species, use NCBI taxon IDs: | Organism | Common Name | Taxon ID | |----------|-------------|----------| | Homo sapiens | Human | 9606 | | Mus musculus | Mouse | 10090 | | Rattus norvegicus | Rat | 10116 | | Drosophila melanogaster | Fruit fly | 7227 | | Caenorhabditis elegans | C. elegans | 6239 | | Saccharomyces cerevisiae | Yeast | 4932 | | Arabidopsis thaliana | Thale cress | 3702 | | Escherichia coli | E. coli | 511145 | | Danio rerio | Zebrafish | 7955 | Full list available at: https://string-db.org/cgi/input?input_page_active_form=organisms ## Understanding Confidence Scores STRING provides combined confidence scores (0-1000) integrating multiple evidence types: ### Evidence Channels 1. **Neighborhood (nscore)**: Conserved genomic neighborhood across species 2. **Fusion (fscore)**: Gene fusion events 3. **Phylogenetic Profile (pscore)**: Co-occurrence patterns across species 4. **Coexpression (ascore)**: Correlated RNA expression 5. **Experimental (escore)**: Biochemical and genetic experiments 6. **Database (dscore)**: Curated pathway and complex databases 7. **Text-mining (tscore)**: Literature co-occurrence and NLP extraction ### Recommended Thresholds Choose threshold based on analysis goals: - **150 (low confidence)**: Exploratory analysis, hypothesis generation - **400 (medium confidence)**: Standard analysis, balanced sensitivity/specificity - **700 (high confidence)**: Conservative analysis, high-confidence interactions - **900 (highest confidence)**: Very stringent, experimental evidence preferred **Trade-offs**: - Lower thresholds: More interactions (higher recall, more false positives) - Higher thresholds: Fewer interactions (higher precision, more false negatives) ## Network Types ### Functional Networks (Default) Includes all evidence types (experimental, computational, text-mining). Represents proteins that are functionally associated, even without direct physical binding. **When to use**: - Pathway analysis - Functional enrichment studies - Systems biology - Most general analyses ### Physical Networks Only includes evidence for direct physical binding (experimental data and database annotations for physical interactions). **When to use**: - Structural biology studies - Protein complex analysis - Direct binding validation - When physical contact is required ## API Best Practices 1. **Always map identifiers first**: Use `string_map_ids()` before other operations for faster queries 2. **Use STRING IDs when possible**: Use format `9606.ENSP00000269305` instead of gene names 3. **Specify species for networks >10 proteins**: Required for accurate results 4. **Respect rate limits**: Wait 1 second between API calls 5. **Use versioned URLs for reproducibility**: Available in reference documentation 6. **Handle errors gracefully**: Check for "Error:" prefix in returned strings 7. **Choose appropriate confidence thresholds**: Match threshold to analysis goals ## Detailed Reference For comprehensive API documentation, complete parameter lists, output formats, and advanced usage, refer to `references/string_reference.md`. This includes: - Complete API endpoint specifications - All supported output formats (TSV, JSON, XML, PSI-MI) - Advanced features (bulk upload, values/ranks enrichment) - Error handling and troubleshooting - Integration with other tools (Cytoscape, R, Python libraries) - Data license and citation information ## Troubleshooting **No proteins found**: - Verify species parameter matches identifiers - Try mapping identifiers first with `string_map_ids()` - Check for typos in protein names **Empty network results**: - Lower confidence threshold (`required_score`) - Check if proteins actually interact - Verify species is correct **Timeout or slow queries**: - Reduce number of input proteins - Use STRING IDs instead of gene names - Split large queries into batches **"Species required" error**: - Add `species` parameter for networks with >10 proteins - Always include species for consistency **Results look unexpected**: - Check STRING version with `string_version()` - Verify network_type is appropriate (functional vs physical) - Review confidence threshold selection ## Additional Resources For proteome-scale analysis or complete species network upload: - Visit https://string-db.org - Use "Upload proteome" feature - STRING will generate complete interaction network and predict functions For bulk downloads of complete datasets: - Download page: https://string-db.org/cgi/download - Includes complete interaction files, protein annotations, and pathway mappings ## Data License STRING data is freely available under **Creative Commons BY 4.0** license: - Free for academic and commercial use - Attribution required when publishing - Cite latest STRING publication ## Citation When using STRING in publications, cite the most recent publication from: https://string-db.org/cgi/about