--- name: chembl-database-bioactivity description: Query ChEMBL via Python SDK. Search compounds by structure/properties, retrieve bioactivity (IC50, Ki, EC50), find target inhibitors, run SAR, access drug mechanism/indication data. license: CC-BY-SA-3.0 --- # ChEMBL Database — Bioactivity Queries ## Overview Query the ChEMBL bioactive molecule database (2M+ compounds, 19M+ bioactivity measurements, 13K+ targets) using the `chembl_webresource_client` Python SDK. Covers compound search, target lookup, bioactivity retrieval, structure-based search, and drug information access. ## When to Use - Finding compounds by name, ChEMBL ID, or physicochemical properties - Querying bioactivity data (IC50, Ki, EC50) for specific targets - Performing similarity or substructure searches using SMILES - Retrieving drug mechanisms of action and clinical indications - Identifying inhibitors, agonists, or bioactive molecules for a target - Analyzing structure-activity relationships (SAR) across compound series - Filtering molecules by Lipinski rule-of-5 or other drug-likeness criteria - For general cheminformatics (SMILES manipulation, fingerprints, descriptors) use rdkit-cheminformatics instead ## Prerequisites ```bash uv pip install chembl_webresource_client # Optional: pandas for tabular analysis uv pip install pandas ``` **Rate limiting**: The SDK handles rate limiting internally via automatic retries and caching. No `time.sleep()` needed between queries. For large-scale data retrieval (100K+ records), consider ChEMBL bulk downloads instead of API queries. ## Quick Start ```python from chembl_webresource_client.new_client import new_client # Each entity type has its own client endpoint molecule = new_client.molecule target = new_client.target activity = new_client.activity # Retrieve a molecule by ChEMBL ID aspirin = molecule.get('CHEMBL25') print(f"{aspirin['pref_name']}: MW={aspirin['molecule_properties']['mw_freebase']}") # Search targets by name egfr_targets = target.filter(pref_name__icontains='EGFR', target_type='SINGLE PROTEIN') print(f"Found {len(list(egfr_targets))} EGFR-related targets") # Query bioactivities with filters potent = activity.filter( target_chembl_id='CHEMBL203', # EGFR standard_type='IC50', standard_value__lte=100, # <= 100 nM standard_units='nM' ) ``` ## Key Concepts ### Filter Operators ChEMBL uses Django-style query filters on all endpoints: | Operator | Meaning | Example | |----------|---------|---------| | `__exact` | Exact match (default) | `target_type__exact='SINGLE PROTEIN'` | | `__iexact` | Case-insensitive exact | `pref_name__iexact='aspirin'` | | `__contains` | Substring match | `pref_name__contains='kinase'` | | `__icontains` | Case-insensitive substring | `pref_name__icontains='egfr'` | | `__startswith` | Prefix match | `pref_name__startswith='Epi'` | | `__endswith` | Suffix match | `pref_name__endswith='nib'` | | `__gt` / `__gte` | Greater than (or equal) | `standard_value__gte=10` | | `__lt` / `__lte` | Less than (or equal) | `standard_value__lte=100` | | `__range` | Value in range | `mw_freebase__range=[300, 500]` | | `__in` | Value in list | `target_chembl_id__in=['CHEMBL203', 'CHEMBL240']` | | `__isnull` | Null check | `pchembl_value__isnull=False` | | `__regex` | Regular expression | `pref_name__regex='^EGF.*kinase$'` | | `__search` | Full-text search | `description__search='apoptosis'` | ### Core Endpoints | Endpoint | Access | Description | |----------|--------|-------------| | `molecule` | `new_client.molecule` | Compound structures, properties, synonyms | | `target` | `new_client.target` | Protein and non-protein biological targets | | `activity` | `new_client.activity` | Bioassay measurement results | | `assay` | `new_client.assay` | Experimental assay details | | `drug` | `new_client.drug` | Approved pharmaceutical information | | `mechanism` | `new_client.mechanism` | Drug mechanism of action data | | `drug_indication` | `new_client.drug_indication` | Drug therapeutic indications | | `similarity` | `new_client.similarity` | Tanimoto similarity search | | `substructure` | `new_client.substructure` | Substructure search | | `image` | `new_client.image` | SVG molecular structure images | | `molecule_form` | `new_client.molecule_form` | Parent/salt forms | | `protein_class` | `new_client.protein_class` | Protein classification hierarchy | | `target_component` | `new_client.target_component` | Target component details | | `cell_line` | `new_client.cell_line` | Cell line information | | `tissue` | `new_client.tissue` | Tissue type information | | `compound_structural_alert` | `new_client.compound_structural_alert` | Structural alerts for toxicity | | `document` | `new_client.document` | Literature source references | ### Molecular Properties Properties accessible via `molecule['molecule_properties']`: | Field | Description | |-------|-------------| | `mw_freebase` | Molecular weight (free base) | | `full_mwt` | Full molecular weight (including salts) | | `alogp` | Calculated LogP | | `hba` | Hydrogen bond acceptors | | `hbd` | Hydrogen bond donors | | `psa` | Polar surface area | | `rtb` | Rotatable bonds | | `num_ro5_violations` | Lipinski rule-of-5 violations | | `ro3_pass` | Rule of 3 compliance | | `cx_most_apka` | Most acidic pKa | | `cx_most_bpka` | Most basic pKa | ### Target Information Fields Key fields in target records: | Field | Description | |-------|-------------| | `target_chembl_id` | ChEMBL target identifier | | `pref_name` | Preferred target name | | `target_type` | Type: SINGLE PROTEIN, PROTEIN COMPLEX, ORGANISM | | `organism` | Target organism (e.g., Homo sapiens) | | `tax_id` | NCBI taxonomy ID | | `target_components` | Component details (UniProt accession, etc.) | ### Bioactivity Data Fields Key fields in activity records: | Field | Description | |-------|-------------| | `standard_type` | Activity type: IC50, Ki, Kd, EC50, etc. | | `standard_value` | Numerical activity value | | `standard_units` | Units: nM, uM, etc. | | `pchembl_value` | Normalized activity (-log10 scale, comparable across types) | | `activity_comment` | Activity annotations | | `data_validity_comment` | Data quality flags (check before analysis) | | `potential_duplicate` | Duplicate flag | ## Core API ### 1. Molecule Queries ```python molecule = new_client.molecule # By ChEMBL ID aspirin = molecule.get('CHEMBL25') # By name (case-insensitive) results = molecule.filter(pref_name__icontains='imatinib') # By properties (Lipinski rule-of-5 compliant) drug_like = molecule.filter( molecule_properties__mw_freebase__lte=500, molecule_properties__alogp__lte=5, molecule_properties__hba__lte=10, molecule_properties__hbd__lte=5 ) # By property range mid_weight = molecule.filter( molecule_properties__mw_freebase__range=[300, 500] ) ``` ### 2. Target Queries ```python target = new_client.target # By ChEMBL ID egfr = target.get('CHEMBL203') print(f"{egfr['pref_name']} ({egfr['organism']})") # Search by name and type kinases = target.filter( target_type='SINGLE PROTEIN', pref_name__icontains='kinase' ) # By organism human_targets = target.filter(organism='Homo sapiens') ``` ### 3. Bioactivity Data ```python activity = new_client.activity # Potent inhibitors for a target potent = activity.filter( target_chembl_id='CHEMBL203', standard_type='IC50', standard_value__lte=100, standard_units='nM' ) # All activities for a compound (with pChEMBL values) compound_acts = activity.filter( molecule_chembl_id='CHEMBL25', pchembl_value__isnull=False ) # Multiple activity types ki_data = activity.filter( target_chembl_id='CHEMBL240', standard_type__in=['IC50', 'Ki', 'Kd'] ) ``` ### 4. Structure-Based Search ```python # Similarity search (Tanimoto) similarity = new_client.similarity similar = similarity.filter( smiles='CC(=O)Oc1ccccc1C(=O)O', # aspirin similarity=85 # >=85% similarity ) # Substructure search substructure = new_client.substructure benzimidazoles = substructure.filter(smiles='c1ccc2[nH]cnc2c1') ``` ### 5. Drug and Mechanism Data ```python drug = new_client.drug mechanism = new_client.mechanism drug_indication = new_client.drug_indication # Drug details drug_info = drug.get('CHEMBL25') # Mechanisms of action mechs = mechanism.filter(molecule_chembl_id='CHEMBL941') for m in mechs: print(f"{m['mechanism_of_action']} → {m.get('target_chembl_id')}") # Therapeutic indications indications = drug_indication.filter(molecule_chembl_id='CHEMBL941') for ind in indications: print(f"{ind.get('mesh_heading')} (Phase {ind.get('max_phase_for_ind')})") # SVG molecular image image = new_client.image svg_data = image.get('CHEMBL25') with open('aspirin.svg', 'w') as f: f.write(svg_data) ``` ## Common Workflows ### Workflow 1: Find Inhibitors for a Target ```python from chembl_webresource_client.new_client import new_client import pandas as pd # Step 1: Identify the target targets = new_client.target.filter(pref_name__icontains='BRAF', target_type='SINGLE PROTEIN') target_id = list(targets)[0]['target_chembl_id'] # Step 2: Query potent activities activities = new_client.activity.filter( target_chembl_id=target_id, standard_type='IC50', standard_value__lte=100, standard_units='nM', pchembl_value__isnull=False ) # Step 3: Convert to DataFrame for analysis df = pd.DataFrame(list(activities)) df['standard_value'] = pd.to_numeric(df['standard_value']) print(f"Found {len(df)} potent compounds") print(df[['molecule_chembl_id', 'standard_value', 'pchembl_value']].head(10)) ``` ### Workflow 2: Analyze a Known Drug ```python from chembl_webresource_client.new_client import new_client chembl_id = 'CHEMBL941' # Imatinib # Drug information drug_info = new_client.molecule.get(chembl_id) print(f"Name: {drug_info['pref_name']}") print(f"MW: {drug_info['molecule_properties']['mw_freebase']}") # Mechanisms of action mechs = list(new_client.mechanism.filter(molecule_chembl_id=chembl_id)) for m in mechs: print(f"Mechanism: {m['mechanism_of_action']}") # Indications indications = list(new_client.drug_indication.filter(molecule_chembl_id=chembl_id)) for ind in indications: print(f"Indication: {ind.get('mesh_heading')} (Phase {ind.get('max_phase_for_ind')})") # All bioactivity data activities = list(new_client.activity.filter( molecule_chembl_id=chembl_id, pchembl_value__isnull=False )) print(f"Total bioactivity records: {len(activities)}") ``` ### Workflow 3: SAR Study ```python from chembl_webresource_client.new_client import new_client import pandas as pd # Step 1: Find similar compounds to lead similar = new_client.similarity.filter( smiles='c1ccc2c(c1)cc(nc2N)c3ccc(cc3)NC(=O)c4ccccc4', # lead compound similarity=80 ) analogs = list(similar) # Step 2: Collect activities for each analog records = [] for compound in analogs[:20]: # limit for demo cid = compound['molecule_chembl_id'] acts = list(new_client.activity.filter( molecule_chembl_id=cid, standard_type='IC50', pchembl_value__isnull=False )) for act in acts: records.append({ 'chembl_id': cid, 'target': act.get('target_pref_name'), 'IC50_nM': act.get('standard_value'), 'pchembl': act.get('pchembl_value'), 'mw': compound.get('molecule_properties', {}).get('mw_freebase'), 'alogp': compound.get('molecule_properties', {}).get('alogp') }) # Step 3: Analyze property-activity relationships df = pd.DataFrame(records) if not df.empty: df['IC50_nM'] = pd.to_numeric(df['IC50_nM']) print(df.groupby('target')['IC50_nM'].describe()) ``` ## Common Recipes ### Recipe: Virtual Screening Filter (Lipinski + Activity) ```python from chembl_webresource_client.new_client import new_client candidates = new_client.molecule.filter( molecule_properties__mw_freebase__range=[300, 500], molecule_properties__alogp__lte=5, molecule_properties__hba__lte=10, molecule_properties__hbd__lte=5, molecule_properties__num_ro5_violations=0 ) print(f"Drug-like candidates: {len(list(candidates))}") ``` ### Recipe: Client Configuration ```python from chembl_webresource_client.settings import Settings Settings.Instance().CACHING = True # enable/disable cache Settings.Instance().CACHE_EXPIRE = 86400 # cache duration (seconds) Settings.Instance().TIMEOUT = 30 # request timeout (seconds) Settings.Instance().TOTAL_RETRIES = 3 # retry count on failure ``` ### Recipe: Export Activities to CSV ```python import pandas as pd from chembl_webresource_client.new_client import new_client activities = new_client.activity.filter( target_chembl_id='CHEMBL203', standard_type='IC50', pchembl_value__isnull=False ) df = pd.DataFrame(list(activities)) df.to_csv('egfr_activities.csv', index=False) print(f"Exported {len(df)} records") ``` ## Key Parameters | Parameter | Endpoint | Default | Description | |-----------|----------|---------|-------------| | `similarity` | `similarity.filter()` | — | Tanimoto threshold (0-100), typically 70-90 | | `standard_type` | `activity.filter()` | — | Activity type: IC50, Ki, Kd, EC50 | | `standard_value__lte` | `activity.filter()` | — | Max activity value (nM) | | `pchembl_value` | `activity.filter()` | — | Normalized -log10 activity (>6 = potent) | | `target_type` | `target.filter()` | — | SINGLE PROTEIN, PROTEIN COMPLEX, ORGANISM | | `CACHING` | `Settings` | `True` | Enable HTTP response caching | | `CACHE_EXPIRE` | `Settings` | `86400` | Cache TTL in seconds | | `TIMEOUT` | `Settings` | `30` | HTTP request timeout in seconds | | `TOTAL_RETRIES` | `Settings` | `3` | Auto-retry count on failure | ## Troubleshooting | Problem | Cause | Solution | |---------|-------|----------| | Empty results from filter | No matches or too strict filters | Relax filters; verify IDs exist with `.get()` first | | `KeyError` on molecule properties | Not all molecules have full property data | Use `.get('molecule_properties', {}).get('field')` | | Query returns unexpectedly few results | Lazy evaluation not consumed | Convert to `list()` before checking length | | Slow queries | Large result sets paginated automatically | Add more filters to narrow results; use `__range` | | `404` on `.get()` | Invalid ChEMBL ID | Verify ID format (e.g., CHEMBL25, not 25) | | Stale data | Aggressive caching | Set `Settings.Instance().CACHING = False` or clear cache | | Timeout errors | Server overload or large query | Increase `TIMEOUT`; split into smaller queries | | Mixed units in activity data | Different assays use different units | Filter by `standard_units='nM'` or use `pchembl_value` | | Duplicate activity records | Same measurement from different sources | Check `potential_duplicate` and `data_validity_comment` | ## Best Practices - **Use `pchembl_value`** for cross-study comparisons — it normalizes IC50/Ki/EC50 to a comparable -log10 scale - **Always check `data_validity_comment`** before using activity values — flags data quality issues - **Filter by `standard_units`** to ensure consistent units across results - **Pagination is automatic**: the SDK handles pagination transparently — iterate directly over query results without manual page handling. Convert to `list()` only when you need all results in memory - **Use lazy evaluation**: queries execute only when iterated — convert to `list()` only when needed - **Cache results**: the SDK caches for 24h by default — leverage this for repeated queries - **For bulk data** (>100K records): use ChEMBL FTP downloads rather than API queries ## Related Skills - `rdkit-cheminformatics` — SMILES manipulation, molecular descriptors, fingerprints - `datamol-cheminformatics` — Molecular preprocessing and featurization - `pubchem-compound-search` — Alternative compound database (NIH) ## References - ChEMBL website: https://www.ebi.ac.uk/chembl/ - API documentation: https://www.ebi.ac.uk/chembl/api/data/docs - Python client: https://github.com/chembl/chembl_webresource_client - Interface docs: https://chembl.gitbook.io/chembl-interface-documentation/ - Example notebooks: https://github.com/chembl/notebooks ## Bundled Resources **Self-contained entry** (no `references/` directory). Original total: 662 lines (SKILL.md 389 + api_reference.md 273). Scripts: 279 lines (example_queries.py). **Original file disposition**: - `SKILL.md` (389 lines) → Core API, Workflows, Quick Start. "Common Use Cases" consolidated (rule 7b): Find Kinase Inhibitors → Workflow 1 pattern, Virtual Screening → Recipe, Drug Repurposing → omitted (trivial loop over drug endpoint, not a distinct analytical workflow). "Important Notes" section routed to Best Practices and Troubleshooting (rule 9) - `references/api_reference.md` (273 lines) → Consolidated inline. Filter Operators → Key Concepts table. Core Endpoints listing → Key Concepts table (all endpoints). Molecular Properties → Key Concepts table. Bioactivity Data Fields → Key Concepts table. Target Information Fields → Key Concepts table. Configuration/Settings → Common Recipes. Error handling/rate limiting → Troubleshooting + Best Practices. Response formats (JSON/XML/YAML) → omitted (JSON is default and only format used via Python SDK). Advanced query examples already covered in Core API - `scripts/example_queries.py` (279 lines) → Thin-wrapper functions absorbed into Core API modules: `get_molecule_info` / `search_molecules_by_name` / `find_molecules_by_properties` → Module 1 (Molecule Queries); `get_target_info` / `search_targets_by_name` → Module 2 (Target Queries); `get_bioactivity_data` / `get_compound_bioactivities` → Module 3 (Bioactivity Data); `find_similar_compounds` / `substructure_search` → Module 4 (Structure-Based Search); `get_drug_info` → Module 5 (Drug and Mechanism Data); `find_kinase_inhibitors` → Workflow 1; `export_to_dataframe` → Workflow 1 + Recipe (Export) **Retention**: ~460 lines / 662 original = ~69%. Vendor metadata stripped (rule 13). Agent-behavior section stripped (rule 4).