--- name: "pubchem-compound-search" description: "Query PubChem (110M+ compounds) via PubChemPy/PUG-REST. Search by name/CID/SMILES, get properties (MW, LogP, TPSA), similarity/substructure search, bioactivity. For local cheminformatics use rdkit; for multi-DB queries use bioservices." license: "CC-BY-4.0" --- # PubChem Compound Search ## Overview PubChem is the world's largest freely available chemical database with 110M+ compounds. This skill covers searching compounds by name, structure, or identifier, retrieving molecular properties, performing similarity/substructure searches, and accessing bioactivity data through PubChemPy (Python wrapper) and PUG-REST API (direct HTTP). ## When to Use - Looking up a compound by name, CAS number, or SMILES to get its PubChem CID and properties - Retrieving molecular properties (molecular weight, LogP, TPSA, H-bond counts) for known compounds - Finding structurally similar compounds via Tanimoto similarity search - Searching for compounds containing a specific substructure (pharmacophore screening) - Converting between chemical identifier formats (name ↔ CID ↔ SMILES ↔ InChI) - Accessing bioactivity screening data (assay results, active/inactive status) - Batch property comparison across a set of drug candidates - For local molecular computation (fingerprints, descriptors, 3D conformers), use `rdkit` instead - For querying multiple databases (UniProt, KEGG, ChEMBL) in one workflow, use `bioservices` instead ## Prerequisites - **Python packages**: `pubchempy`, `requests` (for direct API), `pandas` (for batch processing) - **No API key required**: PubChem is freely accessible - **Rate limits**: Max 5 requests/second, 400 requests/minute ```bash pip install pubchempy requests pandas ``` ## Quick Start ```python import pubchempy as pcp # Search by name → get properties compound = pcp.get_compounds("aspirin", "name")[0] print(f"CID: {compound.cid}") print(f"SMILES: {compound.canonical_smiles}") print(f"MW: {compound.molecular_weight}, LogP: {compound.xlogp}") print(f"HBD: {compound.h_bond_donor_count}, HBA: {compound.h_bond_acceptor_count}") ``` ## Workflow ### Step 1: Compound Search Search by name, CID, SMILES, InChI, or molecular formula. ```python import pubchempy as pcp # By name compounds = pcp.get_compounds("caffeine", "name") print(f"Found {len(compounds)} compounds for 'caffeine'") # By CID (fastest) compound = pcp.Compound.from_cid(2244) # Aspirin print(f"CID 2244 = {compound.iupac_name}") # By SMILES compound = pcp.get_compounds("CC(=O)OC1=CC=CC=C1C(=O)O", "smiles")[0] print(f"SMILES lookup: CID {compound.cid}") # By molecular formula (returns all matches) formula_matches = pcp.get_compounds("C9H8O4", "formula") print(f"Formula C9H8O4 matches: {len(formula_matches)} compounds") ``` ### Step 2: Property Retrieval Get molecular properties for one or more compounds. ```python import pubchempy as pcp # Full compound object compound = pcp.get_compounds("ibuprofen", "name")[0] print(f"MW: {compound.molecular_weight}") print(f"LogP: {compound.xlogp}") print(f"TPSA: {compound.tpsa}") print(f"Rotatable bonds: {compound.rotatable_bond_count}") # Selective property retrieval (more efficient for specific needs) props = pcp.get_properties( ["MolecularWeight", "XLogP", "TPSA", "HBondDonorCount"], "aspirin", "name" ) print(props) # List of dicts ``` ### Step 3: Similarity Search Find structurally similar compounds using Tanimoto coefficient. ```python import pubchempy as pcp # Get reference compound SMILES ref = pcp.get_compounds("gefitinib", "name")[0] # Similarity search (may take 15-30s for async processing) similar = pcp.get_compounds( ref.canonical_smiles, "smiles", searchtype="similarity", Threshold=85, # Tanimoto threshold (0-100) MaxRecords=50 ) print(f"Found {len(similar)} compounds with ≥85% similarity to gefitinib") for comp in similar[:5]: print(f" CID {comp.cid}: MW={comp.molecular_weight}") ``` ### Step 4: Substructure Search Find compounds containing a specific structural motif. ```python import pubchempy as pcp # Search for sulfonamide-containing compounds hits = pcp.get_compounds( "S(=O)(=O)N", "smiles", searchtype="substructure", MaxRecords=100 ) print(f"Found {len(hits)} compounds with sulfonamide group") ``` ### Step 5: Bioactivity Data Access Retrieve biological screening results via PUG-REST API. ```python import requests cid = 2244 # Aspirin url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/assaysummary/JSON" response = requests.get(url) if response.status_code == 200: data = response.json() rows = data.get("Table", {}).get("Row", []) print(f"Aspirin has {len(rows)} bioassay records") ``` ### Step 6: Batch Property Comparison Compare properties across multiple compounds. ```python import pubchempy as pcp import pandas as pd import time compounds = ["aspirin", "ibuprofen", "naproxen", "celecoxib"] results = [] for name in compounds: comp = pcp.get_compounds(name, "name")[0] results.append({ "Name": name, "CID": comp.cid, "MW": comp.molecular_weight, "LogP": comp.xlogp, "TPSA": comp.tpsa, "HBD": comp.h_bond_donor_count, "HBA": comp.h_bond_acceptor_count, }) time.sleep(0.25) # Respect rate limits df = pd.DataFrame(results) print(df.to_string(index=False)) ``` ### Step 7: Identifier Format Conversion Convert between chemical identifier formats. ```python import pubchempy as pcp compound = pcp.get_compounds("caffeine", "name")[0] print(f"CID: {compound.cid}") print(f"IUPAC: {compound.iupac_name}") print(f"SMILES: {compound.canonical_smiles}") print(f"InChI: {compound.inchi}") print(f"InChIKey: {compound.inchikey}") print(f"Formula: {compound.molecular_formula}") # Download structure files pcp.download("SDF", "caffeine", "name", "caffeine.sdf", overwrite=True) print("Downloaded caffeine.sdf") ``` ## Key Parameters | Parameter | Function | Default | Range / Options | Effect | |-----------|----------|---------|-----------------|--------| | `namespace` | `get_compounds` | required | `"name"`, `"cid"`, `"smiles"`, `"inchi"`, `"formula"` | Identifier type for search | | `searchtype` | `get_compounds` | `None` | `"similarity"`, `"substructure"` | Type of structure search | | `Threshold` | similarity search | `90` | `0`-`100` | Tanimoto similarity cutoff (%) | | `MaxRecords` | structure search | `None` | `1`-`10000` | Maximum results returned | | `properties` | `get_properties` | required | See API reference | Which molecular properties to retrieve | | `record_type` | `download` | `"2d"` | `"2d"`, `"3d"` | Structure dimensionality | ## Common Recipes ### Recipe: Drug-Likeness Screening (Lipinski's Rule of Five) When to use: Quick check if a compound is orally bioavailable. ```python import pubchempy as pcp def check_lipinski(name): comp = pcp.get_compounds(name, "name")[0] rules = { "MW ≤ 500": comp.molecular_weight <= 500, "LogP ≤ 5": (comp.xlogp or 0) <= 5, "HBD ≤ 5": comp.h_bond_donor_count <= 5, "HBA ≤ 10": comp.h_bond_acceptor_count <= 10, } violations = sum(1 for v in rules.values() if not v) return rules, violations rules, v = check_lipinski("metformin") print(f"Violations: {v}/4 — {'PASS' if v <= 1 else 'FAIL'}") for rule, passed in rules.items(): print(f" {'✓' if passed else '✗'} {rule}") ``` ### Recipe: Get All Synonyms for a Compound When to use: Finding alternative names, trade names, or CAS numbers. ```python import pubchempy as pcp synonyms = pcp.get_synonyms("aspirin", "name") if synonyms: names = synonyms[0]["Synonym"] print(f"Found {len(names)} synonyms for aspirin:") for name in names[:10]: print(f" {name}") ``` ### Recipe: Download 2D Structure Image When to use: Generating structure images for reports or presentations. ```python import requests cid = 2519 # Caffeine url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/PNG?image_size=large" response = requests.get(url) with open("caffeine_structure.png", "wb") as f: f.write(response.content) print("Saved caffeine_structure.png") ``` ## Expected Outputs - **Compound search**: `pubchempy.Compound` objects with properties (CID, name, SMILES, MW, etc.) - **Property retrieval**: List of dictionaries with requested properties - **Similarity search**: List of `Compound` objects sorted by similarity - **Bioactivity query**: JSON with assay results (activity outcome, assay ID, target) - **Structure download**: SDF, JSON, or PNG files ## Troubleshooting | Problem | Cause | Solution | |---------|-------|----------| | `IndexError: list index out of range` | No compounds found for query | Check spelling; try alternative names or CID | | Request timeout (>30s) | Large similarity/substructure search | Reduce `MaxRecords`; PubChemPy handles async polling automatically | | Empty property values (`None`) | Property not available for this compound | Check if property exists before use: `if comp.xlogp is not None` | | `HTTP 503 Service Unavailable` | Rate limit exceeded | Add `time.sleep(0.25)` between requests; max 5 req/sec | | `BadRequestError` | Invalid SMILES or identifier | Validate SMILES syntax; use canonical SMILES from RDKit | | Formula search returns too many hits | Common formula shared by many isomers | Use SMILES or InChI for more specific searches | | Bioactivity API returns empty | Compound has no bioassay data | Not all compounds have been tested; check PubChem web interface | ## References - [PubChem PUG-REST API](https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest) — official REST API docs - [PubChemPy documentation](https://pubchempy.readthedocs.io/) — Python wrapper docs - [PubChem PUG-REST tutorial](https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest-tutorial) — step-by-step guide - [PubChem database](https://pubchem.ncbi.nlm.nih.gov/) — web interface