--- name: rdkit description: "Cheminformatics toolkit for fine-grained molecular control. SMILES/SDF parsing, descriptors (MW, LogP, TPSA), fingerprints, substructure search, 2D/3D generation, similarity, reactions. For standard workflows with simpler interface, use datamol (wrapper around RDKit). Use rdkit for advanced control, custom sanitization, specialized algorithms." --- # RDKit Cheminformatics Toolkit ## Overview RDKit is a comprehensive cheminformatics library providing Python APIs for molecular analysis and manipulation. This skill provides guidance for reading/writing molecular structures, calculating descriptors, fingerprinting, substructure searching, chemical reactions, 2D/3D coordinate generation, and molecular visualization. Use this skill for drug discovery, computational chemistry, and cheminformatics research tasks. ## Core Capabilities ### 1. Molecular I/O and Creation **Reading Molecules:** Read molecular structures from various formats: ```python from rdkit import Chem # From SMILES strings mol = Chem.MolFromSmiles('Cc1ccccc1') # Returns Mol object or None # From MOL files mol = Chem.MolFromMolFile('path/to/file.mol') # From MOL blocks (string data) mol = Chem.MolFromMolBlock(mol_block_string) # From InChI mol = Chem.MolFromInchi('InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H') ``` **Writing Molecules:** Convert molecules to text representations: ```python # To canonical SMILES smiles = Chem.MolToSmiles(mol) # To MOL block mol_block = Chem.MolToMolBlock(mol) # To InChI inchi = Chem.MolToInchi(mol) ``` **Batch Processing:** For processing multiple molecules, use Supplier/Writer objects: ```python # Read SDF files suppl = Chem.SDMolSupplier('molecules.sdf') for mol in suppl: if mol is not None: # Check for parsing errors # Process molecule pass # Read SMILES files suppl = Chem.SmilesMolSupplier('molecules.smi', titleLine=False) # For large files or compressed data with gzip.open('molecules.sdf.gz') as f: suppl = Chem.ForwardSDMolSupplier(f) for mol in suppl: # Process molecule pass # Multithreaded processing for large datasets suppl = Chem.MultithreadedSDMolSupplier('molecules.sdf') # Write molecules to SDF writer = Chem.SDWriter('output.sdf') for mol in molecules: writer.write(mol) writer.close() ``` **Important Notes:** - All `MolFrom*` functions return `None` on failure with error messages - Always check for `None` before processing molecules - Molecules are automatically sanitized on import (validates valence, perceives aromaticity) ### 2. Molecular Sanitization and Validation RDKit automatically sanitizes molecules during parsing, executing 13 steps including valence checking, aromaticity perception, and chirality assignment. **Sanitization Control:** ```python # Disable automatic sanitization mol = Chem.MolFromSmiles('C1=CC=CC=C1', sanitize=False) # Manual sanitization Chem.SanitizeMol(mol) # Detect problems before sanitization problems = Chem.DetectChemistryProblems(mol) for problem in problems: print(problem.GetType(), problem.Message()) # Partial sanitization (skip specific steps) from rdkit.Chem import rdMolStandardize Chem.SanitizeMol(mol, sanitizeOps=Chem.SANITIZE_ALL ^ Chem.SANITIZE_PROPERTIES) ``` **Common Sanitization Issues:** - Atoms with explicit valence exceeding maximum allowed will raise exceptions - Invalid aromatic rings will cause kekulization errors - Radical electrons may not be properly assigned without explicit specification ### 3. Molecular Analysis and Properties **Accessing Molecular Structure:** ```python # Iterate atoms and bonds for atom in mol.GetAtoms(): print(atom.GetSymbol(), atom.GetIdx(), atom.GetDegree()) for bond in mol.GetBonds(): print(bond.GetBeginAtomIdx(), bond.GetEndAtomIdx(), bond.GetBondType()) # Ring information ring_info = mol.GetRingInfo() ring_info.NumRings() ring_info.AtomRings() # Returns tuples of atom indices # Check if atom is in ring atom = mol.GetAtomWithIdx(0) atom.IsInRing() atom.IsInRingSize(6) # Check for 6-membered rings # Find smallest set of smallest rings (SSSR) from rdkit.Chem import GetSymmSSSR rings = GetSymmSSSR(mol) ``` **Stereochemistry:** ```python # Find chiral centers from rdkit.Chem import FindMolChiralCenters chiral_centers = FindMolChiralCenters(mol, includeUnassigned=True) # Returns list of (atom_idx, chirality) tuples # Assign stereochemistry from 3D coordinates from rdkit.Chem import AssignStereochemistryFrom3D AssignStereochemistryFrom3D(mol) # Check bond stereochemistry bond = mol.GetBondWithIdx(0) stereo = bond.GetStereo() # STEREONONE, STEREOZ, STEREOE, etc. ``` **Fragment Analysis:** ```python # Get disconnected fragments frags = Chem.GetMolFrags(mol, asMols=True) # Fragment on specific bonds from rdkit.Chem import FragmentOnBonds frag_mol = FragmentOnBonds(mol, [bond_idx1, bond_idx2]) # Count ring systems from rdkit.Chem.Scaffolds import MurckoScaffold scaffold = MurckoScaffold.GetScaffoldForMol(mol) ``` ### 4. Molecular Descriptors and Properties **Basic Descriptors:** ```python from rdkit.Chem import Descriptors # Molecular weight mw = Descriptors.MolWt(mol) exact_mw = Descriptors.ExactMolWt(mol) # LogP (lipophilicity) logp = Descriptors.MolLogP(mol) # Topological polar surface area tpsa = Descriptors.TPSA(mol) # Number of hydrogen bond donors/acceptors hbd = Descriptors.NumHDonors(mol) hba = Descriptors.NumHAcceptors(mol) # Number of rotatable bonds rot_bonds = Descriptors.NumRotatableBonds(mol) # Number of aromatic rings aromatic_rings = Descriptors.NumAromaticRings(mol) ``` **Batch Descriptor Calculation:** ```python # Calculate all descriptors at once all_descriptors = Descriptors.CalcMolDescriptors(mol) # Returns dictionary: {'MolWt': 180.16, 'MolLogP': 1.23, ...} # Get list of available descriptor names descriptor_names = [desc[0] for desc in Descriptors._descList] ``` **Lipinski's Rule of Five:** ```python # Check drug-likeness mw = Descriptors.MolWt(mol) <= 500 logp = Descriptors.MolLogP(mol) <= 5 hbd = Descriptors.NumHDonors(mol) <= 5 hba = Descriptors.NumHAcceptors(mol) <= 10 is_drug_like = mw and logp and hbd and hba ``` ### 5. Fingerprints and Molecular Similarity **Fingerprint Types:** ```python from rdkit.Chem import AllChem, RDKFingerprint from rdkit.Chem.AtomPairs import Pairs, Torsions from rdkit.Chem import MACCSkeys # RDKit topological fingerprint fp = Chem.RDKFingerprint(mol) # Morgan fingerprints (circular fingerprints, similar to ECFP) fp = AllChem.GetMorganFingerprint(mol, radius=2) fp_bits = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048) # MACCS keys (166-bit structural key) fp = MACCSkeys.GenMACCSKeys(mol) # Atom pair fingerprints fp = Pairs.GetAtomPairFingerprint(mol) # Topological torsion fingerprints fp = Torsions.GetTopologicalTorsionFingerprint(mol) # Avalon fingerprints (if available) from rdkit.Avalon import pyAvalonTools fp = pyAvalonTools.GetAvalonFP(mol) ``` **Similarity Calculation:** ```python from rdkit import DataStructs # Calculate Tanimoto similarity fp1 = AllChem.GetMorganFingerprintAsBitVect(mol1, radius=2) fp2 = AllChem.GetMorganFingerprintAsBitVect(mol2, radius=2) similarity = DataStructs.TanimotoSimilarity(fp1, fp2) # Calculate similarity for multiple molecules similarities = DataStructs.BulkTanimotoSimilarity(fp1, [fp2, fp3, fp4]) # Other similarity metrics dice = DataStructs.DiceSimilarity(fp1, fp2) cosine = DataStructs.CosineSimilarity(fp1, fp2) ``` **Clustering and Diversity:** ```python # Butina clustering based on fingerprint similarity from rdkit.ML.Cluster import Butina # Calculate distance matrix dists = [] fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 2) for mol in mols] for i in range(len(fps)): sims = DataStructs.BulkTanimotoSimilarity(fps[i], fps[:i]) dists.extend([1-sim for sim in sims]) # Cluster with distance cutoff clusters = Butina.ClusterData(dists, len(fps), distThresh=0.3, isDistData=True) ``` ### 6. Substructure Searching and SMARTS **Basic Substructure Matching:** ```python # Define query using SMARTS query = Chem.MolFromSmarts('[#6]1:[#6]:[#6]:[#6]:[#6]:[#6]:1') # Benzene ring # Check if molecule contains substructure has_match = mol.HasSubstructMatch(query) # Get all matches (returns tuple of tuples with atom indices) matches = mol.GetSubstructMatches(query) # Get only first match match = mol.GetSubstructMatch(query) ``` **Common SMARTS Patterns:** ```python # Primary alcohols primary_alcohol = Chem.MolFromSmarts('[CH2][OH1]') # Carboxylic acids carboxylic_acid = Chem.MolFromSmarts('C(=O)[OH]') # Amides amide = Chem.MolFromSmarts('C(=O)N') # Aromatic heterocycles aromatic_n = Chem.MolFromSmarts('[nR]') # Aromatic nitrogen in ring # Macrocycles (rings > 12 atoms) macrocycle = Chem.MolFromSmarts('[r{12-}]') ``` **Matching Rules:** - Unspecified properties in query match any value in target - Hydrogens are ignored unless explicitly specified - Charged query atom won't match uncharged target atom - Aromatic query atom won't match aliphatic target atom (unless query is generic) ### 7. Chemical Reactions **Reaction SMARTS:** ```python from rdkit.Chem import AllChem # Define reaction using SMARTS: reactants >> products rxn = AllChem.ReactionFromSmarts('[C:1]=[O:2]>>[C:1][O:2]') # Ketone reduction # Apply reaction to molecules reactants = (mol1,) products = rxn.RunReactants(reactants) # Products is tuple of tuples (one tuple per product set) for product_set in products: for product in product_set: # Sanitize product Chem.SanitizeMol(product) ``` **Reaction Features:** - Atom mapping preserves specific atoms between reactants and products - Dummy atoms in products are replaced by corresponding reactant atoms - "Any" bonds inherit bond order from reactants - Chirality preserved unless explicitly changed **Reaction Similarity:** ```python # Generate reaction fingerprints fp = AllChem.CreateDifferenceFingerprintForReaction(rxn) # Compare reactions similarity = DataStructs.TanimotoSimilarity(fp1, fp2) ``` ### 8. 2D and 3D Coordinate Generation **2D Coordinate Generation:** ```python from rdkit.Chem import AllChem # Generate 2D coordinates for depiction AllChem.Compute2DCoords(mol) # Align molecule to template structure template = Chem.MolFromSmiles('c1ccccc1') AllChem.Compute2DCoords(template) AllChem.GenerateDepictionMatching2DStructure(mol, template) ``` **3D Coordinate Generation and Conformers:** ```python # Generate single 3D conformer using ETKDG AllChem.EmbedMolecule(mol, randomSeed=42) # Generate multiple conformers conf_ids = AllChem.EmbedMultipleConfs(mol, numConfs=10, randomSeed=42) # Optimize geometry with force field AllChem.UFFOptimizeMolecule(mol) # UFF force field AllChem.MMFFOptimizeMolecule(mol) # MMFF94 force field # Optimize all conformers for conf_id in conf_ids: AllChem.MMFFOptimizeMolecule(mol, confId=conf_id) # Calculate RMSD between conformers from rdkit.Chem import AllChem rms = AllChem.GetConformerRMS(mol, conf_id1, conf_id2) # Align molecules AllChem.AlignMol(probe_mol, ref_mol) ``` **Constrained Embedding:** ```python # Embed with part of molecule constrained to specific coordinates AllChem.ConstrainedEmbed(mol, core_mol) ``` ### 9. Molecular Visualization **Basic Drawing:** ```python from rdkit.Chem import Draw # Draw single molecule to PIL image img = Draw.MolToImage(mol, size=(300, 300)) img.save('molecule.png') # Draw to file directly Draw.MolToFile(mol, 'molecule.png') # Draw multiple molecules in grid mols = [mol1, mol2, mol3, mol4] img = Draw.MolsToGridImage(mols, molsPerRow=2, subImgSize=(200, 200)) ``` **Highlighting Substructures:** ```python # Highlight substructure match query = Chem.MolFromSmarts('c1ccccc1') match = mol.GetSubstructMatch(query) img = Draw.MolToImage(mol, highlightAtoms=match) # Custom highlight colors highlight_colors = {atom_idx: (1, 0, 0) for atom_idx in match} # Red img = Draw.MolToImage(mol, highlightAtoms=match, highlightAtomColors=highlight_colors) ``` **Customizing Visualization:** ```python from rdkit.Chem.Draw import rdMolDraw2D # Create drawer with custom options drawer = rdMolDraw2D.MolDraw2DCairo(300, 300) opts = drawer.drawOptions() # Customize options opts.addAtomIndices = True opts.addStereoAnnotation = True opts.bondLineWidth = 2 # Draw molecule drawer.DrawMolecule(mol) drawer.FinishDrawing() # Save to file with open('molecule.png', 'wb') as f: f.write(drawer.GetDrawingText()) ``` **Jupyter Notebook Integration:** ```python # Enable inline display in Jupyter from rdkit.Chem.Draw import IPythonConsole # Customize default display IPythonConsole.ipython_useSVG = True # Use SVG instead of PNG IPythonConsole.molSize = (300, 300) # Default size # Molecules now display automatically mol # Shows molecule image ``` **Visualizing Fingerprint Bits:** ```python # Show what molecular features a fingerprint bit represents from rdkit.Chem import Draw # For Morgan fingerprints bit_info = {} fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, bitInfo=bit_info) # Draw environment for specific bit img = Draw.DrawMorganBit(mol, bit_id, bit_info) ``` ### 10. Molecular Modification **Adding/Removing Hydrogens:** ```python # Add explicit hydrogens mol_h = Chem.AddHs(mol) # Remove explicit hydrogens mol = Chem.RemoveHs(mol_h) ``` **Kekulization and Aromaticity:** ```python # Convert aromatic bonds to alternating single/double Chem.Kekulize(mol) # Set aromaticity Chem.SetAromaticity(mol) ``` **Replacing Substructures:** ```python # Replace substructure with another structure query = Chem.MolFromSmarts('c1ccccc1') # Benzene replacement = Chem.MolFromSmiles('C1CCCCC1') # Cyclohexane new_mol = Chem.ReplaceSubstructs(mol, query, replacement)[0] ``` **Neutralizing Charges:** ```python # Remove formal charges by adding/removing hydrogens from rdkit.Chem.MolStandardize import rdMolStandardize # Using Uncharger uncharger = rdMolStandardize.Uncharger() mol_neutral = uncharger.uncharge(mol) ``` ### 11. Working with Molecular Hashes and Standardization **Molecular Hashing:** ```python from rdkit.Chem import rdMolHash # Generate Murcko scaffold hash scaffold_hash = rdMolHash.MolHash(mol, rdMolHash.HashFunction.MurckoScaffold) # Canonical SMILES hash canonical_hash = rdMolHash.MolHash(mol, rdMolHash.HashFunction.CanonicalSmiles) # Regioisomer hash (ignores stereochemistry) regio_hash = rdMolHash.MolHash(mol, rdMolHash.HashFunction.Regioisomer) ``` **Randomized SMILES:** ```python # Generate random SMILES representations (for data augmentation) from rdkit.Chem import MolToRandomSmilesVect random_smiles = MolToRandomSmilesVect(mol, numSmiles=10, randomSeed=42) ``` ### 12. Pharmacophore and 3D Features **Pharmacophore Features:** ```python from rdkit.Chem import ChemicalFeatures from rdkit import RDConfig import os # Load feature factory fdef_path = os.path.join(RDConfig.RDDataDir, 'BaseFeatures.fdef') factory = ChemicalFeatures.BuildFeatureFactory(fdef_path) # Get pharmacophore features features = factory.GetFeaturesForMol(mol) for feat in features: print(feat.GetFamily(), feat.GetType(), feat.GetAtomIds()) ``` ## Common Workflows ### Drug-likeness Analysis ```python from rdkit import Chem from rdkit.Chem import Descriptors def analyze_druglikeness(smiles): mol = Chem.MolFromSmiles(smiles) if mol is None: return None # Calculate Lipinski descriptors results = { 'MW': Descriptors.MolWt(mol), 'LogP': Descriptors.MolLogP(mol), 'HBD': Descriptors.NumHDonors(mol), 'HBA': Descriptors.NumHAcceptors(mol), 'TPSA': Descriptors.TPSA(mol), 'RotBonds': Descriptors.NumRotatableBonds(mol) } # Check Lipinski's Rule of Five results['Lipinski'] = ( results['MW'] <= 500 and results['LogP'] <= 5 and results['HBD'] <= 5 and results['HBA'] <= 10 ) return results ``` ### Similarity Screening ```python from rdkit import Chem from rdkit.Chem import AllChem from rdkit import DataStructs def similarity_screen(query_smiles, database_smiles, threshold=0.7): query_mol = Chem.MolFromSmiles(query_smiles) query_fp = AllChem.GetMorganFingerprintAsBitVect(query_mol, 2) hits = [] for idx, smiles in enumerate(database_smiles): mol = Chem.MolFromSmiles(smiles) if mol: fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2) sim = DataStructs.TanimotoSimilarity(query_fp, fp) if sim >= threshold: hits.append((idx, smiles, sim)) return sorted(hits, key=lambda x: x[2], reverse=True) ``` ### Substructure Filtering ```python from rdkit import Chem def filter_by_substructure(smiles_list, pattern_smarts): query = Chem.MolFromSmarts(pattern_smarts) hits = [] for smiles in smiles_list: mol = Chem.MolFromSmiles(smiles) if mol and mol.HasSubstructMatch(query): hits.append(smiles) return hits ``` ## Best Practices ### Error Handling Always check for `None` when parsing molecules: ```python mol = Chem.MolFromSmiles(smiles) if mol is None: print(f"Failed to parse: {smiles}") continue ``` ### Performance Optimization **Use binary formats for storage:** ```python import pickle # Pickle molecules for fast loading with open('molecules.pkl', 'wb') as f: pickle.dump(mols, f) # Load pickled molecules (much faster than reparsing) with open('molecules.pkl', 'rb') as f: mols = pickle.load(f) ``` **Use bulk operations:** ```python # Calculate fingerprints for all molecules at once fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 2) for mol in mols] # Use bulk similarity calculations similarities = DataStructs.BulkTanimotoSimilarity(fps[0], fps[1:]) ``` ### Thread Safety RDKit operations are generally thread-safe for: - Molecule I/O (SMILES, mol blocks) - Coordinate generation - Fingerprinting and descriptors - Substructure searching - Reactions - Drawing **Not thread-safe:** MolSuppliers when accessed concurrently. ### Memory Management For large datasets: ```python # Use ForwardSDMolSupplier to avoid loading entire file with open('large.sdf') as f: suppl = Chem.ForwardSDMolSupplier(f) for mol in suppl: # Process one molecule at a time pass # Use MultithreadedSDMolSupplier for parallel processing suppl = Chem.MultithreadedSDMolSupplier('large.sdf', numWriterThreads=4) ``` ## Common Pitfalls 1. **Forgetting to check for None:** Always validate molecules after parsing 2. **Sanitization failures:** Use `DetectChemistryProblems()` to debug 3. **Missing hydrogens:** Use `AddHs()` when calculating properties that depend on hydrogen 4. **2D vs 3D:** Generate appropriate coordinates before visualization or 3D analysis 5. **SMARTS matching rules:** Remember that unspecified properties match anything 6. **Thread safety with MolSuppliers:** Don't share supplier objects across threads ## Resources ### references/ This skill includes detailed API reference documentation: - `api_reference.md` - Comprehensive listing of RDKit modules, functions, and classes organized by functionality - `descriptors_reference.md` - Complete list of available molecular descriptors with descriptions - `smarts_patterns.md` - Common SMARTS patterns for functional groups and structural features Load these references when needing specific API details, parameter information, or pattern examples. ### scripts/ Example scripts for common RDKit workflows: - `molecular_properties.py` - Calculate comprehensive molecular properties and descriptors - `similarity_search.py` - Perform fingerprint-based similarity screening - `substructure_filter.py` - Filter molecules by substructure patterns These scripts can be executed directly or used as templates for custom workflows.