--- name: datamol description: Pythonic wrapper around RDKit with simplified interface and sensible defaults. Preferred for standard drug discovery including SMILES parsing, standardization, descriptors, fingerprints, clustering, 3D conformers, parallel processing. Returns native rdkit.Chem.Mol objects. For advanced control or custom parameters, use rdkit directly. license: Apache-2.0 license allowed-tools: Read Write Edit Bash compatibility: Requires Python 3.8+ and datamol (uv pip install). RDKit is installed automatically as a datamol dependency (since 0.12.2). Optional s3fs/gcsfs for cloud I/O via fsspec. metadata: version: "1.0" skill-author: K-Dense Inc. --- # Datamol Cheminformatics Skill ## Overview Datamol is a Python library that provides a lightweight, Pythonic abstraction layer over RDKit for molecular cheminformatics. Simplify complex molecular operations with sensible defaults, efficient parallelization, and modern I/O capabilities. All molecular objects are native `rdkit.Chem.Mol` instances, ensuring full compatibility with the RDKit ecosystem. **Version note:** Examples target **datamol 0.12.x** (PyPI stable: **0.12.5**, June 2024). Since 0.10.0, modules are lazy-loaded by default (set `DATAMOL_DISABLE_LAZY_LOADING=1` to disable). Since 0.12.2, RDKit is a direct PyPI dependency of datamol. Fingerprints use RDKit's `rdFingerprintGenerator` API (0.12.5+). **Key capabilities**: - Molecular format conversion (SMILES, SELFIES, InChI) - Structure standardization and sanitization - Molecular descriptors and fingerprints - 3D conformer generation and analysis - Clustering and diversity selection - Scaffold and fragment analysis - Chemical reaction application - Visualization and alignment - Batch processing with parallelization - Cloud storage support via fsspec ## Installation and Setup Guide users to install datamol: ```bash uv pip install datamol ``` RDKit is installed automatically with datamol. For remote file paths (S3, GCS, HTTP), install the matching fsspec backend: ```bash uv pip install s3fs # AWS S3 uv pip install gcsfs # Google Cloud Storage ``` **Import convention**: ```python import datamol as dm ``` ## Core Workflows ### 1. Basic Molecule Handling **Creating molecules from SMILES**: ```python import datamol as dm # Single molecule mol = dm.to_mol("CCO") # Ethanol # From list of SMILES smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"] mols = [dm.to_mol(smi) for smi in smiles_list] # Error handling mol = dm.to_mol("invalid_smiles") # Returns None if mol is None: print("Failed to parse SMILES") ``` **Converting molecules to SMILES**: ```python # Canonical SMILES smiles = dm.to_smiles(mol) # Isomeric SMILES (includes stereochemistry) smiles = dm.to_smiles(mol, isomeric=True) # Other formats inchi = dm.to_inchi(mol) inchikey = dm.to_inchikey(mol) selfies = dm.to_selfies(mol) ``` **Standardization and sanitization** (always recommend for user-provided molecules): ```python # Sanitize molecule mol = dm.sanitize_mol(mol) # Full standardization (recommended for datasets) mol = dm.standardize_mol( mol, disconnect_metals=True, normalize=True, reionize=True ) # For SMILES strings directly clean_smiles = dm.standardize_smiles(smiles) ``` ### 2. Reading and Writing Molecular Files Refer to `references/io_module.md` for comprehensive I/O documentation. **Reading files**: ```python # SDF files (most common in chemistry) df = dm.read_sdf("compounds.sdf", mol_column='mol') # SMILES files df = dm.read_smi("molecules.smi", smiles_column='smiles', mol_column='mol') # CSV with SMILES column df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol") # Excel files df = dm.read_excel("compounds.xlsx", sheet_name=0, mol_column="mol") # Universal reader/writer (auto-detects format; supports compression) df = dm.open_df("file.sdf") # .sdf, .csv, .xlsx, .parquet, .json, .gz, etc. dm.save_df(df, "output.parquet") ``` **Writing files**: ```python # Save as SDF dm.to_sdf(mols, "output.sdf") # Or from DataFrame dm.to_sdf(df, "output.sdf", mol_column="mol") # Save as SMILES file dm.to_smi(mols, "output.smi") # Excel with rendered molecule images dm.to_xlsx(df, "output.xlsx", mol_columns=["mol"]) ``` **Remote file support** (S3, GCS, HTTP via fsspec): Only use cloud paths when the user explicitly requests them. Confirm the destination before writing. ```python # Read from cloud storage or HTTPS (user-provided URLs only) df = dm.read_sdf("s3://bucket/compounds.sdf") df = dm.read_csv("https://example.com/data.csv") # Write to cloud storage — confirm path with user first dm.to_sdf(mols, "s3://bucket/output.sdf") ``` Cloud backends read credentials from the standard provider environment (for example `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_DEFAULT_REGION`, or `GOOGLE_APPLICATION_CREDENTIALS`). Datamol passes these to fsspec locally; it does not collect or transmit environment variables to third-party endpoints. Scope credential access to the named provider variables only. ### 3. Molecular Descriptors and Properties Refer to `references/descriptors_viz.md` for detailed descriptor documentation. **Computing descriptors for a single molecule**: ```python # Get standard descriptor set descriptors = dm.descriptors.compute_many_descriptors(mol) # Returns: {'mw': 46.07, 'logp': -0.03, 'hbd': 1, 'hba': 1, # 'tpsa': 20.23, 'n_aromatic_atoms': 0, ...} ``` **Batch descriptor computation** (recommended for datasets): ```python # Compute for all molecules in parallel desc_df = dm.descriptors.batch_compute_many_descriptors( mols, n_jobs=-1, # Use all CPU cores progress=True # Show progress bar ) ``` **Specific descriptors**: ```python # Aromaticity n_aromatic = dm.descriptors.n_aromatic_atoms(mol) aromatic_ratio = dm.descriptors.n_aromatic_atoms_proportion(mol) # Stereochemistry n_stereo = dm.descriptors.n_stereo_centers(mol) n_unspec = dm.descriptors.n_stereo_centers_unspecified(mol) # Flexibility n_rigid = dm.descriptors.n_rigid_bonds(mol) ``` **Drug-likeness filtering (Lipinski's Rule of Five)**: ```python # Filter compounds def is_druglike(mol): desc = dm.descriptors.compute_many_descriptors(mol) return ( desc['mw'] <= 500 and desc['logp'] <= 5 and desc['hbd'] <= 5 and desc['hba'] <= 10 ) druglike_mols = [mol for mol in mols if is_druglike(mol)] ``` ### 4. Molecular Fingerprints and Similarity **Generating fingerprints**: Datamol defaults to ECFP6 (`radius=3`, `n_bits=2048`). Pass `radius=2` explicitly for ECFP4. ```python # ECFP4 (common in similarity screening) fp = dm.to_fp(mol, fp_type='ecfp', radius=2, n_bits=2048) # Other fingerprint types fp_maccs = dm.to_fp(mol, fp_type='maccs') fp_topological = dm.to_fp(mol, fp_type='topological') fp_atompair = dm.to_fp(mol, fp_type='atompair') fp_rdkit = dm.to_fp(mol, fp_type='rdkit') ``` **Similarity calculations**: ```python # Pairwise distances within a set distance_matrix = dm.pdist(mols, n_jobs=-1) # Distances between two sets distances = dm.cdist(query_mols, library_mols, n_jobs=-1) # Find most similar molecules (scipy is a PyPI package, not a file in this skill) from scipy.spatial.distance import squareform # third-party library dist_matrix = squareform(dm.pdist(mols)) # Lower distance = higher similarity (Tanimoto distance = 1 - Tanimoto similarity) ``` ### 5. Clustering and Diversity Selection Refer to `references/core_api.md` for clustering details. **Butina clustering**: ```python # Cluster molecules by structural similarity clusters = dm.cluster_mols( mols, cutoff=0.2, # Tanimoto distance threshold (0=identical, 1=completely different) n_jobs=-1 # Parallel processing ) # Each cluster is a list of molecule indices for i, cluster in enumerate(clusters): print(f"Cluster {i}: {len(cluster)} molecules") cluster_mols = [mols[idx] for idx in cluster] ``` **Important**: Butina clustering builds a full distance matrix - suitable for ~1000 molecules, not for 10,000+. **Diversity selection**: ```python # Pick diverse subset diverse_mols = dm.pick_diverse( mols, npick=100 # Select 100 diverse molecules ) # Pick cluster centroids centroids = dm.pick_centroids( mols, npick=50 # Select 50 representative molecules ) ``` ### 6. Scaffold Analysis Refer to `references/fragments_scaffolds.md` for complete scaffold documentation. **Extracting Murcko scaffolds**: ```python # Get Bemis-Murcko scaffold (core structure) scaffold = dm.to_scaffold_murcko(mol) scaffold_smiles = dm.to_smiles(scaffold) ``` **Scaffold-based analysis**: ```python # Group compounds by scaffold from collections import Counter scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols] scaffold_smiles = [dm.to_smiles(s) for s in scaffolds] # Count scaffold frequency scaffold_counts = Counter(scaffold_smiles) most_common = scaffold_counts.most_common(10) # Create scaffold-to-molecules mapping scaffold_groups = {} for mol, scaf_smi in zip(mols, scaffold_smiles): if scaf_smi not in scaffold_groups: scaffold_groups[scaf_smi] = [] scaffold_groups[scaf_smi].append(mol) ``` **Scaffold-based train/test splitting** (for ML): ```python # Ensure train and test sets have different scaffolds scaffold_to_mols = {} for mol, scaf in zip(mols, scaffold_smiles): if scaf not in scaffold_to_mols: scaffold_to_mols[scaf] = [] scaffold_to_mols[scaf].append(mol) # Split scaffolds into train/test import random scaffolds = list(scaffold_to_mols.keys()) random.shuffle(scaffolds) split_idx = int(0.8 * len(scaffolds)) train_scaffolds = scaffolds[:split_idx] test_scaffolds = scaffolds[split_idx:] # Get molecules for each split train_mols = [mol for scaf in train_scaffolds for mol in scaffold_to_mols[scaf]] test_mols = [mol for scaf in test_scaffolds for mol in scaffold_to_mols[scaf]] ``` ### 7. Molecular Fragmentation Refer to `references/fragments_scaffolds.md` for fragmentation details. **BRICS fragmentation** (16 bond types): ```python # Fragment molecule fragments = dm.fragment.brics(mol) # Returns: set of fragment SMILES with attachment points like '[1*]CCN' ``` **RECAP fragmentation** (11 bond types): ```python fragments = dm.fragment.recap(mol) ``` **Fragment analysis**: ```python # Find common fragments across compound library from collections import Counter all_fragments = [] for mol in mols: frags = dm.fragment.brics(mol) all_fragments.extend(frags) fragment_counts = Counter(all_fragments) common_frags = fragment_counts.most_common(20) # Fragment-based scoring def fragment_score(mol, reference_fragments): mol_frags = dm.fragment.brics(mol) overlap = mol_frags.intersection(reference_fragments) return len(overlap) / len(mol_frags) if mol_frags else 0 ``` ### 8. 3D Conformer Generation Refer to `references/conformers_module.md` for detailed conformer documentation. **Generating conformers**: ```python # Generate 3D conformers mol_3d = dm.conformers.generate( mol, n_confs=50, # Number to generate (auto if None) rms_cutoff=0.5, # Filter similar conformers (Ångströms) minimize_energy=True, # Minimize with UFF force field method='ETKDGv3' # Embedding method (recommended) ) # Access conformers n_conformers = mol_3d.GetNumConformers() conf = mol_3d.GetConformer(0) # Get first conformer positions = conf.GetPositions() # Nx3 array of atom coordinates ``` **Conformer clustering**: ```python # Cluster conformers by RMSD clusters = dm.conformers.cluster( mol_3d, rms_cutoff=1.0, centroids=False ) # Get representative conformers centroids = dm.conformers.return_centroids(mol_3d, clusters) ``` **SASA calculation**: ```python # Calculate solvent accessible surface area sasa_values = dm.conformers.sasa(mol_3d, n_jobs=-1) # Access SASA from conformer properties conf = mol_3d.GetConformer(0) sasa = conf.GetDoubleProp('rdkit_free_sasa') ``` ### 9. Visualization Refer to `references/descriptors_viz.md` for visualization documentation. **Basic molecule grid**: ```python # Visualize molecules dm.viz.to_image( mols[:20], legends=[dm.to_smiles(m) for m in mols[:20]], n_cols=5, mol_size=(300, 300) ) # Save to file dm.viz.to_image(mols, outfile="molecules.png") # SVG for publications dm.viz.to_image(mols, outfile="molecules.svg", use_svg=True) ``` **Aligned visualization** (for SAR analysis): ```python # Align molecules by common substructure dm.viz.to_image( similar_mols, align=True, # Enable MCS alignment legends=activity_labels, n_cols=4 ) ``` **Highlighting substructures**: ```python # Highlight specific atoms and bonds dm.viz.to_image( mol, highlight_atom=[0, 1, 2, 3], # Atom indices highlight_bond=[0, 1, 2] # Bond indices ) ``` **Conformer visualization**: ```python # Display multiple conformers dm.viz.conformers( mol_3d, n_confs=10, align_conf=True, n_cols=3 ) ``` ### 10. Chemical Reactions Refer to `references/reactions_data.md` for reactions documentation. **Applying reactions**: ```python from rdkit.Chem import rdChemReactions # Define reaction from SMARTS rxn_smarts = '[C:1](=[O:2])[OH:3]>>[C:1](=[O:2])[Cl:3]' rxn = rdChemReactions.ReactionFromSmarts(rxn_smarts) # Apply to molecule reactant = dm.to_mol("CC(=O)O") # Acetic acid product = dm.reactions.apply_reaction( rxn, (reactant,), sanitize=True ) # Convert to SMILES product_smiles = dm.to_smiles(product) ``` **Batch reaction application**: ```python # Apply reaction to library products = [] for mol in reactant_mols: try: prod = dm.reactions.apply_reaction(rxn, (mol,)) if prod is not None: products.append(prod) except Exception as e: print(f"Reaction failed: {e}") ``` ## Parallelization Datamol includes built-in parallelization for many operations. Use `n_jobs` parameter: - `n_jobs=1`: Sequential (no parallelization) - `n_jobs=-1`: Use all available CPU cores - `n_jobs=4`: Use 4 cores **Functions supporting parallelization**: - `dm.read_sdf(..., n_jobs=-1)` - `dm.descriptors.batch_compute_many_descriptors(..., n_jobs=-1)` - `dm.cluster_mols(..., n_jobs=-1)` - `dm.pdist(..., n_jobs=-1)` - `dm.conformers.sasa(..., n_jobs=-1)` **Progress bars**: Many batch operations support `progress=True` parameter. ## Common Workflows and Patterns ### Complete Pipeline: Data Loading → Filtering → Analysis ```python import datamol as dm import pandas as pd # 1. Load molecules df = dm.read_sdf("compounds.sdf") # 2. Standardize df['mol'] = df['mol'].apply(lambda m: dm.standardize_mol(m) if m else None) df = df[df['mol'].notna()] # Remove failed molecules # 3. Compute descriptors desc_df = dm.descriptors.batch_compute_many_descriptors( df['mol'].tolist(), n_jobs=-1, progress=True ) # 4. Filter by drug-likeness druglike = ( (desc_df['mw'] <= 500) & (desc_df['logp'] <= 5) & (desc_df['hbd'] <= 5) & (desc_df['hba'] <= 10) ) filtered_df = df[druglike] # 5. Cluster and select diverse subset diverse_mols = dm.pick_diverse( filtered_df['mol'].tolist(), npick=100 ) # 6. Visualize results dm.viz.to_image( diverse_mols, legends=[dm.to_smiles(m) for m in diverse_mols], outfile="diverse_compounds.png", n_cols=10 ) ``` ### Structure-Activity Relationship (SAR) Analysis ```python # Group by scaffold scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols] scaffold_smiles = [dm.to_smiles(s) for s in scaffolds] # Create DataFrame with activities sar_df = pd.DataFrame({ 'mol': mols, 'scaffold': scaffold_smiles, 'activity': activities # User-provided activity data }) # Analyze each scaffold series for scaffold, group in sar_df.groupby('scaffold'): if len(group) >= 3: # Need multiple examples print(f"\nScaffold: {scaffold}") print(f"Count: {len(group)}") print(f"Activity range: {group['activity'].min():.2f} - {group['activity'].max():.2f}") # Visualize with activities as legends dm.viz.to_image( group['mol'].tolist(), legends=[f"Activity: {act:.2f}" for act in group['activity']], align=True # Align by common substructure ) ``` ### Virtual Screening Pipeline ```python import numpy as np # 1. Calculate Tanimoto distances between query actives and library distances = dm.cdist(query_actives, library_mols, n_jobs=-1) # 3. Find closest matches (min distance to any query) min_distances = distances.min(axis=0) similarities = 1 - min_distances # Convert distance to similarity # 4. Rank and select top hits top_indices = np.argsort(similarities)[::-1][:100] # Top 100 top_hits = [library_mols[i] for i in top_indices] top_scores = [similarities[i] for i in top_indices] # 5. Visualize hits dm.viz.to_image( top_hits[:20], legends=[f"Sim: {score:.3f}" for score in top_scores[:20]], outfile="screening_hits.png" ) ``` ## Reference Documentation For detailed API documentation, consult these reference files: - **`references/core_api.md`**: Core namespace functions (conversions, standardization, fingerprints, clustering) - **`references/io_module.md`**: File I/O operations (read/write SDF, CSV, Excel, remote files) - **`references/conformers_module.md`**: 3D conformer generation, clustering, SASA calculations - **`references/descriptors_viz.md`**: Molecular descriptors and visualization functions - **`references/fragments_scaffolds.md`**: Scaffold extraction, BRICS/RECAP fragmentation - **`references/reactions_data.md`**: Chemical reactions and toy datasets ## Best Practices 1. **Always standardize molecules** from external sources: ```python mol = dm.standardize_mol(mol, disconnect_metals=True, normalize=True, reionize=True) ``` 2. **Check for None values** after molecule parsing: ```python mol = dm.to_mol(smiles) if mol is None: # Handle invalid SMILES ``` 3. **Use parallel processing** for large datasets: ```python result = dm.operation(..., n_jobs=-1, progress=True) ``` 4. **Use cloud I/O only when requested** — confirm remote write paths; install `s3fs`/`gcsfs` as needed: ```python df = dm.read_sdf("s3://bucket/compounds.sdf") ``` 5. **Use appropriate fingerprints** for similarity: - ECFP (Morgan): General purpose, structural similarity - MACCS: Fast, smaller feature space - Atom pairs: Considers atom pairs and distances 6. **Consider scale limitations**: - Butina clustering: ~1,000 molecules (full distance matrix) - For larger datasets: Use diversity selection or hierarchical methods 7. **Scaffold splitting for ML**: Ensure proper train/test separation by scaffold 8. **Align molecules** when visualizing SAR series ## Error Handling ```python # Safe molecule creation def safe_to_mol(smiles): try: mol = dm.to_mol(smiles) if mol is not None: mol = dm.standardize_mol(mol) return mol except Exception as e: print(f"Failed to process {smiles}: {e}") return None # Safe batch processing valid_mols = [] for smiles in smiles_list: mol = safe_to_mol(smiles) if mol is not None: valid_mols.append(mol) ``` ## Integration with Machine Learning Datamol ships with `scipy` and `scikit-learn` as dependencies. Import them as normal PyPI packages — they are not scripts bundled in this skill. ```python import numpy as np # Feature generation X = np.array([dm.to_fp(mol) for mol in mols]) # Or descriptors desc_df = dm.descriptors.batch_compute_many_descriptors(mols, n_jobs=-1) X = desc_df.values # Train model (scikit-learn PyPI package) from sklearn.ensemble import RandomForestRegressor # third-party library model = RandomForestRegressor() model.fit(X, y_target) # Predict predictions = model.predict(X_test) ``` ## Troubleshooting **Issue**: Molecule parsing fails - **Solution**: Use `dm.standardize_smiles()` first or try `dm.fix_mol()` **Issue**: Memory errors with clustering - **Solution**: Use `dm.pick_diverse()` instead of full clustering for large sets **Issue**: Slow conformer generation - **Solution**: Reduce `n_confs` or increase `rms_cutoff` to generate fewer conformers **Issue**: Remote file access fails - **Solution**: Install the matching fsspec backend (`uv pip install s3fs` or `gcsfs`) and verify only the provider credentials needed for that backend are set (see Remote file support above) ## Additional Resources - **Datamol Documentation**: https://docs.datamol.io/ - **RDKit Documentation**: https://www.rdkit.org/docs/ - **GitHub Repository**: https://github.com/datamol-io/datamol