--- name: deepchem description: "Deep learning for drug discovery. 60+ models (GCN, GAT, AttentiveFP, MPNN, ChemBERTa, GROVER), 50+ featurizers, MoleculeNet benchmarks, HPO, transfer learning. Unified load-featurize-split-train-evaluate API. For fingerprints use rdkit-cheminformatics; for featurization-only use molfeat." license: MIT --- # DeepChem — Deep Learning for Drug Discovery ## Overview DeepChem is an open-source Python framework providing a unified API for molecular machine learning across drug discovery, materials science, and quantum chemistry. It wraps 60+ model architectures (graph neural networks, transformers, classical ML) with 50+ molecular featurizers and standardized datasets (MoleculeNet), enabling end-to-end workflows from SMILES strings to trained predictive models. ## When to Use - Predicting molecular properties (solubility, toxicity, binding affinity) from SMILES - Benchmarking models on MoleculeNet standardized datasets (BBBP, Tox21, ESOL, FreeSolv, etc.) - Training graph neural networks on molecular graphs (GCN, GAT, AttentiveFP, MPNN, DMPNN) - Fine-tuning pretrained chemical language models (ChemBERTa, GROVER, MolFormer) - Running hyperparameter optimization for molecular ML models - Virtual screening and hit prioritization with trained models - Materials property prediction from crystal structures (CGCNN, MEGNet) - Protein-ligand interaction modeling and binding affinity prediction - For fingerprint-based cheminformatics without deep learning, use `rdkit-cheminformatics` instead - For featurization only (no model training), use `molfeat-molecular-featurization` instead ## Prerequisites - **Python packages**: `deepchem` (core), `torch` or `tensorflow` (backend-dependent models) - **GPU**: Recommended for graph neural networks and transformer models; CPU sufficient for classical ML and fingerprint models - **Data**: SMILES strings with property labels (CSV), or MoleculeNet datasets (auto-downloaded) ```bash # Core installation (includes RDKit, scikit-learn, XGBoost) pip install deepchem # With PyTorch backend (GNN models) pip install deepchem[torch] # With TensorFlow backend (legacy models) pip install deepchem[tensorflow] # Full installation (all backends + extras) pip install deepchem[all] ``` ## Quick Start ```python import deepchem as dc # Load MoleculeNet dataset with featurization + scaffold split tasks, datasets, transformers = dc.molnet.load_delaney(featurizer="ECFP") train, valid, test = datasets # Train and evaluate a multitask regressor model = dc.models.MultitaskRegressor(n_tasks=1, n_features=1024, dropouts=0.2) model.fit(train, nb_epoch=50) metric = dc.metrics.Metric(dc.metrics.pearson_r2_score) print(f"Test R2: {model.evaluate(test, [metric])}") # {'pearson_r2_score': ~0.7} ``` ## Core API ### Module 1: Data Loading and Processing Load molecular data from CSV files or MoleculeNet benchmark datasets. ```python import deepchem as dc # Load from CSV (SMILES + property columns) loader = dc.data.CSVLoader( tasks=["measured_log_solubility"], feature_field="smiles", featurizer=dc.feat.CircularFingerprint(size=2048, radius=3) ) dataset = loader.create_dataset("solubility_data.csv") print(f"Samples: {dataset.X.shape[0]}, Features: {dataset.X.shape[1]}") # Samples: 1128, Features: 2048 # Load from SDF (3D structures) sdf_loader = dc.data.SDFLoader( tasks=["activity"], featurizer=dc.feat.CoulombMatrix(max_atoms=50) ) dataset_3d = sdf_loader.create_dataset("molecules.sdf") ``` ```python # Load MoleculeNet benchmark datasets (auto-download + featurize + split) # Available: load_delaney, load_bbbp, load_tox21, load_hiv, load_qm7, load_qm9, etc. tasks, datasets, transformers = dc.molnet.load_tox21(featurizer="ECFP", splitter="scaffold") train, valid, test = datasets print(f"Tasks: {len(tasks)}, Train: {len(train)}, Test: {len(test)}") # Tasks: 12, Train: ~6264, Test: ~631 # Inverse-transform predictions back to original scale y_pred = model.predict(test) y_original = transformers[0].untransform(y_pred) ``` ### Module 2: Molecular Featurization Convert molecules to numerical representations for ML. DeepChem provides 50+ featurizers spanning fingerprints, descriptors, graph features, and Coulomb matrices. ```python import deepchem as dc smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CC(C)O"] # Fingerprints (most common for classical ML) ecfp = dc.feat.CircularFingerprint(size=2048, radius=3) fp_features = ecfp.featurize(smiles) print(f"ECFP shape: {fp_features.shape}") # (4, 2048) # RDKit descriptors (interpretable physicochemical properties) rdkit_desc = dc.feat.RDKitDescriptors() desc_features = rdkit_desc.featurize(smiles) print(f"Descriptor shape: {desc_features.shape}") # (4, 208) # Graph features (for GNN models — returns ConvMol objects) graph_feat = dc.feat.ConvMolFeaturizer() graphs = graph_feat.featurize(smiles) print(f"Atoms in first mol: {graphs[0].get_num_atoms()}") # 3 # Mol2Vec embeddings (pretrained word2vec on molecular substructures) mol2vec = dc.feat.Mol2VecFingerprint() embeddings = mol2vec.featurize(smiles) print(f"Mol2Vec shape: {embeddings.shape}") # (4, 300) ``` ### Module 3: Model Training and Evaluation DeepChem provides MultitaskRegressor and MultitaskClassifier as general-purpose models, plus specialized architectures for graph and sequence data. ```python import deepchem as dc # Load dataset tasks, datasets, transformers = dc.molnet.load_delaney(featurizer="ECFP") train, valid, test = datasets # Regression model (fingerprint input) model = dc.models.MultitaskRegressor( n_tasks=1, n_features=1024, layer_sizes=[1000, 500], dropouts=0.25, learning_rate=0.001, batch_size=64, ) model.fit(train, nb_epoch=100) # Evaluate with multiple metrics metrics = [ dc.metrics.Metric(dc.metrics.pearson_r2_score), dc.metrics.Metric(dc.metrics.mean_absolute_error), dc.metrics.Metric(dc.metrics.rms_score), ] results = model.evaluate(test, metrics) print(f"R2: {results['pearson_r2_score']:.3f}, MAE: {results['mean_absolute_error']:.3f}") ``` ```python # Classification model (e.g., Tox21 toxicity prediction) tasks, datasets, transformers = dc.molnet.load_tox21(featurizer="ECFP") train, valid, test = datasets clf = dc.models.MultitaskClassifier( n_tasks=len(tasks), n_features=1024, layer_sizes=[1000, 500], dropouts=0.5, learning_rate=0.001, ) clf.fit(train, nb_epoch=50) roc_metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean) print(f"Mean ROC-AUC: {clf.evaluate(test, [roc_metric])}") ``` ### Module 4: Graph Neural Networks GNNs operate directly on molecular graphs (atoms as nodes, bonds as edges), avoiding information loss from fixed fingerprints. ```python import deepchem as dc # Load with graph featurizer tasks, datasets, transformers = dc.molnet.load_delaney(featurizer="GraphConv") train, valid, test = datasets # Graph Convolutional Network (Duvenaud et al.) gcn_model = dc.models.GraphConvModel( n_tasks=1, mode="regression", graph_conv_layers=[64, 64], dense_layer_size=256, dropout=0.2, learning_rate=0.001, batch_size=64, ) gcn_model.fit(train, nb_epoch=100) metric = dc.metrics.Metric(dc.metrics.pearson_r2_score) print(f"GCN R2: {gcn_model.evaluate(test, [metric])}") ``` ```python # AttentiveFP (Xiong et al.) — attention-based GNN, strong on molecular properties tasks, datasets, transformers = dc.molnet.load_delaney( featurizer=dc.feat.MolGraphConvFeaturizer(use_edges=True) ) train, valid, test = datasets attfp_model = dc.models.AttentiveFPModel( n_tasks=1, mode="regression", num_layers=2, graph_feat_size=200, num_timesteps=2, dropout=0.2, learning_rate=0.001, batch_size=64, ) attfp_model.fit(train, nb_epoch=100) print(f"AttentiveFP R2: {attfp_model.evaluate(test, [metric])}") ``` ### Module 5: Transfer Learning Fine-tune pretrained chemical language models for downstream tasks with limited data. ```python import deepchem as dc from deepchem.models.torch_models import ChemBERTaModel # ChemBERTa — SMILES-based transformer (pretrained on 77M molecules) tasks, datasets, transformers = dc.molnet.load_bbbp(featurizer=dc.feat.SmilesTokenizer()) train, valid, test = datasets chemberta = ChemBERTaModel( task="classification", n_tasks=1, model_dir="chemberta_finetuned/", ) # Fine-tune on downstream task (BBB permeability) chemberta.fit(train, nb_epoch=10) metric = dc.metrics.Metric(dc.metrics.roc_auc_score) print(f"ChemBERTa ROC-AUC: {chemberta.evaluate(test, [metric])}") ``` ### Module 6: Predictions on New Molecules Run inference on new molecules with a trained model. ```python import deepchem as dc import numpy as np # Assume trained model from Module 3 # Featurize new molecules using same featurizer featurizer = dc.feat.CircularFingerprint(size=1024, radius=2) new_smiles = ["c1cc(O)ccc1", "CC(=O)Nc1ccc(O)cc1", "OC(=O)c1ccccc1"] new_features = featurizer.featurize(new_smiles) new_dataset = dc.data.NumpyDataset(X=new_features) predictions = model.predict(new_dataset) for smi, pred in zip(new_smiles, predictions): print(f"{smi}: {pred[0]:.2f}") # Ensemble predictions from multiple models for robustness models = [model1, model2, model3] # trained models all_preds = np.array([m.predict(new_dataset) for m in models]) ensemble_mean = all_preds.mean(axis=0) ensemble_std = all_preds.std(axis=0) print(f"Ensemble prediction: {ensemble_mean[0][0]:.2f} +/- {ensemble_std[0][0]:.2f}") ``` ## Key Concepts ### Unified API Pattern All DeepChem workflows follow a consistent 5-step pattern: ``` Load Data → Featurize → Split → Train → Evaluate ``` - **Load**: `CSVLoader`, `SDFLoader`, or `dc.molnet.load_*()` (auto-loads MoleculeNet datasets) - **Featurize**: Pass featurizer to loader, or call `featurizer.featurize(smiles)` directly - **Split**: `ScaffoldSplitter` (recommended for drug discovery), `RandomSplitter`, `ButinaSplitter` - **Train**: `model.fit(train_dataset, nb_epoch=N)` - **Evaluate**: `model.evaluate(test_dataset, metrics_list)` ### Model Selection Guide | Data Type | Model | Key Feature | Use When | |-----------|-------|-------------|----------| | SMILES + fingerprints | `MultitaskRegressor` | Fast, baseline | First attempt, small datasets | | SMILES + fingerprints | `MultitaskClassifier` | Multi-label | Multi-task classification (Tox21) | | Molecular graphs | `GraphConvModel` | Learned fingerprints | Medium datasets, general properties | | Molecular graphs | `GATModel` | Attention mechanism | When atom importance matters | | Molecular graphs | `AttentiveFPModel` | Graph + timestep attention | State-of-art molecular properties | | Molecular graphs | `MPNNModel` | Message passing | Complex molecular interactions | | Molecular graphs | `DMPNNModel` | Directed MPNN | Bond-level predictions | | SMILES strings | `ChemBERTaModel` | Pretrained transformer | Low-data regime, transfer learning | | SMILES strings | `GROVERModel` | Graph + transformer | Rich molecular representations | | Crystal structures | `CGCNNModel` | Crystal graph CNN | Materials property prediction | | Crystal structures | `MEGNetModel` | Graph networks | Materials and molecules | | Protein sequences | `ProteinLigandComplexModel` | Complex modeling | Binding affinity prediction | | Tabular features | `XGBoostModel`, `RandomForestModel` | Classical ML | Interpretability, baselines | ### Featurizer Selection Guide | Featurizer | Class | Output | Best For | |------------|-------|--------|----------| | ECFP/Morgan | `CircularFingerprint` | Binary vector (1024-2048) | General QSAR, fast baselines | | MACCS Keys | `MACCSKeysFingerprint` | 167-bit vector | Substructure filtering | | RDKit 2D | `RDKitDescriptors` | 200+ descriptors | Interpretable models | | Mol2Vec | `Mol2VecFingerprint` | 300-dim embedding | Similarity, clustering | | ConvMol | `ConvMolFeaturizer` | Graph features | `GraphConvModel` input | | MolGraph | `MolGraphConvFeaturizer` | Node + edge features | `AttentiveFPModel`, `MPNNModel` | | Weave | `WeaveFeaturizer` | Pair features | `WeaveModel` input | | Coulomb Matrix | `CoulombMatrix` | Atom-pair distances | QM property prediction | | SMILES tokens | `SmilesTokenizer` | Token IDs | ChemBERTa, transformer models | ### Data Splitting Strategies | Splitter | Use Case | Why | |----------|----------|-----| | `ScaffoldSplitter` | Drug discovery (default) | Tests generalization to new chemotypes | | `RandomSplitter` | Quick experiments | Baseline, but overestimates performance | | `ButinaSplitter` | Diversity-based | Clusters by Tanimoto similarity | | `FingerprintSplitter` | Chemical similarity | Groups structurally similar molecules | | `MaxMinSplitter` | Maximum diversity test | Extreme generalization test | ## Common Workflows ### Workflow 1: QSAR from CSV Data **Goal**: Build a property prediction model from a CSV file with SMILES and activity columns. ```python import deepchem as dc import pandas as pd # Step 1: Load and featurize CSV data loader = dc.data.CSVLoader( tasks=["pIC50"], feature_field="smiles", featurizer=dc.feat.CircularFingerprint(size=2048, radius=3), ) dataset = loader.create_dataset("bioactivity_data.csv") # Step 2: Normalize targets transformer = dc.trans.NormalizationTransformer( transform_y=True, dataset=dataset ) dataset = transformer.transform(dataset) # Step 3: Scaffold split (realistic for drug discovery) splitter = dc.splits.ScaffoldSplitter() train, valid, test = splitter.train_valid_test_split(dataset) print(f"Train: {len(train)}, Valid: {len(valid)}, Test: {len(test)}") # Step 4: Train model model = dc.models.MultitaskRegressor( n_tasks=1, n_features=2048, layer_sizes=[1000, 500], dropouts=0.25, learning_rate=0.001, batch_size=64, ) model.fit(train, nb_epoch=100) # Step 5: Evaluate metrics = [ dc.metrics.Metric(dc.metrics.pearson_r2_score), dc.metrics.Metric(dc.metrics.mean_absolute_error), ] results = model.evaluate(test, metrics) print(f"R2: {results['pearson_r2_score']:.3f}, MAE: {results['mean_absolute_error']:.3f}") ``` ### Workflow 2: MoleculeNet Benchmark Comparison **Goal**: Compare multiple models on a MoleculeNet benchmark dataset. ```python import deepchem as dc # Load dataset with graph featurizer (supports both fingerprint and GNN models) tasks, datasets, transformers = dc.molnet.load_bbbp( featurizer="GraphConv", splitter="scaffold" ) train, valid, test = datasets metric = dc.metrics.Metric(dc.metrics.roc_auc_score) # Model 1: Graph Convolutional Network gcn = dc.models.GraphConvModel(n_tasks=1, mode="classification", dropout=0.2) gcn.fit(train, nb_epoch=50) gcn_score = gcn.evaluate(test, [metric]) # Model 2: Random Forest baseline (needs fingerprints) tasks_fp, datasets_fp, _ = dc.molnet.load_bbbp(featurizer="ECFP", splitter="scaffold") train_fp, _, test_fp = datasets_fp rf = dc.models.SklearnModel( model=dc.models.sklearn_models.RandomForestClassifier(n_estimators=500), model_dir="rf_model/" ) rf.fit(train_fp) rf_score = rf.evaluate(test_fp, [metric]) print(f"GCN ROC-AUC: {gcn_score['roc_auc_score']:.3f}") print(f"RF ROC-AUC: {rf_score['roc_auc_score']:.3f}") ``` ### Workflow 3: Transfer Learning Pipeline **Goal**: Fine-tune a pretrained model on a small dataset. 1. Load pretrained ChemBERTa model (see Module 5 for code) 2. Prepare downstream dataset with `SmilesTokenizer` featurizer 3. Fine-tune with reduced learning rate (`1e-5` to `5e-5`) for 5-15 epochs 4. Evaluate on held-out scaffold split — expect gains over fingerprint baselines when training data < 1000 samples 5. Save fine-tuned model: `model.save_checkpoint()` 6. See `references/workflows_model_catalog.md` Workflow 1 for complete hyperparameter optimization code ## Key Parameters | Parameter | Module | Default | Range / Options | Effect | |-----------|--------|---------|-----------------|--------| | `n_features` | MultitaskRegressor/Classifier | Required | Matches featurizer output | Input feature dimension | | `layer_sizes` | MultitaskRegressor/Classifier | `[1000]` | `[256]` to `[1000, 500, 250]` | Hidden layer dimensions | | `dropouts` | All neural models | `0.0` | `0.0`-`0.5` | Regularization strength | | `learning_rate` | All neural models | `0.001` | `1e-5`-`0.01` | Training step size | | `batch_size` | All neural models | `100` | `16`-`256` | Samples per gradient update | | `nb_epoch` | `model.fit()` | `10` | `10`-`300` | Training iterations | | `size` | `CircularFingerprint` | `2048` | `512`-`4096` | Fingerprint bit length | | `radius` | `CircularFingerprint` | `2` | `2`-`4` | Substructure neighborhood radius | | `graph_conv_layers` | `GraphConvModel` | `[64, 64]` | `[32]` to `[128, 128, 64]` | Graph convolution widths | | `num_layers` | `AttentiveFPModel` | `2` | `1`-`5` | GNN message passing depth | | `graph_feat_size` | `AttentiveFPModel` | `200` | `64`-`512` | Graph feature dimension | | `splitter` | `dc.molnet.load_*()` | `"scaffold"` | `"scaffold"`, `"random"`, `"butina"` | Data splitting strategy | ## Best Practices 1. **Always use scaffold splitting for drug discovery**: Random splits leak structural information and overestimate performance. Scaffold splits test generalization to novel chemotypes. 2. **Normalize regression targets**: Apply `NormalizationTransformer(transform_y=True)` before training. Remember to `untransform()` predictions for interpretable values. 3. **Start with fingerprint baselines**: Train `MultitaskRegressor` + ECFP first. Only move to GNNs if fingerprint baseline is insufficient — GNNs need more data and compute. ```python # Baseline first baseline = dc.models.MultitaskRegressor(n_tasks=1, n_features=2048) ``` 4. **Match featurizer to model**: GNN models require graph featurizers (`ConvMolFeaturizer`, `MolGraphConvFeaturizer`). Fingerprint models need `CircularFingerprint`. Mixing causes silent errors. 5. **Anti-pattern -- Do not use random split for drug discovery benchmarks**: Results with `RandomSplitter` are not publishable for molecular property prediction. Reviewers expect scaffold or temporal splits. 6. **Handle missing labels in multi-task datasets**: Tox21 and many bioactivity datasets have missing values. DeepChem handles NaN labels automatically during training (masked loss), but verify with `np.isnan(dataset.y).sum()`. 7. **Use early stopping via validation set**: Monitor validation loss to prevent overfitting, especially with GNN models. ## Common Recipes ### Recipe: Hyperparameter Search When to use: Optimize model performance before final evaluation. ```python import deepchem as dc tasks, datasets, transformers = dc.molnet.load_delaney(featurizer="ECFP") train, valid, test = datasets # Define parameter grid params = { "n_features": [1024], "layer_sizes": [[500], [1000, 500], [1000, 500, 250]], "dropouts": [0.1, 0.25, 0.5], "learning_rate": [0.001, 0.0005], } optimizer = dc.hyper.GridHyperparamOpt(lambda **p: dc.models.MultitaskRegressor(**p)) metric = dc.metrics.Metric(dc.metrics.pearson_r2_score) best_model, best_params, all_results = optimizer.hyperparam_search( params, train, valid, metric, logdir="hyperparam_logs/" ) print(f"Best params: {best_params}") print(f"Best R2: {best_model.evaluate(test, [metric])}") ``` ### Recipe: Save and Reload Models When to use: Deploy trained models or resume training. ```python # Save model checkpoint model.save_checkpoint(model_dir="saved_model/") # Reload model loaded_model = dc.models.MultitaskRegressor(n_tasks=1, n_features=2048) loaded_model.restore(model_dir="saved_model/") predictions = loaded_model.predict(test) ``` ### Recipe: Custom Metric When to use: Evaluate models with domain-specific metrics. ```python import deepchem as dc import numpy as np def enrichment_factor(y_true, y_pred, top_fraction=0.01): """Enrichment factor at top X% of ranked predictions.""" n = len(y_true) n_top = max(int(n * top_fraction), 1) top_indices = np.argsort(y_pred.flatten())[-n_top:] hits_in_top = y_true.flatten()[top_indices].sum() expected = y_true.sum() * top_fraction return hits_in_top / expected if expected > 0 else 0.0 ef_metric = dc.metrics.Metric(enrichment_factor, mode="regression") print(f"EF@1%: {model.evaluate(test, [ef_metric])}") ``` ## Troubleshooting | Problem | Cause | Solution | |---------|-------|----------| | `ModuleNotFoundError: torch` | PyTorch not installed | `pip install deepchem[torch]` for GNN models | | `ValueError: n_features mismatch` | Featurizer output size does not match model `n_features` | Check `dataset.X.shape[1]` and set `n_features` accordingly | | NaN loss during training | Learning rate too high or unnormalized targets | Apply `NormalizationTransformer`, reduce learning rate to `1e-4` | | Low scaffold-split performance | Model memorizes scaffolds, not properties | Use more data, try GNN models, or add regularization (dropout 0.3-0.5) | | `RuntimeError: CUDA out of memory` | Batch size too large for GPU | Reduce `batch_size` (32 or 16), or use CPU for small datasets | | `FeaturizationError` on some SMILES | Invalid or complex SMILES strings | Pre-filter with RDKit: `Chem.MolFromSmiles(smi) is not None` | | Model predicts constant values | Targets not normalized or too few epochs | Apply `NormalizationTransformer`, increase `nb_epoch` | | Slow featurization | Large dataset with expensive featurizer | Use `CircularFingerprint` (fast) or parallelize with `n_jobs` parameter | ## Bundled Resources - **references/workflows_model_catalog.md** -- Extended workflows (hyperparameter optimization with full code, MolGAN generative models, materials property prediction with CGCNN/MEGNet, protein-ligand modeling, custom model architecture) plus complete model catalog (60+ models organized by category) and complete featurizer catalog (50+ featurizers). Covers: workflows 4-8 from original, extended model and featurizer inventories, MoleculeNet dataset catalog. Relocated inline: top 3 workflows (QSAR, MoleculeNet benchmark, transfer learning) are in Common Workflows; core model/featurizer tables are in Key Concepts. Omitted: detailed installation troubleshooting for TensorFlow 1.x (deprecated) and Docker-specific setup (covered by official docs). ## Related Skills - **rdkit-cheminformatics** -- molecular manipulation, fingerprints, substructure search (upstream featurization) - **molfeat-molecular-featurization** -- 100+ featurizers with scikit-learn API (featurization-only alternative) - **datamol-cheminformatics** -- Pythonic molecular processing (upstream data prep) - **pytdc-therapeutics-data-commons** -- curated ADMET/DTI datasets with standardized splits (complementary data source) - **torch-geometric-graph-neural-networks** -- lower-level PyG for custom GNN architectures (alternative for advanced users) - **scikit-learn-machine-learning** -- classical ML baselines that DeepChem wraps via `SklearnModel` ## References - [DeepChem documentation](https://deepchem.readthedocs.io/) -- official API docs and tutorials - [DeepChem GitHub](https://github.com/deepchem/deepchem) -- source code, examples, issues - [MoleculeNet benchmark paper](https://doi.org/10.1039/C7SC02664A) -- Wu et al. 2018, benchmark dataset descriptions - [DeepChem tutorials](https://github.com/deepchem/deepchem/tree/master/examples/tutorials) -- Jupyter notebook tutorials